Is tokenization needed for masked particle modeling?

In this work, we significantly enhance masked particle modeling (MPM), a self-supervised learning scheme for constructing highly expressive representations of unordered sets relevant to developing foundation models for high-energy physics. In MPM, a model is trained to recover the missing elements o...

Full description

Saved in:

Bibliographic Details
Main Authors:	Matthew Leigh, Samuel Klein, François Charton, Tobias Golling, Lukas Heinrich, Michael Kagan, Inês Ochoa, Margarita Osadchy
Format:	Article
Language:	English
Published:	IOP Publishing 2025-01-01
Series:	Machine Learning: Science and Technology
Subjects:	jet self-supervised learning high-energy physics conditional generative models jet physics
Online Access:	https://doi.org/10.1088/2632-2153/addb98
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1839649114980089856
author	Matthew Leigh Samuel Klein François Charton Tobias Golling Lukas Heinrich Michael Kagan Inês Ochoa Margarita Osadchy
author_facet	Matthew Leigh Samuel Klein François Charton Tobias Golling Lukas Heinrich Michael Kagan Inês Ochoa Margarita Osadchy
author_sort	Matthew Leigh
collection	DOAJ
description	In this work, we significantly enhance masked particle modeling (MPM), a self-supervised learning scheme for constructing highly expressive representations of unordered sets relevant to developing foundation models for high-energy physics. In MPM, a model is trained to recover the missing elements of a set, a learning objective that requires no labels and can be applied directly to experimental data. We achieve significant performance improvements over previous work on MPM by addressing inefficiencies in the implementation and incorporating a more powerful decoder. We compare several pre-training tasks and introduce new reconstruction methods that utilize conditional generative models without data tokenization or discretization. We show that these new methods outperform the tokenized learning objective from the original MPM on a new test bed for foundation models for jets, which includes using a wide variety of downstream tasks relevant to jet physics, such as classification, secondary vertex finding, and track identification.
format	Article
id	doaj-art-610de1e2147c4d7c85f7bfaf576ffb51
institution	Matheson Library
issn	2632-2153
language	English
publishDate	2025-01-01
publisher	IOP Publishing
record_format	Article
series	Machine Learning: Science and Technology
spelling	doaj-art-610de1e2147c4d7c85f7bfaf576ffb512025-06-27T11:04:49ZengIOP PublishingMachine Learning: Science and Technology2632-21532025-01-016202507510.1088/2632-2153/addb98Is tokenization needed for masked particle modeling?Matthew Leigh0https://orcid.org/0000-0003-1406-1413Samuel Klein1https://orcid.org/0000-0002-2999-6150François Charton2Tobias Golling3https://orcid.org/0000-0001-8535-6687Lukas Heinrich4Michael Kagan5https://orcid.org/0000-0002-3386-6869Inês Ochoa6https://orcid.org/0000-0001-6156-1790Margarita Osadchy7https://orcid.org/0000-0001-5480-5099University of Geneva , Geneva, SwitzerlandUniversity of Geneva , Geneva, SwitzerlandMeta FAIR , Paris, FranceUniversity of Geneva , Geneva, SwitzerlandTechnical University of Munich , Munich, GermanySLAC National Accelerator Laboratory , Menlo Park, CA, United States of AmericaLIP , Lisbon, PortugalUniversity of Haifa , Haifa, IsraelIn this work, we significantly enhance masked particle modeling (MPM), a self-supervised learning scheme for constructing highly expressive representations of unordered sets relevant to developing foundation models for high-energy physics. In MPM, a model is trained to recover the missing elements of a set, a learning objective that requires no labels and can be applied directly to experimental data. We achieve significant performance improvements over previous work on MPM by addressing inefficiencies in the implementation and incorporating a more powerful decoder. We compare several pre-training tasks and introduce new reconstruction methods that utilize conditional generative models without data tokenization or discretization. We show that these new methods outperform the tokenized learning objective from the original MPM on a new test bed for foundation models for jets, which includes using a wide variety of downstream tasks relevant to jet physics, such as classification, secondary vertex finding, and track identification.https://doi.org/10.1088/2632-2153/addb98jetself-supervised learninghigh-energy physicsconditional generative modelsjet physics
spellingShingle	Matthew Leigh Samuel Klein François Charton Tobias Golling Lukas Heinrich Michael Kagan Inês Ochoa Margarita Osadchy Is tokenization needed for masked particle modeling? Machine Learning: Science and Technology jet self-supervised learning high-energy physics conditional generative models jet physics
title	Is tokenization needed for masked particle modeling?
title_full	Is tokenization needed for masked particle modeling?
title_fullStr	Is tokenization needed for masked particle modeling?
title_full_unstemmed	Is tokenization needed for masked particle modeling?
title_short	Is tokenization needed for masked particle modeling?
title_sort	is tokenization needed for masked particle modeling
topic	jet self-supervised learning high-energy physics conditional generative models jet physics
url	https://doi.org/10.1088/2632-2153/addb98
work_keys_str_mv	AT matthewleigh istokenizationneededformaskedparticlemodeling AT samuelklein istokenizationneededformaskedparticlemodeling AT francoischarton istokenizationneededformaskedparticlemodeling AT tobiasgolling istokenizationneededformaskedparticlemodeling AT lukasheinrich istokenizationneededformaskedparticlemodeling AT michaelkagan istokenizationneededformaskedparticlemodeling AT inesochoa istokenizationneededformaskedparticlemodeling AT margaritaosadchy istokenizationneededformaskedparticlemodeling

Is tokenization needed for masked particle modeling?

Similar Items