The big picture: The European Organization for Nuclear Research (CERN) is home to one of the most ambitious engineering and scientific endeavors ever undertaken by man. The Large Hadron Collider (LHC) is the world’s largest and most energetic particle accelerator, which scientists use to analyze clues about the structure of the subatomic world – the LHC is capable of generating tens of petabytes of data every year.
CERN recently had to upgrade its backend IT systems in preparation for the new experimental phase of the Large Hadron Collider (LHC Run 3). This phase is expected to generate 1 petabyte of data daily by the end of 2025. The previous database systems were no longer sufficient to process the “high cardinality” data generated by the collider’s primary experiments, such as: B. CMS were created.
The Compact Muon Solenoid (CMS) is the LHC’s general-purpose detector with a comprehensive physics program. It involves studying the Standard Model (including the Higgs boson) and searching for additional dimensions and particles that could form dark matter. CERN describes This experiment is considered one of the largest scientific collaborations in history, involving approximately 5,500 people from 241 institutes in 54 different countries.
CMS and other LHC experiments underwent a significant modernization period from 2018 to 2022 and are now ready to resume colliding subatomic particles during the three-year “Run 3” data collection period. During the shutdown, CERN experts also made significant upgrades to the detector systems and computer infrastructure supporting CMS.
Brij Kishor Jashal, a scientist working with CMS, mentioned that his team collected 30 terabytes of data over a 30-day period to monitor infrastructure performance. He explained that Run 3 operations produce higher luminosity, resulting in a significant increase in data volume. The previous backend monitoring system was based on the open source time series database (TSDB) InfluxDB, which uses compression algorithms to efficiently process this data, and the Prometheus monitoring database.
However, InfluxDB and Prometheus experienced performance, scalability and reliability issues, particularly when dealing with high cardinality data. High cardinality refers to the propagation of repeated values and the ability to redeploy applications multiple times in new instances. To address these challenges, the CMS monitoring team decided to replace both InfluxDB and Prometheus with the VictoriaMetrics TSDB database.
VictoriaMetrics now serves as both a backend storage and monitoring system for CMS, effectively solving the cardinality problem that occurred previously. Jashal noted that the CMS team is currently happy with the performance of clusters and services. Although there is still room for scalability, the services operate in “high availability mode” within CMS’s dedicated Kubernetes clusters to provide enhanced reliability guarantees. CERN’s data center is based on an OpenStack service running on a cluster of robust x86 machines.