Preparing dataset for ML studies

Good Training for a good estimation

Scipy_conversion

The code to prepare numpy arrays with EdbSegP information and store them into a pandas dataframe is the following

https://github.com/antonioiuliano2/Tutorial-DESY19/blob/master/scipy_conversion/desy19_fedrautils.py

It can build a dataframe with position, angles, true MC info and other required information.

I have also added an option to project the position to a new list of z positions, if we want to merge data from different configuration (z positions should not differ too much, to avoid big uncertainties on the positions)

Background/signal dataset preparation

The signal with the electromagnetic showers is provided from simulation.

The background is provided from real data, which especially for the first plates is dominated by background. 2 possible datasets are envisioned:

  1. 1st -2nd plates only background: we take the data from the first two plates, then we duplicate it with simple transformation (1D reflections on single features) to reproduce them in later plates. Signal is stored from all plates.

  2. First 10 plates background/signal: we take all the data from the first 10 plates of a RUN, and the first 10 plates from simulation. This allows to study correlation between features of nearby segments, to separate them from background. We cannot take more plates, otherwise the signal taken from data and treaten as background might be significant.

Dataset construction algorithm

The algorithm follows the steps designed by Maria during her thesis.

A dataset containing the base-tracks for shower candidate is produced, then provided as input to a Random Forest classifier.

There are two versions, one for simulation csv and the other for data csv

Lettura_csv analyses the ML output and it produces histograms

Dataset preparation from simulation

The scripts need to be launched in the following order (Concat_dataframe.py is launched twice):

  1. Proiezioni.py

  2. Inizio_sciame.py

  3. Rect.py

  4. Rect_crescenti.py

  5. Taglio_Theta.py

  6. Concat_dataframe.py

  7. Ricerca_new.py

  8. Concat_dataframe.py

  9. Random_Forest_Ishower.py

Proiezioni computes the "next" variables with the projections of the coordinates in the next plate. Inizio_sciame provides a list of shower injectors. After this, a 1 cm x 1 cm selection in the transverse plane is performed (Rect.py). A pyramid is built with increasing rectangles with slope 140 micron and intercept 500 micron. Taglio Theta provides selections in angle and impact parameter. Concat dataframes merges files from different showers (it is called twice). Ricerca new actually does the final dataset building, with the selection over the projects. Random Forest Ishower finally does training and test over showers.

Dataset preparation from data

The scripts need to be launched in the following order (Concat_dataframe.py is launched twice):

  1. Proiezioni_Theta.py

Between 1 and 2, produce "Inizio_candidati_sciami.csv" Segments with the same TrackID within first three plates with theta <= 50 mrad. Take last segment of the track

  1. Data_rect.py

  2. Data_rect_crescenti.py

  3. Data_taglio_Theta.py

  4. Concat_dataframe.py

  5. Ricerca_complete.py

  6. Concat_dataframe.py

  7. Random_Forest_Ishower.py

The scripts work as the previous ones, but using as input data instead of True Monte Carlo information.

Energy measurement and histogram production

Let us assume we are analyzing a true Monte Carlo simulation. First, we need to launch Lunghezza sciami ricostruiti RF, which collects events according to their classification (11, 00, 10, 01).

Then, the produced file can be read with Istogramma_pyroot.py, to produce histograms. Finally Erec.py provides an estimation of energy resolution (without the calibration step, we assume to already have the parameters)

HTCondor workflow

To speed up times by reconstructing the 360 showers in parallel, I have developed a procedure which uses HTCondor submission jobs (see HTCondor section in other sofware). The procedure is now as follows:

  1. Launch "Proiezioni.py"

  2. Launch "Inizio_sciame.py"

  3. Submit HTCondor script "PreReco.sub" (calling "PreReco.sh")

  4. Merge dataframes with "Concat_dataframe.py" (afterthetacut())

  5. Submit HTCondor script "Ricerca_new.sub" (calling "Ricerca_new.sh")

  6. Merge dataframes with "Concat_dataframe.py" (afternewvariables())

  7. Launch "Random_Forest_Ishower.py"

Note: the python scripts now have input options with parsearg: check first the options by starting the help: for example "python Proiezioni.py -h"

For data the workflow is the same, only file names change:

  1. Launch "Proiezioni_Theta.py"

  2. Launch "Inizio_sciame.py"

  3. Submit HTCondor script "PreReco_Data.sub" (calling "PreReco_Data.sh")

  4. Merge dataframes with "Concat_dataframe.py" (afterthetacut())

  5. Submit HTCondor script "Ricerca_complete.sub" (calling "Ricerca_complete.sh")

  6. Merge dataframes with "Concat_dataframe.py" (afternewvariables())

  7. Launch "Random_Forest_Ishower.py"

File Locations:

  • $DESYMACROS/condor_scripts/ (for HTCondor scripts);

  • $DESYMACROS/Dataset_preparation_ML_analysis/RUN3_simulazioni (scripts for Monte Carlo)

  • $DESYMACROS/Dataset_preparation_ML_analysis/RUN3_Data (scripts for real data)

Last updated