Machine Learning FAQ

Q: How many folds are used for k-fold crossvalidation?

A: The number of folds used varies depending on the training set size to balance between accuracy and computational time. Smaller training sets will use more folds than larger training sets. The minimum number of folds used is 3.

 

Q: What do “supported” and “unsupported” mean in the predicted vs actual plot?

A: They describe the placement of the point “within” or “on the boundary of” the training data, respectively.  Typically, supported predictions are more accurate and have lower uncertainty.  For problems with a small amount of data, most of the points will be unsupported.  Adding more data will help!

 

Q: What is the GTME and how is it calculated?

A: GTME stands for “Guess The Mean Error”, which is the RMSE that the model would have if it just guessed the average value.  Therefore, the “RMSE as fraction of naive model performance (GTME)” is almost always between 0 and 1, where 0 is a perfect model and 1 is equivalent to guessing.  


Q: When used through the Citrination platform, does Lolo do any hyperparameter optimization?

A: Presently, Lolo is used as a random forest imbued with uncertainty estimates. The best practice for such forests is for them to be grown to full depth with as many trees as possible. Therefore, the Citrination use case doesn't have any true hyperparameters: more trees are always better.


Q: For the histogram of the residuals, how does the Gaussian ("ideal" curve) get set?

A: The histogram is of standard residuals, which are defined as the predicted vs actual error divided by the predicted uncertainty.  If the predicted uncertainty is “well calibrated”, then the standard residuals should follow a normal distribution (the ideal bell curve).  This is by definition.  When the models produce uncertainty estimates, they are intended to communicate the width of a normal distribution of output values.  That is why the ideal is normal.


Q: How are PIFs turned into training rows for machine learning?

A: The PIF is a deeply structured format that does not contain sufficient information to unambiguously flatten into a set of key-value pairs in the general case.  In the context of machine learning on Citrination, we use a particular two-step procedure to turn a PIF into a set of rows:

  1. The properties are grouped together by their conditions.  If two properties have exactly the same condition values, we assume those properties were measured in a consistent context and should be present together in the same row.  Each group of properties and conditions will result in a different set of rows.  Exception: if a property has a single value (list with one element) and no conditions, we assume it is a general property of a material that should be included in every resulting row.
  2. Within each property/condition set, properties and conditions which are lists are zipped together.  Properties and conditions which have a single value (list with one element) are assumed to be general within that particular property/condition set.
For example, consider the following PIF:

{
    "chemicalFormula": "CH4",
    "preparation": [
        {
            "details": [
                {
                    "name": "Annealing temperature",
                    "scalars": [{"value": "1000.2"}]
                }
            ],
            "name": "Prep"
        }
    ],
    "properties": [
        {
            "conditions": [
                {
                    "name": "temperature",
                    "scalars": [
                        {"value": "273"},
                        {"value": "300"},
                        {"value": "400"}
                    ]
                },
                {
                    "name": "pressure",
                    "scalars": [{"value": "1.0"}]
                }
            ],
            "name": "density",
            "scalars": [
                {"value": "2.7"},
                {"value": "3.0"},
                {"value": "1.5"}
            ]
        },
        {
            "conditions": [
                {
                    "name": "temperature",
                    "scalars": [
                        {"value": "273"},
                        {"value": "300"}
                    ]
                }
            ],
            "name": "conductivity",
            "scalars": [
                {"value": "0.03"},
                {"value": "0.01"}
            ]
        }
    ]
}

This would result in 5 rows:

{"chemicalFormula": "CH4", "annealing temperature": "1000.2", "temperature": "273", "density": "2.7", "pressure": "1.0"}
{"chemicalFormula": "CH4", "annealing temperature": "1000.2", "temperature": "300", "density": "3.0", "pressure": "1.0"}
{"chemicalFormula": "CH4", "annealing temperature": "1000.2", "temperature": "400", "density": "1.5", "pressure": "1.0"}
{"chemicalFormula": "CH4", "annealing temperature": "1000.2", "temperature": "273", "conductivity": "0.03"}
{"chemicalFormula": "CH4", "annealing temperature": "1000.2", "temperature": "300", "conductivity": "0.01"}


 


Feedback and Knowledge Base