Machine Learning FAQs

How many folds are used for k-fold cross-validation?
The number of folds used varies depending on the training set size to balance between accuracy and computational time. Smaller training sets will use more folds than larger training sets. The minimum number of folds used is 3.


When used through the Citrination platform, does Lolo do any hyperparameter optimization?
Presently, Lolo is used as a random forest imbued with uncertainty estimates. The best practice for such forests is for them to be grown to full depth with as many trees as possible. Therefore, the Citrination use case doesn't have any true hyperparameters: more trees are always better.


What do “not extrapolating” and “extrapolating” mean in the Predicted vs. Actual plot?
They describe the placement of the point “within” or “on the boundary of” the training data, respectively. Typically, "not extrapolating" predictions are more accurate and have lower uncertainty. For problems with a small amount of data, most of the points will be "extrapolating." Adding more data will help!


What is the Non-dimensional model error and how is it calculated?
The Non-dimensional model error is the ratio between the root mean square error (RMSE) and the standard deviation of the data. The standard deviation can be thought of as the error the model would have if it just guessed the mean value. Therefore, the NDE is almost always between 0 and 1, where 0 is a perfect model and 1 is equivalent to guessing.


For the histogram of the residuals, how does the Gaussian ("ideal" curve) get set?
The histogram is of standard residuals, which are defined as the predicted vs actual error divided by the predicted uncertainty. If the predicted uncertainty is “well calibrated,” then the standard residuals should follow a normal distribution (the ideal bell curve). This is by definition. When the models produce uncertainty estimates, they are intended to communicate the width of a normal distribution of output values. That is why the ideal is normal.


How are PIFs turned into training rows for machine learning?
The PIF is a deeply structured format that does not contain sufficient information to unambiguously flatten into a set of key-value pairs in the general case. In the context of machine learning on Citrination, we use a particular two-step procedure to turn a PIF into a set of rows:

  1. The properties are grouped together by their conditions.  If two properties have exactly the same condition values, we assume those properties were measured in a consistent context and should be present together in the same row.  Each group of properties and conditions will result in a different set of rows.  Exception: if a property has a single value (list with one element) and no conditions, we assume it is a general property of a material that should be included in every resulting row.
  2. Within each property/condition set, properties and conditions which are lists are zipped together.  Properties and conditions which have a single value (list with one element) are assumed to be general within that particular property/condition set.
For example, consider the following PIF:

{
    "chemicalFormula": "CH4",
    "preparation": [
        {
            "details": [
                {
                    "name": "Annealing temperature",
                    "scalars": [{"value": "1000.2"}]
                }
            ],
            "name": "Prep"
        }
    ],
    "properties": [
        {
            "conditions": [
                {
                    "name": "temperature",
                    "scalars": [
                        {"value": "273"},
                        {"value": "300"},
                        {"value": "400"}
                    ]
                },
                {
                    "name": "pressure",
                    "scalars": [{"value": "1.0"}]
                }
            ],
            "name": "density",
            "scalars": [
                {"value": "2.7"},
                {"value": "3.0"},
                {"value": "1.5"}
            ]
        },
        {
            "conditions": [
                {
                    "name": "temperature",
                    "scalars": [
                        {"value": "273"},
                        {"value": "300"}
                    ]
                }
            ],
            "name": "conductivity",
            "scalars": [
                {"value": "0.03"},
                {"value": "0.01"}
            ]
        }
    ]
}

This would result in 5 rows:

{"chemicalFormula": "CH4", "annealing temperature": "1000.2", "temperature": "273", "density": "2.7", "pressure": "1.0"}
{"chemicalFormula": "CH4", "annealing temperature": "1000.2", "temperature": "300", "density": "3.0", "pressure": "1.0"}
{"chemicalFormula": "CH4", "annealing temperature": "1000.2", "temperature": "400", "density": "1.5", "pressure": "1.0"}
{"chemicalFormula": "CH4", "annealing temperature": "1000.2", "temperature": "273", "conductivity": "0.03"}
{"chemicalFormula": "CH4", "annealing temperature": "1000.2", "temperature": "300", "conductivity": "0.01"}

If you have more questions, check out our tutorial resources or contact us.

Feedback and Knowledge Base