Data Views FAQs
General Data Views
A data view is the Citrination way of performing data analysis over one or more datasets.
The Populate With Data button uses a machine learning model to fill in gaps in the output properties in the dataset.
By default, the public Citrination platform keeps your data views private, so that other users cannot access them. In order to share them with other users, click the "Access" tab. From there, you can choose to either make the data view public so that all users can see it, or you can share it with specific groups or teams. To create a new group to share your data view with, click your user name in the top right of the browser window, and click the "Groups" item in the menu.
Machine Learning Configuration
On the Set Column Types step, why do I need to specify whether a given column is Real, Categorical, Organic, or Inorganic?
- Real means that the property has a numerical value and is continuous. For example, the yield strength of a material is a real valued property; it can be 100 MPa or 101 MPa. We will automatically determine the bounds (Min/Max) on the range of possible values based on the data, or you can set it yourself.
- Categorical means that the property can have only a few discrete values. For example, Heusler type is a categorical property, because a Heusler alloy can either be a full Heusler, half Heusler, or inverse Heusler. You can choose which categories to include for ML.
- Chemical formulas are treated differently in the Citrination platform. Our platform can parse these formulas to calculate many different formula features based on the chemical formula, and whether it's organic or inorganic.
- Inputs will be used as input features to the machine learning model. Each model needs at least one input.
- Outputs are the properties that you would like the model to predict. Each model needs at least one output.
- Ignore means that you would like to keep track of this property, but don't want to use it as an input or output of the model. Often names and sample IDs are tagged with Ignore.
- Latent Variables are properties that you would like the model to predict, and which you think could also be useful in predicting other outputs. The latent variables are used to build hierarchical models, so the predictions of the latent variable are then used as inputs to help predict the other model outputs.
Reports
More information about the Pearson Correlation and Distance Correlation can be found on their respective wikipedia pages, here and here.
The t-SNE projection is a dimensionality reduction technique that can be used to visualize high-dimensional data. Like principal component analysis, it is used to project data down to a lower number of dimensions. t-SNE is useful because it preserves neighborhoods; points that are near each other in the original high-dimensional space should still be near each other after the t-SNE projection. This property makes t-SNE useful for visualizing clusters within the data set. More information on t-SNE can be found here.
Model Report Tab
The Important Features are the input features that the machine learning model depends on most strongly. The feature importances can be useful in gaining some insight into how the machine learning model is making its predictions.
When chemical formula is used as an input to the model, the Citrination platform automatically parses that formula and generates dozens of features based on the chemical formula to use as inputs to the model.
- The RMSE is a useful and popular statistical metric for model quality. Lower RMSE means the model is more accurate.
- The NDE is the ratio between the RMSE and the standard deviation in the output variable. The NDE is a useful non-dimensional model quality metric. A value of NDE=0 is a perfect model. If NDE=1, then the model is uninformative.
What is a good value of the NDE?
This depends on what you would like to use the model for.
- Generally, we would say that a value of NDE > 0.9 indicates a model with very high uncertainty.
- If 0.9 > NDE > 0.6, this model is typically a good candidate for experimental design.
- Lower values of NDE indicate increasingly accurate models.
What is the Predicted vs. Actual plot?
The Predicted vs. Actual plot, sometimes called a parity plot, is generated via cross-validation. It compares the model predictions to the actual output values in the training set. A perfect model would have predictions that lie on the line y=x. The Extrapolating and Not extrapolating labels indicate whether the specified points are extrapolations in the input feature space.
The standard residuals are given by (Predicted - Actual)/(Uncertainty Estimate). This plot is useful for assessing whether the uncertainty estimates are well-calibrated for this model. In the case of perfectly calibrated uncertainty estimates, this distribution would be expected to follow a normal distribution.
Predict and Design
Rather than making a single prediction, Citrination uses cutting edge uncertainty quantification to assess model uncertainty. These uncertainty estimates are discussed in detail in this paper.
- The MEI (Maximum Expected Improvement) strategy simply selects the candidate with the highest (or lowest, for minimization) target value for all un-measured candidates. The MEI Score is normalized from 0-100 and corresponds to the likelihood that candidates will meet your objective (minimization or maximization) and satisfy any constraints.
- The MLI (Maximum Likelihood of Improvement) strategy selects the un-measured candidate that is the most likely to have a higher (or lower, for minimization) target value than the best previously measured material. This strategy incorporates uncertainty estimates in its prediction. The MLI Score, normalized between 0 and 100, indicates the percent likelihood of a given candidate to out-perform all the materials in the training data in terms of optimizing the target and satisfying constraints. This score takes into account both predicted output values and uncertainty. It provides suggested experiments that explore new regions of design space (have high uncertainty).
- Candidates with a known value for the target output (i.e. those in your dataset) will have a score of 0.
- MEI and MLI scores are explained in more detail in this paper.