# Data Views FAQs

## General Data Views

**What is a data view?**

A data view is the Citrination way of performing data analysis over one or more datasets.

**What does the**

**Populate With Data**

**button do?**

The

*Populate With Data*button uses a machine learning model to fill in gaps in the

*output*properties in the dataset.

**Can I keep my data view private? Can I share it?**

By default, the public Citrination platform keeps your data views private, so that other users cannot access them. In order to share them with other users, click the "Access" tab. From there, you can choose to either make the data view public so that all users can see it, or you can share it with specific groups or teams. To create a new group to share your data view with, click your user name in the top right of the browser window, and click the "Groups" item in the menu.

## Machine Learning Configuration

**On the Set Column Types step, why do I need to specify whether a given column is**

*Real*,*Categorical*,*Organic*, or*Inorganic*?*Real*means that the property has a numerical value and is continuous. For example, the yield strength of a material is a real valued property; it can be 100 MPa or 101 MPa. We will automatically determine the bounds (Min/Max) on the range of possible values based on the data, or you can set it yourself.*Categorical*means that the property can have only a few discrete values. For example, Heusler type is a categorical property, because a Heusler alloy can either be a full Heusler, half Heusler, or inverse Heusler. You can choose which categories to include for ML.*C**hemical formulas*are treated differently in the Citrination platform. Our platform can parse these formulas to calculate many different formula features based on the chemical formula, and whether it's organic or inorganic.

**On the Set Column Types step, what is the difference between choosing **

**Input, Output, Ignore,**

**and**

**Latent Variable**

**?**

*Inputs*will be used as input features to the machine learning model. Each model needs at least one input.*Outputs*are the properties that you would like the model to predict. Each model needs at least one output.*Ignore*means that you would like to keep track of this property, but don't want to use it as an input or output of the model. Often names and sample IDs are tagged with*Ignore.**Latent Variables*are properties that you would like the model to predict, and which you think could also be useful in predicting other outputs. The latent variables are used to build hierarchical models, so the predictions of the latent variable are then used as inputs to help predict the other model outputs.

**Which "rows" (material with properties) will be used for training?**

Only rows with data in

*all*inputs and*at least one*output or latent variable are included in the training set (up to 4000 rows by default).**What does the "Group By Key" checkbox do?**

Checking this box will group the examples by that property when creating folds for cross-validation (e.g. you might not want the same chemical formula appearing in both the training and test sets).

## Reports

Data Summary Tab

What is the difference between the Pearson Correlation and Distance Correlation?

More information about the Pearson Correlation and Distance Correlation can be found on their respective wikipedia pages, here and here.

More information about the Pearson Correlation and Distance Correlation can be found on their respective wikipedia pages, here and here.

What is the t-SNE projection?

The t-SNE projection is a dimensionality reduction technique that can be used to visualize high-dimensional data. Like principal component analysis, it is used to project data down to a lower number of dimensions. t-SNE is useful because it preserves neighborhoods; points that are near each other in the original high-dimensional space should still be near each other after the t-SNE projection. This property makes t-SNE useful for visualizing clusters within the data set. More information on t-SNE can be found here.

The t-SNE projection is a dimensionality reduction technique that can be used to visualize high-dimensional data. Like principal component analysis, it is used to project data down to a lower number of dimensions. t-SNE is useful because it preserves neighborhoods; points that are near each other in the original high-dimensional space should still be near each other after the t-SNE projection. This property makes t-SNE useful for visualizing clusters within the data set. More information on t-SNE can be found here.

### Model Report Tab

What are the Important Features?

The Important Features are the input features that the machine learning model depends on most strongly. The feature importances can be useful in gaining some insight into how the machine learning model is making its predictions.

The Important Features are the input features that the machine learning model depends on most strongly. The feature importances can be useful in gaining some insight into how the machine learning model is making its predictions.

How come some of the Important Features are not input features that I selected for this model?

When chemical formula is used as an input to the model, the Citrination platform automatically parses that formula and generates dozens of features based on the chemical formula to use as inputs to the model.

When chemical formula is used as an input to the model, the Citrination platform automatically parses that formula and generates dozens of features based on the chemical formula to use as inputs to the model.

What is the root mean square error (RMSE)? What is the non-dimensional model error (NDE)?

What is a good value of the NDE?

This depends on what you would like to use the model for.

- The RMSE is a useful and popular statistical metric for model quality. Lower RMSE means the model is more accurate.
- The NDE is the ratio between the RMSE and the standard deviation in the output variable. The NDE is a useful non-dimensional model quality metric. A value of NDE=0 is a perfect model. If NDE=1, then the model is uninformative.

What is a good value of the NDE?

This depends on what you would like to use the model for.

- Generally, we would say that a value of NDE > 0.9 indicates a model with very high uncertainty.
- If 0.9 > NDE > 0.6, this model is typically a good candidate for experimental design.
- Lower values of NDE indicate increasingly accurate models.

What is the Predicted vs. Actual plot?

The Predicted vs. Actual plot, sometimes called a

*parity plot*, is generated via cross-validation. It compares the model predictions to the actual output values in the training set. A perfect model would have predictions that lie on the line

*y=x*. The

*Extrapolating*and

*Not extrapolating*labels indicate whether the specified points are extrapolations in the input feature space.

The standard residuals are given by (Predicted - Actual)/(Uncertainty Estimate). This plot is useful for assessing whether the uncertainty estimates are well-calibrated for this model. In the case of perfectly calibrated uncertainty estimates, this distribution would be expected to follow a normal distribution.

## Predict and Design

Rather than making a single prediction, Citrination uses cutting edge uncertainty quantification to assess model uncertainty. These uncertainty estimates are discussed in detail in this paper.

What can I use Design for?

Design can be used to solve the inverse design problem of determining the inputs that match specifications for the output properties. For example, Design can be used to suggest alloy compositions that might match specifications for mechanical properties.

In Design, what are MEI and MLI scores?

- The MEI (Maximum Expected Improvement) strategy simply selects the candidate with the highest (or lowest, for minimization) target value for all un-measured candidates. The MEI Score is normalized from 0-100 and corresponds to the likelihood that candidates will meet your objective (minimization or maximization) and satisfy any constraints.
- The MLI (Maximum Likelihood of Improvement) strategy selects the un-measured candidate that is the most likely to have a higher (or lower, for minimization) target value than the best previously measured material. This strategy incorporates uncertainty estimates in its prediction. The MLI Score, normalized between 0 and 100, indicates the percent likelihood of a given candidate to out-perform all the materials in the training data in terms of optimizing the target and satisfying constraints. This score takes into account both predicted output values and uncertainty. It provides suggested experiments that explore new regions of design space (have high uncertainty).
- Candidates with a known value for the target output (i.e. those in your dataset) will have a score of 0.
- MEI and MLI scores are explained in more detail in this paper.