SKEDSOFT

Six Sigma

Preparing “Flat Files” and Missing Data:

a)      Probably the hardest step in data analysis of “on-hand” data is getting the data into a format that regression software can use. The term “field” refers to factors in the database, “points” refer to rows, and “entries” refer to individual field values for specific data points.

b)      The term “flat file” refers to a database of entries that are formatted well enough to facilitate easy analysis with software. The process of creating flat files often requires over 80% of the analysis time.

c)       Also, the process of piecing together a database from multiple sources generates a flat file with many missing entries. If the missing entries relate to factors not included in the model, then these entries are not relevant. For other cases, many approaches can be considered to address issues relating to the missing entries. The simplest strategy (Strategy 1) is to remove all points for which there are missing entries from the database before fitting the model. Many software packages such as Regression implement this automatically.

d)      In general, removing the data points with missing entries can be the safest, most conservative approach generating the highest standard of evidence possible.

e)      However, in some cases other strategies are of interest and might even increase the believability of results. For these cases, a common strategy is to include an average value for the missing responses and then see how sensitive the final results are to changes in these made-up values (Strategy 2). Reasons for adopting this second strategy could be:

1.      The missing entries constitute a sizable fraction of entries in the database and many completed entries would be lost if the data points were discarded.

2.      The most relevant data to the questions being asked contain missing entries. Making up data should always be done with caution and clear labeling of what is made up should be emphasized in any relevant documentation.

 

Evaluating Models and DOE Theory:

Analyzing a flat file using regression is an art, to a great extent. Determining which terms should be included in the functional form is not obvious unless one of the design of experiments (DOE) methods in previous chapters has been applied to planning and data collection. Even if one of the DOE methods and randomization has been applied, several tests are necessary for the derivation of proof.

 

Variance Inflation Factors (VIFs) are numbers that permit the assessment of whether reliable predictions and inferences can be derived from the combination of model form and input pattern. A common rule is that VIFs must be less than 10. Note that this rule applies only for formulas involving “standardized” inputs.

 

Normal Plot of Residuals are graphs that indicate whether the hypothesis tests on coefficients can be trusted and whether specific data points are likely to be representative of systems of interest. Generally, points off the line are outliers.

 

Summary Statistics are numbers that describe the goodness of fit. For example, R2 prediction describes the fraction of the variation in that is explainable by the data. It cannot always be calculated, but when it is available it is relatively reliable.