Data Science Lifecycle

In our data science projects, we follow the data science lifecycle process proposed in the “cross industry standard process for data mining (CRISP-DM)” from Wirth and Hipp [2000]

Note

To learn more about this framework, review this presentation about the CRISP-DM.

crispdm

Next, we show the most crucial steps of the framework.

Business understanding

  1. Define your (business) goal

  2. Frame the problem (regression, classification,…)

  3. Choose a performance measure (RMSE, …)

  4. Show the data processing components (data pipeline)

Data understanding

  1. Import data

  2. Clean data

  3. Format data properly (numeric or categorical)

  4. Create new variables

  5. Overview about the complete data

  6. Split data into training and test set using stratified sampling

  7. Discover and visualize the data to gain insights (on a copy of the training data)

Data preparation

  1. Perform feature selection (choose predictor variables)

  2. Do feature engineering (mainly with recipes)

  3. Create a validation set from the training data (e.g., with k-fold crossvalidation)

Modeling

  1. Specify the models

  2. Bundle the data preprocessing recipe and model in a workflow

  3. Compare model performance on the validation set

  4. Pick the model that does best on the validation set

  5. Train your best model with all of the training data

  6. Double-check that model against the test set.