Data science and machine learning in engineering and science

Data science and machine learning in engineering and science¶

This class is about data, models and data analysis in chemical engineering. We will cover topics including

  • Reading data from different sources (urls, files, etc)

  • Visualization of data

  • Working with multidimensional data

  • Using Pandas

  • Using scikit-learn for machine learning

The class will utilize Python and Jupyter notebooks. I am going to assume you have some basic fluency in Python, and Jupyter notebooks, e.g. you are familiar with the content at https://kitchingroup.cheme.cmu.edu/pycse. Everything here builds on that.

Recommended reading:

  1. https://jakevdp.github.io/PythonDataScienceHandbook/ We don’t follow it exactly, but there is a lot of good information, and it is freely available.

The general path we take here is a gradual transition from what is possible with numpy and scipy into Pandas and scikit-learn. You can do a lot with numpy and scipy, but it quickly gets tedious, and error-prone. Pandas was created to alleviate some of this, but that comes at a price; you have to learn a new model for thinking about data and interacting with it. scikit-learn goes another step forward in abstracting out model building and fitting. This too comes at a cost; you give up the explicit control over writing the model, and flexibility in adapting it. You gain, however, short code that is fit for purpose so you can focus on other aspects of the problem, like what it means.

%load_ext jupyter_black

Data¶

We will start with some high level thinking about what we mean by data, why we get it, and what we do with it.

Data are things we measure, assume to be facts, and that we use to learn about the process the data was collected from. It is usually a set of numerical values that are collected. It is critical to know something about your data so you understand what analysis may be appropriate. Data is a plural word. Datum is the singular form of data.

For example, here are two sets of data on my weight:

  1. [7.5, 46, 150, 157]

  2. [156, 155, 158, 157]

We are missing some context on these. The first set is data over four decades, while the second set is over four days. It doesn’t really make sense to average the first set, whereas the average of the second set gives you a good idea of how my weight fluctuates on a daily basis.

Data by itself is not helpful. It is analysis of data that is helpful, but you have to know what the data is supposed to represent to know if the analysis is helpful.

There are many kinds of analysis one can do: statistical, regression, integration, etc. Each of these has the purpose of extracting information from the data.

Let’s consider the average and standard deviation of the second weights above. To perform this analysis, we need a computational tool, we will use Python. We will extensively use numpy arrays for data analysis. We start by making an array in a variable called weights. Then, we simply call the mean and std functions of that array inside a formatted string.

import numpy as np

weights = np.array([156, 155, 158, 157])
print(f"My average weight is {np.mean(weights)} +\\- {np.std(weights):1.1f} lbs.")
My average weight is 156.5 +\- 1.1 lbs.

This analysis makes sense if we think my weight fluctuates about some average with a normal distribution of fluctuations. We do not have enough data to determine if it is normal here, but it is worth noting that assumption underlies the analysis. Note, we also assume that each measurement is independent, and uncorrelated with the previous and next measurement. If I weigh myself only once a day, that is probably reasonable. If these are sequential weights separated by 1 minute, then either something is wrong with the scale or, I am doing something funny in how I weigh myself.

What factors could affect the weight measurement?

  1. What am I wearing?

  2. What and when did I last eat/drink?

  3. When was the last time I exercised and for how long?

  4. Are all the measurements from the same scale?

The answers for all these constitute the metadata, which is data about the data. If we had access to this metadata, we might ask if any of these factors influence the measurements. As we consider more dimensions like this, it becomes inconvenient to visualize and build models with conventional tools, and we then will turn to machine learning.

There is a lot to learn about using data before we get to machine learning though.

Reference material¶

Reading material¶

Please start reading at https://jakevdp.github.io/PythonDataScienceHandbook/02.02-the-basics-of-numpy-arrays.htmland read through Chapter 2 to the end https://jakevdp.github.io/PythonDataScienceHandbook/02.09-structured-data-numpy.html. We will cover some of this material next.