1. Basic Curve Fitting#

1.1. Lesson overview#

Analyzing measured data with a theoretical model is a common task for a scientist and engineer. Often we end up “fitting” a dataset to some sort of mathematical function like a line, a sinusoid wave, or an exponentially decaying function. The fitted parameters from this analysis often end up being related to important materials properties, processing conditions, or physical phenomena.

Least squares regression is a commonly used when fitting data. You may have done this before in other software packages (e.g., Microsoft Excel’s LINEST function). In this lesson, we will go over how to do basic least squares regression analysis using the SciPy library. We first will go over how to perform a regression analysis by fitting a line through a dataset, and then we will demonstrate how perform curve fitting with a non-linear function.

1.2. Topics taught#

  • Performing least squares analysis using the scipy.optimize.curve_fit() function.

  • Reporting fitting parameters and standard errors / uncertainties.

  • Reading data from a file using the numpy.loadtxt() function.

  • Plotting measured data and fit lines with the Matplotlib library.

1.3. Line fitting#

In this example, we will fit a line through a data set containing current-voltage measurements of an electronic device in order to extract the resistance of the device. As a quick review, Ohm’s law demonstrates that there is a linear relationship between the voltage across a resistor and the current passing through a resistor via the equation,

\[ V = IR \]

where \(V\) is the voltage across a resistor (units: volt, V), \(I\) is the current passing through a resistor (units: amps, A), and \(R\) is the inherent electrical “resistance” of the device (units: V/A = ohm). In circuit analysis, electrical devices and materials are often modeled as resistors, so measuring the resistance is a common task for a scientist or engineer.

Data for this exercise can be found in the file IV_data_no_uncertainty.csv. This file has two columns of data: (1) the sourced current and (2) measured voltage of the measurement. The first row has the labels for each column and the second row has the units for each column. Since this is a CSV file, each column is separated with a , symbol.

We will need to use three external libraries in order to read the data file, perform the regression analysis, and plot the data and fitted line. These are the NumPy, SciPy, and Matplotlib libraries, respectively. Let’s first load the necessary libraries for the analysis:

import numpy as np
from matplotlib import pyplot as plt
from scipy.optimize import curve_fit

The import calls for the NumPy and Matplotlib libraries should be familiar to you if you have read the Introduction to Scientific Python module (specially the introduction to NumPy section and the basic plotting with Matplotlib section). The new import call in this section is for the scipy.optimize.curve_fit() function. We will review the details of this function soon.

Now, we create the function we want to fit the data with, in this case a line. Recall that a line follows the formula,

\[ y = mx + b \]

where \(y\) is the dependent variable in the measurement (i.e., usually what is measured), \(x\) is the independent variable (i.e., usually what is sourced), \(m\) is the slope of the line, and \(b\) is the \(y\)-intercept (i.e., the \(y\) value when \(x\) = 0). The function call below creates a creates our line that will be fitted soon:

def linearFunc(x, intercept, slope):
    y = intercept + slope * x
    return y

Now let’s load our data from IV_data_no_uncertainty.csv using the numpy.loadtxt() function. We have demonstrated how to use this function before in our Introduction to Scientific Python module’s section on working with CSV files. The code below loads the data to a NumPy data array called data1 and prints out the entire array to the shell just to confirm it has been loaded correctly:

data1 = np.loadtxt("./example_data/IV_data_no_uncertainty.csv", 
                   delimiter = ",", 
                   skiprows = 2)
print(data1)
[[1.00e-03 3.20e-01]
 [2.00e-03 6.70e-01]
 [3.00e-03 9.20e-01]
 [4.00e-03 1.33e+00]
 [5.00e-03 1.64e+00]
 [6.00e-03 2.12e+00]]

Notice the skiprows=2 argument prevents NumPy from trying to load the header rows in the CSV file to data array. Failure to do so would cause the following ValueError:

data_bad_import = np.loadtxt("./example_data/IV_data_no_uncertainty.csv", 
                   delimiter = ",")
print(data_bad_import)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
ValueError: could not convert string to float: ''

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
Input In [4], in <cell line: 1>()
----> 1 data_bad_import = np.loadtxt("./example_data/IV_data_no_uncertainty.csv", 
      2                    delimiter = ",")
      3 print(data_bad_import)

File /srv/docs_venv/lib/python3.10/site-packages/numpy/lib/npyio.py:1373, in loadtxt(fname, dtype, comments, delimiter, converters, skiprows, usecols, unpack, ndmin, encoding, max_rows, quotechar, like)
   1370 if isinstance(delimiter, bytes):
   1371     delimiter = delimiter.decode('latin1')
-> 1373 arr = _read(fname, dtype=dtype, comment=comment, delimiter=delimiter,
   1374             converters=converters, skiplines=skiprows, usecols=usecols,
   1375             unpack=unpack, ndmin=ndmin, encoding=encoding,
   1376             max_rows=max_rows, quote=quotechar)
   1378 return arr

File /srv/docs_venv/lib/python3.10/site-packages/numpy/lib/npyio.py:1016, in _read(fname, delimiter, comment, quote, imaginary_unit, usecols, skiplines, max_rows, converters, ndmin, unpack, dtype, encoding)
   1013     data = _preprocess_comments(data, comments, encoding)
   1015 if read_dtype_via_object_chunks is None:
-> 1016     arr = _load_from_filelike(
   1017         data, delimiter=delimiter, comment=comment, quote=quote,
   1018         imaginary_unit=imaginary_unit,
   1019         usecols=usecols, skiplines=skiplines, max_rows=max_rows,
   1020         converters=converters, dtype=dtype,
   1021         encoding=encoding, filelike=filelike,
   1022         byte_converters=byte_converters)
   1024 else:
   1025     # This branch reads the file into chunks of object arrays and then
   1026     # casts them to the desired actual dtype.  This ensures correct
   1027     # string-length and datetime-unit discovery (like `arr.astype()`).
   1028     # Due to chunking, certain error reports are less clear, currently.
   1029     if filelike:

ValueError: could not convert string '\ufeffI' to float64 at row 0, column 1.

Now let’s create two variables to hold the current and voltage data values:

current = data1[:, 0]
voltage = data1[:, 1]

With the data now loaded, let’s perform the least squares fit. The line of code below performs the least squares fit:

popt, pcov = curve_fit(linearFunc, current, voltage)

Notice the three positional arguments: linearFunc, current, and voltage. As detailed in the online documentation for scipy.optimize.curve_fit(), the first positional argument is the model function to fit with, which in our case is our linearFunc function from earlier. The second positional argument is our independent variable data current (i.e., the “x” value in the data; calledxdata in the documentation). The final positional value is our dependent variable data voltage (i.e., the “y” value in the data; called ydata in the documentation). There are many additional keyword arguments that can be used, but for this basic tutorial, these will suffice.

In our call of scipy.optimize.curve_fit() we return two variables: popt and pcov. According to the documentation, popt is a 1D array that contains the optimized fitted parameters and pcov is a 2D array that contains the variances and covariances of the fitted parameters. For our needs, the square root of the diagonal represents the standard errors of the fitted parameters (a.k.a. the standard uncertainties of each parameter).

Let’s now extract the fitting parameters and their standard uncertainties with the following block of code:

intercept = popt[0]
slope = popt[1]
intercept_err = np.sqrt(pcov[0][0])
slope_err = np.sqrt(pcov[1][1])

With this block of code, we have all we need in the regression analysis! For completion, let’s make a plot that shows the optimized line with our data. The code below goes over all the necessary steps to make this happen. If you need a review on the basics of plotting, look at our Introduction to Scientific Python module’s section on basic plotting with Matplotlib.

# Create figure
fig1 = plt.figure()
ax = fig1.add_subplot(1, 1, 1)

# Plot measured data
ax.plot(current, voltage,
        label="measured data",
        marker="o",
        markersize=8,
        linestyle="none",
        color="blue")

# Create fitted line
yfit = intercept + slope * current

# Plot fit data
ax.plot(current, yfit, 
        color="red",
        label="fit",
        linestyle="dashed")

# Figure options
ax.set_title("Ohm's law analysis on sample")
ax.set_xlabel("I [A]")
ax.set_ylabel("V [V]")
ax.tick_params(axis="both", direction="in")
ax.tick_params(top="on")
ax.tick_params(right="on")
ax.set_xlim(left=0, right=6.5E-3)
ax.set_ylim(bottom=0, top=2.5)
ax.legend(frameon=False)

# Show plot
plt.show()
_images/Curve_Fitting_15_0.png

Comments have been provided in the code above for readability. To create the fit line, we first create yfit, which contains calculated voltages based on our input current array and the fitting results intercept and slope (notice the line equation format). We then include this data set to the graph as an overlay.

Finally, we can also report the fitting results to the shell with a set of four print() calls. Here, we use f-formatting of strings and number rounding in strings to readability purposes.

print(f"intercept = {intercept:.3f} V")
print(f"intercept std. error = {intercept_err:.3f} V")
print(f"slope = {slope:.1f} ohm")
print(f"slope std. error = {slope_err:.1f} ohm")
intercept = -0.065 V
intercept std. error = 0.058 V
slope = 352.0 ohm
slope std. error = 14.9 ohm

1.4. Non-linear curve fitting#

We can also use scipy.optimize.curve_fit() to fit datasets that do not have a linear relationship. In order to perform “non-linear curve fitting”, we simply need to rewrite our function to our desired mathematical relationship and account for all additional fit parameters in the code. To see this in action, first download the CSV file MR_data_no_uncertainty.csv, which contains a set of measured data from a sample whose electrical resistance changes as function of magnetic field. This data will show a second-order polynomial-like dependence (i.e., a parabola) that follows the functional form:

\[ y = A + Bx + Cx^2 \]

where \(A\) is a constant, \(B\) is the linear parameter, and \(C\) is the second-order or “quadratic” term. The block of code below is the entire code needed to fully run this regression analysis:

# Load libraries
import numpy as np
from matplotlib import pyplot as plt
from scipy.optimize import curve_fit

# Create function to fit
def parabolaFunc(x, constant, linear, quadratic):
    y = constant + linear * x + quadratic * (x**2)
    return y

# Load data and separate columns
data2 = np.loadtxt("./example_data/MR_data_no_uncertainty.csv", 
                   delimiter = ",", 
                   skiprows = 2)
magnetic_field = data2[:,0]
resistance = data2[:,1]

# Perform the least-squares fitting
popt, pcov = curve_fit(parabolaFunc, magnetic_field, resistance)

# Extract out the fitted parameters and standard errors
constant = popt[0]
linear = popt[1]
quadratic = popt[2]
constant_err = np.sqrt(pcov[0][0])
linear_err = np.sqrt(pcov[1][1])
quadratic_err = np.sqrt(pcov[2][2])

# Create figure
fig1 = plt.figure()
ax = fig1.add_subplot(1, 1, 1)

# Plot measured data
ax.plot(magnetic_field, resistance,
        label="measured data",
        marker="d",
        markersize=8,
        linestyle="none",
        color="green")

# Create fitted curve
yfit = constant + (linear * magnetic_field) + (quadratic * (magnetic_field**2))

# Plot fit data
ax.plot(magnetic_field, yfit, 
        color="red",
        label="fit",
        linestyle="dashed")

# Figure options
ax.set_title("Magnetoresistance")
ax.set_xlabel("B [T]")
ax.set_ylabel("R [ohm]")
ax.tick_params(axis="both", direction="in")
ax.tick_params(top="on")
ax.tick_params(right="on")
#ax.set_xlim(left=0, right=2.5)
#ax.set_ylim(bottom=0.3525, top=0.3535)
ax.legend(frameon=False)

# Show plot
plt.show()

# Report values to shell
print(f"constant = {constant:.7f} ohm")
print(f"constant std. error = {constant_err:.7f} ohm")
print(f"linear = {linear:.2E} ohm/T")
print(f"linear std. error = {linear_err:.2E} ohm/T")
print(f"quadratic = {quadratic:.4E} ohm/T^2")
print(f"quadratic std. error = {quadratic_err:.2E} ohm/T^2")
_images/Curve_Fitting_19_0.png
constant = 0.3698562 ohm
constant std. error = 0.0000014 ohm
linear = -1.19E-08 ohm/T
linear std. error = 5.02E-07 ohm/T
quadratic = 2.4583E-05 ohm/T^2
quadratic std. error = 3.04E-07 ohm/T^2

We have reloaded the three external libraries completeness in case you would like to re-use this code block for your own needs. Structurally, there is no real difference from the previous line fitting example. The only differences in the code are replacing our linear function linearFunc() with the polynomial function parabolaFunc(), renaming the variables for the measured data, and adjusting the code to handle the new fitting parameters. In short, once you know how to perform the line fit, you can fit any other function!

1.5. Conclusion#

Hopefully you find this lesson useful in learning the basics of curve fitting. The scipy.optimize.curve_fit() function has additional optional arguments available that go beyond the scope of this lesson.

For example, we can use the keyword argument sigma to account for any uncertainty on the y-value for each data point in the regression analysis. This way if our data has “error bars” on the y-value for each data point, we can include this in the fitting process. We will cover this in an upcoming lesson. There are also ways to enter initial values for each fitting parameter and specify bounds for each value. If you are interested in this additional options, please look at the documentation for curve_fit().

1.6. Want to learn more?#

Wikipedia - Least Squares
Shane Burns - Data Fitting
SciPy Library - The curve_fit() Function
NumPy - Documentation Page
Matplotlib - Home Page