Wednesday, January 1, 2025

Polynomial regression in Scikit-learn

Imagine you're trying to draw a line to connect a buch of dots on a piece of paper. If all the dots are kind of in a straight line, you just draw a straight line, right? If stright line goes through all the dots or majority of them and if these dots are kined of in a stright line then that's a linear regression.
But what to do when the dots form a curve, like a shape of a hill or a rollercoaster? A straight line won't fit very well. Instead, we will need a bendy line that can go up and down to match the cruve. That's where polynomial regression comes in!.

What is Polynomial Regression?

A Polynomial Regression is like upgrading from a stright ruler to a flexible ruler that can ben. Instead of just fitting a straight line (\(y = mx + c\)), you can use a formula that can be written as: \begin{equation} y = a_0 + a_1 x + a_2 x^2 + a_3 x^3 + \cdots + a_nx^n \end{equation} where:
  • \(x\) - this is the input (like dots on the paper)
  • \(y\) - This is the output (the line you're drawing)
  • \(a_0, a_1,a_2,...,a_n\) . These are numbers (coefficients) that the math figures out to make the line fit the dots.
  • \(x^2,x^3,...,x^n\) - These make the line bend. The higher the power (n), the more bendy the line can be.
The proper question that should be proposed is when to use the polynomial regression? The polynomial regression is appropriate to use when:
  1. The data doesn't fit a straight line but follows a curve
  2. You notice atterns like ups and downs (e.g. growth trends, hills, valleys)
  3. You want a model that's simple but flexible enough to capture curves.

How to use the polynomial regression?

The application of the polynomial regression will be shown on the following example:
  1. Look at the data - Suppose you're measuring how fast a toy car rools down a hill over time. The speed might increase slowly at first, then zoom up fast. The graph of this data could like a curve.
  2. Pick a polynomial degree (\(n\)) - The idea is to start from the lowest degree (\(n=2\)) ( a simple bendy line, a parabola). If that's not curvy enough, try \(n=3\), \(n=4\), etc. But don't make it too bendy, or it might wiggle too much and ift random noise instead of the real pattern.
  3. Fit the equation - Use a computer to calculate the coefficients (\(a_0\), \(a_1\), \(a_2\),...) that make the line match your data as closely as possible.
  4. Check the fit - Does the line match the dots? IF not, adjust the degree of the polynomial.

Key Things to Remember

  1. Don't overdo it: If you make the polynomial too bendy (\(n\) too high), it will try to fit every single dot perfectly, even the random little bumps (noise). That's bad because it won't work very well on the new data due to overfitting.
  2. Balanved simpicity and accuracy - find the lowest degree \(n\) that fits the curve well.
It’s like building a toy car track. Sometimes a straight ramp is enough, but other times you need to add curves to make it exciting! That’s the magic of polynomial regression.

Example 1 - Estimation of the plants growth based on the exposure to sunlight.

You’re trying to figure out the relationship between the number of hours a plant gets sunlight (x) and how tall it grows (y). Your measurements are:
\(x\) (hours of sunlight) \(y\) (height in cm)
1 2
2 6
3 10
4 18
5 26
The data from the table is graphically shown in Figure 1.
2025-01-02T00:23:46.883092 image/svg+xml Matplotlib v3.8.0, https://matplotlib.org/
Figure 1 - Height in cm versus hours of sunlight
From Figure 1 it can be noticed that the points cannot be fitted using straight line. So, we will try the polynomial regression of degree 2 (\(y = a_0 + a_1 x + a_2 x^2 \)).

Step 1: Set up the equation

For degree 2, the equation in general form can be written as: \begin{equation} y = a_0 + a_1 x + a_2 x^2 \end{equation} In the previous equaiton we have to find \(a_0\), \(a_1\), and \(a_2\) which are called intercept, linear term, and quadratic term.

Step 2: Organize the data

\(x\) \(y\) \(x^2\)
1 2 1
2 6 4
3 10 9
4 18 16
5 26 25

Step 3: Write the system of equations

To solve for \(a_0\), \(a_1\), and \(a_2\), we use normal equaitons derived from least squares:
  1. Sum of \(y\): \begin{equation} \sum y = na_0 + a_1\sum x + a_2\sum x^2 \end{equation}
  2. Sum of \(xy\): \begin{equation} \sum xy = a_0\sum x + a_1\sum x^2 + a_2\sum x^3 \end{equation}
  3. Sum of \(x^2y\): \begin{equation} \sum x^2y = a_0\sum x^2 + a_1\sum x^3 + a_2\sum x^4 \end{equation}

Step 4: Plug in the data

Now we have to calculate all the sums. \begin{equation} \sum x = 1+2+3+4+5 = 15 \end{equation} \begin{equation} \sum x^2 = 1+4+9+16+25 = 25 \end{equation} \begin{equation} \sum x^3 = 1+8+27+64+125 = 225 \end{equation} \begin{equation} \sum x^4 = 1 + 16+ 81+256+625 = 979 \end{equation} \begin{equation} \sum y = 2 + 6 + 10 + 18 + 26 = 62 \end{equation} \begin{equation} \sum xy = 1\cdot 2 + 2 \cdot 6 + 3 \cdot 10 + 4 \cdot 18 + 5 \cdot 26 = 230 \end{equation} \begin{equation} \sum x^2 y = 1 \cdot 2 + 4 \cdot 6 + 9 \cdot 10 + 16 \cdot 18 + 25 \cdot 26 = 978 \end{equation} With the substitution of the obtained sums into equations for \(\sum y\), \(\sum xy\), and \(\sum x^2 y\) the following linear equaions are obtained: \begin{eqnarray} 62 &=& 5a_0 + 15a_1 + 55a_2\\ \nonumber 230 &=& 15a_0 + 55a_1 + 225 a_2 \\ \nonumber 978 &=& 55a_0 + 225a_1 + 979 a_2 \end{eqnarray} These three equations can be solved manually or using caclulator. However, first the equations have to be simplifed (if possible) to isolate \(a_0\), \(a_1\), and \(a_2). Then we have to use the substitution or elimination to find the coefficients. After solving these three equations with three unkowns the vlaues of the unknowns are equal to: \begin{equation} a_0 = 0.8, a_1 = 0.2, a_2 = 1.0 \end{equation}

Step 5: Write the final equation

The polynomial regression equation can be written as: \begin{equation} y = 0.8 + 0.2x + 1.0x^2 \end{equation} The output is grapically shown in Figure 2.
2025-01-02T00:30:16.323072 image/svg+xml Matplotlib v3.8.0, https://matplotlib.org/
Figure 2 - approximation of data using polynomial regression.
As seen from Fiugre 2 using polynomial regression we have obtained the function that can successfully approaximate the coordinates shown with blue points. The model (polynomial regression) is not overfitted since hte curve does not go through every single data sample.

Step 7: Use the equation

Now you can predict the plant height for any number of sunlight howurs. For example, if \(x = 6\) - 6 hours of sunlight then the predicted plant height is equal to: \begin{equation} y = 0.8 + 0.2\cdot 6 + 1.0\cdot(6)^2 = 38 [\mathrm{cm}] \end{equation}

No comments:

Post a Comment