REGRESSION ANALYSIS

in #story6 years ago (edited)

Hi, everyone!

Today I decided to make a post about one of the most widely used statistical methods – regression analysis. My plan is to keep it short and simple so that even people who are not familiar with this particular topic may be able to understand.

Regression analysis is a group of statistical procedures and techniques used for modeling relations between two or more variables. Those relations can be presented in a form of a regression equation. Depending on the number of variables included in one equation, regression analysis can either be simple or multiple. My explanation will be based on a simple analysis with only two variables. One of them is independent (also called - explanatory) variable and the other one is dependent (criterion) variable.
How to recognize them?
-The independent variable is always the one whose change affects the value of the other variable.

Examples:

JBHFN,SM.png

There are a few different ways to label them, but most commonly, we use letter X for independent and Y for the dependent variable. In order to make it more clear, I will be using the first example from the table above.

The real relation between variables can be shown in a form of the following equation called the population regression line:

E(Y) = β0 + βX

Explanation:

Y – dependent variable (consumption)
X – independent variable (income)
E(Y) – expected value of consumption (Y) that can also be interpreted as an average consumption of people with the same amount of income (X)
β0 - a parameter that shows the value of consumption if the value of income is zero
β – a parameter that shows an increase of consumption for an increase of income by one

The approximate relation between two variables can also be shown graphically in the chart called Dispersion Diagram.

Dispersion Diagrams are used to display the main pattern in the distribution of data. The graph shows each value plotted as an individual point against a vertical scale. It shows the range of data and the distribution of each piece of data within that range.

Example:
A dataset of daily consumption and income per capita in one fictitious family are presented in the table below. All numbers are randomly typed.

BHVBJBHMN, (2).png

HDM,NF,J.png

Now, if we draw a line through points on the graph, we can easily see that the slope is positive, which is logical because when income increases the consumption also increases. But, if we take a look at the next example from the table (price and demand), we can surely expect that the line would have negative slope because of the nature of the relation between these variables.

Usually, the real value of consumption (Y) is not equal to its expected value (E(Y)). There are a number of factors that lead to this difference and all of them are incorporated into the part of the equation known as error term (ε).
When an error term is included in the equation, the real value of dependent variable becomes:

Y = E(Y) + ε,

ie

Y = β0 + βX + ε

The main goal of the regression analysis is estimating the values of Y, β and β0 (since these are the unknown parts of the equation). The first thing we need to do is estimate the values of parameters. The process of estimation is based on the data from the sample. Using appropriate methods (most commonly – OLS (Ordinary Least Squares)), we get least square regression line:

Ŷ = b0 + bX

Explanation:

Ŷ – estimated value of consumption (still unknown)
b0 – estimated value of β0
b – estimated value of β

Based on the previous relation and dataset, we are now able to estimate the value of the dependent variable. Since the estimated values of the parameters β and β0 are constant, it is clear that any change in the value of income will immediately change the value of consumption.

Also, the estimated value of consumption (Ŷ) is rarely equal to its real value (Y). The difference between them is called residual (e).

Y > Ŷ → e > 0,

Y < Ŷ → e < 0

Giving that the whole point of the estimation is to make it as precisely as possible, we need to minimize the residuals which can be considered as some estimation mistakes. And this is actually what the OLS is based on. But, since they vary from positive to negative, we must minimize their sum of the squares, as the name of the method speaks for itself.

In the end, what do we use regression analysis for?
-Well, it is widely used for prediction and forecasting in many scientific and other fields.

References:
https://en.wikipedia.org/wiki/Regression_analysis
Mladenović Z, Petrović P, (2017), Uvod u ekonometriju, Ekonomski fakultet Univerziteta u Beogradu

I hope you like my post!

Bojana :)

Sort:  

Regression analysis in machine learning is really very interesting, I would supplement your article with an explanation of the different types of models, but since I do not know if you will do this, I can recommend reading an article in which you will learn more about the types of regression models, as well as in this blog you can find a large amount of field information on the topic of machine learning.

Coin Marketplace

STEEM 0.19
TRX 0.16
JST 0.030
BTC 66982.11
ETH 2591.90
USDT 1.00
SBD 2.67