Correlation and Regression

Correlation studies the relationship between two or more variables. Regression uses correlation and estimates a predictive function to relate a dependent variable to an independent one, or a set of independent variables. The regression can be linear or non-linear. Linear regression is provided for in most spreadsheets and performed by a least-squares method. We refer to a very useful textbook on the matter by Edwards (1979).

It is important to stress that correlation does not prove a causal relationship of a pair of variables. In addition an understanding of the data generating processes is required to conclude a causality.

Parametric correlation

The product-moment correlation coefficient ("r"), in its full glory, is:


and simple linear regression of a variable Y on X is as in the figure below.

Although for the correlation coefficient it does not matter if we correlate Y on X or X on Y, the regression of Y on X is not the same as X on Y in the general case.

A few points worth mentioning, in relation to the use of correlation and regression in prospect appraisal, notably GAEAPAS:

A typical Regression equation with one independent variable x would look as this:

Y = Beta0 + Beta1.x

Two important parameters: the slope (usually denoted by beta1) and the intercept where the regression line hits the Y-axis, (beta0).

Coefficients. A regression equations with 6 independent variables are used by Gaeapas to estimate the HC charge volumes. The coefficients are based on the "calibration set" of well-known accumulations with a worldwide spread. This data is stored in a database of statistical parameters that can be updated, but normally remains unchanged, unless a new calibration is carried out. In this way the experience gained from a sizable geological database is encapsulated in a small set of constants that can be used in a prediction program.

The R-square is the fraction of the variance in the volume regression is explained by the independent variables. In the calibration carried out by Shell Intl. in the early eighties this was about 0.64, with a samplesize of 160 accumulations and statistically significant. There were 6 X-variables in the regression, representing geological variables of the petroleum system, all individually significant contributors to prediction, and almost totally independent.

The Gaeapas model for HC charge is in principle a multiplication (Sluijk & Nederlof, 1984); therefore the X-variables entering in the product are logtransformed before the regression analysis allowing to use a linear regression model. Although most regression calculations will produce a result, the interpretation may be more difficult. First of all: regression does not prove a causal relationship. Second, one of the requirements of the regression model is that the data are "iid or ("independent and identically distributed"). For time series (also called longitudinal data) this is often not the case, because of autocorrelation in the data. In multiple regression, (several independent X-variables), independence is also often debatable. One of the worst examples is when regression is applied with a variable X as well as X-squared, a parabolic fit. Then the relationship of X and X-squared is causing an inflation in the statistical significance. it works, but interpretation and forecasting must be done very carefully. For a multiple regression the variables are chosen to be as independent as possible. If some of the independent variables would be strongly inter-correlated an ill-conditoned variance/covariance matrix would generate problems for the calculations, but also for the interpretation of the results. When the variables are represented by vectors in space, independent vectors would be perpendicular to each other: "orthogonal".

Residuals are the deviations from the regression line. If correlation is perfect all the residuals are equal to zero. In practice they depict the scatter of points around the regression line but the mean of the residuals is zero. In Gaeapas some of the input variables are correlated with depth. For instance the API gravity. To combine a worldwide prior for API with local data, the worldwide regression of API on depth is used to find the relevant distribution of residuals. The technique is shown in the following graph. So, to a small extent, the independence requirement is not satisfied.

Prediction using a regression equation

The prediction using a regression equation involves introducing all the uncertainties around the regression line for the particular set of X-variables. The uncertainties reflect three different variances:

The uncertainties are graphically shown as a formula and as in a figure below. A similar logic holds when there is more than one X-variable.

The following example concerns the regression of condensate ratio (barrels per million cubic feet gas) on reservoir pressure (psia). We have the following regression results:

ParameterValue
Number of data points73
R-square0.455
Intercept, Beta0-67.781
Slope, Beta10.0244
Standard error of estimate ( sy.x)69.324
Mean of P5250
Variance of P ( sx2)6676634
Mean of CGR60.37
Variance of CGR8744

First of all, a prediction of a single value of CGR, given a value of P (x in the above formula) is a draw from a normal distribution with a mean equal to:

CGR = -67.781 + 0.0244 * P

and a standard deviation of 69.324, approximately. However, more correct is the standard error of estimate using the formula above.

We use the formula to calculate the standard deviation for the prediction at two values of P. The first at the mean of P and the second in the "outskirts" of P at P = 1000 psia.

Pressure PTerms under the root signThe standard deviation
52501 + 1/73 + (5250 - 5250)^2/(72 * 6676634) = 1.01371.0068 * 69.324 = 69.795
10001 + 1/73 + (1000 - 5250)^2/(72 * 6676634) = 1.05131.0253 * 69.324 = 71.080

Note that the predictive standard deviation at P = 1000 is 2 % larger than at the mean of P. A simulation of the regression would use a routine to create a number of random draws from a standard normal distribution with a mean of zero and a variance of 1. For each P value the associated mean and standard deviation would translate the standard normal deviate in a CGR value. Note that the sample size is larger than 50. This means that the above use of the normal distribution is warranted. If smaller, the t-distribution with n - 1 degrees of freedom would have to be generated instead of the normal, bacause the more exact formula requires that. For our sample size the t-distribution coincides with the normal well enough. This keeps the procedure simpler.

The t-values. The regression calculation not only gives the values of the coefficients, but also the standard deviations of the coefficients. If the coefficient is zero, it does certainly not contribute to the prediction of Y. Therefore it is useful to see if a beta is significantly different from zero. When we divide a coefficient by its standard deviation we obtain the t-value. Tables of t-values are available for different degrees of freedom, and as a function in spreadsheets. If we assume that the uncertainty of a coefficient is more or less normal, and there are enough degrees of freedom (say > 25), a t-value of 2.0 would already indicate a (mildly) significant coefficient at ~p < 0.05, assuming a two-tailed test. Two-tailed means that you had no idea if the coefficient beta should be negative or positive, and necessarily the same for the t-value. On the contrary if you have assumed a positive beta, a one-tailed test with p < 0.025.
The t-values have another interesting use: The squared t-values of a coefficients are proportional to the part of the R-square associated with it. In other words, squared t-values indicate the relative contributions of the variables to the regression: ANOVA, as the squared t-values are equivalent to the sums of squares in the F-test for significance.

Autocorrelation

When analyzing a sequence of say, yearly data Y for randomness, it is useful to make regressions of Y on itself, where e.g. y(i) is correlated with y(i-1). Here a lag of one year is used, but larger lags are possible too, so that a Autocorrelogram can be drawn. Often, the correlation coefficients are positive and gradually declining with increasing lag. Such condition may be called an "Auto Regressive Moving Average (ARMA) model". The global temperature anomaly, yearly averages, show strong autocorrelation as shown in the following autocorrelogram:

But the correlogram may show a wavy pattern, indicating cycles in the sequence of X. With the autocerrelated data in timeseries, the utmost care must be taken when interpreting the results of regressions. When considering the correlation of two time-series, all kinds of "spurious" correlation may occur (Fawcett, 2015). This means that very strong and significant correlations are found, but there is clearly no causal relationship!

The effect of autocorrelation is that the significance of correlations may be seriously overestimated. Autocorrelation means that the data in the series are not independent on the previous values. Hence the number of degrees of freedom is overestimated. One remedy is to use the first differences of the values. In the case of temperature above, the autocorrelation then disappears, and results become more informative.

An interesting way to study causality is the "Granger-causality", a method that establishes whether X values that precede Y have a significant effect on Y, as explained on Wikipedia. Although this method does not prove real causality, it supports a physical relationship between X and Y, if significant, even in the case of autocorrelation. So for the temperature anomaly (Hardcrut4) and CO2 series of the last 60 years (1959-2018) we see that Temperature (Granger)-causes a rise in CO2 a year later and a little two years later, while there is no influence of CO2 on temperature in this short time series.

One of the consequences of autocorrelation, if significant, is that the iid requirement of regressions of Y with other variables is violated. Then the R-square and other statistics from the analysis may be suspect.

Autocorrelation can also occur along a borehole. Data derived from wireline logs will often show autocorrelation. A regression of porosity in samples on density may be seemingly more significant, or successful than the real relationship. A similar problem arises in trend analysis. I once studied the porosity in several wells in the Groningen gas field. A second degree trend surface had been calculated in order to predict porosity (dependent variable Z) for new well locations in the field. This involves the location coordinates X, Y, X2, Y2 and XY. The dependence of these "independent" variables tends to inflate the R-square (Coefficient of determination) and in false precision of the predicted porosity. Probably the best way to study autocorrelation in one- or two dimensions is kriging, a system in which the autocorrelation is an advantage rather than a nuisance!

Cross-validation

With the various problems that may arise in a simple regression, it may help to apply cross-evaluation. This means, for example, that the regression is calculated on the first half of the data and Y estimated using the second half of the data, and vice versa. This usually shows that the prediction power of the regression equation is smaller than expected from the original result with all data pairs (see also the section on R-square.

Another trick to obtain insight into the predictive value is the "Jacknife".

Rank correlation

Relationships between variables can also be studied with other methods. Non-parametric correlation is sometimes useful and almost as powerful as the parametric method described above for linear correlation. For correlation important tools are "Rank correlation" of which at least two flavours exist: Kendall's tau (Kendall, 1970) and Spearman's rho (Spearman, 1904). The data are not required to be measured on an interval scale, only rank order. In Gaeapas a rank correlation is used in the analysis of variance, because of the nature of the data distributions involved. The rank correlations do not expect a linear relationship. If Y has a monotone, but linear relationship with X, it can give a significant result, which might be less significant for the Pearson correlation.

Lag regression

Lag correlation is used when process in time creating the Y variable is influenced by the X variable from an earlier time. A interesting example is the gas production in the Groningen gas field, presumably causing earthquakes that cause multibillion Euros damage to houses and infrastructure, or loss of revenue in case production is reduced to mitigate the effect. Here changes in production level can be related to the number and intensity of the quakes, although the exact causal relationship is enormously complex and unsufficiently understood.

Having yearly data of production and numbers of eartquakes => 1.5 magnitude per year (1989 - 2017), a lag regression of this seismic activity on production gave the following results:

LagR-squared.f.Probability
00.1229270.03115 *
10.2893260.00158 **
20.1927250.01010 *
30.1363240.03174 *

It seeems that the immediate seismic reaction to changes in production level is statistically significant at the 5% level, but too weak for prediction. By far the most important predictor appears to be the production in the year before. Earlier years also show statistical significance, but when a multiple regression is made with all the lags shown, the lag 1 is hardly trumped by the four lag combination. So we settle with the lag 1 correlation for prediction.

Two estimates are made: First, an expectation curve for earthquake frequency in 2018, based upon production in 2017, second, an expectation curve for, say, 2019, if production in 2018 is stopped completely. In both cases there is a still considerable chance of having a number of earthquakes = > 1.5 as shown in the graphs below, but the zero production suggests an important reduction.

A few caveats: There is strong autocorrelation in both production sequence and the seismic sequence. This has not been taken care of and affects the true numbers of freedom by inflation. And, the zero production is an extrapolation far below the minimum production in the dataset.

If the above analysis is sound, the results suggest that instead of maintaining the 2017 rate of production, zero production just about halves the probability of having a given number of earthquakes, compared to maintaining the 2017 level. But zero earthquakes requires zero production during several years. The effect of natural earthquakes has not been considered here.

Top

Home