# If I want to remain bluntly quantitative

### My editorial

I am still mining my database in order to create some kind of theoretical model for explaining the relative importance of renewable energies in a given society. Now, I am operating with two variables for measuring said importance. Firstly, it is the percentage of renewables in the final consumption of energy (https://data.worldbank.org/indicator/EG.FEC.RNEW.ZS ). This is renewable energy put into the whole dish of energy that we, humans, use in ways other than feeding ourselves: driving around, air-conditioning, texting to girlfriends and boyfriends, launching satellites, waging war on each other and whatnot. The second estimate is the percentage of renewable energy in the primary production of electricity (https://data.worldbank.org/indicator/EG.ELC.RNEW.ZS ). That one is obviously, much nicer and gentler a variable, less entangled sociologically and biologically. These two are correlated with each other. In my database, they are Pearson-correlated at r = 0,676. This is a lot, for a Pearson-correlation of moments, and still it means that these two mutually explain their respective variances at more or less R2 = 0,6762 = 0,456976. Yes, this is the basic meaning of that R2 coefficient of determination, which kind of comes along whenever I or someone else presents the results of quantitative regression. I take each explanatory variable in my equation, so basically each variable I can find on the right side, and I multiply it, for each empirical observation, by the coefficient of regression attached to this variable.

When I am done with multiplication, I do addition and subtraction: I sum up those partial products, and I subtract this sum, for each particular observation, from the actual value of the explained variable, or the one on the left side of the equation. What I get is a residual constant, so basically the part of the actually observed explained variable, which remains unmatched by this sum of products ‘coefficient of regression times the value of the explanatory variable’. I make an arithmetical average out of those residuals, and I have the general constant in those equations I use to present you on this blog whenever I report the results of my quantitative tests. Once I have that general function, I trace it as a line, and I compute the correlation between this line, and the actual distribution of my left-hand variable, the explained one. This correlation tells me how closely my theoretical line follows the really observed variable.

Now, why the heck elevating this coefficient of correlation to power two, so why converting the ‘r’ into the capital ‘R2’? Well, correlations, as you probably know, can be positive or negative in their sign. What I want, though, is kind of a universal indicator of how close did I land to real life in my statistical simulations. As saying something like ‘you are minus 30% away from reality’ sounds a bit awkward, and as you cannot really have a negative distance, the good idea is to get rid of the minuses by elevating them power two. It can be any even power, by the way. There is no mathematical reason for calculating R2 instead of R22, for instance, only the coefficients of correlation are fractions, whose module is always smaller than one. If you elevate a decimal smaller than one to power 22, you get something so tiny you even have problems thinking about it without having smoked something interesting beforehand. Thus, R2 is simply handier than R22, with no prejudice to the latter.

Oh, I started doctoring, kind of just by letting myself being carried away. All right, so this is going to be a didactic one. I don’t mind: when I write as if I were doctoring you, I am doctoring myself, as a matter of fact, and it is always a good thing to learn something valuable from someone interesting, for one. For two, this blog is supposed to have educational value. Now, the good move consists in asking myself what exactly do I want to doctor myself about. What kind of surprise in empirical reality made me behave in this squid-like way, i.e. release a cloud of ink? By experience, I know that what makes me doctoring is cognitive dissonance, which, in turn, pokes its head out of my mind when I experience too much incoherence in the facts of life. When I have smeared the jam of my understanding over too big a toast of reality, I feel like adding more jam on the toast.

As I am wrestling with those shares of renewable energies in the total consumption of energy, and in the primary generation of electricity, what I encounter are very different social environments, with very different shares of renewables in their local cocktails of energy, and those shares seem not to be exactly scalable on kind of big, respectable socio-economic traits, like GDP per capita or capital stock per capita. These idiosyncrasies go as far as looking as paradoxes, in some instances. In Europe, we have practically no off-grid electricity from renewable sources. In Africa or in Asia, they have plenty. Building a power source off-grid means, basically, that the operator of the power grid doesn’t give a s*** about it and you are financially on your own. Hence, what you need is capital. Logically, there should be more off-grid power systems in regions with lots of capital per capita, and with a reasonably high density of population. Lots of capital per capita times lots of capita per square kilometre gives lots of money to finance any project. Besides, lots of capital per capita is usually correlated with better an education in the average capita, so with better an understanding of how important it is to have reliable and clean, local sources of energy. Still, it is exactly the opposite that happens: those off-grid, green power systems tend to pop up where there is much less capital per capita and where the average capita has much poorer an access to education.

At the brutal bottom line, it seems that what drives people to install solar farms or windfarms in their vicinity is the lack of access to electricity from power grids – so the actual lack and need of electricity – much more than the fact of being wealthy and well educated. Let’s name it honestly: poverty makes people figure out, and carry out, much more new things than wealth does. I already have in my database one variable, very closely related to poverty: it is food deficit, at the very core of being poor. Dropping food deficit in a model related to very nearly any socio-economic phenomenon instantaneously makes those R2’s ramp up. Still, a paradox emerges: when I put food deficit, or any other variable reflecting true poverty, into a quantitative model, I can test it only on those cases, where this variable takes a non-null value. Where food deficit is zero, I have a null value associated with non-null values in other variables, and such observations are automatically cut out of my sample. With food deficit in an equation, empirical tests yield their results only regarding those countries and years, where and when local populations actually starved. I can test with Ethiopia, but I cannot test with Belgium. What can I do in such case? Well, this is where I can use that tool called ‘control variable’. If dropping a variable into an equation proves kind of awkward, I can find a way around it by keeping that variable out of the equation but kind of close to. This is exactly what I did when I tested some of my regressions in various classes of food deficit (see, for example ‘Cases of moderate deprivation’ ).

Good, so I have that control variable, and different versions of my basic model, according to the interval of values in said control variable. I kind of have two or more special cases inside a general theoretical framework. The theory I can make out of it is basically that there are some irreducible idiosyncrasies in my reality. Going 100% green, in a local community in Africa or in Asia is so different from going the same way inside European Union that it risks being awkward to mix those two in the same cauldron. If I want that SEAP, or Sustainable Energy Action plan (see the website of the Global Covenant of Mayors for more information ), and I want it to be truly good a SEAP, it has to be based on different socio-economic assumptions according the way local communities work. One SEAP for those, who starve more or less, and have problems with basic access to electricity. Another SEAP for the wealthy and well-educated ones, whose strive for going 100% green is driven by cultural constructs rather than by bare needs.

Right, it is time to be a bit selfish, thus to focus on my social environment, i.e. Poland and Europe in general, where no food deficit is officially reported at the national scale. I take that variable from the World Bank –  the percentage of renewable energy in the primary production of electricity (https://data.worldbank.org/indicator/EG.ELC.RNEW.ZS ) – and I name it ‘%RenEl’, and I am building a model of its occurrence. It is quite hard to pin down the correlates of this variable as such. There seems to be a lot of history in the belt of each country as for their power systems and therefore it is hard to capture those big, macroeconomic connections. Interestingly, its strongest correlation is with that other metric of energy structure, namely the percentage of renewables in the final consumption of energy (https://data.worldbank.org/indicator/EG.FEC.RNEW.ZS ), or ‘%Ren’ in acronym. This is logical: the ways we produce energy are linked to the ways we consume it. Still, I have a basic intuition that in relatively non-starving societies people have energy to think, so they have labs, and they make inventions in those labs, and they kind of speed up their technological change with those inventions. Another intuition that I have regarding my home continent is that we have big governments, with lots of liquid assets in their governmental accounts. Those liquid public assets are technically the residual difference between gross public debt and net public debt. Hence, I take the same formal hypothesis I made in ‘Those new SUVs are visibly purchased with some capital rent’ and I pepper it with the natural logarithms of, respectively, the number of patent applications per million people (‘PatApp/Pop’), and the share of liquid public assets in the GDP (‘LPA/GDP’). That makes me state formally that

ln(%RenEl) = a1*ln(GDP per capita) + a2*ln(Pop) + a3*ln(Labsh) + a4*ln(Capital stock per capita) + a5*ln(%Ren) + a6*ln(LPA/GDP) + a7*ln(PatApp/Pop) + residual

When I test this equation in my database, I can see an interesting phenomenon. The fact of adding the ln(PatApp/Pop) factor to my equation, taken individually, adds some 2% to the overall explanatory power of my model. Without the ln(PatApp/Pop), my equation is being tested on n = 2 068 observations and yields a coefficient of determination equal to R2 = 0,492. Then, I drop a pinch of ln(PatApp/Pop) into my soup, it reduces my sample to n = 1 089 observations, but pumps up the coefficient of determination to R2 = 0,515. Yet, the ln(PatApp/Pop) is not really robustly correlated with my ln(%RenEl): the p-value attached to this correlation is p = 0,503. It means that for any given value of ln(PatApp/Pop), my ln(%RenEl) can be found anywhere in one entire half of the normal distribution. This is one of those cases when I can see a pattern, but I cannot guess what is it exactly what I see.

If I want to remain bluntly quantitative, which sometimes pays off, I take those patent applications out of the equation and park it close by, as a control variable. I make classes in it, or rather it is my software, Wizard for MacOS that does, and I test my equation without ln(PatApp/Pop)[1] in it in those various classes and I look at the values my R2 coefficient takes in each of those classes. Here are the results:

Class #1: no patent applications observed >> 979 observations yield R2 = 0,490

Class #2: less than 3,527 patent applications per million people >> 108 observations yield R2 = 0,729

Class #3: between 3,527 and 23,519 patent applications per million people >> 198 observations and R2 = 0,427

Class #4: 23,519 < PatApp/Pop < 77,675   >> 267 observations and R2 = 0,625

Class #5: 77,675 < PatApp/Pop < 160,682  >> 166 observations and R2 = 0,697

Class #6: 160,682 < PatApp/Pop < 290,87  >> 204 observations and R2 = 0,508

Class #7: 290,87 < PatApp/Pop < 3 276,584 >> 146 observations and R2 = 0,965

Now, I can see there are two sub-samples in my sample – countries with really low rate of invention and those with an extremely high one – where the equation really works much stronger than anywhere else (much higher an R2 than in other classes). This is the job a control variable can do: it can serve to define special cases and to refine my hypotheses. Now, I can say, for example, that when the local rate of patentable invention in a society is really high, I can make a very plausible model of them going 100% green in their electricity output.

[1] So it is ln(%RenEl) = a1*ln(GDP per capita) + a2*ln(Pop) + a3*ln(Labsh) + a4*ln(Capital stock per capita) + a5*ln(%Ren) + a6*ln(LPA/GDP) + residual