The difference jumps to my eye, but what does it mean?

My editorial

I hope I am on the right track with that idea that the maturing of markets can be represented as incremental change in the density of population. This is what I came up with yesterday, in my research update in French (see ‘Le mûrissement progressif du marché, ça promet’). I am still trying to sort it out, intellectually. This is one of those things, which just seem to work but you don’t exactly know how they do it. I think I need some time and some writing in order to develop a nice, well-rounded, intellectual crystallization of that concept. It all started, I think, as I multiplied tests on different quantitative models to explain incremental changes in the value of those two variables I am currently interested in: the percentage of renewable energy in the primary production of electricity (https://data.worldbank.org/indicator/EG.ELC.RNEW.ZS ), and the percentage of renewables in the final consumption of energy (https://data.worldbank.org/indicator/EG.FEC.RNEW.ZS ).

With the software I have, that Wizard for MacOS – and this is really not heavy artillery as statistical software comes – testing models sums up to quick clicking. Setting up and testing a model – or an equation – with that tool is much faster than my writing about it. This is both the blessing and the curse of modern technology: it does things much faster than we can wrap our mind around things. In order to understand fully this idea that I came up with yesterday, I need to reconstruct, more or less, the train of my clicking. That should help me in reconstructing the train of my thinking. So, yesterday, I was trying to develop, once again, on that idea of the Wasun, or virtual currency connected to the market of renewable energies. I assumed that empirical exploration of the question would consist in taking the same equations I have been serving you on my blog for the last few weeks, and inserting the supply of money as one more explanatory variable on the right side in those equations. It kind of worked, but just kind of: adding the supply of money, as a percentage of the GDP, to a model explaining the percentage of renewables in the final consumption of energy, for instance, added some explanatory power to that model, i.e. it pumped the R2 coefficient of determination up. Still, the correlation attached to the supply of money, in that model, did not seem very robust. With a p-value like 0,3 or 0,4 – depending on the exact version of the equation I was testing – it turned out that I have like 30 or 40% of probability that I can have any percentage of renewable energies with a given velocity of money. That p-value is the probability of the null hypothesis, i.e. of no correlation whatsoever between variables.

Interestingly, I had the same problem with a structural variable I was using as well: the density of population. I routinely use the density of population as a quantitative estimator of difference between social structures. I have that deeply rooted intuition that societies displaying noticeable differences in their densities of population are very different in other respects as well. Being around in a certain number in a given territory, and thus having, on average, a given surface of that territory per person, is, for me, a fundamental trait of any society. Fundamental or not, it behaved in those equations of mine in the same way the supply of money did: it added to the coefficient of determination R2, but it refused to establish robust correlations. Just for you, my readers, to understand the position I was in, as a researcher: imagine that you discover some kind of super cool spice, which can radically improve the taste of a sauce. You know it does, but you have one tiny little problem: you don’t know how much of that spice, exactly, you should add to the sauce, and you know that if you add too much or too little, the sauce will taste much worse. Imagined that? Good. Now, imagine you have two such spices, in the same recipe. Bit of a cooking challenge, isn’t it?

What you can do, and what great cooks allegedly do, is to prepare a few alternative sauces, each with the same recipe, but with a different, and precisely defined amount of the spice under investigation. As you taste each of those alternative sauces, you can discover the right amount of spice to add. If you are really good at it, you can even discover the gradient of taste, i.e. the incremental change in taste that has been brought by a given incremental change in the quantity of one particular ingredient. In quantitative research, we call it ‘control variable’: instead of putting a variable right in the equation, we keep it out, we select different subsets of empirical data, each characterized by a different class of value in this particular variable, and we test the equation, without the variable in question, in those different subsets. The mathematical idea behind this approach is that we never know for sure whether our way of counting and measuring things is accurate and adequate to the changes and differences we can observe in those things. Take distance, for example: sometimes it is better to use kilometres, but sometimes even a centimetre it too much. Sometimes, small incremental changes in a measurable phenomenon induce too much complexity for us to crystallize any intelligible thought about it. In statistics, it manifests as a relatively high p-value, or the probability of the null hypothesis. Taking that complexity out of the equation and simplify it into a few big chunks of reality can help our understanding.

Anyway, I had two spices: the density of population, and the supply of money. I had to take one of them out of the equation and treat as control variable. As I am investigating the role of monetary systems in all that business of renewable energies, it seemed just stupid to take it out of the equation. Mind you: it seemed, which does not mean it was. There is a huge difference between seeming to be stupid and being really stupid. Anyway, I decided to keep the supply of money in, whilst taking the density of population out and just controlling for it, i.e. testing the equation in different classes of said density. For a reason that I ignore, when I ask my statistical software to define classes in a control variable, it makes sextiles (spelled jointly!), i.e. it divides the whole sample into six subsets of roughly the same size, 1577 or 1578 observations each in the case of the actual database I am using in that research. Why six? Dunno… Why not, after all?

So I had those sextiles in the density of population, and I had my equation, regarding the percentage of renewable energies in the final consumption of energy, and I had that velocity of money in it, and I tested inside each sextile. Interesting things happened. In the least dense populations, the equation barely had any explanatory power at all. As my equation was climbing the ladder of density in population, it gained explanatory power as well. Still, there is an interval of density, where that explanatory power fell again, just to soar in the densest populations. Those changes in the coefficient of determination R2 were accompanied by visible changes in the sign and the magnitude of the regression coefficient attached to the velocity of money. The same happened in other explanatory variables as well. My equation, as I was trying to wrap my mind around all that, works differently in different types of populations, regarding their density. It works the most logically, in economic terms, in the densest populations. The percentage of renewable energy in the final basket of consumption depends nicely and positively on the accumulation of production factors and on the supply of money. The more developed the local economic system, the better are the chances of going greener and greener in that energy mix.

In economics, demographic variables tend to be considered as a rich and weird cousin. The cousin is rich, so they cannot be completely ignored, but the cousin is kind of a weirdo as well, not really the kind you would invite risk-free to a wedding, so we don’t really invite them a lot. This nice metaphor sums up to saying that I tried to find a purely economic interpretation for those changes I observed when controlling for the density of population. My roughest guess was that money matters the most when we have really a lot of people around us and a lot of transactions to make (or avoid). With hardly any people around me (around is another simplification here, it can be around via Internet), money tends to have less importance. That’s logical. In other words, the velocity of money depends on the degree of development in the market we consider. The more developed a market is, the more transactions are there to finance, and the more money we need in the system to make that market work. Right, this works for any market, regardless whether we are talking about long-range missiles, refrigerators or spices. Now, how does it matter for this particular market, the market of energy? Please, notice: I used the ‘how?’ question instead of ‘why?’. Final consumption of energy is a lifestyle and a social structure doing its job. If the factors determining the percentage of renewable energies in said final consumption work differently in different classes of density in the population, those classes probably correspond to different lifestyles and different types of local social structures.

I imagined a local community, where people progressively transition towards the idea of renewable energies. In the beginning, there are just a few enthusiasts, who, with time, turn into a few hundred, then a few thousands and so on. From then on, I unhinged my mind a bit. I equalled the local community at the starting point, when nobody gives a s*** about green energy, as a virgin land. As new settlers come, new social relations emerge, and new opportunities to transact and pay turn up. Each person, who starts actively to use renewable energies, is like a pioneering settler coming to that virgin land. The emergence of a new market, like that of renewable energy, in an initially indifferent population, is akin to a growing density in a population of settlers. So, I further speculated, the nascence and development of a new market can be represented as a growing density in the population of customers. I know: at this point, it could be really hard to follow me. I even have trouble following myself. After all, if there are like 150 people per square kilometre in a population, according to my database, there are just them in that square kilometre, and no one else. It is not like they are here, those 150 pioneers, and a few hundred others, who are there, but remain kind of passive. Here, you have an example of the kind of mindfuck a researcher deals all the time. Data exploration is great, but data tends to have sharp edges. There is a difference, regarding the role of money in going green in our energies, between a population of 100 per km2 and a population of 5000 per km2. The difference is there, it jumps to my eye, but what does it mean? How does it work? My general intuition is that the density of population, as control variable, controls for the intensity of social interactions (i.e. interactions per unit of time). The degree of maturity in a market is the closest economic meaning I can associate with that intensity of interactions, but there could be something else.

If I want to remain bluntly quantitative

My editorial

I am still mining my database in order to create some kind of theoretical model for explaining the relative importance of renewable energies in a given society. Now, I am operating with two variables for measuring said importance. Firstly, it is the percentage of renewables in the final consumption of energy (https://data.worldbank.org/indicator/EG.FEC.RNEW.ZS ). This is renewable energy put into the whole dish of energy that we, humans, use in ways other than feeding ourselves: driving around, air-conditioning, texting to girlfriends and boyfriends, launching satellites, waging war on each other and whatnot. The second estimate is the percentage of renewable energy in the primary production of electricity (https://data.worldbank.org/indicator/EG.ELC.RNEW.ZS ). That one is obviously, much nicer and gentler a variable, less entangled sociologically and biologically. These two are correlated with each other. In my database, they are Pearson-correlated at r = 0,676. This is a lot, for a Pearson-correlation of moments, and still it means that these two mutually explain their respective variances at more or less R2 = 0,6762 = 0,456976. Yes, this is the basic meaning of that R2 coefficient of determination, which kind of comes along whenever I or someone else presents the results of quantitative regression. I take each explanatory variable in my equation, so basically each variable I can find on the right side, and I multiply it, for each empirical observation, by the coefficient of regression attached to this variable.

When I am done with multiplication, I do addition and subtraction: I sum up those partial products, and I subtract this sum, for each particular observation, from the actual value of the explained variable, or the one on the left side of the equation. What I get is a residual constant, so basically the part of the actually observed explained variable, which remains unmatched by this sum of products ‘coefficient of regression times the value of the explanatory variable’. I make an arithmetical average out of those residuals, and I have the general constant in those equations I use to present you on this blog whenever I report the results of my quantitative tests. Once I have that general function, I trace it as a line, and I compute the correlation between this line, and the actual distribution of my left-hand variable, the explained one. This correlation tells me how closely my theoretical line follows the really observed variable.

Now, why the heck elevating this coefficient of correlation to power two, so why converting the ‘r’ into the capital ‘R2’? Well, correlations, as you probably know, can be positive or negative in their sign. What I want, though, is kind of a universal indicator of how close did I land to real life in my statistical simulations. As saying something like ‘you are minus 30% away from reality’ sounds a bit awkward, and as you cannot really have a negative distance, the good idea is to get rid of the minuses by elevating them power two. It can be any even power, by the way. There is no mathematical reason for calculating R2 instead of R22, for instance, only the coefficients of correlation are fractions, whose module is always smaller than one. If you elevate a decimal smaller than one to power 22, you get something so tiny you even have problems thinking about it without having smoked something interesting beforehand. Thus, R2 is simply handier than R22, with no prejudice to the latter.

Oh, I started doctoring, kind of just by letting myself being carried away. All right, so this is going to be a didactic one. I don’t mind: when I write as if I were doctoring you, I am doctoring myself, as a matter of fact, and it is always a good thing to learn something valuable from someone interesting, for one. For two, this blog is supposed to have educational value. Now, the good move consists in asking myself what exactly do I want to doctor myself about. What kind of surprise in empirical reality made me behave in this squid-like way, i.e. release a cloud of ink? By experience, I know that what makes me doctoring is cognitive dissonance, which, in turn, pokes its head out of my mind when I experience too much incoherence in the facts of life. When I have smeared the jam of my understanding over too big a toast of reality, I feel like adding more jam on the toast.

As I am wrestling with those shares of renewable energies in the total consumption of energy, and in the primary generation of electricity, what I encounter are very different social environments, with very different shares of renewables in their local cocktails of energy, and those shares seem not to be exactly scalable on kind of big, respectable socio-economic traits, like GDP per capita or capital stock per capita. These idiosyncrasies go as far as looking as paradoxes, in some instances. In Europe, we have practically no off-grid electricity from renewable sources. In Africa or in Asia, they have plenty. Building a power source off-grid means, basically, that the operator of the power grid doesn’t give a s*** about it and you are financially on your own. Hence, what you need is capital. Logically, there should be more off-grid power systems in regions with lots of capital per capita, and with a reasonably high density of population. Lots of capital per capita times lots of capita per square kilometre gives lots of money to finance any project. Besides, lots of capital per capita is usually correlated with better an education in the average capita, so with better an understanding of how important it is to have reliable and clean, local sources of energy. Still, it is exactly the opposite that happens: those off-grid, green power systems tend to pop up where there is much less capital per capita and where the average capita has much poorer an access to education.

At the brutal bottom line, it seems that what drives people to install solar farms or windfarms in their vicinity is the lack of access to electricity from power grids – so the actual lack and need of electricity – much more than the fact of being wealthy and well educated. Let’s name it honestly: poverty makes people figure out, and carry out, much more new things than wealth does. I already have in my database one variable, very closely related to poverty: it is food deficit, at the very core of being poor. Dropping food deficit in a model related to very nearly any socio-economic phenomenon instantaneously makes those R2’s ramp up. Still, a paradox emerges: when I put food deficit, or any other variable reflecting true poverty, into a quantitative model, I can test it only on those cases, where this variable takes a non-null value. Where food deficit is zero, I have a null value associated with non-null values in other variables, and such observations are automatically cut out of my sample. With food deficit in an equation, empirical tests yield their results only regarding those countries and years, where and when local populations actually starved. I can test with Ethiopia, but I cannot test with Belgium. What can I do in such case? Well, this is where I can use that tool called ‘control variable’. If dropping a variable into an equation proves kind of awkward, I can find a way around it by keeping that variable out of the equation but kind of close to. This is exactly what I did when I tested some of my regressions in various classes of food deficit (see, for example ‘Cases of moderate deprivation’ ).

Good, so I have that control variable, and different versions of my basic model, according to the interval of values in said control variable. I kind of have two or more special cases inside a general theoretical framework. The theory I can make out of it is basically that there are some irreducible idiosyncrasies in my reality. Going 100% green, in a local community in Africa or in Asia is so different from going the same way inside European Union that it risks being awkward to mix those two in the same cauldron. If I want that SEAP, or Sustainable Energy Action plan (see the website of the Global Covenant of Mayors for more information ), and I want it to be truly good a SEAP, it has to be based on different socio-economic assumptions according the way local communities work. One SEAP for those, who starve more or less, and have problems with basic access to electricity. Another SEAP for the wealthy and well-educated ones, whose strive for going 100% green is driven by cultural constructs rather than by bare needs.

Right, it is time to be a bit selfish, thus to focus on my social environment, i.e. Poland and Europe in general, where no food deficit is officially reported at the national scale. I take that variable from the World Bank –  the percentage of renewable energy in the primary production of electricity (https://data.worldbank.org/indicator/EG.ELC.RNEW.ZS ) – and I name it ‘%RenEl’, and I am building a model of its occurrence. It is quite hard to pin down the correlates of this variable as such. There seems to be a lot of history in the belt of each country as for their power systems and therefore it is hard to capture those big, macroeconomic connections. Interestingly, its strongest correlation is with that other metric of energy structure, namely the percentage of renewables in the final consumption of energy (https://data.worldbank.org/indicator/EG.FEC.RNEW.ZS ), or ‘%Ren’ in acronym. This is logical: the ways we produce energy are linked to the ways we consume it. Still, I have a basic intuition that in relatively non-starving societies people have energy to think, so they have labs, and they make inventions in those labs, and they kind of speed up their technological change with those inventions. Another intuition that I have regarding my home continent is that we have big governments, with lots of liquid assets in their governmental accounts. Those liquid public assets are technically the residual difference between gross public debt and net public debt. Hence, I take the same formal hypothesis I made in ‘Those new SUVs are visibly purchased with some capital rent’ and I pepper it with the natural logarithms of, respectively, the number of patent applications per million people (‘PatApp/Pop’), and the share of liquid public assets in the GDP (‘LPA/GDP’). That makes me state formally that

           ln(%RenEl) = a1*ln(GDP per capita) + a2*ln(Pop) + a3*ln(Labsh) + a4*ln(Capital stock per capita) + a5*ln(%Ren) + a6*ln(LPA/GDP) + a7*ln(PatApp/Pop) + residual

When I test this equation in my database, I can see an interesting phenomenon. The fact of adding the ln(PatApp/Pop) factor to my equation, taken individually, adds some 2% to the overall explanatory power of my model. Without the ln(PatApp/Pop), my equation is being tested on n = 2 068 observations and yields a coefficient of determination equal to R2 = 0,492. Then, I drop a pinch of ln(PatApp/Pop) into my soup, it reduces my sample to n = 1 089 observations, but pumps up the coefficient of determination to R2 = 0,515. Yet, the ln(PatApp/Pop) is not really robustly correlated with my ln(%RenEl): the p-value attached to this correlation is p = 0,503. It means that for any given value of ln(PatApp/Pop), my ln(%RenEl) can be found anywhere in one entire half of the normal distribution. This is one of those cases when I can see a pattern, but I cannot guess what is it exactly what I see.

If I want to remain bluntly quantitative, which sometimes pays off, I take those patent applications out of the equation and park it close by, as a control variable. I make classes in it, or rather it is my software, Wizard for MacOS that does, and I test my equation without ln(PatApp/Pop)[1] in it in those various classes and I look at the values my R2 coefficient takes in each of those classes. Here are the results:

Class #1: no patent applications observed >> 979 observations yield R2 = 0,490

Class #2: less than 3,527 patent applications per million people >> 108 observations yield R2 = 0,729

Class #3: between 3,527 and 23,519 patent applications per million people >> 198 observations and R2 = 0,427

Class #4: 23,519 < PatApp/Pop < 77,675   >> 267 observations and R2 = 0,625

Class #5: 77,675 < PatApp/Pop < 160,682  >> 166 observations and R2 = 0,697

Class #6: 160,682 < PatApp/Pop < 290,87  >> 204 observations and R2 = 0,508

Class #7: 290,87 < PatApp/Pop < 3 276,584 >> 146 observations and R2 = 0,965

Now, I can see there are two sub-samples in my sample – countries with really low rate of invention and those with an extremely high one – where the equation really works much stronger than anywhere else (much higher an R2 than in other classes). This is the job a control variable can do: it can serve to define special cases and to refine my hypotheses. Now, I can say, for example, that when the local rate of patentable invention in a society is really high, I can make a very plausible model of them going 100% green in their electricity output.

[1] So it is ln(%RenEl) = a1*ln(GDP per capita) + a2*ln(Pop) + a3*ln(Labsh) + a4*ln(Capital stock per capita) + a5*ln(%Ren) + a6*ln(LPA/GDP) + residual