I am developing directly on the mathematical model I started to sketch in my last update, i.e. in Social roles and pathogens: our average civilisation. This is an extension of my earlier research regarding the application of artificial neural networks to simulate collective intelligence in human societies. I am digging down one particular rabbit-hole, namely the interaction between the prevalence of social roles, and that of disturbances to the social structure, such as epidemics, natural disasters, long-term changes in natural environment, radically new technologies etc.

Here comes to my mind, and thence to my writing, a mathematical model that generalizes some of the intuitions, which I already, tentatively, phrased out in my last update. The general idea is that society can be represented as a body of phenomena able to evolve endogenously (i.e. by itself, in plain human lingo), plus an external disturbance. Disturbance is anything that knocks society out of balance: a sudden, massive change in technology, a pandemic, climate change, full legalization of all drugs worldwide, Justin Bieber becoming the next president of the United States etc.

Thus, we have the social structure and a likely disturbance to it. Social structure is a set SR = {sr1, sr2, …, srm} of ‘m’ social roles, defined as combinations of technologies and behavioural patterns. The set SR can be stable or unstable. Some of the social roles can drop out of the game. Just checking: does anybody among my readers know what did the craft of a town crier consist in, back in the day? That guy was a local media industry, basically. You paid him for shouting your message in one or more public places in the town. Some social roles can emerge. Twenty years ago, the social role of an online influencer was associated mostly with black public relations, and today it is a regular occupation.

Disappearance or emergence of social roles is one plane of social change, and mutual cohesion between social roles is another one. In any relatively stable social structure, the existing social roles are culturally linked to each other. The behaviour of a political journalist is somehow coherent with the behaviour of politicians he or she interviews. The behaviour of a technician with a company of fibreoptic connections is somehow coherent with the behaviour of end users of those connections. Yet, social change can loosen the ties between social roles. I remember the early 1990ies, in Poland, just after the transition from communism. It was an odd moment, when, for example, many public officers, e.g. maires or ministers, were constantly experimenting with their respective roles. That very loose coupling of social roles is frequently observable in start-up businesses, on the other hand. In many innovative start-ups, when you start a new job, you’d better be prepared to its exact essence and form taking shape as you work.

In all that story of social cohesion I essentially tap into swarm theory (see Correlated coupling between living in cities and developing science; Xie, Zhang & Yang 2002[1] ; Poli, Kennedy & Blackwell 2007[2] ; Torres 2012[3]; Stradner et al. 2013[4]). I assume that each given pair of social roles – e.g. the First Secretary of The Communist Party of China and a professional gambler in Las Vegas – can be coupled at three levels: random, fixed, and correlated. A relative loosening of social cohesion means that random coupling grows in relative importance, at the expense of the fixed, strictly ritualized coupling, and of the correlated one.

All in all, I hypothesise four basic types of social change in an established structure, under the impact of an exogenous disturbance. Scenario A assumes the loosening of cohesion between social roles, under the impact of an exogenous disturbance, with a constant catalogue of social roles in place. Scenario B implies that external stressor makes some social roles disappear, whilst scenarios C and D represent the emergence of new social roles, in two different perspectives. In Scenario C, new social roles are not coherent with the established ones, whilst Scenario D assumes such a cohesion.

Mathematically, I represent the whole thing in the form of a simple neural network, a multi-layer perceptron. I have written a lot about using neural networks as representation of collective intelligence, and now, I feel like generalising my theoretical stance and explaining two important points, namely what exactly I mean by a neural network, and why do I apply a neural network instead of a stochastic model, such as e.g. an Ito drift.

A neural network is a sequence of equations, which can be executed in a loop, over a finite sequence ER = {er1, er2, …, ern} of ‘n’ of experimental rounds, and that recurrent sequence of equations has a scalable capacity to learn. In other words, equation A takes input data, transforms it, feeds the result into equation B, which feeds into equation C etc., and, at some point, the result yielded by the last equation in the sequence gets fed into equation A once again, and the whole sequence runs another round A > B > C > …> A etc.. In each consecutive experimental round erj, equation A taps into raw empirical data, and into the result of the previous experimental round ej-1. Another way of defining a neural network is to say that it is a general, logical structure able to learn by producing many specific instances of itself and observing their specific properties. Both definitions meet in the concept of logical structure and learning. It is quite an old observation in our culture that some logical structures, such as sequences of words, have the property of creating much more meaning than others. When I utter a sequence ‘Noun + Verb + Noun’, e.g. ‘I eat breakfast’, it has the capacity to produce more meaning than a sequence of the type ‘Verb + Verb + Verb’, e.g. ‘Eat read walk’. The latter sequence leaves more ambiguity, and the amount of that ambiguity makes that sequence of words virtually useless in daily life, save for online memes.  

There are certain peg structures in the sequence of equations that make a neural network, i.e. some equations and sequences thereof which just need to be there, and which the network cannot produce meaningful results. I am going to present the peg structure of a neural network, and then I will explain its parts one by one.

Thus, the essential structure is the following: [Equation of random experimentation  ε* xi (er1)] => [Equation of aggregation  h = ∑ ε* xi (er1)] => [Equation of neural activation  NA = (a*ebh ± 1) / (a*ebh ± 1) ] => {Equation of error assessment  e(er1) = [O(er1) – NA(er1)]*c} => {[Equation of backpropagation]  [Equation of random experimentation + acknowledgement of error from the previous experimental round]  [ε* xi (erj) + e(er1)]} => {Equation of aggregation  h = ∑ [ε* xi (erj) + e(er1)]} etc.          

In that short sequential description, I combined mathematical expressions with formal logic. Brackets of different types – round (), square [] and curly {} – serve to delineate distinct logical categories. The arrowed symbols stand for logical connections, with ‘’ being an equivalence, and ‘=>’ and implication. That being explained, I can start explaining those equations and their sequence. The equation of random experimentation expresses what an infant’s brain does: it learns, by trial and error, i.e. my mixing stimuli in various hierarchies and seeing which hierarchy of importance, attached to individual pieces of sensory data, works better. In an artificial neural network, random experimentation means that each separate piece of data is being associated with a random number ε between 0 and 1, e.g. 0,2 or 0,87 etc. A number between 0 and 1 can be interpreted in two ways: as a probability, or as the fraction of a whole. In the associated pair ε* xi (erj), the random weight 0 < ε < 1 can be seen as hypothetical probability that the given piece xi of raw data really matters in the experimental round erj. From another angle, we can interpret the same pair ε* xi (erj) as an experiment: what happens when we cut fraction ε from the piece of data xi. it can be for one, or as a slice cut out of that piece of data.

Random experimentation in the first experimental round er1 is different from what happens in consecutive rounds erj. In the first round, the equation of random experimentation just takes the data xi. In any following round, the same equation must account for the error of adjustment incurred in previous rounds. The logic is still the same: what happens if we assume a probability of 32% that error from past experiments really matters vs. the probability of 86%?

The equation of aggregation corresponds to the most elementary phase of what we could call making sense of reality, or to language. A live intelligent brain collects separate pieces of data into large semantic chunks, such as ‘the colour red’, ‘the neighbour next door’, ‘that splendid vintage Porsche Carrera’ etc. The summation h = ∑ ε* xi (erj) is such a semantic chunk, i.e. h could be equivalent to ‘the neighbour next door’.

Neural activation is the next step in the neural network making sense of reality. It is the reaction to the neighbour next door. The mathematical expression NA = (a*ebh ± 1) / (a*ebh ± 1) is my own generalisation of two commonly used activation functions: the sigmoid and the hyperbolic tangent. The ‘e’ symbol is the mathematical constant e, and ‘h’ in the expression ebh is the ‘h’ chunk of pre-processed data from the equation of aggregation. The ‘b’ coefficient is usually a small integer, e.g. b = 2 in the hyperbolic tangent, and -1 in the basic version of the sigmoid function.

The logic of neural activation consists in combining a constant component with a variable one, just as a live nervous system has some baseline neural activity, e.g. the residual muscular tonus, which ramps up in the presence of stimulation. In the equation of hyperbolic tangent, namely NA = tanh = (e2h – 1) / (e2h + 1), the constant part is (e2 – 1) / (e2 + 1) = 0,761594156. Should my neural activation be the sigmoid, it goes like NA = sig = 1 / (1 + e-h), with the constant root of 1 / (1 + e-1) = 0,731058579.

Now, let’s suppose that the activating neuron NA gets excited about a stream of sensory experience represented by input data: x1 = 0.19, x2 = 0.86, x3 = 0.36, x4 = 0.18, x5 = 0.93. At the starting point, the artificial mind has no idea how important are particular pieces of data, so it experiments by assigning them a first set of aleatory coefficients – ε1 = 0.85, ε2 = 0.70, ε3 = 0.08, ε4 = 0.71, ε5 = 0.20 – which means that we experiment with what happens if x3 was totally unimportant, x4 was hardly more significant, whilst x1, x2 and x3 are really important. Aggregation yields h = 0,19*0,85 +0,86*0,70 + 0,36*0,08 + 0,18*0,71 + 0,93*0,20 = 1,10.

An activating neuron based on the hyperbolic tangent gets into a state of NA = tanh = (e2*1,10 – 1) / (e2*1,10 + 1) = 0.801620, and another activating neuron working with the sigmoid function thinks NA = sig = 1 / (1 + e-1,10) = 0,7508457. Another experiment with the same data consists in changing the aleatory coefficients of importance and seeing what happens, thus in saying  ε1 = 0.48, ε2 = 0.44, ε3 = 0.24, ε4 = 0.27, ε5 = 0.80 and aggregating h = 0,19*0,48 +0,86*0,44 + 0,36*0,24 + 0,18*0,27 + 0,93*0,80 = 1,35. In response to the same raw data aggregated in a different way, the hyperbolic tangent says NA = tanh = (e2*1,35 – 1) / (e2*1,35 + 1) = 0,873571 and the activating neuron which sees reality as a sigmoid retorts: ‘No sir, absolutely not. I say NA = sig = 1 / (1 + e-1,35) = 0,7937956’. What do you want: equations are like people, they are ready to argue even about 0,25 of difference in aggregate input from reality.

Those two neural reactions bear a difference, visible as gradients of response, or elasticities of response to a change in aggregate output. The activating neuron based on hyperbolic tangent yields a susceptibility of (0,873571 – 0,801620) / (1,35 – 1,10) = 0.293880075, which the sigmoid sees as an overreaction, with its well-pondered (0,7937956 – 0,7508457) / (1,35 – 1,10) = 0,175427218. That’s an important thing to know about neural networks: they can be more or less touchy in their reaction. Hyperbolic tangent produces more stir, and the sigmoid is more like ‘calm down’ in its ways.

Whatever the neural activation NA produces, gets compared with a pre-set outcome O, or output variable. Error is assessed as e(erj) = [O(erj) – NA(erj)]*c, where ‘c’ is na additional factor, sometimes the local derivative of NA. It just serves to put c there: it can amplify (c > 1) or downplay (c < 1) the importance of local errors and therefore make the neural network more or less sensitive to making errors.                

Before I pass to discussing the practical application of that whole logical structure to the general problem at hand, i.e. the way that a social structure reacts to exogenous disturbances, one more explanation is due, namely the issue of backpropagation of error, where said error is being fed forward. One could legitimately ask how the hell is it possible to backpropagate something whilst feeding it forward. Let’s have a look at real life. When I learn to play piano, for example, I make mistakes in my play, and I utilise them to learn. I learn by repeating over and over again the same sequence of musical notes. Repetition is an instance of feeding forward. Each consecutive time I play the same sequence, I move forward one more round. However, if I want that move forward to be really productive as regards learning, I need to review, each time, my entire technique. I need to go back to my first equation and run the whole sequence of equations again. I need to backpropagate my mistakes over the whole sequence of behaviour. Backpropagating errors and feeding them forward calls two different aspects of the same action. I backpropagate errors across the logical structure of the neural network, and I feed them forward over consecutive rounds of experimentation.   

Now, it is time to explain how I simulate the whole issue of disturbed social structure, and the four scenarios A, B, C, and D, which I described a few paragraphs earlier. The trick I used consists in creating a baseline neural network, one which sort of does something but not much really, and then making mutants out of it, and comparing the outcomes yielded by mutants with that produced by their baseline ancestor. For the baseline version, I have been looking for a neural network which learns lightning fast on the short run but remains profoundly stupid on the long run. I wanted quick immediate reaction and no capacity whatsoever to narrow down the error and adjust to it. 

The input layer of the baseline neural network is made of the set SR = {sr1, sr2, …, srm} of ‘m’ social roles, and one additional variables representative for the hypothetical disturbance. Each social role sri corresponds to a single neuron, which can take values between 0 and 1. Those values represent the probability of occurrence in the social role sri. If, for example, in the experimental round e = 100, the input value of the social role sri is sri(e100) = 0.23, it means that 23% of people manifest the distinctive signs of that social role. Of course, consistently with what I perceive as the conceptual acquis of social sciences, I assume that an individual can have multiple, overlapping social roles.

The factor of disturbance RB is an additional variable in the input layer of the network and comes with similar scale and notation. It takes values between 0 and 1, which represent the probability of disturbing occurrence in the social structure. Once again, RB can be anything, disturbing positively, negatively, or kind of we have no idea what it is going to bring about.

Those of you who are familiar with the architecture of neural networks might wonder how I am going to represent the emergence of new social roles without modifying the structure of the network. Here comes a mathematical trick, which, fortunately enough, is well grounded in social sciences. The mathematical part of the trick consists in incorporating dormant social roles in the initial set SR = {sr1, sr2, …, srm}, i.e. social roles assigned with arbitrary 0 value, i.e. zero probability of occurrence. On the historically short run, i.e. at the scale of like one generation, new social roles are largely predictable. As we are now, we can reasonably predict the need for new computer programmers, whilst being able to safely assume a shortage of jobs for cosmic janitors, collecting metal scrap from the terrestrial orbit. In 20 years from now, that perspective can change – and it’d better change, as we have megatons of metal crap on the orbit – yet, for now, it looks pretty robust.

Thus, in the set SR = {sr1, sr2, …, srm}, I reserve k neurons for active social roles, and l neurons for dormant ones, with, of course, k + l = m. All in all, in the actual network I programmed in Excel, I had k = 20 active social roles, l = 19 dormant social roles, and one neuron corresponding to the disturbance factor RB.            

Now, the issue of social cohesion. In this case, we are talking about cohesion inside the set SR = {sr1, sr2, …, srm}. Mathematically, cohesion inside a set of numerical values can be represented as the average numerical distance between them. Therefore, I couple the input layer of 20k + 19l + RB = 40 neurons is coupled with a layer of meta-input, i.e. with a layer of 40 other neurons whose sole function is to inform about the Euclidean distance between the current value of each input neuron, and the values of the other 39 input neurons.

Euclidean distance plays the role of fitness function (see Hamann et al. 2010[1]). Each social role in the set SR = {sr1, sr2, …, srm}, with its specific probability of occurrence, displays a Euclidean distance from the probability of occurrence in other social roles. The general idea behind this specific mathematical turn is that in a stable structure, the Euclidean distance between phenomena stays more or less the same. When, as a society, we take care of being collectively cohesive, we use the observation of cohesion as data, and the very fact of minding our cohesion helps us to maintain cohesion. When, on the other hand, we don’t care about social cohesion, then we stop using (feeding forward) this specific observation, and social cohesion dissolves.

For the purposes of my own scientific writing, I commonly label that Euclidean distance as V, i.e. V(sri; ej) stands for the average Euclidean distance between social role sri, and all the other m – 1 social roles in the set SR = {sr1, sr2, …, srm}, in the experimental round ej. When input variables are being denominated on a scale from 0 to 1, thus typically standardized for a neural network, and the network uses (i.e. feeds forward) the meta input on cohesion between variables, the typical Euclidean distance you can expect is like 0,1 ≤ V(sri; ej) ≤ 0,3. When the social structure loses it, Euclidean distance between phenomena starts swinging, and that interval tends to go into 0,05 ≤ V(sri; ej) ≤ 0,8. This is how the general idea of social cohesion is translated into a mathematical model.

Thus, my neural network uses, as primary data, basic input about the probability of specific social roles being played by a randomly chosen individual, and metadata about cohesion between those probabilities. I start by assuming that all the active k = 20 social roles occur with the same probability of 0,5. In other words, at the starting point, each individual in the society displays a 50% probability of endorsing any of the k = 20 social roles active in this specific society. Reminder: l = 19 dormant social roles stay at 0, i.e. each of them has 0% of happening, and the RB disturbance stays at 0% probability as well. All is calm. This is my experimental round 1, or e1. In the equation of random experimentation, each social role sri gets experimentally weighed with a random coefficient, and with its local Euclidean distance from other social roles. Of course, as all k = 20 social roles have the same probability of 50%, their distance from each other is uniform and always makes V = 0,256097561. All is calm.

As I want my baseline AI to be quick on the uptake and dumb as f**k on the long-haul flight of learning, I use neural activation through hyperbolic tangent. As you could have seen earlier, this function is sort of prone to short term excitement. In order to assess the error, I use both logic and one more mathematical trick. In the input, I made each of k = 20 social roles equiprobable in its happening, i.e. 0,50. I assume that the output of neural activation should also be 0,50. Fifty percent of being anybody’s social role should yield fifty percent: simplistic, but practical. I go e(erj) = O(erj) – NA(erj) = 0,5 – tanh = 0,5 – [(e2h – 1) / (e2h + 1)], and I feed forward that error from round 1 to the next experimental round. This is an important trait of this particular neural network: in each experimental round, it experiments adds up the probability from previous experimental round and the error made in the same, previous experimental round, and with the assumption that expected value of output should be a probability of 50%.

That whole mathematical strategy yields interesting results. Firstly, in each experimental round, each active social role displays rigorously the same probability of happening, and yet that uniformly distributed probability changes from one experimental round to another. We have here a peculiar set of phenomena, which all have the same probability of taking place, which, in turn, makes all those local probabilities equal to the average probability in the given experimental round, i.e. to the expected value. Consequently, the same happens to the internal cohesion of each experimental round: all Euclidean distances between input probabilities are equal to each other, and to their average expected distance. Technically, after having discovered that homogeneity, I could have dropped the whole idea of many social roles sri in the database and reduce the input data just to three variables (columns): one active social role, one dormant, and the disturbance factor RB. Still, I know by experience that even simple neural networks tend to yield surprising results. Thus, I kept the architecture ’20k + 19l + RB’ just for the sake of experimentation.

That whole baseline neural network, in the form of an Excel file, is available under THIS LINK. In Table 1, below, I summarize the essential property of this mathematical structure: short cyclicality. The average probability of happening in each social role swings regularly, yielding, at the end of the day, an overall average probability of 0,33. Interesting. The way this neural network behaves, it represents a recurrent sequence of two very different states of society. In odd experimental rounds (i.e. 1, 3, 5,… etc.) each social role has 50% or more of probability of manifesting itself in an individual, and the relative cohesion inside the set of social roles is quite high. On the other hand, in even experimental rounds (i.e. 2, 4, 6, … etc.), social roles become disparate in their probability of happening in a given time and place of society, and the internal cohesion of the network is low. The sequence of those two states looks like the work of a muscle: contract, relax, contract, relax etc.

Table 1 – Characteristics of the baseline neural network

Experimental roundAverage probability of input  Cohesion – Average Euclidean distance V in input  Aggregate input ‘h’  Error to backpropagate
1           0,5000 0,25011,62771505-0,4257355
2           0,0743 0,03720,029903190,47010572
3           0,5444 0,27231,79626958-0,4464183
4           0,0980 0,04900,051916330,44813027
5           0,5461 0,27321,60393868-0,4222593
6           0,1238 0,06190,093201450,40706748
7           0,5309 0,26561,59030006-0,4201953
8           0,1107 0,05540,071570250,4285517
9           0,5392 0,26981,49009281-0,4033418
10           0,1359 0,06800,113017960,38746079
11           0,5234 0,26181,51642329-0,4080723
12           0,1153 0,05770,062083680,43799596
13           0,5533 0,27681,92399208-0,458245
14           0,0950 0,04760,036164950,46385081
15           0,5589 0,27961,51645936-0,4080786
16           0,1508 0,07550,138602510,36227827
17           0,5131 0,25671,29611259-0,3607191
18           0,1524 0,07620,122810620,37780311
19           0,5302 0,26521,55382594-0,4144146
20           0,1158 0,05790,063916620,43617027
Average over 3000 rounds0,33160,16590,81130,0000041
Variability*0,60920,60920,901297 439,507

*Variability is calculated as standard deviation, i.e. square root of variance, divided by the average.

Now, I go into the scenario A of social change. The factor of disturbance RB gets activated and provokes a loosening of social cohesion. Mathematically, it involves a few modifications to the baseline network. Activation of the disturbance RB involves two steps. Firstly, numerical values of this specific variable in the network needs to take non-null values: the disturbance is there. I do it by generating random numbers in the RB column of the database. Secondly, there must be a reaction to disturbance, and the reaction consists in disconnecting the layer of neurons, which I labelled meta-data, i.e. the one containing Euclidean distances between the raw data points.

Here comes the overarching issue of sensitivity to disturbance, which goes across all the four scenarios (i.e. A, B, C, and D). As representation of what’s going on in social structure, it is about collective and individual alertness. When a new technology comes out into the market, I don’t necessarily change my job, but when that technology spreads over a certain threshold of popularity, I might be strongly pushed to reconsider my decision. When COVID-19 started hitting the global population, all levels of reaction (i.e. governments, media etc.) were somehow delayed in relation to the actual epidemic spread. This is how social change happens in reaction to a stressor: there is a threshold of sensitivity.

When I throw a handful of random values into the database, as values of disturbance RB, they are likely to be distributed under a bell-curve. I translate mathematically the social concept of sensitivity threshold as a value under that curve, past which the network reacts by cutting ties between errors input as raw data from previous experimental rounds, and the measurement of Euclidean distance between them. Question: how to set this value so as it fits with the general logic of that neural network? I decided to set the threshold at the absolute value of the error recorded in the previous experimental round. Thus, for example, when error generated in round 120 is e120 = -0.08, the threshold of activation for triggering the response to disturbance is ABS(-0,08) = 0,08. The logic behind this condition is that social disturbance becomes significant when it is more prevalent than normal discrepancy between social goals and the actual outcomes.

I come back to the scenario A, thus to the hypothetical situation when the factor of disturbance cuts the ties of cohesion between existing, active social roles. I use the threshold condition ‘if RB(erj) > e(erj-1), then don’t feed forward V(erj-1)’, and this is what happens. First of all, the values of probability assigned to all active social roles remain just as uniform, in every experimental round, as they are in the baseline neural network I described earlier. I know, now, that the neural network, such as I designed it, is not able to discriminate between inputs. It just generates a uniform distribution thereof. That being said, the uniform probability of happening in social roles sri follows, in scenario A, a clearly different trajectory than the monotonous oscillation in the baseline network. The first 134 experimental rounds yield a progressive decrease in probability down to 0. Somewhere in rounds 134 ÷ 136 the network reaches a paradoxical situation, when no active social role in the k = 20 subset has any chance of manifesting itself. It is a society without social roles, and all that because the network stops feeding forward meta-data on its own internal cohesion when the disturbance RB goes over the triggering point. Past that zero point, a strange cycle of learning starts, in irregular leaps: the uniform probability attached to social roles rises up to an upper threshold, and then descends again back to zero. The upper limit of those successive leaps oscillates and then, at an experimental round somewhere between er400 and er1000, probability jumps just below 0,7 and stays this way until the end of the 3000 experimental rounds I ran this neural network through. At this very point, the error recorded by the network gets very close to zero and stays there as well: the network has learnt whatever it was supposed to learn.

Of course, the exact number of experimental rounds in that cycle of learning is irrelevant society-wise. It is not 400 days or 400 weeks; it is the shape of the cycle that really matters. That shape suggests that, when an external disturbance switches off internal cohesion between social roles in a social structure, the so-stimulated society changes in two phases. At first, there are successive, hardly predictable episodes of virtual disappearance of distinct social roles. Professions disappear, family ties distort etc. It is interesting. Social roles get suppressed simply because there is no need for them to stay coherent with other social roles. Then, a hyper-response emerges. Each social role becomes even more prevalent than before the disturbance started happening. It means a growing probability that one and the same individual plays many social roles in parallel.

I pass to scenario B of social change, i.e. the hypothetical situation when the exogenous disturbance straightforwardly triggers the suppression of social roles, and the network keeps feeding forward meta-data on internal cohesion between social roles. Interestingly, suppression of social roles under this logical structure is very short lived, i.e. 1 – 5 experimental rounds, and then the network yields an error which forces social roles to disappear.

One important observation is to note as regards scenarios B, C, and D of social change in general. Such as the neural network is designed, with the threshold of social disturbance calibrated on the error from previous experimental round, error keeps oscillating within an apparently constant amplitude over all the 3000 experimental rounds. In other words, there is no visible reduction of magnitude in error. Some sort of social change is occurring in scenarios B, C, and D, still it looks as a dynamic equilibrium rather than a definitive change of state. That general remark kept in mind, the way that the neural network behaves in scenario B is coherent with the observation  made regarding the side effects of its functioning in scenario A: when the factor of disturbance triggers the disappearance of some social roles, they re-emerge spontaneously, shortly after. To the extent that the neural network I use here can be deemed representative for real social change, widely prevalent social roles seem to be a robust part of the social structure.

Now, it is time to screen comparatively the results yielded by the neural network when it is supposed to represent scenarios C and D of social change: I study situations when a factor of social disturbance, calibrated in its significance on the error made by the neural network in previous experimental rounds, triggers the emergence of new social roles. The difference between those two scenarios is in the role of social cohesion. Mathematically, I did it by activating the dormant l = 19 social roles in the network, with a random component. When the random value generated in the column of social disturbance RB is greater than the error observed in the previous experimental round, thus when RB(erj) > e(erj-1), then each of the l = 19 dormant social roles gets a random positive value between 0 and 1. That random positive value gets processed in two alternative ways. In scenario C, it goes directly into aggregation and neural activation, i.e. there is no meta-data on the Euclidean distance between any of those newly emerging social roles and other social roles. Each new social role is considered as a monad, which develops free from constraints of social cohesion. Scenario D establishes such a constraint, thus the randomly triggered probability of a woken up, and previously dormant social role is being aggregated, and fed into neural activation with meta-data as for its Euclidean distance from other social roles.    

Scenarios C and D share one important characteristic: heterogeneity in new social roles. The k = 20 social roles active from the very beginning, thus social roles ‘inherited’ from the baseline social network, share a uniform probability of happening in each experimental round. Still, as probabilities of new social roles, triggered by the factor of disturbance, are random by default, these probabilities are distributed aleatorily. Therefore, scenarios C and D represent a general case of a new, heterogenous social structure emerging in the presence of an incumbent rigid social structure. Given that specific trait, I introduce a new method of comparing those two sets of social roles, namely by the average probability attached to social roles, calculated over the 3000 experimental rounds. I calculate the average probability of active social roles across all the 3000 experimental rounds, and I compare it with individual, average probabilities obtained for each of the new social roles (or woken up and previously dormant social roles) over 3000 experimental rounds. The idea behind this method is that in big sets of observations, arithmetical average represents the expected value, or the expected state of the given variable.

The process of social change observed, respectively, in scenarios C and D, is different. In the scenario C, the uniform probability attached to the incumbent k = 20 social roles follows a very calm trend, oscillating slightly between 0,2 and 0,5, whilst the heterogenous probabilities of newly triggered l = 19 social roles swing quickly and broadly between 0 and 1. When the network starts feeding forward meta-data on Euclidean distance between each new social role and the others, it creates additional oscillation in the uniform probability of incumbent social roles. The latter gets systematically and cyclically pushed into negative values. A negative probability is logically impossible and represents no real phenomenon. Well, I mean… It is possible to assume that the negative probability of one phenomenon represents the probability of the opposite phenomenon taking place, but this is really far-fetched and doesn’t really find grounding in the logical structure of this specific neural network. Still, the cycle of change where the probability of something incumbent and previously existing gets crushed down to zero (and below) represents a state of society, when a new phenomenon aggressively pushes the incumbent phenomena out of the system.

Let’s see how those two processes of social change, observed in scenarios C and D, translate into expected states of social roles, i.e. into average probabilities. The first step in this analysis is to see how heterogeneous are those average expected states across the new social roles, triggered out of dormancy by the intrusion of the disturbance RB. In scenario C, new social roles display average probabilities between 0,32 and 0,35. Average probabilities corresponding to each individual, new social role differs from others by no more than 0.03, thus by a phenomenological fringe to be found in the tails of the normal distribution. By comparison, the average uniform probability attached to the existing social roles is 0,31. Thus, in the absence of constraint regarding social cohesion between new social roles and the incumbent ones, the expected average probability in both categories is very similar.

In scenario D, average probabilities of new social roles oscillate between 0,45 and 0,49, with just as little disparity as in scenario C, but, in the same time, they push the incumbent social roles out of the nest, so to say. The average uniform probability in the latter, after 3000 experimental rounds, is 0.01, which is most of all a result of the ‘positive probability – negative probability’ cycle during experimentation.

It is time to sum up my observations from the entire experiment conducted through and with a neural network. The initial intention was to understand better the mechanism which underlies one of my most fundamental claims regarding the civilizational role of cities, namely that cities, as a social contrivance, serve to accommodate a growing population in the framework of an increasingly complex network of social roles.

I am focusing on the ‘increasingly complex’ part of that claim. I want to understand patterns of change in the network of social roles, i.e. how can the complexity of that network evolve over time. The kind of artificial behaviour I induced in a neural network allows identifying a few recurrent patterns, which I can transform into hypotheses for further research. There is a connection between social cohesion and the emergence/disappearance of new social roles, for one. Social cohesion drags me back into the realm of the swarm theory. As a society, we seem to be evolving by a cycle of loosening and tightening in the way that social roles are coupled with each other.      

OK, here is the big picture. The highest demographic growth, in absolute numbers, takes place in Asia and Africa. The biggest migratory flows start from there, as well, and aim at and into regions with much less of human mass in accrual: North America and Europe. Less human accrual, indeed, and yet much better conditions for each new homo sapiens. In some places on the planet, a huge amount of humans is born every year. That huge amount means a huge number of genetic variations around the same genetic tune, namely that of the homo sapiens. Those genetic variations leave their homeland, for a new and better homeland, where they bring their genes into a new social environment, which assures them much more safety, and higher odds of prolonging their genetic line.

What is the point of there being more specimens of any species? I mean, is there a logic to increasing the headcount of any population? When I say ‘any’, is ranges from bacteria to us, humans. After having meddled with the most basic algorithm of a neural network (see « Pardon my French, but the thing is really intelligent » and « Ce petit train-train des petits signaux locaux d’inquiétude »), I have some thoughts about what intelligence is. I think that intelligence is a class, i.e. it is a framework structure able to produce many local, alternative instances of itself.

Being intelligent consists, to start with, in creating alternative versions of itself, and creating them purposefully imperfect so as to generate small local errors, whilst using those errors to create still different versions of itself. The process is tricky. There is some sort of fundamental coherence required between the way of creating those alternative instances of oneself, and the way that resulting errors are being processed. Fault of such coherence, the allegedly intelligent structure can fall into purposeful ignorance, or into panic.

Purposeful ignorance manifests as the incapacity to signal and process the local imperfections in alternative instances of the intelligent structure, although those imperfections actually stand out and wave at you. This is the ‘everything is just fine and there is no way it could be another way’ behavioural pattern. It happens, for example, when the function of processing local errors is too gross – or not sharp enough, if you want – to actually extract meaning from tiny, still observable local errors. The panic mode of an intelligent structure, on the other hand, is that situation when the error-processing function is too sharp for the actually observable errors. Them errors just knock it out of balance, like completely, and the function signals general ‘Error’, or ‘I can’t stand this cognitive dissonance’.

So, what is the point of there being more specimens of any species? The point might be to generate as many specific instances of an intelligent structure – the specific DNA – as possible, so as to generate purposeful (and still largely unpredictable) errors, just to feed those errors into the future instantiations of that structure. In the process of breeding, some path of evolutionary coherence leads to errors that can be handled, and that path unfolds between a state of evolutionary ‘everything is OK, no need to change anything’ (case mosquito, unchanged for millions of years), and a state of evolutionary ‘what the f**k!?’ (case common fruit fly, which produces insane amount of mutations in response to the slightest environmental stressor).

Essentially, all life could be a framework structure, which, back in the day, made a piece of software in artificial intelligence – the genetic code – and ever since that piece of software has been working on minimizing the MSE (mean square error) in predicting the next best version of life, and it has been working by breeding, in a tree-like method of generating variations,  indefinitely many instances of the framework structure of life. Question: what happens when, one day, a perfect form of life emerges? Something like TRex – Megalodon – Angelina Jolie – Albert Einstein – Jeff Bezos – [put whatever or whoever you like in the rest of that string]? On the grounds of what I have already learnt about artificial intelligence, such a state of perfection would mean the end of experimentation, thus the end of multiplying instances of the intelligent structure, thus the end of births and deaths, thus the end of life.

Question: if the above is even remotely true, does that overarching structure of life understand how the software it made – the genetic code – works? Not necessarily. That very basic algorithm of neural network, which I have experimented with a bit, produces local instances of the sigmoid function Ω = 1/(1 + e-x) such that Ω < 1, and that 1 + e-x > 1, which is always true. Still, the thing does it just sometimes. Why? How? Go figure. That thing accomplishes an apparently absurd task, and it does so just by being sufficiently flexible with its random coefficients. If Life In General is God, that God might not have a clue about how the actual life works. God just needs to know how to write an algorithm for making actual life work. I would even say more: if God is any good at being one, he would write an algorithm smarter than himself, just to make things advance.

The hypothesis of life being one, big, intelligent structure gives an interesting insight into what the cost of experimentation is. Each instance of life, i.e. each specimen of each species needs energy to sustain it. That energy takes many forms: light, warmth, food, Lexus (a form of matter), parties, Armani (another peculiar form of matter) etc. The more instances of life are there, the more energy they need to be there. Even if we take the Armani particle out of the equation, life is still bloody energy-consuming. The available amount of energy puts a limit to the number of experimental instances of the framework, structural life that the platform (Earth) can handle.

Here comes another one about climate change. Climate change means warmer, let’s be honest. Warmer means more energy on the planet. Yes, temperature is our human measurement scale for the aggregate kinetic energy of vibrating particles. More energy is what we need to have more instances of framework life, in the same time. Logically, incremental change in total energy on the planet translates into incremental change in the capacity of framework life to experiment with itself. Still, as framework life could be just the God who made that software for artificial intelligence (yes, I am still in the same metaphor), said framework life could not be quite aware of how bumpy could the road be, towards the desired minimum in the Mean Square Error. If God is an IT engineer, it could very well be the case.

I had that conversation with my son, who is graduating his IT engineering studies. I told him ‘See, I took that algorithm of neural network, and I just wrote its iterations out into separate tables of values in Excel, just to see what it does, like iteration after iteration. Interesting, isn’t it? I bet you have done such thing many times, eh?’. I still remember that heavy look in my son’s eyes: ‘Why the hell should I ever do that?’ he went. ‘There is a logical loop in that algorithm, you see? This loop is supposed to do the job, I mean to iterate until it comes up with something really useful. What is the point of doing manually what the loop is supposed to do for you? It is like hiring a gardener and then doing everything in the garden by yourself, just to see how it goes. It doesn’t make sense!’. ‘But it’s interesting to observe, isn’t it?’ I went, and then I realized I am talking to an alien form of intelligence, there.

Anyway, if God is a framework life who created some software to learn in itself, it could not be quite aware of the tiny little difficulties in the unfolding of the Big Plan. I mean acidification of oceans, hurricanes and stuff. The framework life could say: ‘Who cares? I want more learning in my algorithm, and it needs more energy to loop on itself, and so it makes those instances of me, pumping more carbon into the atmosphere, so as to have more energy to sustain more instances of me. Stands to reason, man. It is all working smoothly. I don’t understand what you are moaning about’.

Whatever that godly framework life says, I am still interested in studying particular instances of what happens. One of them is my business concept of EneFin. See « Which salesman am I? » as what I think is the last case of me being like fully verbal about it. Long story short, the idea consists in crowdfunding capital for small, local operators of power systems based on renewable energies, by selling shares in equity, or units of corporate debt, in bundles with tradable claims on either the present output of energy, or the future one. In simple terms, you buy from that supplier of energy tradable claims on, for example, 2 000 kWh, and you pay the regular market price, still, in that price, you buy energy properly spoken with a juicy discount. The rest of the actual amount of money you have paid buys you shares in your supplier’s equity.

The idea in that simplest form is largely based on two simple observations about energy bills we pay. In most countries (at least in Europe), our energy bills are made of two components: the (slightly) variable value of the energy actually supplied, and a fixed part labelled sometimes as ‘maintenance of the grid’ or similar. Besides, small users (e.g. households) usually pay a much higher unitary price per kWh than large, institutional scale buyers (factories, office buildings etc.). In my EneFin concept, a local supplier of renewable energy makes a deal with its local customers to sell them electricity at a fair, market price, with participations in equity on the top of electricity.

That would be a classical crowdfunding scheme, such as you can find with, StartEngine, for example. I want to give it some additional, financial spin. Classical crowdfunding has a weakness: low liquidity. The participatory shares you buy via crowdfunding are usually non-tradable, and they create a quasi-cooperative bond between investors and investees. Where I come from, i.e. in Central Europe, we are quite familiar with cooperatives. At the first sight, they look like a form of institutional heaven, compared to those big, ugly, capitalistic corporations. Still, after you have waved out that first mist, cooperatives turn out to be very exposed to embezzlement, and to abuse of managerial power. Besides, they are quite weak when competing for capital against corporate structures. I want to create highly liquid a transactional platform, with those investments being as tradable as possible, and use financial liquidity as a both a shield against managerial excesses, and a competitive edge for those small ventures.

My idea is to assure liquidity via a FinTech solution similar to that used by Katipult Technology Corp., i.e. to create some kind of virtual currency (note: virtual currency is not absolutely the same as cryptocurrency; cousins, but not twins, so to say). Units of currency would correspond to those complex contracts « energy plus equity ». First, you create an account with EneFin, i.e. you buy a certain amount of the virtual currency used inside the EneFin platform. I call them ‘tokens’ to simplify. Next, you pick your complex contracts, in the basket of those offered by local providers of energy. You buy those contracts with the tokens you have already acquired. Now, you change your mind. You want to withdraw your capital from the supplier A, and move it to supplier H, you haven’t considered so far. You move your tokens from A to H, even with a mobile app. It means that the transactional platform – the EneFin one – buys from you the corresponding amount of equity of A and tries to find for you some available equity in H. You can also move your tokens completely out of investment in those suppliers of energy. You can free your money, so to say. Just as simple: you just move them out, even with a movement of your thumb on the screen. The EneFin platform buys from you the shares you have moved out of.

You have an even different idea. Instead of investing your tokens into the equity of a provider of energy, you want to lend them. You move your tokens to the field ‘lending’, you study the interest rates offered on the transactional platform, and you close the deal. Now, the corresponding number of tokens represents securitized (thus tradable) corporate debt.

Question: why the hell bothering about a virtual currency, possibly a cryptocurrency, instead of just using good old fiat money? At this point, I am reaching to the very roots of the Bitcoin, the grandpa of all cryptocurrencies (or so they say). Question: what amount of money you need to finance 20 transactions of equal unitary value P? Answer: it depends on how frequently you monetize them. Imagine that the EneFin app offers you an option like ‘Monetize vs. Don’t Monetize’. As long as – with each transaction you do on the platform – you stick to the ‘Don’t Monetize’ option, your transactions remain recorded inside the transactional platform, and so there is recorded movement in tokens, but there is no monetary outcome, i.e. your strictly spoken monetary balance, for example that in €, does not change. It is only when you hit the ‘Monetize’ button in the app that the current bottom line of your transactions inside the platform is being converted into « official » money.

The virtual currency in the EneFin scheme would serve to allow a high level of liquidity (more transactions in a unit of time), without provoking the exactly corresponding demand for money. What connection with artificial intelligence? I want to study the possible absorption of such a scheme in the market of energy, and in the related financial markets, as a manifestation of collective intelligence. I imagine two intelligent framework structures: one incumbent (the existing markets) and one emerging (the EneFin platform). Both are intelligent structures to the extent that they technically can produce many alternative instances of themselves, and thus intelligently adapt to their environment by testing those instances and utilising the recorded local errors.

In terms of an algorithm of neural network, that intelligent adaptation can be manifest, for example, as an optimization in two coefficients: the share of energy traded via EneFin in the total energy supplied in the given market, and the capitalization of EneFin as a share in the total capitalization of the corresponding financial markets. Those two coefficients can be equated to weights in a classical MLP (Multilayer Perceptron) network, and the perceptron network could work around them. Of course, the issue can be approached from a classical methodological angle, as a general equilibrium to assess via « normal » econometric modelling. Still, what I want is precisely what I hinted in « Pardon my French, but the thing is really intelligent » and « Ce petit train-train des petits signaux locaux d’inquiétude »: I want to study the very process of adaptation and modification in those intelligent framework structures. I want to know, for example, how much experimentation those structures need to form something really workable, i.e. an EneFin platform with serious business going on, and, in the same time, that business contributing to the development of renewable energies in the given region of the world. Do those framework structures have enough local resources – mostly capital – for sustaining the number of alternative instances needed for effective learning? What kind of factors can block learning, i.e. drive the framework structure either into deliberate an ignorance of local errors or into panic?

Here is an example of more exact a theoretical issue. In a typical economic model, things are connected. When I pull on the string ‘capital invested in fixed assets’, I can see a valve open, with ‘Lifecycle in incumbent technologies’, and some steam rushes out. When I push the ‘investment in new production capacity’ button, I can see something happening in the ‘Jobs and employment’ department. In other words, variables present in economic systems mutually constrain each other. Just some combinations work, others just don’t. Now, the thing I have already discovered about them Multilayer Perceptrons is that as soon as I add some constraint on the weights assigned to input data, for example when I swap ‘random’ for ‘erandom’, the scope of possible structural functions leading to effective learning dramatically shrinks, and the likelihood of my neural network falling into deliberate ignorance or into panic just swells like hell. What degree of constraint on those economic variables is tolerable in the economic system conceived as a neural network, thus as a framework intelligent structure?

There are some general guidelines I can see for building a neural network that simulates those things. Creating local power systems, based on microgrids connected to one or more local sources of renewable energies, can be greatly enhanced with efficient financing schemes. The publicly disclosed financial results of companies operating in those segments – such as Tesla[1], Vivint Solar[2], FirstSolar[3], or 8Point3 Energy Partners[4] – suggest that business models in that domain are only emerging, and are far from being battle-tested. There is still a way to pave towards well-rounded business practices as regards such local power systems, both profitable economically and sustainable socially.

The basic assumptions of a neural network in that field are essentially behavioural. Firstly, consumption of energy is greatly predictable at the level of individual users. The size of a market in energy changes, as the number of users change. The output of energy needed to satisfy those users’ needs, and the corresponding capacity to install, are largely predictable on the long run. Consumers of energy use a basket of energy-consuming technologies. The structure of this basket determines their overall consumption, and is determined, in turn, by long-run social behaviour. Changes over time in that behaviour can be represented as a social game, where consecutive moves consist in purchasing, or disposing of a given technology. Thus, a game-like process of relatively slow social change generates a relatively predictable output of energy, and a demand thereof. Secondly, the behaviour of investors in any financial market, crowdfunding or other, is comparatively more volatile. Investment decisions are being taken, and modified at a much faster pace than decisions about the basket of technologies used in everyday life.

The financing of relatively small, local power systems, based on renewable energies and connected by local microgrids, implies an interplay of the two above-mentioned patterns, namely the relatively slower transformation in the technological base, and the quicker, more volatile modification of investors’ behaviour in financial markets.

And so I am meddling with neural networks. It had to come. It just had to. I started with me having many ideas to develop at once. Routine stuff with me. Then, the Editor-in-Chief of the ‘Energy Economics’ journal returned my manuscript of article on the energy-efficiency of national economies, which I had submitted with them, with a general remark that I should work both on the clarity of my hypotheses, and on the scientific spin of my empirical research. In short, Mr Wasniewski, linear models tested with Ordinary Least Squares is a bit oldie, if you catch my drift. Bloody right, Mr Editor-In-Chief. Basically, I agree with your remarks. I need to move out of my cavern, towards the light of progress, and get acquainted with the latest fashion. The latest fashion we are wearing this season is artificial intelligence, machine learning, and neural networks.

It comes handy, to the extent that I obsessively meddle with the issue of collective intelligence, and am dreaming about creating a model of human social structure acting as collective intelligence, sort of a beehive. Whilst the casting for a queen in that hive remains open, and is likely to stay this way for a while, I am digging into the very basics of neural networks. I am looking in the Python department, as I have already got a bit familiar with that environment. I found an article by James Loy, entitled “How to build your own Neural Network from scratch in Python”. The article looks a bit like sourcing from another one, available at the website of ProBytes Software, thus I use both to develop my understanding. I pasted the whole algorithm by James Loy into my Python Shell, made in run with an ‘enter’, and I am waiting for what it is going to produce. In the meantime, I am being verbal about my understanding.

The author declares he wants to do more or less the same thing that I, namely to understand neural networks. He constructs a simple algorithm for a neural network. It starts with defining the neural network as a class, i.e. as a callable object that acts as a factory for new instances of itself. In the neural network defined as a class, that algorithm starts with calling the constructor function ‘_init_’, which constructs an instance ‘self’ of that class. It goes like ‘def __init__(self, x, y):’. In other words, the class ‘Neural network’ generates instances ‘self’ of itself, and each instance is essentially made of two variables: input x, and output y. The ‘x’ is declared as input variable through the ‘self.input = x’ expression. Then, the output of the network is defined in two steps. Yes, the ‘y’ is generally the output, only in a neural network, we want the network to predict a value of ‘y’, thus some kind of y^. What we have to do is to define ‘self.y = y’, feed the real x-s and the real y-s into the network, and expect the latter to turn out some y^-s.

Logically, we need to prepare a vessel for holding the y^-s. The vessel is defined as ‘self.output = np.zeros(y.shape)’. The ‘shape’ function defines a tuple – a table, for those mildly fond of maths – with given dimensions. What are the dimensions of ‘y’ in that ‘y.shape’? They have been given earlier, as the weights of the network were being defined. It goes as follows. It starts, thus, right after the ‘self.input = x’ has been said, ‘self.weights1 = np.random.rand(self.input.shape[1],4)’ fires off, closely followed by ‘self.weights2 =  np.random.rand(4,1)’. All in all, the entire class of ‘Neural network’ is defined in the following form:

class NeuralNetwork:

    def __init__(self, x, y):

        self.input      = x

        self.weights1   = np.random.rand(self.input.shape[1],4)

        self.weights2   = np.random.rand(4,1)                

        self.y          = y

        self.output     = np.zeros(self.y.shape)                

The output of each instance in that neural network is a two-dimensional tuple (table) made of one row (I hope I got it correctly), and four columns. Initially, it is filled with zeros, so as to make room for something more meaningful. The predicted y^-s are supposed to jump into those empty sockets, held ready by the zeros. The ‘random.rand’ expression, associated with ‘weights’ means that the network is supposed to assign randomly different levels of importance to different x-s fed into it.

Anyway, the next step is to instruct my snake (i.e. Python) what to do next, with that class ‘Neural Network’. It is supposed to do two things: feed data forward, i.e. makes those neurons work on predicting the y^-s, and then check itself by an operation called backpropagation of errors. The latter consists in comparing the predicted y^-s with the real y-s, measuring the discrepancy as a loss of information, updating the initial random weights with conclusions from that measurement, and do it all again, and again, and again, until the error runs down to very low values. The weights applied by the network in order to generate that lowest possible error are the best the network can do in terms of learning.

The feeding forward of predicted y^-s goes on in two steps, or in two layers of neurons, one hidden, and one final. They are defined as:

def feedforward(self):

        self.layer1 = sigmoid(, self.weights1))

        self.output = sigmoid(, self.weights2))

The ‘sigmoid’ part means sigmoid function, AKA logistic function, expressed as y=1/(1+e-x), where, at the end of the day, the y always falls somewhere between 0 and 1, and the ‘x’ is not really the empirical, real ‘x’, but the ‘x’ multiplied by a weight, ranging between 0 and 1 as well. The sigmoid function is good for testing the weights we apply to various types of input x-es. Whatever kind of data you take: populations measured in millions, or consumption of energy per capita, measured in kilograms of oil equivalent, the basic sigmoid function y=1/(1+e-x), will always yield a value between 0 and 1. This function essentially normalizes any data.

Now, I want to take differentiated data, like population as headcount, energy consumption in them kilograms of whatever oil equals to, and the supply of money in standardized US dollars. Quite a mix of units and scales of measurement. I label those three as, respectively, xa, xb, and xc. I assign them weights ranging between 0 and 1, so as the sum of weights never exceeds 1. In plain language it means that for every vector of observations made of xa, xb, and xc I take a pinchful of  xa, then a zest of xb, and a spoon of xc. I make them into x = wa*xa + wb*xb + wc*xc, I give it a minus sign and put it as an exponent for the Euler’s constant.

That yields y=1/(1+e-( wa*xa + wb*xb + wc*xc)). Long, but meaningful to the extent that now, my y is always to find somewhere between 0 and 1, and I can experiment with various weights for my various shades of x, and look what it gives in terms of y.

In the algorithm above, the ‘’ function conveys the idea of weighing our x-s. With two dimensions, like the input signal ‘x’ and its weight ‘w’, the ‘’ function yields a multiplication of those two one-dimensional matrices, exactly in the x = wa*xa + wb*xb + wc*xc drift.

Thus, the first really smart layer of the network, the hidden one, takes the empirical x-s, weighs them with random weights, and makes a sigmoid of that. The next layer, the output one, takes the sigmoid-calculated values from the hidden layer, and applies the same operation to them.

One more remark about the sigmoid. You can put something else instead of 1, in the nominator. Then, the sigmoid will yield your data normalized over that something. If you have a process that tends towards a level of saturation, e.g. number of psilocybin parties per month, you can put that level in the nominator. On the top of that, you can add parameters to the denominator. In other words, you can replace the 1+e-x with ‘b + e-k*x’, where b and k can be whatever seems to make sense for you. With that specific spin, the sigmoid is good for simulating anything that tends towards saturation over time. Depending on the parameters in denominator, the shape of the corresponding curve will change. Usually, ‘b’ works well when taken as a fraction of the nominator (the saturation level), and the ‘k’ seems to be behaving meaningfully when comprised between 0 and 1.

I return to the algorithm. Now, as the network has generated a set of predicted y^-s, it is time to compare them to the actual y-s, and to evaluate how much is there to learn yet. We can use any measure of error, still, most frequently, them algorithms go after the simplest one, namely the Mean Square Error MSE = [(y1 – y^1)2 + (y2 – y^2)2 + … + (yn – y^n)2]0,5. Yes, it is Euclidean distance between the set of actual y-s and that of predicted y^-s. Yes, it is also the standard deviation of predicted y^-s from the actual distribution of empirical y-s.

In this precise algorithm, the author goes down another avenue: he takes the actual differences between observed y-s and predicted y^-s, and then multiplies it by the sigmoid derivative of predicted y^-s. Then he takes the transpose of a uni-dimensional matrix of those (y – y^)*(y^)’ with (y^)’ standing for derivative. It goes like:

    def backprop(self):

        # application of the chain rule to find derivative of the loss function with respect to weights2 and weights1

        d_weights2 =, (2*(self.y – self.output) * sigmoid_derivative(self.output)))

        d_weights1 =,  (*(self.y – self.output) * sigmoid_derivative(self.output), self.weights2.T) * sigmoid_derivative(self.layer1)))

        # update the weights with the derivative (slope) of the loss function

        self.weights1 += d_weights1

        self.weights2 += d_weights2

    def sigmoid(x):

    return 1.0/(1+ np.exp(-x))

    def sigmoid_derivative(x):

     return x * (1.0 – x)

I am still trying to wrap my mind around the reasons for taking this specific approach to the backpropagation of errors. The derivative of a sigmoid y=1/(1+e-x) is y’ =  [1/(1+e-x)]*{1 – [1/(1+e-x)]} and, as any derivative, it measures the slope of change in y. When I do (y1 – y^1)*(y^1)’ + (y2 – y^2)*(y^2)’ + … + (yn – y^n)*(y^n)’ it is as if I were taking some kind of weighted average. That weighted average can be understood in two alternative ways. Either it is standard deviation of y^ from y, weighted with the local slopes, or it is a general slope weighted with local deviations. Now I take the transpose of a matrix like {(y1 – y^1)*(y^1)’ ; (y2 – y^2)*(y^2)’ ; … (yn – y^n)*(y^n)’}, it is a bit as if I made a matrix of inverted terms, i.e. 1/[(yn – y^n)*(y^n)’]. Now, I make a ‘.dot’ product of those inverted terms, so I multiply them by each other. Then, I feed the ‘.dot’ product into the neural network with the ‘+=’ operator. The latter means that in the next round of calculations, the network can do whatever it wants with those terms. Hmmweeellyyeess, makes some sense. I don’t know what exact sense is that, but it has some mathematical charm.

Now, I try to apply the same logic to the data I am working with in my research. Just to give you an idea, I show some data for just one country: Australia. Why Australia? Honestly, I don’t see why it shouldn’t be. Quite a respectable place. Anyway, here is that table. GDP per unit of energy consumed can be considered as the target output variable y, and the rest are those x-s.

Table 1 – Selected data regarding Australia

Year GDP per unit of energy use (constant 2011 PPP $ per kg of oil equivalent) Share of aggregate amortization in the GDP Supply of broad money, % of GDP Energy use (tons of oil equivalent per capita) Urban population as % of total population GDP per capita, ‘000 USD
  y X1 X2 X3 X4 X5
1990 5,662020744 14,46 54,146 5,062 85,4 26,768
1991 5,719765048 14,806 53,369 4,928 85,4 26,496
1992 5,639817305 14,865 56,208 4,959 85,566 27,234
1993 5,597913126 15,277 56,61 5,148 85,748 28,082
1994 5,824685357 15,62 59,227 5,09 85,928 29,295
1995 5,929177604 15,895 60,519 5,129 86,106 30,489
1996 5,780817973 15,431 62,734 5,394 86,283 31,566
1997 5,860645225 15,259 63,981 5,47 86,504 32,709
1998 5,973528571 15,352 65,591 5,554 86,727 33,789
1999 6,139349354 15,086 69,539 5,61 86,947 35,139
2000 6,268129418 14,5 67,72 5,644 87,165 35,35
2001 6,531818805 14,041 70,382 5,447 87,378 36,297
2002 6,563073754 13,609 70,518 5,57 87,541 37,047
2003 6,677186947 13,398 74,818 5,569 87,695 38,302
2004 6,82834791 13,582 77,495 5,598 87,849 39,134
2005 6,99630318 13,737 78,556 5,564 88 39,914
2006 6,908872246 14,116 83,538 5,709 88,15 41,032
2007 6,932137612 14,025 90,679 5,868 88,298 42,022
2008 6,929395465 13,449 97,866 5,965 88,445 42,222
2009 7,039061961 13,698 94,542 5,863 88,59 41,616
2010 7,157467568 12,647 101,042 5,649 88,733 43,155
2011 7,291989544 12,489 100,349 5,638 88,875 43,716
2012 7,671605162 13,071 101,852 5,559 89,015 43,151
2013 7,891026044 13,455 106,347 5,586 89,153 43,238
2014 8,172929207 13,793 109,502 5,485 89,289 43,071

In his article, James Loy reports the cumulative error over 1500 iterations of training, with just four series of x-s, made of four observations. I do something else. I am interested in how the network works, step by step. I do step-by-step calculations with data from that table, following that algorithm I have just discussed. I do it in Excel, and I observe the way that the network behaves. I can see that the hidden layer is really hidden, to the extent that it does not produce much in terms of meaningful information. What really spins is the output layer, thus, in fact, the connection between the hidden layer and the output. In the hidden layer, all the predicted sigmoid y^ are equal to 1, and their derivatives are automatically 0. Still, in the output layer, when the second random distribution of weights overlaps with the first one from the hidden layer. Then, for some years, those output sigmoids demonstrate tiny differences from 1, and their derivatives become very small positive numbers. As a result, tiny, local (yi – y^i)*(y^i)’ expressions are being generated in the output layer, and they modify the initial weights in the next round of training.

I observe the cumulative error (loss) in the first four iterations. In the first one it is 0,003138796, the second round brings 0,000100228, the third round displays 0,0000143, and the fourth one 0,005997739. Looks like an initial reduction of cumulative error, by one order of magnitude at each iteration, and then, in the fourth round, it jumps up to the highest cumulative error of the four. I extend the number to those hand-driven iterations from four to six, and I keep feeding the network with random weights, again and again. A pattern emerges. The cumulative error oscillates. Sometimes the network drives it down, sometimes it swings it up.

F**k! Pardon my French, but just six iterations of that algorithm show me that the thing is really intelligent. It generates an error, it drives it down to a lower value, and then, as if it was somehow dysfunctional to jump to conclusions that quickly, it generates a greater error in consecutive steps, as if it was considering more alternative options. I know that data scientists, should they read this, can slap their thighs at that elderly uncle (i.e. me), fascinated with how a neural network behaves. Still, for me, it is science. I take my data, I feed it into a machine that I see for the first time in my life, and I observe intelligent behaviour in something written on less than one page. It experiments with weights attributed to the stimuli I feed into it, and it evaluates its own error.

Now, I understand why that scientist from MIT, Lex Fridman, says that building artificial intelligence brings insights into how the human brain works.

I am consistently delivering good, almost new science to my readers, and love doing it, and I am working on crowdfunding this activity of mine. As we talk business plans, I remind you that you can download, from the library of my blog, the business plan I prepared for my semi-scientific project Befund  (and you can access the French version as well). You can also get a free e-copy of my book ‘Capitalism and Political Power’ You can support my research by donating directly, any amount you consider appropriate, to my PayPal account. You can also consider going to my Patreon page and become my patron. If you decide so, I will be grateful for suggesting me two things that Patreon suggests me to suggest you. Firstly, what kind of reward would you expect in exchange of supporting me? Secondly, what kind of phases would you like to see in the development of my research, and of the corresponding educational tools?