Pardon my French, but the thing is really intelligent

My editorial on You Tube

And so I am meddling with neural networks. It had to come. It just had to. I started with me having many ideas to develop at once. Routine stuff with me. Then, the Editor-in-Chief of the ‘Energy Economics’ journal returned my manuscript of article on the energy-efficiency of national economies, which I had submitted with them, with a general remark that I should work both on the clarity of my hypotheses, and on the scientific spin of my empirical research. In short, Mr Wasniewski, linear models tested with Ordinary Least Squares is a bit oldie, if you catch my drift. Bloody right, Mr Editor-In-Chief. Basically, I agree with your remarks. I need to move out of my cavern, towards the light of progress, and get acquainted with the latest fashion. The latest fashion we are wearing this season is artificial intelligence, machine learning, and neural networks.

It comes handy, to the extent that I obsessively meddle with the issue of collective intelligence, and am dreaming about creating a model of human social structure acting as collective intelligence, sort of a beehive. Whilst the casting for a queen in that hive remains open, and is likely to stay this way for a while, I am digging into the very basics of neural networks. I am looking in the Python department, as I have already got a bit familiar with that environment. I found an article by James Loy, entitled “How to build your own Neural Network from scratch in Python”. The article looks a bit like sourcing from another one, available at the website of ProBytes Software, thus I use both to develop my understanding. I pasted the whole algorithm by James Loy into my Python Shell, made in run with an ‘enter’, and I am waiting for what it is going to produce. In the meantime, I am being verbal about my understanding.

The author declares he wants to do more or less the same thing that I, namely to understand neural networks. He constructs a simple algorithm for a neural network. It starts with defining the neural network as a class, i.e. as a callable object that acts as a factory for new instances of itself. In the neural network defined as a class, that algorithm starts with calling the constructor function ‘_init_’, which constructs an instance ‘self’ of that class. It goes like ‘def __init__(self, x, y):’. In other words, the class ‘Neural network’ generates instances ‘self’ of itself, and each instance is essentially made of two variables: input x, and output y. The ‘x’ is declared as input variable through the ‘self.input = x’ expression. Then, the output of the network is defined in two steps. Yes, the ‘y’ is generally the output, only in a neural network, we want the network to predict a value of ‘y’, thus some kind of y^. What we have to do is to define ‘self.y = y’, feed the real x-s and the real y-s into the network, and expect the latter to turn out some y^-s.

Logically, we need to prepare a vessel for holding the y^-s. The vessel is defined as ‘self.output = np.zeros(y.shape)’. The ‘shape’ function defines a tuple – a table, for those mildly fond of maths – with given dimensions. What are the dimensions of ‘y’ in that ‘y.shape’? They have been given earlier, as the weights of the network were being defined. It goes as follows. It starts, thus, right after the ‘self.input = x’ has been said, ‘self.weights1 = np.random.rand(self.input.shape[1],4)’ fires off, closely followed by ‘self.weights2 =  np.random.rand(4,1)’. All in all, the entire class of ‘Neural network’ is defined in the following form:

class NeuralNetwork:

    def __init__(self, x, y):

        self.input      = x

        self.weights1   = np.random.rand(self.input.shape[1],4)

        self.weights2   = np.random.rand(4,1)                

        self.y          = y

        self.output     = np.zeros(self.y.shape)                

The output of each instance in that neural network is a two-dimensional tuple (table) made of one row (I hope I got it correctly), and four columns. Initially, it is filled with zeros, so as to make room for something more meaningful. The predicted y^-s are supposed to jump into those empty sockets, held ready by the zeros. The ‘random.rand’ expression, associated with ‘weights’ means that the network is supposed to assign randomly different levels of importance to different x-s fed into it.

Anyway, the next step is to instruct my snake (i.e. Python) what to do next, with that class ‘Neural Network’. It is supposed to do two things: feed data forward, i.e. makes those neurons work on predicting the y^-s, and then check itself by an operation called backpropagation of errors. The latter consists in comparing the predicted y^-s with the real y-s, measuring the discrepancy as a loss of information, updating the initial random weights with conclusions from that measurement, and do it all again, and again, and again, until the error runs down to very low values. The weights applied by the network in order to generate that lowest possible error are the best the network can do in terms of learning.

The feeding forward of predicted y^-s goes on in two steps, or in two layers of neurons, one hidden, and one final. They are defined as:

def feedforward(self):

        self.layer1 = sigmoid(np.dot(self.input, self.weights1))

        self.output = sigmoid(np.dot(self.layer1, self.weights2))

The ‘sigmoid’ part means sigmoid function, AKA logistic function, expressed as y=1/(1+e-x), where, at the end of the day, the y always falls somewhere between 0 and 1, and the ‘x’ is not really the empirical, real ‘x’, but the ‘x’ multiplied by a weight, ranging between 0 and 1 as well. The sigmoid function is good for testing the weights we apply to various types of input x-es. Whatever kind of data you take: populations measured in millions, or consumption of energy per capita, measured in kilograms of oil equivalent, the basic sigmoid function y=1/(1+e-x), will always yield a value between 0 and 1. This function essentially normalizes any data.

Now, I want to take differentiated data, like population as headcount, energy consumption in them kilograms of whatever oil equals to, and the supply of money in standardized US dollars. Quite a mix of units and scales of measurement. I label those three as, respectively, xa, xb, and xc. I assign them weights ranging between 0 and 1, so as the sum of weights never exceeds 1. In plain language it means that for every vector of observations made of xa, xb, and xc I take a pinchful of  xa, then a zest of xb, and a spoon of xc. I make them into x = wa*xa + wb*xb + wc*xc, I give it a minus sign and put it as an exponent for the Euler’s constant.

That yields y=1/(1+e-( wa*xa + wb*xb + wc*xc)). Long, but meaningful to the extent that now, my y is always to find somewhere between 0 and 1, and I can experiment with various weights for my various shades of x, and look what it gives in terms of y.

In the algorithm above, the ‘np.dot’ function conveys the idea of weighing our x-s. With two dimensions, like the input signal ‘x’ and its weight ‘w’, the ‘np.dot’ function yields a multiplication of those two one-dimensional matrices, exactly in the x = wa*xa + wb*xb + wc*xc drift.

Thus, the first really smart layer of the network, the hidden one, takes the empirical x-s, weighs them with random weights, and makes a sigmoid of that. The next layer, the output one, takes the sigmoid-calculated values from the hidden layer, and applies the same operation to them.

One more remark about the sigmoid. You can put something else instead of 1, in the nominator. Then, the sigmoid will yield your data normalized over that something. If you have a process that tends towards a level of saturation, e.g. number of psilocybin parties per month, you can put that level in the nominator. On the top of that, you can add parameters to the denominator. In other words, you can replace the 1+e-x with ‘b + e-k*x’, where b and k can be whatever seems to make sense for you. With that specific spin, the sigmoid is good for simulating anything that tends towards saturation over time. Depending on the parameters in denominator, the shape of the corresponding curve will change. Usually, ‘b’ works well when taken as a fraction of the nominator (the saturation level), and the ‘k’ seems to be behaving meaningfully when comprised between 0 and 1.

I return to the algorithm. Now, as the network has generated a set of predicted y^-s, it is time to compare them to the actual y-s, and to evaluate how much is there to learn yet. We can use any measure of error, still, most frequently, them algorithms go after the simplest one, namely the Mean Square Error MSE = [(y1 – y^1)2 + (y2 – y^2)2 + … + (yn – y^n)2]0,5. Yes, it is Euclidean distance between the set of actual y-s and that of predicted y^-s. Yes, it is also the standard deviation of predicted y^-s from the actual distribution of empirical y-s.

In this precise algorithm, the author goes down another avenue: he takes the actual differences between observed y-s and predicted y^-s, and then multiplies it by the sigmoid derivative of predicted y^-s. Then he takes the transpose of a uni-dimensional matrix of those (y – y^)*(y^)’ with (y^)’ standing for derivative. It goes like:

    def backprop(self):

        # application of the chain rule to find derivative of the loss function with respect to weights2 and weights1

        d_weights2 = np.dot(self.layer1.T, (2*(self.y – self.output) * sigmoid_derivative(self.output)))

        d_weights1 = np.dot(self.input.T,  (np.dot(2*(self.y – self.output) * sigmoid_derivative(self.output), self.weights2.T) * sigmoid_derivative(self.layer1)))

        # update the weights with the derivative (slope) of the loss function

        self.weights1 += d_weights1

        self.weights2 += d_weights2

    def sigmoid(x):

    return 1.0/(1+ np.exp(-x))

    def sigmoid_derivative(x):

     return x * (1.0 – x)

I am still trying to wrap my mind around the reasons for taking this specific approach to the backpropagation of errors. The derivative of a sigmoid y=1/(1+e-x) is y’ =  [1/(1+e-x)]*{1 – [1/(1+e-x)]} and, as any derivative, it measures the slope of change in y. When I do (y1 – y^1)*(y^1)’ + (y2 – y^2)*(y^2)’ + … + (yn – y^n)*(y^n)’ it is as if I were taking some kind of weighted average. That weighted average can be understood in two alternative ways. Either it is standard deviation of y^ from y, weighted with the local slopes, or it is a general slope weighted with local deviations. Now I take the transpose of a matrix like {(y1 – y^1)*(y^1)’ ; (y2 – y^2)*(y^2)’ ; … (yn – y^n)*(y^n)’}, it is a bit as if I made a matrix of inverted terms, i.e. 1/[(yn – y^n)*(y^n)’]. Now, I make a ‘.dot’ product of those inverted terms, so I multiply them by each other. Then, I feed the ‘.dot’ product into the neural network with the ‘+=’ operator. The latter means that in the next round of calculations, the network can do whatever it wants with those terms. Hmmweeellyyeess, makes some sense. I don’t know what exact sense is that, but it has some mathematical charm.

Now, I try to apply the same logic to the data I am working with in my research. Just to give you an idea, I show some data for just one country: Australia. Why Australia? Honestly, I don’t see why it shouldn’t be. Quite a respectable place. Anyway, here is that table. GDP per unit of energy consumed can be considered as the target output variable y, and the rest are those x-s.

Table 1 – Selected data regarding Australia

Year GDP per unit of energy use (constant 2011 PPP $ per kg of oil equivalent) Share of aggregate amortization in the GDP Supply of broad money, % of GDP Energy use (tons of oil equivalent per capita) Urban population as % of total population GDP per capita, ‘000 USD
  y X1 X2 X3 X4 X5
1990 5,662020744 14,46 54,146 5,062 85,4 26,768
1991 5,719765048 14,806 53,369 4,928 85,4 26,496
1992 5,639817305 14,865 56,208 4,959 85,566 27,234
1993 5,597913126 15,277 56,61 5,148 85,748 28,082
1994 5,824685357 15,62 59,227 5,09 85,928 29,295
1995 5,929177604 15,895 60,519 5,129 86,106 30,489
1996 5,780817973 15,431 62,734 5,394 86,283 31,566
1997 5,860645225 15,259 63,981 5,47 86,504 32,709
1998 5,973528571 15,352 65,591 5,554 86,727 33,789
1999 6,139349354 15,086 69,539 5,61 86,947 35,139
2000 6,268129418 14,5 67,72 5,644 87,165 35,35
2001 6,531818805 14,041 70,382 5,447 87,378 36,297
2002 6,563073754 13,609 70,518 5,57 87,541 37,047
2003 6,677186947 13,398 74,818 5,569 87,695 38,302
2004 6,82834791 13,582 77,495 5,598 87,849 39,134
2005 6,99630318 13,737 78,556 5,564 88 39,914
2006 6,908872246 14,116 83,538 5,709 88,15 41,032
2007 6,932137612 14,025 90,679 5,868 88,298 42,022
2008 6,929395465 13,449 97,866 5,965 88,445 42,222
2009 7,039061961 13,698 94,542 5,863 88,59 41,616
2010 7,157467568 12,647 101,042 5,649 88,733 43,155
2011 7,291989544 12,489 100,349 5,638 88,875 43,716
2012 7,671605162 13,071 101,852 5,559 89,015 43,151
2013 7,891026044 13,455 106,347 5,586 89,153 43,238
2014 8,172929207 13,793 109,502 5,485 89,289 43,071

In his article, James Loy reports the cumulative error over 1500 iterations of training, with just four series of x-s, made of four observations. I do something else. I am interested in how the network works, step by step. I do step-by-step calculations with data from that table, following that algorithm I have just discussed. I do it in Excel, and I observe the way that the network behaves. I can see that the hidden layer is really hidden, to the extent that it does not produce much in terms of meaningful information. What really spins is the output layer, thus, in fact, the connection between the hidden layer and the output. In the hidden layer, all the predicted sigmoid y^ are equal to 1, and their derivatives are automatically 0. Still, in the output layer, when the second random distribution of weights overlaps with the first one from the hidden layer. Then, for some years, those output sigmoids demonstrate tiny differences from 1, and their derivatives become very small positive numbers. As a result, tiny, local (yi – y^i)*(y^i)’ expressions are being generated in the output layer, and they modify the initial weights in the next round of training.

I observe the cumulative error (loss) in the first four iterations. In the first one it is 0,003138796, the second round brings 0,000100228, the third round displays 0,0000143, and the fourth one 0,005997739. Looks like an initial reduction of cumulative error, by one order of magnitude at each iteration, and then, in the fourth round, it jumps up to the highest cumulative error of the four. I extend the number to those hand-driven iterations from four to six, and I keep feeding the network with random weights, again and again. A pattern emerges. The cumulative error oscillates. Sometimes the network drives it down, sometimes it swings it up.

F**k! Pardon my French, but just six iterations of that algorithm show me that the thing is really intelligent. It generates an error, it drives it down to a lower value, and then, as if it was somehow dysfunctional to jump to conclusions that quickly, it generates a greater error in consecutive steps, as if it was considering more alternative options. I know that data scientists, should they read this, can slap their thighs at that elderly uncle (i.e. me), fascinated with how a neural network behaves. Still, for me, it is science. I take my data, I feed it into a machine that I see for the first time in my life, and I observe intelligent behaviour in something written on less than one page. It experiments with weights attributed to the stimuli I feed into it, and it evaluates its own error.

Now, I understand why that scientist from MIT, Lex Fridman, says that building artificial intelligence brings insights into how the human brain works.

I am consistently delivering good, almost new science to my readers, and love doing it, and I am working on crowdfunding this activity of mine. As we talk business plans, I remind you that you can download, from the library of my blog, the business plan I prepared for my semi-scientific project Befund  (and you can access the French version as well). You can also get a free e-copy of my book ‘Capitalism and Political Power’ You can support my research by donating directly, any amount you consider appropriate, to my PayPal account. You can also consider going to my Patreon page and become my patron. If you decide so, I will be grateful for suggesting me two things that Patreon suggests me to suggest you. Firstly, what kind of reward would you expect in exchange of supporting me? Secondly, what kind of phases would you like to see in the development of my research, and of the corresponding educational tools?

3 thoughts on “Pardon my French, but the thing is really intelligent

Leave a Reply