Two loops, one inside the other

I am developing my skills in programming by attacking the general construct of Markov chains and state space. My theory on the bridging between collective intelligence in human societies and artificial neural networks as simulators thereof is that both are intelligent structures. I assume that they learn by producing many alternative versions of themselves whilst staying structurally coherent, and they pitch each such version against a desired output, just to see how fit that particular take on existence is, regarding the requirements in place.  

Mathematically, that learning-by-doing is a Markov chain of states, i.e. a sequence of complex states, described by a handful of variables, such that each consecutive state in the sequence is a modification of the preceding state, through a logically coherent σ-algebra. My so-far findings suggest that orienting the intelligent structure on specific outcomes, out of all those available, is crucial for the path of learning that structure takes. In other words, the general hypothesis I am sniffing around and digging into is that the way an intelligent structure learns is principally determined by the desired outcomes which the structure is after, more than by the exact basket of inputs it uses. Stands to reason, for a neural network: the thing optimises inputs so as to make it fit to the outcome it seeks to get as close to as possible.  

As I am taking real taste in stepping out of my cavern, I have installed Anaconda on my computer, from https://www.anaconda.com/products/individual/download-success . When I use Anaconda, I use the same JupyterLab online functionality which I have been using so far, with one difference. Anaconda allows me to create a user account with JupyterLab, and to have all my work stored on that account. Probably, there are some storage limits, yet the thing is practical. 

Anyway, I want to program in Python, just as I do it in Excel, intelligent structures able to emulate the collective intelligence of human societies. A basic finding of mine, in the so-far research, is that intelligent structures alter their behaviour significantly depending on the outcome they pursue. The initial landscape I start operating in is akin a junkyard of information. I go to the website of World Bank, for example, I mean the one with freely available data, AKA https://data.worldbank.org , and I start rummaging. Quality of life, size of economies, headcount of populations… What else? Oh, yes, there are things about education, energy consumption and whatnot. All that stuff just piled up nicely, each item easy to retrieve, and yet, how does it all make sense together? My take on the thing is that there is stuff going on, like all the time and everywhere. We are part of that ongoing stuff, actually. Out of that stream of happening, we perceptually single out phenomenological cuts , and we isolate those specific cuts because we are able to measure them with some kind of gauge. Data-driven observation of ourselves in the world is closely connected to our science of measuring and counting stuff. Have you noticed that a basic metric, i.e. how many of us is there around, can take a denominator of one – when we count the population of a city – or a denominator of 10 000, when we are interested in the incidence of criminality. 

Each quantitative variable I can observe and download the dataset of from https://data.worldbank.org  comes out of that complex process of collective cognition, resembling a huge bunch of psychos walking around with rulers and abacuses, trying to measure everything they perceive. I use data as phenomenological description of both the reality those psychos (me included) live in, and the way they measure that reality. I want to check which among those quantitative variables are particularly suitable to represent the things we are really after, our collectively desired outcomes. The method I use to do it consists in producing as many variations of the original dataset as I have variables. Each variation of the original dataset has one variable singled out as output, and the remaining ones are input. I run such variation through a simple neural network – the simpler, the better – where standardised, randomly weighed and neurally activated input gets compared with the pre-set output. I measure the mean expected values of all the variables in such a transformation, i.e. when I run it through 3000 experimental rounds, I measure those means over the same 3000 rounds. I compute the Euclidean distance between each such vector of means and its cousin computed for the original dataset. I assume that, with rigorously the same logical structure of the neural network, those variations differ from each other just by the output variable they are pegged on. When I say ‘pegged’, by the way, I mean that the output variable is not subject to random weighing, i.e. it is not being experimented with. It comes exogenously, and is taken as it is. 

I noticed that each time I do that procedure, with whatever set of variables I take, one or two among them, when taken as output ones, produce variations much closer to the original dataset that other, in terms of Euclidean distance. It looks as if the neural network, when pegged on those particular variables, emulated a process of adaptation particularly similar to what is represented by the original empirical data. 

Now, I want to learn how to program, in Python, the production of alternative ‘input <> output’  couplings out of a source dataset. I already know the general drill for producing just one such coupling. Once I have my dataset read out of a CSV file into a Data Frame in Python Pandas, I start with creating a dictionary of all the numerical columns:

>> dict_numerical = [‘numerical_column1’, ‘numerical_column2’, …, ‘numerical column_n’]

A simple way of doing that, with large data frames, is to type in Python:

>> df.columns

… and it yields a string of labels in quotation marks ‘’, separated with commas. I just copy that lot , without the non-numerical columns, into the square brackets of dict_numerical = […], and Bob’s my uncle. 

Then I make a strictly numerical version of my database, by:

>> df_numerical = pd.DataFrame(df[dict_numerical])

By the way, each time I produce a new data frame, I check its structure with commands ‘df.info()‘ and ‘df.describe()’. At my neophytic level of programming, I want to make sure that what I have in a strictly numerical database is strictly numerical data, i.e. the ‘float64’ type. Here, one hint: when you convert your data from an original Excel file, pay attention to having your decimal point as a point, i.e. as ‘0.0’, not as a comma. With a comma, the Pandas reader tends to interpret such data by default as ‘object’. Annoying. 

Once I have that numerical data frame in place, I make another dictionary of the type:

>> dict_for_Input_pegged_on_X_as_output = [‘numerical_input_column1’, ‘numerical_input_column2’, …, ‘numerical_input_column_k’]

… where k = n -1, of course, and the 1 corresponds to the variable X, supposed to be the output one. 

I use that dictionary to split df_numerical:

>> df_output_X = df_numerical[‘numerical_column_X’]

>> df_input_for_X = df_numerical[dict_for_Input_pegged_on_X_as_output]     

I would like to automatise the process. It means I need a loop. I am looping over a range of numerical columns df_numerical. Let’s dance. I start routinely, in my Anaconda-Jupyter Lab-powered notebook. By the way, I noticed an interesting practical feature of Jupyter Lab. When you start it directly from its website https://jupyter.org , the notebook you can use has somehow limited functionality as compared to the notebook you can create when accessing Jupyter Lab from the Anaconda app on your computer. In the latter case you can create an account with Jupyter Lab, with a very useful functionality of mirroring the content of your cloud account on your hard drive. I know, I know: we use the cloud so as not to collect rubbish on our own disk. Still, Python files are small, they take little space, and I discovered that this mirroring stuff is really useful. 

I open up with importing the libraries I think I will need:

>> import numpy as np

>> import pandas as pd

>> import math

>> import os

As I am learning new stuff, I prefer taking known stuff as my data. Once again, I use a dataset which I made out of Penn Tables 9.1., by kicking out all the rows with empty cells [see: Feenstra, Robert C., Robert Inklaar and Marcel P. Timmer (2015), “The Next Generation of the Penn World Table” American Economic Review, 105(10), 3150-3182, www.ggdc.net/pwt ].

I already have that dataset in my working directory. By the way, when you install Anaconda on a MacBook, its working directory is by default the root directory of the user’s profile. For the moment, I keep ip that way. Anyway, I have that dataset and I read it into a Pandas dataframe:

>> PWT=pd.DataFrame(pd.read_csv(‘PWT 9_1 no empty cells.csv’,header=0))

I create my first dictionaries. I type:

>> PWT.columns

… which yields:

Index([‘country’, ‘year’, ‘rgdpe’, ‘rgdpo’, ‘pop’, ’emp’, ’emp / pop’, ‘avh’,

       ‘hc’, ‘ccon’, ‘cda’, ‘cgdpe’, ‘cgdpo’, ‘cn’, ‘ck’, ‘ctfp’, ‘cwtfp’,

       ‘rgdpna’, ‘rconna’, ‘rdana’, ‘rnna’, ‘rkna’, ‘rtfpna’, ‘rwtfpna’,

       ‘labsh’, ‘irr’, ‘delta’, ‘xr’, ‘pl_con’, ‘pl_da’, ‘pl_gdpo’, ‘csh_c’,

       ‘csh_i’, ‘csh_g’, ‘csh_x’, ‘csh_m’, ‘csh_r’, ‘pl_c’, ‘pl_i’, ‘pl_g’,

       ‘pl_x’, ‘pl_m’, ‘pl_n’, ‘pl_k’],

      dtype=’object’)

…and I create the dictionary of quantitative variables:

>> Variables=[‘rgdpe’, ‘rgdpo’, ‘pop’, ’emp’, ’emp / pop’, ‘avh’,

       ‘hc’, ‘ccon’, ‘cda’, ‘cgdpe’, ‘cgdpo’, ‘cn’, ‘ck’, ‘ctfp’, ‘cwtfp’,

       ‘rgdpna’, ‘rconna’, ‘rdana’, ‘rnna’, ‘rkna’, ‘rtfpna’, ‘rwtfpna’,

       ‘labsh’, ‘irr’, ‘delta’, ‘xr’, ‘pl_con’, ‘pl_da’, ‘pl_gdpo’, ‘csh_c’,

       ‘csh_i’, ‘csh_g’, ‘csh_x’, ‘csh_m’, ‘csh_r’, ‘pl_c’, ‘pl_i’, ‘pl_g’,

       ‘pl_x’, ‘pl_m’, ‘pl_n’, ‘pl_k’]

The ‘Variables’ dictionary serves me to mutate the ‘PWT’ dataframe into its close cousin, obsessed with numbers, namely into ‘PWT_Numerical’:

>> PWT_Numerical = pd.DataFrame(PWT[Variables])

I quickly check the PWT_Numerical’s driving licence, by typing ‘PWT_Numerical.info()’ and  ‘PWT_Numerical.shape’. All is well, data is in the ‘float64’ format, there are 42 columns and 3006 rows, the guy is cleared to go.

Once I have that nailed down, I mess around a bit with creating names for my cloned datasets. I practice with the following loop:

>> for i in range(42):

    print(“Input_for_”+PWT_Numerical.iloc[:,i].name) 

It yields a list of names for input databases in various ‘input <> output’ configurations of my experiment with the PWT 9.1 dataset. The ‘print’ command gives a string of 42 names: Input_for_rgdpe, Input_for_rgdpo, Input_for_pop etc. 

In my next step, I want to make that outcome durable. The ‘print’ command just prints the output of the loop, it does not store it in any logical structure. The output is gone as soon as it is printed. I create a loop that makes a dictionary, this time with names of output data frames:

>> Names_Output_Data=[] # Here, I create an empty dictionary

>> for i in range(42): # I design the loop

    >> Name_Output_Data=PWT_Numerical.iloc[:,i].name # I create a mechanism for generating strings to fill the dictionary up. 

    >> Names_Output_Data.append(Name_Output_Data) # This is the mechanism of appending   the dictionary with names generated in the previous command 

I check the result by typing the name of the dictionary – ‘Names_Output_Data’ – and executing (Shift + Enter in Jupyter Lab). It yields a full dictionary, filled with column names from PWT_Numerical

Now,  pass to designing my Markov chain of states, i.e. into making an intelligent structure, which produces many alternative versions of itself and tests them for fitness to meet a pre-defined desired outcome. In my neophyte’s logic, I see it as two loops, one inside the other. 

The big, external loop is the one which clones the initial ‘PWT_Numerical’ into pairs of data frames of the style: ’Input variables’ plus ‘Output variable’. I make as many such cloned pairs as there are numerical variables in PWT_Numerical, i.e. 42. Thus, my loop opens up as ‘for i in range(42):’. Inside each iteration of that loop, there is an internal loop of passing the input variables  through a very simple perceptron, assessing the error in estimating the output variable, and then feeding the error forward. Now, I will present below the entire code for those two loops, and then discuss what works, what doesn’t, and what I have no idea how to check whether it works or not. The code is grammatically correct in Python, i.e. it does not yield any error message when put to execution (Shift + Enter in JupyterLab, by the way).  After I present the entire code, I will discuss, further below, its particular parts. Anyway, here it is:

>> List_of_Output_DB=[]

>>Names_Output_Data=[]

>>MEANS=[]

>> Source_means=np.array(PWT_Numerical.mean())

>> EUC=[]

>>for i in range(42):

    >> Name_Output_Data=PWT_Numerical.iloc[:,i].name

    >> Names_Output_Data.append(Name_Output_Data)

    >> Output=pd.DataFrame(PWT_Numerical.iloc[:,i])    

    >> Mean=Output.mean()

   >> MEANS.append(Mean)

    >> Input=pd.DataFrame(PWT_Numerical.drop(Output,axis=1)) 

   >> Input_STD=pd.DataFrame(Input/Input.max(axis=0))

    >> ER=[]

    >> Transformed=[]

      >> for j in range(30):        

>> Input_STD_randomized=Input.iloc[j]*np.random.rand(41)

        >> Input_STD_summed=Input_STD_randomized.sum(axis=0)

        >> T=math.tanh(Input_STD_summed)

        >> D=1-(T**2)

        >> E=(Output.iloc[j]-T)*D

        >> E_vector=np.array(np.repeat(E,41))        

>> Next_row_with_error=Input_STD.iloc[j+1]+E_vector

>> Next_row_DESTD=Next_row_with_error*Input.max(axis=0)

        >> ER.append(E)

        >> ERROR=pd.DataFrame(ER)

        >> Transformed.append(Next_row_DESTD)

        >> CLONE=pd.DataFrame(Transformed).mean()

>> frames=[CLONE,MEANS[i]]

>> CLONE_Means=np.array(pd.concat(frames))

>> Euclidean=np.linalg.norm(Source_means-CLONE_Means)

>> EUC.append(Euclidean)

>> print(‘Finished’)   

Here is a shareable link to my Python file with that code inside: http://localhost:8880/lab/tree/Practice%20Dec%208%202020.ipynb  . I hope it works. 

I start explaining this code casually, from its end. This is a little trick I discovered as regards looping on datasets. Looping takes time and energy. In my struggles to learn Python, I have already managed to make a loop which kept looping forever. All I did was to call the loop as ‘for i in range PWT.index:’, without putting any ‘break’ command at the end. Yes, the index of a data frame is a finite number, yet it is also a sequence. When you don’t break explicitly the looping over that sequence, it will loop over and over again. 

Anyway, the trick. I put the command ‘print(‘Finished’)’ at the very end of the code, after all the loops. When the thing is done with being an intelligent structure at work, it simply prints ‘Finished’ in the next line. Among other things, it allows me to count the time it needs to deal with a given amount of data. As you might have already noticed, whilst I have a dataset with index = 3005 rows, I made the internal loop of the code to go just over 30 rows: ‘for j in range (30)’. The code took some 4 seconds in total to create 42 big loops (‘for i in range (42)’) , and then to loop over 30 rows of data inside each of them. It gives like 42*30 = 1260 experimental rounds in 10 seconds, thus something like 0,0079 seconds per one round. If I took the full dataset of 3005 rows, it would be like 42*3000*0,0079 = 1000 seconds, i.e. 16,6666 minutes. Satanic. I like it. 

Before opening each level of looping, I create empty lists. You can see:

>> List_of_Output_DB=[]

>>Names_Output_Data=[]

>>MEANS=[]

>> Source_means=np.array(PWT_Numerical.mean())

>> EUC=[]

… before I open the external loop, and…

  >> ER=[]

>> Transformed=[]

… before the internal loop.

I noticed that I create those empty lists in a loop, essentially. This is more than just a play on words. When I code a loop, I have output of the loop. The loop does something, and as it does, I discover I want to store that particular outcome in some kind of repository vessel, and I go back to the lines of code before the loop opens and I add an empty list, just in case. I come up with a smart name for the   list, e.g. MEANS, which stands for the mean values of numerical variables, such as they are after being transformed by the perceptron. Mathematically, it is the most basic representation of expected state in a particular transformation of the source dataset “PWT’. 

I code it as ‘MEANS=[]’, and, once I have done that, I add a mechanism of updating a list, inside the loop. This, in turn, goes in two steps. First, I code the variable which should be stored in that list. In the case of ‘MEANS’, as this list is created before I open the big loop of 42 ‘input <> output’ mutations, I append it in that loop. Logically, is must be appended with the mean expected values of output variables in each instance of the big loop. I code it in the big loop, and before opening the internal loop, as:

>> Output=pd.DataFrame(PWT_Numerical.iloc[:,i])  # Here, I define the data frame for the output variable in this specific instance of the big loop   

>> Mean=Output.mean() # Now, I define the function to generate values, which later append the ‘MEANS’ list

    >> MEANS.append(Mean) # Finally, I append the ‘MEANS’ list with values generated in the previous line of the code. 

It is a good thing for me to write about the things I do. I have just noticed that I use two different methods of storing partial outcomes of my loops. The first one is the one I have just presented. The second one is visible in the part of code presented below, included in the internal loop ‘for j in range(number of rows experimented with)’, range(30) in the occurrence tested. 

In this situation, I need to store in some kind of repository the values of input variables transformed by the neural network, i.e. with local error from each experimental round fed forward to the next experimental round. I need to store the way my data looks under each possible orientation of the intelligent structure I assume it represents. I denote that data under the general name ‘Transformed’, and, before opening the internal loop, just at the end of the big external loop, I define an empty list: ‘Transformed=[]’, which is supposed to contain those values I want.

In other words, when I structure the big external loop, I go like: 

# Step 1: for each variables in the dataset, i.e. ‘for i in range(number of variables)’, split the overall dataset into into this variable as the output, in a separate data frame, and all the other variables grouped separately as input. These are the lines of code:

>> Output=pd.DataFrame(PWT_Numerical.iloc[:,i])  # I define the output variable 

[…]    

>> Input=pd.DataFrame(PWT_Numerical.drop(Output,axis=1)) # I drop the output from the entire dataset and I group the remaining columns as ‘Input’

# Step 2: I standardise the input data by denominating it over the respective maximums for each variable:    

>> Input_STD=pd.DataFrame(Input/Input.max(axis=0))

# Step 3: I define, at the end of the big external loop, containers for data which I want to store from each round of the big loop:

>> ER=[] # This is the list of local errors generated by the perceptron when working with each ‘Input <> Output’ configuration

    >> Transformed=[] # That’s the container for input data transformed by the perceptron 

# Step 4: I open the internal loop, with ‘for j in range(number of rows to experiment with)’, and I start by coding the computational procedure of the perceptron:

>> Input_STD_randomized=Input.iloc[j]*np.random.rand(41) # I weigh each empirical, standardised value in this specific row with a random weight

        >> Input_STD_summed=Input_STD_randomized.sum(axis=0) # I sum the randomised values from that specific row of input. This line of code together with the preceding one are equivalent to the mathematical structure ‘∑x*random’.

        >> T=math.tanh(Input_STD_summed) # I compute the hyperbolic tangent of summed, randomised input data

        >> D=1-(T**2) # I compute the local first derivative of the hyperbolic tangent

        >> E=(Output.iloc[j]-T)*D # I compute the error, as: (Expected Output minus Hyperbolic Tangent of Randomised Input) times local derivative of the Hyperbolic Tangent

        >> E_vector=np.array(np.repeat(E,41)) # I create a NumPy array, with the error repeated as many times as there are input variables.

>> Next_row_with_error=Input_STD.iloc[j+1]+E_vector # I feed the error forward. In the next experimental row ‘j+1’, error from row ‘j’ is added to the value of each standardised input variable. This is probably the most elementary representation of learning: I include into my input for the next go the knowledge about what I f**ked up in the previous go. This line creates the transformed input data I want to store later on. 

# Step 5: I collect and store information about the things my perceptron did to input data in the given j-th round of the internal loop:

>> Next_row_DESTD=Next_row_with_error*Input.max(axis=0) # I destandardise the data transformed by the perceptron. It is like translating the work of the perceptron, which operates on standardised values, back into the measurement scale proper to each variable. In a sense, I deneuralise that data. 

        >> ER.append(E) # I collect and store error in the ER list

        >> ERROR=pd.DataFrame(ER) #I transform the ER list into a data frame, which I name ‘ERROR’. I do it a few times with different data, and, quite honestly, I do it intuitively. I already know that data frames defines in Pandas are somehow handier to do statistics with than lists defined in the basic code of Python. Just as honestly: I know too little yet about programming to know whether this turn of code makes sense at all.   

        >> Transformed.append(Next_row_DESTD) # I collect and store the destandardized, transformed input data in the ‘Transformed’ list.

# Step 6: I step out of both loops, and I start putting some order in the data I generated and collected. Stepping out of both loops means that in my code, the lines presented below have no indent. They all start at the left margin, just as the definition of the big external loop.

       >> CLONE=pd.DataFrame(Transformed).mean() # I transform the ‘Transformed’ list into a data frame. Same procedure as two lines of code earlier, only now, I know why I do it. I intend to put together the mean values of destandardised input with the mean value of output, and I am going to do it by concatenation of data frames. 

    >> frames=[CLONE,MEANS[i]] # I define the data frames for concatenation. I put together mean values in the input variables, generated in this specific, i-th round of the big external loop, with the mean value of the output variable corresponding to the same i-th round. You can notice that in the full code, such as I presented it earlier in this update, at this verse of code I move back by one indent. In other words, this definition is already outside of the internal loop, and still inside the big external loop. 

    >> CLONE_Means=np.array(pd.concat(frames)) # I concatenate the data I defined in the previous line. 

    >> Euclidean=np.linalg.norm(Source_means-CLONE_Means) # Something I need for my science. I estimate the mathematical similarity between the source data set ‘PWT_Numerical’, and the data set created by the perceptron, in the given i-th round of the big external loop. I do it by computing the Euclidean distance between the respective vectors of mean expected values in this specific pair of datasets, i.e. the pair ‘source vs i-th clone’.

    >> EUC.append(Euclidean) # I collect and store information generated in the ‘Euclidean’ line. I store it in the EUC list, which I opened as empty before starting the big external loop. 

Leave a Reply