Mathematical distance

I continue learning Python as regards data analysis. I have a few thoughts on what I have already learnt, and a new challenge, namely to repeat the same thing with another source of data, namely the World Economic Outlook database, published by the International Monetary Fund (https://www.imf.org/en/Publications/WEO/weo-database/2020/October ). My purpose is to use that data in the same way as I used that from Penn Tables 9.1 (see ‘Two loops, one inside the other’, for example), namely to run it through a digital intelligent structure consisting of a double algorithmic loop.

First things first, I need to do what I promised to do in Two loops, one inside the other, that is to test the cognitive value of the algorithm I presented there. By the way, as I keep working with algorithms known as ‘artificial intelligence’, I am more and more convinced that the term ‘artificial neural networks’ is not really appropriate. I think that talking about artificial intelligent structure is much closer to reality. Giving the name of ‘neurons’ to particular fragments of the algorithm reflects the properties of some of those neurons, I get it. Yet, the neurons of a digital technology are the micro-transistors in the CPU or in the GPU. Yes, micro-transistors do what neurons do in our brain: they fire conditionally and so they produce neural signals. Algorithms of AI can be run on any computer with proper software. AI is software, not hardware.

Yes, I know I’m ranting. This is how I am gathering intellectual speed for my writing. Learning to program in Python has led me to a few realizations about the digital intelligent structures I am working with, as simulators of collective intelligence in human societies. Algorithms are different from equations in the sense that algorithms do things, whilst equations represent things. When I want an algorithm to do the things represented with equations, I need functional coherence between commands. A command needs data to work on, and it is a good thing if I can utilize the data it puts out. A chain of commands is functional, when earlier commands give accurate input to later commands, and when the final output of the last command can be properly stored and utilized. On the other hand, equations don’t need data to work, because equations don’t work. They just are.

I can say my equations are fine when they make sense logically. On the other hand, I can be sure my algorithm works the way it is supposed to work, when I can empirically prove its functionality by testing it. Hence, I need a method of testing it and I need to be sure the method in itself is robust. Now, I understand why the hell in all the tutorials which I could find as regards programming in Python there is that ‘print(output)’ command at the end of each algorithm. Whatever the output is, printing it, i.e. displaying it on the screen, is the most elementary method of checking whether that output is what I expect it to be. By the way, I have made my own little discovery about the usefulness of the ‘print()’ command. In looping algorithms, which, by nature, are prone to looping forever if the range of iterations is not properly defined, I put ‘print(‘Finished’)’ at the very end of the code. When I see ‘Finished’ printed in the line below, I can be sure the thing has done the work it was supposed to do.

Good, I was supposed to write about the testing of my algorithm. How do I test? I start by taking small pieces of the algorithm and checking the kind of output they give. By doing that, I modified the algorithm from ‘Two loops, one inside the other’, into the form you can see below:

That’s the preliminary part: importing libraries and data for analysis >>

In [1]: import numpy as np

   …: import pandas as pd

   …: import os

   …: import math

In [2]: PWT=pd.DataFrame(pd.read_csv(‘PWT 9_1 no empty cells.csv’,header=0)) # PWT 9_1 no empty cells.csv is a CSV version of the database I made with non-empty observations in the Penn Tables 9.1 database

Now, I extract the purely numerical columns, into another data frame, which I label ‘PWT_Numerical’

In [3]: Variables=[‘rgdpe’, ‘rgdpo’, ‘pop’, ’emp’, ’emp / pop’, ‘avh’,

   …:        ‘hc’, ‘ccon’, ‘cda’, ‘cgdpe’, ‘cgdpo’, ‘cn’, ‘ck’, ‘ctfp’, ‘cwtfp’,

   …:        ‘rgdpna’, ‘rconna’, ‘rdana’, ‘rnna’, ‘rkna’, ‘rtfpna’, ‘rwtfpna’,

   …:        ‘labsh’, ‘irr’, ‘delta’, ‘xr’, ‘pl_con’, ‘pl_da’, ‘pl_gdpo’, ‘csh_c’,

   …:        ‘csh_i’, ‘csh_g’, ‘csh_x’, ‘csh_m’, ‘csh_r’, ‘pl_c’, ‘pl_i’, ‘pl_g’,

   …:        ‘pl_x’, ‘pl_m’, ‘pl_n’, ‘pl_k’]

In [4]: PWT_Numerical=pd.DataFrame(PWT[Variables])

My next step is to practice with creating dictionaries out of column names in my data frame

In [5]: Names_Output_Data=[]

   …: for i in range(42):

   …:     Name_Output_Data=PWT_Numerical.iloc[:,i].name

   …:     Names_Output_Data.append(Name_Output_Data)

I start coding the intelligent structure. I start by defining empty lists, to store data which the intelligent structure produces.

In [6]: ER=[]

   …: Transformed=[]

   …: MEANS=[]

   …: EUC=[]

I define an important numerical array in NumPy: the vector of mean expected values in the variables of PWT_Numerical.

   …: Source_means=np.array(PWT_Numerical.mean())

I open the big external loop of my intelligent structure. This loop is supposed to produce as many alternative intelligent structures as there are variables in my PWT_Numerical data frame.

   …: for i in range(42):

   …:     Name_Output_Data=PWT_Numerical.iloc[:,i].name

   …:     Names_Output_Data.append(Name_Output_Data)

   …:     Output=pd.DataFrame(PWT_Numerical.iloc[:,i]) # I make an output data frame

   …:     Mean=Output.mean()

   …:     MEANS.append(Mean) # I store the expected mean of each output variable in a separate list.

   …:     Input=pd.DataFrame(PWT_Numerical.drop(Output,axis=1)) # I make an input data frame, coupled with output

   …:     Input_STD=pd.DataFrame(Input/Input.max(axis=0)) # I standardize input data over the respective maximum of each variable

   …:     Input_Means=pd.DataFrame(Input.mean()) # I prepare two data frames sort of for later: one with the vector of means…

   …:     Input_Max=pd.DataFrame(Input.max(axis=0)) #… and another one with the vector of maximums

Now, I put in motion the intelligent structure strictly speaking: a simple perceptron, which…

   …:     for j in range(10): # … is short, for testing purposes, just 10 rows in the source data

   …:         Input_STD_randomized=np.array(Input_STD.iloc[j])*np.random.rand(41) #… sprays the standardized input data with random weights

   …:         Input_STD_summed=Input_STD_randomized.sum(axis=0) # … and then sums up sort of ∑(input variable *random weight).

   …:         T=math.tanh(Input_STD_summed) # …computes the hyperbolic tangent of summed randomized input. This is neural activation.

   …:         D=1-(T**2) # …computes the local first derivative of that hyperbolic tangent

   …:         E=(Output.iloc[j]-T)*D # … computes the local error of estimating the value of output variable, with input data neural-activated with the function of hyperbolic tangent

   …:         E_vector=np.array(np.repeat(E,41)) # I spread the local error into a vector to feed forward

   …:         Next_row_with_error=Input_STD.iloc[j+1]+E_vector # I feed the error forward

   …:         Next_row_DESTD=Next_row_with_error*Input.max(axis=0) # I destandardize

   …:         ER.append(E) # I store local errors in the list ER

   …:         ERROR=pd.DataFrame(ER) # I make a data frame out of the list ER

   …:         Transformed.append(Next_row_with_error) # I store the input values transformed by the perceptron (through the forward feed of error), in the list Transformed

   …:     TR=pd.DataFrame(Transformed) # I turn the Transformed list into a data frame

   …:     MEAN_TR=pd.DataFrame(TR.mean()) # I compute the mean values of transformed input and store them in a data frame. They are still mean values of standardized data.

   …:     MEAN_TR_DESTD=pd.DataFrame(MEAN_TR*Input_Max) # I destandardise

   …: MEANS_DF=pd.DataFrame(MEANS)

   …: print(MEANS)

   …: print(‘Finished’)

The general problem which I encounter with that algorithm is essentially that of reading correctly and utilizing the output, or, at least, this is how I understand that problem. First, I remind the general hypothesis which I want to test and explore with that algorithm presented above. Here it comes: for a given set of phenomena, informative about the state of a human social structure, and observable as a dataset of empirical values in numerical variables, there is a subset of variables which inform about the ethical and functional orientation of that human social structure; orientation manifests as relatively the least significant transformation, which the original dataset needs to undergo in order to minimize error in estimating the orientation-informative variable as output, when the remaining variables are used as input.

When the empirical dataset in question is being used as training set for an artificial neural network of the perceptron type, i.e. a network which tests for the optimal values in the input variables, for minimizing the error in estimating the output variable, such neural testing transforms the original dataset into a specific version thereof. I want to know how far away  from the original empirical dataset  does the specific transformation, oriented on a specific output, go. I measure that mathematical distance as the Euclidean distance between the vector of mean expected values in the transformed dataset, and the original one.

Therefore, I need two data frames in Pandas, or two arrays in NumPy, one containing the mean expected values of the original input data, the other storing mean expected values of the transformed dataset. Here is where my problems start, with the algorithm presented above. The ‘TR’ data frame has a different shape and structure than the ‘Input’ data frame, from which, technically, it is derived.  The Input data frame has 41 columns, and the TR has 42 columns. Besides, one column from ‘Input’, the ‘rgdpe’, AKA real GDP on the final consumption side, moves from being the first column in ‘Input’ to being the last ‘column’ in ‘TR’. For the moment, I have no clue what’s going on at that level. I even checked the algorithm with a debugger, available with the integrated development environment called Spyder (https://www.spyder-ide.org ). Technically, as far as the grammar of Python is concerned, the algorithm is OK. Still, it produces different than expected vectors of mean expected values in transformed data. I don’t even know where to start looking for a solution.    

There is one more thing I want to include in this algorithm, which I have already been doing in Excel. At each row of transformed data, thus at each ‘Next_row_with_error’, I want to add a mirroring row of mean Euclidean distance from each individual variable to all the remaining ones. It is a measure of internal coherence in the process of learning through trial and error, and I already know, having learnt it by trial and error, that including that specific metric, and feeding it forward together with the error, changes a lot in the way a perceptron learns.    

Leave a Reply