One step out of the cavern

I have made one step further in my learning of programming. I finally have learn’t at least one method of standardising numerical values in a dataset. In a moment, I will show what exact method did I nail down. First, I want to share a thought of more general nature. I learn programming in order to enrich my research on the application of artificial intelligence for simulating collective intelligence in human societies. I have already discovered the importance of libraries, i.e. ready-made pieces of code, possible to call with a simple command, and short-cutting across many verses of code which I would have to write laboriously. I mean libraries such as NumPy, Pandas, Math etc. It is very similar to human consciousness. Using pre-constructed cognitive structures, i.e. using language and culture is a turbo boost for whatever we do of things that humans are supposed to do when being a civilisation.  

Anyway, I kept working with the dataset which I had already mentioned in my earlier updates, namely a version of Penn Tables 9.1., cleaned of all the rows with empty cells [see: Feenstra, Robert C., Robert Inklaar and Marcel P. Timmer (2015), “The Next Generation of the Penn World Table” American Economic Review, 105(10), 3150-3182, www.ggdc.net/pwt ]. Thus I started by creating an online notebook at JupyterLab (https://jupyter.org/try), with Python 3 as its kernel. Then I imported what I needed from Python in terms of ready-cooked culture, i.e. I went:

>> import numpy as np

>> import pandas as pd

>> import os

I uploaded the ‘PWT 9_1 no empty cells.csv’ file from my computer, and, just in case, I checked its presence in the working directory, with >> os.listdir(). I read the contents of the file into a Pandas Data Frame, which spells: PWT = pd.DataFrame(pd.read_csv(‘PWT 9_1 no empty cells.csv’)). Worked.  

In my next step, as I planned to mess up a bit with the columns of that dataset, I typed: PWT.columns. The thing nicely gave me back a list of columns, i.e. literally a list of labels in quotation marks [‘’]. I used that list to create a dictionary of columns with numerical values, and therefore the most interesting to me. I went:

>> Variables=[‘rgdpe’, ‘rgdpo’, ‘pop’, ’emp’, ’emp / pop’, ‘avh’,

       ‘hc’, ‘ccon’, ‘cda’, ‘cgdpe’, ‘cgdpo’, ‘cn’, ‘ck’, ‘ctfp’, ‘cwtfp’,

       ‘rgdpna’, ‘rconna’, ‘rdana’, ‘rnna’, ‘rkna’, ‘rtfpna’, ‘rwtfpna’,

       ‘labsh’, ‘irr’, ‘delta’, ‘xr’, ‘pl_con’, ‘pl_da’, ‘pl_gdpo’, ‘csh_c’,

       ‘csh_i’, ‘csh_g’, ‘csh_x’, ‘csh_m’, ‘csh_r’, ‘pl_c’, ‘pl_i’, ‘pl_g’,

       ‘pl_x’, ‘pl_m’, ‘pl_n’, ‘pl_k’]

The ‘Variables’ dictionary served me to make a purely numerical mutation of my dataset, namely: PWTVar=pd.DataFrame(PWT[Variables]).  

I generated the fixed components of standardisation in my data, i.e. maximums, means, and standard deviations across columns in PWTVar. It looked like this: 

>> Maximums=PWTVar.max(axis=0)

>> Means=PWTVar.mean(axis=0)

>> Deviations=PWTVar.std(axis=0)

The ‘axis=0’ part means that I want to generate those values across columns, not rows. Once that done, I made my two standardisations of data from PWTVar, namely: a) standardisation over maximums, like s(x) = x/max(x) and b) standardisation by mean-reversion, where s(x) = [x – avg(x)]/std(x)]. I did it as:

>> Standardized=pd.DataFrame(PWTVar/Maximums)

>> MR=pd.DataFrame((PWTVar-Means)/Deviations)

I used here the in-built function of Python Pandas, i.e. the fact that they automatically operate data frames as matrices. When, for example, I subtract ‘Means’ from ‘PWTVar’, the one-row matrix of ‘Means’ gets subtracted from each among the 3005 rows of ‘PWTVar’ etc. I checked those two data frames with commands such as ‘df.describe()’, ’df.shape’, and df.info(), just to make sure they are what I think they are. They are, indeed. 

Standardisation allowed me to step out of my cavern, in terms of programming artificial neural networks. The next step I took was to split my numerical dataset PWTVar into one output variable, on the one hand, and all the other variables grouped as input. As output, I took a variable which, as I have already found out in my research, is extremely important in social change seen through the lens of Penn Tables 9.1. This is ‘avh’ AKA the average number of hours worked per person per year. I did:  

>> Output_AVH=pd.DataFrame(PWTVar[‘avh’])

>> Input_dict=[‘rgdpe’, ‘rgdpo’, ‘pop’, ’emp’, ’emp / pop’, ‘hc’, ‘ccon’, ‘cda’,

        ‘cgdpe’, ‘cgdpo’, ‘cn’, ‘ck’, ‘ctfp’, ‘cwtfp’, ‘rgdpna’, ‘rconna’,

        ‘rdana’, ‘rnna’, ‘rkna’, ‘rtfpna’, ‘rwtfpna’, ‘labsh’, ‘irr’, ‘delta’,

        ‘xr’, ‘pl_con’, ‘pl_da’, ‘pl_gdpo’, ‘csh_c’, ‘csh_i’, ‘csh_g’, ‘csh_x’,

        ‘csh_m’, ‘csh_r’, ‘pl_c’, ‘pl_i’, ‘pl_g’, ‘pl_x’, ‘pl_m’, ‘pl_n’,

        ‘pl_k’] 

#As you can see, ‘avh’ is absent from the ‘Input-dict’ dictionary 

>> Input = pd.DataFrame(PWT[Input_dict])

The last thing that worked, in this episode of my learning, was to multiply the ‘Input’ dataset by a matrix of random float values generated with NumPy:

>> Randomized_input=pd.DataFrame(Input*np.random.rand(3006,41)) 

## Gives an entire Data Frame of randomized values

Leave a Reply