It works again

I intend to work on iterations. My general purpose with learning to program in Python is to create my own algorithms of artificial neural networks, in line with what I have already done in that respect using just Excel. Iteration is the essence of artificial intelligence, to the extent that the latter manifests as an intelligent structure producing many alternative versions of itself. Many means one at a time over many repetitions. 

When I run my neural networks in Excel, they do a finite number of iterations. That would be a Definite Iteration in Python, thus the structure based on the ‘for’ expression. I am helping myself  with the tutorial available at https://realpython.com/python-for-loop/ . Still, as programming is supposed to enlarge my Excel-forged intellectual horizons, I want to understand and practice the ‘while’ loop in Python, thus Indefinite Iteration (https://realpython.com/python-while-loop/ ).

Anyway, programming a loop is very different from looping over multiple rows of an Excel sheet. The latter simply makes a formula repeat over many rows, whilst the former requires defining the exact operation to iterate, the input domain which the iteration takes as data, and the output dataset to store the outcome of iteration.

It is time, therefore, to describe exactly the iteration I want to program in Python. As a matter of fact, we are talking about a few different iterations. The first one is the standardisation of my source data. I can use two different ways of standardising it, depending on the neural activation function I use. The baseline method is to standardise each variable over its maximum, and then it fits every activation function I use. It is standardised value of x, AKA s(x), being calculated as s(x) = x/max(x)

If I focus just on the hyperbolic tangent as activation function, I can use the first method, or I can standardise by mean-reversion, where s(x) = [x – avg(x)]/std(x). In a first step, I subtract from x the average expected value of x – this is the the [x – avg(x)] expression – and then I divide the resulting difference by the standard deviation of x, or std(x)

The essential difference between those two modes of standardisation is the range of standardised values. When denominated in units of the max(x), standardised values range in 0 ≥ std(x) ≥ 1. When I standardise by mean-reversion, I have -1 ≥ std(x) ≥ 1.

The piece of programming I start that specific learning of mine with consists in transforming my source Data Frame ‘df’ into its standardised version ’s_df’ by dividing values in each column of df by their maximums. As I think of all that, it comes to my mind what I have recently learnt, namely that operations on Numpy arrays, in Python, are much faster than the same operations on data frames built with Python Pandas. I check if I can make a Data Frame out of an imported CSV file, and then turn it into a Numpy array. 

Let’s walse. I start by opening JupyterLab at https://hub.gke2.mybinder.org/user/jupyterlab-jupyterlab-demo-nocqldur/lab and creating a notebook with Python 3 as its kernel. Then, I import the libraries which I expect to use one way or another: NumPy, Pandas, Matplot, OS, and Math. In other words, I do:

>> import numpy as np

>> import pandas as pd

>> import matplotlib.pyplot as plt

>> import math

>> import os

Then, I upload a CSV file and I import it into a Data Frame. It is a database I used in my research on cities and urbanization, its name is ‘DU_DG database.csv’, and, as it is transformed from an Excel file, I take care to specify that separators are semi-columns.

>> DU_DG=pd.DataFrame(pd.read_csv(‘DU_DG database.csv’,sep=’;’)) 

The resulting Data Frame is structured as:

Index([‘Index’, ‘Country’, ‘Year’, ‘DU/DG’, ‘Population’,

       ‘GDP (constant 2010 US$)’, ‘Broad money (% of GDP)’,

       ‘urban population absolute’,

       ‘Energy use (kg of oil equivalent per capita)’, ‘agricultural land km2’,

       ‘Cereal yield (kg per hectare)’],

      dtype=’object’)

Import being successful (I just check with commands ‘DU_DG.shape’ and ‘DU_DG.head()’), I am trying to create a NumPy array. Of course, there is not much sense in translating names of countries and labels of years into a NumPy array. I try to select numerical columns ‘DU/DG’, ‘Population’, ‘GDP (constant 2010 US$)’, ‘Broad money (% of GDP)’, ‘urban population absolute’, ‘Energy use (kg of oil equivalent per capita)’, ‘agricultural land km2’, and ‘Cereal yield (kg per hectare)’, by commanding:

>> DU_DGnumeric=np.array(DU_DG[‘DU/DG’,’Population’,’GDP (constant 2010 US$)’,’Broad money (% of GDP)’,’urban population absolute’,’Energy use (kg of oil equivalent per capita)’,’agricultural land km2′,’Cereal yield (kg per hectare)’]) 

The answer I get from Python 3 is a gentle ‘f**k you!’, i.e. an elaborate error message. 

KeyError                                  Traceback (most recent call last)

/srv/conda/envs/notebook/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)

   2656             try:

-> 2657                 return self._engine.get_loc(key)

   2658             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: (‘DU/DG’, ‘Population’, ‘GDP (constant 2010 US$)’, ‘Broad money (% of GDP)’, ‘urban population absolute’, ‘Energy use (kg of oil equivalent per capita)’, ‘agricultural land km2’, ‘Cereal yield (kg per hectare)’)

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)

<ipython-input-18-e438a5ba1aa2> in <module>

—-> 1 DU_DGnumeric=np.array(DU_DG[‘DU/DG’,’Population’,’GDP (constant 2010 US$)’,’Broad money (% of GDP)’,’urban population absolute’,’Energy use (kg of oil equivalent per capita)’,’agricultural land km2′,’Cereal yield (kg per hectare)’])

/srv/conda/envs/notebook/lib/python3.7/site-packages/pandas/core/frame.py in __getitem__(self, key)

   2925             if self.columns.nlevels > 1:

   2926                 return self._getitem_multilevel(key)

-> 2927             indexer = self.columns.get_loc(key)

   2928             if is_integer(indexer):

   2929                 indexer = [indexer]

/srv/conda/envs/notebook/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)

   2657                 return self._engine.get_loc(key)

   2658             except KeyError:

-> 2659                 return self._engine.get_loc(self._maybe_cast_indexer(key))

   2660         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

   2661         if indexer.ndim > 1 or indexer.size > 1:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: (‘DU/DG’, ‘Population’, ‘GDP (constant 2010 US$)’, ‘Broad money (% of GDP)’, ‘urban population absolute’, ‘Energy use (kg of oil equivalent per capita)’, ‘agricultural land km2’, ‘Cereal yield (kg per hectare)’)

Didn’t work, obviously. I try something else. I proceed in two steps. First, I create a second Data Frame out of the numerical columns of DU_DG. I go:

>> DU_DGNumCol=pd.DataFrame(DU_DG.columns[‘DU/DG’, ‘Population’,’GDP (constant 2010 US$)’, ‘Broad money (% of GDP)’,’urban population absolute’,’Energy use (kg of oil equivalent per capita)’, ‘agricultural land km2’,’Cereal yield (kg per hectare)’])

Python seems to have accepted the command without reserves, and yet something strange happens. Informative commands about that second Data Frame, i.e. DU_DGNumCol, such as ‘DU_DGNumCol.head()’, ‘DU_DGNumCol.shape’ or ‘DU_DGNumCol.info‘ don’t work, as if DU_DGNumCol had no structure at all.

Cool. I investigate. I want to check how does Python see data in my DU_DG data frame. I do ‘DU_DG.describe()’ first, and, to my surprise, I can see descriptive statistics just for columns ‘Index’ and ‘Year’. The legitimate WTF? question pushes me to type ‘DU_DG.info()’ and here is what I get:

<class ‘pandas.core.frame.DataFrame’>

RangeIndex: 896 entries, 0 to 895

Data columns (total 11 columns):

Index                                           896 non-null int64

Country                                         896 non-null object

Year                                            896 non-null int64

DU/DG                                           896 non-null object

Population                                      896 non-null object

GDP (constant 2010 US$)                         896 non-null object

Broad money (% of GDP)                          896 non-null object

urban population absolute                       896 non-null object

Energy use (kg of oil equivalent per capita)    896 non-null object

agricultural land km2                           896 non-null object

Cereal yield (kg per hectare)                   896 non-null object

dtypes: int64(2), object(9)

memory usage: 77.1+ KB

I think I understand. My numerical data has been imported as object, and I want it to be float values.  Once again, I have the same valuable lesson: before I do anything with my data, in Python, I need to  check and curate it. It is strangely connected to my theory of collective intelligence. Our human perception accepts empirical experience for further processing, especially for collective processing at the level of culture, only if said experience has the right form. We tend to ignore phenomena, which manifest in a form we are not used to process cognitively.

Just by sheer curiosity, I take another dataset and I repeat the whole sequence of import from CSV, and definition of data type. This time, I take a reduced version of Penn Tables 9.1. The full citation due in this case is: Feenstra, Robert C., Robert Inklaar and Marcel P. Timmer (2015), “The Next Generation of the Penn World Table” American Economic Review, 105(10), 3150-3182, available for download at www.ggdc.net/pwt. The ‘reduced’ part means that I took out of the database all the rows (i.e. country <> year observations) with at least one empty cell. I go:

>> PWT=pd.DataFrame(pd.read_csv(‘PWT 9_1 no empty cells.csv’)) 

…aaaaand it lands. Import successful. I test the properties of PWT data frame:

>> PWT.info()

yields:

<class ‘pandas.core.frame.DataFrame’>

RangeIndex: 3006 entries, 0 to 3005

Data columns (total 44 columns):

country      3006 non-null object

year         3006 non-null int64

rgdpe        3006 non-null float64

rgdpo        3006 non-null float64

pop          3006 non-null float64

emp          3006 non-null float64

emp / pop    3006 non-null float64

avh          3006 non-null float64

hc           3006 non-null float64

ccon         3006 non-null float64

cda          3006 non-null float64

cgdpe        3006 non-null float64

cgdpo        3006 non-null float64

cn           3006 non-null float64

ck           3006 non-null float64

ctfp         3006 non-null float64

cwtfp        3006 non-null float64

rgdpna       3006 non-null float64

rconna       3006 non-null float64

rdana        3006 non-null float64

rnna         3006 non-null float64

rkna         3006 non-null float64

rtfpna       3006 non-null float64

rwtfpna      3006 non-null float64

labsh        3006 non-null float64

irr          3006 non-null float64

delta        3006 non-null float64

xr           3006 non-null float64

pl_con       3006 non-null float64

pl_da        3006 non-null float64

pl_gdpo      3006 non-null float64

csh_c        3006 non-null float64

csh_i        3006 non-null float64

csh_g        3006 non-null float64

csh_x        3006 non-null float64

csh_m        3006 non-null float64

csh_r        3006 non-null float64

pl_c         3006 non-null float64

pl_i         3006 non-null float64

pl_g         3006 non-null float64

pl_x         3006 non-null float64

pl_m         3006 non-null float64

pl_n         3006 non-null float64

pl_k         3006 non-null float64

dtypes: float64(42), int64(1), object(1)

memory usage: 1.0+ MB

>> PWT.describe()

gives nice descriptive statistics. This dataset has been imported in the format I want. I do the same thing I attempted with the DU_DG dataset: I try to convert it into a NumPy array and to check the shape obtained. I do:

>> PWTNumeric=np.array(PWT)

>> PWTNumeric.shape

I get (3006,44), i.e. 3006 rows over 44 columns. 

I try to wrap my mind around standardising values in PWT. I start gently. I slice one column out of PWT, namely the AVH variable, which stands for the average number of hours worked per person per year. I do:

>> AVH=pd.DataFrame(PWT[‘avh’])

>> stdAVH=pd.DataFrame(AVH/AVH.max())

Apparently, it worked. I check with ‘stdAVH.describe()’ and I get a nice distribution of values between 0 and 1. 

I do the same thing with mean-reversion. I create the ‘mrAVH’ data frame according to the s(x) = [x – avg(x)]/std(x) drill. I do:

 >> mrAVH=pd.DataFrame((AVH-AVH.mean())/AVH.std())

…and I get a nice distribution of mean reverted values. 

Cool. Now, it is time to try and iterate the same standardisation over many columns in the same Data Frame. I have already rummaged a bit and apparently it is not going to as simple as in Excel. It usually isn’t.

That would be all in that update. A short summary is due. It works again. I mean, learning something and keeping a journal of how exactly I learn, that thing works. I feel that special vibe, like ‘What the hell, even if it sucks, it is interesting’. Besides the technical details of programming, I have already learnt two big things about data analysis in Python. Firstly, however comfortable it is to use libraries such as NumPy or Pandas, being really efficient requires the understanding of small details at the very basic level, e.g. conversion of data types, and, as a matter of fact, the practical workability of different data types, selection of values in a data set, by row and by column, iteration over rows and columns etc. Secondly, once again, data works well in Python when it has been properly curated prior to analysis. Learning quick algorithmic ways to curate that data, without having to do is manually in Excel, is certainly an asset, which I need to learn.  

Leave a Reply