I continue the fundamental cleaning in my head, as the year 2020 touches to its end. What do I want? Firstly,I want to exploit and develop on my hypothesis of collective intelligence in human societies, and I want to develop my programming skills in Python. Secondly, I want to develop my skills and my position as a facilitator and manager of research projects at the frontier of the academic world and that of business. How will I know I have what I want? If I actually program a workable (and working) intelligent structure, able to uncover and reconstruct the collective intelligence of a social structure out of available empirical data – namely to uncover and reconstruct the chief collective outcomes that structure is after, and its patterns of reaction to random exogenous disturbances – that would be an almost tangible outcome for me, telling me I have made a significant step. When I see that I have repetitive, predictable patterns of facilitating the start of joint research projects in consortiums of scientific entities and business ones, then I know I have nailed down something in terms of project management. If I can start something like an investment fund for innovative technologies, then I definitely know I am on the right track.
As I want to program an intelligent structure, it is essentially an artificial neural network, possibly instrumented with additional functions, such as data collection, data cleansing etc. I know I want to understand very specifically what my neural network does. I want to understand every step in takes. To that purpose, I need to figure out a workable algorithm of my own, where I understand every line of code. It can be sub-optimally slow and limited in its real computational power, yet I need it. On the other hand, Internet is more and more equipped with platforms and libraries in the form of digital clouds, such as IBM Watson, or Tensorflow, which provide optimized processes to build complex pieces of AI. I already know that being truly proficient in Data Science entails skills pertinent to using those cloud-based tools. My bottom line is that if I want to program an intelligent structure communicable and appealing to other people, I need to program it at two levels: as my own prototypic code, and as a procedure of using cloud-based platforms to replicate it.
At the juncture of those two how-will-I-know pieces of evidence, an idea emerges, a crazy one. What if I can program an intelligent structure which uncovers and reconstructs one or more alternative business models out of the available empirical data? Interesting. The empirical data I work the most with, as regards business models, is the data provided in the annual reports of publicly listed companies. Secondarily, data about financial markets sort of connects. My own experience as small investor supplies me with existential basis to back this external data, and that experience suggests me to define a business model as a portfolio of assets combined with broadly spoken behavioural patterns both in people active inside the business model, thus running it and employed with it, and in people connected to that model from outside, as customers, suppliers, investors etc.
How will other people know I have what I want? The intelligent structure I will have programmed has to work across different individual environments, which is an elegant way of saying it should work on different computers. Logically, I can say I have clearly demonstrated to other people that I achieved what I wanted with that thing of collective intelligence when said other people will be willing to and successful at trying my algorithm. Here comes the point of willingness in other people. I think it is something like an existential thing across the board. When we want other people to try and do something, and they don’t, we are pissed. When other people want us to try and do something, and we don’t, we are pissed, and they are pissed. As regards my hypothesis of collective intelligence, I have already experienced that sort of intellectual barrier, when my articles get reviewed. Reviewers write that my hypothesis is interesting, yet not articulate and not grounded enough. Honestly, I can’t blame them. My feeling is that it is even hard to say that I have that hypothesis of collective intelligence. It is rather as if that hypothesis was having me as its voice and speech. Crazy, I know, only this is how I feel about the thing, and I know by experience that good science (and good things, in general) turn up when I am honest with myself.
My point is that I feel I need to write a book about that concept of collective intelligence, in order to give a full explanation of my hypothesis. My observations about cities and their role in the human civilization make, for the moment, one of the most tangible topics I can attach the theoretical hypothesis to. Writing that book about cities, together with programming an intelligent structure, takes a different shade, now. It becomes a complex account of how we can deconstruct something – our own collective intelligence – which we know is there and yet, as we are inside that thing, we have hard times to describe it.
That book about cities, abundantly referring to my hypothesis of collective intelligence, could be one of the ways to convince other people to at least try what I propose. Thus, once again, I restate what I understand by intelligent structure. It is a structure which learns new patterns by experimenting with many alternative versions of itself, whilst staying internally coherent. I return to my ‘DU_DG’ database about cities (see ‘It is important to re-assume the meaning’) and I am re-assuming the concept of alternative versions, in an intelligent structure.
I have a dataset structured into n variables and m empirical observations. In my DU_DG database, as in many other economic datasets, distinct observations are defined as the state of a country in a given year. As I look at the dataset (metaphorically, it has content and meaning, but it does not have any physical shape save for the one my software supplies it with), and as I look at my thoughts (metaphorically, once again), I realize I have been subconsciously distinguishing two levels of defining an intelligent structure in that dataset, and, correspondingly, two ways of defining alternative versions thereof. At the first level, the entire dataset is supposed to be an intelligent structure and alternative versions thereof consist in alternative dichotomies of the type ‘one variable as output, i.e. as the desired outcome to optimize, and the remaining ones as instrumental input’. At this level of structural intelligence – by which term I understand the way of being in an intelligent structure – alternative versions are alternative orientations, and there are as many of them as there are variables.
Distinction into variables is largely, although not entirely, epistemic, and not ontological. The headcount of urban population is not fundamentally different phenomenon from the surface of agricultural land. Yes, the units of measurement are different, i.e. people vs. square kilometres, but, ontologically, it is largely the same existential stuff, possible to describe as people living somewhere in large numbers and being successful at it. Historically, social scientists and governments alike have come to the conclusion, though, that these two metrics have different a meaning, and thus it comes handy to distinguish them as semantic vessels to collect and convey information. The distinction of alternative orientations in an intelligent structure, supposedly represented in a dataset, is arbitrary and cognitive more than ontological. It depends on the number of variables we have. If I add variables to the dataset, e.g. by computing coefficients between the incumbent variables, I can create new orientations for the intelligent structure, i.e. new alternative versions to experiment with.
The point which comes to my mind is that the complexity of an intelligent structure, at that first level, depends on the complexity of my observational apparatus. The more different variables I can distinguish, and measure as regards a given society, the more complexity I can give to the allegedly existing, collectively intelligent structure of that society.
Whichever combination ‘output variable vs. input variables’ I am experimenting with, there comes the second level of defining intelligent structures, i.e. that of defining them as separate countries. They are sort of local intelligent structures, and, at the same time, they are alternative experimental versions of the overarching intelligent structure to be found in the vector of variables. Each such local intelligent structure, with a flag, a national anthem, and a government, produces many alternative versions of itself in consecutive years covered by the window of observation I have in my dataset.
I can see a subtle distinction here. A country produces alternative versions of itself, in different years of its existence, sort of objectively and without giving a f**k about my epistemological distinctions. It just exists and tries to be good at it. Experimenting comes as natural in the flow of time. This is unguided learning. On the other hand, I produce different orientations of the entire dataset. This is guided learning. Now, I understand the importance of the degree of supervision in artificial neural networks.
I can see an important lesson for me, here. If I want to program intelligent structures ‘able to uncover and reconstruct the collective intelligence of a social structure out of available empirical data – namely to uncover and reconstruct the chief collective outcomes that structure is after, and its patterns of reaction to random exogenous disturbances’, I need to distinguish those two levels of learning in the first place, namely the unguided flow of existential states from the guided structuring into variables and orientations. When I have an empirical dataset and I want to program an intelligent structure able to deconstruct the collective intelligence represented in that dataset, I need to define accurately the basic ontological units, i.e. the fundamentally existing things, then I define alternative states of those things, and finally I define alternative orientations.
Now, I am contrasting. I pass from those abstract thoughts on intelligent structures to a quick review of my so-far learning to program those structures in Python. Below, I present that review as a quick list of separate files I created in JupyterLab, together with a quick characteristic of problems I am trying to solve in each of those files, as well as of the solutions found and not found.
>> Practice Dec 11 2020.iypnb.
In this file, I work with IMF database WEOOct2020 (https://www.imf.org/en/Publications/WEO/weo-database/2020/October ). I practiced reading complex datasets, with an artificially flattened structure. It is a table, in which index columns are used to add dimensions to an otherwise two-dimensional format. I practiced the ‘read_excel’ and ‘read_csv’ commands. On the whole, it seems that converting an Excel to CSV and then reading CSV in Python is a better method than reading excel. Problems solved: a) cleansing the dataset of not-a-number components and successful conversion of initially ‘object’ columns into the desired ‘float64’ format b) setting descriptive indexes to the data frame c) listing unique labels from a descriptive index d) inserting new columns into the data frame e) adding (compounding) the contents of two existing, descriptive index columns into a third index column. Failures: i) reading data from XML file ii) reading data from SDMX format iii) transposing my data frame so as to put index values of economic variables as column names and years as index values in a column.
>> Practice Dec 8 2020.iypnb.
In this file, I worked with a favourite dataset of mine, the Penn Tables 9.1. (https://www.rug.nl/ggdc/productivity/pwt/?lang=en ). I described my work with it in two earlier updates, namely ‘Two loops, one inside the other’, and ‘Mathematical distance’. I succeeded in creating an intelligent structure from that dataset. I failed at properly formatting the output of that structure and thus at comparing the cognitive value of different orientations I made it simulate.
>> Practice with Mortality.iypnb.
I created this file as a first practice before working with the above-mentioned WEOOct2020 database. I took one dataset from the website of the World Bank, namely that pertinent to the coefficient of adult male mortality (https://data.worldbank.org/indicator/SP.DYN.AMRT.MA ). I practiced reading data from CSV files, and I unsuccessfully tried to stack the dataset, i.e. to transform columns corresponding to different years of observation into rows indexed with labels corresponding to years.
>> Practice DU_DG.iypnb.
In this file, I am practicing with my own dataset pertinent to the density of urban population and its correlates. The dataset is already structured in Excel. I start practicing the coding of the same intelligent structure I made with Penn Tables, supposed to study orientation of the societies studied. Same problems and same failures as with Penn Tables 9.1.: for the moment, I cannot nail down the way to get output data in structures that allow full comparability. My columns tend to wander across the output data frames. In other words, the vectors of mean expected values produced by the code I made have slightly (just slightly, and sufficiently to be annoying) different a structure from the original dataset. I don’t know why, yet, and I don’t know how to fix it.
On the other hand, in that same file, I have been messing around a bit with algorithms based on the ‘scikit’ library for Python. Nice graphs, and functions which I still need to understand.
>> Practice SEC Financials.iypnb.
Here, I work with data published by the US Securities and Exchange Commission, regarding the financials of individual companies listed in the US stock market (https://www.sec.gov/dera/data/financial-statement-data-sets.html ). The challenge here consists in translating data originally supplied in *.TXT files into numerical data frames in Python. The problem with I managed to solve, so far (this is the most recent piece of my programming), is the most elementary translation of TXT data into a Pandas data frame, using the ‘open()’ command, and the ‘f.readlines()’ one. Another small victory here is to read data from a sub-directory inside the working directory of JupyterLab, i.e. inside the root directory of my user profile. I used two methods of reading TXT data. Both worked sort of. First, I used the following sequence:
>> with open(‘2020q3/num.txt’) as f:
numbers=f.readlines()
>> Numbers=pd.DataFrame(numbers)
… which, when checked with the ‘Numbers.info()’ command, yields:
<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 2351641 entries, 0 to 2351640
Data columns (total 1 columns):
# Column Dtype
— —— —–
0 0 object
dtypes: object(1)
memory usage: 17.9+ MB
In other words, that sequence did not split the string of column names into separate columns, and the ‘Numbers’ data frame contains one column, in which every row is a long string structured with the ‘\’ separators. I tried to be smart with it. I did:
>> Numbers.to_csv(‘Num2’) # I converted the Pandas data frame into a CSV file
>> Num3=pd.DataFrame(pd.read_csv(‘Num2′,sep=’;’)) # …and I tried to read back from CSV, experimenting with different separators. None of it worked. With the ‘sep=’ argument in the command, I kept getting an error of parsing, in the lines of ‘ParserError: Error tokenizing data. C error: Expected 1 fields in line 3952, saw 10’. When I didn’t use the ‘sep=’ argument, the command did not yield error, yet it yielded the same long column of structured strings instead of many data columns.
Thus, I gave up a bit, and I used Excel to open the TXT file, and to save a copy of it in the CSV format. Then, I just created a data frame from the CSV dataset, through the ‘NUM_from_CSV=pd.DataFrame(pd.read_csv(‘SEC_NUM.csv’, sep=’;’))’ command, which, checked with the ‘NUM_from_CSV.info()’ command, yields:
<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 1048575 entries, 0 to 1048574
Data columns (total 9 columns):
# Column Non-Null Count Dtype
— —— ————– —–
0 adsh 1048575 non-null object
1 tag 1048575 non-null object
2 version 1048575 non-null object
3 coreg 30131 non-null object
4 ddate 1048575 non-null int64
5 qtrs 1048575 non-null int64
6 uom 1048575 non-null object
7 value 1034174 non-null float64
8 footnote 1564 non-null object
dtypes: float64(1), int64(2), object(6)
memory usage: 72.0+ MB
The ‘tag’ column in this data frame contains the names of financial variables ascribed to companies identified with their ‘adsh’ codes. I experience the same challenge, and, so far, the same failure as with the WEOOct2020 database from IMF, namely translating different values in a descriptive index into a dictionary, and then, in the next step, to flip the database so as to make those different index categories into separate columns (variables).
As I have passed in review that programming of mine, I have become aware that reading and properly structuring different formats of data is the sensory apparatus of the intelligent structure I want to program. Operations of data cleansing and data formatting are the fundamental skills I need to develop in programming. Contrarily to what I expected a few weeks ago, when I was taking on programming in Python, elaborate mathematical constructs are simpler to code than I thought they would be. What might be harder, mind you, is to program them so as to optimize computational efficiency with large datasets. Still, the very basic, boots-on-the-ground structuring of data seems to be the name of the game for programming intelligent structures.