Chapter 7. Joining data sources 59
To make explicit what we’re doing, we could accomplish exactly the same using the --tkey option:
join IBM.csv Close --tkey="Date,%Y-%m-%d"
Example: OECD quarterly data
Table 7.3 shows an excerpt from a CSV file provided by the OECD statistical site (stat.oecd.org)
in response to a request for GDP at constant prices for several countries.
4
Frequency,Period,Country,Value,Flags
"Quarterly","Q1-1960","France",463876.148126845,E
"Quarterly","Q1-1960","Germany",768802.119278467,E
"Quarterly","Q1-1960","Italy",414629.791450547,E
"Quarterly","Q1-1960","United Kingdom",578437.090291889,E
"Quarterly","Q2-1960","France",465618.977328614,E
"Quarterly","Q2-1960","Germany",782484.138122549,E
"Quarterly","Q2-1960","Italy",420714.910290157,E
"Quarterly","Q2-1960","United Kingdom",572853.474696578,E
"Quarterly","Q3-1960","France",469104.41925852,E
"Quarterly","Q3-1960","Germany",809532.161494483,E
"Quarterly","Q3-1960","Italy",426893.675840156,E
"Quarterly","Q3-1960","United Kingdom",581252.066618986,E
"Quarterly","Q4-1960","France",474664.327992619,E
"Quarterly","Q4-1960","Germany",817806.132384948,E
"Quarterly","Q4-1960","Italy",427221.338414114,E
...
Table 7.3: Example of CSV file as provided by the OECD statistical website
This is an instance of data in what we call atomic format, that is, a format in which each line of the
outer file contains a single data-point and extracting data mainly requires filtering the appropriate
lines. The outer time key is under the Period heading, and has the format Q<quarter>-<year>.
Assuming that the file in Table 7.3 has the name oecd.csv, the following script reconstructs the
time series of Gross Domestic Product for several countries:
nulldata 220
setobs 4 1960:1
join oecd.csv FRA --tkey="Period,Q%q-%Y" --data=Value --filter="Country==\"France\""
join oecd.csv GER --tkey="Period,Q%q-%Y" --data=Value --filter="Country==\"Germany\""
join oecd.csv ITA --tkey="Period,Q%q-%Y" --data=Value --filter="Country==\"Italy\""
join oecd.csv UK --tkey="Period,Q%q-%Y" --data=Value --filter="Country==\"United Kingdom\""
Note the use of the format codes %q for the quarter and %Y for the 4-digit year. A touch of elegance
could have been added by storing the invariant options to join using the setopt command, as in
setopt join persist --tkey="Period,Q%q-%Y" --data=Value
join oecd.csv FRA --filter="Country==\"France\""
join oecd.csv GER --filter="Country==\"Germany\""
join oecd.csv ITA --filter="Country==\"Italy\""
join oecd.csv UK --filter="Country==\"United Kingdom\""
setopt join clear
If one were importing a large number of such series it might be worth rewriting the sequence of
joins as a loop, as in
4
Retrieved 2013-08-05. The OECD files in fact contain two leading columns with very long labels; these are irrelevant
to the present example and can be omitted without altering the sample script.