Overview
Teaching: 5 min Exercises: 10 minQuestions
How can I read tabular data?
Objectives
Import the Pandas library.
Use Pandas to load a simple CSV data set.
Get some basic information about a Pandas Data frame.
import pandas
.pandas.read_csv
.
import pandas
data = pandas.read_csv('data/gapminder_gdp_oceania.csv')
print(data)
country gdpPercap_1952 gdpPercap_1957 gdpPercap_1962 \
0 Australia 10039.59564 10949.64959 12217.22686
1 New Zealand 10556.57566 12247.39532 13175.67800
gdpPercap_1967 gdpPercap_1972 gdpPercap_1977 gdpPercap_1982 \
0 14526.12465 16788.62948 18334.19751 19477.00928
1 14463.91893 16046.03728 16233.71770 17632.41040
gdpPercap_1987 gdpPercap_1992 gdpPercap_1997 gdpPercap_2002 \
0 21888.88903 23424.76683 26997.93657 30687.75473
1 19007.19129 18363.32494 21050.41377 23189.80135
gdpPercap_2007
0 34435.36744
1 25185.00911
\
to show wrapped lines when output is too wide to fit the screen.File Not Found
Our lessons store their data files in a
data
sub-directory, which is why the path to the file isdata/gapminder_gdp_oceania.csv
. If you forget to includedata/
, or if you include it but your copy of the file is somewhere else, you will get a runtime error that ends with a line like this:OSError: File b'gapminder_gdp_oceania.csv' does not exist
index_col
to specify that a column’s values should be used as row headings.read_csv
as its index_col
parameter to do this.data = pandas.read_csv('data/gapminder_gdp_oceania.csv', index_col='country')
print(data)
gdpPercap_1952 gdpPercap_1957 gdpPercap_1962 gdpPercap_1967 \
country
Australia 10039.59564 10949.64959 12217.22686 14526.12465
New Zealand 10556.57566 12247.39532 13175.67800 14463.91893
gdpPercap_1972 gdpPercap_1977 gdpPercap_1982 gdpPercap_1987 \
country
Australia 16788.62948 18334.19751 19477.00928 21888.88903
New Zealand 16046.03728 16233.71770 17632.41040 19007.19129
gdpPercap_1992 gdpPercap_1997 gdpPercap_2002 gdpPercap_2007
country
Australia 23424.76683 26997.93657 30687.75473 34435.36744
New Zealand 18363.32494 21050.41377 23189.80135 25185.00911
DataFrame.info
to find out more about a data frame.data.info()
<class 'pandas.core.frame.DataFrame'>
Index: 2 entries, Australia to New Zealand
Data columns (total 12 columns):
gdpPercap_1952 2 non-null float64
gdpPercap_1957 2 non-null float64
gdpPercap_1962 2 non-null float64
gdpPercap_1967 2 non-null float64
gdpPercap_1972 2 non-null float64
gdpPercap_1977 2 non-null float64
gdpPercap_1982 2 non-null float64
gdpPercap_1987 2 non-null float64
gdpPercap_1992 2 non-null float64
gdpPercap_1997 2 non-null float64
gdpPercap_2002 2 non-null float64
gdpPercap_2007 2 non-null float64
dtypes: float64(12)
memory usage: 208.0+ bytes
DataFrame
'Australia'
and 'New Zealand'
DataFrame.columns
variable stores information about the data frame’s columns.math.pi
.()
to try to call it.print(data.columns)
Index(['gdpPercap_1952', 'gdpPercap_1957', 'gdpPercap_1962', 'gdpPercap_1967',
'gdpPercap_1972', 'gdpPercap_1977', 'gdpPercap_1982', 'gdpPercap_1987',
'gdpPercap_1992', 'gdpPercap_1997', 'gdpPercap_2002', 'gdpPercap_2007'],
dtype='object')
DataFrame.T
to transpose a data frame..T
) doesn’t copy the data, just changes the program’s view of it.columns
, it is a member variable.print(data.T)
country Australia New Zealand
gdpPercap_1952 10039.59564 10556.57566
gdpPercap_1957 10949.64959 12247.39532
gdpPercap_1962 12217.22686 13175.67800
gdpPercap_1967 14526.12465 14463.91893
gdpPercap_1972 16788.62948 16046.03728
gdpPercap_1977 18334.19751 16233.71770
gdpPercap_1982 19477.00928 17632.41040
gdpPercap_1987 21888.88903 19007.19129
gdpPercap_1992 23424.76683 18363.32494
gdpPercap_1997 26997.93657 21050.41377
gdpPercap_2002 30687.75473 23189.80135
gdpPercap_2007 34435.36744 25185.00911
DataFrame.describe
to get summary statistics about data.DataFrame.describe() gets the summary statistics of only the columns that have numerical data. All other columns are ignored.
print(data.describe())
gdpPercap_1952 gdpPercap_1957 gdpPercap_1962 gdpPercap_1967 \
count 2.000000 2.000000 2.000000 2.000000
mean 10298.085650 11598.522455 12696.452430 14495.021790
std 365.560078 917.644806 677.727301 43.986086
min 10039.595640 10949.649590 12217.226860 14463.918930
25% 10168.840645 11274.086022 12456.839645 14479.470360
50% 10298.085650 11598.522455 12696.452430 14495.021790
75% 10427.330655 11922.958888 12936.065215 14510.573220
max 10556.575660 12247.395320 13175.678000 14526.124650
gdpPercap_1972 gdpPercap_1977 gdpPercap_1982 gdpPercap_1987 \
count 2.00000 2.000000 2.000000 2.000000
mean 16417.33338 17283.957605 18554.709840 20448.040160
std 525.09198 1485.263517 1304.328377 2037.668013
min 16046.03728 16233.717700 17632.410400 19007.191290
25% 16231.68533 16758.837652 18093.560120 19727.615725
50% 16417.33338 17283.957605 18554.709840 20448.040160
75% 16602.98143 17809.077557 19015.859560 21168.464595
max 16788.62948 18334.197510 19477.009280 21888.889030
gdpPercap_1992 gdpPercap_1997 gdpPercap_2002 gdpPercap_2007
count 2.000000 2.000000 2.000000 2.000000
mean 20894.045885 24024.175170 26938.778040 29810.188275
std 3578.979883 4205.533703 5301.853680 6540.991104
min 18363.324940 21050.413770 23189.801350 25185.009110
25% 19628.685413 22537.294470 25064.289695 27497.598692
50% 20894.045885 24024.175170 26938.778040 29810.188275
75% 22159.406358 25511.055870 28813.266385 32122.777857
max 23424.766830 26997.936570 30687.754730 34435.367440
Reading Other Data
Read the data in
gapminder_gdp_americas.csv
(which should be in the same directory asgapminder_gdp_oceania.csv
) into a variable calledamericas
and display its summary statistics.
Inspecting Data.
After reading the data for the Americas, use
help(americas.head)
andhelp(americas.tail)
to find out whatDataFrame.head
andDataFrame.tail
do.
- What method call will display the first three rows of this data?
- What method call will display the last three columns of this data? (Hint: you may need to change your view of the data.)
Error Messages
The data for your current project is stored in a file called
microbes.csv
, which is located in a folder calledfield_data
. You are doing analysis in a notebook calledanalysis.ipynb
in a sibling folder calledthesis
:your_home_directory +-- field_data/ | +-- microbes.csv +-- thesis/ +-- analysis.ipynb
What value(s) should you pass to
read_csv
to readmicrobes.csv
inanalysis.ipynb
?
Writing Data
As well as the
read_csv
function for reading data from a file, Pandas provides ato_csv
function to write data frames to files. Applying what you’ve learned about reading from files, write one of your data frames to a file calledprocessed.csv
. You can usehelp
to get information on how to useto_csv
.
Key Points
Use the Pandas library to do statistics on tabular data.
Use
index_col
to specify that a column’s values should be used as row headings.Use
DataFrame.info
to find out more about a data frame.The
DataFrame.columns
variable stores information about the data frame’s columns.Use
DataFrame.T
to transpose a data frame.Use
DataFrame.describe
to get summary statistics about data.