Analytics with Pandas : 1. introduction¶
Download exercises zip¶
Python gives powerful tools for data analysis - among the main ones we find Pandas, which gives fast and flexible data structures, especially for interactive data analysis.
Pandas reuses existing libraries we’ve already seen, such as Numpy:
In this tutorial we will see:
data analysis with Pandas library
plotting with MatPlotLib
Examples with AstroPi dataset
Exercises with meteotrentino dataset
1. What to do¶
unzip exercises in a folder, you should get something like this:
pandas
pandas1.ipynb
pandas1-sol.ipynb
pandas2.ipynb
pandas2-sol.ipynb
pandas3-chal.ipynb
jupman.py
WARNING 1: to correctly visualize the notebook, it MUST be in an unzipped folder !
open Jupyter Notebook from that folder. Two things should open, first a console and then browser.
The browser should show a file list: navigate the list and open the notebook
pandas/pandas1.ipynb
WARNING 2: DO NOT use the Upload button in Jupyter, instead navigate in Jupyter browser to the unzipped folder !
Go on reading that notebook, and follow instuctions inside.
Shortcut keys:
to execute Python code inside a Jupyter cell, press
Control + Enter
to execute Python code inside a Jupyter cell AND select next cell, press
Shift + Enter
to execute Python code inside a Jupyter cell AND a create a new cell aftwerwards, press
Alt + Enter
If the notebooks look stuck, try to select
Kernel -> Restart
Check installation¶
First let’s see if you have already installed pandas on your system, try executing this cell with Ctrl-Enter:
[1]:
import pandas as pd
If you saw no error messages, you can skip installation, otherwise do this:
Anaconda - open Anaconda Prompt and issue this:
conda install pandas
Without Anaconda (
--user
installs in your home):
python3 -m pip install --user pandas
2. Data analysis of Astro Pi¶
Let’s try analyzing data recorded on a RaspberryPi present on the International Space Station, downloaded from here:
https://projects.raspberrypi.org/en/projects/astro-pi-flight-data-analysis
in which it is possible to find the detailed description of data gathered by sensors, in the month of February 2016 (one record each 10 seconds).
The method read_csv
imports data from a CSV file and saves them in DataFrame structure.
In this exercise we shall use the file Columbus_Ed_astro_pi_datalog.csv
[2]:
import pandas as pd # we import pandas and for ease we rename it to 'pd'
import numpy as np # we import numpy and for ease we rename it to 'np'
# remember the encoding !
df = pd.read_csv('Columbus_Ed_astro_pi_datalog.csv', encoding='UTF-8')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110869 entries, 0 to 110868
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ROW_ID 110869 non-null int64
1 temp_cpu 110869 non-null float64
2 temp_h 110869 non-null float64
3 temp_p 110869 non-null float64
4 humidity 110869 non-null float64
5 pressure 110869 non-null float64
6 pitch 110869 non-null float64
7 roll 110869 non-null float64
8 yaw 110869 non-null float64
9 mag_x 110869 non-null float64
10 mag_y 110869 non-null float64
11 mag_z 110869 non-null float64
12 accel_x 110869 non-null float64
13 accel_y 110869 non-null float64
14 accel_z 110869 non-null float64
15 gyro_x 110869 non-null float64
16 gyro_y 110869 non-null float64
17 gyro_z 110869 non-null float64
18 reset 110869 non-null int64
19 time_stamp 110869 non-null object
dtypes: float64(17), int64(2), object(1)
memory usage: 16.9+ MB
We can quickly see rows and columns of the dataframe with the attribute shape
:
NOTE: shape
is not followed by round parenthesis !
[3]:
df.shape
[3]:
(110869, 20)
The describe
method gives you on the fly many summary info:
rows counting
the average
minimum and maximum
[4]:
df.describe()
[4]:
ROW_ID | temp_cpu | temp_h | temp_p | humidity | pressure | pitch | roll | yaw | mag_x | mag_y | mag_z | accel_x | accel_y | accel_z | gyro_x | gyro_y | gyro_z | reset | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 110869.000000 | 110869.000000 | 110869.000000 | 110869.000000 | 110869.000000 | 110869.000000 | 110869.000000 | 110869.000000 | 110869.00000 | 110869.000000 | 110869.000000 | 110869.000000 | 110869.000000 | 110869.000000 | 110869.000000 | 1.108690e+05 | 110869.000000 | 1.108690e+05 | 110869.000000 |
mean | 55435.000000 | 32.236259 | 28.101773 | 25.543272 | 46.252005 | 1008.126788 | 2.770553 | 51.807973 | 200.90126 | -19.465265 | -1.174493 | -6.004529 | -0.000630 | 0.018504 | 0.014512 | -8.959493e-07 | 0.000007 | -9.671594e-07 | 0.000180 |
std | 32005.267835 | 0.360289 | 0.369256 | 0.380877 | 1.907273 | 3.093485 | 21.848940 | 2.085821 | 84.47763 | 28.120202 | 15.655121 | 8.552481 | 0.000224 | 0.000604 | 0.000312 | 2.807614e-03 | 0.002456 | 2.133104e-03 | 0.060065 |
min | 1.000000 | 31.410000 | 27.200000 | 24.530000 | 42.270000 | 1001.560000 | 0.000000 | 30.890000 | 0.01000 | -73.046240 | -43.810030 | -41.163040 | -0.025034 | -0.005903 | -0.022900 | -3.037930e-01 | -0.378412 | -2.970800e-01 | 0.000000 |
25% | 27718.000000 | 31.960000 | 27.840000 | 25.260000 | 45.230000 | 1006.090000 | 1.140000 | 51.180000 | 162.43000 | -41.742792 | -12.982321 | -11.238430 | -0.000697 | 0.018009 | 0.014349 | -2.750000e-04 | -0.000278 | -1.200000e-04 | 0.000000 |
50% | 55435.000000 | 32.280000 | 28.110000 | 25.570000 | 46.130000 | 1007.650000 | 1.450000 | 51.950000 | 190.58000 | -21.339485 | -1.350467 | -5.764400 | -0.000631 | 0.018620 | 0.014510 | -3.000000e-06 | -0.000004 | -1.000000e-06 | 0.000000 |
75% | 83152.000000 | 32.480000 | 28.360000 | 25.790000 | 46.880000 | 1010.270000 | 1.740000 | 52.450000 | 256.34000 | 7.299000 | 11.912456 | -0.653705 | -0.000567 | 0.018940 | 0.014673 | 2.710000e-04 | 0.000271 | 1.190000e-04 | 0.000000 |
max | 110869.000000 | 33.700000 | 29.280000 | 26.810000 | 60.590000 | 1021.780000 | 360.000000 | 359.400000 | 359.98000 | 33.134748 | 37.552135 | 31.003047 | 0.018708 | 0.041012 | 0.029938 | 2.151470e-01 | 0.389499 | 2.698760e-01 | 20.000000 |
QUESTION: is there some missing field from the table produced by describe? Why is it not included?
To limit describe
to only one column like humidity
, you can write like this:
[5]:
df['humidity'].describe()
[5]:
count 110869.000000
mean 46.252005
std 1.907273
min 42.270000
25% 45.230000
50% 46.130000
75% 46.880000
max 60.590000
Name: humidity, dtype: float64
Notation with the dot is even more handy:
[6]:
df.humidity.describe()
[6]:
count 110869.000000
mean 46.252005
std 1.907273
min 42.270000
25% 45.230000
50% 46.130000
75% 46.880000
max 60.590000
Name: humidity, dtype: float64
WARNING: Careful about spaces!:
In case the field name has spaces (es. 'blender rotations'
), do not use the dot notation, instead use squared bracket notation seen above (ie: df.['blender rotations'].describe()
)
head()
method gives back the first datasets:
[7]:
df.head()
[7]:
ROW_ID | temp_cpu | temp_h | temp_p | humidity | pressure | pitch | roll | yaw | mag_x | mag_y | mag_z | accel_x | accel_y | accel_z | gyro_x | gyro_y | gyro_z | reset | time_stamp | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 31.88 | 27.57 | 25.01 | 44.94 | 1001.68 | 1.49 | 52.25 | 185.21 | -46.422753 | -8.132907 | -12.129346 | -0.000468 | 0.019439 | 0.014569 | 0.000942 | 0.000492 | -0.000750 | 20 | 2016-02-16 10:44:40 |
1 | 2 | 31.79 | 27.53 | 25.01 | 45.12 | 1001.72 | 1.03 | 53.73 | 186.72 | -48.778951 | -8.304243 | -12.943096 | -0.000614 | 0.019436 | 0.014577 | 0.000218 | -0.000005 | -0.000235 | 0 | 2016-02-16 10:44:50 |
2 | 3 | 31.66 | 27.53 | 25.01 | 45.12 | 1001.72 | 1.24 | 53.57 | 186.21 | -49.161878 | -8.470832 | -12.642772 | -0.000569 | 0.019359 | 0.014357 | 0.000395 | 0.000600 | -0.000003 | 0 | 2016-02-16 10:45:00 |
3 | 4 | 31.69 | 27.52 | 25.01 | 45.32 | 1001.69 | 1.57 | 53.63 | 186.03 | -49.341941 | -8.457380 | -12.615509 | -0.000575 | 0.019383 | 0.014409 | 0.000308 | 0.000577 | -0.000102 | 0 | 2016-02-16 10:45:10 |
4 | 5 | 31.66 | 27.54 | 25.01 | 45.18 | 1001.71 | 0.85 | 53.66 | 186.46 | -50.056683 | -8.122609 | -12.678341 | -0.000548 | 0.019378 | 0.014380 | 0.000321 | 0.000691 | 0.000272 | 0 | 2016-02-16 10:45:20 |
tail()
method gives back last rows:
[8]:
df.tail()
[8]:
ROW_ID | temp_cpu | temp_h | temp_p | humidity | pressure | pitch | roll | yaw | mag_x | mag_y | mag_z | accel_x | accel_y | accel_z | gyro_x | gyro_y | gyro_z | reset | time_stamp | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
110864 | 110865 | 31.56 | 27.52 | 24.83 | 42.94 | 1005.83 | 1.58 | 49.93 | 129.60 | -15.169673 | -27.642610 | 1.563183 | -0.000682 | 0.017743 | 0.014646 | -0.000264 | 0.000206 | 0.000196 | 0 | 2016-02-29 09:24:21 |
110865 | 110866 | 31.55 | 27.50 | 24.83 | 42.72 | 1005.85 | 1.89 | 49.92 | 130.51 | -15.832622 | -27.729389 | 1.785682 | -0.000736 | 0.017570 | 0.014855 | 0.000143 | 0.000199 | -0.000024 | 0 | 2016-02-29 09:24:30 |
110866 | 110867 | 31.58 | 27.50 | 24.83 | 42.83 | 1005.85 | 2.09 | 50.00 | 132.04 | -16.646212 | -27.719479 | 1.629533 | -0.000647 | 0.017657 | 0.014799 | 0.000537 | 0.000257 | 0.000057 | 0 | 2016-02-29 09:24:41 |
110867 | 110868 | 31.62 | 27.50 | 24.83 | 42.81 | 1005.88 | 2.88 | 49.69 | 133.00 | -17.270447 | -27.793136 | 1.703806 | -0.000835 | 0.017635 | 0.014877 | 0.000534 | 0.000456 | 0.000195 | 0 | 2016-02-29 09:24:50 |
110868 | 110869 | 31.57 | 27.51 | 24.83 | 42.94 | 1005.86 | 2.17 | 49.77 | 134.18 | -17.885872 | -27.824149 | 1.293345 | -0.000787 | 0.017261 | 0.014380 | 0.000459 | 0.000076 | 0.000030 | 0 | 2016-02-29 09:25:00 |
colums
property gives the column headers:
[9]:
df.columns
[9]:
Index(['ROW_ID', 'temp_cpu', 'temp_h', 'temp_p', 'humidity', 'pressure',
'pitch', 'roll', 'yaw', 'mag_x', 'mag_y', 'mag_z', 'accel_x', 'accel_y',
'accel_z', 'gyro_x', 'gyro_y', 'gyro_z', 'reset', 'time_stamp'],
dtype='object')
Nota: as you see in the above, the type of the found object is not a list, but a special container defined by pandas:
[10]:
type(df.columns)
[10]:
pandas.core.indexes.base.Index
Nevertheless, we can access the elements of this container using indeces within the squared parenthesis:
[11]:
df.columns[0]
[11]:
'ROW_ID'
[12]:
df.columns[1]
[12]:
'temp_cpu'
With corr
method we can see the correlation between DataFrame columns.
[13]:
df.corr()
[13]:
ROW_ID | temp_cpu | temp_h | temp_p | humidity | pressure | pitch | roll | yaw | mag_x | mag_y | mag_z | accel_x | accel_y | accel_z | gyro_x | gyro_y | gyro_z | reset | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ROW_ID | 1.000000 | 0.361391 | 0.435289 | 0.339916 | -0.110241 | 0.243337 | -0.007741 | -0.390159 | -0.034877 | -0.035966 | -0.025818 | 0.108138 | -0.132291 | -0.918587 | -0.080899 | -0.001906 | -0.004913 | -0.000211 | -0.005202 |
temp_cpu | 0.361391 | 1.000000 | 0.986872 | 0.991672 | -0.297081 | 0.038065 | 0.008076 | -0.171644 | -0.117972 | 0.005145 | -0.285192 | -0.120838 | -0.023582 | -0.446358 | -0.029155 | 0.002511 | 0.005947 | -0.001250 | -0.002970 |
temp_h | 0.435289 | 0.986872 | 1.000000 | 0.993260 | -0.281422 | 0.070882 | 0.005145 | -0.199628 | -0.117870 | 0.000428 | -0.276276 | -0.098864 | -0.032188 | -0.510126 | -0.043213 | 0.001771 | 0.005020 | -0.001423 | -0.004325 |
temp_p | 0.339916 | 0.991672 | 0.993260 | 1.000000 | -0.288373 | 0.035496 | 0.006750 | -0.163685 | -0.118463 | 0.004338 | -0.283427 | -0.114407 | -0.018047 | -0.428884 | -0.036505 | 0.001829 | 0.006127 | -0.001623 | -0.004205 |
humidity | -0.110241 | -0.297081 | -0.281422 | -0.288373 | 1.000000 | 0.434374 | 0.004050 | 0.101304 | 0.031664 | -0.035146 | 0.077897 | 0.076424 | -0.009741 | 0.226281 | 0.005281 | 0.004345 | 0.003457 | 0.001298 | -0.002066 |
pressure | 0.243337 | 0.038065 | 0.070882 | 0.035496 | 0.434374 | 1.000000 | 0.003018 | 0.011815 | -0.051697 | -0.040183 | -0.074578 | 0.092352 | 0.013556 | -0.115642 | -0.221208 | -0.000611 | -0.002493 | -0.000615 | -0.006259 |
pitch | -0.007741 | 0.008076 | 0.005145 | 0.006750 | 0.004050 | 0.003018 | 1.000000 | 0.087941 | -0.011611 | 0.013331 | 0.006133 | 0.000540 | 0.043285 | 0.009015 | -0.039146 | 0.066618 | -0.015034 | 0.049340 | -0.000176 |
roll | -0.390159 | -0.171644 | -0.199628 | -0.163685 | 0.101304 | 0.011815 | 0.087941 | 1.000000 | 0.095354 | -0.020947 | 0.060297 | -0.080620 | 0.116637 | 0.462630 | -0.167905 | -0.115873 | -0.002509 | -0.214202 | 0.000636 |
yaw | -0.034877 | -0.117972 | -0.117870 | -0.118463 | 0.031664 | -0.051697 | -0.011611 | 0.095354 | 1.000000 | 0.257971 | 0.549394 | -0.328360 | 0.006943 | 0.044157 | -0.013634 | 0.003106 | 0.003665 | 0.004020 | -0.000558 |
mag_x | -0.035966 | 0.005145 | 0.000428 | 0.004338 | -0.035146 | -0.040183 | 0.013331 | -0.020947 | 0.257971 | 1.000000 | 0.001239 | -0.213070 | -0.006629 | 0.027921 | 0.021524 | -0.004954 | -0.004429 | -0.005052 | -0.002879 |
mag_y | -0.025818 | -0.285192 | -0.276276 | -0.283427 | 0.077897 | -0.074578 | 0.006133 | 0.060297 | 0.549394 | 0.001239 | 1.000000 | -0.266351 | 0.014057 | 0.051619 | -0.053016 | 0.001239 | 0.001063 | 0.001530 | -0.001335 |
mag_z | 0.108138 | -0.120838 | -0.098864 | -0.114407 | 0.076424 | 0.092352 | 0.000540 | -0.080620 | -0.328360 | -0.213070 | -0.266351 | 1.000000 | 0.024718 | -0.083914 | -0.061317 | -0.008470 | -0.009557 | -0.008997 | -0.002151 |
accel_x | -0.132291 | -0.023582 | -0.032188 | -0.018047 | -0.009741 | 0.013556 | 0.043285 | 0.116637 | 0.006943 | -0.006629 | 0.014057 | 0.024718 | 1.000000 | 0.095286 | -0.262305 | 0.035314 | 0.103449 | 0.197740 | 0.002173 |
accel_y | -0.918587 | -0.446358 | -0.510126 | -0.428884 | 0.226281 | -0.115642 | 0.009015 | 0.462630 | 0.044157 | 0.027921 | 0.051619 | -0.083914 | 0.095286 | 1.000000 | 0.120215 | 0.043263 | -0.046463 | 0.009541 | 0.004648 |
accel_z | -0.080899 | -0.029155 | -0.043213 | -0.036505 | 0.005281 | -0.221208 | -0.039146 | -0.167905 | -0.013634 | 0.021524 | -0.053016 | -0.061317 | -0.262305 | 0.120215 | 1.000000 | 0.078315 | -0.075625 | 0.057075 | 0.000554 |
gyro_x | -0.001906 | 0.002511 | 0.001771 | 0.001829 | 0.004345 | -0.000611 | 0.066618 | -0.115873 | 0.003106 | -0.004954 | 0.001239 | -0.008470 | 0.035314 | 0.043263 | 0.078315 | 1.000000 | -0.248968 | 0.337553 | 0.001009 |
gyro_y | -0.004913 | 0.005947 | 0.005020 | 0.006127 | 0.003457 | -0.002493 | -0.015034 | -0.002509 | 0.003665 | -0.004429 | 0.001063 | -0.009557 | 0.103449 | -0.046463 | -0.075625 | -0.248968 | 1.000000 | 0.190112 | 0.000593 |
gyro_z | -0.000211 | -0.001250 | -0.001423 | -0.001623 | 0.001298 | -0.000615 | 0.049340 | -0.214202 | 0.004020 | -0.005052 | 0.001530 | -0.008997 | 0.197740 | 0.009541 | 0.057075 | 0.337553 | 0.190112 | 1.000000 | -0.001055 |
reset | -0.005202 | -0.002970 | -0.004325 | -0.004205 | -0.002066 | -0.006259 | -0.000176 | 0.000636 | -0.000558 | -0.002879 | -0.001335 | -0.002151 | 0.002173 | 0.004648 | 0.000554 | 0.001009 | 0.000593 | -0.001055 | 1.000000 |
2.1 Exercise - meteo info¶
✪ a) Create a new dataframe called meteo
by importing the data from file meteo.csv, which contains the meteo data of Trento from November 2017 (source: https://www.meteotrentino.it). IMPORTANT: assign the dataframe to a variable called meteo
(so we avoid confusion whith AstroPi dataframe)
Visualize info about this dataframe
[14]:
# write here - create dataframe
COLUMNS:
Index(['Date', 'Pressure', 'Rain', 'Temp'], dtype='object')
INFO:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2878 entries, 0 to 2877
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 2878 non-null object
1 Pressure 2878 non-null float64
2 Rain 2878 non-null float64
3 Temp 2878 non-null float64
dtypes: float64(3), object(1)
memory usage: 90.1+ KB
None
HEAD():
[14]:
Date | Pressure | Rain | Temp | |
---|---|---|---|---|
0 | 01/11/2017 00:00 | 995.4 | 0.0 | 5.4 |
1 | 01/11/2017 00:15 | 995.5 | 0.0 | 6.0 |
2 | 01/11/2017 00:30 | 995.5 | 0.0 | 5.9 |
3 | 01/11/2017 00:45 | 995.7 | 0.0 | 5.4 |
4 | 01/11/2017 01:00 | 995.7 | 0.0 | 5.3 |
3. MatPlotLib review¶
We’ve already seen MatplotLib in the part on visualization, and today we use Matplotlib to display data.
3.1 An example¶
Let’s take again an example, with the Matlab approach. We will plot a line passing two lists of coordinates, one for xs and one for ys:
[15]:
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
[16]:
xs = [1,2,3,4]
ys = [2,4,6,8]
plt.plot(xs, ys) # we can directly pass xs and ys lists
plt.title('Some number')
plt.show()

We can also create the series with numpy. Let’s try making a parabola:
[17]:
import numpy as np
xs = np.arange(0.,5.,0.1)
# '**' is the power operator in Python, NOT '^'
ys = xs**2
Let’s use the type
function to understand which data types are xs and ys:
[18]:
type(xs)
[18]:
numpy.ndarray
[19]:
type(ys)
[19]:
numpy.ndarray
Hence we have NumPy arrays.
Let’s plot it:
[20]:
plt.title('The parabola')
plt.plot(xs,ys);

If we want the same units in both x and y axis, we can use the gca function
To set x and y limits, we can use xlim
e ylim
:
[21]:
plt.xlim([0, 5])
plt.ylim([0,10])
plt.title('The parabola')
plt.gca().set_aspect('equal')
plt.plot(xs,ys);

3.2 Matplotlib plots from pandas datastructures¶
We can get plots directly from pandas data structures, always using the matlab style. Let’s make a simple example, for more complex cases we refer to DataFrame.plot documentation.
In case of big quantity of data, it may be useful to have a qualitative idea of data by putting them in a plot:
[22]:
df.humidity.plot(label="Humidity", legend=True)
# with secondary_y=True we display number on y axis
# of graph on the right
df.pressure.plot(secondary_y=True, label="Pressure", legend=True);

We can put pressure values on horizontal axis, and see which humidity values on vertical axis have a certain pressure:
[23]:
plt.plot(df['pressure'], df['humidity'])
[23]:
[<matplotlib.lines.Line2D at 0x7f45431942d0>]

4. Operations on rows¶
If we consider the rows of a dataset, typically we will want to index, filter and order them.
4.1 Indexing integers¶
We report here the simplest indexing with row numbers.
To obtain the i-th series you can use the method iloc[i]
(here we reuse AstroPi dataset) :
[24]:
df.iloc[6]
[24]:
ROW_ID 7
temp_cpu 31.68
temp_h 27.53
temp_p 25.01
humidity 45.31
pressure 1001.7
pitch 0.63
roll 53.55
yaw 186.1
mag_x -50.447346
mag_y -7.937309
mag_z -12.188574
accel_x -0.00051
accel_y 0.019264
accel_z 0.014528
gyro_x -0.000111
gyro_y 0.00032
gyro_z 0.000222
reset 0
time_stamp 2016-02-16 10:45:41
Name: 6, dtype: object
It is possible to select a dataframe of contiguous positions by using slicing, as we already did for strings and lists
For example, here we select the rows from 5th included to 7-th excluded :
[25]:
df.iloc[5:7]
[25]:
ROW_ID | temp_cpu | temp_h | temp_p | humidity | pressure | pitch | roll | yaw | mag_x | mag_y | mag_z | accel_x | accel_y | accel_z | gyro_x | gyro_y | gyro_z | reset | time_stamp | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5 | 6 | 31.69 | 27.55 | 25.01 | 45.12 | 1001.67 | 0.85 | 53.53 | 185.52 | -50.246476 | -8.343209 | -11.938124 | -0.000536 | 0.019453 | 0.014380 | 0.000273 | 0.000494 | -0.000059 | 0 | 2016-02-16 10:45:30 |
6 | 7 | 31.68 | 27.53 | 25.01 | 45.31 | 1001.70 | 0.63 | 53.55 | 186.10 | -50.447346 | -7.937309 | -12.188574 | -0.000510 | 0.019264 | 0.014528 | -0.000111 | 0.000320 | 0.000222 | 0 | 2016-02-16 10:45:41 |
By filtering the rows we can ‘zoom in’ the dataset, selecting for example the rows between the 12500th (included) and the 15000th (excluded) in the new dataframe df2
:
[26]:
df2=df.iloc[12500:15000]
[27]:
plt.plot(df2['pressure'], df2['humidity'])
[27]:
[<matplotlib.lines.Line2D at 0x7f454322c490>]

[28]:
df2.humidity.plot(label="Humidity", legend=True)
df2.pressure.plot(secondary_y=True, label="Pressure", legend=True)
[28]:
<AxesSubplot:>

4.2 Filtering¶
It’s possible to filter data by according to a condition data should satisfy, which can be expressed by indicating a column and a comparison operator. For example:
[29]:
df.ROW_ID >= 6
[29]:
0 False
1 False
2 False
3 False
4 False
...
110864 True
110865 True
110866 True
110867 True
110868 True
Name: ROW_ID, Length: 110869, dtype: bool
We see it’s a series of values True
or False
, according to ROW_ID
being greater or equal to 6
. What’s the type of this result?
[30]:
type(df.ROW_ID >= 6)
[30]:
pandas.core.series.Series
What is contained in this Series
object ? If we try printing it we will see it is a series of values True or False, according whether the ROW_ID
is greater or equal than 6:
[31]:
df.ROW_ID >= 6
[31]:
0 False
1 False
2 False
3 False
4 False
...
110864 True
110865 True
110866 True
110867 True
110868 True
Name: ROW_ID, Length: 110869, dtype: bool
Combining filters¶
It’s possible to combine conditions like we already did in Numpy filtering: for example by using the special operator conjunction &
If we write (df.ROW_ID >= 6) & (df.ROW_ID <= 10)
we obtain a series of values True
or False
, if ROW_ID
is at the same time greater or equal than 6 and less or equal of 10
[32]:
type((df.ROW_ID >= 6) & (df.ROW_ID <= 10))
[32]:
pandas.core.series.Series
If we want complete rows of the dataframe which satisfy the condition, we can write like this:
IMPORTANT: we use df
externally from expression df[ ]
starting and closing the square bracket parenthesis to tell Python we want to filter the df
dataframe, and use again df
inside the parenthesis to tell on which columns and which rows we want to filter
[33]:
df[ (df.ROW_ID >= 6) & (df.ROW_ID <= 10) ]
[33]:
ROW_ID | temp_cpu | temp_h | temp_p | humidity | pressure | pitch | roll | yaw | mag_x | mag_y | mag_z | accel_x | accel_y | accel_z | gyro_x | gyro_y | gyro_z | reset | time_stamp | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5 | 6 | 31.69 | 27.55 | 25.01 | 45.12 | 1001.67 | 0.85 | 53.53 | 185.52 | -50.246476 | -8.343209 | -11.938124 | -0.000536 | 0.019453 | 0.014380 | 0.000273 | 0.000494 | -0.000059 | 0 | 2016-02-16 10:45:30 |
6 | 7 | 31.68 | 27.53 | 25.01 | 45.31 | 1001.70 | 0.63 | 53.55 | 186.10 | -50.447346 | -7.937309 | -12.188574 | -0.000510 | 0.019264 | 0.014528 | -0.000111 | 0.000320 | 0.000222 | 0 | 2016-02-16 10:45:41 |
7 | 8 | 31.66 | 27.55 | 25.01 | 45.34 | 1001.70 | 1.49 | 53.65 | 186.08 | -50.668232 | -7.762600 | -12.284196 | -0.000523 | 0.019473 | 0.014298 | -0.000044 | 0.000436 | 0.000301 | 0 | 2016-02-16 10:45:50 |
8 | 9 | 31.67 | 27.54 | 25.01 | 45.20 | 1001.72 | 1.22 | 53.77 | 186.55 | -50.761529 | -7.262934 | -11.981090 | -0.000522 | 0.019385 | 0.014286 | 0.000358 | 0.000651 | 0.000187 | 0 | 2016-02-16 10:46:01 |
9 | 10 | 31.67 | 27.54 | 25.01 | 45.41 | 1001.75 | 1.63 | 53.46 | 185.94 | -51.243832 | -6.875270 | -11.672494 | -0.000581 | 0.019390 | 0.014441 | 0.000266 | 0.000676 | 0.000356 | 0 | 2016-02-16 10:46:10 |
So if we want to search the record where pressure is maximal, we user values
property of the series on which we calculate the maximal value:
[34]:
df[ (df.pressure == df.pressure.values.max()) ]
[34]:
ROW_ID | temp_cpu | temp_h | temp_p | humidity | pressure | pitch | roll | yaw | mag_x | mag_y | mag_z | accel_x | accel_y | accel_z | gyro_x | gyro_y | gyro_z | reset | time_stamp | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
77602 | 77603 | 32.44 | 28.31 | 25.74 | 47.57 | 1021.78 | 1.1 | 51.82 | 267.39 | -0.797428 | 10.891803 | -15.728202 | -0.000612 | 0.01817 | 0.014295 | -0.000139 | -0.000179 | -0.000298 | 0 | 2016-02-25 12:13:20 |
4.3 Sorting¶
To obtain a NEW dataframe sorted according to one or more columns, we can use the sort_values
method:
[35]:
df.sort_values('pressure',ascending=False).head()
[35]:
ROW_ID | temp_cpu | temp_h | temp_p | humidity | pressure | pitch | roll | yaw | mag_x | mag_y | mag_z | accel_x | accel_y | accel_z | gyro_x | gyro_y | gyro_z | reset | time_stamp | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
77602 | 77603 | 32.44 | 28.31 | 25.74 | 47.57 | 1021.78 | 1.10 | 51.82 | 267.39 | -0.797428 | 10.891803 | -15.728202 | -0.000612 | 0.018170 | 0.014295 | -0.000139 | -0.000179 | -0.000298 | 0 | 2016-02-25 12:13:20 |
77601 | 77602 | 32.45 | 28.30 | 25.74 | 47.26 | 1021.75 | 1.53 | 51.76 | 266.12 | -1.266335 | 10.927442 | -15.690558 | -0.000661 | 0.018357 | 0.014533 | 0.000152 | 0.000459 | -0.000298 | 0 | 2016-02-25 12:13:10 |
77603 | 77604 | 32.44 | 28.30 | 25.74 | 47.29 | 1021.75 | 1.86 | 51.83 | 268.83 | -0.320795 | 10.651441 | -15.565123 | -0.000648 | 0.018290 | 0.014372 | 0.000049 | 0.000473 | -0.000029 | 0 | 2016-02-25 12:13:30 |
77604 | 77605 | 32.43 | 28.30 | 25.74 | 47.39 | 1021.75 | 1.78 | 51.54 | 269.41 | -0.130574 | 10.628383 | -15.488983 | -0.000672 | 0.018154 | 0.014602 | 0.000360 | 0.000089 | -0.000002 | 0 | 2016-02-25 12:13:40 |
77608 | 77609 | 32.42 | 28.29 | 25.74 | 47.36 | 1021.73 | 0.86 | 51.89 | 272.77 | 0.952025 | 10.435951 | -16.027235 | -0.000607 | 0.018186 | 0.014232 | -0.000260 | -0.000059 | -0.000187 | 0 | 2016-02-25 12:14:20 |
4.4 Exercise - Meteo stats¶
✪ Analyze data from Dataframe meteo
to find:
values of average pression, minimal and maximal
average temperature
the dates of rainy days
[36]:
# write here
Average pressure : 986.3408269631689
Minimal pressure : 966.3
Maximal pressure : 998.3
Average temperature : 6.410701876302988
[36]:
Date | Pressure | Rain | Temp | |
---|---|---|---|---|
433 | 05/11/2017 12:15 | 979.2 | 0.2 | 8.6 |
435 | 05/11/2017 12:45 | 978.9 | 0.2 | 8.4 |
436 | 05/11/2017 13:00 | 979.0 | 0.2 | 8.4 |
437 | 05/11/2017 13:15 | 979.1 | 0.8 | 8.2 |
438 | 05/11/2017 13:30 | 979.0 | 0.6 | 8.2 |
... | ... | ... | ... | ... |
2754 | 29/11/2017 17:15 | 976.1 | 0.2 | 0.9 |
2755 | 29/11/2017 17:30 | 975.9 | 0.2 | 0.9 |
2802 | 30/11/2017 05:15 | 971.3 | 0.2 | 1.3 |
2803 | 30/11/2017 05:30 | 971.3 | 0.2 | 1.1 |
2804 | 30/11/2017 05:45 | 971.5 | 0.2 | 1.1 |
107 rows × 4 columns
5. Object values and strings¶
In general, when we want to manipulate objects of a known type, say strings which have type str
, we can write .str
after a series and then treat the result like it were a single string, using any operator (es: slicing) or method that particular class allows us, plus others provided by pandas
Text in particular can be manipulated in many ways, for more details see pandas documentation)
5.1 Filter by textual values¶
When we want to filter by text values, we can use .str.contains
, here for example we select all the samples in the last days of february (which have timestamp containing 2016-02-2
) :
[37]:
df[ df['time_stamp'].str.contains('2016-02-2') ]
[37]:
ROW_ID | temp_cpu | temp_h | temp_p | humidity | pressure | pitch | roll | yaw | mag_x | mag_y | mag_z | accel_x | accel_y | accel_z | gyro_x | gyro_y | gyro_z | reset | time_stamp | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
30442 | 30443 | 32.30 | 28.12 | 25.59 | 45.05 | 1008.01 | 1.47 | 51.82 | 51.18 | 9.215883 | -12.947023 | 4.066202 | -0.000612 | 0.018792 | 0.014558 | -0.000042 | 0.000275 | 0.000157 | 0 | 2016-02-20 00:00:00 |
30443 | 30444 | 32.25 | 28.13 | 25.59 | 44.82 | 1008.02 | 0.81 | 51.53 | 52.21 | 8.710130 | -13.143595 | 3.499386 | -0.000718 | 0.019290 | 0.014667 | 0.000260 | 0.001011 | 0.000149 | 0 | 2016-02-20 00:00:10 |
30444 | 30445 | 33.07 | 28.13 | 25.59 | 45.08 | 1008.09 | 0.68 | 51.69 | 57.36 | 7.383435 | -13.827667 | 4.438656 | -0.000700 | 0.018714 | 0.014598 | 0.000299 | 0.000343 | -0.000025 | 0 | 2016-02-20 00:00:41 |
30445 | 30446 | 32.63 | 28.10 | 25.60 | 44.87 | 1008.07 | 1.42 | 52.13 | 59.95 | 7.292313 | -13.999682 | 4.517029 | -0.000657 | 0.018857 | 0.014565 | 0.000160 | 0.000349 | -0.000190 | 0 | 2016-02-20 00:00:50 |
30446 | 30447 | 32.55 | 28.11 | 25.60 | 44.94 | 1008.07 | 1.41 | 51.86 | 61.83 | 6.699141 | -14.065591 | 4.448778 | -0.000678 | 0.018871 | 0.014564 | -0.000608 | -0.000381 | -0.000243 | 0 | 2016-02-20 00:01:01 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
110864 | 110865 | 31.56 | 27.52 | 24.83 | 42.94 | 1005.83 | 1.58 | 49.93 | 129.60 | -15.169673 | -27.642610 | 1.563183 | -0.000682 | 0.017743 | 0.014646 | -0.000264 | 0.000206 | 0.000196 | 0 | 2016-02-29 09:24:21 |
110865 | 110866 | 31.55 | 27.50 | 24.83 | 42.72 | 1005.85 | 1.89 | 49.92 | 130.51 | -15.832622 | -27.729389 | 1.785682 | -0.000736 | 0.017570 | 0.014855 | 0.000143 | 0.000199 | -0.000024 | 0 | 2016-02-29 09:24:30 |
110866 | 110867 | 31.58 | 27.50 | 24.83 | 42.83 | 1005.85 | 2.09 | 50.00 | 132.04 | -16.646212 | -27.719479 | 1.629533 | -0.000647 | 0.017657 | 0.014799 | 0.000537 | 0.000257 | 0.000057 | 0 | 2016-02-29 09:24:41 |
110867 | 110868 | 31.62 | 27.50 | 24.83 | 42.81 | 1005.88 | 2.88 | 49.69 | 133.00 | -17.270447 | -27.793136 | 1.703806 | -0.000835 | 0.017635 | 0.014877 | 0.000534 | 0.000456 | 0.000195 | 0 | 2016-02-29 09:24:50 |
110868 | 110869 | 31.57 | 27.51 | 24.83 | 42.94 | 1005.86 | 2.17 | 49.77 | 134.18 | -17.885872 | -27.824149 | 1.293345 | -0.000787 | 0.017261 | 0.014380 | 0.000459 | 0.000076 | 0.000030 | 0 | 2016-02-29 09:25:00 |
80427 rows × 20 columns
5.2 Extracting strings¶
To extract only the day from timestamp
column, we can use str
and slice operator with square brackets:
[38]:
df['time_stamp'].str[8:10]
[38]:
0 16
1 16
2 16
3 16
4 16
..
110864 29
110865 29
110866 29
110867 29
110868 29
Name: time_stamp, Length: 110869, dtype: object
6. Operations on columns¶
Let’s see now how to select, add and trasform columns.
6.1 - Selecting columns¶
If we want a subset of columns, we can express the names in a list like this:
NOTE: inside the external square brackets ther is a simple list without df
!
[39]:
df[ ['temp_h', 'temp_p', 'time_stamp'] ]
[39]:
temp_h | temp_p | time_stamp | |
---|---|---|---|
0 | 27.57 | 25.01 | 2016-02-16 10:44:40 |
1 | 27.53 | 25.01 | 2016-02-16 10:44:50 |
2 | 27.53 | 25.01 | 2016-02-16 10:45:00 |
3 | 27.52 | 25.01 | 2016-02-16 10:45:10 |
4 | 27.54 | 25.01 | 2016-02-16 10:45:20 |
... | ... | ... | ... |
110864 | 27.52 | 24.83 | 2016-02-29 09:24:21 |
110865 | 27.50 | 24.83 | 2016-02-29 09:24:30 |
110866 | 27.50 | 24.83 | 2016-02-29 09:24:41 |
110867 | 27.50 | 24.83 | 2016-02-29 09:24:50 |
110868 | 27.51 | 24.83 | 2016-02-29 09:25:00 |
110869 rows × 3 columns
As alwyas selecting the columns doens’t change the original dataframe:
[40]:
df.head()
[40]:
ROW_ID | temp_cpu | temp_h | temp_p | humidity | pressure | pitch | roll | yaw | mag_x | mag_y | mag_z | accel_x | accel_y | accel_z | gyro_x | gyro_y | gyro_z | reset | time_stamp | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 31.88 | 27.57 | 25.01 | 44.94 | 1001.68 | 1.49 | 52.25 | 185.21 | -46.422753 | -8.132907 | -12.129346 | -0.000468 | 0.019439 | 0.014569 | 0.000942 | 0.000492 | -0.000750 | 20 | 2016-02-16 10:44:40 |
1 | 2 | 31.79 | 27.53 | 25.01 | 45.12 | 1001.72 | 1.03 | 53.73 | 186.72 | -48.778951 | -8.304243 | -12.943096 | -0.000614 | 0.019436 | 0.014577 | 0.000218 | -0.000005 | -0.000235 | 0 | 2016-02-16 10:44:50 |
2 | 3 | 31.66 | 27.53 | 25.01 | 45.12 | 1001.72 | 1.24 | 53.57 | 186.21 | -49.161878 | -8.470832 | -12.642772 | -0.000569 | 0.019359 | 0.014357 | 0.000395 | 0.000600 | -0.000003 | 0 | 2016-02-16 10:45:00 |
3 | 4 | 31.69 | 27.52 | 25.01 | 45.32 | 1001.69 | 1.57 | 53.63 | 186.03 | -49.341941 | -8.457380 | -12.615509 | -0.000575 | 0.019383 | 0.014409 | 0.000308 | 0.000577 | -0.000102 | 0 | 2016-02-16 10:45:10 |
4 | 5 | 31.66 | 27.54 | 25.01 | 45.18 | 1001.71 | 0.85 | 53.66 | 186.46 | -50.056683 | -8.122609 | -12.678341 | -0.000548 | 0.019378 | 0.014380 | 0.000321 | 0.000691 | 0.000272 | 0 | 2016-02-16 10:45:20 |
6.2 - Adding columns¶
It’s possible to obtain new columns by calculating them from other columns in a very natural way. For example, we get new column mag_tot
, that is the absolute magnetic field taken from space station by mag_x
, mag_y
, e mag_z
, and then plot it:
[41]:
df['mag_tot'] = df['mag_x']**2 + df['mag_y']**2 + df['mag_z']**2
df.mag_tot.plot()
[41]:
<AxesSubplot:>

Let’s find when the magnetic field was maximal:
[42]:
df['time_stamp'][(df.mag_tot == df.mag_tot.values.max())]
[42]:
96156 2016-02-27 16:12:31
Name: time_stamp, dtype: object
By filling in the value found on the website isstracker.com/historical, we can find the positions where the magnetic field is at the highest.
Writing only in some rows¶
The loc
property allows to filter rows according to a property and select a column, which can be new. In this case, for rows where cpu temperature is too high, we write True
value in the fields of the column with header'Too hot'
:
[43]:
df.loc[(df.temp_cpu > 31.68),'Too hot'] = True
Let’s see the resulting table (scroll until the end to see the new column). We note the values from the rows we did not filter are represented with NaN, which literally means not a number :
[44]:
df.head()
[44]:
ROW_ID | temp_cpu | temp_h | temp_p | humidity | pressure | pitch | roll | yaw | mag_x | ... | accel_x | accel_y | accel_z | gyro_x | gyro_y | gyro_z | reset | time_stamp | mag_tot | Too hot | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 31.88 | 27.57 | 25.01 | 44.94 | 1001.68 | 1.49 | 52.25 | 185.21 | -46.422753 | ... | -0.000468 | 0.019439 | 0.014569 | 0.000942 | 0.000492 | -0.000750 | 20 | 2016-02-16 10:44:40 | 2368.337207 | True |
1 | 2 | 31.79 | 27.53 | 25.01 | 45.12 | 1001.72 | 1.03 | 53.73 | 186.72 | -48.778951 | ... | -0.000614 | 0.019436 | 0.014577 | 0.000218 | -0.000005 | -0.000235 | 0 | 2016-02-16 10:44:50 | 2615.870247 | True |
2 | 3 | 31.66 | 27.53 | 25.01 | 45.12 | 1001.72 | 1.24 | 53.57 | 186.21 | -49.161878 | ... | -0.000569 | 0.019359 | 0.014357 | 0.000395 | 0.000600 | -0.000003 | 0 | 2016-02-16 10:45:00 | 2648.484927 | NaN |
3 | 4 | 31.69 | 27.52 | 25.01 | 45.32 | 1001.69 | 1.57 | 53.63 | 186.03 | -49.341941 | ... | -0.000575 | 0.019383 | 0.014409 | 0.000308 | 0.000577 | -0.000102 | 0 | 2016-02-16 10:45:10 | 2665.305485 | True |
4 | 5 | 31.66 | 27.54 | 25.01 | 45.18 | 1001.71 | 0.85 | 53.66 | 186.46 | -50.056683 | ... | -0.000548 | 0.019378 | 0.014380 | 0.000321 | 0.000691 | 0.000272 | 0 | 2016-02-16 10:45:20 | 2732.388620 | NaN |
5 rows × 22 columns
Pandas is a very flexible library, and gives several methods to obtain the same results. For example, we can try the same operation as above with the command np.where
as down below. For example, we add a column telling if pressure is above or below the average:
[45]:
avg_pressure = df.pressure.values.mean()
df['check_p'] = np.where(df.pressure <= avg_pressure, 'below', 'over')
6.2.1 Exercise: Meteo temperature in Fahrenheit¶
In meteo
dataframe, create a column Temp (Fahrenheit)
with the temperature measured in Fahrenheit degrees.
Formula to calculate conversion from Celsius degrees (C):
\(Fahrenheit = \frac{9}{5}C + 32\)
[46]:
# write here
[47]:
************** SOLUTION OUTPUT **************
[47]:
Date | Pressure | Rain | Temp | Temp (Fahrenheit) | |
---|---|---|---|---|---|
0 | 01/11/2017 00:00 | 995.4 | 0.0 | 5.4 | 41.72 |
1 | 01/11/2017 00:15 | 995.5 | 0.0 | 6.0 | 42.80 |
2 | 01/11/2017 00:30 | 995.5 | 0.0 | 5.9 | 42.62 |
3 | 01/11/2017 00:45 | 995.7 | 0.0 | 5.4 | 41.72 |
4 | 01/11/2017 01:00 | 995.7 | 0.0 | 5.3 | 41.54 |
6.2.2 Exercise - Pressure vs Temperature¶
Pressure should be directly proportional to temperature in a closed environment Gay-Lussac’s law:
\(\frac{P}{T} = k\)
Does this holds true for meteo
dataset? Try to find out by direct calculation of the formula and compare with corr()
method results.
[48]:
6.3 Transforming columns¶
Suppose we want to convert all values of column temperature which are floats to integers.
We know that to convert a float to an integer there the predefined python function int
[49]:
int(23.7)
[49]:
23
We would like to apply such function to all the elements of the column humidity
.
To do so, we can call the transform
method and pass to it the function int
as a parameter
NOTE: there are no round parenthesis after int
!!!
[50]:
df['humidity'].transform(int)
[50]:
0 44
1 45
2 45
3 45
4 45
..
110864 42
110865 42
110866 42
110867 42
110868 42
Name: humidity, Length: 110869, dtype: int64
Just to be clear what passing a function means, let’s see other two completely equivalent ways we could have used to pass the function.
Defining a function: We could have defined a function myf
like this (notice the function MUST RETURN something !)
[51]:
def myf(x):
return int(x)
df['humidity'].transform(myf)
[51]:
0 44
1 45
2 45
3 45
4 45
..
110864 42
110865 42
110866 42
110867 42
110868 42
Name: humidity, Length: 110869, dtype: int64
lamda function: We could have used as well a lambda
function, that is, a function without a name which is defined on one line:
[52]:
df['humidity'].transform( lambda x: int(x) )
[52]:
0 44
1 45
2 45
3 45
4 45
..
110864 42
110865 42
110866 42
110867 42
110868 42
Name: humidity, Length: 110869, dtype: int64
Regardless of the way we choose to pass the function, transform
method does not change the original dataframe:
[53]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110869 entries, 0 to 110868
Data columns (total 23 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ROW_ID 110869 non-null int64
1 temp_cpu 110869 non-null float64
2 temp_h 110869 non-null float64
3 temp_p 110869 non-null float64
4 humidity 110869 non-null float64
5 pressure 110869 non-null float64
6 pitch 110869 non-null float64
7 roll 110869 non-null float64
8 yaw 110869 non-null float64
9 mag_x 110869 non-null float64
10 mag_y 110869 non-null float64
11 mag_z 110869 non-null float64
12 accel_x 110869 non-null float64
13 accel_y 110869 non-null float64
14 accel_z 110869 non-null float64
15 gyro_x 110869 non-null float64
16 gyro_y 110869 non-null float64
17 gyro_z 110869 non-null float64
18 reset 110869 non-null int64
19 time_stamp 110869 non-null object
20 mag_tot 110869 non-null float64
21 Too hot 105315 non-null object
22 check_p 110869 non-null object
dtypes: float64(18), int64(2), object(3)
memory usage: 19.5+ MB
If we want to add a new column, say humidity_int
, we have to explicitly assign the result of transform
to a new series:
[54]:
df['humidity_int'] = df['humidity'].transform( lambda x: int(x) )
Notice how pandas automatically infers type int64
for the newly created column:
[55]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110869 entries, 0 to 110868
Data columns (total 24 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ROW_ID 110869 non-null int64
1 temp_cpu 110869 non-null float64
2 temp_h 110869 non-null float64
3 temp_p 110869 non-null float64
4 humidity 110869 non-null float64
5 pressure 110869 non-null float64
6 pitch 110869 non-null float64
7 roll 110869 non-null float64
8 yaw 110869 non-null float64
9 mag_x 110869 non-null float64
10 mag_y 110869 non-null float64
11 mag_z 110869 non-null float64
12 accel_x 110869 non-null float64
13 accel_y 110869 non-null float64
14 accel_z 110869 non-null float64
15 gyro_x 110869 non-null float64
16 gyro_y 110869 non-null float64
17 gyro_z 110869 non-null float64
18 reset 110869 non-null int64
19 time_stamp 110869 non-null object
20 mag_tot 110869 non-null float64
21 Too hot 105315 non-null object
22 check_p 110869 non-null object
23 humidity_int 110869 non-null int64
dtypes: float64(18), int64(3), object(3)
memory usage: 20.3+ MB
7. Grouping¶
Reference:
It’s pretty easy to group items and perform aggregated calculations by using groupby
method. Let’s say we want to count how many humidity
readings were taken for each integer humidity humidity_int
(here we use pandas groupby
, but for histograms you could also use numpy)
After groupby
we can use count()
aggregation function (other common ones are sum()
, min()
, max()
, mean()
):
[56]:
df.groupby(['humidity_int'])['humidity'].count()
[56]:
humidity_int
42 2776
43 2479
44 13029
45 32730
46 35775
47 14176
48 7392
49 297
50 155
51 205
52 209
53 128
54 224
55 164
56 139
57 183
58 237
59 271
60 300
Name: humidity, dtype: int64
Notice we’ve got only 19 rows. To have a series that fills the whole table, assigning to each row the count of its own group, we can use transform
like this:
[57]:
df.groupby(['humidity_int'])['humidity'].transform('count')
[57]:
0 13029
1 32730
2 32730
3 32730
4 32730
...
110864 2776
110865 2776
110866 2776
110867 2776
110868 2776
Name: humidity, Length: 110869, dtype: int64
As usual, group_by
does not modify the dataframe, if we want the result stored in the dataframe we need to assign the result to a new column:
[58]:
df['Humidity counts'] = df.groupby(['humidity_int'])['humidity'].transform('count')
[59]:
df
[59]:
ROW_ID | temp_cpu | temp_h | temp_p | humidity | pressure | pitch | roll | yaw | mag_x | ... | gyro_x | gyro_y | gyro_z | reset | time_stamp | mag_tot | Too hot | check_p | humidity_int | Humidity counts | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 31.88 | 27.57 | 25.01 | 44.94 | 1001.68 | 1.49 | 52.25 | 185.21 | -46.422753 | ... | 0.000942 | 0.000492 | -0.000750 | 20 | 2016-02-16 10:44:40 | 2368.337207 | True | below | 44 | 13029 |
1 | 2 | 31.79 | 27.53 | 25.01 | 45.12 | 1001.72 | 1.03 | 53.73 | 186.72 | -48.778951 | ... | 0.000218 | -0.000005 | -0.000235 | 0 | 2016-02-16 10:44:50 | 2615.870247 | True | below | 45 | 32730 |
2 | 3 | 31.66 | 27.53 | 25.01 | 45.12 | 1001.72 | 1.24 | 53.57 | 186.21 | -49.161878 | ... | 0.000395 | 0.000600 | -0.000003 | 0 | 2016-02-16 10:45:00 | 2648.484927 | NaN | below | 45 | 32730 |
3 | 4 | 31.69 | 27.52 | 25.01 | 45.32 | 1001.69 | 1.57 | 53.63 | 186.03 | -49.341941 | ... | 0.000308 | 0.000577 | -0.000102 | 0 | 2016-02-16 10:45:10 | 2665.305485 | True | below | 45 | 32730 |
4 | 5 | 31.66 | 27.54 | 25.01 | 45.18 | 1001.71 | 0.85 | 53.66 | 186.46 | -50.056683 | ... | 0.000321 | 0.000691 | 0.000272 | 0 | 2016-02-16 10:45:20 | 2732.388620 | NaN | below | 45 | 32730 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
110864 | 110865 | 31.56 | 27.52 | 24.83 | 42.94 | 1005.83 | 1.58 | 49.93 | 129.60 | -15.169673 | ... | -0.000264 | 0.000206 | 0.000196 | 0 | 2016-02-29 09:24:21 | 996.676408 | NaN | below | 42 | 2776 |
110865 | 110866 | 31.55 | 27.50 | 24.83 | 42.72 | 1005.85 | 1.89 | 49.92 | 130.51 | -15.832622 | ... | 0.000143 | 0.000199 | -0.000024 | 0 | 2016-02-29 09:24:30 | 1022.779594 | NaN | below | 42 | 2776 |
110866 | 110867 | 31.58 | 27.50 | 24.83 | 42.83 | 1005.85 | 2.09 | 50.00 | 132.04 | -16.646212 | ... | 0.000537 | 0.000257 | 0.000057 | 0 | 2016-02-29 09:24:41 | 1048.121268 | NaN | below | 42 | 2776 |
110867 | 110868 | 31.62 | 27.50 | 24.83 | 42.81 | 1005.88 | 2.88 | 49.69 | 133.00 | -17.270447 | ... | 0.000534 | 0.000456 | 0.000195 | 0 | 2016-02-29 09:24:50 | 1073.629703 | NaN | below | 42 | 2776 |
110868 | 110869 | 31.57 | 27.51 | 24.83 | 42.94 | 1005.86 | 2.17 | 49.77 | 134.18 | -17.885872 | ... | 0.000459 | 0.000076 | 0.000030 | 0 | 2016-02-29 09:25:00 | 1095.760426 | NaN | below | 42 | 2776 |
110869 rows × 25 columns
8. Exercise - meteo average temperatures¶
8.1 meteo plot¶
✪ Put in a plot the temperature from dataframe meteo:
[60]:
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
# write here
[61]:
[61]:
<AxesSubplot:>

8.2 meteo pressure and raining¶
✪ In the same plot as above show the pressure and amount of raining.
[62]:
# write here
[63]:

8.3 meteo average temperature¶
✪✪✪ Calculate the average temperature for each day, and show it in the plot, so to have a couple new columns like these:
Day Avg_day_temp
01/11/2017 7.983333
01/11/2017 7.983333
01/11/2017 7.983333
. .
. .
02/11/2017 7.384375
02/11/2017 7.384375
02/11/2017 7.384375
. .
. .
HINT 1: add 'Day'
column by extracting only the day from the date. To do it, use the function .str
applied to all the column.
HINT 2: There are various ways to solve the exercise:
Most perfomant and elegant is with
groupby
operator, see Pandas trasform - more than meets the eyeAs alternative, you may use a
for
to cycle through days. Typically, using afor
is not a good idea with Pandas, as on large datasets it can take a lot to perform the updates. Still, since this dataset is small enough, you should get results in a decent amount of time.
[64]:
# write here
[65]:
**************** SOLUTION 1 OUTPUT - recalculate average for every row - slow !
WITH AVERAGE TEMPERATURE
Date Pressure Rain Temp Day Avg_day_temp
0 01/11/2017 00:00 995.4 0.0 5.4 01/11/2017 7.983333
1 01/11/2017 00:15 995.5 0.0 6.0 01/11/2017 7.983333
2 01/11/2017 00:30 995.5 0.0 5.9 01/11/2017 7.983333
3 01/11/2017 00:45 995.7 0.0 5.4 01/11/2017 7.983333
4 01/11/2017 01:00 995.7 0.0 5.3 01/11/2017 7.983333
[65]:
<AxesSubplot:>

[66]:
******** SOLUTION 2 OUTPUT - recalculate average only 30 times
by using a dictionary d_avg, faster but not yet optimal
Date Pressure Rain Temp Day Avg_day_temp
0 01/11/2017 00:00 995.4 0.0 5.4 01/11/2017 7.983333
1 01/11/2017 00:15 995.5 0.0 6.0 01/11/2017 7.983333
2 01/11/2017 00:30 995.5 0.0 5.9 01/11/2017 7.983333
3 01/11/2017 00:45 995.7 0.0 5.4 01/11/2017 7.983333
4 01/11/2017 01:00 995.7 0.0 5.3 01/11/2017 7.983333
[66]:
<AxesSubplot:>

[67]:
**************** SOLUTION 3 - OUTPUT - best solution with groupby and transform
WITH AVERAGE TEMPERATURE
Date Pressure Rain Temp Day Avg_day_temp
0 01/11/2017 00:00 995.4 0.0 5.4 01/11/2017 7.983333
1 01/11/2017 00:15 995.5 0.0 6.0 01/11/2017 7.983333
2 01/11/2017 00:30 995.5 0.0 5.9 01/11/2017 7.983333
3 01/11/2017 00:45 995.7 0.0 5.4 01/11/2017 7.983333
4 01/11/2017 01:00 995.7 0.0 5.3 01/11/2017 7.983333
[67]:
<AxesSubplot:>

9 Exercise - Air pollutants¶
Let’s try analzying the hourly data from air quality monitoring stations from Autonomous Province of Trento.
Source: dati.trentino.it
9.1 - load the file¶
✪ Load the file aria.csv in pandas
IMPORTANT: put the dataframe into the variable aria
, so not to confuse it with the previous datasets.
IMPORTANT: use encoding 'latin-1'
(otherwise you might get weird load errors according to your operating system)
IMPORTANT: if you also receive other strange errors, try adding the parameter engine=python
[68]:
# write here
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20693 entries, 0 to 20692
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Stazione 20693 non-null object
1 Inquinante 20693 non-null object
2 Data 20693 non-null object
3 Ora 20693 non-null int64
4 Valore 20693 non-null float64
5 Unità di misura 20693 non-null object
dtypes: float64(1), int64(1), object(4)
memory usage: 970.1+ KB
9.2 - pollutants average¶
✪ find the average of PM10
pollutants at Parco S. Chiara
(average on all days). You should obtain the value 11.385752688172044
[69]:
# write here
[69]:
11.385752688172044
9.3 - PM10 chart¶
✪ Use plt.plot
as seen in a previous example (so by directly passing the relevant Pandas series), show in a chart the values of PM10
during May 7h, 2019.
[72]:
# write here
[73]:

10. Merging tables¶
Suppose we want to add a column with geographical position of the ISS. To do so, we would need to join our dataset with another one containing such information. Let’s take for example the dataset iss-coords.csv
[74]:
iss_coords = pd.read_csv('iss-coords.csv', encoding='UTF-8')
[75]:
iss_coords
[75]:
timestamp | lat | lon | |
---|---|---|---|
0 | 2016-01-01 05:11:30 | -45.103458 | 14.083858 |
1 | 2016-01-01 06:49:59 | -37.597242 | 28.931170 |
2 | 2016-01-01 11:52:30 | 17.126141 | 77.535602 |
3 | 2016-01-01 11:52:30 | 17.126464 | 77.535861 |
4 | 2016-01-01 14:54:08 | 7.259561 | 70.001561 |
... | ... | ... | ... |
333 | 2016-02-29 13:23:17 | -51.077590 | -31.093987 |
334 | 2016-02-29 13:44:13 | 30.688553 | -135.403820 |
335 | 2016-02-29 13:44:13 | 30.688295 | -135.403533 |
336 | 2016-02-29 18:44:57 | 27.608774 | -130.198781 |
337 | 2016-02-29 21:36:47 | 27.325186 | -129.893278 |
338 rows × 3 columns
We notice there is a timestamp
column, which unfortunately has a slightly different name that time_stamp
column (notice the underscore _
) in original astropi dataset:
[76]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110869 entries, 0 to 110868
Data columns (total 25 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ROW_ID 110869 non-null int64
1 temp_cpu 110869 non-null float64
2 temp_h 110869 non-null float64
3 temp_p 110869 non-null float64
4 humidity 110869 non-null float64
5 pressure 110869 non-null float64
6 pitch 110869 non-null float64
7 roll 110869 non-null float64
8 yaw 110869 non-null float64
9 mag_x 110869 non-null float64
10 mag_y 110869 non-null float64
11 mag_z 110869 non-null float64
12 accel_x 110869 non-null float64
13 accel_y 110869 non-null float64
14 accel_z 110869 non-null float64
15 gyro_x 110869 non-null float64
16 gyro_y 110869 non-null float64
17 gyro_z 110869 non-null float64
18 reset 110869 non-null int64
19 time_stamp 110869 non-null object
20 mag_tot 110869 non-null float64
21 Too hot 105315 non-null object
22 check_p 110869 non-null object
23 humidity_int 110869 non-null int64
24 Humidity counts 110869 non-null int64
dtypes: float64(18), int64(4), object(3)
memory usage: 21.1+ MB
To merge datasets according to the columns, we can use the command merge
like this:
[77]:
# remember merge produces a NEW dataframe
geo_astropi = df.merge(iss_coords, left_on='time_stamp', right_on='timestamp')
# merge will add both time_stamp and timestamp columns,
# so we remove the duplicate column `timestamp`
geo_astropi = geo_astropi.drop('timestamp', axis=1)
[78]:
geo_astropi
[78]:
ROW_ID | temp_cpu | temp_h | temp_p | humidity | pressure | pitch | roll | yaw | mag_x | ... | gyro_z | reset | time_stamp | mag_tot | Too hot | check_p | humidity_int | Humidity counts | lat | lon | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 23231 | 32.53 | 28.37 | 25.89 | 45.31 | 1006.04 | 1.31 | 51.63 | 34.91 | 21.125001 | ... | 0.000046 | 0 | 2016-02-19 03:49:00 | 2345.207992 | True | below | 45 | 32730 | 31.434741 | 52.917464 |
1 | 27052 | 32.30 | 28.12 | 25.62 | 45.57 | 1007.42 | 1.49 | 52.29 | 333.49 | 16.083471 | ... | 0.000034 | 0 | 2016-02-19 14:30:40 | 323.634786 | True | below | 45 | 32730 | -46.620658 | -57.311657 |
2 | 27052 | 32.30 | 28.12 | 25.62 | 45.57 | 1007.42 | 1.49 | 52.29 | 333.49 | 16.083471 | ... | 0.000034 | 0 | 2016-02-19 14:30:40 | 323.634786 | True | below | 45 | 32730 | -46.620477 | -57.311138 |
3 | 46933 | 32.21 | 28.05 | 25.50 | 47.36 | 1012.41 | 0.67 | 52.40 | 27.57 | 15.441683 | ... | 0.000221 | 0 | 2016-02-21 22:14:11 | 342.159257 | True | over | 47 | 14176 | 19.138359 | -140.211489 |
4 | 64572 | 32.32 | 28.18 | 25.61 | 47.45 | 1010.62 | 1.14 | 51.41 | 33.68 | 11.994554 | ... | 0.000030 | 0 | 2016-02-23 23:40:50 | 264.655601 | True | over | 47 | 14176 | 4.713819 | 80.261665 |
5 | 68293 | 32.39 | 28.26 | 25.70 | 46.83 | 1010.51 | 0.61 | 51.91 | 287.86 | 6.554283 | ... | 0.000171 | 0 | 2016-02-24 10:05:51 | 436.876111 | True | over | 46 | 35775 | -46.061583 | 22.246025 |
6 | 73374 | 32.38 | 28.18 | 25.62 | 46.52 | 1008.28 | 0.90 | 51.77 | 30.80 | 9.947132 | ... | -0.000375 | 0 | 2016-02-25 00:23:01 | 226.089258 | True | over | 46 | 35775 | 47.047346 | 137.958918 |
7 | 90986 | 32.42 | 28.34 | 25.76 | 45.72 | 1006.79 | 0.57 | 49.85 | 10.57 | 7.805606 | ... | -0.000047 | 0 | 2016-02-27 01:43:10 | 149.700293 | True | below | 45 | 32730 | -41.049112 | 30.193004 |
8 | 90986 | 32.42 | 28.34 | 25.76 | 45.72 | 1006.79 | 0.57 | 49.85 | 10.57 | 7.805606 | ... | -0.000047 | 0 | 2016-02-27 01:43:10 | 149.700293 | True | below | 45 | 32730 | -8.402991 | -100.981726 |
9 | 102440 | 32.62 | 28.62 | 26.02 | 45.15 | 1006.06 | 1.12 | 50.44 | 301.74 | 10.348327 | ... | -0.000061 | 0 | 2016-02-28 09:48:40 | 381.014223 | True | below | 45 | 32730 | 50.047523 | 175.566751 |
10 rows × 27 columns
Exercise 10.1 - better merge¶
If you notice, above table does have lat
and lon
columns, but has very few rows. Why ? Try to merge the tables in some meaningful way so to have all the original rows and all cells of lat
and lon
filled.
For other merging stategies, read about attribute
how
in Why And How To Use Merge With Pandas in PythonTo fill missing values don’t use fancy interpolation techniques, just put the station position in that given day or hour
[79]:
# write here
[79]:
ROW_ID | temp_cpu | temp_h | temp_p | humidity | pressure | pitch | roll | yaw | mag_x | ... | reset | time_stamp | mag_tot | Too hot | check_p | humidity_int | Humidity counts | timestamp | lat | lon | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 31.88 | 27.57 | 25.01 | 44.94 | 1001.68 | 1.49 | 52.25 | 185.21 | -46.422753 | ... | 20 | 2016-02-16 10:44:40 | 2368.337207 | True | below | 44 | 13029 | NaN | NaN | NaN |
1 | 2 | 31.79 | 27.53 | 25.01 | 45.12 | 1001.72 | 1.03 | 53.73 | 186.72 | -48.778951 | ... | 0 | 2016-02-16 10:44:50 | 2615.870247 | True | below | 45 | 32730 | NaN | NaN | NaN |
2 | 3 | 31.66 | 27.53 | 25.01 | 45.12 | 1001.72 | 1.24 | 53.57 | 186.21 | -49.161878 | ... | 0 | 2016-02-16 10:45:00 | 2648.484927 | NaN | below | 45 | 32730 | NaN | NaN | NaN |
3 | 4 | 31.69 | 27.52 | 25.01 | 45.32 | 1001.69 | 1.57 | 53.63 | 186.03 | -49.341941 | ... | 0 | 2016-02-16 10:45:10 | 2665.305485 | True | below | 45 | 32730 | NaN | NaN | NaN |
4 | 5 | 31.66 | 27.54 | 25.01 | 45.18 | 1001.71 | 0.85 | 53.66 | 186.46 | -50.056683 | ... | 0 | 2016-02-16 10:45:20 | 2732.388620 | NaN | below | 45 | 32730 | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
110866 | 110865 | 31.56 | 27.52 | 24.83 | 42.94 | 1005.83 | 1.58 | 49.93 | 129.60 | -15.169673 | ... | 0 | 2016-02-29 09:24:21 | 996.676408 | NaN | below | 42 | 2776 | NaN | NaN | NaN |
110867 | 110866 | 31.55 | 27.50 | 24.83 | 42.72 | 1005.85 | 1.89 | 49.92 | 130.51 | -15.832622 | ... | 0 | 2016-02-29 09:24:30 | 1022.779594 | NaN | below | 42 | 2776 | NaN | NaN | NaN |
110868 | 110867 | 31.58 | 27.50 | 24.83 | 42.83 | 1005.85 | 2.09 | 50.00 | 132.04 | -16.646212 | ... | 0 | 2016-02-29 09:24:41 | 1048.121268 | NaN | below | 42 | 2776 | NaN | NaN | NaN |
110869 | 110868 | 31.62 | 27.50 | 24.83 | 42.81 | 1005.88 | 2.88 | 49.69 | 133.00 | -17.270447 | ... | 0 | 2016-02-29 09:24:50 | 1073.629703 | NaN | below | 42 | 2776 | NaN | NaN | NaN |
110870 | 110869 | 31.57 | 27.51 | 24.83 | 42.94 | 1005.86 | 2.17 | 49.77 | 134.18 | -17.885872 | ... | 0 | 2016-02-29 09:25:00 | 1095.760426 | NaN | below | 42 | 2776 | NaN | NaN | NaN |
110871 rows × 28 columns
11. GeoPandas¶
You can easily manipulate geographical data with GeoPandas library. For some nice online tutorial, we refer to Geospatial Analysis and Representation for Data Science course website @ master in Data Science University of Trento, by Maurizio Napolitano (FBK)
Continue¶
Go on with more exercises worksheet