Analytics with Pandas : 1. introduction

Download exercises zip

Browse files online

Python gives powerful tools for data analysis - among the main ones we find Pandas, which gives fast and flexible data structures, especially for interactive data analysis.

Pandas reuses existing libraries we’ve already seen, such as Numpy:

pydata iuiu34

In this tutorial we will see:

  • data analysis with Pandas library

  • plotting with MatPlotLib

  • Examples with AstroPi dataset

  • Exercises with meteotrentino dataset

1. What to do

  1. unzip exercises in a folder, you should get something like this:

pandas
    pandas1.ipynb
    pandas1-sol.ipynb
    pandas2.ipynb
    pandas2-sol.ipynb
    pandas3-chal.ipynb
    jupman.py

WARNING 1: to correctly visualize the notebook, it MUST be in an unzipped folder !

  1. open Jupyter Notebook from that folder. Two things should open, first a console and then browser.

  2. The browser should show a file list: navigate the list and open the notebook pandas/pandas1.ipynb

WARNING 2: DO NOT use the Upload button in Jupyter, instead navigate in Jupyter browser to the unzipped folder !

  1. Go on reading that notebook, and follow instuctions inside.

Shortcut keys:

  • to execute Python code inside a Jupyter cell, press Control + Enter

  • to execute Python code inside a Jupyter cell AND select next cell, press Shift + Enter

  • to execute Python code inside a Jupyter cell AND a create a new cell aftwerwards, press Alt + Enter

  • If the notebooks look stuck, try to select Kernel -> Restart

Check installation

First let’s see if you have already installed pandas on your system, try executing this cell with Ctrl-Enter:

[1]:
import pandas as pd

If you saw no error messages, you can skip installation, otherwise do this:

  • Anaconda - open Anaconda Prompt and issue this:

conda install pandas

  • Without Anaconda (--user installs in your home):

python3 -m pip install --user pandas

2. Data analysis of Astro Pi

Let’s try analyzing data recorded on a RaspberryPi present on the International Space Station, downloaded from here:

https://projects.raspberrypi.org/en/projects/astro-pi-flight-data-analysis

in which it is possible to find the detailed description of data gathered by sensors, in the month of February 2016 (one record each 10 seconds).

ISS uiu9u

The method read_csv imports data from a CSV file and saves them in DataFrame structure.

In this exercise we shall use the file Columbus_Ed_astro_pi_datalog.csv

[2]:
import pandas as pd   # we import pandas and for ease we rename it to 'pd'
import numpy as np    # we import numpy and for ease we rename it to 'np'

# remember the encoding !
df = pd.read_csv('Columbus_Ed_astro_pi_datalog.csv', encoding='UTF-8')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110869 entries, 0 to 110868
Data columns (total 20 columns):
 #   Column      Non-Null Count   Dtype
---  ------      --------------   -----
 0   ROW_ID      110869 non-null  int64
 1   temp_cpu    110869 non-null  float64
 2   temp_h      110869 non-null  float64
 3   temp_p      110869 non-null  float64
 4   humidity    110869 non-null  float64
 5   pressure    110869 non-null  float64
 6   pitch       110869 non-null  float64
 7   roll        110869 non-null  float64
 8   yaw         110869 non-null  float64
 9   mag_x       110869 non-null  float64
 10  mag_y       110869 non-null  float64
 11  mag_z       110869 non-null  float64
 12  accel_x     110869 non-null  float64
 13  accel_y     110869 non-null  float64
 14  accel_z     110869 non-null  float64
 15  gyro_x      110869 non-null  float64
 16  gyro_y      110869 non-null  float64
 17  gyro_z      110869 non-null  float64
 18  reset       110869 non-null  int64
 19  time_stamp  110869 non-null  object
dtypes: float64(17), int64(2), object(1)
memory usage: 16.9+ MB

We can quickly see rows and columns of the dataframe with the attribute shape:

NOTE: shape is not followed by round parenthesis !

[3]:
df.shape
[3]:
(110869, 20)

The describe method gives you on the fly many summary info:

[4]:
df.describe()
[4]:
ROW_ID temp_cpu temp_h temp_p humidity pressure pitch roll yaw mag_x mag_y mag_z accel_x accel_y accel_z gyro_x gyro_y gyro_z reset
count 110869.000000 110869.000000 110869.000000 110869.000000 110869.000000 110869.000000 110869.000000 110869.000000 110869.00000 110869.000000 110869.000000 110869.000000 110869.000000 110869.000000 110869.000000 1.108690e+05 110869.000000 1.108690e+05 110869.000000
mean 55435.000000 32.236259 28.101773 25.543272 46.252005 1008.126788 2.770553 51.807973 200.90126 -19.465265 -1.174493 -6.004529 -0.000630 0.018504 0.014512 -8.959493e-07 0.000007 -9.671594e-07 0.000180
std 32005.267835 0.360289 0.369256 0.380877 1.907273 3.093485 21.848940 2.085821 84.47763 28.120202 15.655121 8.552481 0.000224 0.000604 0.000312 2.807614e-03 0.002456 2.133104e-03 0.060065
min 1.000000 31.410000 27.200000 24.530000 42.270000 1001.560000 0.000000 30.890000 0.01000 -73.046240 -43.810030 -41.163040 -0.025034 -0.005903 -0.022900 -3.037930e-01 -0.378412 -2.970800e-01 0.000000
25% 27718.000000 31.960000 27.840000 25.260000 45.230000 1006.090000 1.140000 51.180000 162.43000 -41.742792 -12.982321 -11.238430 -0.000697 0.018009 0.014349 -2.750000e-04 -0.000278 -1.200000e-04 0.000000
50% 55435.000000 32.280000 28.110000 25.570000 46.130000 1007.650000 1.450000 51.950000 190.58000 -21.339485 -1.350467 -5.764400 -0.000631 0.018620 0.014510 -3.000000e-06 -0.000004 -1.000000e-06 0.000000
75% 83152.000000 32.480000 28.360000 25.790000 46.880000 1010.270000 1.740000 52.450000 256.34000 7.299000 11.912456 -0.653705 -0.000567 0.018940 0.014673 2.710000e-04 0.000271 1.190000e-04 0.000000
max 110869.000000 33.700000 29.280000 26.810000 60.590000 1021.780000 360.000000 359.400000 359.98000 33.134748 37.552135 31.003047 0.018708 0.041012 0.029938 2.151470e-01 0.389499 2.698760e-01 20.000000

QUESTION: is there some missing field from the table produced by describe? Why is it not included?

To limit describe to only one column like humidity, you can write like this:

[5]:
df['humidity'].describe()
[5]:
count    110869.000000
mean         46.252005
std           1.907273
min          42.270000
25%          45.230000
50%          46.130000
75%          46.880000
max          60.590000
Name: humidity, dtype: float64

Notation with the dot is even more handy:

[6]:
df.humidity.describe()
[6]:
count    110869.000000
mean         46.252005
std           1.907273
min          42.270000
25%          45.230000
50%          46.130000
75%          46.880000
max          60.590000
Name: humidity, dtype: float64

WARNING: Careful about spaces!:

In case the field name has spaces (es. 'blender rotations'), do not use the dot notation, instead use squared bracket notation seen above (ie: df.['blender rotations'].describe())

head() method gives back the first datasets:

[7]:
df.head()
[7]:
ROW_ID temp_cpu temp_h temp_p humidity pressure pitch roll yaw mag_x mag_y mag_z accel_x accel_y accel_z gyro_x gyro_y gyro_z reset time_stamp
0 1 31.88 27.57 25.01 44.94 1001.68 1.49 52.25 185.21 -46.422753 -8.132907 -12.129346 -0.000468 0.019439 0.014569 0.000942 0.000492 -0.000750 20 2016-02-16 10:44:40
1 2 31.79 27.53 25.01 45.12 1001.72 1.03 53.73 186.72 -48.778951 -8.304243 -12.943096 -0.000614 0.019436 0.014577 0.000218 -0.000005 -0.000235 0 2016-02-16 10:44:50
2 3 31.66 27.53 25.01 45.12 1001.72 1.24 53.57 186.21 -49.161878 -8.470832 -12.642772 -0.000569 0.019359 0.014357 0.000395 0.000600 -0.000003 0 2016-02-16 10:45:00
3 4 31.69 27.52 25.01 45.32 1001.69 1.57 53.63 186.03 -49.341941 -8.457380 -12.615509 -0.000575 0.019383 0.014409 0.000308 0.000577 -0.000102 0 2016-02-16 10:45:10
4 5 31.66 27.54 25.01 45.18 1001.71 0.85 53.66 186.46 -50.056683 -8.122609 -12.678341 -0.000548 0.019378 0.014380 0.000321 0.000691 0.000272 0 2016-02-16 10:45:20

tail() method gives back last rows:

[8]:
df.tail()
[8]:
ROW_ID temp_cpu temp_h temp_p humidity pressure pitch roll yaw mag_x mag_y mag_z accel_x accel_y accel_z gyro_x gyro_y gyro_z reset time_stamp
110864 110865 31.56 27.52 24.83 42.94 1005.83 1.58 49.93 129.60 -15.169673 -27.642610 1.563183 -0.000682 0.017743 0.014646 -0.000264 0.000206 0.000196 0 2016-02-29 09:24:21
110865 110866 31.55 27.50 24.83 42.72 1005.85 1.89 49.92 130.51 -15.832622 -27.729389 1.785682 -0.000736 0.017570 0.014855 0.000143 0.000199 -0.000024 0 2016-02-29 09:24:30
110866 110867 31.58 27.50 24.83 42.83 1005.85 2.09 50.00 132.04 -16.646212 -27.719479 1.629533 -0.000647 0.017657 0.014799 0.000537 0.000257 0.000057 0 2016-02-29 09:24:41
110867 110868 31.62 27.50 24.83 42.81 1005.88 2.88 49.69 133.00 -17.270447 -27.793136 1.703806 -0.000835 0.017635 0.014877 0.000534 0.000456 0.000195 0 2016-02-29 09:24:50
110868 110869 31.57 27.51 24.83 42.94 1005.86 2.17 49.77 134.18 -17.885872 -27.824149 1.293345 -0.000787 0.017261 0.014380 0.000459 0.000076 0.000030 0 2016-02-29 09:25:00

colums property gives the column headers:

[9]:
df.columns
[9]:
Index(['ROW_ID', 'temp_cpu', 'temp_h', 'temp_p', 'humidity', 'pressure',
       'pitch', 'roll', 'yaw', 'mag_x', 'mag_y', 'mag_z', 'accel_x', 'accel_y',
       'accel_z', 'gyro_x', 'gyro_y', 'gyro_z', 'reset', 'time_stamp'],
      dtype='object')

Nota: as you see in the above, the type of the found object is not a list, but a special container defined by pandas:

[10]:
type(df.columns)
[10]:
pandas.core.indexes.base.Index

Nevertheless, we can access the elements of this container using indeces within the squared parenthesis:

[11]:
df.columns[0]
[11]:
'ROW_ID'
[12]:
df.columns[1]
[12]:
'temp_cpu'

With corr method we can see the correlation between DataFrame columns.

[13]:
df.corr()
[13]:
ROW_ID temp_cpu temp_h temp_p humidity pressure pitch roll yaw mag_x mag_y mag_z accel_x accel_y accel_z gyro_x gyro_y gyro_z reset
ROW_ID 1.000000 0.361391 0.435289 0.339916 -0.110241 0.243337 -0.007741 -0.390159 -0.034877 -0.035966 -0.025818 0.108138 -0.132291 -0.918587 -0.080899 -0.001906 -0.004913 -0.000211 -0.005202
temp_cpu 0.361391 1.000000 0.986872 0.991672 -0.297081 0.038065 0.008076 -0.171644 -0.117972 0.005145 -0.285192 -0.120838 -0.023582 -0.446358 -0.029155 0.002511 0.005947 -0.001250 -0.002970
temp_h 0.435289 0.986872 1.000000 0.993260 -0.281422 0.070882 0.005145 -0.199628 -0.117870 0.000428 -0.276276 -0.098864 -0.032188 -0.510126 -0.043213 0.001771 0.005020 -0.001423 -0.004325
temp_p 0.339916 0.991672 0.993260 1.000000 -0.288373 0.035496 0.006750 -0.163685 -0.118463 0.004338 -0.283427 -0.114407 -0.018047 -0.428884 -0.036505 0.001829 0.006127 -0.001623 -0.004205
humidity -0.110241 -0.297081 -0.281422 -0.288373 1.000000 0.434374 0.004050 0.101304 0.031664 -0.035146 0.077897 0.076424 -0.009741 0.226281 0.005281 0.004345 0.003457 0.001298 -0.002066
pressure 0.243337 0.038065 0.070882 0.035496 0.434374 1.000000 0.003018 0.011815 -0.051697 -0.040183 -0.074578 0.092352 0.013556 -0.115642 -0.221208 -0.000611 -0.002493 -0.000615 -0.006259
pitch -0.007741 0.008076 0.005145 0.006750 0.004050 0.003018 1.000000 0.087941 -0.011611 0.013331 0.006133 0.000540 0.043285 0.009015 -0.039146 0.066618 -0.015034 0.049340 -0.000176
roll -0.390159 -0.171644 -0.199628 -0.163685 0.101304 0.011815 0.087941 1.000000 0.095354 -0.020947 0.060297 -0.080620 0.116637 0.462630 -0.167905 -0.115873 -0.002509 -0.214202 0.000636
yaw -0.034877 -0.117972 -0.117870 -0.118463 0.031664 -0.051697 -0.011611 0.095354 1.000000 0.257971 0.549394 -0.328360 0.006943 0.044157 -0.013634 0.003106 0.003665 0.004020 -0.000558
mag_x -0.035966 0.005145 0.000428 0.004338 -0.035146 -0.040183 0.013331 -0.020947 0.257971 1.000000 0.001239 -0.213070 -0.006629 0.027921 0.021524 -0.004954 -0.004429 -0.005052 -0.002879
mag_y -0.025818 -0.285192 -0.276276 -0.283427 0.077897 -0.074578 0.006133 0.060297 0.549394 0.001239 1.000000 -0.266351 0.014057 0.051619 -0.053016 0.001239 0.001063 0.001530 -0.001335
mag_z 0.108138 -0.120838 -0.098864 -0.114407 0.076424 0.092352 0.000540 -0.080620 -0.328360 -0.213070 -0.266351 1.000000 0.024718 -0.083914 -0.061317 -0.008470 -0.009557 -0.008997 -0.002151
accel_x -0.132291 -0.023582 -0.032188 -0.018047 -0.009741 0.013556 0.043285 0.116637 0.006943 -0.006629 0.014057 0.024718 1.000000 0.095286 -0.262305 0.035314 0.103449 0.197740 0.002173
accel_y -0.918587 -0.446358 -0.510126 -0.428884 0.226281 -0.115642 0.009015 0.462630 0.044157 0.027921 0.051619 -0.083914 0.095286 1.000000 0.120215 0.043263 -0.046463 0.009541 0.004648
accel_z -0.080899 -0.029155 -0.043213 -0.036505 0.005281 -0.221208 -0.039146 -0.167905 -0.013634 0.021524 -0.053016 -0.061317 -0.262305 0.120215 1.000000 0.078315 -0.075625 0.057075 0.000554
gyro_x -0.001906 0.002511 0.001771 0.001829 0.004345 -0.000611 0.066618 -0.115873 0.003106 -0.004954 0.001239 -0.008470 0.035314 0.043263 0.078315 1.000000 -0.248968 0.337553 0.001009
gyro_y -0.004913 0.005947 0.005020 0.006127 0.003457 -0.002493 -0.015034 -0.002509 0.003665 -0.004429 0.001063 -0.009557 0.103449 -0.046463 -0.075625 -0.248968 1.000000 0.190112 0.000593
gyro_z -0.000211 -0.001250 -0.001423 -0.001623 0.001298 -0.000615 0.049340 -0.214202 0.004020 -0.005052 0.001530 -0.008997 0.197740 0.009541 0.057075 0.337553 0.190112 1.000000 -0.001055
reset -0.005202 -0.002970 -0.004325 -0.004205 -0.002066 -0.006259 -0.000176 0.000636 -0.000558 -0.002879 -0.001335 -0.002151 0.002173 0.004648 0.000554 0.001009 0.000593 -0.001055 1.000000

2.1 Exercise - meteo info

✪ a) Create a new dataframe called meteo by importing the data from file meteo.csv, which contains the meteo data of Trento from November 2017 (source: https://www.meteotrentino.it). IMPORTANT: assign the dataframe to a variable called meteo (so we avoid confusion whith AstroPi dataframe)

  1. Visualize info about this dataframe

Show solution
[14]:
# write here - create dataframe


COLUMNS:

Index(['Date', 'Pressure', 'Rain', 'Temp'], dtype='object')

INFO:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2878 entries, 0 to 2877
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   Date      2878 non-null   object
 1   Pressure  2878 non-null   float64
 2   Rain      2878 non-null   float64
 3   Temp      2878 non-null   float64
dtypes: float64(3), object(1)
memory usage: 90.1+ KB
None

HEAD():
[14]:
Date Pressure Rain Temp
0 01/11/2017 00:00 995.4 0.0 5.4
1 01/11/2017 00:15 995.5 0.0 6.0
2 01/11/2017 00:30 995.5 0.0 5.9
3 01/11/2017 00:45 995.7 0.0 5.4
4 01/11/2017 01:00 995.7 0.0 5.3

3. MatPlotLib review

We’ve already seen MatplotLib in the part on visualization, and today we use Matplotlib to display data.

3.1 An example

Let’s take again an example, with the Matlab approach. We will plot a line passing two lists of coordinates, one for xs and one for ys:

[15]:
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
[16]:

xs = [1,2,3,4]
ys = [2,4,6,8]
plt.plot(xs, ys) # we can directly pass xs and ys lists
plt.title('Some number')
plt.show()

../_images/pandas_pandas1-sol_39_0.png

We can also create the series with numpy. Let’s try making a parabola:

[17]:
import numpy as np
xs = np.arange(0.,5.,0.1)
#  '**' is the power operator in  Python, NOT '^'
ys = xs**2

Let’s use the type function to understand which data types are xs and ys:

[18]:
type(xs)
[18]:
numpy.ndarray
[19]:
type(ys)
[19]:
numpy.ndarray

Hence we have NumPy arrays.

Let’s plot it:

[20]:
plt.title('The parabola')
plt.plot(xs,ys);
../_images/pandas_pandas1-sol_46_0.png

If we want the same units in both x and y axis, we can use the gca function

To set x and y limits, we can use xlim e ylim:

[21]:
plt.xlim([0, 5])
plt.ylim([0,10])
plt.title('The parabola')

plt.gca().set_aspect('equal')
plt.plot(xs,ys);
../_images/pandas_pandas1-sol_48_0.png

3.2 Matplotlib plots from pandas datastructures

We can get plots directly from pandas data structures, always using the matlab style. Let’s make a simple example, for more complex cases we refer to DataFrame.plot documentation.

In case of big quantity of data, it may be useful to have a qualitative idea of data by putting them in a plot:

[22]:
df.humidity.plot(label="Humidity", legend=True)
# with secondary_y=True we display number on y axis
# of graph on the right
df.pressure.plot(secondary_y=True, label="Pressure", legend=True);
../_images/pandas_pandas1-sol_50_0.png

We can put pressure values on horizontal axis, and see which humidity values on vertical axis have a certain pressure:

[23]:
plt.plot(df['pressure'], df['humidity'])
[23]:
[<matplotlib.lines.Line2D at 0x7f45431942d0>]
../_images/pandas_pandas1-sol_52_1.png

4. Operations on rows

If we consider the rows of a dataset, typically we will want to index, filter and order them.

4.1 Indexing integers

We report here the simplest indexing with row numbers.

To obtain the i-th series you can use the method iloc[i] (here we reuse AstroPi dataset) :

[24]:
df.iloc[6]
[24]:
ROW_ID                          7
temp_cpu                    31.68
temp_h                      27.53
temp_p                      25.01
humidity                    45.31
pressure                   1001.7
pitch                        0.63
roll                        53.55
yaw                         186.1
mag_x                  -50.447346
mag_y                   -7.937309
mag_z                  -12.188574
accel_x                  -0.00051
accel_y                  0.019264
accel_z                  0.014528
gyro_x                  -0.000111
gyro_y                    0.00032
gyro_z                   0.000222
reset                           0
time_stamp    2016-02-16 10:45:41
Name: 6, dtype: object

It is possible to select a dataframe of contiguous positions by using slicing, as we already did for strings and lists

For example, here we select the rows from 5th included to 7-th excluded :

[25]:
df.iloc[5:7]
[25]:
ROW_ID temp_cpu temp_h temp_p humidity pressure pitch roll yaw mag_x mag_y mag_z accel_x accel_y accel_z gyro_x gyro_y gyro_z reset time_stamp
5 6 31.69 27.55 25.01 45.12 1001.67 0.85 53.53 185.52 -50.246476 -8.343209 -11.938124 -0.000536 0.019453 0.014380 0.000273 0.000494 -0.000059 0 2016-02-16 10:45:30
6 7 31.68 27.53 25.01 45.31 1001.70 0.63 53.55 186.10 -50.447346 -7.937309 -12.188574 -0.000510 0.019264 0.014528 -0.000111 0.000320 0.000222 0 2016-02-16 10:45:41

By filtering the rows we can ‘zoom in’ the dataset, selecting for example the rows between the 12500th (included) and the 15000th (excluded) in the new dataframe df2:

[26]:
df2=df.iloc[12500:15000]
[27]:
plt.plot(df2['pressure'], df2['humidity'])
[27]:
[<matplotlib.lines.Line2D at 0x7f454322c490>]
../_images/pandas_pandas1-sol_61_1.png
[28]:
df2.humidity.plot(label="Humidity", legend=True)
df2.pressure.plot(secondary_y=True, label="Pressure", legend=True)
[28]:
<AxesSubplot:>
../_images/pandas_pandas1-sol_62_1.png

4.2 Filtering

It’s possible to filter data by according to a condition data should satisfy, which can be expressed by indicating a column and a comparison operator. For example:

[29]:
df.ROW_ID >= 6
[29]:
0         False
1         False
2         False
3         False
4         False
          ...
110864     True
110865     True
110866     True
110867     True
110868     True
Name: ROW_ID, Length: 110869, dtype: bool

We see it’s a series of values True or False, according to ROW_ID being greater or equal to 6. What’s the type of this result?

[30]:
type(df.ROW_ID >= 6)
[30]:
pandas.core.series.Series

What is contained in this Series object ? If we try printing it we will see it is a series of values True or False, according whether the ROW_ID is greater or equal than 6:

[31]:
df.ROW_ID >= 6
[31]:
0         False
1         False
2         False
3         False
4         False
          ...
110864     True
110865     True
110866     True
110867     True
110868     True
Name: ROW_ID, Length: 110869, dtype: bool

Combining filters

It’s possible to combine conditions like we already did in Numpy filtering: for example by using the special operator conjunction &

If we write (df.ROW_ID >= 6) & (df.ROW_ID <= 10) we obtain a series of values True or False, if ROW_ID is at the same time greater or equal than 6 and less or equal of 10

[32]:
type((df.ROW_ID >= 6) & (df.ROW_ID <= 10))
[32]:
pandas.core.series.Series

If we want complete rows of the dataframe which satisfy the condition, we can write like this:

IMPORTANT: we use df externally from expression df[     ] starting and closing the square bracket parenthesis to tell Python we want to filter the df dataframe, and use again df inside the parenthesis to tell on which columns and which rows we want to filter

[33]:
df[  (df.ROW_ID >= 6) & (df.ROW_ID <= 10)  ]
[33]:
ROW_ID temp_cpu temp_h temp_p humidity pressure pitch roll yaw mag_x mag_y mag_z accel_x accel_y accel_z gyro_x gyro_y gyro_z reset time_stamp
5 6 31.69 27.55 25.01 45.12 1001.67 0.85 53.53 185.52 -50.246476 -8.343209 -11.938124 -0.000536 0.019453 0.014380 0.000273 0.000494 -0.000059 0 2016-02-16 10:45:30
6 7 31.68 27.53 25.01 45.31 1001.70 0.63 53.55 186.10 -50.447346 -7.937309 -12.188574 -0.000510 0.019264 0.014528 -0.000111 0.000320 0.000222 0 2016-02-16 10:45:41
7 8 31.66 27.55 25.01 45.34 1001.70 1.49 53.65 186.08 -50.668232 -7.762600 -12.284196 -0.000523 0.019473 0.014298 -0.000044 0.000436 0.000301 0 2016-02-16 10:45:50
8 9 31.67 27.54 25.01 45.20 1001.72 1.22 53.77 186.55 -50.761529 -7.262934 -11.981090 -0.000522 0.019385 0.014286 0.000358 0.000651 0.000187 0 2016-02-16 10:46:01
9 10 31.67 27.54 25.01 45.41 1001.75 1.63 53.46 185.94 -51.243832 -6.875270 -11.672494 -0.000581 0.019390 0.014441 0.000266 0.000676 0.000356 0 2016-02-16 10:46:10

So if we want to search the record where pressure is maximal, we user values property of the series on which we calculate the maximal value:

[34]:
df[  (df.pressure == df.pressure.values.max())  ]
[34]:
ROW_ID temp_cpu temp_h temp_p humidity pressure pitch roll yaw mag_x mag_y mag_z accel_x accel_y accel_z gyro_x gyro_y gyro_z reset time_stamp
77602 77603 32.44 28.31 25.74 47.57 1021.78 1.1 51.82 267.39 -0.797428 10.891803 -15.728202 -0.000612 0.01817 0.014295 -0.000139 -0.000179 -0.000298 0 2016-02-25 12:13:20

4.3 Sorting

To obtain a NEW dataframe sorted according to one or more columns, we can use the sort_values method:

[35]:
df.sort_values('pressure',ascending=False).head()
[35]:
ROW_ID temp_cpu temp_h temp_p humidity pressure pitch roll yaw mag_x mag_y mag_z accel_x accel_y accel_z gyro_x gyro_y gyro_z reset time_stamp
77602 77603 32.44 28.31 25.74 47.57 1021.78 1.10 51.82 267.39 -0.797428 10.891803 -15.728202 -0.000612 0.018170 0.014295 -0.000139 -0.000179 -0.000298 0 2016-02-25 12:13:20
77601 77602 32.45 28.30 25.74 47.26 1021.75 1.53 51.76 266.12 -1.266335 10.927442 -15.690558 -0.000661 0.018357 0.014533 0.000152 0.000459 -0.000298 0 2016-02-25 12:13:10
77603 77604 32.44 28.30 25.74 47.29 1021.75 1.86 51.83 268.83 -0.320795 10.651441 -15.565123 -0.000648 0.018290 0.014372 0.000049 0.000473 -0.000029 0 2016-02-25 12:13:30
77604 77605 32.43 28.30 25.74 47.39 1021.75 1.78 51.54 269.41 -0.130574 10.628383 -15.488983 -0.000672 0.018154 0.014602 0.000360 0.000089 -0.000002 0 2016-02-25 12:13:40
77608 77609 32.42 28.29 25.74 47.36 1021.73 0.86 51.89 272.77 0.952025 10.435951 -16.027235 -0.000607 0.018186 0.014232 -0.000260 -0.000059 -0.000187 0 2016-02-25 12:14:20

4.4 Exercise - Meteo stats

✪ Analyze data from Dataframe meteo to find:

  • values of average pression, minimal and maximal

  • average temperature

  • the dates of rainy days

Show solution
[36]:
# write here


Average pressure : 986.3408269631689
Minimal pressure : 966.3
Maximal pressure : 998.3
Average temperature : 6.410701876302988
[36]:
Date Pressure Rain Temp
433 05/11/2017 12:15 979.2 0.2 8.6
435 05/11/2017 12:45 978.9 0.2 8.4
436 05/11/2017 13:00 979.0 0.2 8.4
437 05/11/2017 13:15 979.1 0.8 8.2
438 05/11/2017 13:30 979.0 0.6 8.2
... ... ... ... ...
2754 29/11/2017 17:15 976.1 0.2 0.9
2755 29/11/2017 17:30 975.9 0.2 0.9
2802 30/11/2017 05:15 971.3 0.2 1.3
2803 30/11/2017 05:30 971.3 0.2 1.1
2804 30/11/2017 05:45 971.5 0.2 1.1

107 rows × 4 columns

5. Object values and strings

In general, when we want to manipulate objects of a known type, say strings which have type str, we can write .str after a series and then treat the result like it were a single string, using any operator (es: slicing) or method that particular class allows us, plus others provided by pandas

Text in particular can be manipulated in many ways, for more details see pandas documentation)

5.1 Filter by textual values

When we want to filter by text values, we can use .str.contains, here for example we select all the samples in the last days of february (which have timestamp containing 2016-02-2) :

[37]:
df[  df['time_stamp'].str.contains('2016-02-2')  ]
[37]:
ROW_ID temp_cpu temp_h temp_p humidity pressure pitch roll yaw mag_x mag_y mag_z accel_x accel_y accel_z gyro_x gyro_y gyro_z reset time_stamp
30442 30443 32.30 28.12 25.59 45.05 1008.01 1.47 51.82 51.18 9.215883 -12.947023 4.066202 -0.000612 0.018792 0.014558 -0.000042 0.000275 0.000157 0 2016-02-20 00:00:00
30443 30444 32.25 28.13 25.59 44.82 1008.02 0.81 51.53 52.21 8.710130 -13.143595 3.499386 -0.000718 0.019290 0.014667 0.000260 0.001011 0.000149 0 2016-02-20 00:00:10
30444 30445 33.07 28.13 25.59 45.08 1008.09 0.68 51.69 57.36 7.383435 -13.827667 4.438656 -0.000700 0.018714 0.014598 0.000299 0.000343 -0.000025 0 2016-02-20 00:00:41
30445 30446 32.63 28.10 25.60 44.87 1008.07 1.42 52.13 59.95 7.292313 -13.999682 4.517029 -0.000657 0.018857 0.014565 0.000160 0.000349 -0.000190 0 2016-02-20 00:00:50
30446 30447 32.55 28.11 25.60 44.94 1008.07 1.41 51.86 61.83 6.699141 -14.065591 4.448778 -0.000678 0.018871 0.014564 -0.000608 -0.000381 -0.000243 0 2016-02-20 00:01:01
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
110864 110865 31.56 27.52 24.83 42.94 1005.83 1.58 49.93 129.60 -15.169673 -27.642610 1.563183 -0.000682 0.017743 0.014646 -0.000264 0.000206 0.000196 0 2016-02-29 09:24:21
110865 110866 31.55 27.50 24.83 42.72 1005.85 1.89 49.92 130.51 -15.832622 -27.729389 1.785682 -0.000736 0.017570 0.014855 0.000143 0.000199 -0.000024 0 2016-02-29 09:24:30
110866 110867 31.58 27.50 24.83 42.83 1005.85 2.09 50.00 132.04 -16.646212 -27.719479 1.629533 -0.000647 0.017657 0.014799 0.000537 0.000257 0.000057 0 2016-02-29 09:24:41
110867 110868 31.62 27.50 24.83 42.81 1005.88 2.88 49.69 133.00 -17.270447 -27.793136 1.703806 -0.000835 0.017635 0.014877 0.000534 0.000456 0.000195 0 2016-02-29 09:24:50
110868 110869 31.57 27.51 24.83 42.94 1005.86 2.17 49.77 134.18 -17.885872 -27.824149 1.293345 -0.000787 0.017261 0.014380 0.000459 0.000076 0.000030 0 2016-02-29 09:25:00

80427 rows × 20 columns

5.2 Extracting strings

To extract only the day from timestamp column, we can use str and slice operator with square brackets:

[38]:
df['time_stamp'].str[8:10]
[38]:
0         16
1         16
2         16
3         16
4         16
          ..
110864    29
110865    29
110866    29
110867    29
110868    29
Name: time_stamp, Length: 110869, dtype: object

6. Operations on columns

Let’s see now how to select, add and trasform columns.

6.1 - Selecting columns

If we want a subset of columns, we can express the names in a list like this:

NOTE: inside the external square brackets ther is a simple list without df!

[39]:
df[  ['temp_h', 'temp_p', 'time_stamp']  ]
[39]:
temp_h temp_p time_stamp
0 27.57 25.01 2016-02-16 10:44:40
1 27.53 25.01 2016-02-16 10:44:50
2 27.53 25.01 2016-02-16 10:45:00
3 27.52 25.01 2016-02-16 10:45:10
4 27.54 25.01 2016-02-16 10:45:20
... ... ... ...
110864 27.52 24.83 2016-02-29 09:24:21
110865 27.50 24.83 2016-02-29 09:24:30
110866 27.50 24.83 2016-02-29 09:24:41
110867 27.50 24.83 2016-02-29 09:24:50
110868 27.51 24.83 2016-02-29 09:25:00

110869 rows × 3 columns

As alwyas selecting the columns doens’t change the original dataframe:

[40]:
df.head()
[40]:
ROW_ID temp_cpu temp_h temp_p humidity pressure pitch roll yaw mag_x mag_y mag_z accel_x accel_y accel_z gyro_x gyro_y gyro_z reset time_stamp
0 1 31.88 27.57 25.01 44.94 1001.68 1.49 52.25 185.21 -46.422753 -8.132907 -12.129346 -0.000468 0.019439 0.014569 0.000942 0.000492 -0.000750 20 2016-02-16 10:44:40
1 2 31.79 27.53 25.01 45.12 1001.72 1.03 53.73 186.72 -48.778951 -8.304243 -12.943096 -0.000614 0.019436 0.014577 0.000218 -0.000005 -0.000235 0 2016-02-16 10:44:50
2 3 31.66 27.53 25.01 45.12 1001.72 1.24 53.57 186.21 -49.161878 -8.470832 -12.642772 -0.000569 0.019359 0.014357 0.000395 0.000600 -0.000003 0 2016-02-16 10:45:00
3 4 31.69 27.52 25.01 45.32 1001.69 1.57 53.63 186.03 -49.341941 -8.457380 -12.615509 -0.000575 0.019383 0.014409 0.000308 0.000577 -0.000102 0 2016-02-16 10:45:10
4 5 31.66 27.54 25.01 45.18 1001.71 0.85 53.66 186.46 -50.056683 -8.122609 -12.678341 -0.000548 0.019378 0.014380 0.000321 0.000691 0.000272 0 2016-02-16 10:45:20

6.2 - Adding columns

It’s possible to obtain new columns by calculating them from other columns in a very natural way. For example, we get new column mag_tot, that is the absolute magnetic field taken from space station by mag_x, mag_y, e mag_z, and then plot it:

[41]:
df['mag_tot'] = df['mag_x']**2 + df['mag_y']**2 + df['mag_z']**2
df.mag_tot.plot()
[41]:
<AxesSubplot:>
../_images/pandas_pandas1-sol_92_1.png

Let’s find when the magnetic field was maximal:

[42]:
df['time_stamp'][(df.mag_tot == df.mag_tot.values.max())]
[42]:
96156    2016-02-27 16:12:31
Name: time_stamp, dtype: object

By filling in the value found on the website isstracker.com/historical, we can find the positions where the magnetic field is at the highest.

Writing only in some rows

The loc property allows to filter rows according to a property and select a column, which can be new. In this case, for rows where cpu temperature is too high, we write True value in the fields of the column with header'Too hot':

[43]:
df.loc[(df.temp_cpu > 31.68),'Too hot'] = True

Let’s see the resulting table (scroll until the end to see the new column). We note the values from the rows we did not filter are represented with NaN, which literally means not a number :

[44]:
df.head()
[44]:
ROW_ID temp_cpu temp_h temp_p humidity pressure pitch roll yaw mag_x ... accel_x accel_y accel_z gyro_x gyro_y gyro_z reset time_stamp mag_tot Too hot
0 1 31.88 27.57 25.01 44.94 1001.68 1.49 52.25 185.21 -46.422753 ... -0.000468 0.019439 0.014569 0.000942 0.000492 -0.000750 20 2016-02-16 10:44:40 2368.337207 True
1 2 31.79 27.53 25.01 45.12 1001.72 1.03 53.73 186.72 -48.778951 ... -0.000614 0.019436 0.014577 0.000218 -0.000005 -0.000235 0 2016-02-16 10:44:50 2615.870247 True
2 3 31.66 27.53 25.01 45.12 1001.72 1.24 53.57 186.21 -49.161878 ... -0.000569 0.019359 0.014357 0.000395 0.000600 -0.000003 0 2016-02-16 10:45:00 2648.484927 NaN
3 4 31.69 27.52 25.01 45.32 1001.69 1.57 53.63 186.03 -49.341941 ... -0.000575 0.019383 0.014409 0.000308 0.000577 -0.000102 0 2016-02-16 10:45:10 2665.305485 True
4 5 31.66 27.54 25.01 45.18 1001.71 0.85 53.66 186.46 -50.056683 ... -0.000548 0.019378 0.014380 0.000321 0.000691 0.000272 0 2016-02-16 10:45:20 2732.388620 NaN

5 rows × 22 columns

Pandas is a very flexible library, and gives several methods to obtain the same results. For example, we can try the same operation as above with the command np.where as down below. For example, we add a column telling if pressure is above or below the average:

[45]:
avg_pressure = df.pressure.values.mean()
df['check_p'] = np.where(df.pressure <= avg_pressure, 'below', 'over')

6.2.1 Exercise: Meteo temperature in Fahrenheit

In meteo dataframe, create a column Temp (Fahrenheit) with the temperature measured in Fahrenheit degrees.

Formula to calculate conversion from Celsius degrees (C):

\(Fahrenheit = \frac{9}{5}C + 32\)

[46]:
# write here


Show solution
[47]:


       **************  SOLUTION OUTPUT  **************
[47]:
Date Pressure Rain Temp Temp (Fahrenheit)
0 01/11/2017 00:00 995.4 0.0 5.4 41.72
1 01/11/2017 00:15 995.5 0.0 6.0 42.80
2 01/11/2017 00:30 995.5 0.0 5.9 42.62
3 01/11/2017 00:45 995.7 0.0 5.4 41.72
4 01/11/2017 01:00 995.7 0.0 5.3 41.54

6.2.2 Exercise - Pressure vs Temperature

Pressure should be directly proportional to temperature in a closed environment Gay-Lussac’s law:

\(\frac{P}{T} = k\)

Does this holds true for meteo dataset? Try to find out by direct calculation of the formula and compare with corr() method results.

Show solution
[48]:

6.3 Transforming columns

Suppose we want to convert all values of column temperature which are floats to integers.

We know that to convert a float to an integer there the predefined python function int

[49]:
int(23.7)
[49]:
23

We would like to apply such function to all the elements of the column humidity.

To do so, we can call the transform method and pass to it the function int as a parameter

NOTE: there are no round parenthesis after int !!!

[50]:
df['humidity'].transform(int)
[50]:
0         44
1         45
2         45
3         45
4         45
          ..
110864    42
110865    42
110866    42
110867    42
110868    42
Name: humidity, Length: 110869, dtype: int64

Just to be clear what passing a function means, let’s see other two completely equivalent ways we could have used to pass the function.

Defining a function: We could have defined a function myf like this (notice the function MUST RETURN something !)

[51]:
def myf(x):
    return int(x)

df['humidity'].transform(myf)
[51]:
0         44
1         45
2         45
3         45
4         45
          ..
110864    42
110865    42
110866    42
110867    42
110868    42
Name: humidity, Length: 110869, dtype: int64

lamda function: We could have used as well a lambda function, that is, a function without a name which is defined on one line:

[52]:
df['humidity'].transform( lambda x: int(x) )
[52]:
0         44
1         45
2         45
3         45
4         45
          ..
110864    42
110865    42
110866    42
110867    42
110868    42
Name: humidity, Length: 110869, dtype: int64

Regardless of the way we choose to pass the function, transform method does not change the original dataframe:

[53]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110869 entries, 0 to 110868
Data columns (total 23 columns):
 #   Column      Non-Null Count   Dtype
---  ------      --------------   -----
 0   ROW_ID      110869 non-null  int64
 1   temp_cpu    110869 non-null  float64
 2   temp_h      110869 non-null  float64
 3   temp_p      110869 non-null  float64
 4   humidity    110869 non-null  float64
 5   pressure    110869 non-null  float64
 6   pitch       110869 non-null  float64
 7   roll        110869 non-null  float64
 8   yaw         110869 non-null  float64
 9   mag_x       110869 non-null  float64
 10  mag_y       110869 non-null  float64
 11  mag_z       110869 non-null  float64
 12  accel_x     110869 non-null  float64
 13  accel_y     110869 non-null  float64
 14  accel_z     110869 non-null  float64
 15  gyro_x      110869 non-null  float64
 16  gyro_y      110869 non-null  float64
 17  gyro_z      110869 non-null  float64
 18  reset       110869 non-null  int64
 19  time_stamp  110869 non-null  object
 20  mag_tot     110869 non-null  float64
 21  Too hot     105315 non-null  object
 22  check_p     110869 non-null  object
dtypes: float64(18), int64(2), object(3)
memory usage: 19.5+ MB

If we want to add a new column, say humidity_int, we have to explicitly assign the result of transform to a new series:

[54]:
df['humidity_int'] = df['humidity'].transform( lambda x: int(x) )

Notice how pandas automatically infers type int64 for the newly created column:

[55]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110869 entries, 0 to 110868
Data columns (total 24 columns):
 #   Column        Non-Null Count   Dtype
---  ------        --------------   -----
 0   ROW_ID        110869 non-null  int64
 1   temp_cpu      110869 non-null  float64
 2   temp_h        110869 non-null  float64
 3   temp_p        110869 non-null  float64
 4   humidity      110869 non-null  float64
 5   pressure      110869 non-null  float64
 6   pitch         110869 non-null  float64
 7   roll          110869 non-null  float64
 8   yaw           110869 non-null  float64
 9   mag_x         110869 non-null  float64
 10  mag_y         110869 non-null  float64
 11  mag_z         110869 non-null  float64
 12  accel_x       110869 non-null  float64
 13  accel_y       110869 non-null  float64
 14  accel_z       110869 non-null  float64
 15  gyro_x        110869 non-null  float64
 16  gyro_y        110869 non-null  float64
 17  gyro_z        110869 non-null  float64
 18  reset         110869 non-null  int64
 19  time_stamp    110869 non-null  object
 20  mag_tot       110869 non-null  float64
 21  Too hot       105315 non-null  object
 22  check_p       110869 non-null  object
 23  humidity_int  110869 non-null  int64
dtypes: float64(18), int64(3), object(3)
memory usage: 20.3+ MB

7. Grouping

Reference:

It’s pretty easy to group items and perform aggregated calculations by using groupby method. Let’s say we want to count how many humidity readings were taken for each integer humidity humidity_int (here we use pandas groupby, but for histograms you could also use numpy)

After groupby we can use count() aggregation function (other common ones are sum(), min(), max(), mean()):

[56]:
df.groupby(['humidity_int'])['humidity'].count()
[56]:
humidity_int
42     2776
43     2479
44    13029
45    32730
46    35775
47    14176
48     7392
49      297
50      155
51      205
52      209
53      128
54      224
55      164
56      139
57      183
58      237
59      271
60      300
Name: humidity, dtype: int64

Notice we’ve got only 19 rows. To have a series that fills the whole table, assigning to each row the count of its own group, we can use transform like this:

[57]:
df.groupby(['humidity_int'])['humidity'].transform('count')
[57]:
0         13029
1         32730
2         32730
3         32730
4         32730
          ...
110864     2776
110865     2776
110866     2776
110867     2776
110868     2776
Name: humidity, Length: 110869, dtype: int64

As usual, group_by does not modify the dataframe, if we want the result stored in the dataframe we need to assign the result to a new column:

[58]:
df['Humidity counts'] = df.groupby(['humidity_int'])['humidity'].transform('count')
[59]:
df
[59]:
ROW_ID temp_cpu temp_h temp_p humidity pressure pitch roll yaw mag_x ... gyro_x gyro_y gyro_z reset time_stamp mag_tot Too hot check_p humidity_int Humidity counts
0 1 31.88 27.57 25.01 44.94 1001.68 1.49 52.25 185.21 -46.422753 ... 0.000942 0.000492 -0.000750 20 2016-02-16 10:44:40 2368.337207 True below 44 13029
1 2 31.79 27.53 25.01 45.12 1001.72 1.03 53.73 186.72 -48.778951 ... 0.000218 -0.000005 -0.000235 0 2016-02-16 10:44:50 2615.870247 True below 45 32730
2 3 31.66 27.53 25.01 45.12 1001.72 1.24 53.57 186.21 -49.161878 ... 0.000395 0.000600 -0.000003 0 2016-02-16 10:45:00 2648.484927 NaN below 45 32730
3 4 31.69 27.52 25.01 45.32 1001.69 1.57 53.63 186.03 -49.341941 ... 0.000308 0.000577 -0.000102 0 2016-02-16 10:45:10 2665.305485 True below 45 32730
4 5 31.66 27.54 25.01 45.18 1001.71 0.85 53.66 186.46 -50.056683 ... 0.000321 0.000691 0.000272 0 2016-02-16 10:45:20 2732.388620 NaN below 45 32730
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
110864 110865 31.56 27.52 24.83 42.94 1005.83 1.58 49.93 129.60 -15.169673 ... -0.000264 0.000206 0.000196 0 2016-02-29 09:24:21 996.676408 NaN below 42 2776
110865 110866 31.55 27.50 24.83 42.72 1005.85 1.89 49.92 130.51 -15.832622 ... 0.000143 0.000199 -0.000024 0 2016-02-29 09:24:30 1022.779594 NaN below 42 2776
110866 110867 31.58 27.50 24.83 42.83 1005.85 2.09 50.00 132.04 -16.646212 ... 0.000537 0.000257 0.000057 0 2016-02-29 09:24:41 1048.121268 NaN below 42 2776
110867 110868 31.62 27.50 24.83 42.81 1005.88 2.88 49.69 133.00 -17.270447 ... 0.000534 0.000456 0.000195 0 2016-02-29 09:24:50 1073.629703 NaN below 42 2776
110868 110869 31.57 27.51 24.83 42.94 1005.86 2.17 49.77 134.18 -17.885872 ... 0.000459 0.000076 0.000030 0 2016-02-29 09:25:00 1095.760426 NaN below 42 2776

110869 rows × 25 columns

8. Exercise - meteo average temperatures

8.1 meteo plot

✪ Put in a plot the temperature from dataframe meteo:

[60]:
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline

# write here


Show solution
[61]:

[61]:
<AxesSubplot:>
../_images/pandas_pandas1-sol_140_1.png

8.2 meteo pressure and raining

✪ In the same plot as above show the pressure and amount of raining.

[62]:
# write here

Show solution
[63]:

../_images/pandas_pandas1-sol_146_0.png

8.3 meteo average temperature

✪✪✪ Calculate the average temperature for each day, and show it in the plot, so to have a couple new columns like these:

    Day       Avg_day_temp
01/11/2017      7.983333
01/11/2017      7.983333
01/11/2017      7.983333
    .               .
    .               .
02/11/2017      7.384375
02/11/2017      7.384375
02/11/2017      7.384375
    .               .
    .               .

HINT 1: add 'Day' column by extracting only the day from the date. To do it, use the function .strapplied to all the column.

HINT 2: There are various ways to solve the exercise:

  • Most perfomant and elegant is with groupby operator, see Pandas trasform - more than meets the eye

  • As alternative, you may use a for to cycle through days. Typically, using a for is not a good idea with Pandas, as on large datasets it can take a lot to perform the updates. Still, since this dataset is small enough, you should get results in a decent amount of time.

[64]:
# write here

Show solution
[65]:


    ****************    SOLUTION 1 OUTPUT - recalculate average for every row - slow !

WITH AVERAGE TEMPERATURE
               Date  Pressure  Rain  Temp         Day  Avg_day_temp
0  01/11/2017 00:00     995.4   0.0   5.4  01/11/2017      7.983333
1  01/11/2017 00:15     995.5   0.0   6.0  01/11/2017      7.983333
2  01/11/2017 00:30     995.5   0.0   5.9  01/11/2017      7.983333
3  01/11/2017 00:45     995.7   0.0   5.4  01/11/2017      7.983333
4  01/11/2017 01:00     995.7   0.0   5.3  01/11/2017      7.983333
[65]:
<AxesSubplot:>
../_images/pandas_pandas1-sol_152_2.png
Show solution
[66]:



********    SOLUTION 2 OUTPUT - recalculate average only 30 times
                                by using a dictionary d_avg, faster but not yet optimal
               Date  Pressure  Rain  Temp         Day  Avg_day_temp
0  01/11/2017 00:00     995.4   0.0   5.4  01/11/2017      7.983333
1  01/11/2017 00:15     995.5   0.0   6.0  01/11/2017      7.983333
2  01/11/2017 00:30     995.5   0.0   5.9  01/11/2017      7.983333
3  01/11/2017 00:45     995.7   0.0   5.4  01/11/2017      7.983333
4  01/11/2017 01:00     995.7   0.0   5.3  01/11/2017      7.983333
[66]:
<AxesSubplot:>
../_images/pandas_pandas1-sol_156_2.png
Show solution
[67]:


****************  SOLUTION 3 - OUTPUT  -  best solution with groupby and transform

WITH AVERAGE TEMPERATURE
               Date  Pressure  Rain  Temp         Day  Avg_day_temp
0  01/11/2017 00:00     995.4   0.0   5.4  01/11/2017      7.983333
1  01/11/2017 00:15     995.5   0.0   6.0  01/11/2017      7.983333
2  01/11/2017 00:30     995.5   0.0   5.9  01/11/2017      7.983333
3  01/11/2017 00:45     995.7   0.0   5.4  01/11/2017      7.983333
4  01/11/2017 01:00     995.7   0.0   5.3  01/11/2017      7.983333
[67]:
<AxesSubplot:>
../_images/pandas_pandas1-sol_160_2.png

9 Exercise - Air pollutants

Let’s try analzying the hourly data from air quality monitoring stations from Autonomous Province of Trento.

Source: dati.trentino.it

9.1 - load the file

✪ Load the file aria.csv in pandas

IMPORTANT: put the dataframe into the variable aria, so not to confuse it with the previous datasets.

IMPORTANT: use encoding 'latin-1' (otherwise you might get weird load errors according to your operating system)

IMPORTANT: if you also receive other strange errors, try adding the parameter engine=python

Show solution
[68]:

# write here


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20693 entries, 0 to 20692
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   Stazione         20693 non-null  object
 1   Inquinante       20693 non-null  object
 2   Data             20693 non-null  object
 3   Ora              20693 non-null  int64
 4   Valore           20693 non-null  float64
 5   Unità di misura  20693 non-null  object
dtypes: float64(1), int64(1), object(4)
memory usage: 970.1+ KB

9.2 - pollutants average

✪ find the average of PM10 pollutants at Parco S. Chiara (average on all days). You should obtain the value 11.385752688172044

Show solution
[69]:
# write here


[69]:
11.385752688172044

9.3 - PM10 chart

✪ Use plt.plot as seen in a previous example (so by directly passing the relevant Pandas series), show in a chart the values of PM10 during May 7h, 2019.

[72]:
# write here


Show solution
[73]:

../_images/pandas_pandas1-sol_179_0.png

10. Merging tables

Suppose we want to add a column with geographical position of the ISS. To do so, we would need to join our dataset with another one containing such information. Let’s take for example the dataset iss-coords.csv

[74]:
iss_coords = pd.read_csv('iss-coords.csv', encoding='UTF-8')
[75]:
iss_coords
[75]:
timestamp lat lon
0 2016-01-01 05:11:30 -45.103458 14.083858
1 2016-01-01 06:49:59 -37.597242 28.931170
2 2016-01-01 11:52:30 17.126141 77.535602
3 2016-01-01 11:52:30 17.126464 77.535861
4 2016-01-01 14:54:08 7.259561 70.001561
... ... ... ...
333 2016-02-29 13:23:17 -51.077590 -31.093987
334 2016-02-29 13:44:13 30.688553 -135.403820
335 2016-02-29 13:44:13 30.688295 -135.403533
336 2016-02-29 18:44:57 27.608774 -130.198781
337 2016-02-29 21:36:47 27.325186 -129.893278

338 rows × 3 columns

We notice there is a timestamp column, which unfortunately has a slightly different name that time_stamp column (notice the underscore _) in original astropi dataset:

[76]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110869 entries, 0 to 110868
Data columns (total 25 columns):
 #   Column           Non-Null Count   Dtype
---  ------           --------------   -----
 0   ROW_ID           110869 non-null  int64
 1   temp_cpu         110869 non-null  float64
 2   temp_h           110869 non-null  float64
 3   temp_p           110869 non-null  float64
 4   humidity         110869 non-null  float64
 5   pressure         110869 non-null  float64
 6   pitch            110869 non-null  float64
 7   roll             110869 non-null  float64
 8   yaw              110869 non-null  float64
 9   mag_x            110869 non-null  float64
 10  mag_y            110869 non-null  float64
 11  mag_z            110869 non-null  float64
 12  accel_x          110869 non-null  float64
 13  accel_y          110869 non-null  float64
 14  accel_z          110869 non-null  float64
 15  gyro_x           110869 non-null  float64
 16  gyro_y           110869 non-null  float64
 17  gyro_z           110869 non-null  float64
 18  reset            110869 non-null  int64
 19  time_stamp       110869 non-null  object
 20  mag_tot          110869 non-null  float64
 21  Too hot          105315 non-null  object
 22  check_p          110869 non-null  object
 23  humidity_int     110869 non-null  int64
 24  Humidity counts  110869 non-null  int64
dtypes: float64(18), int64(4), object(3)
memory usage: 21.1+ MB

To merge datasets according to the columns, we can use the command merge like this:

[77]:
# remember merge produces a NEW dataframe

geo_astropi = df.merge(iss_coords, left_on='time_stamp', right_on='timestamp')

# merge will add both time_stamp and timestamp columns,
# so we remove the duplicate column `timestamp`
geo_astropi = geo_astropi.drop('timestamp', axis=1)
[78]:
geo_astropi
[78]:
ROW_ID temp_cpu temp_h temp_p humidity pressure pitch roll yaw mag_x ... gyro_z reset time_stamp mag_tot Too hot check_p humidity_int Humidity counts lat lon
0 23231 32.53 28.37 25.89 45.31 1006.04 1.31 51.63 34.91 21.125001 ... 0.000046 0 2016-02-19 03:49:00 2345.207992 True below 45 32730 31.434741 52.917464
1 27052 32.30 28.12 25.62 45.57 1007.42 1.49 52.29 333.49 16.083471 ... 0.000034 0 2016-02-19 14:30:40 323.634786 True below 45 32730 -46.620658 -57.311657
2 27052 32.30 28.12 25.62 45.57 1007.42 1.49 52.29 333.49 16.083471 ... 0.000034 0 2016-02-19 14:30:40 323.634786 True below 45 32730 -46.620477 -57.311138
3 46933 32.21 28.05 25.50 47.36 1012.41 0.67 52.40 27.57 15.441683 ... 0.000221 0 2016-02-21 22:14:11 342.159257 True over 47 14176 19.138359 -140.211489
4 64572 32.32 28.18 25.61 47.45 1010.62 1.14 51.41 33.68 11.994554 ... 0.000030 0 2016-02-23 23:40:50 264.655601 True over 47 14176 4.713819 80.261665
5 68293 32.39 28.26 25.70 46.83 1010.51 0.61 51.91 287.86 6.554283 ... 0.000171 0 2016-02-24 10:05:51 436.876111 True over 46 35775 -46.061583 22.246025
6 73374 32.38 28.18 25.62 46.52 1008.28 0.90 51.77 30.80 9.947132 ... -0.000375 0 2016-02-25 00:23:01 226.089258 True over 46 35775 47.047346 137.958918
7 90986 32.42 28.34 25.76 45.72 1006.79 0.57 49.85 10.57 7.805606 ... -0.000047 0 2016-02-27 01:43:10 149.700293 True below 45 32730 -41.049112 30.193004
8 90986 32.42 28.34 25.76 45.72 1006.79 0.57 49.85 10.57 7.805606 ... -0.000047 0 2016-02-27 01:43:10 149.700293 True below 45 32730 -8.402991 -100.981726
9 102440 32.62 28.62 26.02 45.15 1006.06 1.12 50.44 301.74 10.348327 ... -0.000061 0 2016-02-28 09:48:40 381.014223 True below 45 32730 50.047523 175.566751

10 rows × 27 columns

Exercise 10.1 - better merge

If you notice, above table does have lat and lon columns, but has very few rows. Why ? Try to merge the tables in some meaningful way so to have all the original rows and all cells of lat and lon filled.

Show solution
[79]:
# write here


[79]:
ROW_ID temp_cpu temp_h temp_p humidity pressure pitch roll yaw mag_x ... reset time_stamp mag_tot Too hot check_p humidity_int Humidity counts timestamp lat lon
0 1 31.88 27.57 25.01 44.94 1001.68 1.49 52.25 185.21 -46.422753 ... 20 2016-02-16 10:44:40 2368.337207 True below 44 13029 NaN NaN NaN
1 2 31.79 27.53 25.01 45.12 1001.72 1.03 53.73 186.72 -48.778951 ... 0 2016-02-16 10:44:50 2615.870247 True below 45 32730 NaN NaN NaN
2 3 31.66 27.53 25.01 45.12 1001.72 1.24 53.57 186.21 -49.161878 ... 0 2016-02-16 10:45:00 2648.484927 NaN below 45 32730 NaN NaN NaN
3 4 31.69 27.52 25.01 45.32 1001.69 1.57 53.63 186.03 -49.341941 ... 0 2016-02-16 10:45:10 2665.305485 True below 45 32730 NaN NaN NaN
4 5 31.66 27.54 25.01 45.18 1001.71 0.85 53.66 186.46 -50.056683 ... 0 2016-02-16 10:45:20 2732.388620 NaN below 45 32730 NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
110866 110865 31.56 27.52 24.83 42.94 1005.83 1.58 49.93 129.60 -15.169673 ... 0 2016-02-29 09:24:21 996.676408 NaN below 42 2776 NaN NaN NaN
110867 110866 31.55 27.50 24.83 42.72 1005.85 1.89 49.92 130.51 -15.832622 ... 0 2016-02-29 09:24:30 1022.779594 NaN below 42 2776 NaN NaN NaN
110868 110867 31.58 27.50 24.83 42.83 1005.85 2.09 50.00 132.04 -16.646212 ... 0 2016-02-29 09:24:41 1048.121268 NaN below 42 2776 NaN NaN NaN
110869 110868 31.62 27.50 24.83 42.81 1005.88 2.88 49.69 133.00 -17.270447 ... 0 2016-02-29 09:24:50 1073.629703 NaN below 42 2776 NaN NaN NaN
110870 110869 31.57 27.51 24.83 42.94 1005.86 2.17 49.77 134.18 -17.885872 ... 0 2016-02-29 09:25:00 1095.760426 NaN below 42 2776 NaN NaN NaN

110871 rows × 28 columns

11. GeoPandas

You can easily manipulate geographical data with GeoPandas library. For some nice online tutorial, we refer to Geospatial Analysis and Representation for Data Science course website @ master in Data Science University of Trento, by Maurizio Napolitano (FBK)

Continue

Go on with more exercises worksheet