Pandas solutions

1. Introduction

Today we will try analyzing data with Pandas

  • data analysis with Pandas library

  • plotting with MatPlotLib

  • Examples from AstroPi dataset

  • Exercises with meteotrentino dataset

Python gives powerful tools for data analysis:

pydata iuiu34

One of these is Pandas, which gives fast and flexible data structures, especially for interactive data analysis.

What to do

unzip exercises in a folder, you should get something like this:

pandas
    pandas.ipynb
    pandas-sol.ipynb
    eures-jobs-sol.ipynb
    jupman.py

WARNING 1: to correctly visualize the notebook, it MUST be in an unzipped folder !

  1. open Jupyter Notebook from that folder. Two things should open, first a console and then browser.

  2. The browser should show a file list: navigate the list and open the notebook pandas/pandas.ipynb

WARNING 2: DO NOT use the Upload button in Jupyter, instead navigate in Jupyter browser to the unzipped folder !

  1. Go on reading that notebook, and follow instuctions inside.

Shortcut keys:

  • to execute Python code inside a Jupyter cell, press Control + Enter

  • to execute Python code inside a Jupyter cell AND select next cell, press Shift + Enter

  • to execute Python code inside a Jupyter cell AND a create a new cell aftwerwards, press Alt + Enter

  • If the notebooks look stuck, try to select Kernel -> Restart

Check installation

First let’s see if you have already installed pandas on your system, try executing this cell with Ctrl-Enter:

[2]:
import pandas as pd

If you saw no error messages, you can skip installation, otherwise do this:

  • Anaconda - open Anaconda Prompt and issue this:

conda install pandas

  • Without Anaconda (--user installs in your home):

python3 -m pip install --user pandas

2. Data analysis of Astro Pi

Let’s try analyzing data recorded on a Raspberry present on the International Space Station, downloaded from here:

raspberrypi.org/learning/astro-pi-flight-data-analysis/worksheet

in which it is possible to find the detailed description of data gathered by sensors, in the month of February 2016 (one record each 10 seconds).

ISS uiu9u

The method read_csv imports data from a CSV file and saves them in DataFrame structure.

In this exercise we shall use the file Columbus_Ed_astro_pi_datalog.csv

[3]:
import pandas as pd   # we import pandas and for ease we rename it to 'pd'
import numpy as np    # we import numpy and for ease we rename it to 'np'

# remember the encoding !
df = pd.read_csv('Columbus_Ed_astro_pi_datalog.csv', encoding='UTF-8')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110869 entries, 0 to 110868
Data columns (total 20 columns):
ROW_ID        110869 non-null int64
temp_cpu      110869 non-null float64
temp_h        110869 non-null float64
temp_p        110869 non-null float64
humidity      110869 non-null float64
pressure      110869 non-null float64
pitch         110869 non-null float64
roll          110869 non-null float64
yaw           110869 non-null float64
mag_x         110869 non-null float64
mag_y         110869 non-null float64
mag_z         110869 non-null float64
accel_x       110869 non-null float64
accel_y       110869 non-null float64
accel_z       110869 non-null float64
gyro_x        110869 non-null float64
gyro_y        110869 non-null float64
gyro_z        110869 non-null float64
reset         110869 non-null int64
time_stamp    110869 non-null object
dtypes: float64(17), int64(2), object(1)
memory usage: 16.9+ MB

We can quickly see rows and columns of the dataframe with the attribute shape:

NOTE: shape is not followed by rounded parenthesis !

[4]:
df.shape
[4]:
(110869, 20)

The describe method gives you on the fly many summary info:

[5]:
df.describe()
[5]:
ROW_ID temp_cpu temp_h temp_p humidity pressure pitch roll yaw mag_x mag_y mag_z accel_x accel_y accel_z gyro_x gyro_y gyro_z reset
count 110869.000000 110869.000000 110869.000000 110869.000000 110869.000000 110869.000000 110869.000000 110869.000000 110869.00000 110869.000000 110869.000000 110869.000000 110869.000000 110869.000000 110869.000000 1.108690e+05 110869.000000 1.108690e+05 110869.000000
mean 55435.000000 32.236259 28.101773 25.543272 46.252005 1008.126788 2.770553 51.807973 200.90126 -19.465265 -1.174493 -6.004529 -0.000630 0.018504 0.014512 -8.959493e-07 0.000007 -9.671594e-07 0.000180
std 32005.267835 0.360289 0.369256 0.380877 1.907273 3.093485 21.848940 2.085821 84.47763 28.120202 15.655121 8.552481 0.000224 0.000604 0.000312 2.807614e-03 0.002456 2.133104e-03 0.060065
min 1.000000 31.410000 27.200000 24.530000 42.270000 1001.560000 0.000000 30.890000 0.01000 -73.046240 -43.810030 -41.163040 -0.025034 -0.005903 -0.022900 -3.037930e-01 -0.378412 -2.970800e-01 0.000000
25% 27718.000000 31.960000 27.840000 25.260000 45.230000 1006.090000 1.140000 51.180000 162.43000 -41.742792 -12.982321 -11.238430 -0.000697 0.018009 0.014349 -2.750000e-04 -0.000278 -1.200000e-04 0.000000
50% 55435.000000 32.280000 28.110000 25.570000 46.130000 1007.650000 1.450000 51.950000 190.58000 -21.339485 -1.350467 -5.764400 -0.000631 0.018620 0.014510 -3.000000e-06 -0.000004 -1.000000e-06 0.000000
75% 83152.000000 32.480000 28.360000 25.790000 46.880000 1010.270000 1.740000 52.450000 256.34000 7.299000 11.912456 -0.653705 -0.000567 0.018940 0.014673 2.710000e-04 0.000271 1.190000e-04 0.000000
max 110869.000000 33.700000 29.280000 26.810000 60.590000 1021.780000 360.000000 359.400000 359.98000 33.134748 37.552135 31.003047 0.018708 0.041012 0.029938 2.151470e-01 0.389499 2.698760e-01 20.000000

QUESTION: is there some missing field from the table produced by describe? Why is it not included?

To limit describe to only one column like humidity, you can write like this:

[6]:
df['humidity'].describe()
[6]:
count    110869.000000
mean         46.252005
std           1.907273
min          42.270000
25%          45.230000
50%          46.130000
75%          46.880000
max          60.590000
Name: humidity, dtype: float64

Notation with the dot is even more handy:

[7]:
df.humidity.describe()
[7]:
count    110869.000000
mean         46.252005
std           1.907273
min          42.270000
25%          45.230000
50%          46.130000
75%          46.880000
max          60.590000
Name: humidity, dtype: float64

WARNING: Careful about spaces!:

In case the field name has spaces (es. 'blender rotations'), do not use the dot notation, instead use squared bracket notation seen above (ie: df.['blender rotations'].describe())

head method gives back the first datasets:

[8]:
df.head()
[8]:
ROW_ID temp_cpu temp_h temp_p humidity pressure pitch roll yaw mag_x mag_y mag_z accel_x accel_y accel_z gyro_x gyro_y gyro_z reset time_stamp
0 1 31.88 27.57 25.01 44.94 1001.68 1.49 52.25 185.21 -46.422753 -8.132907 -12.129346 -0.000468 0.019439 0.014569 0.000942 0.000492 -0.000750 20 2016-02-16 10:44:40
1 2 31.79 27.53 25.01 45.12 1001.72 1.03 53.73 186.72 -48.778951 -8.304243 -12.943096 -0.000614 0.019436 0.014577 0.000218 -0.000005 -0.000235 0 2016-02-16 10:44:50
2 3 31.66 27.53 25.01 45.12 1001.72 1.24 53.57 186.21 -49.161878 -8.470832 -12.642772 -0.000569 0.019359 0.014357 0.000395 0.000600 -0.000003 0 2016-02-16 10:45:00
3 4 31.69 27.52 25.01 45.32 1001.69 1.57 53.63 186.03 -49.341941 -8.457380 -12.615509 -0.000575 0.019383 0.014409 0.000308 0.000577 -0.000102 0 2016-02-16 10:45:10
4 5 31.66 27.54 25.01 45.18 1001.71 0.85 53.66 186.46 -50.056683 -8.122609 -12.678341 -0.000548 0.019378 0.014380 0.000321 0.000691 0.000272 0 2016-02-16 10:45:20

tail method gives back last dataset:

[9]:
df.tail()
[9]:
ROW_ID temp_cpu temp_h temp_p humidity pressure pitch roll yaw mag_x mag_y mag_z accel_x accel_y accel_z gyro_x gyro_y gyro_z reset time_stamp
110864 110865 31.56 27.52 24.83 42.94 1005.83 1.58 49.93 129.60 -15.169673 -27.642610 1.563183 -0.000682 0.017743 0.014646 -0.000264 0.000206 0.000196 0 2016-02-29 09:24:21
110865 110866 31.55 27.50 24.83 42.72 1005.85 1.89 49.92 130.51 -15.832622 -27.729389 1.785682 -0.000736 0.017570 0.014855 0.000143 0.000199 -0.000024 0 2016-02-29 09:24:30
110866 110867 31.58 27.50 24.83 42.83 1005.85 2.09 50.00 132.04 -16.646212 -27.719479 1.629533 -0.000647 0.017657 0.014799 0.000537 0.000257 0.000057 0 2016-02-29 09:24:41
110867 110868 31.62 27.50 24.83 42.81 1005.88 2.88 49.69 133.00 -17.270447 -27.793136 1.703806 -0.000835 0.017635 0.014877 0.000534 0.000456 0.000195 0 2016-02-29 09:24:50
110868 110869 31.57 27.51 24.83 42.94 1005.86 2.17 49.77 134.18 -17.885872 -27.824149 1.293345 -0.000787 0.017261 0.014380 0.000459 0.000076 0.000030 0 2016-02-29 09:25:00

colums property gives the column headers:

[10]:
df.columns
[10]:
Index(['ROW_ID', 'temp_cpu', 'temp_h', 'temp_p', 'humidity', 'pressure',
       'pitch', 'roll', 'yaw', 'mag_x', 'mag_y', 'mag_z', 'accel_x', 'accel_y',
       'accel_z', 'gyro_x', 'gyro_y', 'gyro_z', 'reset', 'time_stamp'],
      dtype='object')

Nota: as you see in the above, the type of the found object is not a list, but a special container defined by pandas:

[11]:
type(df.columns)
[11]:
pandas.core.indexes.base.Index

Nevertheless, we can access the elements of this container using indeces within the squared parenthesis:

[12]:
df.columns[0]
[12]:
'ROW_ID'
[13]:
df.columns[1]
[13]:
'temp_cpu'

2.1 Exercise: meteo info

✪ a) Create a new dataframe called meteo by importing the data from file meteo.csv, which contains the meteo data of Trento from November 2017 (source: https://www.meteotrentino.it). IMPORTANT: assign the dataframe to a variable called meteo (so we avoid confusion whith AstroPi dataframe)

  1. Visualize the information about this dataframe.

Show solution
[14]:
# write here - create dataframe


COLUMNS:

Index(['Date', 'Pressure', 'Rain', 'Temp'], dtype='object')

INFO:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2878 entries, 0 to 2877
Data columns (total 4 columns):
Date        2878 non-null object
Pressure    2878 non-null float64
Rain        2878 non-null float64
Temp        2878 non-null float64
dtypes: float64(3), object(1)
memory usage: 90.0+ KB
None

HEAD():
[14]:
Date Pressure Rain Temp
0 01/11/2017 00:00 995.4 0.0 5.4
1 01/11/2017 00:15 995.5 0.0 6.0
2 01/11/2017 00:30 995.5 0.0 5.9
3 01/11/2017 00:45 995.7 0.0 5.4
4 01/11/2017 01:00 995.7 0.0 5.3

3. Indexing, filtering, ordering

To obtain the i-th series you can use the method iloc[i] (here we reuse AstroPi dataset) :

[15]:
df.iloc[6]
[15]:
ROW_ID                          7
temp_cpu                    31.68
temp_h                      27.53
temp_p                      25.01
humidity                    45.31
pressure                   1001.7
pitch                        0.63
roll                        53.55
yaw                         186.1
mag_x                    -50.4473
mag_y                    -7.93731
mag_z                    -12.1886
accel_x                  -0.00051
accel_y                  0.019264
accel_z                  0.014528
gyro_x                  -0.000111
gyro_y                    0.00032
gyro_z                   0.000222
reset                           0
time_stamp    2016-02-16 10:45:41
Name: 6, dtype: object

It is possible to select a dataframe by near positions using slicing:

Here for example we select the rows from 5th included to 7-th excluded :

[16]:
df.iloc[5:7]
[16]:
ROW_ID temp_cpu temp_h temp_p humidity pressure pitch roll yaw mag_x mag_y mag_z accel_x accel_y accel_z gyro_x gyro_y gyro_z reset time_stamp
5 6 31.69 27.55 25.01 45.12 1001.67 0.85 53.53 185.52 -50.246476 -8.343209 -11.938124 -0.000536 0.019453 0.014380 0.000273 0.000494 -0.000059 0 2016-02-16 10:45:30
6 7 31.68 27.53 25.01 45.31 1001.70 0.63 53.55 186.10 -50.447346 -7.937309 -12.188574 -0.000510 0.019264 0.014528 -0.000111 0.000320 0.000222 0 2016-02-16 10:45:41

It is possible to filter data according to a condition:

We che discover the data type, for example for df.ROW_ID >= 6:

[17]:
type(df.ROW_ID >= 6)
[17]:
pandas.core.series.Series

What is contained in this Series object ? If we try printing it we will see it is a series of values True or False, according whether the ROW_ID is greater or equal than 6:

[18]:
df.ROW_ID >= 6
[18]:
0         False
1         False
2         False
3         False
4         False
5          True
6          True
7          True
8          True
9          True
10         True
11         True
12         True
13         True
14         True
15         True
16         True
17         True
18         True
19         True
20         True
21         True
22         True
23         True
24         True
25         True
26         True
27         True
28         True
29         True
          ...
110839     True
110840     True
110841     True
110842     True
110843     True
110844     True
110845     True
110846     True
110847     True
110848     True
110849     True
110850     True
110851     True
110852     True
110853     True
110854     True
110855     True
110856     True
110857     True
110858     True
110859     True
110860     True
110861     True
110862     True
110863     True
110864     True
110865     True
110866     True
110867     True
110868     True
Name: ROW_ID, Length: 110869, dtype: bool

In an analogue way (df.ROW_ID >= 6) & (df.ROW_ID <= 10) is a series of values True or False, if ROW_ID is at the same time greater or equal than 6 and less or equal of 10

[19]:
type((df.ROW_ID >= 6) & (df.ROW_ID <= 10))
[19]:
pandas.core.series.Series

If we want complete rows of the dataframe which satisfy the condition, we can write like this:

IMPORTANT: we use df externally from expression df[     ] starting and closing the square bracket parenthesis to tell Python we want to filter the df dataframe, and use again df inside the parenthesis to tell on which columns and which rows we want to filter

[20]:
df[  (df.ROW_ID >= 6) & (df.ROW_ID <= 10)  ]
[20]:
ROW_ID temp_cpu temp_h temp_p humidity pressure pitch roll yaw mag_x mag_y mag_z accel_x accel_y accel_z gyro_x gyro_y gyro_z reset time_stamp
5 6 31.69 27.55 25.01 45.12 1001.67 0.85 53.53 185.52 -50.246476 -8.343209 -11.938124 -0.000536 0.019453 0.014380 0.000273 0.000494 -0.000059 0 2016-02-16 10:45:30
6 7 31.68 27.53 25.01 45.31 1001.70 0.63 53.55 186.10 -50.447346 -7.937309 -12.188574 -0.000510 0.019264 0.014528 -0.000111 0.000320 0.000222 0 2016-02-16 10:45:41
7 8 31.66 27.55 25.01 45.34 1001.70 1.49 53.65 186.08 -50.668232 -7.762600 -12.284196 -0.000523 0.019473 0.014298 -0.000044 0.000436 0.000301 0 2016-02-16 10:45:50
8 9 31.67 27.54 25.01 45.20 1001.72 1.22 53.77 186.55 -50.761529 -7.262934 -11.981090 -0.000522 0.019385 0.014286 0.000358 0.000651 0.000187 0 2016-02-16 10:46:01
9 10 31.67 27.54 25.01 45.41 1001.75 1.63 53.46 185.94 -51.243832 -6.875270 -11.672494 -0.000581 0.019390 0.014441 0.000266 0.000676 0.000356 0 2016-02-16 10:46:10

So if we want to search the record where pressure is maximal, we user values property of the series on which we calculate the maximal value:

[21]:
df[  (df.pressure == df.pressure.values.max())  ]
[21]:
ROW_ID temp_cpu temp_h temp_p humidity pressure pitch roll yaw mag_x mag_y mag_z accel_x accel_y accel_z gyro_x gyro_y gyro_z reset time_stamp
77602 77603 32.44 28.31 25.74 47.57 1021.78 1.1 51.82 267.39 -0.797428 10.891803 -15.728202 -0.000612 0.01817 0.014295 -0.000139 -0.000179 -0.000298 0 2016-02-25 12:13:20

The method sort_values return a dataframe ordered according to one or more columns:

[22]:
df.sort_values('pressure',ascending=False).head()
[22]:
ROW_ID temp_cpu temp_h temp_p humidity pressure pitch roll yaw mag_x mag_y mag_z accel_x accel_y accel_z gyro_x gyro_y gyro_z reset time_stamp
77602 77603 32.44 28.31 25.74 47.57 1021.78 1.10 51.82 267.39 -0.797428 10.891803 -15.728202 -0.000612 0.018170 0.014295 -0.000139 -0.000179 -0.000298 0 2016-02-25 12:13:20
77601 77602 32.45 28.30 25.74 47.26 1021.75 1.53 51.76 266.12 -1.266335 10.927442 -15.690558 -0.000661 0.018357 0.014533 0.000152 0.000459 -0.000298 0 2016-02-25 12:13:10
77603 77604 32.44 28.30 25.74 47.29 1021.75 1.86 51.83 268.83 -0.320795 10.651441 -15.565123 -0.000648 0.018290 0.014372 0.000049 0.000473 -0.000029 0 2016-02-25 12:13:30
77604 77605 32.43 28.30 25.74 47.39 1021.75 1.78 51.54 269.41 -0.130574 10.628383 -15.488983 -0.000672 0.018154 0.014602 0.000360 0.000089 -0.000002 0 2016-02-25 12:13:40
77608 77609 32.42 28.29 25.74 47.36 1021.73 0.86 51.89 272.77 0.952025 10.435951 -16.027235 -0.000607 0.018186 0.014232 -0.000260 -0.000059 -0.000187 0 2016-02-25 12:14:20

The loc property allows to filter rows according to a property and select a column, which can be new. In this case, for rows where temperature is too much, we write True value in the fields of the column with header'Too hot':

[23]:
df.loc[(df.temp_cpu > 31.68),'Too hot'] = True

Let’s see the resulting table (scroll until the end to see the new column). We note the values from the rows we did not filter are represented with NaN, which literally means not a number :

[24]:
df.head()
[24]:
ROW_ID temp_cpu temp_h temp_p humidity pressure pitch roll yaw mag_x ... mag_z accel_x accel_y accel_z gyro_x gyro_y gyro_z reset time_stamp Too hot
0 1 31.88 27.57 25.01 44.94 1001.68 1.49 52.25 185.21 -46.422753 ... -12.129346 -0.000468 0.019439 0.014569 0.000942 0.000492 -0.000750 20 2016-02-16 10:44:40 True
1 2 31.79 27.53 25.01 45.12 1001.72 1.03 53.73 186.72 -48.778951 ... -12.943096 -0.000614 0.019436 0.014577 0.000218 -0.000005 -0.000235 0 2016-02-16 10:44:50 True
2 3 31.66 27.53 25.01 45.12 1001.72 1.24 53.57 186.21 -49.161878 ... -12.642772 -0.000569 0.019359 0.014357 0.000395 0.000600 -0.000003 0 2016-02-16 10:45:00 NaN
3 4 31.69 27.52 25.01 45.32 1001.69 1.57 53.63 186.03 -49.341941 ... -12.615509 -0.000575 0.019383 0.014409 0.000308 0.000577 -0.000102 0 2016-02-16 10:45:10 True
4 5 31.66 27.54 25.01 45.18 1001.71 0.85 53.66 186.46 -50.056683 ... -12.678341 -0.000548 0.019378 0.014380 0.000321 0.000691 0.000272 0 2016-02-16 10:45:20 NaN

5 rows × 21 columns

Pandas is a very flexible library, and gives several methods to obtain the same results. For example, we can try the same operation as above with the command np.where as down below. For example, we add a column telling if pressure is above or below the average:

[25]:
avg_pressure = df.pressure.values.mean()
df['check_p'] = np.where(df.pressure <= avg_pressure, 'sotto', 'sopra')

3.1 Exercise: Meteo stats

✪ Analyze data from Dataframe meteo and find:

  • values of average pression, minimal and maximal

  • average temperature

  • the dates of rainy days

Show solution
[26]:
# write here


Average pressure : 986.3408269631689
Minimal pressure : 966.3
Maximal pressure : 998.3
Average temperature : 6.410701876302988
[26]:
Date Pressure Rain Temp
433 05/11/2017 12:15 979.2 0.2 8.6
435 05/11/2017 12:45 978.9 0.2 8.4
436 05/11/2017 13:00 979.0 0.2 8.4
437 05/11/2017 13:15 979.1 0.8 8.2
438 05/11/2017 13:30 979.0 0.6 8.2
439 05/11/2017 13:45 978.8 0.4 8.2
440 05/11/2017 14:00 978.7 0.8 8.2
441 05/11/2017 14:15 978.4 0.6 8.3
442 05/11/2017 14:30 978.2 0.6 8.2
443 05/11/2017 14:45 978.1 0.6 8.2
444 05/11/2017 15:00 978.1 0.4 8.1
445 05/11/2017 15:15 977.9 0.4 8.1
446 05/11/2017 15:30 977.9 0.4 8.1
448 05/11/2017 16:00 977.4 0.2 8.1
455 05/11/2017 17:45 977.1 0.2 8.1
456 05/11/2017 18:00 977.1 0.2 8.2
457 05/11/2017 18:15 977.1 0.2 8.2
458 05/11/2017 18:30 976.8 0.2 8.3
459 05/11/2017 18:45 976.7 0.4 8.3
460 05/11/2017 19:00 976.5 0.2 8.4
461 05/11/2017 19:15 976.5 0.2 8.5
462 05/11/2017 19:30 976.3 0.2 8.5
463 05/11/2017 19:45 976.1 0.4 8.6
464 05/11/2017 20:00 976.3 0.2 8.7
465 05/11/2017 20:15 976.1 0.4 8.7
466 05/11/2017 20:30 976.1 0.4 8.7
467 05/11/2017 20:45 976.2 0.2 8.7
468 05/11/2017 21:00 976.4 0.6 8.8
469 05/11/2017 21:15 976.4 0.6 8.7
470 05/11/2017 21:30 976.9 1.2 8.7
... ... ... ... ...
1150 12/11/2017 23:45 970.1 0.6 5.3
1151 13/11/2017 00:00 969.9 0.4 5.6
1152 13/11/2017 00:15 970.1 0.6 5.5
1153 13/11/2017 00:30 970.4 0.6 5.1
1154 13/11/2017 00:45 970.4 0.6 5.2
1155 13/11/2017 01:00 970.4 0.2 4.7
1159 13/11/2017 02:00 969.5 0.2 5.4
2338 25/11/2017 09:15 985.9 0.2 5.0
2346 25/11/2017 11:15 984.6 0.2 5.0
2347 25/11/2017 11:30 984.2 0.4 5.0
2348 25/11/2017 11:45 984.1 0.2 4.8
2349 25/11/2017 12:00 983.7 0.2 4.9
2350 25/11/2017 12:15 983.6 0.2 4.9
2352 25/11/2017 12:45 983.2 0.2 4.9
2353 25/11/2017 13:00 983.0 0.2 5.0
2354 25/11/2017 13:15 982.6 0.2 5.0
2355 25/11/2017 13:30 982.5 0.2 4.9
2356 25/11/2017 13:45 982.4 0.2 4.9
2358 25/11/2017 14:15 982.0 0.2 4.8
2359 25/11/2017 14:30 982.1 0.2 4.8
2362 25/11/2017 15:15 981.5 0.2 4.9
2363 25/11/2017 15:30 981.2 0.2 5.0
2364 25/11/2017 15:45 981.1 0.2 5.0
2366 25/11/2017 16:15 981.0 0.2 5.0
2736 29/11/2017 12:45 978.0 0.2 0.9
2754 29/11/2017 17:15 976.1 0.2 0.9
2755 29/11/2017 17:30 975.9 0.2 0.9
2802 30/11/2017 05:15 971.3 0.2 1.3
2803 30/11/2017 05:30 971.3 0.2 1.1
2804 30/11/2017 05:45 971.5 0.2 1.1

107 rows × 4 columns

4. MatPlotLib review

We’ve already seen MatplotLib in the part on visualization, and today we use Matplotlib to display data.

Let’s take again an example, with the Matlab approach. We will plot a line passing two lists of coordinates, one for xs and one for ys:

[27]:
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
[28]:

x = [1,2,3,4]
y = [2,4,6,8]
plt.plot(x, y) # we can directly pass x and y lists
plt.title('Some number')
plt.show()

../_images/pandas_pandas-sol_65_0.png

We can also create the series with numpy. Let’s try making a parabola:

[29]:
x = np.arange(0.,5.,0.1)
#  '**' is the power operator in  Python, NOT '^'
y = x**2

Let’s use the type function to understand which data types are x and y:

[30]:
type(x)
[30]:
numpy.ndarray
[31]:
type(y)
[31]:
numpy.ndarray

Hence we have NumPy arrays.

[32]:
plt.title('The parabola')
plt.plot(x,y);
../_images/pandas_pandas-sol_72_0.png

If we want the x axis units to be same as y axis, we can use function gca

To set x and y limits, we can use xlim e ylim:

[33]:
plt.xlim([0, 5])
plt.ylim([0,10])
plt.title('La parabola')

plt.gca().set_aspect('equal')
plt.plot(x,y);


../_images/pandas_pandas-sol_74_0.png

Matplotlib plots from pandas datastructures

We can get plots directly from pandas data structures, always using the matlab style. Here there is documentation of DataFrame.plot. Let’s make an example. In case of big quantity of data, it may be useful to have a qualitative idea of data by putting them in a plot:

[34]:
df.humidity.plot(label="Humidity", legend=True)
# with secondary_y=True we display number on y axis
# of graph on the right
df.pressure.plot(secondary_y=True, label="Pressure", legend=True);
../_images/pandas_pandas-sol_76_0.png

We can put pressure values on horizontal axis, and see which humidity values on vertical axis have a certain pressure:

[35]:
plt.plot(df['pressure'], df['humidity'])
[35]:
[<matplotlib.lines.Line2D at 0x7f30d66f1cc0>]
../_images/pandas_pandas-sol_78_1.png

Let’s select in the new dataframe df2 the rows between the 12500th (included) and the 15000th (excluded):

[36]:
df2=df.iloc[12500:15000]
[37]:
plt.plot(df2['pressure'], df2['humidity'])
[37]:
[<matplotlib.lines.Line2D at 0x7f30d665f588>]
../_images/pandas_pandas-sol_81_1.png
[38]:
df2.humidity.plot(label="Humidity", legend=True)
df2.pressure.plot(secondary_y=True, label="Pressure", legend=True)
[38]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f30d661ebe0>
../_images/pandas_pandas-sol_82_1.png

With corr method we can see the correlation between DataFrame columns.

[39]:
df2.corr()
[39]:
ROW_ID temp_cpu temp_h temp_p humidity pressure pitch roll yaw mag_x mag_y mag_z accel_x accel_y accel_z gyro_x gyro_y gyro_z reset
ROW_ID 1.000000 0.561540 0.636899 0.730764 0.945210 0.760732 0.005633 0.266995 0.172192 -0.108713 0.057601 -0.270656 0.015936 0.121838 0.075160 -0.014346 -0.026012 0.011714 NaN
temp_cpu 0.561540 1.000000 0.591610 0.670043 0.488038 0.484902 0.025618 0.165540 0.056950 -0.019815 -0.028729 -0.193077 -0.021093 0.108878 0.065628 -0.019478 -0.007527 -0.006737 NaN
temp_h 0.636899 0.591610 1.000000 0.890775 0.539603 0.614536 0.022718 0.196767 -0.024700 -0.151336 0.031512 -0.260633 -0.009408 0.173037 0.129074 -0.005255 -0.017054 -0.016113 NaN
temp_p 0.730764 0.670043 0.890775 1.000000 0.620307 0.650015 0.019178 0.192621 0.007474 -0.060122 -0.039648 -0.285640 -0.034348 0.187457 0.144595 -0.010679 -0.016674 -0.017010 NaN
humidity 0.945210 0.488038 0.539603 0.620307 1.000000 0.750000 0.012247 0.231316 0.181905 -0.108781 0.131218 -0.191957 0.040452 0.069717 0.021627 0.005625 -0.001927 0.014431 NaN
pressure 0.760732 0.484902 0.614536 0.650015 0.750000 1.000000 0.037081 0.225112 0.070603 -0.246485 0.194611 -0.173808 0.085183 -0.032049 -0.068296 -0.014838 -0.008821 0.032056 NaN
pitch 0.005633 0.025618 0.022718 0.019178 0.012247 0.037081 1.000000 0.068880 0.030448 -0.008220 -0.002278 -0.019085 0.024460 -0.053634 -0.029345 0.040685 0.041674 -0.024081 NaN
roll 0.266995 0.165540 0.196767 0.192621 0.231316 0.225112 0.068880 1.000000 -0.053750 -0.281035 -0.479779 -0.665041 0.057330 -0.049233 -0.153524 0.139427 0.134319 -0.078113 NaN
yaw 0.172192 0.056950 -0.024700 0.007474 0.181905 0.070603 0.030448 -0.053750 1.000000 0.536693 0.300571 0.394324 -0.028267 0.078585 0.068321 -0.021071 -0.009650 0.064290 NaN
mag_x -0.108713 -0.019815 -0.151336 -0.060122 -0.108781 -0.246485 -0.008220 -0.281035 0.536693 1.000000 0.046591 0.475674 -0.097520 0.168764 0.115423 -0.017739 -0.006722 0.008456 NaN
mag_y 0.057601 -0.028729 0.031512 -0.039648 0.131218 0.194611 -0.002278 -0.479779 0.300571 0.046591 1.000000 0.794756 0.046693 -0.035111 -0.022579 -0.084045 -0.061460 0.115327 NaN
mag_z -0.270656 -0.193077 -0.260633 -0.285640 -0.191957 -0.173808 -0.019085 -0.665041 0.394324 0.475674 0.794756 1.000000 0.001699 -0.020016 -0.006496 -0.092749 -0.060097 0.101276 NaN
accel_x 0.015936 -0.021093 -0.009408 -0.034348 0.040452 0.085183 0.024460 0.057330 -0.028267 -0.097520 0.046693 0.001699 1.000000 -0.197363 -0.174005 -0.016811 -0.013694 -0.017850 NaN
accel_y 0.121838 0.108878 0.173037 0.187457 0.069717 -0.032049 -0.053634 -0.049233 0.078585 0.168764 -0.035111 -0.020016 -0.197363 1.000000 0.424272 -0.023942 -0.054733 0.014870 NaN
accel_z 0.075160 0.065628 0.129074 0.144595 0.021627 -0.068296 -0.029345 -0.153524 0.068321 0.115423 -0.022579 -0.006496 -0.174005 0.424272 1.000000 0.006313 -0.011883 -0.015390 NaN
gyro_x -0.014346 -0.019478 -0.005255 -0.010679 0.005625 -0.014838 0.040685 0.139427 -0.021071 -0.017739 -0.084045 -0.092749 -0.016811 -0.023942 0.006313 1.000000 0.802471 -0.012705 NaN
gyro_y -0.026012 -0.007527 -0.017054 -0.016674 -0.001927 -0.008821 0.041674 0.134319 -0.009650 -0.006722 -0.061460 -0.060097 -0.013694 -0.054733 -0.011883 0.802471 1.000000 -0.043332 NaN
gyro_z 0.011714 -0.006737 -0.016113 -0.017010 0.014431 0.032056 -0.024081 -0.078113 0.064290 0.008456 0.115327 0.101276 -0.017850 0.014870 -0.015390 -0.012705 -0.043332 1.000000 NaN
reset NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5. Calculating new columns

It is possible to obtain new columns by calculating them from other columns. For example, we get new column mag_tot, that is the absolute magnetic field taken from space station by mag_x, mag_y, e mag_z, and then plot it:

[40]:
df['mag_tot'] = df['mag_x']**2 + df['mag_y']**2 + df['mag_z']**2
df.mag_tot.plot()
[40]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f30d66080b8>
../_images/pandas_pandas-sol_86_1.png

Let’s find when the magnetic field was maximal:

[41]:
df['time_stamp'][(df.mag_tot == df.mag_tot.values.max())]
[41]:
96156    2016-02-27 16:12:31
Name: time_stamp, dtype: object

By filling in the value found on the website isstracker.com/historical, we can find the positions where the magnetic field is at the highest.

5.1 Exercise: Meteo Fahrenheit temperature

In meteo dataframe, create a column Temp (Fahrenheit) with the temperature measured in Fahrenheit degrees.

Formula to calculate conversion from Celsius degrees (C):

\(Fahrenheit = \frac{9}{5}C + 32\)

[42]:
# write here


Show solution
[43]:


       **************  SOLUTION OUTPUT  **************
[43]:
Date Pressure Rain Temp Temp (Fahrenheit)
0 01/11/2017 00:00 995.4 0.0 5.4 41.72
1 01/11/2017 00:15 995.5 0.0 6.0 42.80
2 01/11/2017 00:30 995.5 0.0 5.9 42.62
3 01/11/2017 00:45 995.7 0.0 5.4 41.72
4 01/11/2017 01:00 995.7 0.0 5.3 41.54

5.2 Exercise: Pressure vs Temperature

Pressure should be directly proportional to temperature in a closed environment Gay-Lussac’s law:

\(\frac{P}{T} = k\)

Does this holds true for meteo dataset? Try to find out by direct calculation of the formula and compare with corr() method results.

Show solution
[44]:

[ ]:

6. Object values

In general, when we want to manipulate objects of a known type, say strings which have type str, we can write .str after a series and then treat the result like it were a single string, using any operator (es: slicing) or method that particular class allows us plus others provided by pandas. (for text in particular there are various ways to manipulate it, for more details (see pandas documentation)

Filter by textual values

When we want to filter by text values, we can use .str.contains, here for example we select all the samples in the last days of february (which have timestamp containing 2016-02-2) :

[45]:
df[  df['time_stamp'].str.contains('2016-02-2')  ]
[45]:
ROW_ID temp_cpu temp_h temp_p humidity pressure pitch roll yaw mag_x ... accel_y accel_z gyro_x gyro_y gyro_z reset time_stamp Too hot check_p mag_tot
30442 30443 32.30 28.12 25.59 45.05 1008.01 1.47 51.82 51.18 9.215883 ... 0.018792 0.014558 -0.000042 0.000275 0.000157 0 2016-02-20 00:00:00 True sotto 269.091903
30443 30444 32.25 28.13 25.59 44.82 1008.02 0.81 51.53 52.21 8.710130 ... 0.019290 0.014667 0.000260 0.001011 0.000149 0 2016-02-20 00:00:10 True sotto 260.866157
30444 30445 33.07 28.13 25.59 45.08 1008.09 0.68 51.69 57.36 7.383435 ... 0.018714 0.014598 0.000299 0.000343 -0.000025 0 2016-02-20 00:00:41 True sotto 265.421154
30445 30446 32.63 28.10 25.60 44.87 1008.07 1.42 52.13 59.95 7.292313 ... 0.018857 0.014565 0.000160 0.000349 -0.000190 0 2016-02-20 00:00:50 True sotto 269.572476
30446 30447 32.55 28.11 25.60 44.94 1008.07 1.41 51.86 61.83 6.699141 ... 0.018871 0.014564 -0.000608 -0.000381 -0.000243 0 2016-02-20 00:01:01 True sotto 262.510966
30447 30448 32.47 28.12 25.61 44.83 1008.08 1.84 51.75 64.10 6.339477 ... 0.018833 0.014691 -0.000233 -0.000403 -0.000337 0 2016-02-20 00:01:10 True sotto 273.997653
30448 30449 32.41 28.11 25.61 45.00 1008.10 2.35 51.87 66.59 5.861904 ... 0.018828 0.014534 -0.000225 -0.000292 -0.000004 0 2016-02-20 00:01:20 True sotto 272.043915
30449 30450 32.41 28.12 25.61 45.02 1008.10 1.41 51.92 68.70 5.235877 ... 0.018724 0.014255 0.000134 -0.000310 -0.000101 0 2016-02-20 00:01:30 True sotto 268.608057
30450 30451 32.38 28.12 25.61 45.00 1008.12 1.46 52.04 70.98 4.775404 ... 0.018730 0.014372 0.000319 0.000079 -0.000215 0 2016-02-20 00:01:40 True sotto 271.750032
30451 30452 32.36 28.13 25.61 44.97 1008.12 1.18 51.78 73.10 4.300375 ... 0.018814 0.014518 -0.000023 0.000186 -0.000118 0 2016-02-20 00:01:51 True sotto 277.538126
30452 30453 32.38 28.12 25.61 45.10 1008.12 1.08 51.81 74.90 3.763551 ... 0.018526 0.014454 -0.000184 -0.000075 -0.000077 0 2016-02-20 00:02:00 True sotto 268.391448
30453 30454 32.33 28.12 25.61 44.96 1008.14 1.45 51.79 77.31 3.228626 ... 0.018607 0.014330 -0.000269 -0.000547 -0.000262 0 2016-02-20 00:02:11 True sopra 271.942019
30454 30455 32.32 28.14 25.61 44.86 1008.12 1.89 51.95 78.88 2.888813 ... 0.018698 0.014548 -0.000081 -0.000079 -0.000240 0 2016-02-20 00:02:20 True sotto 264.664070
30455 30456 32.39 28.13 25.61 45.01 1008.12 1.49 51.60 80.46 2.447253 ... 0.018427 0.014576 -0.000349 -0.000269 -0.000198 0 2016-02-20 00:02:31 True sotto 267.262186
30456 30457 32.34 28.09 25.61 45.02 1008.14 1.18 51.74 82.41 1.983143 ... 0.018866 0.014438 0.000248 0.000172 -0.000474 0 2016-02-20 00:02:40 True sopra 270.414588
30457 30458 32.34 28.11 25.61 45.02 1008.16 1.92 51.72 84.46 1.623884 ... 0.018729 0.014770 0.000417 0.000231 -0.000171 0 2016-02-20 00:02:50 True sopra 278.210856
30458 30459 32.33 28.10 25.61 44.85 1008.18 1.99 52.06 86.72 1.050999 ... 0.018867 0.014592 0.000377 0.000270 -0.000074 0 2016-02-20 00:03:00 True sopra 288.728974
30459 30460 32.35 28.11 25.61 44.98 1008.15 1.38 51.78 89.42 0.297179 ... 0.018609 0.014593 0.000622 0.000364 -0.000134 0 2016-02-20 00:03:10 True sopra 303.816530
30460 30461 32.34 28.11 25.61 44.93 1008.18 1.41 51.66 91.11 -0.136305 ... 0.018504 0.014502 -0.000049 -0.000104 -0.000286 0 2016-02-20 00:03:21 True sopra 305.475482
30461 30462 32.29 28.11 25.61 44.90 1008.18 1.33 51.99 93.09 -0.659496 ... 0.018584 0.014593 0.000132 -0.000542 -0.000221 0 2016-02-20 00:03:30 True sopra 306.437506
30462 30463 32.32 28.12 25.61 45.04 1008.17 1.30 51.93 94.25 -1.002867 ... 0.018703 0.014584 0.000245 0.000074 -0.000308 0 2016-02-20 00:03:41 True sopra 318.703894
30463 30464 32.30 28.12 25.61 44.86 1008.16 0.98 51.78 96.42 -1.634671 ... 0.018833 0.014771 0.000343 -0.000154 -0.000286 0 2016-02-20 00:03:50 True sopra 324.412585
30464 30465 32.31 28.10 25.60 44.96 1008.18 1.82 51.95 98.65 -2.204607 ... 0.018867 0.014664 -0.000058 -0.000366 -0.000091 0 2016-02-20 00:04:01 True sopra 331.006515
30465 30466 32.34 28.11 25.60 45.07 1008.19 1.14 51.69 101.53 -3.065968 ... 0.018461 0.014735 0.000263 -0.000071 -0.000370 0 2016-02-20 00:04:10 True sopra 332.503688
30466 30467 32.37 28.12 25.61 44.92 1008.19 1.73 51.94 103.40 -3.533967 ... 0.018810 0.014541 0.000442 0.000022 -0.000193 0 2016-02-20 00:04:20 True sopra 330.051496
30467 30468 32.32 28.11 25.60 44.98 1008.18 1.45 51.67 104.59 -4.009444 ... 0.018657 0.014586 -0.000125 0.000013 0.000209 0 2016-02-20 00:04:31 True sopra 340.085476
30468 30469 32.32 28.12 25.60 44.98 1008.20 1.66 51.85 105.99 -4.438902 ... 0.019021 0.014753 -0.000055 0.000126 0.000070 0 2016-02-20 00:04:40 True sopra 354.350961
30469 30470 32.30 28.12 25.60 44.93 1008.20 1.45 51.89 107.38 -4.940700 ... 0.018959 0.014662 0.000046 -0.000504 0.000041 0 2016-02-20 00:04:51 True sopra 364.753950
30470 30471 32.28 28.11 25.60 44.88 1008.21 1.78 51.88 108.78 -5.444541 ... 0.019012 0.014606 -0.000177 -0.000407 -0.000427 0 2016-02-20 00:05:00 True sopra 379.362654
30471 30472 32.33 28.10 25.60 44.96 1008.21 1.76 51.88 110.70 -6.101692 ... 0.018822 0.014834 0.000044 0.000042 -0.000327 0 2016-02-20 00:05:11 True sopra 388.749366
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
110839 110840 31.60 27.49 24.82 42.74 1005.83 1.12 49.34 90.42 0.319629 ... 0.017461 0.014988 -0.000209 -0.000005 0.000138 0 2016-02-29 09:20:10 NaN sotto 574.877314
110840 110841 31.59 27.48 24.82 42.75 1005.82 2.04 49.53 92.11 0.015879 ... 0.017413 0.014565 -0.000472 -0.000478 0.000126 0 2016-02-29 09:20:20 NaN sotto 593.855683
110841 110842 31.59 27.51 24.82 42.76 1005.82 1.31 49.19 93.94 -0.658624 ... 0.017516 0.015014 -0.000590 -0.000372 0.000207 0 2016-02-29 09:20:31 NaN sotto 604.215692
110842 110843 31.60 27.50 24.82 42.74 1005.85 1.19 48.91 95.57 -1.117541 ... 0.017400 0.014982 -0.000039 0.000059 0.000149 0 2016-02-29 09:20:40 NaN sotto 606.406098
110843 110844 31.57 27.49 24.82 42.80 1005.83 1.49 49.17 98.11 -1.860475 ... 0.017580 0.014704 0.000223 0.000278 0.000038 0 2016-02-29 09:20:51 NaN sotto 622.733559
110844 110845 31.60 27.50 24.82 42.81 1005.84 1.47 49.46 99.67 -2.286044 ... 0.017428 0.014325 -0.000283 -0.000187 0.000077 0 2016-02-29 09:21:00 NaN sotto 641.480748
110845 110846 31.61 27.50 24.82 42.81 1005.82 2.28 49.27 103.17 -3.182359 ... 0.017537 0.014575 -0.000451 -0.000100 -0.000351 0 2016-02-29 09:21:10 NaN sotto 633.949204
110846 110847 31.61 27.50 24.82 42.75 1005.84 2.18 49.64 105.05 -3.769940 ... 0.017739 0.014926 0.000476 0.000452 -0.000249 0 2016-02-29 09:21:20 NaN sotto 643.508698
110847 110848 31.58 27.50 24.82 43.00 1005.85 2.52 49.31 107.23 -4.431722 ... 0.017588 0.015077 0.000822 0.000739 -0.000012 0 2016-02-29 09:21:30 NaN sotto 658.512439
110848 110849 31.54 27.51 24.82 42.76 1005.84 2.35 49.55 108.68 -4.944477 ... 0.017487 0.014864 0.000613 0.000763 -0.000227 0 2016-02-29 09:21:41 NaN sotto 667.095455
110849 110850 31.60 27.50 24.82 42.79 1005.82 2.33 48.79 109.52 -5.481255 ... 0.017455 0.014638 0.000196 0.000519 -0.000234 0 2016-02-29 09:21:50 NaN sotto 689.714415
110850 110851 31.61 27.50 24.82 42.79 1005.85 2.11 49.66 111.90 -6.263577 ... 0.017489 0.014960 0.000029 -0.000098 -0.000073 0 2016-02-29 09:22:01 NaN sotto 707.304506
110851 110852 31.56 27.50 24.83 42.84 1005.83 1.68 49.91 113.38 -6.844946 ... 0.017778 0.014703 -0.000177 -0.000452 -0.000232 0 2016-02-29 09:22:10 NaN sotto 726.361255
110852 110853 31.59 27.51 24.83 42.76 1005.82 2.26 49.17 114.42 -7.437300 ... 0.017733 0.014838 0.000396 0.000400 -0.000188 0 2016-02-29 09:22:21 NaN sotto 743.185242
110853 110854 31.58 27.50 24.83 42.98 1005.83 1.96 49.41 116.50 -8.271114 ... 0.017490 0.014582 0.000285 0.000312 -0.000058 0 2016-02-29 09:22:30 NaN sotto 767.328522
110854 110855 31.61 27.51 24.83 42.69 1005.84 2.27 49.39 117.61 -8.690470 ... 0.017465 0.014720 -0.000001 0.000371 -0.000274 0 2016-02-29 09:22:40 NaN sotto 791.907055
110855 110856 31.55 27.50 24.83 42.79 1005.83 1.51 48.98 119.13 -9.585351 ... 0.017554 0.014910 -0.000115 0.000029 -0.000223 0 2016-02-29 09:22:50 NaN sotto 802.932850
110856 110857 31.55 27.49 24.83 42.81 1005.82 2.12 49.95 120.81 -10.120745 ... 0.017494 0.014718 -0.000150 0.000147 -0.000320 0 2016-02-29 09:23:00 NaN sotto 820.194642
110857 110858 31.60 27.51 24.83 42.92 1005.82 1.53 49.33 121.74 -10.657858 ... 0.017544 0.014762 0.000161 0.000029 -0.000210 0 2016-02-29 09:23:11 NaN sotto 815.462202
110858 110859 31.58 27.50 24.83 42.81 1005.83 1.60 49.65 123.50 -11.584851 ... 0.017608 0.015093 -0.000073 0.000158 -0.000006 0 2016-02-29 09:23:20 NaN sotto 851.154631
110859 110860 31.61 27.50 24.83 42.82 1005.84 2.65 49.47 124.51 -12.089743 ... 0.017433 0.014930 0.000428 0.000137 0.000201 0 2016-02-29 09:23:31 NaN sotto 879.563826
110860 110861 31.57 27.50 24.83 42.80 1005.84 2.63 50.08 125.85 -12.701497 ... 0.017805 0.014939 0.000263 0.000163 0.000031 0 2016-02-29 09:23:40 NaN sotto 895.543882
110861 110862 31.58 27.51 24.83 42.90 1005.85 1.70 49.81 126.86 -13.393369 ... 0.017577 0.015026 -0.000077 0.000179 0.000148 0 2016-02-29 09:23:50 NaN sotto 928.948693
110862 110863 31.60 27.51 24.83 42.80 1005.85 1.66 49.13 127.35 -13.990712 ... 0.017508 0.014478 0.000119 -0.000204 0.000041 0 2016-02-29 09:24:01 NaN sotto 957.695014
110863 110864 31.64 27.51 24.83 42.80 1005.85 1.91 49.31 128.62 -14.691672 ... 0.017789 0.014891 0.000286 0.000103 0.000221 0 2016-02-29 09:24:10 NaN sotto 971.126355
110864 110865 31.56 27.52 24.83 42.94 1005.83 1.58 49.93 129.60 -15.169673 ... 0.017743 0.014646 -0.000264 0.000206 0.000196 0 2016-02-29 09:24:21 NaN sotto 996.676408
110865 110866 31.55 27.50 24.83 42.72 1005.85 1.89 49.92 130.51 -15.832622 ... 0.017570 0.014855 0.000143 0.000199 -0.000024 0 2016-02-29 09:24:30 NaN sotto 1022.779594
110866 110867 31.58 27.50 24.83 42.83 1005.85 2.09 50.00 132.04 -16.646212 ... 0.017657 0.014799 0.000537 0.000257 0.000057 0 2016-02-29 09:24:41 NaN sotto 1048.121268
110867 110868 31.62 27.50 24.83 42.81 1005.88 2.88 49.69 133.00 -17.270447 ... 0.017635 0.014877 0.000534 0.000456 0.000195 0 2016-02-29 09:24:50 NaN sotto 1073.629703
110868 110869 31.57 27.51 24.83 42.94 1005.86 2.17 49.77 134.18 -17.885872 ... 0.017261 0.014380 0.000459 0.000076 0.000030 0 2016-02-29 09:25:00 NaN sotto 1095.760426

80427 rows × 23 columns

Extracting strings

To extract only the day from timestamp column, we can use str and use slice operator with square brackets:

[46]:
df['time_stamp'].str[8:10]
[46]:
0         16
1         16
2         16
3         16
4         16
5         16
6         16
7         16
8         16
9         16
10        16
11        16
12        16
13        16
14        16
15        16
16        16
17        16
18        16
19        16
20        16
21        16
22        16
23        16
24        16
25        16
26        16
27        16
28        16
29        16
          ..
110839    29
110840    29
110841    29
110842    29
110843    29
110844    29
110845    29
110846    29
110847    29
110848    29
110849    29
110850    29
110851    29
110852    29
110853    29
110854    29
110855    29
110856    29
110857    29
110858    29
110859    29
110860    29
110861    29
110862    29
110863    29
110864    29
110865    29
110866    29
110867    29
110868    29
Name: time_stamp, Length: 110869, dtype: object
[47]:
count, division = np.histogram(df['temp_h'])
print(count)
print(division)
[ 2242  8186 15692 22738 20114 24683  9371  5856  1131   856]
[27.2   27.408 27.616 27.824 28.032 28.24  28.448 28.656 28.864 29.072
 29.28 ]

7. Transforming

Suppose we want to convert all values of column temperature which are floats to integers.

We know that to convert a float to an integer there the predefined python function int

[48]:
int(23.7)
[48]:
23

We would like to apply such function to all the elements of the column humidity.

To do so, we can call the transform method and pass to it the function int as a parameter

NOTE: there are no round parenthesis after int !!!

[49]:
df['humidity'].transform(int)
[49]:
0         44
1         45
2         45
3         45
4         45
5         45
6         45
7         45
8         45
9         45
10        45
11        45
12        45
13        45
14        45
15        45
16        45
17        45
18        45
19        45
20        45
21        45
22        45
23        45
24        45
25        45
26        45
27        45
28        45
29        45
          ..
110839    42
110840    42
110841    42
110842    42
110843    42
110844    42
110845    42
110846    42
110847    43
110848    42
110849    42
110850    42
110851    42
110852    42
110853    42
110854    42
110855    42
110856    42
110857    42
110858    42
110859    42
110860    42
110861    42
110862    42
110863    42
110864    42
110865    42
110866    42
110867    42
110868    42
Name: humidity, Length: 110869, dtype: int64

Just to be clear what passing a function means, let’s see other two completely equivalent ways we could have used to pass the function:

Defining a function: We could have defined a function myf like this (notice the function MUST RETURN something !)

[50]:
def myf(x):
    return int(x)

df['humidity'].transform(myf)
[50]:
0         44
1         45
2         45
3         45
4         45
5         45
6         45
7         45
8         45
9         45
10        45
11        45
12        45
13        45
14        45
15        45
16        45
17        45
18        45
19        45
20        45
21        45
22        45
23        45
24        45
25        45
26        45
27        45
28        45
29        45
          ..
110839    42
110840    42
110841    42
110842    42
110843    42
110844    42
110845    42
110846    42
110847    43
110848    42
110849    42
110850    42
110851    42
110852    42
110853    42
110854    42
110855    42
110856    42
110857    42
110858    42
110859    42
110860    42
110861    42
110862    42
110863    42
110864    42
110865    42
110866    42
110867    42
110868    42
Name: humidity, Length: 110869, dtype: int64

lamda function: We could have used as well a lambda function, that is, a function without a name which is defined on one line:

[51]:
df['humidity'].transform( lambda x: int(x) )
[51]:
0         44
1         45
2         45
3         45
4         45
5         45
6         45
7         45
8         45
9         45
10        45
11        45
12        45
13        45
14        45
15        45
16        45
17        45
18        45
19        45
20        45
21        45
22        45
23        45
24        45
25        45
26        45
27        45
28        45
29        45
          ..
110839    42
110840    42
110841    42
110842    42
110843    42
110844    42
110845    42
110846    42
110847    43
110848    42
110849    42
110850    42
110851    42
110852    42
110853    42
110854    42
110855    42
110856    42
110857    42
110858    42
110859    42
110860    42
110861    42
110862    42
110863    42
110864    42
110865    42
110866    42
110867    42
110868    42
Name: humidity, Length: 110869, dtype: int64

Regardless of the way we choose to pass the function, transform method does not change the original dataframe:

[52]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110869 entries, 0 to 110868
Data columns (total 23 columns):
ROW_ID        110869 non-null int64
temp_cpu      110869 non-null float64
temp_h        110869 non-null float64
temp_p        110869 non-null float64
humidity      110869 non-null float64
pressure      110869 non-null float64
pitch         110869 non-null float64
roll          110869 non-null float64
yaw           110869 non-null float64
mag_x         110869 non-null float64
mag_y         110869 non-null float64
mag_z         110869 non-null float64
accel_x       110869 non-null float64
accel_y       110869 non-null float64
accel_z       110869 non-null float64
gyro_x        110869 non-null float64
gyro_y        110869 non-null float64
gyro_z        110869 non-null float64
reset         110869 non-null int64
time_stamp    110869 non-null object
Too hot       105315 non-null object
check_p       110869 non-null object
mag_tot       110869 non-null float64
dtypes: float64(18), int64(2), object(3)
memory usage: 19.5+ MB

If we want to add a new column, say huimdity_int, we have to explicitly assigne the result of transform to a new series:

[53]:
df['humidity_int'] = df['humidity'].transform( lambda x: int(x) )

Notice how pandas automatically infers type int64 for the newly created column:

[54]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110869 entries, 0 to 110868
Data columns (total 24 columns):
ROW_ID          110869 non-null int64
temp_cpu        110869 non-null float64
temp_h          110869 non-null float64
temp_p          110869 non-null float64
humidity        110869 non-null float64
pressure        110869 non-null float64
pitch           110869 non-null float64
roll            110869 non-null float64
yaw             110869 non-null float64
mag_x           110869 non-null float64
mag_y           110869 non-null float64
mag_z           110869 non-null float64
accel_x         110869 non-null float64
accel_y         110869 non-null float64
accel_z         110869 non-null float64
gyro_x          110869 non-null float64
gyro_y          110869 non-null float64
gyro_z          110869 non-null float64
reset           110869 non-null int64
time_stamp      110869 non-null object
Too hot         105315 non-null object
check_p         110869 non-null object
mag_tot         110869 non-null float64
humidity_int    110869 non-null int64
dtypes: float64(18), int64(3), object(3)
memory usage: 20.3+ MB

8. Grouping

Reference:

It is pretty easy to group items and perform aggregated calculations by using groupby method. Let’s say we want to count how many huidity readings were taken for each integer humidity (here we use pandas groupby, but for histograms you could also use numpy)

After groupby we can use count() aggregation function (other common ones are sum(), mean(), min(), max()):

[55]:
df.groupby(['humidity_int'])['humidity'].count()
[55]:
humidity_int
42     2776
43     2479
44    13029
45    32730
46    35775
47    14176
48     7392
49      297
50      155
51      205
52      209
53      128
54      224
55      164
56      139
57      183
58      237
59      271
60      300
Name: humidity, dtype: int64

Notice we got only 19 rows. To have a series that fills the whole table, assigning to each row the count of its own group, we can use transform like this:

[56]:
df.groupby(['humidity_int'])['humidity'].transform('count')
[56]:
0         13029
1         32730
2         32730
3         32730
4         32730
5         32730
6         32730
7         32730
8         32730
9         32730
10        32730
11        32730
12        32730
13        32730
14        32730
15        32730
16        32730
17        32730
18        32730
19        32730
20        32730
21        32730
22        32730
23        32730
24        32730
25        32730
26        32730
27        32730
28        32730
29        32730
          ...
110839     2776
110840     2776
110841     2776
110842     2776
110843     2776
110844     2776
110845     2776
110846     2776
110847     2479
110848     2776
110849     2776
110850     2776
110851     2776
110852     2776
110853     2776
110854     2776
110855     2776
110856     2776
110857     2776
110858     2776
110859     2776
110860     2776
110861     2776
110862     2776
110863     2776
110864     2776
110865     2776
110866     2776
110867     2776
110868     2776
Name: humidity, Length: 110869, dtype: int64

As usual, group_by does not modify the dataframe, if we want the result stored in the dataframe we need to assign the result to a new column:

[57]:
df['Humidity counts'] = df.groupby(['humidity_int'])['humidity'].transform('count')
[58]:
df
[58]:
ROW_ID temp_cpu temp_h temp_p humidity pressure pitch roll yaw mag_x ... gyro_x gyro_y gyro_z reset time_stamp Too hot check_p mag_tot humidity_int Humidity counts
0 1 31.88 27.57 25.01 44.94 1001.68 1.49 52.25 185.21 -46.422753 ... 0.000942 0.000492 -0.000750 20 2016-02-16 10:44:40 True sotto 2368.337207 44 13029
1 2 31.79 27.53 25.01 45.12 1001.72 1.03 53.73 186.72 -48.778951 ... 0.000218 -0.000005 -0.000235 0 2016-02-16 10:44:50 True sotto 2615.870247 45 32730
2 3 31.66 27.53 25.01 45.12 1001.72 1.24 53.57 186.21 -49.161878 ... 0.000395 0.000600 -0.000003 0 2016-02-16 10:45:00 NaN sotto 2648.484927 45 32730
3 4 31.69 27.52 25.01 45.32 1001.69 1.57 53.63 186.03 -49.341941 ... 0.000308 0.000577 -0.000102 0 2016-02-16 10:45:10 True sotto 2665.305485 45 32730
4 5 31.66 27.54 25.01 45.18 1001.71 0.85 53.66 186.46 -50.056683 ... 0.000321 0.000691 0.000272 0 2016-02-16 10:45:20 NaN sotto 2732.388620 45 32730
5 6 31.69 27.55 25.01 45.12 1001.67 0.85 53.53 185.52 -50.246476 ... 0.000273 0.000494 -0.000059 0 2016-02-16 10:45:30 True sotto 2736.836291 45 32730
6 7 31.68 27.53 25.01 45.31 1001.70 0.63 53.55 186.10 -50.447346 ... -0.000111 0.000320 0.000222 0 2016-02-16 10:45:41 NaN sotto 2756.496929 45 32730
7 8 31.66 27.55 25.01 45.34 1001.70 1.49 53.65 186.08 -50.668232 ... -0.000044 0.000436 0.000301 0 2016-02-16 10:45:50 NaN sotto 2778.429164 45 32730
8 9 31.67 27.54 25.01 45.20 1001.72 1.22 53.77 186.55 -50.761529 ... 0.000358 0.000651 0.000187 0 2016-02-16 10:46:01 NaN sotto 2773.029554 45 32730
9 10 31.67 27.54 25.01 45.41 1001.75 1.63 53.46 185.94 -51.243832 ... 0.000266 0.000676 0.000356 0 2016-02-16 10:46:10 NaN sotto 2809.446772 45 32730
10 11 31.68 27.53 25.00 45.16 1001.72 1.32 53.52 186.24 -51.616473 ... 0.000268 0.001194 0.000106 0 2016-02-16 10:46:20 NaN sotto 2851.426683 45 32730
11 12 31.67 27.52 25.00 45.48 1001.72 1.51 53.47 186.17 -51.781714 ... 0.000859 0.001221 0.000264 0 2016-02-16 10:46:30 NaN sotto 2864.856376 45 32730
12 13 31.63 27.53 25.00 45.20 1001.72 1.55 53.75 186.38 -51.992696 ... 0.000589 0.001151 0.000002 0 2016-02-16 10:46:40 NaN sotto 2880.392591 45 32730
13 14 31.69 27.53 25.00 45.28 1001.71 1.07 53.63 186.60 -52.409175 ... 0.000497 0.000610 -0.000060 0 2016-02-16 10:46:50 True sotto 2921.288936 45 32730
14 15 31.70 27.52 25.00 45.14 1001.72 0.81 53.40 186.32 -52.648488 ... -0.000053 0.000593 -0.000141 0 2016-02-16 10:47:00 True sotto 2946.615432 45 32730
15 16 31.72 27.53 25.00 45.31 1001.75 1.51 53.34 186.42 -52.850708 ... -0.000238 0.000495 0.000156 0 2016-02-16 10:47:11 True sotto 2967.640766 45 32730
16 17 31.71 27.52 25.00 45.14 1001.72 1.82 53.49 186.39 -53.449140 ... 0.000571 0.000770 0.000331 0 2016-02-16 10:47:20 True sotto 3029.683044 45 32730
17 18 31.67 27.53 25.00 45.23 1001.71 0.46 53.69 186.72 -53.679986 ... -0.000187 0.000159 0.000386 0 2016-02-16 10:47:31 NaN sotto 3052.251538 45 32730
18 19 31.67 27.53 25.00 45.28 1001.71 0.67 53.55 186.61 -54.159015 ... -0.000495 0.000094 0.000084 0 2016-02-16 10:47:40 NaN sotto 3095.501435 45 32730
19 20 31.69 27.53 25.00 45.21 1001.71 1.23 53.43 186.21 -54.400646 ... -0.000338 0.000013 0.000041 0 2016-02-16 10:47:51 True sotto 3110.640598 45 32730
20 21 31.69 27.51 25.00 45.18 1001.71 1.44 53.58 186.40 -54.609398 ... -0.000266 0.000279 -0.000009 0 2016-02-16 10:48:00 True sotto 3140.151110 45 32730
21 22 31.66 27.52 25.00 45.18 1001.73 1.25 53.34 186.50 -54.746114 ... 0.000139 0.000312 0.000050 0 2016-02-16 10:48:10 NaN sotto 3156.665111 45 32730
22 23 31.68 27.54 25.00 45.25 1001.72 1.18 53.49 186.69 -55.091416 ... -0.000489 0.000086 0.000065 0 2016-02-16 10:48:21 NaN sotto 3188.235806 45 32730
23 24 31.67 27.53 24.99 45.30 1001.72 1.34 53.32 186.84 -55.516313 ... 0.000312 0.000175 0.000308 0 2016-02-16 10:48:30 NaN sotto 3238.850567 45 32730
24 25 31.65 27.53 25.00 45.40 1001.71 1.36 53.56 187.02 -55.560991 ... -0.000101 -0.000023 0.000377 0 2016-02-16 10:48:41 NaN sotto 3242.425155 45 32730
25 26 31.67 27.52 25.00 45.33 1001.72 1.17 53.44 186.95 -56.016359 ... 0.000147 0.000054 0.000147 0 2016-02-16 10:48:50 NaN sotto 3288.794716 45 32730
26 27 31.74 27.54 25.00 45.27 1001.71 0.88 53.41 186.57 -56.393694 ... -0.000125 -0.000193 0.000269 0 2016-02-16 10:49:01 True sotto 3320.328854 45 32730
27 28 31.63 27.52 25.00 45.33 1001.75 0.78 53.84 186.85 -56.524545 ... -0.000175 -0.000312 0.000361 0 2016-02-16 10:49:10 NaN sotto 3339.433796 45 32730
28 29 31.68 27.52 25.00 45.33 1001.73 0.88 53.41 186.62 -56.791585 ... -0.000382 -0.000253 0.000132 0 2016-02-16 10:49:20 NaN sotto 3364.310107 45 32730
29 30 31.67 27.51 25.00 45.21 1001.74 0.86 53.29 186.71 -56.915466 ... 0.000031 -0.000260 0.000069 0 2016-02-16 10:49:30 NaN sotto 3377.217368 45 32730
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
110839 110840 31.60 27.49 24.82 42.74 1005.83 1.12 49.34 90.42 0.319629 ... -0.000209 -0.000005 0.000138 0 2016-02-29 09:20:10 NaN sotto 574.877314 42 2776
110840 110841 31.59 27.48 24.82 42.75 1005.82 2.04 49.53 92.11 0.015879 ... -0.000472 -0.000478 0.000126 0 2016-02-29 09:20:20 NaN sotto 593.855683 42 2776
110841 110842 31.59 27.51 24.82 42.76 1005.82 1.31 49.19 93.94 -0.658624 ... -0.000590 -0.000372 0.000207 0 2016-02-29 09:20:31 NaN sotto 604.215692 42 2776
110842 110843 31.60 27.50 24.82 42.74 1005.85 1.19 48.91 95.57 -1.117541 ... -0.000039 0.000059 0.000149 0 2016-02-29 09:20:40 NaN sotto 606.406098 42 2776
110843 110844 31.57 27.49 24.82 42.80 1005.83 1.49 49.17 98.11 -1.860475 ... 0.000223 0.000278 0.000038 0 2016-02-29 09:20:51 NaN sotto 622.733559 42 2776
110844 110845 31.60 27.50 24.82 42.81 1005.84 1.47 49.46 99.67 -2.286044 ... -0.000283 -0.000187 0.000077 0 2016-02-29 09:21:00 NaN sotto 641.480748 42 2776
110845 110846 31.61 27.50 24.82 42.81 1005.82 2.28 49.27 103.17 -3.182359 ... -0.000451 -0.000100 -0.000351 0 2016-02-29 09:21:10 NaN sotto 633.949204 42 2776
110846 110847 31.61 27.50 24.82 42.75 1005.84 2.18 49.64 105.05 -3.769940 ... 0.000476 0.000452 -0.000249 0 2016-02-29 09:21:20 NaN sotto 643.508698 42 2776
110847 110848 31.58 27.50 24.82 43.00 1005.85 2.52 49.31 107.23 -4.431722 ... 0.000822 0.000739 -0.000012 0 2016-02-29 09:21:30 NaN sotto 658.512439 43 2479
110848 110849 31.54 27.51 24.82 42.76 1005.84 2.35 49.55 108.68 -4.944477 ... 0.000613 0.000763 -0.000227 0 2016-02-29 09:21:41 NaN sotto 667.095455 42 2776
110849 110850 31.60 27.50 24.82 42.79 1005.82 2.33 48.79 109.52 -5.481255 ... 0.000196 0.000519 -0.000234 0 2016-02-29 09:21:50 NaN sotto 689.714415 42 2776
110850 110851 31.61 27.50 24.82 42.79 1005.85 2.11 49.66 111.90 -6.263577 ... 0.000029 -0.000098 -0.000073 0 2016-02-29 09:22:01 NaN sotto 707.304506 42 2776
110851 110852 31.56 27.50 24.83 42.84 1005.83 1.68 49.91 113.38 -6.844946 ... -0.000177 -0.000452 -0.000232 0 2016-02-29 09:22:10 NaN sotto 726.361255 42 2776
110852 110853 31.59 27.51 24.83 42.76 1005.82 2.26 49.17 114.42 -7.437300 ... 0.000396 0.000400 -0.000188 0 2016-02-29 09:22:21 NaN sotto 743.185242 42 2776
110853 110854 31.58 27.50 24.83 42.98 1005.83 1.96 49.41 116.50 -8.271114 ... 0.000285 0.000312 -0.000058 0 2016-02-29 09:22:30 NaN sotto 767.328522 42 2776
110854 110855 31.61 27.51 24.83 42.69 1005.84 2.27 49.39 117.61 -8.690470 ... -0.000001 0.000371 -0.000274 0 2016-02-29 09:22:40 NaN sotto 791.907055 42 2776
110855 110856 31.55 27.50 24.83 42.79 1005.83 1.51 48.98 119.13 -9.585351 ... -0.000115 0.000029 -0.000223 0 2016-02-29 09:22:50 NaN sotto 802.932850 42 2776
110856 110857 31.55 27.49 24.83 42.81 1005.82 2.12 49.95 120.81 -10.120745 ... -0.000150 0.000147 -0.000320 0 2016-02-29 09:23:00 NaN sotto 820.194642 42 2776
110857 110858 31.60 27.51 24.83 42.92 1005.82 1.53 49.33 121.74 -10.657858 ... 0.000161 0.000029 -0.000210 0 2016-02-29 09:23:11 NaN sotto 815.462202 42 2776
110858 110859 31.58 27.50 24.83 42.81 1005.83 1.60 49.65 123.50 -11.584851 ... -0.000073 0.000158 -0.000006 0 2016-02-29 09:23:20 NaN sotto 851.154631 42 2776
110859 110860 31.61 27.50 24.83 42.82 1005.84 2.65 49.47 124.51 -12.089743 ... 0.000428 0.000137 0.000201 0 2016-02-29 09:23:31 NaN sotto 879.563826 42 2776
110860 110861 31.57 27.50 24.83 42.80 1005.84 2.63 50.08 125.85 -12.701497 ... 0.000263 0.000163 0.000031 0 2016-02-29 09:23:40 NaN sotto 895.543882 42 2776
110861 110862 31.58 27.51 24.83 42.90 1005.85 1.70 49.81 126.86 -13.393369 ... -0.000077 0.000179 0.000148 0 2016-02-29 09:23:50 NaN sotto 928.948693 42 2776
110862 110863 31.60 27.51 24.83 42.80 1005.85 1.66 49.13 127.35 -13.990712 ... 0.000119 -0.000204 0.000041 0 2016-02-29 09:24:01 NaN sotto 957.695014 42 2776
110863 110864 31.64 27.51 24.83 42.80 1005.85 1.91 49.31 128.62 -14.691672 ... 0.000286 0.000103 0.000221 0 2016-02-29 09:24:10 NaN sotto 971.126355 42 2776
110864 110865 31.56 27.52 24.83 42.94 1005.83 1.58 49.93 129.60 -15.169673 ... -0.000264 0.000206 0.000196 0 2016-02-29 09:24:21 NaN sotto 996.676408 42 2776
110865 110866 31.55 27.50 24.83 42.72 1005.85 1.89 49.92 130.51 -15.832622 ... 0.000143 0.000199 -0.000024 0 2016-02-29 09:24:30 NaN sotto 1022.779594 42 2776
110866 110867 31.58 27.50 24.83 42.83 1005.85 2.09 50.00 132.04 -16.646212 ... 0.000537 0.000257 0.000057 0 2016-02-29 09:24:41 NaN sotto 1048.121268 42 2776
110867 110868 31.62 27.50 24.83 42.81 1005.88 2.88 49.69 133.00 -17.270447 ... 0.000534 0.000456 0.000195 0 2016-02-29 09:24:50 NaN sotto 1073.629703 42 2776
110868 110869 31.57 27.51 24.83 42.94 1005.86 2.17 49.77 134.18 -17.885872 ... 0.000459 0.000076 0.000030 0 2016-02-29 09:25:00 NaN sotto 1095.760426 42 2776

110869 rows × 25 columns

9. Exercise: meteo average temperatures

9.1 meteo plot

✪ Put in a plot the temperature from dataframe meteo:

[59]:
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline

# write here


Show solution
[60]:

[60]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f30d424e240>
../_images/pandas_pandas-sol_135_1.png

9.2 meteo pressure and raining

✪ In the same plot as above show the pressure and amount of raining.

Show solution
[61]:
# write here


Show solution
[62]:

../_images/pandas_pandas-sol_144_0.png

9.3 meteo average temperature

✪✪✪ Calculate the average temperature for each day, and show it in the plot, so to have a couple new columns like these:

    Day       Avg_day_temp
01/11/2017      7.983333
01/11/2017      7.983333
01/11/2017      7.983333
    .               .
    .               .
02/11/2017      7.384375
02/11/2017      7.384375
02/11/2017      7.384375
    .               .
    .               .

HINT 1: add 'Day' column by extracting only the day from the date. To do it, use the function .strapplied to all the column.

HINT 2: There are various ways to solve the exercise:

  • Most perfomant and elegant is with groupby operator, see Pandas trasform - more than meets the eye

  • As alternative, you may use a for to cycle through days. Typically, using a for is not a good idea with Pandas, as on large datasets it can take a lot to perform the updates. Still, since this dataset is small enough, you should get results in a decent amount of time.

Show solution
[63]:
# write here


Show solution
[64]:


    ****************    SOLUTION 1 OUTPUT - recalculate average for every row - slow !

WITH AVERAGE TEMPERATURE
               Date  Pressure  Rain  Temp         Day  Avg_day_temp
0  01/11/2017 00:00     995.4   0.0   5.4  01/11/2017      7.983333
1  01/11/2017 00:15     995.5   0.0   6.0  01/11/2017      7.983333
2  01/11/2017 00:30     995.5   0.0   5.9  01/11/2017      7.983333
3  01/11/2017 00:45     995.7   0.0   5.4  01/11/2017      7.983333
4  01/11/2017 01:00     995.7   0.0   5.3  01/11/2017      7.983333
[64]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f30d426ce10>
../_images/pandas_pandas-sol_153_2.png
Show solution
[65]:



********    SOLUTION 2 OUTPUT - recalculate average only 30 times
                                by using a dictionary d_avg, faster but not yet optimal
               Date  Pressure  Rain  Temp         Day  Avg_day_temp
0  01/11/2017 00:00     995.4   0.0   5.4  01/11/2017      7.983333
1  01/11/2017 00:15     995.5   0.0   6.0  01/11/2017      7.983333
2  01/11/2017 00:30     995.5   0.0   5.9  01/11/2017      7.983333
3  01/11/2017 00:45     995.7   0.0   5.4  01/11/2017      7.983333
4  01/11/2017 01:00     995.7   0.0   5.3  01/11/2017      7.983333
[65]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f30d42ad908>
../_images/pandas_pandas-sol_157_2.png
Show solution
[66]:


****************  SOLUTION 3 - OUTPUT  -  best solution with groupby and transform

WITH AVERAGE TEMPERATURE
               Date  Pressure  Rain  Temp         Day  Avg_day_temp
0  01/11/2017 00:00     995.4   0.0   5.4  01/11/2017      7.983333
1  01/11/2017 00:15     995.5   0.0   6.0  01/11/2017      7.983333
2  01/11/2017 00:30     995.5   0.0   5.9  01/11/2017      7.983333
3  01/11/2017 00:45     995.7   0.0   5.4  01/11/2017      7.983333
4  01/11/2017 01:00     995.7   0.0   5.3  01/11/2017      7.983333
[66]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f30d435a9e8>
../_images/pandas_pandas-sol_161_2.png

10. Merging tables

Suppose we want to add a column with geographical position of the ISS. To do so, we would need to join our dataset with another one containing such information. Let’s take for example the dataset iss_coords.csv

[67]:
iss_coords = pd.read_csv('iss-coords.csv', encoding='UTF-8')
[68]:
iss_coords
[68]:
timestamp lat lon
0 2016-01-01 05:11:30 -45.103458 14.083858
1 2016-01-01 06:49:59 -37.597242 28.931170
2 2016-01-01 11:52:30 17.126141 77.535602
3 2016-01-01 11:52:30 17.126464 77.535861
4 2016-01-01 14:54:08 7.259561 70.001561
5 2016-01-01 18:24:00 -15.990725 -106.400927
6 2016-01-01 22:45:51 31.602388 85.647998
7 2016-01-02 07:48:31 -51.578009 -26.736801
8 2016-01-02 10:50:19 -36.512021 14.452174
9 2016-01-02 14:01:27 -27.459029 10.991151
10 2016-01-02 14:01:27 -27.458783 10.991398
11 2016-01-02 20:30:13 29.861877 156.955941
12 2016-01-03 11:43:18 9.065825 -172.436293
13 2016-01-03 14:39:47 15.529901 35.812502
14 2016-01-03 14:39:47 15.530149 35.812698
15 2016-01-03 21:12:17 -44.793666 -28.679197
16 2016-01-03 22:39:52 28.061007 178.935724
17 2016-01-04 13:40:02 -14.153170 -139.759391
18 2016-01-04 13:51:36 9.461309 30.520802
19 2016-01-04 13:51:36 9.461560 30.520986
20 2016-01-04 18:42:18 44.974327 84.801522
21 2016-01-04 21:46:03 -51.551958 -75.103323
22 2016-01-04 21:46:03 -51.551933 -75.102954
23 2016-01-05 12:57:50 -41.439217 3.847215
24 2016-01-05 14:36:00 -13.581246 39.166522
25 2016-01-05 14:36:00 -13.581024 39.166692
26 2016-01-05 17:51:36 26.103252 -151.570312
27 2016-01-05 22:28:56 -26.458448 -108.642807
28 2016-01-06 12:07:09 -51.204236 -19.679525
29 2016-01-06 13:41:23 -51.166546 -19.318519
... ... ... ...
308 2016-02-25 21:19:08 14.195431 -133.777268
309 2016-02-25 21:38:48 -14.698631 -85.875320
310 2016-02-26 00:51:29 -4.376121 -94.773870
311 2016-02-26 00:51:29 -51.097174 -21.117794
312 2016-02-26 13:09:56 -1.811782 -99.010499
313 2016-02-26 14:28:13 -15.363988 -87.986579
314 2016-02-26 14:28:13 -15.364276 -87.986354
315 2016-02-26 17:49:36 -32.517607 47.514800
316 2016-02-26 22:37:28 -41.292043 29.733597
317 2016-02-27 01:43:10 -41.049112 30.193004
318 2016-02-27 01:43:10 -8.402991 -100.981726
319 2016-02-27 13:34:30 18.406130 -126.884570
320 2016-02-27 13:52:46 -22.783724 -90.869452
321 2016-02-27 13:52:46 -22.784018 -90.869189
322 2016-02-27 21:47:45 -7.038283 -106.607037
323 2016-02-28 00:51:03 -31.699384 -84.328371
324 2016-02-28 08:13:04 40.239764 -155.465692
325 2016-02-28 09:48:40 50.047523 175.566751
326 2016-02-28 14:29:36 37.854997 106.124377
327 2016-02-28 14:29:36 37.855237 106.124735
328 2016-02-28 20:56:33 51.729529 163.754128
329 2016-02-29 04:39:20 -10.946978 -100.874429
330 2016-02-29 08:56:28 46.885514 -167.143393
331 2016-02-29 10:32:56 46.773608 -166.800893
332 2016-02-29 11:53:49 46.678097 -166.512208
333 2016-02-29 13:23:17 -51.077590 -31.093987
334 2016-02-29 13:44:13 30.688553 -135.403820
335 2016-02-29 13:44:13 30.688295 -135.403533
336 2016-02-29 18:44:57 27.608774 -130.198781
337 2016-02-29 21:36:47 27.325186 -129.893278

338 rows × 3 columns

We notice there is a timestamp column, which unfortunately has a slightly different name that time_stamp column (notice the underscore _) in original astropi dataset:

[69]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110869 entries, 0 to 110868
Data columns (total 25 columns):
ROW_ID             110869 non-null int64
temp_cpu           110869 non-null float64
temp_h             110869 non-null float64
temp_p             110869 non-null float64
humidity           110869 non-null float64
pressure           110869 non-null float64
pitch              110869 non-null float64
roll               110869 non-null float64
yaw                110869 non-null float64
mag_x              110869 non-null float64
mag_y              110869 non-null float64
mag_z              110869 non-null float64
accel_x            110869 non-null float64
accel_y            110869 non-null float64
accel_z            110869 non-null float64
gyro_x             110869 non-null float64
gyro_y             110869 non-null float64
gyro_z             110869 non-null float64
reset              110869 non-null int64
time_stamp         110869 non-null object
Too hot            105315 non-null object
check_p            110869 non-null object
mag_tot            110869 non-null float64
humidity_int       110869 non-null int64
Humidity counts    110869 non-null int64
dtypes: float64(18), int64(4), object(3)
memory usage: 21.1+ MB

To merge datasets according to the columns, we can use the command merge like this:

[70]:
# remember merge produces a NEW dataframe

geo_astropi = df.merge(iss_coords, left_on='time_stamp', right_on='timestamp')

# merge will add both time_stamp and timestamp columns,
# so we remove the duplicate column `timestamp`
geo_astropi = geo_astropi.drop('timestamp', axis=1)
[71]:
geo_astropi
[71]:
ROW_ID temp_cpu temp_h temp_p humidity pressure pitch roll yaw mag_x ... gyro_z reset time_stamp Too hot check_p mag_tot humidity_int Humidity counts lat lon
0 23231 32.53 28.37 25.89 45.31 1006.04 1.31 51.63 34.91 21.125001 ... 0.000046 0 2016-02-19 03:49:00 True sotto 2345.207992 45 32730 31.434741 52.917464
1 27052 32.30 28.12 25.62 45.57 1007.42 1.49 52.29 333.49 16.083471 ... 0.000034 0 2016-02-19 14:30:40 True sotto 323.634786 45 32730 -46.620658 -57.311657
2 27052 32.30 28.12 25.62 45.57 1007.42 1.49 52.29 333.49 16.083471 ... 0.000034 0 2016-02-19 14:30:40 True sotto 323.634786 45 32730 -46.620477 -57.311138
3 46933 32.21 28.05 25.50 47.36 1012.41 0.67 52.40 27.57 15.441683 ... 0.000221 0 2016-02-21 22:14:11 True sopra 342.159257 47 14176 19.138359 -140.211489
4 64572 32.32 28.18 25.61 47.45 1010.62 1.14 51.41 33.68 11.994554 ... 0.000030 0 2016-02-23 23:40:50 True sopra 264.655601 47 14176 4.713819 80.261665
5 68293 32.39 28.26 25.70 46.83 1010.51 0.61 51.91 287.86 6.554283 ... 0.000171 0 2016-02-24 10:05:51 True sopra 436.876111 46 35775 -46.061583 22.246025
6 73374 32.38 28.18 25.62 46.52 1008.28 0.90 51.77 30.80 9.947132 ... -0.000375 0 2016-02-25 00:23:01 True sopra 226.089258 46 35775 47.047346 137.958918
7 90986 32.42 28.34 25.76 45.72 1006.79 0.57 49.85 10.57 7.805606 ... -0.000047 0 2016-02-27 01:43:10 True sotto 149.700293 45 32730 -41.049112 30.193004
8 90986 32.42 28.34 25.76 45.72 1006.79 0.57 49.85 10.57 7.805606 ... -0.000047 0 2016-02-27 01:43:10 True sotto 149.700293 45 32730 -8.402991 -100.981726
9 102440 32.62 28.62 26.02 45.15 1006.06 1.12 50.44 301.74 10.348327 ... -0.000061 0 2016-02-28 09:48:40 True sotto 381.014223 45 32730 50.047523 175.566751

10 rows × 27 columns

Exercise 10.1 better merge

If you notice, above table does have lat and lon columns, but has very few rows. Why ? Try to merge the tables in some meaningful way so to have all the original rows and all cells of lat and lon filled.

Show solution
[72]:
# write here


[72]:
ROW_ID temp_cpu temp_h temp_p humidity pressure pitch roll yaw mag_x ... reset time_stamp Too hot check_p mag_tot humidity_int Humidity counts timestamp lat lon
0 1 31.88 27.57 25.01 44.94 1001.68 1.49 52.25 185.21 -46.422753 ... 20 2016-02-16 10:44:40 True sotto 2368.337207 44 13029 NaN NaN NaN
1 2 31.79 27.53 25.01 45.12 1001.72 1.03 53.73 186.72 -48.778951 ... 0 2016-02-16 10:44:50 True sotto 2615.870247 45 32730 NaN NaN NaN
2 3 31.66 27.53 25.01 45.12 1001.72 1.24 53.57 186.21 -49.161878 ... 0 2016-02-16 10:45:00 NaN sotto 2648.484927 45 32730 NaN NaN NaN
3 4 31.69 27.52 25.01 45.32 1001.69 1.57 53.63 186.03 -49.341941 ... 0 2016-02-16 10:45:10 True sotto 2665.305485 45 32730 NaN NaN NaN
4 5 31.66 27.54 25.01 45.18 1001.71 0.85 53.66 186.46 -50.056683 ... 0 2016-02-16 10:45:20 NaN sotto 2732.388620 45 32730 NaN NaN NaN
5 6 31.69 27.55 25.01 45.12 1001.67 0.85 53.53 185.52 -50.246476 ... 0 2016-02-16 10:45:30 True sotto 2736.836291 45 32730 NaN NaN NaN
6 7 31.68 27.53 25.01 45.31 1001.70 0.63 53.55 186.10 -50.447346 ... 0 2016-02-16 10:45:41 NaN sotto 2756.496929 45 32730 NaN NaN NaN
7 8 31.66 27.55 25.01 45.34 1001.70 1.49 53.65 186.08 -50.668232 ... 0 2016-02-16 10:45:50 NaN sotto 2778.429164 45 32730 NaN NaN NaN
8 9 31.67 27.54 25.01 45.20 1001.72 1.22 53.77 186.55 -50.761529 ... 0 2016-02-16 10:46:01 NaN sotto 2773.029554 45 32730 NaN NaN NaN
9 10 31.67 27.54 25.01 45.41 1001.75 1.63 53.46 185.94 -51.243832 ... 0 2016-02-16 10:46:10 NaN sotto 2809.446772 45 32730 NaN NaN NaN
10 11 31.68 27.53 25.00 45.16 1001.72 1.32 53.52 186.24 -51.616473 ... 0 2016-02-16 10:46:20 NaN sotto 2851.426683 45 32730 NaN NaN NaN
11 12 31.67 27.52 25.00 45.48 1001.72 1.51 53.47 186.17 -51.781714 ... 0 2016-02-16 10:46:30 NaN sotto 2864.856376 45 32730 NaN NaN NaN
12 13 31.63 27.53 25.00 45.20 1001.72 1.55 53.75 186.38 -51.992696 ... 0 2016-02-16 10:46:40 NaN sotto 2880.392591 45 32730 NaN NaN NaN
13 14 31.69 27.53 25.00 45.28 1001.71 1.07 53.63 186.60 -52.409175 ... 0 2016-02-16 10:46:50 True sotto 2921.288936 45 32730 NaN NaN NaN
14 15 31.70 27.52 25.00 45.14 1001.72 0.81 53.40 186.32 -52.648488 ... 0 2016-02-16 10:47:00 True sotto 2946.615432 45 32730 NaN NaN NaN
15 16 31.72 27.53 25.00 45.31 1001.75 1.51 53.34 186.42 -52.850708 ... 0 2016-02-16 10:47:11 True sotto 2967.640766 45 32730 NaN NaN NaN
16 17 31.71 27.52 25.00 45.14 1001.72 1.82 53.49 186.39 -53.449140 ... 0 2016-02-16 10:47:20 True sotto 3029.683044 45 32730 NaN NaN NaN
17 18 31.67 27.53 25.00 45.23 1001.71 0.46 53.69 186.72 -53.679986 ... 0 2016-02-16 10:47:31 NaN sotto 3052.251538 45 32730 NaN NaN NaN
18 19 31.67 27.53 25.00 45.28 1001.71 0.67 53.55 186.61 -54.159015 ... 0 2016-02-16 10:47:40 NaN sotto 3095.501435 45 32730 NaN NaN NaN
19 20 31.69 27.53 25.00 45.21 1001.71 1.23 53.43 186.21 -54.400646 ... 0 2016-02-16 10:47:51 True sotto 3110.640598 45 32730 NaN NaN NaN
20 21 31.69 27.51 25.00 45.18 1001.71 1.44 53.58 186.40 -54.609398 ... 0 2016-02-16 10:48:00 True sotto 3140.151110 45 32730 NaN NaN NaN
21 22 31.66 27.52 25.00 45.18 1001.73 1.25 53.34 186.50 -54.746114 ... 0 2016-02-16 10:48:10 NaN sotto 3156.665111 45 32730 NaN NaN NaN
22 23 31.68 27.54 25.00 45.25 1001.72 1.18 53.49 186.69 -55.091416 ... 0 2016-02-16 10:48:21 NaN sotto 3188.235806 45 32730 NaN NaN NaN
23 24 31.67 27.53 24.99 45.30 1001.72 1.34 53.32 186.84 -55.516313 ... 0 2016-02-16 10:48:30 NaN sotto 3238.850567 45 32730 NaN NaN NaN
24 25 31.65 27.53 25.00 45.40 1001.71 1.36 53.56 187.02 -55.560991 ... 0 2016-02-16 10:48:41 NaN sotto 3242.425155 45 32730 NaN NaN NaN
25 26 31.67 27.52 25.00 45.33 1001.72 1.17 53.44 186.95 -56.016359 ... 0 2016-02-16 10:48:50 NaN sotto 3288.794716 45 32730 NaN NaN NaN
26 27 31.74 27.54 25.00 45.27 1001.71 0.88 53.41 186.57 -56.393694 ... 0 2016-02-16 10:49:01 True sotto 3320.328854 45 32730 NaN NaN NaN
27 28 31.63 27.52 25.00 45.33 1001.75 0.78 53.84 186.85 -56.524545 ... 0 2016-02-16 10:49:10 NaN sotto 3339.433796 45 32730 NaN NaN NaN
28 29 31.68 27.52 25.00 45.33 1001.73 0.88 53.41 186.62 -56.791585 ... 0 2016-02-16 10:49:20 NaN sotto 3364.310107 45 32730 NaN NaN NaN
29 30 31.67 27.51 25.00 45.21 1001.74 0.86 53.29 186.71 -56.915466 ... 0 2016-02-16 10:49:30 NaN sotto 3377.217368 45 32730 NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
110841 110840 31.60 27.49 24.82 42.74 1005.83 1.12 49.34 90.42 0.319629 ... 0 2016-02-29 09:20:10 NaN sotto 574.877314 42 2776 NaN NaN NaN
110842 110841 31.59 27.48 24.82 42.75 1005.82 2.04 49.53 92.11 0.015879 ... 0 2016-02-29 09:20:20 NaN sotto 593.855683 42 2776 NaN NaN NaN
110843 110842 31.59 27.51 24.82 42.76 1005.82 1.31 49.19 93.94 -0.658624 ... 0 2016-02-29 09:20:31 NaN sotto 604.215692 42 2776 NaN NaN NaN
110844 110843 31.60 27.50 24.82 42.74 1005.85 1.19 48.91 95.57 -1.117541 ... 0 2016-02-29 09:20:40 NaN sotto 606.406098 42 2776 NaN NaN NaN
110845 110844 31.57 27.49 24.82 42.80 1005.83 1.49 49.17 98.11 -1.860475 ... 0 2016-02-29 09:20:51 NaN sotto 622.733559 42 2776 NaN NaN NaN
110846 110845 31.60 27.50 24.82 42.81 1005.84 1.47 49.46 99.67 -2.286044 ... 0 2016-02-29 09:21:00 NaN sotto 641.480748 42 2776 NaN NaN NaN
110847 110846 31.61 27.50 24.82 42.81 1005.82 2.28 49.27 103.17 -3.182359 ... 0 2016-02-29 09:21:10 NaN sotto 633.949204 42 2776 NaN NaN NaN
110848 110847 31.61 27.50 24.82 42.75 1005.84 2.18 49.64 105.05 -3.769940 ... 0 2016-02-29 09:21:20 NaN sotto 643.508698 42 2776 NaN NaN NaN
110849 110848 31.58 27.50 24.82 43.00 1005.85 2.52 49.31 107.23 -4.431722 ... 0 2016-02-29 09:21:30 NaN sotto 658.512439 43 2479 NaN NaN NaN
110850 110849 31.54 27.51 24.82 42.76 1005.84 2.35 49.55 108.68 -4.944477 ... 0 2016-02-29 09:21:41 NaN sotto 667.095455 42 2776 NaN NaN NaN
110851 110850 31.60 27.50 24.82 42.79 1005.82 2.33 48.79 109.52 -5.481255 ... 0 2016-02-29 09:21:50 NaN sotto 689.714415 42 2776 NaN NaN NaN
110852 110851 31.61 27.50 24.82 42.79 1005.85 2.11 49.66 111.90 -6.263577 ... 0 2016-02-29 09:22:01 NaN sotto 707.304506 42 2776 NaN NaN NaN
110853 110852 31.56 27.50 24.83 42.84 1005.83 1.68 49.91 113.38 -6.844946 ... 0 2016-02-29 09:22:10 NaN sotto 726.361255 42 2776 NaN NaN NaN
110854 110853 31.59 27.51 24.83 42.76 1005.82 2.26 49.17 114.42 -7.437300 ... 0 2016-02-29 09:22:21 NaN sotto 743.185242 42 2776 NaN NaN NaN
110855 110854 31.58 27.50 24.83 42.98 1005.83 1.96 49.41 116.50 -8.271114 ... 0 2016-02-29 09:22:30 NaN sotto 767.328522 42 2776 NaN NaN NaN
110856 110855 31.61 27.51 24.83 42.69 1005.84 2.27 49.39 117.61 -8.690470 ... 0 2016-02-29 09:22:40 NaN sotto 791.907055 42 2776 NaN NaN NaN
110857 110856 31.55 27.50 24.83 42.79 1005.83 1.51 48.98 119.13 -9.585351 ... 0 2016-02-29 09:22:50 NaN sotto 802.932850 42 2776 NaN NaN NaN
110858 110857 31.55 27.49 24.83 42.81 1005.82 2.12 49.95 120.81 -10.120745 ... 0 2016-02-29 09:23:00 NaN sotto 820.194642 42 2776 NaN NaN NaN
110859 110858 31.60 27.51 24.83 42.92 1005.82 1.53 49.33 121.74 -10.657858 ... 0 2016-02-29 09:23:11 NaN sotto 815.462202 42 2776 NaN NaN NaN
110860 110859 31.58 27.50 24.83 42.81 1005.83 1.60 49.65 123.50 -11.584851 ... 0 2016-02-29 09:23:20 NaN sotto 851.154631 42 2776 NaN NaN NaN
110861 110860 31.61 27.50 24.83 42.82 1005.84 2.65 49.47 124.51 -12.089743 ... 0 2016-02-29 09:23:31 NaN sotto 879.563826 42 2776 NaN NaN NaN
110862 110861 31.57 27.50 24.83 42.80 1005.84 2.63 50.08 125.85 -12.701497 ... 0 2016-02-29 09:23:40 NaN sotto 895.543882 42 2776 NaN NaN NaN
110863 110862 31.58 27.51 24.83 42.90 1005.85 1.70 49.81 126.86 -13.393369 ... 0 2016-02-29 09:23:50 NaN sotto 928.948693 42 2776 NaN NaN NaN
110864 110863 31.60 27.51 24.83 42.80 1005.85 1.66 49.13 127.35 -13.990712 ... 0 2016-02-29 09:24:01 NaN sotto 957.695014 42 2776 NaN NaN NaN
110865 110864 31.64 27.51 24.83 42.80 1005.85 1.91 49.31 128.62 -14.691672 ... 0 2016-02-29 09:24:10 NaN sotto 971.126355 42 2776 NaN NaN NaN
110866 110865 31.56 27.52 24.83 42.94 1005.83 1.58 49.93 129.60 -15.169673 ... 0 2016-02-29 09:24:21 NaN sotto 996.676408 42 2776 NaN NaN NaN
110867 110866 31.55 27.50 24.83 42.72 1005.85 1.89 49.92 130.51 -15.832622 ... 0 2016-02-29 09:24:30 NaN sotto 1022.779594 42 2776 NaN NaN NaN
110868 110867 31.58 27.50 24.83 42.83 1005.85 2.09 50.00 132.04 -16.646212 ... 0 2016-02-29 09:24:41 NaN sotto 1048.121268 42 2776 NaN NaN NaN
110869 110868 31.62 27.50 24.83 42.81 1005.88 2.88 49.69 133.00 -17.270447 ... 0 2016-02-29 09:24:50 NaN sotto 1073.629703 42 2776 NaN NaN NaN
110870 110869 31.57 27.51 24.83 42.94 1005.86 2.17 49.77 134.18 -17.885872 ... 0 2016-02-29 09:25:00 NaN sotto 1095.760426 42 2776 NaN NaN NaN

110871 rows × 28 columns

Continue

See EURES job offers worksheet