Data formats 1 - introduction

Introduction

In these tutorials we will see how to load and write tabular data such as CSV, and we will mention tree-like data such as JSON files. We will also spend a couple of words about opendata catalogs and licenses (creative commons).

In these tutorials we will review main data formats:

Textual formats

  • Line files

  • CSV (tabular data)

  • JSON (tree-like data, just mention)

Binary formats (just mention)

  • fogli Excel

We will also mention open data catalogs and licenses (Creative Commons)

What to do

  1. unzip exercises in a folder, you should get something like this:

formats
    formats1-lines.ipynb
    formats1-lines-sol.ipynb
    formats2-csv.ipynb
    formats2-csv-sol.ipynb
    formats3-json.ipynb
    formats3-json-sol.ipynb
    formats4-chal.ipynb
    jupman.py

WARNING: to correctly visualize the notebook, it MUST be in an unzipped folder !

  1. open Jupyter Notebook from that folder. Two things should open, first a console and then browser. The browser should show a file list: navigate the list and open the notebook formats/formats1-lines.ipynb

  2. Go on reading that notebook, and follow instuctions inside.

Shortcut keys:

  • to execute Python code inside a Jupyter cell, press Control + Enter

  • to execute Python code inside a Jupyter cell AND select next cell, press Shift + Enter

  • to execute Python code inside a Jupyter cell AND a create a new cell aftwerwards, press Alt + Enter

  • If the notebooks look stuck, try to select Kernel -> Restart

Line files

Line files are typically text files which contain information grouped by lines. An example using historical characters might be like the following:

Leonardo
da Vinci
Sandro
Botticelli
Niccolò
Macchiavelli

We can immediately see a regularity: first two lines contain data of Leonardo da Vinci, second one the name and then the surname. Successive lines instead have data of Sandro Botticelli, with again first the name and then the surname and so on.

We might want to do a program that reads the lines and prints on the terminal names and surnames like the following:

Leonardo da Vinci
Sandro Botticelli
Niccolò Macchiavelli

To start having an approximation of the final result, we can open the file, read only the first line and print it:

[1]:
with open('people-simple.txt', encoding='utf-8') as f:
    line=f.readline()
    print(line)

Leonardo

What happened? Let’s examing first rows:

open command

The command

open('people-simple.txt', encoding='utf-8')

allows us to open the text file by telling PYthon the file path 'people-simple.txt' and the encoding in which it was written (encoding='utf-8').

The encoding

The encoding dependes on the operating system and on the editor used to write the file. When we open a file, Python is not capable to divine the encoding, and if we do not specify anything Python might open the file assuming an encoding different from the original - in other words, if we omit the encoding (or we put a wrong one) we might end up seeing weird characters (like little squares instead of accented letters).

In general, when you open a file, try first to specify the encoding utf-8 which is the most common one. If it doesn’t work try others, for example for files written in south Europe with Windows you might check encoding='latin-1'. If you open a file written elsewhere, you might need other encodings. For more in-depth information, you can read Dive into Python - Chapter 4 - Strings, and Dive into Python - Chapter 11 - File, both of which are extremely recommended readings.

with block

The with defines a block with instructions inside:

with open('people-simple.txt', encoding='utf-8') as f:
    line=f.readline()
    print(line)

We used the with to tell PYthon that in any case, even if errors occur, we want that after having used the file, that is after having executed the instructions inside the internal block (the line=f.readline() and print(line)) Python must automatically close the file. Properly closing a file avoids to waste memory resources and creating hard to find paranormal errors. If you want to avoid hunting for never closed zombie files, always remember to open all files in with blocks! Furthermore, at the end of the row in the part as f: we assigned the file to a variable hereby called f, but we could have used any other name we liked.

WARNING: To indent the code, ALWAYS use sequences of four white spaces. Sequences of 2 spaces. Sequences of only 2 spaces even if allowed are not recommended.

WARNING: Depending on the editor you use, by pressing TAB you might get a sequence o f white spaces like it happens in Jupyter (4 spaces which is the recommended length), or a special tabulation character (to avoid)! As much as this annoying this distinction might appear, remember it because it might generate very hard to find errors.

WARNING: In the commands to create blocks such as with, always remember to put the character of colon : at the end of the line !

The command

line=f.readline()

puts in the variable line the entire line, like a string. Warning: the string will contain at the end the special character of line return !

You might wonder where that readline comes from. Like everything in Python, our variable f which represents the file we just opened is an object, and like any object, depending on its type, it has particular methods we can use on it. In this case the method is readline.

The following command prints the string content:

print(line)

✪ 1.1 EXERCISE: Try to rewrite here the block we’ve just seen, and execute the cell by pressing Control-Enter. Rewrite the code with the fingers, not with copy-paste ! Pay attention to correct indentation with spaces in the block.

Show solution
[2]:
# write here


Leonardo

✪ 1.2 EXERCISE: you might wondering what exactly is that f, and what exatly the method readlines should be doing. When you find yourself in these situations, you might help yourself with functions type and help. This time, directly copy paste the same code here, but insert inside with block the commands:

  • print(type(f))

  • help(f)

  • help(f.readline) # Attention: remember the f. before the readline !!

Every time you add something, try to execute with Control+Enter and see what happens

Show solution
[3]:
# write here the code (copy and paste)


Leonardo

<class '_io.TextIOWrapper'>
Help on built-in function readline:

readline(size=-1, /) method of _io.TextIOWrapper instance
    Read until newline or EOF.

    Returns an empty string if EOF is hit immediately.

Help on TextIOWrapper object:

class TextIOWrapper(_TextIOBase)
 |  TextIOWrapper(buffer, encoding=None, errors=None, newline=None, line_buffering=False, write_through=False)
 |
 |  Character and line based layer over a BufferedIOBase object, buffer.
 |
 |  encoding gives the name of the encoding that the stream will be
 |  decoded or encoded with. It defaults to locale.getpreferredencoding(False).
 |
 |  errors determines the strictness of encoding and decoding (see
 |  help(codecs.Codec) or the documentation for codecs.register) and
 |  defaults to "strict".
 |
 |  newline controls how line endings are handled. It can be None, '',
 |  '\n', '\r', and '\r\n'.  It works as follows:
 |
 |  * On input, if newline is None, universal newlines mode is
 |    enabled. Lines in the input can end in '\n', '\r', or '\r\n', and
 |    these are translated into '\n' before being returned to the
 |    caller. If it is '', universal newline mode is enabled, but line
 |    endings are returned to the caller untranslated. If it has any of
 |    the other legal values, input lines are only terminated by the given
 |    string, and the line ending is returned to the caller untranslated.
 |
 |  * On output, if newline is None, any '\n' characters written are
 |    translated to the system default line separator, os.linesep. If
 |    newline is '' or '\n', no translation takes place. If newline is any
 |    of the other legal values, any '\n' characters written are translated
 |    to the given string.
 |
 |  If line_buffering is True, a call to flush is implied when a call to
 |  write contains a newline character.
 |
 |  Method resolution order:
 |      TextIOWrapper
 |      _TextIOBase
 |      _IOBase
 |      builtins.object
 |
 |  Methods defined here:
 |
 |  __getstate__(...)
 |
 |  __init__(self, /, *args, **kwargs)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |
 |  __next__(self, /)
 |      Implement next(self).
 |
 |  __repr__(self, /)
 |      Return repr(self).
 |
 |  close(self, /)
 |      Flush and close the IO object.
 |
 |      This method has no effect if the file is already closed.
 |
 |  detach(self, /)
 |      Separate the underlying buffer from the TextIOBase and return it.
 |
 |      After the underlying buffer has been detached, the TextIO is in an
 |      unusable state.
 |
 |  fileno(self, /)
 |      Returns underlying file descriptor if one exists.
 |
 |      OSError is raised if the IO object does not use a file descriptor.
 |
 |  flush(self, /)
 |      Flush write buffers, if applicable.
 |
 |      This is not implemented for read-only and non-blocking streams.
 |
 |  isatty(self, /)
 |      Return whether this is an 'interactive' stream.
 |
 |      Return False if it can't be determined.
 |
 |  read(self, size=-1, /)
 |      Read at most n characters from stream.
 |
 |      Read from underlying buffer until we have n characters or we hit EOF.
 |      If n is negative or omitted, read until EOF.
 |
 |  readable(self, /)
 |      Return whether object was opened for reading.
 |
 |      If False, read() will raise OSError.
 |
 |  readline(self, size=-1, /)
 |      Read until newline or EOF.
 |
 |      Returns an empty string if EOF is hit immediately.
 |
 |  reconfigure(self, /, *, encoding=None, errors=None, newline=None, line_buffering=None, write_through=None)
 |      Reconfigure the text stream with new parameters.
 |
 |      This also does an implicit stream flush.
 |
 |  seek(self, cookie, whence=0, /)
 |      Change stream position.
 |
 |      Change the stream position to the given byte offset. The offset is
 |      interpreted relative to the position indicated by whence.  Values
 |      for whence are:
 |
 |      * 0 -- start of stream (the default); offset should be zero or positive
 |      * 1 -- current stream position; offset may be negative
 |      * 2 -- end of stream; offset is usually negative
 |
 |      Return the new absolute position.
 |
 |  seekable(self, /)
 |      Return whether object supports random access.
 |
 |      If False, seek(), tell() and truncate() will raise OSError.
 |      This method may need to do a test seek().
 |
 |  tell(self, /)
 |      Return current stream position.
 |
 |  truncate(self, pos=None, /)
 |      Truncate file to size bytes.
 |
 |      File pointer is left unchanged.  Size defaults to the current IO
 |      position as reported by tell().  Returns the new size.
 |
 |  writable(self, /)
 |      Return whether object was opened for writing.
 |
 |      If False, write() will raise OSError.
 |
 |  write(self, text, /)
 |      Write string to stream.
 |      Returns the number of characters written (which is always equal to
 |      the length of the string).
 |
 |  ----------------------------------------------------------------------
 |  Static methods defined here:
 |
 |  __new__(*args, **kwargs) from builtins.type
 |      Create and return a new object.  See help(type) for accurate signature.
 |
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |
 |  buffer
 |
 |  closed
 |
 |  encoding
 |      Encoding of the text stream.
 |
 |      Subclasses should override.
 |
 |  errors
 |      The error setting of the decoder or encoder.
 |
 |      Subclasses should override.
 |
 |  line_buffering
 |
 |  name
 |
 |  newlines
 |      Line endings translated so far.
 |
 |      Only line endings translated during reading are considered.
 |
 |      Subclasses should override.
 |
 |  write_through
 |
 |  ----------------------------------------------------------------------
 |  Methods inherited from _IOBase:
 |
 |  __del__(...)
 |
 |  __enter__(...)
 |
 |  __exit__(...)
 |
 |  __iter__(self, /)
 |      Implement iter(self).
 |
 |  readlines(self, hint=-1, /)
 |      Return a list of lines from the stream.
 |
 |      hint can be specified to control the number of lines read: no more
 |      lines will be read if the total size (in bytes/characters) of all
 |      lines so far exceeds hint.
 |
 |  writelines(self, lines, /)
 |      Write a list of lines to stream.
 |
 |      Line separators are not added, so it is usual for each of the
 |      lines provided to have a line separator at the end.
 |
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from _IOBase:
 |
 |  __dict__

First we put the content of the first line into the variable line, now we might put it in a variable witha more meaningful name, like name. Also, we can directly read the next row into the variable surname and then print the concatenation of both:

[4]:
with open('people-simple.txt', encoding='utf-8') as f:
    name=f.readline()
    surname=f.readline()
    print(name + ' ' + surname)

Leonardo
 da Vinci

PROBLEM ! The printing puts a weird carriage return. Why is that? If you remember, first we said that readline reads the line content in a string adding to the end also the special newline character. To eliminate it, you can use the command rstrip():

[5]:
with open('people-simple.txt', encoding='utf-8') as f:
    name=f.readline().rstrip()
    surname=f.readline().rstrip()
    print(name + ' ' + surname)

Leonardo da Vinci

✪ 1.3 EXERCISE: Again, rewrite the block above in the cell below, ed execute the cell with Control+Enter. Question: what happens if you use strip() instead of rstrip()? What about lstrip()? Can you deduce the meaning of r and l? If you can’t manage it, try to use python command help by calling help(string.rstrip)

Show solution
[6]:
# write here


Leonardo da Vinci

Very good, we have the first line ! Now we can read all the lines in sequence. To this end, we can use a while cycle:

[7]:
with open('people-simple.txt', encoding='utf-8') as f:
    line=f.readline()
    while line != "":
        name = line.rstrip()
        surname=f.readline().rstrip()
        print(name + ' ' + surname)
        line=f.readline()
Leonardo da Vinci
Sandro Botticelli
Niccolò Macchiavelli

NOTE: In Python there are shorter ways to read a text file line by line, we used this approach to make explicit all passages.

What did we do? First, we added a while cycle in a new block

WARNING: In new block, since it is already within the external with, the instructions are indented of 8 spaces and not 4! If you use the wrong spaces, bad things happen !

We first read a line, and two cases are possible:

  1. we are the end of the file (or file is empty) : in this case readline() call returns an empty string

  2. we are not at the end of the file: the first line is put as a string inside the variable line. Since Python internally uses a pointer to keep track at which position we are when reading inside the file, after the read such pointer is moved at the beginning of the next line. This way the next call to readline() will read a line from the new position.

In while block we tell Python to continue the cycle as long as line is not empty. If this is the case, inside the while block we parse the name from the line and put it in variable name (removing extra newline character with rstrip() as we did before), then we proceed reading the next line and parse the result inside the surname variable. Finally, we read again a line into the line variable so it will be ready for the next round of name extraction. If line is empty the cycle will terminate:

while line != "":                   # enter cycle if line contains characters
    name = line.rstrip()            # parses the name
    surname=f.readline().rstrip()   # reads next line and parses surname
    print(name + ' ' + surname)
    line=f.readline()               # read next line

✪ 1.4 EXERCISE: As before, rewrite in the cell below the code with the while, paying attention to the indentation (for the external with line use copy-and-paste):

Show solution
[8]:
# write here the code of internal while


Leonardo da Vinci
Sandro Botticelli
Niccolò Macchiavelli

people-complex line file

Look at the file people-complex.txt:

name: Leonardo
surname: da Vinci
birthdate: 1452-04-15
name: Sandro
surname: Botticelli
birthdate: 1445-03-01
name: Niccolò
surname: Macchiavelli
birthdate: 1469-05-03

Supposing to read the file to print this output, how would you do it?

Leonardo da Vinci, 1452-04-15
Sandro Botticelli, 1445-03-01
Niccolò Macchiavelli, 1469-05-03

Hint 1: to obtain the string 'abcde', the substring 'cde', which starts at index 2, you can ue the operator square brackets, using the index followed by colon :

[9]:
x = 'abcde'
x[2:]
[9]:
'cde'
[10]:
x[3:]
[10]:
'de'

Hint 2: To know the length of a string, use the function len:

[11]:
len('abcde')
[11]:
5

✪ 1.5 EXERCISE: Write here the solution of the exercise ‘People complex’:

Show solution
[12]:
# write here


Leonardo da Vinci, 1452-04-15
Sandro Botticelli, 1445-03-01
Niccolò Macchiavelli, 1469-05-03

Exercise - line file immersione-in-python-toc

✪✪✪ This exercise is more challenging, if you are a beginner you might skip it and go on to CSVs

The book Dive into Python is nice and for the italian version there is a PDF, which has a problem though: if you try to print it, you will discover that the index is missing. Without despairing, we found a program to extract titles in a file as follows, but you will discover it is not exactly nice to see. Since we are Python ninjas, we decided to transform raw titles in a real table of contents. Sure enough there are smarter ways to do this, like loading the pdf in Python with an appropriate module for pdfs, still this makes for an interesting exercise.

You are given the file immersione-in-python-toc.txt:

BookmarkBegin
BookmarkTitle: Il vostro primo programma Python
BookmarkLevel: 1
BookmarkPageNumber: 38
BookmarkBegin
BookmarkTitle: Immersione!
BookmarkLevel: 2
BookmarkPageNumber: 38
BookmarkBegin
BookmarkTitle: Dichiarare funzioni
BookmarkLevel: 2
BookmarkPageNumber: 41
BookmarkBeginint
BookmarkTitle: Argomenti opzionali e con nome
BookmarkLevel: 3
BookmarkPageNumber: 42
BookmarkBegin
BookmarkTitle: Scrivere codice leggibile
BookmarkLevel: 2
BookmarkPageNumber: 44
BookmarkBegin
BookmarkTitle: Stringhe di documentazione
BookmarkLevel: 3
BookmarkPageNumber: 44
BookmarkBegin
BookmarkTitle: Il percorso di ricerca di import
BookmarkLevel: 2
BookmarkPageNumber: 46
BookmarkBegin
BookmarkTitle: Ogni cosa &#232; un oggetto
BookmarkLevel: 2
BookmarkPageNumber: 47

Write a python program to print the following output:

Il vostro primo programma Python  38
   Immersione!  38
   Dichiarare funzioni  41
      Argomenti opzionali e con nome  42
   Scrivere codice leggibile  44
      Stringhe di documentazione  44
   Il percorso di ricerca di import  46
   Ogni cosa è un oggetto  47

For this exercise, you will need to insert in the output artificial spaces, in a qunatity determined by the rows BookmarkLevel

QUESTION: what’s that weird value &#232; at the end of the original file? Should we report it in the output?

HINT 1: To convert a string into an integer number, use the function int:

[13]:
x = '5'
[14]:
x
[14]:
'5'
[15]:
int(x)
[15]:
5

Warning: int(x) returns a value, and never modifies the argument x!

HINT 2: To substitute a substring in a string, you can use the method .replace:

[16]:
x = 'abcde'
x.replace('cd', 'HELLO' )
[16]:
'abHELLOe'

HINT 3: while there is only one sequence to substitute, replace is fine, but if we had a milion of horrible sequences like &gt;, &#62;, &x3e;, what should we do? As good data cleaners, we recognize these are HTML escape sequences, so we could use methods specific to sequences like html.escape. TRy it instead of replace and check if it works!

NOTE: Before using html.unescape, import the module html with the command:

import html

HINT 4: To write n copies of a character, use * like this:

[17]:
"b" * 3
[17]:
'bbb'
[18]:
"b" * 7
[18]:
'bbbbbbb'

IMPLEMENTATION: Write here the solution for the line file immersione-in-python-toc.txt, and try execute it by pressing Control + Enter:

Show solution
[19]:
# write here


   Il vostro primo programma Python  38
      Immersione!  38
      Dichiarare funzioni  41
         Argomenti opzionali e con nome  42
      Scrivere codice leggibile  44
         Stringhe di documentazione  44
      Il percorso di ricerca di import  46
      Ogni cosa è un oggetto  47

Continue

Go on with CSV tabular files

[ ]: