Formats 4 - Challenges

Parsing challenge - Spam killer

Roughly half of all emails sent in the world are spam.

Enraged by the number of pointless messages arriving each day, you decide to develop the definitive spam filter.

Spam killer 1. mail reader

A mail is a text file formatted as specified by standard RFC 822 (you don’t need to read the specs, but keep in mind RFC are typically specs!)

A mail contains certain number of fields, an empty line, and then the mail body:

Received: from forwarder@mailforeverybody.net
Message-Id: <v121c0404ad6a23934739@>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Date: Thursday, 4 Jun 2020 09:43:14 -0800
To: noreply@softpython.org
From: Harvey The Salesman <harvey@thegreatvacuum.com>
Subject: DISCOUNTED Vacuum Cleaners
Precedence: bulk

Hi!
Find the best offers on our website: thegreatvacuum.com !!!

Cheers,
Harvey

Each field name is separated from the value by a colon :

For example, in

From: Harvey The Salesman <harvey@thegreatvacuum.com>

From is the field name, and Harvey The Salesman <harvey@thegreatvacuum.com> is the field value.

Implement a function read_mail(filename) which parses a mailn.txt file (download files) and RETURN a dictionary holding all the field names

  • the body has no field name in the file, in the dictionary you can use Body as field name

  • REMEMBER to remove newlines from field values

  • DO NOT remove newlines from the body

  • HINT: Getting the body text right might be tricky, so first try just parsing the fields

Example:

>>> pprint(read_mail('mail1.txt'))

{'Received': 'from forwarder@mailforeverybody.net',
 'Message-Id': '<v121c0404ad6a23934739@>',
 'Mime-Version': '1.0',
 'Content-Type': 'text/plain; charset="us-ascii"',
 'Date': 'Thursday, 4 Jun 2020 09:43:14 -0800',
 'To': 'noreply@softpython.org',
 'From': 'Harvey The Salesman <harvey@thegreatvacuum.com>',
 'Subject': 'DISCOUNTED Vacuum Cleaners',
 'Precedence': 'bulk',
 'Body': 'Hi!\nFind the best offers on our website: thegreatvacuum.com !!!\nCheers, \nHarvey'}
[1]:

from pprint import pprint

def read_mail(filename):
    """RETURN a NEW dictionary
    """
    raise Exception('TODO IMPLEMENT ME !')

pprint(read_mail('mail1.txt'))

assert read_mail('mail1.txt') == {
    'Body': 'Hi!\n'
    'Find the best offers on our website: thegreatvacuum.com !!!\n'
    'Cheers, \n'
    'Harvey',
    'Content-Type': 'text/plain; charset="us-ascii"',
    'Date': 'Thursday, 4 Jun 2020 09:43:14 -0800',
    'From': 'Harvey The Salesman <harvey@thegreatvacuum.com>',
    'Message-Id': '<v121c0404ad6a23934739@>',
    'Mime-Version': '1.0',
    'Precedence': 'bulk',
    'Received': 'from forwarder@mailforeverybody.net',
    'Subject': 'DISCOUNTED Vacuum Cleaners',
    'To': 'noreply@softpython.org'
}

read_mail('mail2.txt') == {'Received': 'from mailman@networked-solutions.net',
 'Message-Id': '<v47gc04e7ad6a249f4539@>',
 'Mime-Version': '1.0',
 'Content-Type': 'text/plain; charset="us-ascii"',
 'Date': 'Tuesday, 7 Jul 2020 16:25:14 -0800',
 'To': 'info@softpython.org',
 'From': 'Mr Boss <head@overpaid-data-scientists.com>',
 'Subject': '20K/month Job offer',
 'Precedence': 'bulk',
 'Body': "Congratulations! You've been crunching so many matrices \nduring the job interview you deserve 20.000€ salary/month + benefits.\nWe will install in your office three pinball machines \nand a dispenser of M&Ms - which colors do you prefer?\nBest,\nYour Next Boss\n"
}

Spam killer 2. running filters

You defined various filters you want to run on the mails. Each filter is defined as a tuple containing a field name and a string to search for. If the field value contains the string, the mail is marked as spam.

Write a function run_filter which takes filters as list of tuples and a list of mail files, and RETURN a report as a list of lists. It must have:

  • a header

  • rows

  • columns Subject, From

  • column SPAM? as a boolean: True if any of the filters detected the mail as spam, False otherwise

Example:

>>> report = run_filters([('From', 'secret-encounters-at-night.com'),
                          ('Body','offer') ],
                         ['mail1.txt', 'mail2.txt', 'mail3.txt', 'mail4.txt'])
>>> pprint(report, width=90)
[['Subject', 'From', 'SPAM?'],
 ['DISCOUNTED Vacuum Cleaners', 'Harvey The Salesman <harvey@thegreatvacuum.com>', True],
 ['20K/month Job offer', 'Mr Boss <head@overpaid-data-scientists.com>', False],
 ['I noticed you ...', 'That lady  <lady@secret-encounters-at-night.com>', True],
 ['Some help with your thesis', 'John <john@yourfriends.net>', False]]
[2]:

from pprint import pprint

def run_filters(filters, filenames):
    """RETURN a NEW list of lists
    """
    raise Exception('TODO IMPLEMENT ME !')


report1 = run_filters([('From', 'secret-encounters-at-night.com'), ('Body','offer') ],
                     ['mail1.txt', 'mail2.txt', 'mail3.txt', 'mail4.txt'])
assert report1 == [
     ['Subject', 'From', 'SPAM?'],
     ['DISCOUNTED Vacuum Cleaners', 'Harvey The Salesman <harvey@thegreatvacuum.com>', True],
     ['20K/month Job offer', 'Mr Boss <head@overpaid-data-scientists.com>', False],
     ['I noticed you ...', 'That lady  <lady@secret-encounters-at-night.com>', True],
     ['Some help with your thesis', 'John <john@yourfriends.net>', False]
]

report2 = run_filters([('From','vacuum'), ('From', 'Guru')],
                     ['mail4.txt', 'mail1.txt', 'mail5.txt'])
assert report2 == [
     ['Subject', 'From', 'SPAM?'],
     ['Some help with your thesis', 'John <john@yourfriends.net>', False],
     ['DISCOUNTED Vacuum Cleaners', 'Harvey The Salesman <harvey@thegreatvacuum.com>', True],
     ['Is somebody stealing your domain?', 'Internet Guru  <service@cndomaintrouble.org>', True],
]

Parsing challenge - Markdown

Markdown is a language for writing documents, which allows writing plain text with additional syntax to express the way it should be formatted. Many editors support Markdown (Jupyter and Github included). For example, some Markdown text like this:

# My Heading

some paragraph, so much interesting

another paragraph, with a some bla bla

# Another big heading

There is **something notable** and then regular words.

would be displayed in Jupyter like this:

immagine.png

Try writing some Python code which reads a text file with a subset of markdown syntax and translates it into suitable Python data structures. See Markdown basic syntax

  • DO NOT use special purpose libraries!

  • IMPORTANT: markdown supports arbitrary depth of subparagraphs: to keep things simple start supporting one level, then two. Doing more would require some kind of level tracking, which could be cumbersome to implement.

Example - a possible model for the above text could be this one:

[3]:

parsed = [
            {'type': 'header',
             'level': 1,
             'text': 'My Heading',
             'subelements': [ { 'type': 'paragraph',
                                'level': 2,
                                'text': 'some paragraph, so much interesting',
                              },
                              { 'type': 'paragraph',
                                'level': 2,
                                'rich_text': [('normal','another paragraph, with a some bla bla')]
                              }
                            ]
            },
            {'type': 'header',
             'level': 1,
             'text': 'Another big heading',
             'subelements': [ { 'type': 'paragraph',
                                'level': 2,
                                'rich_text': [ ('normal','There is'),
                                               ('bold', 'something notable'),
                                               ('normal', 'and then regular words.')],
                              }
                            ]
            }
]

Parsing challenge - Other languages

Try developing simple parsers for other languages, like:

DO NOT use special purpose libraries!

IMPORTANT: Many of these languages support arbitrary depth of subparagraphs: to keep things simple start supporting one level, then two. Doing more would require some kind of level tracking, which could be cumbersome to implement.

CSV Challenge - Over the top

You’re opening a start-up with your friends for tourists who like mountain hiking.

You decide to focus on the north-east region of Italy and develop an app: one of the first tasks is to collect in a table all the mountain peaks with the names in italian, german, latitude, longitude and elevation.

You take some data from OpenStreetMap ( openstreetmap.org ) the free world map made by volunteers (OSM for short). As data format, you chose an CSV export generated by SLIPO Project.

Over the top 1. reading OpenStreetMap data

Have a look at osm.csv file, try also to open it with LibreOffice or Microsoft Office

Then implement function read_osm which reads a given CSV file with a csv.DictReader and just PRINTS ONLY the peaks (with pprint).

  • At this stage you can just PRINT the whole retrieved dictionary, we will extract stuff later.

You should see something like this

  • NOTE 1: here we show only some printed rows:

  • NOTE 2: according to the python version you have, you might see instead regular dictionaries instead of OrderedDict

OrderedDict([('ID', 'node/26862480'),
             ('NAME', 'Alpe di Succiso'),
             ('CATEGORY', 'TOURISM'),
             ('SUBCATEGORY', 'PEAK'),
             ('LON', '10.1955113'),
             ('LAT', '44.3327854'),
             ('SRID', '4326'),
             ('WKT', 'POINT (10.195511300000001 44.332785400000006)'),
             ('INTERNATIONAL_NAME', ''),
             ('STREET', ''),
             ('WIKIPEDIA', 'it:Alpe di Succiso'),
             ('PHONE', ''),
             ('CITY', ''),
             ('EMAIL', ''),
             ('ALTERNATIVE_NAME', ''),
             ('OPENING_HOURS', ''),
             ('DESCRIPTION', ''),
             ('WEBSITE', ''),
             ('LAST_UPDATE', ''),
             ('OPERATOR', ''),
             ('POSTCODE', ''),
             ('COUNTRY', ''),
             ('FAX', ''),
             ('IMAGE', ''),
             ('HOUSENUMBER', ''),
             ('OTHER_TAGS',
              '{"PDOP":"1.87","natural":"peak","importance":"regional","name":"Alpe '
              'di '
              'Succiso","source":"survey","wikidata":"Q1810954","ele":"2016"}')])
OrderedDict([('ID', 'node/26862538'),
             ('NAME', 'Becco di Filadonna'),
             ('CATEGORY', 'TOURISM'),
             ('SUBCATEGORY', 'PEAK'),
             ('LON', '11.1934654'),
             ('LAT', '45.9636324'),
             ('SRID', '4326'),
             .
             .
             .
[4]:

import csv

def read_osm(in_filename):
    raise Exception('TODO IMPLEMENT ME !')

read_osm('osm.csv')

Over the top 2. extract peak

Implement function extract_peak which given a peak as a raw dictionary, RETURN the list of relevant values in this order: italian name, german name, latitude, longitude, elevation

Note elevation, italian and german names are inside the field other_tags as name:it, name:de, ele

  • WARNING 1: name:it is not always present! In such cases use NAME field from the main dictionary

  • WARNING 2: name:de is not always present! In such cases put an empty string

  • HINT: the field other_tags looks very much like an embedded JSON. To parse it quickly, use the function json.loads, which takes a string as input and outputs a Python object, in this case you will obtain a dictionary. NOTE THE s at the end of json.loads!!

Example - given:

d = OrderedDict([('ID', 'node/26862713'),
                ('NAME', 'Cima Undici'),
                ('CATEGORY', 'TOURISM'),
                ('SUBCATEGORY', 'PEAK'),
                ('LON', '12.3783333'),
                ('LAT', '46.6363889'),
                ('SRID', '4326'),
                ('WKT', 'POINT (12.378333300000001 46.6363889)'),
                ('INTERNATIONAL_NAME', ''),
                ('STREET', ''),
                ('WIKIPEDIA', 'it:Cima Undici'),
                ('PHONE', ''),
                ('CITY', ''),
                ('EMAIL', ''),
                ('ALTERNATIVE_NAME', ''),
                ('OPENING_HOURS', ''),
                ('DESCRIPTION', ''),
                ('WEBSITE', ''),
                ('LAST_UPDATE', ''),
                ('OPERATOR', ''),
                ('POSTCODE', ''),
                ('COUNTRY', ''),
                ('FAX', ''),
                ('IMAGE', ''),
                ('HOUSENUMBER', ''),
                ('OTHER_TAGS',
                '{"name:de":"Elferkofel","natural":"peak","name":"Cima '
                'Undici","name:it":"Cima '
                'Undici","wikidata":"Q628931","ele":"3090"}')])

You should obtain:

>>> extract_peak(d)
['Cima Undici', 'Elferkofel', 46.6363889, 12.3783333, 3090.0]
  • NOTE: numbers should be numbers, not strings!

[5]:

import json

def extract_peak(rawd):
    """Takes a dictionary and RETURN a list
    """
    raise Exception('TODO IMPLEMENT ME !')

from collections import OrderedDict
d = OrderedDict([('ID', 'node/26862713'),
                ('NAME', 'Cima Undici'),
                ('CATEGORY', 'TOURISM'),
                ('SUBCATEGORY', 'PEAK'),
                ('LON', '12.3783333'),
                ('LAT', '46.6363889'),
                ('SRID', '4326'),
                ('WKT', 'POINT (12.378333300000001 46.6363889)'),
                ('INTERNATIONAL_NAME', ''),
                ('STREET', ''),
                ('WIKIPEDIA', 'it:Cima Undici'),
                ('PHONE', ''),
                ('CITY', ''),
                ('EMAIL', ''),
                ('ALTERNATIVE_NAME', ''),
                ('OPENING_HOURS', ''),
                ('DESCRIPTION', ''),
                ('WEBSITE', ''),
                ('LAST_UPDATE', ''),
                ('OPERATOR', ''),
                ('POSTCODE', ''),
                ('COUNTRY', ''),
                ('FAX', ''),
                ('IMAGE', ''),
                ('HOUSENUMBER', ''),
                ('OTHER_TAGS',
                '{"name:de":"Elferkofel","natural":"peak","name":"Cima '
                'Undici","name:it":"Cima '
                'Undici","wikidata":"Q628931","ele":"3090"}')])

extract_peak(d)

Over the top 3. write file

Implement function write_peaks so it calls extract_peak and writes the obtained lists into peaks.csv with a csv.writer (so this time we write lists, not dictionaries!)

  • REMEMBER to put also the header

First lines should be like (for complete expected file see expected-peaks.csv)

name_it,name_de,latitude,longitude,elevation
Alpe di Succiso,,44.3327854,10.1955113,2016.0
Becco di Filadonna,,45.9636324,11.1934654,2150.0
Bechei di Sopra,,46.6077439,12.0444775,2794.0
Catinaccio d'Antermoia,Kesselkogel,46.4740893,11.6438283,3004.0
Cima Ambrizzola,,46.4791667,12.0980556,2715.0
Cima Bastioni,,46.4851159,12.2678531,2926.0
Cima Brenta,,46.1797021,10.900036,3151.0
Cima Cadin di San Lucano,,46.5776149,12.2882724,2839.0
Cima d'Asta,,46.1766183,11.6052937,2847.0
Cima dei Preti,,46.3423245,12.4210592,2707.0
Cima della Vezzana,,46.2899137,11.8297409,3192.0
Cima Dodici,,45.9976856,11.4680336,2337.0
Cima Mora,,46.240557,12.3431523,1940.0
Cima Palon,,45.7922301,11.1765372,2232.0
Cima Pape,,46.3343734,11.9283766,2503.0
Punta di mezzodì,,45.731185,11.1380772,1858.0
Cima Presanella,Cima Presanella,46.2199321,10.6641189,3556.0
Cima Rolle,Rollspitze,46.9463889,11.5077778,2800.0
Cima Tosa,,46.1565222,10.8711276,3136.0
Cima Undici,Elferkofel,46.6363889,12.3783333,3090.0
  .
  .
[7]:

import csv

def write_peaks(in_filename):
    raise Exception('TODO IMPLEMENT ME !')

write_peaks('osm.csv')