What’s your business?

Download worked project

Browse files online

So you want to be a data scientist. Good, plenty of opportunities ahead!

After graduating, you might discover though that many companies require you to actually work as a freelancer: you will just need to declare to the state which type of economic activity you are going to perform, they say. Seems easy, but you will soon encounter a pretty burocratic problem: do public institutions even know what a data scientist is? If not, what is the closest category they recognize? Is there any specific exclusion that would bar you from entering that category?

If you are in Europe, you will be presented with a catalog of economic activites you can choose from called NACE, which is then further specialized by various states (for example Italy’s catalog is called ATECO)

What to do

  1. Unzip exercises zip in a folder, you should obtain something like this:

nace-prj
    nace.ipynb
    nace-sol.ipynb
    NACE_REV2_20200628_213139.csv
    jupman.py

WARNING: to correctly visualize the notebook, it MUST be in an unzipped folder !

  1. open Jupyter Notebook from that folder. Two things should open, first a console and then a browser. The browser should show a file list: navigate the list and open the notebook nace.ipynb

  2. Go on reading the notebook, and write in the appropriate cells when asked

Shortcut keys:

  • to execute Python code inside a Jupyter cell, press Control + Enter

  • to execute Python code inside a Jupyter cell AND select next cell, press Shift + Enter

  • to execute Python code inside a Jupyter cell AND a create a new cell aftwerwards, press Alt + Enter

  • If the notebooks look stuck, try to select Kernel -> Restart

NACE Sections

A NACE code is subdivided in a hierarchical, four-level structure. The categories at the highest level are called sections, here they are:

nace-sections

Section detail

If you drill down in say, section M, you will find something like this:

The first two digits of the code identify the division, the third digit identifies the group, and the fourth digit identifies the class:

example-nace

Let’s pick for example Advertising agencies , which has code 73.11:

Level

Code

Spec

Description

1

Section

M

a single alphabetic char

PROFESSIONAL, SCIENTIFIC AND TECHNICAL ACTIVITIES

2

Division

73

two-digits

Advertising and market research

3

Group

73.1

three-digits, with dot after first two

Advertising

4

Class

73.12

four-digits, with dot after first two

Advertising agencies

Specifications

WARNING: CODES MAY CONTAIN ZEROES!

IF YOU LOAD THE CSV IN LIBREOFFICE CALC OR EXCEL, MAKE SURE IT IMPORTS EVERYTHING AS STRING!

WATCH OUT FOR CHOPPED ZEROES !

Zero examples:

  • Veterinary activities contains a double zero at the end : 75.00

  • group Manufacture of beverages contains a single zero at the end: 11.0

  • Manufacture of beer contains zero inside : 11.05

  • Support services to forestry contains a zero at the beginning : 02.4 which is different from 02.40 even if they have the same description !

The section level code is not integrated in the NACE code: For example, the activity Manufacture of glues is identified by the code 20.52, where 20 is the code for the division, 20.5 is the code for the group and 20.52 is the code of the class; section C, to which this class belongs, does not appear in the code itself.

There may be gaps (not very important for us): The divisions are coded consecutively. However, some “gaps” have been provided to allow the introduction of additional divisions without a complete change of the NACE coding.

NACE CSV

We provide you with a CSV NACE_REV2_20200628_213139.csv that contains all the codes. Try to explore it with LibreOffice Calc or pandas

Here we show some relevant parts (NOTE: you DON’T need to use pandas)

[1]:
import pandas as pd   # we import pandas and for ease we rename it to 'pd'
import numpy as np    # we import numpy and for ease we rename it to 'np'

pd.set_option('display.max_colwidth', None)
df = pd.read_csv('NACE_REV2_20200628_213139.csv', encoding='UTF-8')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 996 entries, 0 to 995
Data columns (total 10 columns):
 #   Column                    Non-Null Count  Dtype
---  ------                    --------------  -----
 0   Order                     996 non-null    int64
 1   Level                     996 non-null    int64
 2   Code                      996 non-null    object
 3   Parent                    975 non-null    object
 4   Description               996 non-null    object
 5   This item includes        778 non-null    object
 6   This item also includes   202 non-null    object
 7   Rulings                   134 non-null    object
 8   This item excludes        507 non-null    object
 9   Reference to ISIC Rev. 4  996 non-null    object
dtypes: int64(2), object(8)
memory usage: 77.9+ KB
[2]:
df.head(5)
[2]:
Order Level Code Parent Description This item includes This item also includes Rulings This item excludes Reference to ISIC Rev. 4
0 398481 1 A NaN AGRICULTURE, FORESTRY AND FISHING This section includes the exploitation of vegetal and animal natural resources, comprising the activities of growing of crops, raising and breeding of animals, harvesting of timber and other plants, animals or animal products from a farm or their natural habitats. NaN NaN NaN A
1 398482 2 01 A Crop and animal production, hunting and related service activities This division includes two basic activities, namely the production of crop products and production of animal products, covering also the forms of organic agriculture, the growing of genetically modified crops and the raising of genetically modified animals. This division includes growing of crops in open fields as well in greenhouses.\n \nGroup 01.5 (Mixed farming) breaks with the usual principles for identifying main activity. It accepts that many agricultural holdings have reasonably balanced crop and animal production, and that it would be arbitrary to classify them in one category or the other. This division also includes service activities incidental to agriculture, as well as hunting, trapping and related activities. NaN Agricultural activities exclude any subsequent processing of the agricultural products (classified under divisions 10 and 11 (Manufacture of food products and beverages) and division 12 (Manufacture of tobacco products)), beyond that needed to prepare them for the primary markets. The preparation of products for the primary markets is included here.\n\nThe division excludes field construction (e.g. agricultural land terracing, drainage, preparing rice paddies etc.) classified in section F (Construction) and buyers and cooperative associations engaged in the marketing of farm products classified in section G. Also excluded is the landscape care and maintenance, which is classified in class 81.30. 01
2 398483 3 01.1 01 Growing of non-perennial crops This group includes the growing of non-perennial crops, i.e. plants that do not last for more than two growing seasons. Included is the growing of these plants for the purpose of seed production. NaN NaN NaN 011
3 398484 4 01.11 01.1 Growing of cereals (except rice), leguminous crops and oil seeds This class includes all forms of growing of cereals, leguminous crops and oil seeds in open fields. The growing of these crops is often combined within agricultural units.\n\nThis class includes:\n- growing of cereals such as:\n . wheat\n . grain maize\n . sorghum\n . barley\n . rye\n . oats\n . millets\n . other cereals n.e.c.\n- growing of leguminous crops such as:\n . beans\n . broad beans\n . chick peas\n . cow peas\n . lentils\n . lupines\n . peas\n . pigeon peas\n . other leguminous crops\n- growing of oil seeds such as:\n . soya beans\n . groundnuts\n . castor bean\n . linseed\n . mustard seed\n . niger seed\n . rapeseed\n . safflower seed\n . sesame seed\n . sunflower seed\n . other oil seeds NaN NaN This class excludes:\n- growing of rice, see 01.12\n- growing of sweet corn, see 01.13\n- growing of maize for fodder, see 01.19\n- growing of oleaginous fruits, see 01.26 0111
4 398485 4 01.12 01.1 Growing of rice This class includes:\n- growing of rice (including organic farming and the growing of genetically modified rice) NaN NaN NaN 0112

We can focus on just these columns:

[3]:
selection = [398482,398488,398530,398608,398482,398518,398521,398567]

from IPython.display import display

example_df = df[['Order', 'Level','Code','Parent','Description','This item excludes']]
# Assuming the variable df contains the relevant DataFrame
example_df = example_df[example_df['Order'].isin(selection)]
display(example_df.style.set_properties(**{'white-space': 'pre-wrap',}))
  Order Level Code Parent Description This item excludes
1 398482 2 01 A Crop and animal production, hunting and related service activities Agricultural activities exclude any subsequent processing of the agricultural products (classified under divisions 10 and 11 (Manufacture of food products and beverages) and division 12 (Manufacture of tobacco products)), beyond that needed to prepare them for the primary markets. The preparation of products for the primary markets is included here. The division excludes field construction (e.g. agricultural land terracing, drainage, preparing rice paddies etc.) classified in section F (Construction) and buyers and cooperative associations engaged in the marketing of farm products classified in section G. Also excluded is the landscape care and maintenance, which is classified in class 81.30.
7 398488 4 01.15 01.1 Growing of tobacco This class excludes: - manufacture of tobacco products, see 12.00
37 398518 4 01.64 01.6 Seed processing for propagation This class excludes: - growing of seeds, see groups 01.1 and 01.2 - processing of seeds to obtain oil, see 10.41 - research to develop or modify new forms of seeds, see 72.11
40 398521 2 02 A Forestry and logging Excluded is further processing of wood beginning with sawmilling and planing of wood, see division 16.
49 398530 2 03 A Fishing and aquaculture This division does not include building and repairing of ships and boats (30.1, 33.15) and sport or recreational fishing activities (93.19). Processing of fish, crustaceans or molluscs is excluded, whether at land-based plants or on factory ships (10.20).
86 398567 4 09.90 09.9 Support activities for other mining and quarrying This class excludes: - operating mines or quarries on a contract or fee basis, see division 05, 07 or 08 - specialised repair of mining machinery, see 33.12 - geophysical surveying services, on a contract or fee basis, see 71.12
127 398608 4 11.03 11.0 Manufacture of cider and other fruit wines This class excludes: - merely bottling and labelling, see 46.34 (if performed as part of wholesale) and 82.92 (if performed on a fee or contract basis)

1. Extracting codes

Let’s say European Commission wants to review the catalog to simplify it. One way to do it, could be to look for codes that have lots of exclusions, the reasoning being that trying to explain somebody something by stating what it is not often results in confusion.

1.1 is_nace

Implement following function. NOTE: it was not explicitly required in the original exam but could help detecting words.

Show solution
[4]:

def is_nace(word):
    """Given a word, RETURN True if the word is a NACE code, else otherwise"""
    raise Exception('TODO IMPLEMENT ME !')

assert is_nace('0') == False
assert is_nace('01') == True
assert is_nace('A') == True   # this is a Section
assert is_nace('AA') == False
assert is_nace('a') == False
assert is_nace('01.2') == True
assert is_nace('01.20') == True
assert is_nace('03.25') == True
assert is_nace('02.753') == False
assert is_nace('300') == False
assert is_nace('5012') == False

1.2 extract_codes

Implement following function which extracts codes from This item excludes column cells. For examples, see asserts.

Show solution
[5]:
def extract_codes(text):
    """Extracts all the NACE codes from given text (a single string),
       and RETURN a list of the codes

       - also extracts section letters
       - list must have *no* duplicates
    """
    raise Exception('TODO IMPLEMENT ME !')

assert extract_codes('group 02.4') == ['02.4']
assert extract_codes('class 02.40') == ['02.40']
assert extract_codes('.') == []
assert extract_codes('exceeding 300 litres') == []
assert extract_codes('see 46.34') == ['46.34']
assert extract_codes('divisions 10 and 11') == ['10','11']
assert extract_codes('(10.20)') == ['10.20']
assert extract_codes('(30.1, 33.15)') == ['30.1', '33.15']
assert extract_codes('as outlined in groups 85.1-85.4, i.e.') == ['85.1','85.4']
assert extract_codes('see 25.99 see 25.99') == ['25.99']  # no duplicates
assert extract_codes('section A') == ['A']
assert extract_codes('in section G. Also') == ['G']
assert extract_codes('section F (Construction)') == ['F']
assert extract_codes('section A, section A') == ['A']
[6]:
# MORE REALISTIC asserts:

t01 = """Agricultural activities exclude any subsequent processing of the
agricultural products (classified under divisions 10 and 11 (Manufacture of food
products and beverages) and division 12 (Manufacture of tobacco products)), beyond
that needed to prepare them for the primary markets. The preparation of products for
the primary markets is included here.

The division excludes field construction (e.g. agricultural land terracing,
drainage, preparing rice paddies etc.) classified in section F (Construction) and buyers
and cooperative associations engaged in the marketing of farm products classified
in section G. Also excluded is the landscape care and maintenance,
which is classified in class 81.30.
"""
assert extract_codes(t01) == ['10','11','12','F','G','81.30']

t01_15 = """This class excludes:
- manufacture of tobacco products, see 12.00
"""
assert extract_codes(t01_15) == ['12.00']

t03 = """This division does not include building and repairing of ships and
boats (30.1, 33.15) and sport or recreational fishing activities (93.19).
Processing of fish, crustaceans or molluscs is excluded, whether at land-based
plants or on factory ships (10.20).
"""

assert extract_codes(t03) == ['30.1', '33.15','93.19','10.20']

t11_03 = """This class excludes:
- merely bottling and labelling, see 46.34 (if performed as part of wholesale)
and 82.92 (if performed on a fee or contract basis)
"""
assert extract_codes(t11_03) == ['46.34', '82.92']


t01_64 = """This class excludes:
- growing of seeds, see groups 01.1 and 01.2
- processing of seeds to obtain oil, see 10.41
- research to develop or modify new forms of seeds, see 72.11
"""
assert extract_codes(t01_64) == ['01.1','01.2','10.41','72.11']

t02 = """Excluded is further processing of wood beginning with sawmilling and planing of wood,
see division 16.
"""
assert extract_codes(t02) == ['16']

t09_90 = """This class excludes:
- operating mines or quarries on a contract or fee basis, see division 05, 07 or 08
- specialised repair of mining machinery, see 33.12
- geophysical surveying services, on a contract or fee basis, see 71.12
"""
assert extract_codes(t09_90) == ['05','07','08','33.12','71.12']

2. build_db

Given a filepath pointing to a NACE CSV, reads the CSV and RETURN a dictionary mapping codes to dictionaries which hold the code descriptionn and a field with the list of excluded codes, for example:

{'01': {'description': 'Crop and animal production, hunting and related service activities',
  'exclusions': ['10', '11', '12', 'F', 'G', '81.30']},
 '01.1': {'description': 'Growing of non-perennial crops', 'exclusions': []},
 '01.11': {'description': 'Growing of cereals (except rice), leguminous crops and oil seeds',
  'exclusions': ['01.12', '01.13', '01.19', '01.26']},
 '01.12': {'description': 'Growing of rice', 'exclusions': []},
 '01.13': {'description': 'Growing of vegetables and melons, roots and tubers',
  'exclusions': ['01.28', '01.30']},
 ...
 ...
}

The complete desired output is in file expected_db.py

Show solution
[7]:
def build_db(filepath):
    raise Exception('TODO IMPLEMENT ME !')

activities_db = build_db('NACE_REV2_20200628_213139.csv')
#activities_db

3. plot

Implement function plot which given a db as created at previous point and a code level among 1,2,3,4, plots the number of exclusions for all codes of that exact level (so do not include sublevels in the sum), sorted in reversed order.

  • remember to plot title, notice it should shows the type of level (could be Section, Division, Group, or Class)

  • try to display labels nicely as in the example output

(if you look at the graph, apparently European Union has a hard time defining what an artist is :-)

IMPORTANT: IF you couldn’t implement the function build_db , you will still find the complete desired output in expected_db.py, to import it write: from expected_db import activities_db

expected-division-exclusions.png

Show solution
[8]:

%matplotlib inline
def plot(db, level):

    import matplotlib.pyplot as plt
    raise Exception('TODO IMPLEMENT ME !')

#Uncomment *only* if you had problems with build_db
#from expected_db import activities_db

#1 Section
#2 Division
#3 Group
#4 Class
plot(activities_db, 2)
[ ]: