The EU Safety Gate Data - Feature Engineering

Safety Gate Logo

Basics

Duplicate Rows - There are no duplicated Rows

Strip - Strip leading / trailing spaces from df

Data Cleaning

Clean up mistyped accent characters, inconsistent punctuation, rename single use of country names, rename long country names to shorter versions to improve visualisation readability.

The Netherlands Netherlands
Republic of Korea South Korea
United Arab Emirates UAE
Russian Federation Russia
United Kingdom in respect of Northern Ireland UK
United Kingdom UK
United States USA
Türkiye Turkey
Democratic People's Republic of Korea North Korea
People's Republic of China China

Feature Engineer - Date Time

Feature Engineer new columns from datetime index : Year, Week,

Feature Engineer - Economic Area

Bulgaria EU Countries
Romania EU Countries
Czechia EU Countries
Denmark EU Countries
Hungary EU Countries
Poland EU Countries
Sweden EU Countries
Croatia EU Countries
Cyprus EU Countries
Ireland EU Countries
Austria EU Countries
Belgium EU Countries
Estonia EU Countries
Finland EU Countries
France EU Countries
Germany EU Countries
Greece EU Countries
Italy EU Countries
Latvia EU Countries
Lithuania EU Countries
Luxembourg EU Countries
Malta EU Countries
The Netherlands EU Countries
Portugal EU Countries
Slovakia EU Countries
Slovenia EU Countries
Spain EU Countries
China China

Fix Brand name / Country of origin inconsistencies

It's worthwhile to improve the quality of relevant data where possible for larger product categories.

Begin with

After mapping Brands

After mapping Countries

Create consistent risk_category feature from Risk type column

Before

There are 218 variations in the `Risk type` column

Each entry is a comma separated list of risks.

The first entry in a list is chosen as the main category, i.e `Burns, Chemical, Damage to hearing` is mapped to `Burns`

A Future task is to split the comma separated variables into binary columns

After Feature Engineering

After Risk type field is feature engineered to risk_category, the result is 14 risk categories
risk_category perc_of_total
Injuries 27.28
Chemical 26.93
Breathing Obstruction 19.08
Electric Shock 11.74
Burns 5.02
Fire 4.37
Environment 1.81
Eyesight Damage 1.23
Microbiological 0.99
Hearing Damage 0.88
Entrapment 0.32
Drowning 0.31
Health Other 0.03

Feature Engineer Category to product_category

There are two product category fields:

OECD portal category : there are 33 categories, however there are a lot of missing values

Category is complete with no missing values and has 80 categories.

The top 12 entries in category represent > 90% of the total number of alerts issued.
product_category will detail the top12 + others listed as other.

Before Feature Engineering

After Feature Engineering

After mapping:

Toys 8793
Clothing & Fashion 5337
Automotive 5317
Electrical Appliances 3443
other 3092
Cosmetics 2525
Lighting 1831
Childcare Equipment 1295
Chemicals 979
Jewellery 907
Protective equipment 826
Hobby \ Sports 628
Machinery 535

Feature Engineer country_top15

There are 111 countries of origin, the top 15 of these represent > 90% of the total alerts. Countries outside of the top 15 were aggregated under 'other'.
Idx country_top15 count perc_of_total
0 China 13866 39.05
1 Unknown 9173 25.83
2 other 2994 8.43
3 Germany 2333 6.57
4 Italy 1263 3.56
5 France 1014 2.86
6 Turkey 937 2.64
7 United States of America 859 2.42
8 Japan 851 2.40
9 United Kingdom 648 1.82
10 Poland 492 1.39
11 India 329 0.93
12 Spain 315 0.89
13 Taiwan 222 0.63
14 Sweden 212 0.60

Binary Encode Compulsory measures / Voluntary measures columns

The Compulsory measures / Voluntary measures columns have detailed not delimited data, sometimes with multiple entries, indicating an escalation in measures.

To simplify the analysis and encoding, the encoding was matched in order of severity.

(I'd make a recommendation to the EU team to re-analyse the requirements of the database and improve this area of their record keeping.)

e.g.

Retailer: Withdrawal of the product from the market:27/05/2020 Retailer:Recall of the product from end users:27/05/2020 Distributor:Ban on the marketing of the product and any accompanying measures:27/05/2020

This example would be split into {compulsory/voluntary}{responsible_party} {action required}

e.g.

{compulsory/voluntary responsible party} options : unknown, retailer, distributor, importer, manufacturer, other

{compulsory/voluntary action required} options : Ban, recall, withdrawal, destruction, removal, criminal, borderstop, warnings


New columns with binary mappings added:

{compulsory_xxxx} or {voluntary_xxxxx} mapping


compulsory_withdrawal
compulsory_recall
compulsory_ban
compulsory_destruction
compulsory_removal
compulsory_criminal
compulsory_borderstop
compulsory_warnings
compulsory_responsible_unknown
compulsory_responsible_retail
compulsory_responsible_disti
compulsory_responsible_importer
compulsory_responsible_manufacturer
compulsory_responsible_other

Compulsory measures mapping order of priority

Compulsory measures -> Binary columns encoding

Set binary values in column if true else false

Since multiple values can exist in each field, the order is relevant, we want to indicate severity and the simple contains search will match any instance, therefore, most import is the last search since it will overwrite previous matches

Severity
(1 low)

Action

% of Alerts

Qty
8 Criminal 0.01% 2
7 Removal from sale 0.31% 111
6 destruction 1.38% 490
5 ban 11.71% 4,157
4 recall 11.17% 3966
3 withdrawal 29.08% 10,324
2 borderstop 4.32% 1,533
1 warnings 1.41% 502

Result of mappings action required

idx Column Total Percentage of Total
1 voluntary_recall 8475 51.56
0 voluntary_withdrawal 5860 35.65
4 voluntary_removal 791 4.81
3 voluntary_destruction 656 3.99
7 voluntary_warnings 367 2.23
2 voluntary_ban 259 1.58
6 voluntary_borderstop 29 0.18
5 voluntary_criminal 0 0.00

Result of mappings responsible party

idx Column Total Percentage of Total
0 voluntary_responsible_unknown 20178 52.83
5 voluntary_responsible_other 6825 17.87
4 voluntary_responsible_manufacturer 4317 11.30
3 voluntary_responsible_importer 3448 9.03
2 voluntary_responsible_disti 2469 6.46
1 voluntary_responsible_retail 954 2.50

Creative Commons License

The reuse policy of European Commission documents is implemented by the Commission Decision 2011/833/EU of 12 December 2011 on the reuse of Commission documents (OJ L 330, 14.12.2011, p. 39). Except otherwise noted, the reuse of this document is authorised under a Creative Commons Attribution 4.0 International (CC-BY 4.0) licence.

This means that reuse is allowed provided appropriate credit is given and any changes are indicated.

Since this entire analysis is based on data from the EU, and I have used visualisations from the official annual summary as inspiration for the programmatically created version of these visualisations, the creative commons license applies this work hosted on rodders.me, this means that reuse of this analysis and its visualisations is allowed provided appropriate credit is given and any changes are indicated.