Jump to section

Food Standards Agency - Source Meta Data

Source data from Web Scraping

Data Observations

There are four main fields in the data obtained from web scraping :

Each row of data could be a single product or multiple products, it is not always obvious from the title or results body text which is which.

Clicking into a result presents an opportunity to scrape more product information, however there is no consistent format in how products are presented, some had product content in paragraphs with <br> breaks, some has product content in html tables. This means alerts with multiple products will have to edited by hand.

Missing Data

Dataset nationalarchives.gov.uk
Start Date
nationalarchives.gov.uk
End Date
Missing Data
From
Missing Data
To
food.gov.uk
Start Date
Food Alerts 02-10-2014 05-12-2017 06-12-2017 17-02-2021 18-02-2021
Allergy Alerts 14-10-2014 06-12-2017 03-07-2019 04-07-2019 05-07-2019

Top

Observations per year

There is a lot of data missing, from mid 2017 to 2020.

Year 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024
count 12 67 97 54 0 1 0 154 176 164 98

Top

Original Source Data

Column Notes
Source Where data was sourced, food.gov.uk or nationalarchives.gov.uk
Date Date the notice was issued
Alert_Type Allergen or Safety Alert
Title Alert Title Text
BodyText Alert Body text (usually has more information)

Top

Data Cleansing

The layout of the nationalarchives and food.gov websites were different (food.gov has both food and allergy alerts on the same page, national archive had different pages for each category ).

The web scraping processed produced three main raw unprocessed files :

These were combined into a single file : fsa_alerts_raw-combined.

Engineered Features

Some automation is applied to alert notices :

A typical Alert Notice is proceeded by the front page search results. Often these note the retail brand, but not always.

Example :

Search Results Title : Update 11: FGS Ingredients Ltd recalls a number of products containing mustard powder because of undeclared peanuts


Search Results Title sometimes includes the supplier, in this instance FGS Ingredients Ltd
Alert Notice Body Text contains the product info : SPAR Cheese and Onion Sandwich Filler - 220g
Brand is typically contained in either the search results title or alert notice body text. Both are presented to the user to edit into the captured data by hand. Supplier and Outlet are not assumed or automatically captured into a separate field, but the alert notice is presented to the user to edit into the captured data by hand.
Column Engineered Feature Notes
Date Datetime conversion
datetime None
year Datetime conversion
month Datetime conversion
Alert_Type None
Product_category Edited by hand
Product_Type Edited by hand
Title None
BodyText None
BrandEdited by hand - (if any) as noted on the alert notice
Supplier Edited by hand or automatically captured from alert notice e.g.
Supplier_Type Hand edited - Manufacturer / Importer / Grocer / Wholesale / Unknown.
Outlet Consumer outlet - Edited by hand - The name of the retail outlet (if noted)
Outlet_Type Type of outlet : Grocer / Convenience / Restaurant / Takeaway / Retailer / Unknown
Product Automatically extracted from Alert Notice Body text
Risk Extract from Body text
Pathogen Hand edited
Allergen Hand edited
Foreign_Material Hand edited
Other Hand edited
month_num Datetime conversion
month_name Datetime conversion
year_month Datetime conversion

Top

Manually applying consistency

The initial results of the web scrape were combined into a single excel file and reviewed by hand.

The EDA and feature engineering revealed that the use of the words recall and because could be used as delimiters :

“Boots recalls / recall / is recalling product XXXX because of reason YYYY”

could be used to split the titles into separate fields :

Where this format was not available, the title was edited to ensure consistency.

Breakout Pathogens, Allergens, Foreign Material and Other Reasons

Each entry was reviewed and additional fields added to the dataset :

The Title and Body Text was reviewed to extract the required information.

Duplicate Entries

Reviewing this file it was noted that many entries in the nation archives dataset were duplicated under both Food Alerts and Allergy Alerts. So this had to be cleaned by hand with the duplicate entry with the incorrect category being removed.

This was hand edited to ensure consistency required for data cleansing.

Specifically:

191 Duplicate Entries were removed, all of which were incorrectly categorised under Food Alert / Allergy Alert where the Allergy Alert categorisation was incorrect.

Product Breakout

Where there were multiple products in the Title or Body Text of the search results these were broken out into separate entries. For older alert notices, the actual full list of products was not available in the search results, requiring a click to get the full detail. However, it should be noted that product names in titles were not applied consistently by the agency when the data was first entered into the web site database requiring a lot of hand editing to get this data into a usable format.

Thankfully, more recent Alert Notices include additional product details within the body text identified by a <caption> tag, and it is possible to automatically capture this along with pack sizes.

For older Alert Notices, the presentation of the data was not consistent, making subsequent scraping of the data impossible. Older Alert Notices had product details in the Body Text , some were in html tables with no tagging, some were within paragraphs tags, some with <BR>.

Therefore extracting the full product detail from each page would be a significant endeavour.