Jump to section

Food Standards Agency - Source Meta Data

Source data from Web Scraping

Data Observations

There are four main fields in the data obtained from web scraping :

Date
Type (Food Alert / Allergy Alert)
Title
Description

Each row of data could be a single product or multiple products, it is not always obvious from the title or results body text which is which.

Clicking into a result presents an opportunity to scrape more product information, however there is no consistent format in how products are presented, some had product content in paragraphs with <br> breaks, some has product content in html tables. This means alerts with multiple products will have to edited by hand.

Missing Data

Dataset	nationalarchives.gov.uk Start Date	nationalarchives.gov.uk End Date	Missing Data From	Missing Data To	food.gov.uk Start Date
Food Alerts	02-10-2014	05-12-2017	06-12-2017	17-02-2021	18-02-2021
Allergy Alerts	14-10-2014	06-12-2017	03-07-2019	04-07-2019	05-07-2019

Top

Observations per year

There is a lot of data missing, from mid 2017 to 2020.

Year	2014	2015	2016	2017	2018	2019	2020	2021	2022	2023	2024
count	12	67	97	54	0	1	0	154	176	164	98

Top

Original Source Data

Column	Notes
Source	Where data was sourced, food.gov.uk or nationalarchives.gov.uk
Date	Date the notice was issued
Alert_Type	Allergen or Safety Alert
Title	Alert Title Text
BodyText	Alert Body text (usually has more information)

Top

Data Cleansing

The layout of the nationalarchives and food.gov websites were different (food.gov has both food and allergy alerts on the same page, national archive had different pages for each category ).

The web scraping processed produced three main raw unprocessed files :

fsa_alerts_raw-food_gov : For Food Safety & Allergy Alerts
fsa_allergy_alerts_raw-national-archive : For Allergy Alerts
fsa_food_safety_alerts_raw-national-archive : For Food Safety Alerts

These were combined into a single file : fsa_alerts_raw-combined.

Engineered Features

Some automation is applied to alert notices :

A typical Alert Notice is proceeded by the front page search results. Often these note the retail brand, but not always.

Example :

Search Results Title : Update 11: FGS Ingredients Ltd recalls a number of products containing mustard powder because of undeclared peanuts

Search Results Title sometimes includes the supplier, in this instance FGS Ingredients Ltd
Alert Notice Body Text contains the product info : SPAR Cheese and Onion Sandwich Filler - 220g
Brand is typically contained in either the search results title or alert notice body text. Both are presented to the user to edit into the captured data by hand. Supplier and Outlet are not assumed or automatically captured into a separate field, but the alert notice is presented to the user to edit into the captured data by hand.

Column	Engineered Feature Notes
Date	Datetime conversion
datetime	None
year	Datetime conversion
month	Datetime conversion
Alert_Type	None
Product_category	Edited by hand
Product_Type	Edited by hand
Title	None
BodyText	None
Brand	Edited by hand - (if any) as noted on the alert notice
Supplier	Edited by hand or automatically captured from alert notice e.g.
Supplier_Type	Hand edited - Manufacturer / Importer / Grocer / Wholesale / Unknown.
Outlet	Consumer outlet - Edited by hand - The name of the retail outlet (if noted)
Outlet_Type	Type of outlet : Grocer / Convenience / Restaurant / Takeaway / Retailer / Unknown
Product	Automatically extracted from Alert Notice Body text
Risk	Extract from Body text
Pathogen	Hand edited
Allergen	Hand edited
Foreign_Material	Hand edited
Other	Hand edited
month_num	Datetime conversion
month_name	Datetime conversion
year_month	Datetime conversion

Top

Manually applying consistency

The initial results of the web scrape were combined into a single excel file and reviewed by hand.

The EDA and feature engineering revealed that the use of the words recall and because could be used as delimiters :

“Boots recalls / recall / is recalling product XXXX because of reason YYYY”

could be used to split the titles into separate fields :

Supplier : Everything before the recalls/recall/is recalling
Risk : Everything after because

Where this format was not available, the title was edited to ensure consistency.

Breakout Pathogens, Allergens, Foreign Material and Other Reasons

Each entry was reviewed and additional fields added to the dataset :

Pathogen
Allergen
Foreign_Material
Other

The Title and Body Text was reviewed to extract the required information.

Duplicate Entries

Reviewing this file it was noted that many entries in the nation archives dataset were duplicated under both Food Alerts and Allergy Alerts. So this had to be cleaned by hand with the duplicate entry with the incorrect category being removed.

This was hand edited to ensure consistency required for data cleansing.

Specifically:

Incorrect posts were removed, examples include request for tender and request for comment
Update posts were removed.
incorrect alert type classifications were corrected (Food Alert / Allergy Alert)
Duplicates were removed

191 Duplicate Entries were removed, all of which were incorrectly categorised under Food Alert / Allergy Alert where the Allergy Alert categorisation was incorrect.

Product Breakout

Where there were multiple products in the Title or Body Text of the search results these were broken out into separate entries. For older alert notices, the actual full list of products was not available in the search results, requiring a click to get the full detail. However, it should be noted that product names in titles were not applied consistently by the agency when the data was first entered into the web site database requiring a lot of hand editing to get this data into a usable format.

Thankfully, more recent Alert Notices include additional product details within the body text identified by a <caption> tag, and it is possible to automatically capture this along with pack sizes.

For older Alert Notices, the presentation of the data was not consistent, making subsequent scraping of the data impossible. Older Alert Notices had product details in the Body Text , some were in html tables with no tagging, some were within paragraphs tags, some with <BR>.

Therefore extracting the full product detail from each page would be a significant endeavour.