There are four main fields in the data obtained from web scraping :
Each row of data could be a single product or multiple products, it is not always obvious from the title or results body text which is which.
Clicking into a result presents an opportunity to scrape more product information, however there is no consistent format in how
products are presented, some had product content in paragraphs with <br> breaks, some has product content in html tables.
This means alerts with multiple products will have to edited by hand.
| Dataset | nationalarchives.gov.uk Start Date |
nationalarchives.gov.uk End Date |
Missing Data From |
Missing Data To |
food.gov.uk Start Date |
|---|---|---|---|---|---|
| Food Alerts | 02-10-2014 | 05-12-2017 | 06-12-2017 | 17-02-2021 | 18-02-2021 |
| Allergy Alerts | 14-10-2014 | 06-12-2017 | 03-07-2019 | 04-07-2019 | 05-07-2019 |
There is a lot of data missing, from mid 2017 to 2020.
| Year | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | 2022 | 2023 | 2024 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 12 | 67 | 97 | 54 | 0 | 1 | 0 | 154 | 176 | 164 | 98 |
| Column | Notes |
|---|---|
| Source | Where data was sourced, food.gov.uk or nationalarchives.gov.uk |
| Date | Date the notice was issued |
| Alert_Type | Allergen or Safety Alert |
| Title | Alert Title Text |
| BodyText | Alert Body text (usually has more information) |
The layout of the nationalarchives and food.gov websites were different (food.gov has both food and allergy alerts on the same page, national archive had different pages for each category ).
The web scraping processed produced three main raw unprocessed files :
These were combined into a single file : fsa_alerts_raw-combined.
Some automation is applied to alert notices :
A typical Alert Notice is proceeded by the front page search results. Often these note the retail brand, but not always.
Example :
FGS Ingredients LtdSPAR Cheese and Onion Sandwich Filler - 220g| Column | Engineered Feature Notes |
|---|---|
| Date | Datetime conversion |
| datetime | None |
| year | Datetime conversion |
| month | Datetime conversion |
| Alert_Type | None |
| Product_category | Edited by hand |
| Product_Type | Edited by hand |
| Title | None |
| BodyText | None |
| Brand | Edited by hand - (if any) as noted on the alert notice |
| Supplier | Edited by hand or automatically captured from alert notice e.g. | Supplier_Type | Hand edited - Manufacturer / Importer / Grocer / Wholesale / Unknown. |
| Outlet | Consumer outlet - Edited by hand - The name of the retail outlet (if noted) |
| Outlet_Type | Type of outlet : Grocer / Convenience / Restaurant / Takeaway / Retailer / Unknown |
| Product | Automatically extracted from Alert Notice Body text |
| Risk | Extract from Body text |
| Pathogen | Hand edited |
| Allergen | Hand edited |
| Foreign_Material | Hand edited |
| Other | Hand edited |
| month_num | Datetime conversion |
| month_name | Datetime conversion |
| year_month | Datetime conversion |
The initial results of the web scrape were combined into a single excel file and reviewed by hand.
The EDA and feature engineering revealed that the use of the words recall and because could be used as delimiters :
“Boots recalls / recall / is recalling product XXXX because of reason YYYY”
could be used to split the titles into separate fields :
recalls/recall/is recallingbecauseWhere this format was not available, the title was edited to ensure consistency.
Each entry was reviewed and additional fields added to the dataset :
The Title and Body Text was reviewed to extract the required information.
Reviewing this file it was noted that many entries in the nation archives dataset were duplicated under both Food Alerts and Allergy Alerts. So this had to be cleaned by hand with the duplicate entry with the incorrect category being removed.
This was hand edited to ensure consistency required for data cleansing.
Specifically:
191 Duplicate Entries were removed, all of which were incorrectly categorised under Food Alert / Allergy Alert where the Allergy Alert categorisation was incorrect.
Where there were multiple products in the Title or Body Text of the search results these were broken out into separate entries. For older alert notices, the actual full list of products was not available in the search results, requiring a click to get the full detail. However, it should be noted that product names in titles were not applied consistently by the agency when the data was first entered into the web site database requiring a lot of hand editing to get this data into a usable format.
Thankfully, more recent Alert Notices include additional product details within the body text identified by a <caption> tag, and it is possible to automatically capture this along with pack sizes.
For older Alert Notices, the presentation of the data was not consistent, making subsequent scraping of the data impossible. Older Alert Notices had product details in the Body Text
, some were in html tables with no tagging, some were within paragraphs tags, some with <BR>.
Therefore extracting the full product detail from each page would be a significant endeavour.