Project Goal

To obtain additional monthly car registration data to add to my DVLA Registration Dashboard in Tableau.

Monthly Vehicle Registration by Fuel Type and Keepership

This is additional data being added to the DVLA Registered Vehicles dashboard I maintain on Tableau. The DVLA data is issued annually and only gives annual totals for registered vehicles. SMMT publish data monthly in an image format and this project, grabs that data from the images and processes it into a CSV file suitable for import into Tableau.

https://public.tableau.com/authoring/VehiclesRegisteredforUKdf_VEH0220/F

Tableau Dashboard frontpage

Intent

Extract data from SMMT Public Vehicle Registration Data published monthly in png and jpg image format, of varying image sizes, and then parse this data into a database to enable querying through a dashboard.

Important Note

The SMMT (Society of Motor Manufacturers and Traders) make their data available to the public through their news publication service. The data published by the SMMT is copyright and for this project I have applied a random percentage to the data to ensure the SMMT copyright is maintained.

If you wish to source data from the SMMT, please contact them directly: https://www.smmt.co.uk/vehicle-data/production-reports/

Preparation

Obtain Images

The images can be downloaded from the SMMT website and Google Images, however the URLs are not consistent :-) and while 20% of these files were downloaded with an automated wget attempt, there were still significant errors from missing files. This meant, hand downloading the missing files that failed the wget attempt using a logical expansion.

Check Images

A sanity check was performed to check for inconsistencies in image sizes which would affect the OCR process.

check\_image\_sizes.py script

This script produces a report summarising the image sizes.

Image Sizer output

Image sizes were not consistent.

From this it's clear the cropping function calls will need to be adapted to the different images sizes and formats.

Image data layout

SMMT produced data in two formats that changed over time:

The OLD format contained both fuel type and Keepership type in one image:

Source Image - Old Format fuel and customer in one Cropped Image
New Format dual New Format dual cropped

I only want the month data, the year data can be calculated, and sticking to monthly data simplifies everything.

The image is cropped before the OCR function to limit the OCR to a area of the image I'm interested in and limit extraneous data being captured.

The newer format separated fuel type and Keepership type into two separate images.

Source Image - New Format Keepership Cropped Image
New Format customer New Format customer cropped
Source Image - New Format Fuel Type Cropped Image
New Format fuel New Format fuel cropped

Data Structures

I'm after a consistent monthly figure from this data, along with additional fields that can be used to validate results and spot any OCR errors not picked up during the OCR conversion process. While it's not necessary to capture a total row, the Total row can be used to to validate accuracy, so while not required for the final data set, it's helpful for validation.

Field Name Contents Type Description
Usage Public
Private
string Public dashboard data is scrambled
Source new
old
String Which format the data came from
old or new
Year The year of the data Int Useful helpers in Tableau sometimes
Month The month of the data String Useful helpers in Tableau sometimes
Topcat Fossil Fuel
Renewable Fuel
String Top Category
Subcat Petrol
Diesel
MHEV Petrol
MHEV Diesel
Plugin Hybrid (PHEV)
Battery Electric (BEV)
Hybrid Electric (HEV)
Fleet
Private
Business
Total
String Sub Category
Quantity Number of vehicles shipped Int
Acquire_Date Date Date the data was acquired and entered into the db
Acquire_By String Who acquired the data
Acquire_Method Tesseract
String Data Acquisition Method

Breaking down the tasks

  1. OCR Image files into a text file for three different layouts
  2. Data Cleanse
  3. Feature Engineering
  4. Reformat csv through pandas into long format, save in a new csv with additional engineered features

Caveats

Order of operations

1. Image to Text conversion

  1. Dual Type - Old Format : Crop and OCR images, for both fuel type and keepership, store results in a text file .TXT
  2. Fuel Type - NEW Format : Crop and OCR images, for Fuel Types, store results in a text file .TXT
  3. Keepership Type - NEW Format : Crop and OCR images, for Fuel Types, store results in a text file .TXT

2. Data Cleansing

  1. Fuel Type - Old Format : data cleanse TXT File, format and save into a CSV
  2. Fuel Type - NEW Format : data cleanse TXT File, format and save into a CSV
  3. Keepership Type - NEW Format : data cleanse TXT File, format and save into a CSV

3. Feature engineering, Feature additions, error checking and correction

The OCR process creates errors.

While steps are taken during the Feature Engineering stage to identify these errors, additional efforts are required:

So we've got nine operations to carry out across 145 image files of different image layouts and image dimensions.

Notebooks and outputs

New Format Fuel Type

SMMT-1a-OCR-fueltype-newformat.ipynb 1a-OCR-fueltype-newformat.ipynb

New Format Fuel

Crop and OCR images

New Format Fuel cropped

Store results in a text file ../ocr/OUT_1a_fueltype_OCR_newformat-2023

Example output stored in the text file:

  MAY

  2023 2022
  Diesel 5758 7614
  Petrol 59766  56767
  MHEVdiesel 5316 5823
  MHEV petrol 23034 16842
  BEV 24513 15448
  PHEV 9025 7339
  HEV 17792 14561
  TOTAL 145204 124394

SMMT-1b-CLEANSE-fueltype-newformat.ipynb

Data cleanse TXT File, format and save into a CSV ../ocr/OUT_1b_fueltype_CLEANSE_newformat-2023.csv

Example output stored in CSV file

  May,,
  ,2023,2022
  Diesel,5758,7614
  Petrol,59766,56767
  MHEV_Diesel,5316,5823
  MHEV_Petrol,23034,16842
  BEV,24513,15448
  PHEV,9025,7339
  HEV,17792,14561
  TOTAL,145204,124394

SMMT-1c-SAVEfueltype-newformat.ipynb

Add top level category for fuel types, convert fuel types into sub categories, validation check / error reporting, shuffle data into the right fields, save to csv ../ocr/OUT_1c_fueltype_newformat_FINISHED-2023.csv

Example output store in CSV file

  0,Private,Primary,2023,May,Fossil,Diesel,5758,06-12-23,Tesseract-new,Manual,May-2023
  1,Private,Secondary,2022,May,Fossil,Diesel,7614,06-12-23,Tesseract-new,Manual,May-2022
  2,Private,Primary,2023,May,Fossil,Petrol,59766,06-12-23,Tesseract-new,Manual,May-2023
  3,Private,Secondary,2022,May,Fossil,Petrol,56767,06-12-23,Tesseract-new,Manual,May-2022
  4,Private,Primary,2023,May,xHEV,MHEV_Diesel,5316,06-12-23,Tesseract-new,Manual,May-2023
  5,Private,Secondary,2022,May,xHEV,MHEV_Diesel,5823,06-12-23,Tesseract-new,Manual,May-2022
  6,Private,Primary,2023,May,xHEV,MHEV_Petrol,23034,06-12-23,Tesseract-new,Manual,May-2023
  7,Private,Secondary,2022,May,xHEV,MHEV_Petrol,16842,06-12-23,Tesseract-new,Manual,May-2022
  8,Private,Primary,2023,May,Plugin,BEV,24513,06-12-23,Tesseract-new,Manual,May-2023
  9,Private,Secondary,2022,May,Plugin,BEV,15448,06-12-23,Tesseract-new,Manual,May-2022
  10,Private,Primary,2023,May,Plugin,PHEV,9025,06-12-23,Tesseract-new,Manual,May-2023
  11,Private,Secondary,2022,May,Plugin,PHEV,7339,06-12-23,Tesseract-new,Manual,May-2022
  12,Private,Primary,2023,May,xHEV,xHEV,17792,06-12-23,Tesseract-new,Manual,May-2023
  13,Private,Secondary,2022,May,xHEV,xHEV,14561,06-12-23,Tesseract-new,Manual,May-2022
  14,Private,Primary,2023,May,Total,Total,145204,06-12-23,Tesseract-new,Manual,May-2023
  15,Private,Secondary,2022,May,Total,Total,124394,06-12-23,Tesseract-new,Manual,May-2022

New Format Keepership Type

SMMT-3a-OCR-custtype-newformat.ipynb

New Format customer

Crop and OCR images

New Format customer cropped

Store results in a text file ../ocr/OUT_3a_custtype_OCR_newformat-2023

Example output stored in the text file:

  MAY
  2023 2022
  Private 65932  66242
  Fleet 76207 55649
  Business 3065 2503
  TOTAL  145204  124394

SMMT-3b-CLEANSE-custtype-newformat.ipynb

data cleanse TXT File, format and save into a CSV

Example output stored in CSV file


      May,,
      ,2023,2022
      Private,65932,66242
      Fleet,76207,55649
      Business,3065,2503
      TOTAL,145204,124394

SMMT-3c-SAVE-custtype-newformat.ipynb

validation check / error reporting, save to csv ../ocr/OUT_3c_custtype_newformat-FINISHED-2023.csv

Example output store in CSV file

  ,Usage,Source,Year,Month,Topcat,Subcat,Quantity,Acquire_Date,Acquire_By,Acquire_ Method,fileDate
  8,Private,Primary,2023,May,Customer,Private,65932,06-12-23,Tesseract-new,Manual,May-2023
  9,Private,Secondary,2022,May,Customer,Private,66242,06-12-23,Tesseract-new,Manual,May-2022
  10,Private,Primary,2023,May,Customer,Fleet,76207,06-12-23,Tesseract-new,Manual,May-2023
  11,Private,Secondary,2022,May,Customer,Fleet,55649,06-12-23,Tesseract-new,Manual,May-2022
  12,Private,Primary,2023,May,Customer,Business,3065,06-12-23,Tesseract-new,Manual,May-2023
  13,Private,Secondary,2022,May,Customer,Business,2503,06-12-23,Tesseract-new,Manual,May-2022
  14,Private,Primary,2023,May,Customer,Total,145204,06-12-23,Tesseract-new,Manual,May-2023
  15,Private,Secondary,2022,May,Customer,Total,124394,06-12-23,Tesseract-new,Manual,May-2022

OLD Format Fuel & Keepership Type

SMMT-2a-tesseract-fueltype-oldformat.ipynb

Dual layout

Crop and OCR images.

Cropped Dual Layout

Save results in a text file ../ocr/OUT_2a_fueltype_OCR_oldformat-2023.txt

Example output stored in the text file:

  October Total Diesel Petrol AFV Private Fleet Business
  2016 180168 89004 85139 6025 77821 96440 5907
  2015 177664 91177 81125 5362 78662 92596 6406
  change 144 244 49 1244 11 42 78
  Mkt share 2016 4944 473 33 432 535 33
  Mkt share 2015 513 457 30 443 521 36

SMMT-2b-CLEANSE-fueltype-oldformat.ipynb

Data cleanse TXT File, format into csv.

Save into a CSV ../ocr/OUT_3a_custtype_CLEANSE_oldformat-2023.csv

Example output stored in CSV file

  October,Total,Diesel,Petrol,AFV,Private,Fleet,Business
  2016,180168,89004,85139,6025,77821,96440,5907
  2015,177664,91177,81125,5362,78662,92596,6406

SMMT-2c-SAVE-fueltype-oldformat.ipynb

Validation check / error reporting, variable shuffling with pandas and recovery of OLD data without a primary source.

Save to csv ../ocr/OUT_2c_fueltype_oldformat-FINISHED-2023.csv

Example output store in CSV file

,Usage,Source,Year,Month,Topcat,Subcat,Quantity,Acquire_Date,Acquire_By,Acquire_Method,fileDate 0,Private,Primary,2016,October,Fuel_TOTAL,Total,180168,03-07-23,tesseract-old,Manual,October-2016 1,Private,Primary,2016,October,Fossil,Diesel,89004,03-07-23,tesseract-old,Manual,October-2016 2,Private,Primary,2016,October,Fossil,Petrol,85139,03-07-23,tesseract-old,Manual,October-2016 3,Private,Primary,2016,October,Xhev,AFV,6025,03-07-23,tesseract-old,Manual,October-2016 4,Private,Primary,2016,October,Customer,Private,77821,03-07-23,tesseract-old,Manual,October-2016 5,Private,Primary,2016,October,Customer,Fleet,96440,03-07-23,tesseract-old,Manual,October-2016 6,Private,Primary,2016,October,Customer,Business,5907,03-07-23,tesseract-old,Manual,October-2016 7,Private,Secondary,2015,October,Fuel_TOTAL,Total,177664,03-07-23,tesseract-old,Manual,October-2015 8,Private,Secondary,2015,OctoberFossil,Diesel,91177,03-07-23,tesseract-old,Manual,October-2015 9,Private,Secondary,2015,OctoberFossil,Petrol,81125,03-07-23,tesseract-old,Manual,October-2015 10,Private,Secondary,2015,OctoberXhev,AFV,5362,03-07-23,tesseract-old,Manual,October-2015 11,Private,Secondary,2015,OctoberCustomer,Private,78662,03-07-23,tesseract-old,Manual,October-2015 12,Private,Secondary,2015,OctoberCustomer,Fleet,92596,03-07-23,tesseract-old,Manual,October-2015 13,Private,Secondary,2015,OctoberCustomer,Business,6406,03-07-23,tesseract-old,Manual,October-2015

Last Steps

The last steps are manual, recovering the missing data from primary source (Oct-2015 to Sep-2016) is achieved by using the data from images from the following year. Since it's simpler to just grab all the data, primary data (the data for the file it's dated for) and secondary data (the previous year is detailed in the file hence why we can recover 2015 data from the 2016 files) is grabbed at this stage but tagged with its source: primary or secondary. While it's simple to filter current primary data using the source field = 'primary' this would miss the missing primary data, so the final manual step is to sort the data file by date and change the source flag for Oct-2015 - Sep-2016 to primary and since we are already manually editing the file, we may as well remove the secondary data while we are at it.

Data is also 'scrambled' a bit to hide the true numbers because the SMMT sell their data as a service and while this is not for commercial use, it is the complete data set in CSV format from 2015 to the current day, so it's quite valuable. The usage column is a flag for me, Private is the real data, Public is the scrambled data you see on the Tableau dashboard.