• Site Navigation
    • Project Home
  • Project Documentation
    • Project Details
    • Tableau Dashboard
  • Customer / Fuel Type (new) Notebooks
    • SMMT-1a-OCR-fueltype-newformat.ipyn
    • SMMT-1b-CLEANSE-fueltype-newformat.ipynb
    • SMMT-1c-SAVE-fueltype-newformat.ipynb
  • Fuel Type (old)
    • SMMT-2a-OCR-fueltype-oldformat.ipynb
    • SMMT-2b-CLEANSE-fueltype-oldformat.ipynb
    • SMMT-2c-SAVE-fueltype-oldformat.ipynb
  • Customer Type (old)
    • SMMT-3a-OCR-custtype-newformat.ipynb
    • SMMT-3b-CLEANSE-custtype-newformat.ipynb
    • SMMT-3c-SAVE-custtype-newformat.ipynb

Data cleanse & feature engineering operations

This notebook carries out datacleanse, error detection and feature engineering on the unclean wide format csv before saving to long format csv.

Notebook NameFunction
SMMT-1a-OCR-fueltype-newformat.ipynb Optical Character recognition of the image files in the import directory. Save to unclean csv
SMMT-1b-CLEANSE-fueltype-newformat.ipynb Datacleanse, error fetection and feature engineering. Saves to cleansed file
SMMT-1b-CLEANSE-fueltype-newformat.ipynb Datacleanse, error detection and feature engineering. Saves to cleansed csv file in long data format.

Import Libraries¶

In [19]:
#!/usr/bin/env python3
from termcolor import colored

Initialise Notebook Options¶

In [20]:
source_path = "/home/hass/Development/smmtdata-evolved/ocr/"
#source_path = "ocr/"
import_name = source_path + "OUT_1a_fueltype_OCR_newformat-2023.txt"
import_file = open(import_name, 'r').readlines()

output_path = "/home/hass/Development/smmtdata-evolved/ocr/"
#output_path = "ocr/"
output_name = output_path + "OUT_1b_fueltype_CLEANSE_newformat-2023.csv"
output_file = open(output_name, 'w')

Replacements & removals lists¶

Replacements list tuple

In [21]:
# Create list of replacements = lookup : Replacement
REPLACEMENTS = [
    ("MHEV diesel", "MHEV_Diesel"),
    ("MHEVdiesel", "MHEV_Diesel"),
    ("MHEV petrol", "MHEV_Petrol"),
    ("MHEVpetrol", "MHEV_Petrol"),
    (" 3 ", " "),
    (" 5 ", " "),
    (" 2 ", " "),
    ('January','January,,'),
    ('February','February,,'),
    ('March','March,,'),
    ('April','April,,'),
    ('May','May,,'),
    ('June','June,,'),
    ('July','July,,'),
    ('August','August,,'),
    ('September','September,,'),
    ('October','October,,'),
    ('November','November,,'),
    ('December','December,,'),
    ('JANUARY','January,,'),
    ('FEBRUARY','February,,'),
    ('MARCH','March,,'),
    ('APRIL','April,,'),
    ('MAY','May,,'),
    ('JUNE','June,,'),
    ('JULY','July,,'),
    ('AUGUST','August,,'),
    ('SEPTEMBER','September,,'),
    ('OCTOBER','October,,'),
    ('NOVEMBER','November,,'),
    ('DECEMBER','December,,'),
    # (',2019,2018','2019,2018'),
    # (',2019,2018','2019,2018'),
    # (',2020,2019','2020,2019'),
    # (',2021,2020','2021,2020'),
    # (',2022,2021','2022,2021'),
    # (',2023,2022','2023,2022'),
    ('2019 2018',',2019,2018'),
    ('2019 2018',',2019,2018'),
    ('2020 2019',',2020,2019'),
    ('2021 2020',',2021,2020'),
    ('2022 2021',',2022,2021'),
    ('2023 2022',',2023,2022'),
    (",,20", ",20"),
    ("   ", " "),
    ("  ", " "),
    (" ", ",")

]

Clean the data¶

  • Remove unwanted change and market share lines
  • Correct Tesseract common errors
  • Remove blank lines
  • Generate a clean CSV for the PANDAS import
  • Track line numbers for error reporting
In [22]:
print("\n")
print("\n")
print("DATA CLEANSING : " + import_name)
uline = colored("================================================================================================================================", "black", "on_white")
print(uline)
print("\n")



line_count = 1 # Keep count of the line count to report errors
for line in import_file:

# Remove lines matching
    if "change" in line == True: 
        pass
        # ignore line
        # print("Change True" + "\n")
        
    elif "mkt" in line == True:
        pass
        # also do nothing
        # print("mkt True" + "\n")
    elif "Mkt" in line == True:
        pass
        # also do nothing
        # print("Mkt True" + "\n")
    elif line == "\n":
        pass
        # also do nothing
        # print("Blank Line" + "\n")   
    elif line == "\n\r":
        pass
        # also do nothing
        # print("Blank Line" + "\n")
    elif line == "\f":
        pass
        # also do nothing
        # print("Blank Line" + "\n")
    elif len(line.strip()) == 0:
        pass
        # also do nothing
        # print("Blank Line" + "\n")
    else:
#################################################################
# Search and replace

        for old, new in REPLACEMENTS:
            line = line.replace(old,new)
        
        linecheck = line.strip().split(",")
        if len(linecheck) != 3:
            print("================================================================================================================================")
            print("Line Error : " + str(line_count) + "  Field Count : " + str(len(linecheck)))
            error_line = colored(line, "black", "on_white").replace("\n","") 
            print(error_line)
        output_line = line.strip()
        output_file.write(output_line)
        output_file.write("\n")
    line_count += 1
output_file.close()

print("FINISHED")
print("\n")
print("\n")
print("The generated file needs to be validated, OCR is not 100% reliable")
print("================================================================================================================================")
print("\n")
print(" Import File : " + import_name)
print("\n")
print(" Output File : " + output_name)
print("\n")
print("================================================================================================================================")



DATA CLEANSING : /home/hass/Development/smmtdata-evolved/ocr/OUT_1a_fueltype_OCR_newformat-2023.txt
================================================================================================================================


FINISHED




The generated file needs to be validated, OCR is not 100% reliable
================================================================================================================================


 Import File : /home/hass/Development/smmtdata-evolved/ocr/OUT_1a_fueltype_OCR_newformat-2023.txt


 Output File : /home/hass/Development/smmtdata-evolved/ocr/OUT_1b_fueltype_CLEANSE_newformat-2023.csv


================================================================================================================================