Data cleanse & feature engineering operations
This notebook carries out datacleanse, error detection and feature engineering on the unclean wide format csv before saving to long format csv.
| Notebook Name | Function |
| SMMT-1a-OCR-fueltype-newformat.ipynb | Optical Character recognition of the image files in the import directory. Save to unclean csv |
| SMMT-1b-CLEANSE-fueltype-newformat.ipynb | Datacleanse, error fetection and feature engineering. Saves to cleansed file |
| SMMT-1b-CLEANSE-fueltype-newformat.ipynb | Datacleanse, error detection and feature engineering. Saves to cleansed csv file in long data format. |
Import Libraries¶
In [19]:
#!/usr/bin/env python3
from termcolor import colored
Initialise Notebook Options¶
In [20]:
source_path = "/home/hass/Development/smmtdata-evolved/ocr/"
#source_path = "ocr/"
import_name = source_path + "OUT_1a_fueltype_OCR_newformat-2023.txt"
import_file = open(import_name, 'r').readlines()
output_path = "/home/hass/Development/smmtdata-evolved/ocr/"
#output_path = "ocr/"
output_name = output_path + "OUT_1b_fueltype_CLEANSE_newformat-2023.csv"
output_file = open(output_name, 'w')
Replacements & removals lists¶
Replacements list tuple
In [21]:
# Create list of replacements = lookup : Replacement
REPLACEMENTS = [
("MHEV diesel", "MHEV_Diesel"),
("MHEVdiesel", "MHEV_Diesel"),
("MHEV petrol", "MHEV_Petrol"),
("MHEVpetrol", "MHEV_Petrol"),
(" 3 ", " "),
(" 5 ", " "),
(" 2 ", " "),
('January','January,,'),
('February','February,,'),
('March','March,,'),
('April','April,,'),
('May','May,,'),
('June','June,,'),
('July','July,,'),
('August','August,,'),
('September','September,,'),
('October','October,,'),
('November','November,,'),
('December','December,,'),
('JANUARY','January,,'),
('FEBRUARY','February,,'),
('MARCH','March,,'),
('APRIL','April,,'),
('MAY','May,,'),
('JUNE','June,,'),
('JULY','July,,'),
('AUGUST','August,,'),
('SEPTEMBER','September,,'),
('OCTOBER','October,,'),
('NOVEMBER','November,,'),
('DECEMBER','December,,'),
# (',2019,2018','2019,2018'),
# (',2019,2018','2019,2018'),
# (',2020,2019','2020,2019'),
# (',2021,2020','2021,2020'),
# (',2022,2021','2022,2021'),
# (',2023,2022','2023,2022'),
('2019 2018',',2019,2018'),
('2019 2018',',2019,2018'),
('2020 2019',',2020,2019'),
('2021 2020',',2021,2020'),
('2022 2021',',2022,2021'),
('2023 2022',',2023,2022'),
(",,20", ",20"),
(" ", " "),
(" ", " "),
(" ", ",")
]
Clean the data¶
- Remove unwanted change and market share lines
- Correct Tesseract common errors
- Remove blank lines
- Generate a clean CSV for the PANDAS import
- Track line numbers for error reporting
In [22]:
print("\n")
print("\n")
print("DATA CLEANSING : " + import_name)
uline = colored("================================================================================================================================", "black", "on_white")
print(uline)
print("\n")
line_count = 1 # Keep count of the line count to report errors
for line in import_file:
# Remove lines matching
if "change" in line == True:
pass
# ignore line
# print("Change True" + "\n")
elif "mkt" in line == True:
pass
# also do nothing
# print("mkt True" + "\n")
elif "Mkt" in line == True:
pass
# also do nothing
# print("Mkt True" + "\n")
elif line == "\n":
pass
# also do nothing
# print("Blank Line" + "\n")
elif line == "\n\r":
pass
# also do nothing
# print("Blank Line" + "\n")
elif line == "\f":
pass
# also do nothing
# print("Blank Line" + "\n")
elif len(line.strip()) == 0:
pass
# also do nothing
# print("Blank Line" + "\n")
else:
#################################################################
# Search and replace
for old, new in REPLACEMENTS:
line = line.replace(old,new)
linecheck = line.strip().split(",")
if len(linecheck) != 3:
print("================================================================================================================================")
print("Line Error : " + str(line_count) + " Field Count : " + str(len(linecheck)))
error_line = colored(line, "black", "on_white").replace("\n","")
print(error_line)
output_line = line.strip()
output_file.write(output_line)
output_file.write("\n")
line_count += 1
output_file.close()
print("FINISHED")
print("\n")
print("\n")
print("The generated file needs to be validated, OCR is not 100% reliable")
print("================================================================================================================================")
print("\n")
print(" Import File : " + import_name)
print("\n")
print(" Output File : " + output_name)
print("\n")
print("================================================================================================================================")
DATA CLEANSING : /home/hass/Development/smmtdata-evolved/ocr/OUT_1a_fueltype_OCR_newformat-2023.txt
================================================================================================================================
FINISHED
The generated file needs to be validated, OCR is not 100% reliable
================================================================================================================================
Import File : /home/hass/Development/smmtdata-evolved/ocr/OUT_1a_fueltype_OCR_newformat-2023.txt
Output File : /home/hass/Development/smmtdata-evolved/ocr/OUT_1b_fueltype_CLEANSE_newformat-2023.csv
================================================================================================================================