Skip to content

Handling BOM of CSV #2

@JasonQiu

Description

@JasonQiu

Description

CSV files saved by MS Excel (and possibly other tools) may have the \ufeff BOM.

Adding print(headers) after L133 in UK_LLC_File_1_Checker.py to debug:

Checking filename
Loading data from file
['\ufeffSTUDY_ID', 'ROW_STATUS', 'NHS_NUMBER', 'SURNAME', 'FORENAME', 'MIDDLENAMES', 'ADDRESS_1', 'ADDRESS_2', 'ADDRESS_3', 'ADDRESS_4', 'ADDRESS_5', 'POSTCODE', 'ADDRESS_START_DATE', 'ADDRESS_END_DATE', 'DATE_OF_BIRTH', 'GENDER_CD', 'CREATE_DATE', 'UKLLC_STATUS', 'NHS_E_Linkage_Permission', 'NHS_Digital_Study_Number', 'NHS_S_Linkage_Permission', 'NHS_S_Study_Number', 'NHS_W_Linkage_Permission', 'NHS_NI_Linkage_Permission', 'NHS_NI_Study_Number', 'Geocoding_Permission', 'Small_Area_Permission', 'Environment_Permission', 'Property_Level_Permission', 'Multiple_Birth', 'National_Opt_Out', 'DFE_Linkage_Permission', 'DWP_Linkage_Permission', 'HMRC_Linkage_Permission']
Unrecognised field names
Unrecognised field names
Column field name(s) STUDY_ID are not as expected. Unable to continue.
Line(s) (ignoring header) 0

This confuses users as usually they cannot see the invisible characters around STUDY_ID with MS Excel, VS Code, Notepad++ or other tools. There might be an indicator saying "Encoding: UTF-8 with BOM" in the status bar but it's not intuitive.

> sw_vers
ProductName:		macOS
ProductVersion:		26.2
BuildVersion:		25C56
> python -V
Python 3.13.11

Solution

May print the raw value that Python parsed (not only the expected value) and let the users investigate. May also specify another encoding when opening the file, but at the cost of general applicability.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions