WEMC Tech Blog 5.5: Metadata Generation for CSV Files
For any data produced with the intention of being downloaded and used by other users, it important to include information on the dataset. For example, details on the data origins should be provided, such as who produced the dataset, who can be contacted about the data and when/where it was produced. In addition, properties on the dataset itself, such as variable names and units of measurement all help the end user in comprehending the data.
For our current project C3S Energy (part of the C3S operational services), we are producing CSV files as various outputs from computations on climate and energy data. The raw data originates from a variety of sources, contains a range of units and time scales and with varying resolution between datasets. It is therefore critical to produce accurate metadata to represent these differences in our CSV files. There is no ‘standard’ format for including metadata in CSV files, so we followed the common approach (in climate science) of displaying the metadata in the first column of the CSV file, with one row per line of information. This approach makes it very easy for non-technical users to open and understand the CSV data in Excel, similarly users utilising a more programmatic approach can simply skip these lines when opening the file with languages like Python and R.
This blog serves as a ‘follow on’ from Tech Blog #5 and outlines constructing a python function that references the same JSON lookup table.
First, define the function and split the filename by it’s delimiters into a list of items:
We are only dealing with CSV and NetCDF files in C3S Energy, detect which one and save as variable:
Loop through filename items list of and JSON file in parallel:
If items in list match JSON, add those variables for use in metadata:
Check title and add appropriate unit measurement as variable:
Create a variable with metadata included:
Here is the function in full:
This code can also be found on my github page
Luke Sanger – WEMC Data Engineer, June 2019