WEMC Tech Blog 5.5: Metadata Generation for CSV Files

For any data produced with the intention of being downloaded and used by other users, it important to include information on the dataset. For example, details on the data origins should be provided, such as who produced the dataset, who can be contacted about the data and when/where it was produced. In addition, properties on the dataset itself, such as variable names and units of measurement all help the end user in comprehending the data.

For our current project C3S Energy (part of the C3S operational services), we are producing CSV files as various outputs from computations on climate and energy data. The raw data originates from a variety of sources, contains a range of units and time scales and with varying resolution between datasets. It is therefore critical to produce accurate metadata to represent these differences in our CSV files. There is no ‘standard’ format for including metadata in CSV files, so we followed the common approach (in climate science) of displaying the metadata in the first column of the CSV file, with one row per line of information. This approach makes it very easy for non-technical users to open and understand the CSV data in Excel, similarly users utilising a more programmatic approach can simply skip these lines when opening the file with languages like Python and R.

This blog serves as a ‘follow on’ from Tech Blog #5 and outlines constructing a python function that references the same JSON lookup table.

First, define the function and split the filename by it’s delimiters into a list of items:

def mdata(fname):
    flist = fname.split('_')
    mdate = now.strftime("%Y-%m-%d")

We are only dealing with CSV and NetCDF files in C3S Energy, detect which one and save as variable:

    if '.nc' in fname:
        mtype = 'NetCDF'
    if '.csv' in fname:
        mtype = 'CSV'

Loop through filename items list of and JSON file in parallel:

for i, word in enumerate(flist):
        for el_id, el_info in c3s.items():
            for key in el_info:

If items in list match JSON, add those variables for use in metadata:

 if key in flist[i] and el_id == 'variable':
                    mtitle = el_info[key]
                if key in flist[i] and el_id == 'generation':
                    mabstract = el_info[key]
                if el_info['pos'] == 8 and i == 8 and len(key) == el_info['length']                and int(word[1:5]) > 1950 and int(word[1:5]) < 3000:
                    msedit = word.lstrip("S")
                    msdate = msedit[0:4] + '-' + msedit[4:6] + '-' + msedit[6:8] + '-' + msedit[8:]
                if el_info['pos'] == 9 and i == 9 and len(key) == el_info['length']                and int(word[1:5]) > 1950 and int(word[1:5]) < 3000:
                    meedit = word.lstrip("E")
                    medate = meedit[0:4] + '-' + meedit[4:6] + '-' + meedit[6:8] + '-' + meedit[8:]
                if key in flist[i] and el_id == 'temporal_resolution':
                    mtres = el_info[key]
                if key in flist[i] and el_id == 'spacial_resolution':
                    msres = el_info[key]

Check title and add appropriate unit measurement as variable:

                    if mtitle == 'Air Temperature':
                        munit = 'K'
                    elif mtitle == 'Total Precipitation':
                        munit = 'm'
                    elif mtitle == 'Global Horizontal Irradiance':
                        munit = 'J m-2'
                    elif mtitle == 'Mean Sea Level Pressure':
                        munit = 'Pa'
                    elif mtitle == 'Wind Speed':
                        munit = 'm s-1'
                    elif mtitle == 'Evaporation':
                        munit = 'm of water equivalent'
                    elif mtitle == 'Snow Depth':
                        munit = 'm of water equivalent'
                    elif mtitle == 'Electricity Demand':
                        munit = 'W'
                    elif mtitle == 'Hydropower (Reservoir)':
                        munit = 'W'
                    elif mtitle == 'Hydropower (Run Of River)':
                        munit = 'W'
                    elif mtitle == 'Wind Power Onshore':
                        munit = 'W'
                    elif mtitle == 'Wind Power Offshore':
                        munit = 'W'
                    elif mtitle == 'Wind':
                        munit = 'W'
                    elif mtitle == 'Solar PV Power':
                        munit = 'W'

Create a variable with metadata included:

 metadata = '# General' + '\n## Title' + '\n### ' + mtitle + '\n## Abstract' + '\n### ' + mabstract    + '\n## Date' + '\n### ' + mdate + '\n## Date type' + '\n### Publication: Date identifies when the data was issued'    + '\n## Unit' + '\n### ' + munit + '\n## URL' + '\n### https://cds.climate.copernicus.eu/' + '\n## Data format'    + '\n### ' + mtype + '\n## Keywords' + '\n### ERA5, reanalysis, Copernicus, C3S, C3S Energy, WEMC'    + '\n## Point of contact' + '\n### Individual name' + '\n#### Joe Bloggs' + '\n### Electronic mail address'    + '\n#### info@wemcouncil.org' + '\n### Organisation name' + '\n#### World Energy & Meteorology Council' + '\n### Role'    + '\n#### Owner: Party that owns the resource' + '\n# Usage' + '\n## Access constraints'    + '\n### Intellectual property rights: The IP of these data belongs to the EU Copernicus programme' + '\n## Use constraints'    + '\n### Creative Commons' + '\n## Citation(s)' + '\n### NA' + '\n## Temporal extent' + '\n## Begin date' + '\n### ' + msdate    + '\n## End date' + '\n### ' + medate + '\n## Temporal resolution' + '\n### ' + mtres + '\n## Geographic bounding box'    + '\n### westBoundLongitude -22.00' + '\n### eastBoundLongitude 45.00' + '\n### southBoundLatitude 27.00'    + '\n### northBoundLatitude 72.00' + '\n## Spatial resolution' + '\n### ' + msres + '\n# Lineage Statement'    + '\n## Original Data Source' + '\n## Statement'    + '\n### The original data sources are ECMWF ERA5 Reanalysis (available at: https://cds.climate.copernicus.eu)'    + '\n#'    + '\nDate'    # print(metadata)
    return(metadata)

Here is the function in full:

# function to create metadata from filename and lookup table
def mdata(fname):
    flist = fname.split('_')
    x = 0
    mdate = now.strftime("%Y-%m-%d")
    if '.nc' in fname:
        mtype = 'NetCDF'
    if '.csv' in fname:
        mtype = 'CSV'
    for i, word in enumerate(flist):
        for el_id, el_info in c3s.items():
            for key in el_info:
                if key in flist[i] and el_id == 'variable':
                    mtitle = el_info[key]
                if key in flist[i] and el_id == 'generation':
                    mabstract = el_info[key]
                if el_info['pos'] == 8 and i == 8 and len(key) == el_info['length']                and int(word[1:5]) > 1950 and int(word[1:5]) < 3000:
                    msedit = word.lstrip("S")
                    msdate = msedit[0:4] + '-' + msedit[4:6] + '-' + msedit[6:8] + '-' + msedit[8:]
                if el_info['pos'] == 9 and i == 9 and len(key) == el_info['length']                and int(word[1:5]) > 1950 and int(word[1:5]) < 3000:
                    meedit = word.lstrip("E")
                    medate = meedit[0:4] + '-' + meedit[4:6] + '-' + meedit[6:8] + '-' + meedit[8:]
                if key in flist[i] and el_id == 'temporal_resolution':
                    mtres = el_info[key]
                if key in flist[i] and el_id == 'spacial_resolution':
                    msres = el_info[key]
                    if mtitle == 'Air Temperature':
                        munit = 'K'
                    elif mtitle == 'Total Precipitation':
                        munit = 'm'
                    elif mtitle == 'Global Horizontal Irradiance':
                        munit = 'J m-2'
                    elif mtitle == 'Mean Sea Level Pressure':
                        munit = 'Pa'
                    elif mtitle == 'Wind Speed':
                        munit = 'm s-1'
                    elif mtitle == 'Evaporation':
                        munit = 'm of water equivalent'
                    elif mtitle == 'Snow Depth':
                        munit = 'm of water equivalent'
                    elif mtitle == 'Electricity Demand':
                        munit = 'W'
                    elif mtitle == 'Hydropower (Reservoir)':
                        munit = 'W'
                    elif mtitle == 'Hydropower (Run Of River)':
                        munit = 'W'
                    elif mtitle == 'Wind Power Onshore':
                        munit = 'W'
                    elif mtitle == 'Wind Power Offshore':
                        munit = 'W'
                    elif mtitle == 'Wind':
                        munit = 'W'
                    elif mtitle == 'Solar PV Power':
                        munit = 'W'
                x = x + 1
    metadata = '# General' + '\n## Title' + '\n### ' + mtitle + '\n## Abstract' + '\n### ' + mabstract    + '\n## Date' + '\n### ' + mdate + '\n## Date type' + '\n### Publication: Date identifies when the data was issued'    + '\n## Unit' + '\n### ' + munit + '\n## URL' + '\n### https://cds.climate.copernicus.eu/' + '\n## Data format'    + '\n### ' + mtype + '\n## Keywords' + '\n### ERA5, reanalysis, Copernicus, C3S, C3S Energy, WEMC'    + '\n## Point of contact' + '\n### Individual name' + '\n#### Joe Bloggs' + '\n### Electronic mail address'    + '\n#### info@wemcouncil.org' + '\n### Organisation name' + '\n#### World Energy & Meteorology Council' + '\n### Role'    + '\n#### Owner: Party that owns the resource' + '\n# Usage' + '\n## Access constraints'    + '\n### Intellectual property rights: The IP of these data belongs to the EU Copernicus programme' + '\n## Use constraints'    + '\n### Creative Commons' + '\n## Citation(s)' + '\n### NA' + '\n## Temporal extent' + '\n## Begin date' + '\n### ' + msdate    + '\n## End date' + '\n### ' + medate + '\n## Temporal resolution' + '\n### ' + mtres + '\n## Geographic bounding box'    + '\n### westBoundLongitude -22.00' + '\n### eastBoundLongitude 45.00' + '\n### southBoundLatitude 27.00'    + '\n### northBoundLatitude 72.00' + '\n## Spatial resolution' + '\n### ' + msres + '\n# Lineage Statement'    + '\n## Original Data Source' + '\n## Statement'    + '\n### The original data sources are ECMWF ERA5 Reanalysis (available at: https://cds.climate.copernicus.eu)'    + '\n#'    + '\nDate'    # print(metadata)
    return(metadata)

This code can also be found on my github page

Luke Sanger – WEMC Data Engineer, June 2019

Recent Posts

Start typing and press Enter to search