WEMC Tech Blog 5: Filename Verification with Python and JSON

In order to build and maintain a robust operational system, which relies on many datasets, file naming becomes an important factor in identifying and describing the data contained within.

In the case of climate science, this could involve descriptors such as data origin, bias adjustment, variable type, start/end dates, accumulated or instantaneous  measurements, grid resolutions etc.

This example uses a JSON (JavaScript Open Notation) file as a lookup table. JSON files are extremely lightweight and can be imported easily by most modern programming languages. In Python, these are imported as a dictionary data type, which behaves like a list of objects. In this example, it will technically be a ‘nested’ dictionary, as it has multiple levels.

The JSON lookup table shown below contains all the allowed elements for the projects filename structure, including the associated ‘long names’ for each item (more on this later). It also contains the position of the filename element and it’s character length.

{
	"category" : {
		"pos" : 0,
		"length" : 1,
		"H" : "historical",
		"S" : "seasonal",
		"P" : "projection"
	},
	"generation" : {
		"pos" : 1,
		"length" : 4,
		"ERA5" : "ERA5",
		"SY05" : "System 5",
		"CMI5" : "CMIP5",
		"EUCX" : "Euro Cordex"
	},
	"originator" : {
		"pos" : 2,
		"length" : 4,
		"ECMW" : "ECMWF",
		"MTFR" : "Meteo-France",
		"METO" : "Met Office",
		"DWD-" : "DWD",
		"CMCC" : "CMCC",
		"GFDL" : "GFDL",
		"GFC2" : "GFC2"			
	},
	"model" : {
		"pos" : 3,
		"length" : 4,
		"T639" : "TL 639",
		"CM20" : "CM2.0"
	},
	"variable" : {
		"pos" : 4,
		"length" : 3,
		"TA-" : "Air Temperature",
		"TP-" : "Total Precipitation",
		"GHI" : "Global Horizontal Irradiance", 
		"MSL" : "Mean Sea Level Pressure",
		"WS-" : "Wind Speed",
		"E--" : "Evaporation",
		"SD-" : "Snow Depth",
		"DEM" : "Electricity Demand",
		"HRE" : "Hydropower (Reservoir)",
		"HRO" : "Hydropower (Run Of River)",
	    "WON" : "Wind Power Onshore",
	    "WOF" : "Wind Power Offshore",
		"WIN" : "Wind",
		"SPV" : "Solar PV Power"
	},
	"level" : {
		"pos" : 5,
		"length" : 5,
		"NA---" : "N/A", 
		"0000m" : "0m",
		"0002m" : "2m",
		"0010m" : "10m",
		"0100m" : "100m",
		"1e3hP" : "1000 hPa",
		"850hP" : "850 hPa"
	},
	"region" : {
		"pos" : 6,
		"length" : 4,
		"Euro" : "Europe"
	},
	"spacial_resolution" : {
		"pos" : 7,
		"length" : 4,
		"025d" : "0.25 deg",
		"nut0" : "NUTS0",
		"nut2" : "NUTS2"
	},
	"start_date" : {
		"pos" : 8,
		"length" : 13,
		"SYYYYMMDDhhmm" : "SYYYYMMDDhhmm"
	},
	"end_date" : {
		"pos" : 9,
		"length" : 13,
		"EYYYYMMDDhhmm" : "EYYYYMMDDhhmm"
	},
	"type" : {
		"pos" : 10,
		"length" : 3,
		"ACC" : "Accumulated",
		"INS" : "Instantaneous",
		"PWR" : "Power",
		"NRG" : "Energy",
		"CFR" : "Capacity factor"
	},
	"view" : {
		"pos" : 11,
		"length" : 3,
		"MAP" : "Map",
		"TIM" : "Time series"
	},
	"temporal_resolution" : {
		"pos" : 12,
		"length" : 3,
		"01h" : "1 hour", 
		"03h" : "3 hours",
		"06h" : "6 hours",
		"01d" : "1 day", 
		"01m" : "1 month", 
		"03m" : "3 month",
		"12m" : "1 year",
		"30y" : "30 years"
	},
	"lead_time" : {
		"pos" : 13,
		"length" : 3,
		"NA-" : "N/A",
		"03m" : "3 Months"
	},
	"bias_adjustment" : {
		"pos" : 14,
		"length" : 3,
		"noc" : "No correction",
		"mbc" : "Mean bias cor",
		"nbc" : "Normal distr adjustment",
		"vbc" : "Variance corrected",
		"msd" : "Mean and std corrected",
		"std" : "Standardized (zero mean and stdev)",
		"wbc" : "Based on Weibull distr.",
		"gbc" : "Based on gamma distr.",
		"qbc" : "Based on quantile distr.n",
		"cdf" : "Cumulative distr. fn"
	},
	"statistics" : {
		"pos" : 15,
		"length" : 3,
		"org" : "Original",
		"avg" : "Mean", 
		"med" : "Median", 
		"min" : "Minimum", 
		"max" : "Maximum",
		"and" : "Anomaly difference", 
		"anr" : "Anomaly ratio",
		"33u" : "Upper tercile",
		"33m" : "Lower tercile",
		"20u" : "Upper quintile",
		"20l" : "Lower quintile",
		"qxx" : "Percentile",
		"bss" : "Brier skill score",
		"rss" : "Roc skill score"
	},
	"ensemble_number" : {
		"pos" : 16,
		"length" : 2,
		"NA" : "N/A", 
		"01" : "1"
	},
	"emission_scenario" : {
		"pos" : 17,
		"length" : 5,
		"NA---" : "N/A",
		"RCP45" : "RCP4.5", 
		"RCP85" : "RCP8.5"
	},
	"energy_scenario" : {
		"pos" : 18,
		"length" : 5,
		"NA---" : "N/A",
		"EHBaM" : "E-H 2050 Big and Market",
		"SHLSR" : "E-H 2050 Large Scale Res",
		"EHFOF" : "E-H 2050 Fossil Fuels",
		"EHRES" : "E-H 2050 100% RES",
		"EHSaL" : "E-H 2050 Small and Local",
		"REF16" : "EU energy ref. scen. 2016"
	},
	"transfer_function" : {
		"pos" : 19,
		"length" : 5,
		"NA---" : "N/A",
		"StGAM" : "Statistical model/GAM",
		"StMLR" : "Statistical model/MLR: StMLR",
		"StSVR" : "Statistical model/SVR: StSVR",
		"StRnF" : "Statistical model/Random Forests",
		"GamWT" : "GAM With Trend",
		"GamAN" : "GAM Anomaly",
		"GamNT" : "GAM No Trend",
		"PhM01" : "Physical Model/method1"
	}
}	
JSON Lookup Table

As you can see, each dictionary key has a nested dictionary within it. The key corresponds to the filename element.  Here is a test example filename relevant to this project:

H_ERA5_ECMW_T639_GHI_0000m_Euro_025d_S200001010000_E200001012300_ACC_MAP_01h_NA-_noc_org_NA_NA—_NA—_NA—.nc

So referencing the JSON table, by eye it is possible to identify that the file contains historical Global Horizontal Irradiance ERA5 data, originating from ECMWF, amongst other things. This is all well and good, but what if you need to check hundreds of similar files automatically? This is where Python can be used to create some functions that call the JSON table and check the integrity of the file name string.

 

Import packages:

# import json library
import json
# import collections - for ordered dictionary
import collections
# import datetime for metadata
import datetime
now = datetime.datetime.now()

Set some font colours for printing to terminal:

# set some colours for printing to terminal
class color:
   PURPLE = '\033[95m'
   CYAN = '\033[96m'
   DARKCYAN = '\033[36m'
   BLUE = '\033[94m'
   GREEN = '\033[92m'
   YELLOW = '\033[93m'
   RED = '\033[91m'
   BOLD = '\033[1m'
   UNDERLINE = '\033[4m'
   END = '\033[0m'

Load JSON file as a dictionary:

# load c3s_energy lookup table (json) as a dictionary
with open('/your/file/path/lookup_table.json') as c3s_json:    
    c3s = json.load(c3s_json)
    c3s2 = collections.OrderedDict(c3s)

 

A simple function to print the file name structure in the correct order, for reference purposes:

# function to print c3s_energy filename structure
def print_structure():
    s = ""
    for k in c3s2.keys():
        s += "<" + color.BOLD + k + color.END + ">_"    
    print(s[:-1] +".nc")

Calling this function will output the following:

print_structure()

<category>_<generation>_<originator>_<model>_<variable>_<level>_<region>_<spacial_resolution>_<start_date>_<end_date>_<type>_<view>_<temporal_resolution>_<lead_time>_<bias_adjustment>_<statistics>_<ensemble_number>_<emission_scenario>_<energy_scenario>_<transfer_function>.nc

 

Another simple function to print all the possible filename elements for reference

def print_elements():
    for el_id, el_info in c3s.items():
        print(color.CYAN + el_id + color.END+ ':')
        for key in el_info:
            print(key + ':' , el_info[key])

Calling this function will output the following:

print_elements()

category:
pos: 0
length: 1
H: historical
S: seasonal
P: projection
generation:
pos: 1
length: 4
ERA5: ERA5
SY05: System 5
CMI5: CMIP5
EUCX: Euro Cordex
originator:
pos: 2
length: 4
ECMW: ECMWF
MTFR: Meteo-France
METO: Met Office
DWD-: DWD
CMCC: CMCC
GFDL: GFDL
GFC2: GFC2
model:
pos: 3
length: 4
T639: TL 639
CM20: CM2.0
variable:
pos: 4
length: 3
TA-: Air Temperature
TP-: Total Precipitation
GHI: Global Horizontal Irradiance
MSL: Mean Sea Level Pressure
WS-: Wind Speed
E--: Evaporation
SD-: Snow Depth
DEM: Electricity Demand
HRE: Hydropower (Reservoir)
HRO: Hydropower (Run Of River)
WON: Wind Power Onshore
WOF: Wind Power Offshore
WIN: Wind
SPV: Solar PV Power
level:
pos: 5
length: 5
NA---: N/A
0000m: 0m
0002m: 2m
0010m: 10m
0100m: 100m
1e3hP: 1000 hPa
850hP: 850 hPa
region:
pos: 6
length: 4
Euro: Europe
spacial_resolution:
pos: 7
length: 4
025d: 0.25 deg
nut0: NUTS0
nut2: NUTS2
start_date:
pos: 8
length: 13
SYYYYMMDDhhmm: SYYYYMMDDhhmm
end_date:
pos: 9
length: 13
EYYYYMMDDhhmm: EYYYYMMDDhhmm
type:
pos: 10
length: 3
ACC: Accumulated
INS: Instantaneous
PWR: Power
NRG: Energy
CFR: Capacity factor
view:
pos: 11
length: 3
MAP: Map
TIM: Time series
temporal_resolution:
pos: 12
length: 3
01h: 1 hour
03h: 3 hours
06h: 6 hours
01d: 1 day
01m: 1 month
03m: 3 month
12m: 1 year
30y: 30 years
lead_time:
pos: 13
length: 3
NA-: N/A
03m: 3 Months
bias_adjustment:
pos: 14
length: 3
noc: No correction
mbc: Mean bias cor
nbc: Normal distr adjustment
vbc: Variance corrected
msd: Mean and std corrected
std: Standardized (zero mean and stdev)
wbc: Based on Weibull distr.
gbc: Based on gamma distr.
qbc: Based on quantile distr.n
cdf: Cumulative distr. fn
statistics:
pos: 15
length: 3
org: Original
avg: Mean
med: Median
min: Minimum
max: Maximum
and: Anomaly difference
anr: Anomaly ratio
33u: Upper tercile
33m: Lower tercile
20u: Upper quintile
20l: Lower quintile
qxx: Percentile
bss: Brier skill score
rss: Roc skill score
ensemble_number:
pos: 16
length: 2
NA: N/A
01: 1
emission_scenario:
pos: 17
length: 5
NA---: N/A
RCP45: RCP4.5
RCP85: RCP8.5
energy_scenario:
pos: 18
length: 5
NA---: N/A
EHBaM: E-H 2050 Big and Market
SHLSR: E-H 2050 Large Scale Res
EHFOF: E-H 2050 Fossil Fuels
EHRES: E-H 2050 100% RES
EHSaL: E-H 2050 Small and Local
REF16: EU energy ref. scen. 2016
transfer_function:
pos: 19
length: 5
NA---: N/A
StGAM: Statistical model/GAM
StMLR: Statistical model/MLR: StMLR
StSVR: Statistical model/SVR: StSVR
StRnF: Statistical model/Random Forests
GamWT: GAM With Trend
GamAN: GAM Anomaly
GamNT: GAM No Trend
PhM01: Physical Model/method1

 

Finally, this function will check the filename string against the JSON table and output if there are the correct amount of elements present in the string. This takes one argument, your filename as string (fname) as input.

# function to check a filename meets JSON lookup table
def check_filename(fname):
    flist = fname.split('_')
    x = 0
    for i, word in enumerate(flist):
        for el_id, el_info in c3s.items():
            for key in el_info:
                if key in flist[i] and len(key) == el_info['length'] and el_info['pos'] == i\
                or el_info['pos'] == 8 and i == 8 and len(key) == el_info['length']\
                and int(word[1:5]) > 1950 and int(word[1:5]) < 3000\
                or el_info['pos'] == 9 and i == 9 and len(key) == el_info['length']\
                and int(word[1:5]) > 1950 and int(word[1:5]) < 3000:
                    print(key + " " + u'\u2713')
                    x = x + 1
    if x == 20:
        print("There are " + str(x) + " of 20 required elements in the filename ")
    elif x != 20 :
        print("There are " + color.RED + str(x) + color.END + " of 20 required elements in the filename")            

Taking the example filename as a string and passing it to the function, gives the following output:

fname = 'H_ERA5_ECMW_T639_GHI_0000m_Euro_025d_S200001010000_E200001012300_ACC_MAP_01h_NA-_noc_org_NA_NA---_NA---_NA---.nc'
check_filename(fname)

H ✓
ERA5 ✓
ECMW ✓
T639 ✓
GHI ✓
0000m ✓
Euro ✓
025d ✓
SYYYYMMDDhhmm ✓
EYYYYMMDDhhmm ✓
ACC ✓
MAP ✓
01h ✓
NA- ✓
noc ✓
org ✓
NA ✓
NA--- ✓
NA--- ✓
NA--- ✓
There are 20 of 20 required elements in the filename 

The motive behind creating these functions, is they can then be called from another Python script running on a system, be that a local machine, virtual machine, server, HPC etc.

This can be achieved simply by adding the following to the top of your script (the functions are saved in a python file called ‘filename_utilities.py’):

# import c3s filename utilities
import sys
sys.path.append(os.path.abspath("/your/path/here/"))
from filename_utilities import print_structure, print_elements, check_filename

 

So what’s the use in having all the long names in the JSON? This comes in handy when we need to produce data in a human readable format, such as writing metadata and comments. The next blog will focus on parsing the JSON file to automatically generate metadata as comments into an outputted CSV file.

by Luke Sanger (WEMC Data Engineer, 2019)

Recommended Posts

Leave a Comment

Start typing and press Enter to search

Wind turbines and solar panels