Hello HYSPLIT developers,
I've made some changes to the `conc2cdf` program which converts the HYSPLIT `cdump` concentration output to the NetCDF format, and I figured I'd share them with you in the case that you'd be interested in incorporating them upstream. The patches are licensed under the MIT license.
Summary of patches
The patch series is divided into three separate patches meant to be applied in order. The forums have prevented me from uploading the actual files (HTTP errors and file extension issues), and I'm getting internal server errors (post too large?) when just trying to insert the code snippet into this post, so I've linked to them instead. They're stored on a password-protected paste site which will expire January 24, and the password for all of them is `HYSPLIT_conc2cdf`. I can find an alternative location for them as needed, but these forums do not seem to be working well for uploading the files or displaying their code. The overview of changes made by the patches are:
- `conc2cdf.f-0-memory-efficiency.patch`: Improves memory efficiency by reading and outputing 3D concentration grid for each time step and species iteratively. Previously, the entire `cdump` file was loaded into memory and written, which for larger simulations could quickly become unwieldy. There are several other fixes as well, including an issue reading more than 3 species, detailed in the patch file itself and Patches section 1.
- `conc2cdf.f-1-deflate.patch`: Adds the option to use NetCDF deflate compression, which can have significant savings in disk space. This also changes the way that the program input arguments are parsed to be somewhat more standard. See the patch and Patches section 2 for details.
- `conc2cdf.f-2-gettime.patch`: Fixes the `gettime` subroutine which was frequently (e.g. about 50% of the time) off by one day. This issue has been reported on these forums previously. See the patch and Patches section 3 for details.
The main focus of this patch is to increase the memory efficiency of the conc2cdf program by reading the 3D concentration array `csum(nlon,nlat,nlvl)` at each time step and for each species separately, iteratively writing to the NetCDF file. The previous version would read the entire concentration data `csum(nlon,nlat,nlvl,ntim,ntyp)` and write it to the NetCDF file at the end of the program. This very easily results in unwieldy memory consumption, especially considering that HYSPLIT itself only needs to store the 3D array for each species and concentration grid at one time step. For example, with a 0.5 degree global grid with 50 height levels and 50 species, each time step requires ~2.6 GB, but with this patch each time step would require ~52 MB.
In addition to this optimization, there are several other small corrections and changes:
- `idconc` array is dynamically allocated based on `ntyp` read from the input file, similar to the `ident` array. The previous version was limited to 3 species because it was statically allocated.
- While reading and outputing the concentration array, the start and stop times of the sampling interval are also read. Previously, the sampling times and concentrations were read separately and required rewinding the input file. This also allowed for the removal of the `stime` and `etime` variables and the multiplication and division necessary to store and extract the individual date-time fields read from the input file. Additionally, the actual minutes read from the input file are used rather than discarded and assumed zero. The `century` variable is considered a parameter. This is a limitation of the data stored in the input file since it stores only the lower two digits of the year.
- longitude variables are always ordered before latitude variables for consistency. Previously, there was no consistency in the order of these variables. Similarly, the heights and times follow the longitude and latitude variables.
- NetCDF function error contexts are reported consistently.
- The NetCDF file is opened as a NetCDF4 format rather than the classic format. This removes some limitations on the size of variables and files.
- The Z dimension and variable 'levels' was renamed to the more standard 'height' (CF Convention[1]).
- The T dimension, 'time', is made a record dimension with unlimited size. This allows for easy addition of subsequent time records to the dataset.
- Some memory leaks of allocated time arrays were fixed.
The purpose of this patch is to allow for compression of the NetCDF variables. It allows the user to specify the deflate level (0-9) used by the NetCDF4/HDF5 zlib compression. To facilitate this additional option, the format of the argument parsing and `conc2cdf` usage was changed to a more standard format. In particular:
- Previously, the user was require to specify at least one argument, despite having default values for all of the "optional" arguments. This completely defeated the purpose of having optional arguments and makes it unclear what arguments are actually required. Instead, I have made it required to specify both the input and output file arguments and made the deflate argument optional.
- The optional arguments were previously parsed by ignoring the first two characters of the argument (the option specifier) and treating the option value as the remaining characters. Instead, option values are treated as individual arguments following the option specifier argument.
- An explicit help option `-h` was added to show the usage and exit.
- A `print_usage` subroutine was added since it is called in multiple places.
This patch fixes the `gettime` subroutine. Previously, there were errors in the calculation of days since 1970-1-1. It uses an algorithm by Howard Hinnant[2,3] which is licensed as MIT, so there should be no issues with including it as adapted for Fortran.
To verify that this new implementation is working as expected, we can write a test program in Python which uses the `numpy` and `datetime` libraries for comparison. First we write both implementations of the Fortran `gettime` subroutines:
Code: Select all
$ cat > test_gettime.f90 <<EOF
SUBROUTINE gettime1(Icentury,Iyear,Imonth,Iday,Ihour,Iminute,IC)
!Subroutine to compute days since 1970-1-1 0:0:0 from date and time in file
Implicit None
Integer, Intent(In):: Iday,Imonth,Iyear,Icentury,Ihour,Iminute
Real*8, Intent(Out):: IC
Integer:: dmonth(12) !days in each month
Integer:: k, i
Integer:: IDY, INY, IYN
dmonth(1) = 31
If (Iyear/4*4 .ne. Iyear) Then
dmonth(2) = 28
Else
If (Iyear/400*400 .eq. Iyear) Then
dmonth(2) = 29
Else If (Iyear/100*100 .eq. Iyear) Then
dmonth(2) = 28
Else
dmonth(2) = 29
End If
End If
dmonth(3) = 31
dmonth(4) = 30
dmonth(5) = 31
dmonth(6) = 30
dmonth(7) = 31
dmonth(8) = 31
dmonth(9) = 30
dmonth(10) = 31
dmonth(11) = 30
dmonth(12) = 31
k = 0
Do i = 1, Imonth
k = k + dmonth(I)
End Do
k = k - dmonth(Imonth)
IDY = k + IDay - 1
INY = Iyear + Icentury*100
IYN = INY - 1970
If (IYN .gt. 0) Then
IC = IDY + IYN*365 + (IYN-1)/4 - (IYN-1)/100 + (IYN+299)/400
Else
IC = IDY + IYN*365 + IYN/4 - IYN/100
End If
! Add hours and minutes
IC = IC + (Ihour +(Iminute)/60.)/24.
END SUBROUTINE gettime1
FUNCTION days_from_civil(Iyear, Imonth, Iday)
! Returns number of days since civil 1970-01-01. Negative values indicate
! days prior to 1970-01-01.
! Preconditions: y-m-d represents a date in the civil (Gregorian) calendar
! Adapted from algorithm by Howard Hinnant[1]
! [1] http://howardhinnant.github.io/date_algorithms.html#days_from_civil
Implicit None
Integer, Intent(In):: Iyear,Imonth,Iday
Integer:: days_from_civil
Integer:: y, era, yoe, doy, doe
IF (Imonth.le.2) THEN
y = Iyear - 1
ELSE
y = Iyear
END IF
IF (y.ge.0) THEN
era = y / 400
ELSE
era = (y-399) / 400
END IF
yoe = y - era * 400 ! [0, 399]
IF (Imonth.gt.2) THEN
doy = (153*(Imonth - 3) + 2)/5 + Iday-1 ! [0, 365]
ELSE
doy = (153*(Imonth + 9) + 2)/5 + Iday-1 ! [0, 365]
END IF
doe = yoe * 365 + yoe/4 - yoe/100 + doy ! [0, 146096]
days_from_civil = era * 146097 + doe - 719468
END FUNCTION days_from_civil
SUBROUTINE gettime2(Icentury,Iyear,Imonth,Iday,Ihour,Iminute,IC)
!Subroutine to compute days since 1970-1-1 0:0:0 from date and time in file
Implicit None
Integer, Intent(In):: Iday,Imonth,Iyear,Icentury,Ihour,Iminute
Real*8, Intent(Out):: IC
Integer:: y, days_from_civil
IF (Iyear.lt.100) THEN
y = Iyear + Icentury*100
ELSE
y = Iyear
END IF
IC = days_from_civil(y, Imonth, Iday) + (Ihour +(Iminute)/60.)/24.
END SUBROUTINE gettime2
EOF
Code: Select all
$ f2py3 -c -m test_gettime test_gettime.f90
Code: Select all
$ cat > test_gettime_runner.py <<EOF
#!/usr/bin/python3
import datetime
import numpy as np
import warnings
import test_gettime
def datetime_to_cymd(t):
# Converts a datetime to century, 2-digit year, month, day tuple
if type(t) is np.datetime64:
y = t.astype('datetime64[Y]').astype(int) + 1970
m = t.astype('datetime64[M]').astype(int) % 12 + 1
d = (t - t.astype('datetime64[M]') + 1).astype('timedelta64[D]').astype(int)
elif type(t) is datetime.date:
y = t.year
m = t.month
d = t.day
# Handle negative values like Fortran does
c = np.sign(y) * (np.abs(y) // 100)
y = np.sign(y) * (np.abs(y) % 100)
return (c, y, m, d)
def gettime(t, epoch):
# Returns the days since epoch
if type(t) is np.datetime64:
return (t-epoch).astype('timedelta64[D]').astype(int)
elif type(t) is datetime.date:
return (t-epoch).days
def run(dt_lib='numpy'):
print("".ljust(80,'-'))
print("Running test_gettime with datetime library '%s'" % dt_lib)
print("".ljust(80,'-'))
if dt_lib == 'numpy':
epoch = np.datetime64('1970-01-01', 'D')
dt = np.timedelta64(1, 'D')
start_datetime = np.datetime64('-5001-12-31', 'D')
end_datetime = np.datetime64('5000-01-01', 'D')
elif dt_lib == 'datetime':
epoch = datetime.date(1970,1,1)
dt = datetime.timedelta(days=1)
start_datetime = datetime.date.min
end_datetime = datetime.date.max-dt
t = start_datetime
while t <= end_datetime:
gt0 = gettime(t, epoch)
gt1 = test_gettime.gettime1(*datetime_to_cymd(t), 0, 0)
gt2 = test_gettime.gettime2(*datetime_to_cymd(t), 0, 0)
if gt1 != gt2:
if gt0 == gt1:
print("gettime1 is correct: t=%s gt0=%d" % (t, gt0))
elif gt0 == gt2:
print("gettime2 is correct: t=%s gt0=%d" % (t, gt0))
else:
warnings.warn("Implementations all differ: t=%s gt0=%d gt1=%.0f gt2=%.0f" % (t, gt0, gt1, gt2))
elif gt0 != gt1:
warnings.warn("Fortran implementations differ with Python implementation '%s': t=%s gt0=%d gt1=%.0f gt2=%.0f" % (dt_lib, t, gt0, gt1, gt2))
t += dt
run(dt_lib='numpy')
run(dt_lib='datetime')
EOF
*Warning: this will generate ~236MB of output.
Code: Select all
$ python3 test_gettime_runner.py &> test_gettime.log
To find the years in which there are differences between the two Fortran implementations, where `gettime2` is correct, we can use:
Code: Select all
$ grep "gettime2 is correct: t=" test_gettime.log | sed 's/.*t=\(-\?[[:digit:]]\{3,4\}\).*/\1/' | sort -n | uniq > test_gettime_diff_years
Conclusions and future work
I think it would be useful for upstream to integrate these patches (or variants thereof, as there are some opinionated choices here, and this could be the first iteration of these patches) so that HYSPLIT users can benefit from these improvements and fixes. There are some additional things in `conc2cdf` that I think could use some work. In particular:
- Additional CF-Convention[5] attributes such as `cell_methods` and `standard_name` can be applied. The `point_spacing` attribute on the longitude and latitude variables does not seem to be specified in the CF-Convention and therefore perhaps should be removed (or is this from another standard?).
- In addition to the time-bounds, it should be possible to include the 3D spatial cell bounds. This is a bit more complicated than the time bounds, however.
- It might also make more sense to let the longitudes and latitudes of the cells be at the center of the cell rather than the lower-left corner, since this is probably what someone unfamiliar with HYSPLIT might assume. At the least, it might make sense to clarify in the `long_name` attribute what part of the cell is specified by the coordinates.
- There are still some inconsistencies in coding style (i.e. capital vs lower case variables, functions, subroutines) and whitespace. This is a minor point, but still somewhat confusing.
Now I'm also curious how the developers feel about simply ditching the proprietary binary `cdump` file in favor of directly outputing the NetCDF format from HYSPLIT. This is controversial I'm sure, but it certainly would make downstream output usage easier. As far as I can tell, the breakdown is as follows:
Advantages of HYSPLIT `cdump` output versus NetCDF:
- Does not rely on external (NetCDF) library
- (Possibly, needs benchmarking) faster output in Fortran-native unformatted binary format rather than indirect calls to NetCDF library
- Proprietary binary format requiring specialized processing and knowledge of the format. In contrast, the NetCDF file is self-contained and can include any additional attributes or simulation context in an easily readable format. Furthermore, the NetCDF format has many existing tools for data manipulation.
- If the user wants to use NetCDF format, this requires additional post-processing step to convert `cdump` to NetCDF.
- With NetCDF deflate compression, output files are typically smaller than even the packed `cdump` formats. As an anecdotal example, on one HYSPLIT `cdump` file with `cpack=1` I have observed a compression ratio of 1.68 and 1.91 with deflate levels 1 and 9, respectively, corresponding to a savings of 40% and 48%, respectively.
- This would also solve the issue of an ambiguous `century` field when converting `cdump` to NetCDF since HYSPLIT internally seems to keep track of the year correctly through `MACC` but does not write the full 4-digit year in the `cdump` file since the `tm2day` subroutine only returns the 2-digit year.
References