[PATCH] conc2cdf Improvements

mkrupcale · Post by **mkrupcale** » January 17th, 2018, 12:19 pm

Introduction

Hello HYSPLIT developers,

I've made some changes to the `conc2cdf` program which converts the HYSPLIT `cdump` concentration output to the NetCDF format, and I figured I'd share them with you in the case that you'd be interested in incorporating them upstream. The patches are licensed under the MIT license.

Summary of patches

The patch series is divided into three separate patches meant to be applied in order. The forums have prevented me from uploading the actual files (HTTP errors and file extension issues), and I'm getting internal server errors (post too large?) when just trying to insert the code snippet into this post, so I've linked to them instead. They're stored on a password-protected paste site which will expire January 24, and the password for all of them is `HYSPLIT_conc2cdf`. I can find an alternative location for them as needed, but these forums do not seem to be working well for uploading the files or displaying their code. The overview of changes made by the patches are:

`conc2cdf.f-0-memory-efficiency.patch`: Improves memory efficiency by reading and outputing 3D concentration grid for each time step and species iteratively. Previously, the entire `cdump` file was loaded into memory and written, which for larger simulations could quickly become unwieldy. There are several other fixes as well, including an issue reading more than 3 species, detailed in the patch file itself and Patches section 1.
`conc2cdf.f-1-deflate.patch`: Adds the option to use NetCDF deflate compression, which can have significant savings in disk space. This also changes the way that the program input arguments are parsed to be somewhat more standard. See the patch and Patches section 2 for details.
`conc2cdf.f-2-gettime.patch`: Fixes the `gettime` subroutine which was frequently (e.g. about 50% of the time) off by one day. This issue has been reported on these forums previously. See the patch and Patches section 3 for details.

Patch 1: `conc2cdf.f-0-memory-efficiency.patch`
The main focus of this patch is to increase the memory efficiency of the conc2cdf program by reading the 3D concentration array `csum(nlon,nlat,nlvl)` at each time step and for each species separately, iteratively writing to the NetCDF file. The previous version would read the entire concentration data `csum(nlon,nlat,nlvl,ntim,ntyp)` and write it to the NetCDF file at the end of the program. This very easily results in unwieldy memory consumption, especially considering that HYSPLIT itself only needs to store the 3D array for each species and concentration grid at one time step. For example, with a 0.5 degree global grid with 50 height levels and 50 species, each time step requires ~2.6 GB, but with this patch each time step would require ~52 MB.

In addition to this optimization, there are several other small corrections and changes:

`idconc` array is dynamically allocated based on `ntyp` read from the input file, similar to the `ident` array. The previous version was limited to 3 species because it was statically allocated.
While reading and outputing the concentration array, the start and stop times of the sampling interval are also read. Previously, the sampling times and concentrations were read separately and required rewinding the input file. This also allowed for the removal of the `stime` and `etime` variables and the multiplication and division necessary to store and extract the individual date-time fields read from the input file. Additionally, the actual minutes read from the input file are used rather than discarded and assumed zero. The `century` variable is considered a parameter. This is a limitation of the data stored in the input file since it stores only the lower two digits of the year.
longitude variables are always ordered before latitude variables for consistency. Previously, there was no consistency in the order of these variables. Similarly, the heights and times follow the longitude and latitude variables.
NetCDF function error contexts are reported consistently.
The NetCDF file is opened as a NetCDF4 format rather than the classic format. This removes some limitations on the size of variables and files.
The Z dimension and variable 'levels' was renamed to the more standard 'height' (CF Convention[1]).
The T dimension, 'time', is made a record dimension with unlimited size. This allows for easy addition of subsequent time records to the dataset.
Some memory leaks of allocated time arrays were fixed.

Patch 2: `conc2cdf.f-1-deflate.patch`
The purpose of this patch is to allow for compression of the NetCDF variables. It allows the user to specify the deflate level (0-9) used by the NetCDF4/HDF5 zlib compression. To facilitate this additional option, the format of the argument parsing and `conc2cdf` usage was changed to a more standard format. In particular:

Previously, the user was require to specify at least one argument, despite having default values for all of the "optional" arguments. This completely defeated the purpose of having optional arguments and makes it unclear what arguments are actually required. Instead, I have made it required to specify both the input and output file arguments and made the deflate argument optional.
The optional arguments were previously parsed by ignoring the first two characters of the argument (the option specifier) and treating the option value as the remaining characters. Instead, option values are treated as individual arguments following the option specifier argument.
An explicit help option `-h` was added to show the usage and exit.
A `print_usage` subroutine was added since it is called in multiple places.

Patch 3: `conc2cdf.f-2-gettime.patch`
This patch fixes the `gettime` subroutine. Previously, there were errors in the calculation of days since 1970-1-1. It uses an algorithm by Howard Hinnant[2,3] which is licensed as MIT, so there should be no issues with including it as adapted for Fortran.

To verify that this new implementation is working as expected, we can write a test program in Python which uses the `numpy` and `datetime` libraries for comparison. First we write both implementations of the Fortran `gettime` subroutines:

Code: Select all

$ cat > test_gettime.f90 <<EOF
SUBROUTINE gettime1(Icentury,Iyear,Imonth,Iday,Ihour,Iminute,IC)
!Subroutine to compute days since 1970-1-1 0:0:0 from date and time in file

  Implicit None
  Integer, Intent(In)::         Iday,Imonth,Iyear,Icentury,Ihour,Iminute
  Real*8, Intent(Out)::          IC
  Integer::                     dmonth(12) !days in each month
  Integer::                     k, i
  Integer::                     IDY, INY, IYN
  dmonth(1) = 31
  If (Iyear/4*4 .ne. Iyear) Then
     dmonth(2) = 28
  Else
     If (Iyear/400*400 .eq. Iyear) Then
        dmonth(2) = 29
     Else If (Iyear/100*100 .eq. Iyear) Then
        dmonth(2) = 28
     Else
        dmonth(2) = 29
     End If
  End If
  dmonth(3) = 31
  dmonth(4) = 30
  dmonth(5) = 31
  dmonth(6) = 30
  dmonth(7) = 31
  dmonth(8) = 31
  dmonth(9) = 30
  dmonth(10) = 31
  dmonth(11) = 30
  dmonth(12) = 31
  k = 0
  Do i = 1, Imonth
     k = k + dmonth(I)
  End Do
  k = k - dmonth(Imonth)
  IDY = k + IDay - 1
  INY = Iyear + Icentury*100
  IYN = INY - 1970
  If (IYN .gt. 0) Then
     IC = IDY + IYN*365 + (IYN-1)/4 - (IYN-1)/100 + (IYN+299)/400
  Else
     IC = IDY + IYN*365 + IYN/4 - IYN/100
  End If

! Add hours and minutes
  IC = IC + (Ihour +(Iminute)/60.)/24.
  
END SUBROUTINE gettime1

FUNCTION days_from_civil(Iyear, Imonth, Iday)
! Returns number of days since civil 1970-01-01.  Negative values indicate
!   days prior to 1970-01-01.
! Preconditions:  y-m-d represents a date in the civil (Gregorian) calendar
! Adapted from algorithm by Howard Hinnant[1]
! [1] http://howardhinnant.github.io/date_algorithms.html#days_from_civil
  Implicit None
  Integer, Intent(In)::         Iyear,Imonth,Iday
  Integer::                     days_from_civil
  Integer::                     y, era, yoe, doy, doe
  IF (Imonth.le.2) THEN
     y = Iyear - 1
  ELSE
     y = Iyear
  END IF
  IF (y.ge.0) THEN
     era = y / 400
  ELSE
     era = (y-399) / 400
  END IF
  yoe = y - era * 400                          ! [0, 399]
  IF (Imonth.gt.2) THEN
     doy = (153*(Imonth - 3) + 2)/5 + Iday-1   ! [0, 365]
  ELSE
     doy = (153*(Imonth + 9) + 2)/5 + Iday-1   ! [0, 365]
  END IF
  doe = yoe * 365 + yoe/4 - yoe/100 + doy      ! [0, 146096]
  days_from_civil = era * 146097 + doe - 719468
END FUNCTION days_from_civil

SUBROUTINE gettime2(Icentury,Iyear,Imonth,Iday,Ihour,Iminute,IC)
!Subroutine to compute days since 1970-1-1 0:0:0 from date and time in file

  Implicit None
  Integer, Intent(In)::         Iday,Imonth,Iyear,Icentury,Ihour,Iminute
  Real*8, Intent(Out)::         IC
  Integer::                     y, days_from_civil
  IF (Iyear.lt.100) THEN
     y = Iyear + Icentury*100
  ELSE
     y = Iyear
  END IF
  IC = days_from_civil(y, Imonth, Iday) + (Ihour +(Iminute)/60.)/24.
END SUBROUTINE gettime2
EOF

Note that `gettime1` is the original implementation, while `gettime2` is the corrected implementation. Now create a compiled extension module for Python[4]:

Code: Select all

$ f2py3 -c -m test_gettime test_gettime.f90

Then write the Python test program:

Code: Select all

$ cat > test_gettime_runner.py <<EOF
#!/usr/bin/python3

import datetime
import numpy as np
import warnings

import test_gettime

def datetime_to_cymd(t):
    # Converts a datetime to century, 2-digit year, month, day tuple
    if type(t) is np.datetime64:
        y = t.astype('datetime64[Y]').astype(int) + 1970
        m = t.astype('datetime64[M]').astype(int) % 12 + 1
        d = (t - t.astype('datetime64[M]') + 1).astype('timedelta64[D]').astype(int)
    elif type(t) is datetime.date:
        y = t.year
        m = t.month
        d = t.day
    # Handle negative values like Fortran does
    c = np.sign(y) * (np.abs(y) // 100)
    y = np.sign(y) * (np.abs(y) % 100)
    return (c, y, m, d)

def gettime(t, epoch):
    # Returns the days since epoch
    if type(t) is np.datetime64:
        return (t-epoch).astype('timedelta64[D]').astype(int)
    elif type(t) is datetime.date:
        return (t-epoch).days

def run(dt_lib='numpy'):
    print("".ljust(80,'-'))
    print("Running test_gettime with datetime library '%s'" % dt_lib)
    print("".ljust(80,'-'))
    if dt_lib == 'numpy':
        epoch = np.datetime64('1970-01-01', 'D')
        dt = np.timedelta64(1, 'D')
        start_datetime = np.datetime64('-5001-12-31', 'D')
        end_datetime = np.datetime64('5000-01-01', 'D')
    elif dt_lib == 'datetime':
        epoch = datetime.date(1970,1,1)
        dt = datetime.timedelta(days=1)
        start_datetime = datetime.date.min
        end_datetime = datetime.date.max-dt
    t = start_datetime
    while t <= end_datetime:
        gt0 = gettime(t, epoch)
        gt1 = test_gettime.gettime1(*datetime_to_cymd(t), 0, 0)
        gt2 = test_gettime.gettime2(*datetime_to_cymd(t), 0, 0)
        if gt1 != gt2:
            if gt0 == gt1:
                print("gettime1 is correct: t=%s gt0=%d" % (t, gt0))
            elif gt0 == gt2:
                print("gettime2 is correct: t=%s gt0=%d" % (t, gt0))
            else:
                warnings.warn("Implementations all differ: t=%s gt0=%d gt1=%.0f gt2=%.0f" % (t, gt0, gt1, gt2))
        elif gt0 != gt1:
            warnings.warn("Fortran implementations differ with Python implementation '%s': t=%s gt0=%d gt1=%.0f gt2=%.0f" % (dt_lib, t, gt0, gt1, gt2))
        t += dt

run(dt_lib='numpy')
run(dt_lib='datetime')
EOF

and finally run the test*:

*Warning: this will generate ~236MB of output.

Code: Select all

$ python3 test_gettime_runner.py &> test_gettime.log

You should find that in the cases where `gettime1` and `gettime2` differ, `gettime2` is correct (i.e. it agrees with the Python implementation). When `gettime1` and `gettime2` are equal, they also agree with the Python implementation (i.e. no warnings are thrown).

To find the years in which there are differences between the two Fortran implementations, where `gettime2` is correct, we can use:

Code: Select all

$ grep "gettime2 is correct: t=" test_gettime.log | sed 's/.*t=\(-\?[[:digit:]]\{3,4\}\).*/\1/' | sort -n | uniq > test_gettime_diff_years

There are 10681 years from -5001 to 9999 in which `gettime1` is incorrect, but `gettime2` is correct. This includes all 6274 years from -5001 to 1272 and about half of the years from 1273 to 9999, including about half of the years from 1900 to 2100. This demonstrates that the implementation of `gettime1` is severely flawed.

Conclusions and future work
I think it would be useful for upstream to integrate these patches (or variants thereof, as there are some opinionated choices here, and this could be the first iteration of these patches) so that HYSPLIT users can benefit from these improvements and fixes. There are some additional things in `conc2cdf` that I think could use some work. In particular:

Additional CF-Convention[5] attributes such as `cell_methods` and `standard_name` can be applied. The `point_spacing` attribute on the longitude and latitude variables does not seem to be specified in the CF-Convention and therefore perhaps should be removed (or is this from another standard?).
In addition to the time-bounds, it should be possible to include the 3D spatial cell bounds. This is a bit more complicated than the time bounds, however.
It might also make more sense to let the longitudes and latitudes of the cells be at the center of the cell rather than the lower-left corner, since this is probably what someone unfamiliar with HYSPLIT might assume. At the least, it might make sense to clarify in the `long_name` attribute what part of the cell is specified by the coordinates.
There are still some inconsistencies in coding style (i.e. capital vs lower case variables, functions, subroutines) and whitespace. This is a minor point, but still somewhat confusing.

In any case, I think this patch set serves as a good starting point for such improvements.

Now I'm also curious how the developers feel about simply ditching the proprietary binary `cdump` file in favor of directly outputing the NetCDF format from HYSPLIT. This is controversial I'm sure, but it certainly would make downstream output usage easier. As far as I can tell, the breakdown is as follows:

Advantages of HYSPLIT `cdump` output versus NetCDF:

Does not rely on external (NetCDF) library
(Possibly, needs benchmarking) faster output in Fortran-native unformatted binary format rather than indirect calls to NetCDF library

Disadvantages of HYSPLIT `cdump` output versus NetCDF:

Proprietary binary format requiring specialized processing and knowledge of the format. In contrast, the NetCDF file is self-contained and can include any additional attributes or simulation context in an easily readable format. Furthermore, the NetCDF format has many existing tools for data manipulation.
If the user wants to use NetCDF format, this requires additional post-processing step to convert `cdump` to NetCDF.
With NetCDF deflate compression, output files are typically smaller than even the packed `cdump` formats. As an anecdotal example, on one HYSPLIT `cdump` file with `cpack=1` I have observed a compression ratio of 1.68 and 1.91 with deflate levels 1 and 9, respectively, corresponding to a savings of 40% and 48%, respectively.
This would also solve the issue of an ambiguous `century` field when converting `cdump` to NetCDF since HYSPLIT internally seems to keep track of the year correctly through `MACC` but does not write the full 4-digit year in the `cdump` file since the `tm2day` subroutine only returns the 2-digit year.

Additionally, it might also be worth considering implementing some of these date algorithms[2] in the HYSPLIT library to replace or at least verify existing time-keeping routines.

References

glenn.rolph · Post by **glenn.rolph** » January 18th, 2018, 11:48 am

Thank you very much for the extensive review and modifications to the con2cdf.f program. We will consider adding it to the repository. Could you forward a copy of the final modified code to arl.webmaster@noaa.gov? Thanks.

January 25th, 2018, 6:05 pm

Thank you! We applied your updates to the code. conc2cdf.f is replaced by con2cdf3.f and con2cdf4.f.

con2cdf3.f contains your patches that improve memory efficiency for IO and a fix to the gettime subroutine, but writes the output file as a NetCDF Version 3 file.

con2cdf4.f contains all of your patches. This program creates NetCDF Version 4 files and utilizes the deflate compression.

In addition, both files include a fix that outputs all emission source origin times and locations (previously only the last emissions source origin was output).

Thank you,

Chris

mkrupcale · Post by **mkrupcale** » February 23rd, 2018, 12:27 pm

You're quite welcome--I'm glad I could make some improvements. The output of all emission source origins is also a good addition.

I've looked through the changes (svn revision 934), and I couldn't help but notice that my name appeared in `cmaq/con2cdf3.f`, `con2cdf4.f`, and `document/updates.txt`, which is fine, but it is misspelled: my last name is written as Kupcale, but it should be Krupcale (missing the "r"). I'd appreciate if you could correct this.

Less important is that the Z axis dimension and variable are called 'levels', whereas according to CF convention it should probably be 'height'.

I'm wondering also how the developers feel about some of the other points in my first post. Particularly about letting the latitude/longitude variables be at the center of the cell, specifying spatial cell bounds, and possibly even using the NetCDF output format directly in HYSPLIT itself.

P.S. I also noticed in revision 933 (2018-02-13 12:11:37 -0500) that `library/hysplit/metinp.f` was changed, but this was recorded as being made on 13 Feb 2017 in the file itself.

Best,
Matthew

mkrupcale · Post by **mkrupcale** » June 15th, 2021, 9:45 am

I'm currently participating in the HYSPLIT 2021 Workshop using the latest version (v5.1.0), and I'd like to just reiterate the request to update my name in the source code:

mkrupcale wrote: ↑
February 23rd, 2018, 12:27 pm
I've looked through the changes (svn revision 934), and I couldn't help but notice that my name appeared in `cmaq/con2cdf3.f`, `con2cdf4.f`, and `document/updates.txt`, which is fine, but it is misspelled: my last name is written as Kupcale, but it should be Krupcale (missing the "r"). I'd appreciate if you could correct this.

I'm also curious if the developers have any thoughts on the other comments I made:

mkrupcale wrote: ↑
February 23rd, 2018, 12:27 pm
Less important is that the Z axis dimension and variable are called 'levels', whereas according to CF convention it should probably be 'height'.

I'm wondering also how the developers feel about some of the other points in my first post. Particularly about letting the latitude/longitude variables be at the center of the cell, specifying spatial cell bounds, and possibly even using the NetCDF output format directly in HYSPLIT itself.

June 15th, 2021, 10:18 am

Sorry, for misspelling your name. I will correct that in future versions of HYSPLIT. I will discuss with the HYSPLIT team to see if we want to incorporate your other suggestions. We try to keep the code backward compatible, so these changes may have negative consequences for other users. There is no plan in the near future for HYSPLIT to read or write netcdf files.

HYSPLIT Forum: hysplitbbs.arl.noaa.gov

[PATCH] conc2cdf Improvements

[PATCH] conc2cdf Improvements

Re: [PATCH] conc2cdf Improvements

Re: [PATCH] conc2cdf Improvements

Re: [PATCH] conc2cdf Improvements

Re: [PATCH] conc2cdf Improvements

Re: [PATCH] conc2cdf Improvements