Diving Deep in to netCDF4

Nirmal Mathew Alex
9 min readJun 1, 2021
Spatial water occurrence (netCDF data visualized using folium)

In my college days, I just misinterpreted the netCDF4 as a shared data storage library specifically developed for applications in atmospheric science and personally, I didn’t like netCDF4 (because it does not seem well enough as any relational database). After an exploration in the regime of data science, I just found out that the UCAR (University Corporation of Atmospheric Research) did not developed the netCDF4 libraries only for atmospheric research but also for an entirely new data storage arch for scientific data.

Overview

The history of netCDF back to CDF library (common data format) developed by NASA for storage of numerical multidimensional data which is a self-describing, platform-independent database. The idea about netCDF4 is not any different than CDF or HDF5 (we will come to that later) but more user friendly and thus easy for research and advanced applications. I have seen several Unix distros like open-SUSE comes with the netCDF4 libraries pre-installed, which tells the significance of the netCDF data format.

Python and netCDF4

It is not a good idea that using python everywhere but at the same time it is cool that python supports most of the libraries from other languages (C, JAVA, Fortran e.t.c) so ultimately python comes to the top wherever the performance (machine-level performance) had less importance. All these factors make python a good tool for scientific data analysis.

Python libraries for netCDF4

  • SciPy (pupynere)
  • PyNIO and PyNGL
  • netCDF4-python
  • xarray
  • pycdf

The above are only python-based libraries so there is also a lot of other tools/software’s available for netCDF4 you can find the whole list here

Here, I am only focused on the spatial NetCDF datasets which are mainly used for atmospheric and ocean research.

  • Creating and reading netcdf file
  • Domain operations
  • Data operations

1. Creating an netCDF file from scratch and reading it

  1. Using scipy (pupynre)

The scipy comes with an netcdf4 library ( which is actually the pupynre library support) this library is made by Roberto De Almeida

SCIPY NC creation

2. Using xarray

The scipy NetCDF only supports netCDF 3 not 4 so there may be some problems when dealing with multiple dimensions so I suggest using the xarray module for creating NetCDF files

creating netcdf using xarray

The most recommended way to read the netcdf4 is using xarray (which might be not the efficient way). The xarray library is specifically for dealing with the multi-dimensional dataset ( or multi-index arrays). Using xarray will provide compatibility with pandas and NumPy which will be a great way to implement advanced data operations

#using xarrayimport xarray as xr
dset = xr.open_dataset("ncfile.nc")
vals = dset["variable_name"].values

here I am demonstrating with the xarray. NetCDF reader for xarray is python-netcdf4 itself ( or scipy). You need a scipy or python-netCDF4 package to use the xarray to open the dataset

we can also use the python-netCDF to read the dataset

# using python-netCDF4
from netCDF4 import Dataset
import numpy as np
dset = Dataset("ncfile.nc)
vals = dset["variable"]
#if you want as numpy array
val_array = np.array(vals)

if you want, you can find the an sample netcdf dataset here.

Using xarray will make us lazy (of course python makes everyone a lazy coder) to deal with the netCDF dataset but it might be the fastest way to do a lot of operations over the dataset so in the coming sections I mostly using the xarray for netCDF operations.

2. Domain operations

What I meant by domain/spatial operation are GIS-based operations that can perform over spatial NetCDF data. For the demo purpose, I am taking spatial NetCDF data from the ERA5 weather dataset from here

For a better understanding, we can check what are the spatial operation (over raster datasets) available in usual GIS-based softwares (ArcGIS, QGIS e.t.c)

  • Clipping ( Extracting the dataset based on defined geometry )
  • Projecting (Assigning or changing projection based on the region)
  • Regridding

1. Sub setting and clipping based on shapefile

  1. Sub setting netCDF file

There are several ways to take a subset of spatial data based on lat/lon or any other dimension

Using numpy and python-netcdf4
""" Using xarray """import xarray as xrnc = "nc_file.nc"ncdata = xr.open_dataset(nc)w_data = ncdata["ws10"]
sub_lat,sub_lon = [10,20],[72,84]
nw_sub = w_data.sel(lat=slice(sub_lat[0],sub_lat[1]),lon = slice(sub_lon[0],sub_lon[1]))"""The new subset is a data array"""
Next, you will open up the digital terrain model for your study area Original and Sub region

The above steps are the same for the other dimensions. Making a subset of netCDF file using lat-long box is much easier with CDO

cdo sellonlatbox,-5,12,34.5,44.5 in.nc out.nc

2. Clipping using shape file

Shape file

The shapefiles for regions generally contain a spatial geometry defined using latitude and longitude. So it will have a spatial reference and also projection (if it is projected). One of the most common ones is ESRI shapefiles with .shp extensions but also there are several others like GeoJSON (which is more convenient to visualize)
here I am using a shapefile to clip the NetCDF file to the extent of the shapefile. There are a lot of ways to do it. we can extract shapefile with a reader and use the extends to clip or we can use libraries like rasterio or rioxarray

import xarray as xr
import geopandas as gpd
from shapely.geometry import mapping
import rioxarray as rio
nc = "file.nc"era = xr.open_dataset(nc)roi_src = "shapefile.shp"era.rio.set_spatial_dims(x_dim="lon", y_dim="lat", inplace=True)
era.rio.write_crs("EPSG:4326", inplace=True)
geodf = gpd.read_file(roi_src)
era_clip = era.rio.clip(geodf.geometry.apply(mapping), geodf.crs)
data = era_clip['ws10'] # replace with variable name
time = era_clip.time.values

Change the name of dimensions and the filenames accordingly. The .shp extension refers to an ESRI shapefile consists of a main file, an index file, and a dBASE table. You need these three files to properly read the shapefile

main file ----> name.shp  --- "list of vertices"
index file ---> name.shx --- " offset record of main file"
dBase file ---> name.dbf --- "contains feature attributes"
Clipped nc file using shapefile

The above method is a bit slower and using multiple libraries. If you find any faster method then let me also know. This will be useful for area-specific analysis (by projecting both NC and shapefile)

I haven’t come across with projecting the NetCDF file, usually for any related process I simply convert netCDF to GeoTiff images. If you are aware of editing/adding projection data of the netCDF file, we can improve this part with netCDF projection.

3. Regridding

There are several ways to re-grid the data from a netCDF file. There is a re-gridding library for xarray is available XESM. There is no pip package available but you can install it via conda or directly from Github

Here I am only showing the spatial re-gridding using griddata function which is in the scipy library (the same griddata function is available in MATLAB also)

After and before Regridding

I didn’t use the xesmf library but most probably it might be the fastest method when using large multi-dimensional datasets and interpolation over multiple variables (and multidimensional) datasets

Additionally, you can make use of tools like CDO, NCO and QGIS, ARCGIS to do the spatial re-gridding process over the netCDF data

3. Data operations

  • data operations
  • data visualization (seaborn,matplotlib,plotly,folium)
  • Extracting the dataset

1. Data operations

For any operation on netcdf multidimensional data I use the xarray and numpy libraries.

import numpy as np
import xarray as xr
data = xr.open_dataset("ncfile.nc")
data_arr = data["variable"] # by treating as xarray
vals = data["variable"].values # can be treated as numpy ndarray

So using the basic indexing we can do any operations over the array. I’m not covering any additional analysis or statistical operations over the data. If there any operations that cant be done similarly, let me know.

2. Data Visualization

Here I am mainly focusing on visualization of spatial netCDF data

  1. Static plots using matplotlib,seaborn and cartopy

I mainly use matplotlib or seaborn as main visualization libraries for static plots

"""Simple plot using matplotlib"""import xarray as xr
import matplotlib.pyplot as plt
dt = "ncfile.nc" # use your nc filenc_data = xr.open_dataset(dt)var = nc_data["ws10"][0,0,:,:]# use your variable and index accordingly
x = var.lon.values # longitude as x dimension
y = var.lat.values # latitude as y dimension
z = var.values # 2d array of spatial values
plt.Figure(figsize=(12,4))
plt.contourf(x,y,z) # contour plot
plt.xlabel("Longitude")
plt.ylabel("Latitude")
plt.title("10m wind speed over Indian sub-Continent")

With cartopy we can easily add simple projection and other earth features (shapefiles of coastlines,rivers , terrain data e.t.c)

More advanced plot using matplotlib and cartopy
Spatial Map using cartopy

Seaborn and matplotlib

In spite there are lot of tools already available in the matplotlib, the seaborn library actually makes data analysis and visualizing bit easier.

import xarray as xr
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
nc = "ncfile.nc"dset = xr.open_dataset(nc)var = dset["ws10"]# use your variable and index accordingly
time = dset.time.values[0] # taking only first time dimension
lev = dset.lev_2.values[0] # takinf only first level dimension
area = var.sel(lat=slice(10,11),lon=slice(70,71),time=time,lev_2=lev) # making a subset of region (using lat/lon)
wind_data = area.values # wind data as an example
lats = area.lat.values # latitude values
lons = area.lon.values # longitude values
f, ax = plt.subplots(figsize=(9, 9))sns.heatmap(area.values, annot=False,xticklabels=lons,yticklabels=lats, fmt="f", linewidths=.5, ax=ax) # seaborn heatmap
plt.savefig("ex.png")
plt.title("netCDF grid based heatmap")
plt.xlabel("Longitude")
plt.ylabels("Latitude")
Heatmap generated using seaborn

I did not go to any advanced analysis this is just an intro to using the seaborn library. You can find more interesting scientific plots and examples here

I have earlier planned to add interactive visualisation of netCDF data using JS based libraries such as folium, earth engine code,plotly but it deserves its own section

3. Extracting dataset

Similar to above steps we can convert and extract the data from netcdf to other common formats

  1. Converting to CSV
import xarray as xr
import pandas as pd
import numpy as np
nc = "yourncfile.nc"dset = xr.open_dataset(nc)var = dset["ws10"]# use your variable and index accordingly
time = dset.time.values[0]
lev = dset.lev_2.values[0]
area = var.sel(lat=slice(10,11),lon=slice(70,71),time=time,lev_2=lev)
wind_data = area.values
lats = area.lat.values
lons = area.lon.values
"""Flattening the 2d arrays """wind_data = wind_data.flatten()
lats,lons = np.meshgrid(lats,lons)
lats_n,lons_n = lats.flatten(),lons.flatten()

dft = pd.DataFrame({"lat":lats_n,"lon":lons_n,"val":wind_data})
dft.to_csv("out.csv",index=False)

In above step I have converting the netcdf4 data in to simple readable format (here Im exporting as CSV file)

2. Converting to GeoTiff

GeoTiff s are collection of 2D images with additional information like map projection,coordinate systems,datum e.t.c . Even both netCDF and GeoTiff are raster datasets there are some underlying differences between them (for example the spatial reference metadata information inside netCDF is very limited).

Earlier I am using GDAL (with python wrapper) to convert the netCDF to GeoTiff but later I have found this script by Pratiman Patel which is much better and easy (using rioxarray)

import xarray as xr
import rioxarray as rio
#Open the NetCDF
#Download the sample from https://www.unidata.ucar.edu/software/netcdf/examples/sresa1b_ncar_ccsm3-example.nc
ncfile = xr.open_dataset('sresa1b_ncar_ccsm3-example.nc')
#Extract the variable
pr = ncfile['pr']
#(Optional) convert longitude from (0-360) to (-180 to 180) (if required)
pr.coords['lon'] = (pr.coords['lon'] + 180) % 360 - 180
pr = pr.sortby(pr.lon)
#Define lat/long
pr = pr.rio.set_spatial_dims('lon', 'lat')
#Check for the CRS
pr.rio.crs
#(Optional) If your CRS is not discovered, you should be able to add it like so:
pr.rio.set_crs("epsg:4326")
#Saving the file
pr.rio.to_raster(r"GeoTIFF.tif")

check this link to see the detailed code description

Final notes

Some of the above operations can be also readily available with several tools like CDO, ArcGIS or QGIS. The point of this article is to understand the basic operations in NetCDF using python.
All the above scripts are created by me (except netCDF to GeoTIFF script) so it will be nice if you can tell me suggestions to improve the scripts. If you had a specific data analysis method using the netCDF dataset we can improve this article by adding it.

I have added all the scripts as public gists in my Github page.

References

--

--

Nirmal Mathew Alex

PhD Meteorology || Scientific Computing || Data Science