Diving Deep in to netCDF4
In my college days, I just misinterpreted the netCDF4 as a shared data storage library specifically developed for applications in atmospheric science and personally, I didn’t like netCDF4 (because it does not seem well enough as any relational database). After an exploration in the regime of data science, I just found out that the UCAR (University Corporation of Atmospheric Research) did not developed the netCDF4 libraries only for atmospheric research but also for an entirely new data storage arch for scientific data.
Overview
The history of netCDF back to CDF library (common data format) developed by NASA for storage of numerical multidimensional data which is a self-describing, platform-independent database. The idea about netCDF4 is not any different than CDF or HDF5 (we will come to that later) but more user friendly and thus easy for research and advanced applications. I have seen several Unix distros like open-SUSE comes with the netCDF4 libraries pre-installed, which tells the significance of the netCDF data format.
Python and netCDF4
It is not a good idea that using python everywhere but at the same time it is cool that python supports most of the libraries from other languages (C, JAVA, Fortran e.t.c) so ultimately python comes to the top wherever the performance (machine-level performance) had less importance. All these factors make python a good tool for scientific data analysis.
Python libraries for netCDF4
- SciPy (pupynere)
- PyNIO and PyNGL
- netCDF4-python
- xarray
- pycdf
The above are only python-based libraries so there is also a lot of other tools/software’s available for netCDF4 you can find the whole list here
Here, I am only focused on the spatial NetCDF datasets which are mainly used for atmospheric and ocean research.
- Creating and reading netcdf file
- Domain operations
- Data operations
1. Creating an netCDF file from scratch and reading it
- Using scipy (pupynre)
The scipy comes with an netcdf4 library ( which is actually the pupynre library support) this library is made by Roberto De Almeida
2. Using xarray
The scipy NetCDF only supports netCDF 3 not 4 so there may be some problems when dealing with multiple dimensions so I suggest using the xarray module for creating NetCDF files
The most recommended way to read the netcdf4 is using xarray (which might be not the efficient way). The xarray library is specifically for dealing with the multi-dimensional dataset ( or multi-index arrays). Using xarray will provide compatibility with pandas and NumPy which will be a great way to implement advanced data operations
#using xarrayimport xarray as xr
dset = xr.open_dataset("ncfile.nc")
vals = dset["variable_name"].values
here I am demonstrating with the xarray. NetCDF reader for xarray is python-netcdf4 itself ( or scipy). You need a scipy or python-netCDF4 package to use the xarray to open the dataset
we can also use the python-netCDF to read the dataset
# using python-netCDF4
from netCDF4 import Dataset
import numpy as npdset = Dataset("ncfile.nc)
vals = dset["variable"]#if you want as numpy array
val_array = np.array(vals)
if you want, you can find the an sample netcdf dataset here.
Using xarray will make us lazy (of course python makes everyone a lazy coder) to deal with the netCDF dataset but it might be the fastest way to do a lot of operations over the dataset so in the coming sections I mostly using the xarray for netCDF operations.
2. Domain operations
What I meant by domain/spatial operation are GIS-based operations that can perform over spatial NetCDF data. For the demo purpose, I am taking spatial NetCDF data from the ERA5 weather dataset from here
For a better understanding, we can check what are the spatial operation (over raster datasets) available in usual GIS-based softwares (ArcGIS, QGIS e.t.c)
- Clipping ( Extracting the dataset based on defined geometry )
- Projecting (Assigning or changing projection based on the region)
- Regridding
1. Sub setting and clipping based on shapefile
- Sub setting netCDF file
There are several ways to take a subset of spatial data based on lat/lon or any other dimension
""" Using xarray """import xarray as xrnc = "nc_file.nc"ncdata = xr.open_dataset(nc)w_data = ncdata["ws10"]
sub_lat,sub_lon = [10,20],[72,84]nw_sub = w_data.sel(lat=slice(sub_lat[0],sub_lat[1]),lon = slice(sub_lon[0],sub_lon[1]))"""The new subset is a data array"""
The above steps are the same for the other dimensions. Making a subset of netCDF file using lat-long box is much easier with CDO
cdo sellonlatbox,-5,12,34.5,44.5 in.nc out.nc
2. Clipping using shape file
The shapefiles for regions generally contain a spatial geometry defined using latitude and longitude. So it will have a spatial reference and also projection (if it is projected). One of the most common ones is ESRI shapefiles with .shp extensions but also there are several others like GeoJSON (which is more convenient to visualize)
here I am using a shapefile to clip the NetCDF file to the extent of the shapefile. There are a lot of ways to do it. we can extract shapefile with a reader and use the extends to clip or we can use libraries like rasterio or rioxarray
import xarray as xr
import geopandas as gpd
from shapely.geometry import mapping
import rioxarray as rionc = "file.nc"era = xr.open_dataset(nc)roi_src = "shapefile.shp"era.rio.set_spatial_dims(x_dim="lon", y_dim="lat", inplace=True)
era.rio.write_crs("EPSG:4326", inplace=True)
geodf = gpd.read_file(roi_src)
era_clip = era.rio.clip(geodf.geometry.apply(mapping), geodf.crs)data = era_clip['ws10'] # replace with variable name
time = era_clip.time.values
Change the name of dimensions and the filenames accordingly. The .shp extension refers to an ESRI shapefile consists of a main file, an index file, and a dBASE table. You need these three files to properly read the shapefile
main file ----> name.shp --- "list of vertices"
index file ---> name.shx --- " offset record of main file"
dBase file ---> name.dbf --- "contains feature attributes"
The above method is a bit slower and using multiple libraries. If you find any faster method then let me also know. This will be useful for area-specific analysis (by projecting both NC and shapefile)
I haven’t come across with projecting the NetCDF file, usually for any related process I simply convert netCDF to GeoTiff images. If you are aware of editing/adding projection data of the netCDF file, we can improve this part with netCDF projection.
3. Regridding
There are several ways to re-grid the data from a netCDF file. There is a re-gridding library for xarray is available XESM. There is no pip package available but you can install it via conda or directly from Github
Here I am only showing the spatial re-gridding using griddata function which is in the scipy library (the same griddata function is available in MATLAB also)
I didn’t use the xesmf library but most probably it might be the fastest method when using large multi-dimensional datasets and interpolation over multiple variables (and multidimensional) datasets
Additionally, you can make use of tools like CDO, NCO and QGIS, ARCGIS to do the spatial re-gridding process over the netCDF data
3. Data operations
- data operations
- data visualization (seaborn,matplotlib,plotly,folium)
- Extracting the dataset
1. Data operations
For any operation on netcdf multidimensional data I use the xarray and numpy libraries.
import numpy as np
import xarray as xrdata = xr.open_dataset("ncfile.nc")
data_arr = data["variable"] # by treating as xarrayvals = data["variable"].values # can be treated as numpy ndarray
So using the basic indexing we can do any operations over the array. I’m not covering any additional analysis or statistical operations over the data. If there any operations that cant be done similarly, let me know.
2. Data Visualization
Here I am mainly focusing on visualization of spatial netCDF data
- Static plots using matplotlib,seaborn and cartopy
I mainly use matplotlib or seaborn as main visualization libraries for static plots
"""Simple plot using matplotlib"""import xarray as xr
import matplotlib.pyplot as pltdt = "ncfile.nc" # use your nc filenc_data = xr.open_dataset(dt)var = nc_data["ws10"][0,0,:,:]# use your variable and index accordingly
x = var.lon.values # longitude as x dimension
y = var.lat.values # latitude as y dimension
z = var.values # 2d array of spatial valuesplt.Figure(figsize=(12,4))
plt.contourf(x,y,z) # contour plot
plt.xlabel("Longitude")
plt.ylabel("Latitude")
plt.title("10m wind speed over Indian sub-Continent")
With cartopy we can easily add simple projection and other earth features (shapefiles of coastlines,rivers , terrain data e.t.c)
Seaborn and matplotlib
In spite there are lot of tools already available in the matplotlib, the seaborn library actually makes data analysis and visualizing bit easier.
import xarray as xr
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as npnc = "ncfile.nc"dset = xr.open_dataset(nc)var = dset["ws10"]# use your variable and index accordingly
time = dset.time.values[0] # taking only first time dimension
lev = dset.lev_2.values[0] # takinf only first level dimension
area = var.sel(lat=slice(10,11),lon=slice(70,71),time=time,lev_2=lev) # making a subset of region (using lat/lon)
wind_data = area.values # wind data as an example
lats = area.lat.values # latitude values
lons = area.lon.values # longitude valuesf, ax = plt.subplots(figsize=(9, 9))sns.heatmap(area.values, annot=False,xticklabels=lons,yticklabels=lats, fmt="f", linewidths=.5, ax=ax) # seaborn heatmap
plt.savefig("ex.png")
plt.title("netCDF grid based heatmap")
plt.xlabel("Longitude")
plt.ylabels("Latitude")
I did not go to any advanced analysis this is just an intro to using the seaborn library. You can find more interesting scientific plots and examples here
I have earlier planned to add interactive visualisation of netCDF data using JS based libraries such as folium, earth engine code,plotly but it deserves its own section
3. Extracting dataset
Similar to above steps we can convert and extract the data from netcdf to other common formats
- Converting to CSV
import xarray as xr
import pandas as pd
import numpy as npnc = "yourncfile.nc"dset = xr.open_dataset(nc)var = dset["ws10"]# use your variable and index accordingly
time = dset.time.values[0]
lev = dset.lev_2.values[0]
area = var.sel(lat=slice(10,11),lon=slice(70,71),time=time,lev_2=lev)
wind_data = area.values
lats = area.lat.values
lons = area.lon.values"""Flattening the 2d arrays """wind_data = wind_data.flatten()
lats,lons = np.meshgrid(lats,lons)
lats_n,lons_n = lats.flatten(),lons.flatten()
dft = pd.DataFrame({"lat":lats_n,"lon":lons_n,"val":wind_data})dft.to_csv("out.csv",index=False)
In above step I have converting the netcdf4 data in to simple readable format (here Im exporting as CSV file)
2. Converting to GeoTiff
GeoTiff s are collection of 2D images with additional information like map projection,coordinate systems,datum e.t.c . Even both netCDF and GeoTiff are raster datasets there are some underlying differences between them (for example the spatial reference metadata information inside netCDF is very limited).
Earlier I am using GDAL (with python wrapper) to convert the netCDF to GeoTiff but later I have found this script by Pratiman Patel which is much better and easy (using rioxarray)
import xarray as xr
import rioxarray as rio#Open the NetCDF
#Download the sample from https://www.unidata.ucar.edu/software/netcdf/examples/sresa1b_ncar_ccsm3-example.nc
ncfile = xr.open_dataset('sresa1b_ncar_ccsm3-example.nc')#Extract the variable
pr = ncfile['pr']#(Optional) convert longitude from (0-360) to (-180 to 180) (if required)
pr.coords['lon'] = (pr.coords['lon'] + 180) % 360 - 180
pr = pr.sortby(pr.lon)#Define lat/long
pr = pr.rio.set_spatial_dims('lon', 'lat')#Check for the CRS
pr.rio.crs#(Optional) If your CRS is not discovered, you should be able to add it like so:
pr.rio.set_crs("epsg:4326")#Saving the file
pr.rio.to_raster(r"GeoTIFF.tif")
Final notes
Some of the above operations can be also readily available with several tools like CDO, ArcGIS or QGIS. The point of this article is to understand the basic operations in NetCDF using python.
All the above scripts are created by me (except netCDF to GeoTIFF script) so it will be nice if you can tell me suggestions to improve the scripts. If you had a specific data analysis method using the netCDF dataset we can improve this article by adding it.
I have added all the scripts as public gists in my Github page.