Work with multiple netCDF files/variables in python

Work with multiple netCDF files/variables in python - python

I have around 4TB MERIS time series data which comes in netCDF format.
So I have a lot netCDF files containing several 'variables'.
NetCDF format is new to me and although I've read a lot about netCDF processing I don't get an idea of how to do it. This question 'Combining a large amount of netCDF files' deals somehow with my problem but I did not get there. My approach was to first mosaic, then stack and lately take the mean out of every pixel.
One file contains the following 32 variables
Here's additional the ncdump output of one .nc file of one day:
http://www.filedropper.com/ncdumpoutput
I managed to read the files, extract the variables I want (variable # 32) and put them into a list using the following code
l = list()
for i in files_in:
# read netCDF file
dset = nc.Dataset(i, mode = 'r')
# save variables
var = dset.variables['vegetation_index_mean'][:]
# write all temp loop outputs in a list
l.append (var)
# close netCDF file
dset.close()
The list now contains 24 'masked_arrays' of different locations of the same date.
Every time I want to print the contents of the list my Spyder freezes. Every command I run afterwards Spyder first freezes for five sec before starting.
My goal is to make a time series analysis for a specific time frame (every date stored in a single .nc file). So my plan was to mosaic (is this possible?) the variables in the list (treating them as raster bands), process additional dates and take the mean for every pixel (1800 x 1800 ).
Maybe my whole approach is wrong? Can I treat these 'variables' like raster bands?

I'm not sure if the following answer may respond to your needs, as this procedure is designed in order to process timeseries, is pretty manual and furthermore you have 4Tb of data...
Thus I apologize myself if this doesn't help.
This is for Python 2.7:
First import all the modules needed:
import tkFileDialog
from netCDF4 import Dataset
import matplotlib.pyplot as plt
Second parse multiple nc files:
n = []
filename = {}
filename = tkFileDialog.askopenfilenames()
filename = list(filename)
n = len(filename)
Third read nc files and classify data and metadata within dictionaries using a loop:
wtr_tem = {} # create empty arrays for variable sea water temperature
fh = {} # create empty arrays for filehandler and variables nc file
vars = {}
for i in range(n):
filename[i]=filename[i].decode('unicode_escape').encode('ascii','ignore') # remove unicode in order to execute the following command
filename1 = ''.join(filename[i]) # converts list to string
fh[i] = Dataset(filename1, mode='r') #create the file handle
vars[i] = fh[i].variables.keys() #returns a list with the variables of the file
wtr_tem[i] = fh[i].variables['WTR_TEM']
#plot variables in different figures
plt.plot(wtr_tem[i],'r-')
plt.xlabel(fh[i].title) #add specific title from each nc file
plt.show()
I hope it may help to somebody.

Related

Saving deque to file minding performance and portability

I have a while loop that collects data from a microphone (replaced here for np.random.random() to make it more reproducible). I do some operations, let's say I take the abs().mean() here because my output will be a one dimensional array.
This loop is going to run for a LONG time (e.g., once a second for a week) and I am wondering my options to save this. My main concerns are saving the data with acceptable performance and having the result being portable (e.g, .csv beats .npy).
The simple way: just append things into a .txt file. Could be replaced by csv.gz maybe? Maybe using np.savetxt()? Would it be worth it?
The hdf5 way: this should be a nicer way, but reading the whole dataset to append to it doesn't seem like good practice or better performing than dumping into a text file. Is there another way to append to hdf5 files?
The npy way (code not shown): I could save this into a .npy file but I would rather make it portable using a format that could be read from any program.
from collections import deque
import numpy as np
import h5py
amplitudes = deque(maxlen=save_interval_sec)
# Read from the microphone in a continuous stream
while True:
data = np.random.random(100)
amplitude = np.abs(data).mean()
print(amplitude, end="\r")
amplitudes.append(amplitude)
# Save the amplitudes to a file every n iterations
if len(amplitudes) == save_interval:
with open("amplitudes.txt", "a") as f:
for amp in amplitudes:
f.write(str(amp) + "\n")
amplitudes.clear()
# Save the amplitudes to an HDF5 file every n iterations
if len(amplitudes) == save_interval:
# Convert the deque to a Numpy array
amplitudes_array = np.array(amplitudes)
# Open an HDF5 file
with h5py.File("amplitudes.h5", "a") as f:
# Get the existing dataset or create a new one if it doesn't exist
dset = f.get("amplitudes")
if dset is None:
dset = f.create_dataset("amplitudes", data=amplitudes_array, dtype=np.float32,
maxshape=(None,), chunks=True, compression="gzip")
else:
# Get the current size of the dataset
current_size = dset.shape[0]
# Resize the dataset to make room for the new data
dset.resize((current_size + save_interval,))
# Write the new data to the dataset
dset[current_size:] = amplitudes_array
# Clear the deque
amplitudes.clear()
# For debug only
if len(amplitudes)>3:
break
Update
I get that the answer might depend a bit on the sampling frequency (once a second might be too slow) and the data dimensions (single column might be too little). I guess I asked because anything can work, but I always just dump to text. I am not sure where the breaking points are that tip the decision into one or the other method.

Read multiple csv files in a folder

I have multiple .csv files that represents a serious of measurements maiden.
I need to plot them in order to compare proceeding alterations.
I basically want to create a function with it I can read the file into a list and replay several of the "data cleaning in each .csv file" Then plot them all together in a happy graph
this is a task I need to do to analyze some results. I intend to make this in python/pandas as I might need to integrate into a bigger picture in the future but for now, this is it.
I basically want to create a function with it I can read the file into a big picture comparing it Graph.
I also have one file that represents background noise. I want to remove these values from the others .csv as well
import matplotlib.pyplot as plt
from matplotlib.ticker import FormatStrFormatter
PATH = r'C:\Users\UserName\Documents\FSC\Folder_name'
FileNames = [os.listdir(PATH)]
for files in FileNames:
df = pd.read_csv(PATH + file, index_col = 0)
I expected to read every file and store into this List but I got this code:
FileNotFoundError: [Errno 2] File b'C:\Users\UserName\Documents\FSC\FolderNameFileName.csv' does not exist: b'C:\Users\UserName\Documents\FSC\FolderNameFileName.csv'

Have you used pathlib from the standard library? it makes working with the file system a breeze,
recommend reading : https://realpython.com/python-pathlib/
try:
from pathlib import Path
files = Path('/your/path/here/').glob('*.csv') # get all csvs in your dir.
for file in files:
df = pd.read_csv(file,index_col = 0)
# your plots.

How to loop through multiple csv files and output their contents into one array?

I working in python and trying to take x, y, z coordinates from multiple LAZ files and put them into one array that can be used for another analysis. I am trying to automate this task as I have about 2000 files to turn into one or even 10 arrays.The example involves two files but I can't get the loop to work properly. I think I am not correctly naming my variables. below is an example code I have been trying to write (note that I am extremely new to programming so apologize if this is a horrible code).
Create list of las files, then turn them into an array--attempt at better automation
import numpy as np
from laspy.file import File
import glob as glob
# create list of vegetation files to be opened
VegList = sorted(glob.glob('/Users/sophiathompson/Desktop/copys/Clips/*.las'))
for f in VegList:
print(f)
Veg = File(filename = f, mode = "r") # Open the file
points = Veg.get_points() # Grab all of the points from the file.
print points #this is a check that the number of rows changes at the end
print ("array shape:")
print points.shape
VegListCoords = np.vstack((Veg.x, Veg.y, Veg.z)).transpose()
print VegListCoords
This block reads both files but fills VegListCoords with the results of the second file in the file list. I need it to hold the records from both. if this is a horrible way to go about it, I am very open to a new way.

You keep overwriting VegListCoords by assigning the values in your last opened file
instead, initialize at the beginning :
VegListCoords = []
and do instead :
VegListCoords.append(np.vstack((Veg.x, Veg.y, Veg.z)).transpose())
If you want them in one numpy array at the end, use np.concatenate

Exporting data to .mat file in python

I am working on a script in python where I can pipe ls output to the script and open all the files I want to work with in python using scipy.io, and then I want to take all the imported data, and assign them into a .mat file (again using scipy.io). I have had success importing the data and assigning it to a dictionary to export, but when I load my output file in MATLAB none of the data looks at all the same.
The data I am importing all has a lat/lon coordinate attached to it so I will use that data as an example. Data is coming from a netCDF (.nc) file:
#!/usr/bin/python
import sys
import scipy.io as sio
import numpy as np
# script take absolute path inputs (or absolute paths are desirable)
# to access an abolute path on a linux systen use the command:
# ls -d $PWD/* OR if files not in PWD then ls -d /absolute_file_path/*
# and then pipe the standard output of that to input of this script
# initialize dictionary (param) and cycle number (i.e. datafile number)
cycle = []
param = {}
# read each file from stdin
for filename in sys.stdin:
fh = sio.netcdf.netcdf_file(filename,'r')
# load in standard variables (coordinates)
latitude = fh.variables['LATITUDE'][:]
longitude = fh.variables['LONGITUDE'][:]
# close file
fh.close()
# get file number (cycle number)
cycle = filename[-7:-4]
# add latest imported coordinate to dictionary
latvar = 'lat' + cycle
lonvar = 'lon' + cycle
param.update({latvar:latitude})
param.update({lonvar:longitude})
# export dictionary to .mat file
sio.savemat('test.mat',param)
When I print the values to check if they are imported correctly I get reasonable values - but when I open my exported values in MATLAB this is an example of my output:
>> lat001
lat001 =
-3.5746e-133
and other variables have similar large exponent value (sometime very small as in this example sometimes extremely large ~1e100).
I have tried looking for similar problems but all I have come across is that some have had issues assigning large amounts of data to a single .mat file (ex. an array exceeding 2**32-1 bytes).
EDIT: Some example outputs (loading file in python, datatypes)
print latitude
[ 43.091]
print type(latitude)
<type 'numpy.ndarray'>
data = sio.loadmat('test.mat')
print data['lat001']
array([[ -3.57459142e-133]])

Selecting certain rows from a set of data files in Python

I am trying to manipulate some data with Python, but having quite a bit of difficulty (given that I'm still a rookie). I have taken some code from other questions/sites but still can't quite get what I want.
Basically what I need is to take a set of data files and select the data from 1 particular row of each of those files, then put it into a new file so I can plot it.
So, to get the data into Python in the first place I'm trying to use:
data = []
path = C:/path/to/file
for files in glob.glob(os.path.join(path, ‘*.*’)):
data.append(list(numpy.loadtxt(files, skiprows=34))) #first 34 rows aren't used
This has worked great for me once before, but for some reason it won't work now. Any possible reasons why that might be the case?
Anyway, carrying on, this should give me a 2D list containing all the data.
Next I want to select a certain row from each data set, and can do so using:
x = list(xrange(30)) #since there are 30 files
Then:
rowdata = list(data[i][some particular row] for i in x)
Which gives me a list containing the value for that particular row from each imported file. This part seems to work quite nicely.
Lastly, I want to write this to a file. I have been trying:
f = open('path/to/file', 'w')
for item in rowdata:
f.write(item)
f.close()
But I keep getting an error. Is there another method of approach here?

You are already using numpy to load the text, you can use it to manipulate it as well.
import numpy as np
path = 'C:/path/to/file'
mydata = np.array([np.loadtxt(f) for f in glob.glob(os.path.join(path, '*.*'))])
This will load all your data into one 3d array:
mydata.ndim
#3
where the first dimension (axis) runs over the files, the second over rows, the third over columns:
mydata.shape
#(number of files, number of rows in each file, number of columns in each file)
So, you can access the first file by
mydata[0,...] # equivalent to: mydata[0,:,:]
or specific parts of all files:
mydata[0,34,:] #the 35th row of the first file by
mydata[:,34,:] #the 35th row in all files
mydata[:,34,1] #the second value in the 34th row in all files
To write to file:
Say you want to write a new file with just the 35th row from all files:
np.savetxt(os.join(path,'outfile.txt'), mydata[:,34,:])

If you just have to read from a file and write to a file you can use open().
For a better solution, you can use linecache

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Work with multiple netCDF files/variables in python - python

Related

Saving deque to file minding performance and portability

Read multiple csv files in a folder

How to loop through multiple csv files and output their contents into one array?

Exporting data to .mat file in python

Selecting certain rows from a set of data files in Python

Categories

Resources