Creating NetCDF files in python from csv data - python

I currently have a number of csv files each for different locations. Within each file there are two columns one is datetime and the other is hourly maximum wind gust in knots. I also have a separate csv that contains the coordinates for each of these csv file locations.
Initially i want to create a netcdf from 12 locations in a 3 x 4 grid with spacing of 0.25 degrees.
All of the examples I have read online about creating netcdf files from csv start with csv files with lat long and then the variable in them where as I am starting with timeseries and the variable. and lat long for each point separate.
As well as this all the examples I've seen load each timestep in manually one at a time. Obviously if I am using hourly data from 1979 this is unfeasible and if possible would like to load all the data in in one go. If this is not possible then it would still be quicker to load in the data for each grid point as opposed to each time step. Any help at all with these problems would be much appreciated.
I have been following the example from
https://www.esri.com/arcgis-blog/products/arcgis/data-management/creating-netcdf-files-for-analysis-and-visualization-in-arcgis/
if this is of any use to those providing assistance
I am also familiar with CDO but I'm not sure if it has any useful functionality here
cheers

There are a variety of ways of doing this. The simplest is possibly to use pandas and xarray. The code below shows how to create a simple dataframe and save it netCDF using pandas/xarray.
import pandas as pd
import xarray as xr
df = pd.DataFrame({"lon":range(0,10), "lat":range(0,10),
"value":range(0,10)})
df = df.set_index(["lat", "lon"])
df.to_xarray().to_netcdf("outfile.nc")
You haven't specified how the time is stored etc., so I will leave it up to you to work out how to read the csvs and get the times in the necessary format.

Related

Exporting a csv of pandas dataframe with LARGE np.arrays

I'm building a deep learning model for speech emotion recognition in google colab environment.
The process of the data and features extraction from the audio files is taking about 20+ mins of runtime.
Therefore, I have made a pandas DataFrame containing all of the data which I want to export to a CSV file so I wouldn't need to wait that long for the data to be extracted every time.
Because audio files have 44,100 frames per second on average (sample rate (Hz)), I get a huge array of values, so that
df.sample shows for e.g:
df.sample for variable 'x'
Each 'x' array has about 170K values, but only shows this minimizing representation in df.sample.
Unfortunately, df.to_csv copies the exact representation, and NOT the full arrays.
Is there a way to export the full DataFrame as CSV? (Should be miles and miles of data for each row...)
The problem is that a dataframe is not expected to contain np.arrays. As numpy is the underlying framework for Pandas, np.arrays are special for pandas. Anyway, a dataframe is intended to be a data processing tools, not a general purpose container, so I think you are using the wrong tool here.
If you still want to go that way, it is enough to change the np.arrays into lists:
df['x'] = df['x'].apply(list)
But at load time, you will have to declare a converter to change the string representations of lists into plain lists:
df = pd.read_csv('data.csv', converters={'x': ast.literal_eval, ...})
But again, a csv file is not intended to have fields containing large lists, and performances could not be what you expect.

Uploading stock price data in .txt files and analyzing in python

I am new to python and have been searching for this but can't find any questions on this. I have stock price data for hundreds of stocks, all in .txt files. I am trying to upload all of them to jupyter notebook to analyze them, ideally with charts and mathematical analysis (specifically mean reversion analysis).
I am wondering how can I upload so many files at once? I need to be able to analyze each of them to see if they are reverting to their mean price. Then I would like to create a chart that analyzes the top 5 biggest difference from the mean.
Also, should I convert them to .csv files? maybe then upload them to pandas? Also what are some good libraries to use? I know pandas, matplotlib, and the math library, as well as probably numpy.
Thank you.
use glob to read the dir and pandas to read the files.
Then concat them all
from glob import glob
dir_containing_files = 'path_to_csv_files'
df = pd.concat([pd.read_csv(i) for i in glob(dir_containing_files + '/*.txt')])
I'm guessing your text files contain columns of data separated by some delimiter, in which case, you can use pd.DataFrame.read_csv (even without changing the file extension to .csv)
data = pd.read_csv('stock_data.txt', sep=",")
# change `sep` to whatever delimiter is in your files
You could put the line above into a loop to load many files at once. Can't say exactly how to loop through it without knowing the pattern in your file names.
In addition to Pandas, libraries that I would reach for to do mean reversion analysis are:
statsmodel for model fitting
matplotlib for drawing graphs

Transforming data in pandas

What would be the best way to approach this problem using python and pandas?
I have an excel file of electricity usage. It comes in an awkward structure and I want to transform it so that I can compare it to weather data based on date and time.
The structure look like ( foo is a string and xx is a number)
100,foo,foo,foo,foo
200,foo,foo,foo,foo,foo,0000,kWh,15
300,20181101,xx,xx,xx,xx...(96 columns)xx,A
... several hundred more 300 type rows
the 100 and 200 rows identify the meter and provide a partial schema. ie data is in kWh and 15 minute intervals. The 300 rows contain date and 96 (ie 96 = 24hours*4 15min blocks) columns of 15min power consumption and one column with a data quality flag.
I have previously processed all the data in other tools but I'm trying to learn how to do it in Python (jupyter notebook to be precise) and tap into the far more advanced analysis, modeling and visualisation tools available.
I think the thing to do is transform the data into a series of datetime and power. From there I can aggregate filter and compare however I like.
I am at a loss even to know what question to ask or resource to look up to tackle this problem. I could just import the 300 rows as is and loop through the rows and columns to create a new series in the right structure - easy enough to do. However, I strongly suspect there is an inbuilt method for doing this sort of thing and I would greatly appreciate any advise on what might be the best strategies. Possibly I don't need to transform the data at all.
You can read the data easy enough into a DataFrame, you just have to step over the metadata rows, e.g.:
df = pd.read_csv(<file>, skiprows=[0,1], index_col=1, parse_dates=True, header=None)
This will read in the csv, skip over the first 2 lines, make the date column the index and try and parse it to a date type.

Standardizing GPX traces

I have two GPX files (from a race I ran twice, obtained via the Strava API) and I would like to be able to compare the effort across both. The sampling frequency is irregular however (i.e. data is not recorded every second, or every meter), so a straightforward comparison is not possible and I would need to standardize the data first. Preferably, I would resample the data so that I have data points for every 10 meters for example.
I'm using Pandas, so I'm currently standardizing a single file by inserting rows for every 10 meters and interpolating the heartrate, duration, lat/lng, etc from the surrounding data points. This works, but doesn't make the data comparable across files, as the recording does not start at the exact same location.
An alternative is first standardizing the course coordinates using something like geohashing and then trying to map both efforts to this standardized course. Since coordinates can not be easily sorted, I'm not sure how to do that correctly however.
Any pointers are appreciated, thanks!

How to convert daily to monthly netcdf files

I have downloaded climate model output in the form of netcdf files with one variable (pr) for the whole world with a daily time-step. My final goal is to have monthly data for Europe.
I have never used netcdf files before and all the specific software for netcdf I could find doesn't seems to work in windows. Since I programme in R, I tried using the ncdf4 package but run into memory size problems (my files are around 2Gb)... I am now trying the netCDF4 module in python (first time I am using python - so go easy on me).
I have managed to install everything and found some code online to import the dataset:
nc_fid = Dataset(nc_f, 'r')
# Extract data from NetCDF file
lats = nc_fid.variables['lat'][:]
lons = nc_fid.variables['lon'][:]
time = nc_fid.variables['time'][:]
pp = nc_fid.variables['pr'][:]
However all the tutorials I found are on how to make a netcdf file... I have no idea how to aggregate this daily rainfall (variable pr) into monthly. Also, I have different types of calender in different files, but I don't even know how to access that information:
time.calendar
AttributeError: 'numpy.ndarray' object has no attribute 'calendar'
Please help, I don't want to have to learn Linux just so I can sort-out some data :(
Why not avoid programming entirely and use NCO which supplies the ncrcat command that aggregates data thusly:
ncrcat day*.nc month.nc
VoilĂ . See more ncrcat examples here.
Added 20160628: If instead of a month-long timeseries you want a monthly average then use the same command only with ncra instead of ncrcat. The manual explains things like this.
If you have a daily timestep and you want to calculate the monthly mean then you can do
cdo monmean input_yyyy.nc output_yyyy.nc
It sounds as if you have several of these files, so you will need to merge them with
cdo mergetime file_*.nc timeseries.nc
where the * is a wildcard for the years.

Categories