Subsample multi-sensor time series data using Python's panda.Dataframe

Subsample multi-sensor time series data using Python's panda.Dataframe - python

I have a file coming from multiple sensor readings. Each line is of the following format:
timestamp sensor_name sensor_value
e.g.
191.12 temperature -5.19
191.17 pressure 20.05
191.18 pressure 20.04
191.23 pressure 20.07
191.23 temperature -5.17
191.31 temperature -5.09
...
The frequency of the readings is irregular, approximately 10-20Hz. I need do downsample these readings to 1Hz and output the result in the following format
timestamp sensor_1_value sensor_2_value ... sensor_n_value
reflecting the (running?) mean value of the sensor readings in the successive seconds, e.g.
timestamp temperature pressure
191.00 -5.02 21.93
192.00 -5.01 21.92
193.00 -5.01 21.91
...
I loaded each line of the input file into a dictionary as follows:
def add(self, timestamp, sensor_name, sensor_value):
self.timeseries[sensor_name].append([timestamp, sensor_value])
... and created a DataFrame from the dictionary:
df = pd.DataFrame(self.timeseries)
... but I need some guidance how to move forward from here, i.e. what's an elegant way to perform the sampling.

I'm not 100% sure what you're doing but this is what I'd do to solve the problem. It assumes your data file is space-separated with a header row.
import pandas as pd
import numpy as np
# Load the data
data = pd.read_csv(file_name, sep="\s", index_col=None)
# Take the mean of the values within a second
data = np.floor(data["timestamp"])
data = data.groupby(["timestamp", "sensor_name"]).mean()
data = data.reset_index()
# Pivot
data = data.pivot(index="timestamp", columns="sensor_name", values="sensor_value")
If you have some other concept for "downsampling" in mind for this context you should do that instead of mean.

Related

iterate different file rows in a time series data while backtesting a strategy

i have a folder data which contains csv files like a,bn1,bn2,bn3
all files have same time series data as index
In order to backtest strategy i use pandas iterrows()
but the trick is i have to iter the file based on the resulted value from file"a"
example: to iter single file in for loop i use
df = pd.read_csv('data/bn1')
for i in df.iterrows():
Open, High, Low, Close, Volume, sl, RSI, fractal, time = i[1]
but what i want is:
1.calculate value from file "a" which has same time series data with all files
2.if point1 results 1 then read file bn1
3.after exitting the trade we recalculate value(where time = time we exited in bn1) from file "a" after stopping itering rows of bn1
4.if :point1" results 3 start iter rows of file bn3
this goes on until the time series data is completed

Python good practice with NetCDF4 shared dimensions across groups

This question is conceptual in place of a direct error.
I am working with the python netcdf4 api to translate and store binary datagram packets from multiple sensors packaged in a single file. My question is in reference to Scope of dimensions and best use practices.
According to the Netcdf4 convention and metadata docs, dimension scope is such that all child groups have access to a dimension defined in the parent group (http://cfconventions.org/Data/cf-conventions/cf-conventions-1.9/cf-conventions.html#_scope).
Context:
The multiple sensors datapackets are written to a binary file. Timing adjustments are handled prior to writing the binary file such that we can trust the time stamp of a data packet. Time sampling rates are not synchonious. Sensor 1 samples at say 1Hz. Sensor 2 samples at 100Hz. Sensor 1 and 2 measure a number of different variables.
Questions:
Do I define a single, unlimited time dimension at the root level and create multiple variables using that dimension, or create individual time dimensions at the group level. Psuedo-code below.
In setting up the netcdf I would use the following code:
import netcdf4
data_set = netcdf4.Dataset(file_name, mode='w')
dim_time = data_set.createDimension(dimname='time', size=None)
grp_1 = data_set.createGroup('grp1')
var_time_1 = grp_1.createVariable(varname='sensor_1_time', datatype='f8', dimensions=(time,))
var_time_1[:] = sensor_1_time # timestamp of data values from sensor_1
var_1 = grp_1.createVariable(varname='sensor_1_data', datatype='f8',
dimensions=(time,))
var_1[:] = sensor_1_data # data values from sensor 1
grp_2 = data_set.createGroup('grp2')
var_time_2 = grp_2.createVariable(varname='sensor_2_time', datatype='f8', dimensions=(time,))
var_time_2[:] = sensor_2_time
var_2 = grp_2.createVariable(varname='sensor_2_data', datatype='f8', dimension=(time,))
var_2[:] = sensor_2_data # data values from sensor 2
The group separation is not necessarily by sensor but by logical data grouping. In the case that data from two sensors falls into multiple groups, is it best to replicate the time array in each group or is it acceptable to reference to other groups using the Scope mechanism.
import netcdf4
data_set = netcdf4.Dataset(file_name, mode='w')
dim_time = data_set.createDimension(dimname='time', size=None)
grp_env = data_set.createGroup('env_data')
sensor_time_1 = grp_env.createVariable(varname='sensor_1_time', datatype='f8', dimensions=(time,))
sensor_time_1[:] = sensor_1_time # timestamp of data values from sensor_1
env_1 = grp_env.createVariable(varname='sensor_1_data', datatype='f8', dimensions=(time,))
env_1[:] = sensor_1_data # data values from sensor 1
env_2 = grp_1.createVariable(varname='sensor_2_data', datatype='f8', dimensions=(time,))
env_2.coordinates = "/grp_platform/sensor_time_1"
grp_platform = data_set.createGroup('platform')
sensor_time_2 = grp_platform.createVariable(varname='sensor_2_time', datatype='f8', dimensions=(time,))
sensor_time_2[:] = sensor_2_time
plt_2 = grp_platform.createVariable(varname='sensor_2_data', datatype='f8', dimension=(time,))
var_2[:] = sensor_2_data # data values from sensor 2
Most examples do not deal with these cross group functionality and I can't seem to find the best practices. I'd love some advice, or even a push in the right direction.

How to extract data from netCDF to CSV using simple python script?

This question is related to data extraction from NetCDF files (*.nc) to CSV, to be further used by researchers for other applications - statistical analysis along time series that are not array-like data. There are two alternatives for beginners in python, namely:
Script 1:
I need your help in extracting climate date from an nc file to csv using python. The data founder, namely Copernicus gives some advice in how to extract data from nc to csv using simple python script. However, I have some issues with OverflowError: int too big to convert.
I will briefly describe the steps and provide all necessary info regarding data shape and content.
#this is for reading the .nc in the working folder
import glob
#this is reaquired ti read the netCDF4 data
from netCDF4 import Dataset
#required to read and write the csv files
import pandas as pd
#required for using the array functions
import numpy as np
from matplotlib.dates import num2date
data = Dataset('prAdjust_tmean.nc')
And data looks like this:
print(data)
<class 'netCDF4._netCDF4.Dataset'>
root group (NETCDF4 data model, file format HDF5):
CDI: Climate Data Interface version 1.8.2 (http://mpimet.mpg.de/cdi)
frequency: year
CDO: Climate Data Operators version 1.8.2 (http://mpimet.mpg.de/cdo)
creation_date: 2020-02-12T15:00:49ZCET+0100
Conventions: CF-1.6
institution_url: www.smhi.se
invar_platform_id: -
invar_rcm_model_driver: MPI-M-MPI-ESM-LR
time_coverage_start: 1971
time_coverage_end: 2000
domain: EUR-11
geospatial_lat_min: 23.942343
geospatial_lat_max: 72.641624
geospatial_lat_resolution: 0.04268074 degree
geospatial_lon_min: -35.034023
geospatial_lon_max: 73.937675
geospatial_lon_resolution: 0.009246826 degree
geospatial_bounds: -
NCO: netCDF Operators version 4.7.7 (Homepage = http://nco.sf.net, Code = http://github.com/nco/nco)
acknowledgements: This work was performed within Copernicus Climate Change Service - C3S_424_SMHI, https://climate.copernicus.eu/operational-service-water-sector, on behalf of ECMWF and EU.
contact: Hydro.fou#smhi.se
keywords: precipitation
license: Copernicus License V1.2
output_frequency: 30 year average value
summary: Calculated as the mean annual values of daily precipitation averaged over a 30 year period.
comment: The Climate Data Operators (CDO) software was used for the calculation of climate impact indicators (https://code.mpimet.mpg.de/projects/cdo/embedded/cdo.pdf, https://code.mpimet.mpg.de/projects/cdo/embedded/cdo_eca.pdf).
history: CDO commands (last cdo command first and separated with ;): timmean; yearmean
invar_bc_institution: Swedish Meteorological and Hydrological Institute
invar_bc_method: TimescaleBC, Description in deliverable C3S_D424.SMHI.1.3b
invar_bc_method_id: TimescaleBC v1.02
invar_bc_observation: EFAS-Meteo, https://ec.europa.eu/jrc/en/publication/eur-scientific-and-technical-research-reports/efas-meteo-european-daily-high-resolution-gridded-meteorological-data-set-1990-2011
invar_bc_observation_id: EFAS-Meteo
invar_bc_period: 1990-2018
data_quality: Testing of EURO-CORDEX data performed by ESGF nodes. Additional tests were performed when producing CII and ECVs in C3S_424_SMHI.
institution: SMHI
project_id: C3S_424_SMHI
references:
source: The RCM data originate from EURO-CORDEX (Coordinated Downscaling Experiment - European Domain, EUR-11) https://euro-cordex.net/.
invar_experiment_id: rcp45
invar_realisation_id: r1i1p1
invar_rcm_model_id: MPI-CSC-REMO2009-v1
variable_name: prAdjust_tmean
dimensions(sizes): x(1000), y(950), time(1), bnds(2)
variables(dimensions): float32 lon(y,x), float32 lat(y,x), float64 time(time), float64 time_bnds(time,bnds), float32 prAdjust_tmean(time,y,x)
groups:
Extract the variable:
t2m = data.variables['prAdjust_tmean']
Get dimensions assuming 3D: time, latitude, longitude
time_dim, lat_dim, lon_dim = t2m.get_dims()
time_var = data.variables[time_dim.name]
times = num2date(time_var[:], time_var.units)
latitudes = data.variables[lat_dim.name][:]
longitudes = data.variables[lon_dim.name][:]
output_dir = './'
And the error:
OverflowError Traceback (most recent call last)
<ipython-input-9-69e10e41e621> in <module>
2 time_dim, lat_dim, lon_dim = t2m.get_dims()
3 time_var = data.variables[time_dim.name]
----> 4 times = num2date(time_var[:], time_var.units)
5 latitudes = data.variables[lat_dim.name][:]
6 longitudes = data.variables[lon_dim.name][:]
C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\dates.py in num2date(x, tz)
509 if tz is None:
510 tz = _get_rc_timezone()
--> 511 return _from_ordinalf_np_vectorized(x, tz).tolist()
512
513
C:\ProgramData\Anaconda3\lib\site-packages\numpy\lib\function_base.py in __call__(self, *args, **kwargs)
2106 vargs.extend([kwargs[_n] for _n in names])
2107
-> 2108 return self._vectorize_call(func=func, args=vargs)
2109
2110 def _get_ufunc_and_otypes(self, func, args):
C:\ProgramData\Anaconda3\lib\site-packages\numpy\lib\function_base.py in _vectorize_call(self, func, args)
2190 for a in args]
2191
-> 2192 outputs = ufunc(*inputs)
2193
2194 if ufunc.nout == 1:
C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\dates.py in _from_ordinalf(x, tz)
329
330 dt = (np.datetime64(get_epoch()) +
--> 331 np.timedelta64(int(np.round(x * MUSECONDS_PER_DAY)), 'us'))
332 if dt < np.datetime64('0001-01-01') or dt >= np.datetime64('10000-01-01'):
333 raise ValueError(f'Date ordinal {x} converts to {dt} (using '
OverflowError: int too big to convert
Last part of the script is:
import os
# Path
path = "/home"
# Join various path components
print(os.path.join(path, "User/Desktop", "file.txt"))
# Path
path = "User/Documents"
# Join various path components
print(os.path.join(path, "/home", "file.txt"))
filename = os.path.join(output_dir, 'table.csv')
print(f'Writing data in tabular form to {filename} (this may take some time)...')
times_grid, latitudes_grid, longitudes_grid = [
x.flatten() for x in np.meshgrid(times, latitudes, longitudes, indexing='ij')]
df = pd.DataFrame({
'time': [t.isoformat() for t in times_grid],
'latitude': latitudes_grid,
'longitude': longitudes_grid,
't2m': t2m[:].flatten()})
df.to_csv(filename, index=False)
print('Done')
Script 2:
The second script has a more useful approach, because it gives the user the opportunity to select the exact data from NetCDF file using python, rather than extracting all data. This script is written by GeoDelta Labs on youtube. However, as in most cases the script is "written" for a specific format of nc files, and needs further tweaking to read any nc file.
The code is like this:
import netCDF4
import glob
from netCDF4 import Dataset
import pandas as pd
import numpy as np
# Record all the years of the netCDF files into a Python list
all_years = []
for file in glob.glob('name_of_the_file.nc'):
print(file)
data = Dataset(file, 'r')
data.variables.keys()
print(data.__dict__)
for file in glob.glob('name_of_the_file.nc'):
print(file)
data = Dataset(file, 'r')
time = data.variables['time']
year = time.units[14:18]
all_years.append(year)
all_years
# Creating an empty Pandas DataFrame covering the whole range of data
year_start = min(all_years)
end_year = max(all_years)
date_range = pd.date_range(start = str(year_start) + '-01-01',
end = str(end_year) + '-12-31',
freq = 'Y')
df = pd.DataFrame(0.0, columns = ['Temperature'], index = date_range)
Now the interesting part of this second python script. Usually NetCDf files cover vast geographical areas (for example Nothern Hemisphere or entire continents), which is not so useful for particular statistical analysis. Let say that the user wants to extract climate data from nc files for particular locations - countries, regions, cities, etc. GeoDelta Lab comes with a very useful approach.
It requires a small CSV file, created separately by the user, where three columns are created: Name (name of the region/country/location), Longitude and Latitude.
# Defining the location, lat, lon based on the csv data
cities = pd.read_csv('countries.csv')
cities
all_years
for index, row in cities.iterrows():
location = row['names']
location_latitude = row['lat']
location_longitude = row['long']
# Sorting the all_years python list
all_years.sort()
for yr in all_years:
# Reading-in the data
data = Dataset('prAdjust_tmean_abs_QM-EFAS-Meteo-EUR-11_MPI-M-MPI-ESM-LR_historical_r1i1p1_MPI-CSC-REMO2009-v1_na_1971-2000_grid5km_v1.nc', 'r')
# Storing the lat and lon data of the netCDF file into variables
lat = data.variables['lat'][:]
lon = data.variables['lon'][:]
# Squared difference between the specified lat,lon and the lat,lon of the netCDF
sq_diff_lat = (lat - location_latitude)**2
sq_diff_lon = (lon - location_longitude)**2
# Identify the index of the min value for lat and lon
min_index_lat = sq_diff_lat.argmin()
min_index_lon = sq_diff_lon.argmin()
# Accessing the average temparature data
temp = data.variables['prAdjust_tmean']
# Creating the date range for each year during each iteration
start = str(yr) + '-01-01'
end = str(yr) + '-12-31'
d_range = pd.date_range(start = start,
end = end,
freq = 'D')
for t_index in np.arange(0, len(d_range)):
print('Recording the value for '+ location+': ' + str(d_range[t_index]))
df.loc[d_range[t_index]]['Temparature'] = temp[t_index, min_index_lat, min_index_lon]
df.to_csv(location +'.csv')
The last part combines the data from user-defined CSV (with exact locations) and data extracted from nc file, to create a new CSV with the exact data needed by the user.
However, like always the second script fails to "read" the years from my nc file and finds only the starting year 1970. Can someone explain what I am doing wrong? Could be that the nc file contains only data for 1970? In the description of the data - it is clearly specified that the file contains annual averages from 1970 until 2000.
If someone can help me to solve this issue, I guess this script can be useful for all future users of netCDF files. Many thanks for your time.
Edit: #Robert Davy requested a sample data. I've added the link to Google Drive, where the original nc files are stored and could be downloaded/viewed by anyone that accesses this link: https://drive.google.com/drive/folders/11q9dKCoQWqPE5EMcM5ye1bNPqELGMzOA?usp=sharing

Pandas dataframe CSV reduce disk size

for my university assignment, I have to produce a csv file with all the distances of the airports of the world... the problem is that my csv file weight 151Mb. I want to reduce it as much as i can: This is my csv:
and this is my code:
# drop all features we don't need
for attribute in df:
if attribute not in ('NAME', 'COUNTRY', 'IATA', 'LAT', 'LNG'):
df = df.drop(attribute, axis=1)
# create a dictionary of airports, each airport has the following structure:
# IATA : (NAME, COUNTRY, LAT, LNG)
airport_dict = {}
for airport in df.itertuples():
airport_dict[airport[3]] = (airport[1], airport[2], airport[4], airport[5])
# From tutorial 4 soulution:
airportcodes=list(airport_dict)
airportdists=pd.DataFrame()
for i, airport_code1 in enumerate(airportcodes):
airport1 = airport_dict[airport_code1]
dists=[]
for j, airport_code2 in enumerate(airportcodes):
if j > i:
airport2 = airport_dict[airport_code2]
dists.append(distanceBetweenAirports(airport1[2],airport1[3],airport2[2],airport2[3]))
else:
# little edit: no need to calculate the distance twice, all duplicates are set to 0 distance
dists.append(0)
airportdists[i]=dists
airportdists.columns=airportcodes
airportdists.index=airportcodes
# set all 0 distance values to NaN
airportdists = airportdists.replace(0, np.nan)
airportdists.to_csv(r'../Project Data Files-20190322/distances.csv')
I also tried re-indexing it before saving:
# remove all NaN values
airportdists = airportdists.stack().reset_index()
airportdists.columns = ['airport1','airport2','distance']
but the result is a dataframe with 3 columns and 17 million columns and a disk size of 419Mb... quite not an improvement...
Can you help me shrink the size of my csv? Thank you!

I have done a similar application in the past; here's what I will do:
It is difficult to shrink your file, but if your application needs to have for example a distance between an airport from others, I suggest you to create 9541 files, each file will be the distance of an airport to others and its name will be name of airport.
In this case the loading of file is really fast.

My suggestion will be instead of storing as a CSV try to store in Key Value pair data structure like JSON. It will be very fast on retrieval. Or try parquet file format that will consume 1/4 of the CSV file storage.
import pandas as pd
import numpy as np
from pathlib import Path
from string import ascii_letters
#created a dataframe
df = pd.DataFrame(np.random.randint(0,10000,size=(1000000, 52)),columns=list(ascii_letters))
df.to_csv('csv_store.csv',index=False)
print('CSV Consumend {} MB'.format(Path('csv_store.csv').stat().st_size*0.000001))
#CSV Consumend 255.22423999999998 MB
df.to_parquet('parquate_store',index=False)
print('Parquet Consumed {} MB'.format(Path('parquate_store').stat().st_size*0.000001))
#Parquet Consumed 93.221154 MB

The title of the question, "..reduce disk size" is solved by outputting a compressed version of the csv.
airportdists.to_csv(r'../Project Data Files-20190322/distances.csv', compression='zip')
Or one better with Pandas 0.24.0
airportdists.to_csv(r'../Project Data Files-20190322/distances.csv.zip')
You will find the csv is hugely compressed.
This of course does not solve for optimizing load and save time and does nothing for working memory. But hopefully useful when disk space is at a premium or cloud storage is being paid for.

The best compression would be to instead store the latitude and longitude of each airport, and then compute the distance between any two pairs on demand. Say, two 32-bit floating point values for each airport and the identifier, which would be about 110K bytes. Compressed by a factor of about 1300.

Combining a large amount of netCDF files

I have a large folder of netCDF (.nc) files each one with a similar name. The data files contain variables of time, longitude, latitude, and monthly precipitation. The goal is to get the average monthly precipitation over X amount of years for each month. So in the end I would have 12 values representing the average monthly precipitation over X amount of years for each lat and long. Each file is the same location over many years.
Each file starts with the same name and ends in a “date.sub.nc” for example:
'data1.somthing.somthing1.avg_2d_Ind_Nx.200109.SUB.nc'
'data1.somthing.somthing1.avg_2d_Ind_Nx.200509.SUB.nc'
'data2.somthing.somthing1.avg_2d_Ind_Nx.201104.SUB.nc'
'data2.somthing.somthing1.avg_2d_Ind_Nx.201004.SUB.nc'
'data2.somthing.somthing1.avg_2d_Ind_Nx.201003.SUB.nc'
'data2.somthing.somthing1.avg_2d_Ind_Nx.201103.SUB.nc'
'data1.somthing.somthing1.avg_2d_Ind_Nx.201203.SUB.nc'
The ending is YearMonth.SUB.nc
What I have so far is:
array=[]
f = nc.MFDataset('data*.nc')
precp = f.variables['prectot']
time = f.variables['time']
array = f.variables['time','longitude','latitude','prectot']
I get a KeyError: ('time', 'longitude', 'latitude', 'prectot'). Is there a way to combine all this data so I am able to manipulate it?

As #CharlieZender mentioned, ncra is the way to go here and I'll provide some more details on integrating that function into a Python script. (PS - you can install NCO easily with Homebrew, e.g. http://alejandrosoto.net/blog/2014/01/22/setting-up-my-mac-for-scientific-research/)
import subprocess
import netCDF4
import glob
import numpy as np
for month in range(1,13):
# Gather all the files for this month
month_files = glob.glob('/path/to/files/*{0:0>2d}.SUB.nc'.format(month))
# Using NCO functions ---------------
avg_file = './precip_avg_{0:0>2d}.nc'.format(month)
# Concatenate the files using ncrcat
subprocess.call(['ncrcat'] + month_files + ['-O', avg_file])
# Take the time (record) average using ncra
subprocess.call(['ncra', avg_file, '-O', avg_file])
# Read in the monthly precip climatology file and do whatever now
ncfile = netCDF4.Dataset(avg_file, 'r')
pr = ncfile.variables['prectot'][:,:,:]
....
# Using only Python -------------
# Initialize an array to store monthly-mean precip for all years
# let's presume we know the lat and lon dimensions (nlat, nlon)
nyears = len(month_files)
pr_arr = np.zeros([nyears,nlat,nlon], dtype='f4')
# Populate pr_arr with each file's monthly-mean precip
for idx, filename in enumerate(month_files):
ncfile = netCDF4.Dataset(filename, 'r')
pr = ncfile.variable['prectot'][:,:,:]
pr_arr[idx,:,:] = np.mean(pr, axis=0)
ncfile.close()
# Take the average along all years for a monthly climatology
pr_clim = np.mean(pr_arr, axis=0) # 2D now [lat,lon]

NCO does this with
ncra *.01.SUB.nc pcp_avg_01.nc
ncra *.02.SUB.nc pcp_avg_02.nc
...
ncra *.12.SUB.nc pcp_avg_12.nc
ncrcat pcp_avg_??.nc pcp_avg.nc
Of course the first twelve commands can be done with a Bash loop, reducing the total number of lines to less than five. If you prefer to script with python, you can check your answers with this. ncra docs here.

The command ymonmean calculates the mean of calendar months in CDO. Thus the task can be accomplished in two lines:
cdo mergetime data*.SUB.nc merged.nc # put files together into one series
cdo ymonmean merged.nc annual_cycle.nc # mean of all Jan,Feb etc.
cdo can also calculate the annual cycle of other statistics, ymonstd, ymonmax etc... and the time units can be days or pentads as well as months. (e.g. ydaymean).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Subsample multi-sensor time series data using Python's panda.Dataframe - python

Related

iterate different file rows in a time series data while backtesting a strategy

Python good practice with NetCDF4 shared dimensions across groups

How to extract data from netCDF to CSV using simple python script?

Pandas dataframe CSV reduce disk size

Combining a large amount of netCDF files

Categories

Resources