Conditional average in Python - python

I am having a problem manipulating my excel file in python.
I have a large excel file with data arranged by date/time.
I would like to be able to average the data for a specific time of day, over all the different days; ie. to create an average profile of the gas_concentrations over 1 day.
Here is a sample of my excel file:
Decimal Day of year Decimal of day Gas concentration
133.6285 0.6285 46.51230
133.6493 0.6493 47.32553
133.6701 0.6701 49.88705
133.691 0.691 51.88382
133.7118 0.7118 49.524
133.7326 0.7326 50.37112
Basically I need a function, like the AVERAGEIF function in excel, that will say something like
"Average the gas_concentrations when decimal_of_day=x"
However I really have no idea how to do this. Currently I have got this far
import xlrd
import numpy as np
book= xlrd.open_workbook('TEST.xlsx')
level_1=book.sheet_by_index(0)
time_1=level_1.col_values(0, start_rowx=1, end_rowx=1088)
dectime_1=level_1.col_values(8, start_rowx=1, end_rowx=1088)
ozone_1=level_1.col_values(2, start_rowx=1, end_rowx=1088)
ozone_1 = [float(i) if i != 'NA' else 'NaN' for i in ozone_1]
Edit
I updated my script to include the following
ozone=np.array(ozone_1, float)
time=np.array(dectime_1)
a=np.column_stack((ozone, time))
b=np.where((a[:,0]<0.0035))
print b
EDIT
Currently I solved the problem by putting both the variables into an array, then making a smaller array with just the variables I need to average - a bit inefficient but it works!
ozone=np.array(ozone_1, float)
time=np.array(dectime_1)
a=np.column_stack((ozone, time))
b=a[a[:,1]<0.0036]
c=np.nanmean(b[:,0])

You can use numpy masked array.
import numpy as np
data_1 = np.ma.arange(10)
data_1 = np.ma.masked_where(<your if statement>, data_1)
data_1_mean = np.mean(data1)
Hope that helps

Related

Key error when looping through netCDF data dates

I often work with satellite/model data, and a common task I need to perform is creating an array where every element is one of the months of the year. This generally works, but when I run the following code on an ECCO .nc file I get a key error that looks like a string of numbers (for example: KeyError: 727185600000000000)
import numpy as np
import xarray as xr
import pandas as pd
import ecco_v4_py as ecco
import os
grid = ecco.load_ecco_grid_nc(grid_dir,'ECCO-GRID.nc')
ecco1992 = xr.merge((grid,xr.open_dataset('oceFWflx_1992.nc'))).load()
ts = []
zero = 0
for time in ecco1992.oceFWflx.time:
ts.append(ecco1992.oceFWflx.sel(time=ecco1992.variables['time'][zero]))
zero = zero+1
However, when I do this manually, it works fine, for example:
jan = ecco1992.oceFWflx.sel(time = '1992-01-16T12:00:00.000000000')
feb = ecco1992.oceFWflx.sel(time = '1992-02-15T12:00:00.000000000')
(rest of months)
ts.append(jan)
ts.append(feb)
ts.append(rest of months)
Yields the desired array, but isn't practical with large quantities of data.
What could the cause of this key error and how might I avoid it?

Python: Using function over multiple columns of a dataframe

I am sure this question pops up often. However, I did not manage to derive an answer for this task after going through similar questions.
I have a dataframe of returns of multiple stocks and would need to run solely univariate regressions to derive rolling beta values.
The ols.PandasRollingOLS by Brad Solomon is very convenient for including the window/rolling. However, I did not manage to iterate the function over all the stock return columns.
I want this function to loop/iterate/go through all the columns of different stock returns.
In the following I am using code from the Project description https://pypi.org/project/pyfinance/ , as it should help to point my issue out clearer than my from my project.
import numpy as np
import pandas as pd
from pyfinance import ols
from pandas_datareader import DataReader
syms = {
'TWEXBMTH': 'usd',
'T10Y2YM': 'term_spread',
'PCOPPUSDM': 'copper'
}
data = DataReader(syms.keys(), data_source='fred',
start='2000-01-01', end='2016-12-31')\
.pct_change()\
.dropna()\
.rename(columns=syms)
y = data.pop('usd')
rolling = ols.PandasRollingOLS(y=y, x=data, window=12)
rolling.beta.head()
#DATE term_spread copper
#2001-01-01 0.000093 0.055448
#2001-02-01 0.000477 0.062622
#2001-03-01 0.001468 0.035703
#2001-04-01 0.001610 0.029522
#2001-05-01 0.001584 -0.044956
Instead of a multivariate regression I would like the function to take each column separately and iterate through the entire dataframe.
As my dataframe over 50 columns long, I would like to avoid being in the need of, writing the function that often, such as in the following (however, yielding the expected results):
rolling = ols.PandasRollingOLS(y=y, x=data[["term_spread"]], window=12).beta
rolling["copper"]= ols.PandasRollingOLS(y=y, x=data[["copper"]], window=12).beta["copper"]
# term_spread copper
#DATE
#2001-01-01 0.000258 0.055856
#2001-02-01 0.000611 0.064094
#2001-03-01 0.001700 0.047485
#2001-04-01 0.001778 0.040413
#2001-05-01 0.001353 -0.032264
#... ... ...
#2016-08-01 -0.100839 -0.176078
#2016-09-01 -0.058668 -0.189192
#2016-10-01 -0.014763 -0.181441
#2016-11-01 0.046531 0.016082
#2016-12-01 0.060192 0.035062
Tries I have made ended often with "SyntaxError: positional argument follows keyword argument" as for the following:
def beta(r):
z = ols.PandasRollingOLS(y=y, x=r, window=12);
return z

Missing_value attribute is lost reading data from a netCDF file?

I'm reading wind components (u and v) data from a netCDF file from NCEP/NCAR Reanalysis 1 to make some computations. I'm using xarray to read the file.
In one of the computations, I'd like to mask out all data below some threshould, make them be equal to the missing_value attribute. I don't want to use NaN's.
However, when reading the data with xarray, the missing_value attribute - present in the variable in the netCDF file - isn't copied to xarray.DataArray that contained the data.
I couldn't find a way to copy this attribute from netCDF file variable, with xarray.
Here is an example of what I'm trying to do:
import xarray as xr
import numpy as np
DS1 = xr.open_dataset( "u_250_850_2009012600-2900.nc" )
DS2 = xr.open_dataset( "v_250_850_2009012600-2900.nc" )
u850 = DS1.uwnd.sel( time='2009-01-28 00:00', level=850, lat=slice(10,-60), lon=slice(260,340) )
v850 = DS2.vwnd.sel( time='2009-01-28 00:00', level=850, lat=slice(10,-60), lon=slice(260,340) )
vvel850 = np.sqrt( u850*u850 + v850*v850 )
jet850 = vvel850.where( vvel850 >= 12 )
#jet850 = vvel850.where( vvel850 >= 12, vvel850, vvel850.missing_value )
The last commented line is what I want to do: to use missing_value attribute to fill where vvel850 < 12. The last uncommented line gives me NaN's, what I'm trying to avoid.
Is it the default behaviour of xarray when reading data from netCDF? Whether yes or not, how could I get this attribute from file variable?
An additional information: I'm using PyNGL (http://www.pyngl.ucar.edu/) to make contour plots and it doesn't work with NaN's.
Thanks.
Mateus
The "missing_value" attribute is kept in the encoding dictionary. Other attributes like "units" or "standard_name" are kept in the attrs dictionary. For example:
v850.encoding['missing_value']
You may also be interested a few other xarray features that may help your use case:
xr.open_dataset has a mask_and_scale keyword argument. This will turn off converting missing/fill values to nans.
DataArray.to_masked_array will convert a DataArray (filled with NaNs) to a numpy.MaskedArray for use in plotting programs like Matplotlib or PyNGL.

Dask read_csv: skip periodically ocurring lines

I want to use Dask to read in a large file of atom coordinates at multiple time steps. The format is called XYZ file, and it looks like this:
3
timestep 1
C 9.5464696279 5.2523477968 4.4976072664
C 10.6455075132 6.0351186102 4.0196547961
C 10.2970471574 7.3880736108 3.6390228968
3
timestep 2
C 9.5464696279 5.2523477968 4.4976072664
C 10.6455075132 6.0351186102 4.0196547961
C 10.2970471574 7.3880736108 3.6390228968
The first line contains the atom number, the second line is just a comment.
After that, the atoms are listed with their names and positions.
After all atoms are listed, the same is repeated for the next time step.
I would now like to load such a trajectory via dask.dataframe.read_csv.
However, I could not figure out how to skip the periodically ocurring lines containing the atom number and the comment. Is this actually possible?
Edit:
Reading this format into a Pandas Dataframe is possible via:
atom_nr = 3
def skip(line_nr):
return line_nr % (atom_nr + 2) < 2
pd.read_csv(xyz_filename, skiprows=skip, delim_whitespace=True,
header=None)
But it looks like the Dask dataframe does not support to pass a function to skiprows.
Edit 2:
MRocklin's answer works! Just for completeness, I write down the full code I used.
from io import BytesIO
import pandas as pd
import dask.bytes
import dask.dataframe
import dask.delayed
atom_nr = ...
filename = ...
def skip(line_nr):
return line_nr % (atom_nr + 2) < 2
def pandaread(data_in_bytes):
pseudo_file = BytesIO(data_in_bytes[0])
return pd.read_csv(pseudo_file, skiprows=skip, delim_whitespace=True,
header=None)
bts = dask.bytes.read_bytes(filename, delimiter=f"{atom_nr}\ntimestep".encode())
dfs = dask.delayed(pandaread)(bts)
sol = dask.dataframe.from_delayed(dfs)
sol.compute()
The only remaining question is: How do I tell dask to only compute the first n frames? At the moment it seems the full trajectory is read.
Short answer
No, neither pandas.read_csv nor dask.dataframe.read_csv offer this kind of functionality (to my knowledge)
Long Answer
If you can write code to convert some of this data into a pandas dataframe, then you can probably do this on your own with moderate effort using
dask.bytes.read_bytes
dask.dataframe.from_delayed
In general this might look something like the following:
values = read_bytes('filenames.*.txt', delimiter='...', blocksize=2**27)
dfs = [dask.delayed(load_pandas_from_bytes)(v) for v in values]
df = dd.from_delayed(dfs)
Each of the dfs correspond to roughly blocksize bytes of your data (and then up until the next delimiter). You can control how fine you want your partitions to be using this blocksize. If you want you can also select only a few of these dfs objects to get a smaller portion of your data
dfs = dfs[:5] # only the first five blocks of `blocksize` data

Python, parsing data 24 hours at a time out of 263 days

I have an excel/( to be converted to CSV a link) file.
The data- , has 8 columns. The first two are day of the year and time respectively while two before the last are minimum temperature and maximum temperature. For each day I need to find the maximum and minimum of the day subtract and save the value for that day.
Two problems I ran into, how do I parse 24 lines at a time ( there are no missing data lines!) and in each batch find the maximum or minimum.
I have 63126 lines=24 hr*263 days
So to iterate through the lines;
import numpy as np
input_temps='/L7_HW_SASP_w1112.csv'
up_air_min=np.genfromtxt(input_temps,skip_header=1, dtype=float, delimiter=',',usecols=(5))
up_air_max=np.genfromtxt(input_temps,skip_header=1, dtype=float, delimiter=',',usecols=(6))
day_year=np.genfromtxt(input_temps,skip_header=1, dtype=float, delimiter=',',usecols=(0))
dt_per_all_days=[]
for i in range (0,63126,1):
# I get stuck here how to limit the iteration for 24 at a time.
# if I can do that I think I can get the rest done.
min_d=[]
max_d=[]
min_d.append( up_air_min[i])
max_d.append( up_air_max[i])
max_per_day=max(max_d)
min_per_day=min(min_d)
dt_d=max_per_day-min_per_day
dt_per_all_days.append(dt_d)
del(min_d)
del(max_d)
move to the next batch of 24 lines....
`
Use the Numpy, Luke, avoid for-loops.
Then you have ap_air_min and ap_air_max numpy arrays you can easily do what you want by using numpy element-wise functions.
At first, create 2d array with 263 rows (one for a day) and 24 columns like this:
min_matrix = up_air_min.reshape((263, 24))
max_matrix = up_air_max.reshape((263, 24))
Then use np.min and np.max functions along axis 1 (good array tip sheet):
min_temperature = np.min(min_matrix, axis=1)
max_temperature = mp.max(max_matrix, axis=1)
And find the difference:
dt = max_temperature - min_temperature
dt is array with needed values. Let's save it to foo.csv:
np.savetxt('foo.csv', np.swapaxes([day_year, dt], 0, 1), delimiter=',')
And final code looks like this:
import numpy as np
# This I got from your answer.
input_temps='/L7_HW_SASP_w1112.csv'
up_air_min=np.genfromtxt(input_temps,skip_header=1, dtype=float, delimiter=',',usecols=(5))
up_air_max=np.genfromtxt(input_temps,skip_header=1, dtype=float, delimiter=',',usecols=(6))
day_year=np.genfromtxt(input_temps,skip_header=1, dtype=float, delimiter=',',usecols=(0))
# Split arrays and create matrix with 263 lines-days and 24 values in every line.
min_matrix = up_air_min.reshape((263, 24))
max_matrix = up_air_max.reshape((263, 24))
# Find min temperature for every day. min_temperature is an array with 263 values.
min_temperature = np.min(min_matrix, axis=1)
# The same for max temperature.
max_temperature = mp.max(max_matrix, axis=1)
# Subtract min temperature from max.
dt = max_temperature - min_temperature
# Save result in csv.
np.savetxt('foo.csv', np.swapaxes([day_year, dt], 0, 1), delimiter=',')
A reasonably pythonic way to do this would be to have a function that loops over the rows, gathering them up and spitting out the gathered rows using yield when the day changes. This gives you a generator that generates 263 lists each holding 24 values, which is a bit easier to process.
If you've definitely not got any missing values, you could use a trivial doubly-nested loop without batching up the elements first. That's a bit more fragile, but it sounds like you might not be planning to re-use the code anyway.
Here's a somewhat contrived example of how you could chunk things by 24 lines at a time.
from StringIO import StringIO
from random import random as r
import numpy as np
import operator
s = StringIO()
for x in xrange(0,10000):
s.write('%f,%f,%f\n' % (r(),r()*10,r()*100))
s.seek(0)
data = np.genfromtxt(s,dtype=None, names=['pitch','yaw','thrust'], delimiter=',')
for x in range(0,len(data),24):
print('Acting on hours %d through %d' % (x, x+24))
one_day = data[x:x+24]
minimum_yaw = min(one_day['yaw'])
max_yaw = max(one_day['yaw'])
print 'min',minimum_yaw,'max',max_yaw,'one_day',one_day['yaw']

Categories