Python: "Binning" subarrays

Python: "Binning" subarrays - python

I am seeking to make a kind of binning of lines of data according to the first element of the line.
My data has this shape:
[[Temperature, value0, value1, ... value249]
[Temperature, ...
]
So to say: The first element of each line is a temperature value, the rest of the line is a time trace of a signal.
I would like to make an array of this shape:
[Temperature-bin,[[values]
[values]
... ]]
Next Temp.-bin, [[values]
[values]
... ]]
...
]
where the lines from the original data-array should be sorted in the subarray of the respective temperature bin.
data= np.array([values]) # shape is [temp+250 timesteps,400K]
temp=data[0]
start=23000
end=380000
tempmin=np.min(temp[start:end])
tempmax=np.max(temp[start:end])
binsize=1
bincenters=np.arange(np.round(tempmin),np.round(tempmax)+1,binsize)
binneddata=np.empty([len(bincenters),2])
for i in np.arange(len(temp)):
binneddata[i]=[bincenters[i],np.array([])]
I was hoping to get a result array as described above, where every line consists of the mean temperature of the bin (bincenters[i]) and an array of time traces. Python gives me an error regarding "setting an array element with a sequence.
I could create this kind of array, consisting of different data types, in another script before, but there I had to define it specifically, which is not possible in this case because I'm handling files on the scale of several 100K lines of data. At the same point I would like to use as many built-in functions and the least possible loops, because my computer is already taking some time to process files of that size.
Thank you for your input,
lepakk

First: Thanks to kwinkunks for the hint of using a pandas dataframe.
I found a solution using this feature.
The binning is now done like this:
tempmin=np.min(temp[start:end])
tempmax=np.max(temp[start:end])
binsize=1
bincenters=np.array(np.arange(np.round(tempmin),np.round(tempmax)+1,binsize))
lowerbinedges=np.array(bincenters-binsize/2)
higherbinedges=np.array(bincenters+binsize/2)
allbinedges=np.append(lowerbinedges,higherbinedges[-1])
temp_pd=pd.Series(temp[start:end])
traces=pd.Series(list(data[start:end,0:250]))
tempbins=pd.cut(temp_pd,allbinedges,labels=bincenters)
df=pd.concat([temp_pd,tempbins,traces], keys=['Temp','Bincenter','Traces'], axis=1)
by defining bins (in this case even-sized). The variable "tempbins" is of the same shape as temp (the "raw" temperature) and assignes every line of data to a certain bin.
The actual analysis is then extremely short. Starting with:
rf=pd.DataFrame({'Bincenter': bincenters})
the resultframe ("rf") starts with the bincenters (as the x-axis in a plot later), and simply adds columns for the desired results.
With
df[df.Bincenter==xyz]
I can select only those data lines from df, that I want to have in the selected bin.
In my case, I am not interested in the actual time traces, but in the sum or the average, so I use lambda-functions, that run through the rows of rf and searches for every row in df, that has the same value in "Bincenter" there.
rf['Binsize']=rf.apply(lambda row: len(df.Traces[df.Bincenter==row.Bincenter]), axis=1)
rf['Trace_sum']=rf.apply(lambda row: sum(df.Traces[df.Bincenter==row.Bincenter]), axis=1)
With those, another column is added to the resultframe rf for the sum of the traces and the number of lines in the bin.
I performed some fits of the traces in rf.Trace_sum, which I did not in pandas.
Still, the dataframe was very useful here. I used odr for fitting like this
for i in binnumber:
fitdata=odr.Data(time[fitstart:],rf.Trace_sum.values[i][fitstart:])
... some more fit stuff here...
and saved the fitresults in
lifetimefits=pd.DataFrame({'lifetime': fitresult[:,1], 'sd_lifetime':fitresult[:,4]})
and finally added them in the resultframe with
rf=pd.concat([rf,lifetimefits],axis=1)
rf[['Bincenter','Binsize','lifetime','sd_lifetime']].to_csv('results.csv', header=True, index=False)
which makes an output like
Out[78]:
Bincenter Binsize ... lifetime sd_lifetime
0 139.0 4102 ... 38.492028 2.803211
1 140.0 4252 ... 33.659729 2.534872
2 141.0 3785 ... 31.220312 2.252104
3 142.0 3823 ... 29.391562 1.783890
4 143.0 3808 ... 40.422578 2.849545
I hope, this explanation might help others to not waste time, trying this with numpy. Thanks again to kwinkunks for his very helpful advice to use the pandas DataFrame.
Best,
lepakk

Related

Storing and reading multiple histograms in a csv file

I'm working with histograms presented as pandas Series and representing the realizations of random variables from an observation set. I'm looking for an efficient way to store and read them back.
The histogram's bins are the index of the Series. For example :
histogram1 :
(-1.3747106810983318, 3.529160051186781] 0.012520
(3.529160051186781, 8.433030783471894] 0.013830
(8.433030783471894, 13.336901515757006] 0.016495
(13.336901515757006, 18.24077224804212] 0.007194
(18.24077224804212, 23.144642980327234] 0.041667
(23.144642980327234, 28.048513712612344] 0.000000
I would like to store several of these histograms in a single csv file (one file for each set of random variables, one file would store ~100 histograms), and read them back later exactly as they were before storing (each histogram from the file as a single Series, all values as floats).
How can I do this ? Since speed matters, is there a more efficient way than csv files ?
Therefore, when a new realization of a variable comes in, I would retrieve it's histogram from the corresponding file and assess the bin that it "falls in". Something like this :
# Not very elegant
for bin in histogram1.index:
if 1.0232545 in bin:
print("It's in!")
print(histogram1.loc[bin])
Thanks !

You are addressing two different topics here:
What is an efficient way to store multiple series?
How to determine the bin for a float from an already formed IntervalIndex?
The first part is straightforward. I would use pandas.concat() to create a big frame before saving to csv (or rather
pd.concat(histograms, keys=hist_names, names=['hist_name','bin']).rename('random_variable').to_frame().to_parquet()
see .to_parquet(), this answer, and this benchmark for more
Then when reading back, select a single histogram with
hist1 = df.loc[('hist1', :), 'random_variable']
or
grouped = df.reset_index('hist_name').groupby('hist_name')
hist1 = grouped.get_group('hist1')
The second part is already answered here.
In short, you need to flatten the IntervalIndex by:
bins = hist1.index.right
Then you can find the bin for your value (or list of values) with numpy.digitize:
i = np.digitize(my_value, bins)
return_value = hist1.iloc[i]
Edit
Just found this answer about Indexing with an IntervalIndex, which also works:
return_value = hist1.loc[my_value]

Paritition matrix into smaller matrices based on multiple values

So I have this giant matrix (~1.5 million rows x 7 columns) and am trying to figure out an efficient way to split it up. For simplicity of what I'm trying to do, I'll work with this much smaller matrix as an example for what I'm trying to do. The 7 columns consist of (in this order): item number, an x and y coordinate, 1st label (non-numeric), data #1, data #2, and 2nd label (non-numeric). So using pandas, I've imported from an excel sheet my matrix called A that looks like this:
What I need to do is partition this based on both labels (i.e. so I have one matrix that is all the 13G + Aa together, another matrix that is 14G + Aa, and another one that is 14G + Ab together -- this would have me wind up with 3 separate 2x7 matrices). The reason for this is because I need to run a bunch of statistics on the dataset of numbers of the "Marker" column for each individual matrix (e.g. in this example, break the 6 "marker" numbers into three sets of 2 "marker" numbers, and then run statistics on each set of two numbers). Since there are going to be hundreds of these smaller matrices on the real data set I have, I was trying to figure out some way to make the smaller matrices be labeled something like M1, M2, ..., M500 (or whatever number it ends up being) so that way later, I can use some loops to apply statistics to each individual matrix all at once without having to write it 500+ times.
What I've done so far is to use pandas to import my data set into python as a matrix with the command:
import pandas as pd
import numpy as np
df = pd.read_csv(r"C:\path\cancerdata.csv")
A = df.as_matrix() #Convert excel sheet to matrix
A = np.delete(A, (0),axis=0) #Delete header row
Unfortunately, I haven't come across many resources for how to do what I want, which is why I wanted to ask here to see if anyone knows how to split up a matrix into smaller matrices based on multiple labels.

Your question has many implications, so instead of giving you a straight answer I'll try to give you some pointers on how to tackle this problem.
First off, don't transform your DataFrame into a Matrix. DataFrames are well-optimised for slicing and indexing operations (a Pandas Series object is in reality a fancy Numpy array anyway), so you only lose functionality by converting it to a Matrix.
You could probably convert your label columns into a MultiIndex. This way, you'll be able to access slices of your original DataFrame using df.loc, with a syntax similar to df.loc[label1].loc[label2].
A MultiIndex may sound confusing at first, but it really isn't. Try executing this code block and see for yourself how the resulting DataFrame looks like:
df = pd.read_csv("C:\path\cancerdata.csv")
labels01 = df["Label 1"].unique()
labels02 = df["Label 2"].unique()
index = pd.MultiIndex.from_product([labels01, labels02])
df.set_index(index, inplace=True)
print(df)
Here, we extracted all unique values in the columns "Label 1" and "Label 2", and created an MultiIndex based on all possible combinations of Label 1 vs. Label 2. In the df.set_index line, we extracted those columns from the DataFrame - now they act as indices for your other columns. For example, in order to access the DataFrame slice from your original DataFrame whose Label 1 = 13G and Label 2 = Aa, you can simply write:
sliced_df = df.loc["13G"].loc["Aa"]
And perform whatever calculations/statistics you need with it.
Lastly, instead of saving each sliced DataFrame into a list or dictionary, and then iterating over them to perform the calculations, consider rearranging your code so that, as soon as you create your sliced DataFrame, you peform the calculations, save them to a output/results file/DataFrame, and move on to the next slicing operation. Something like:
for L1 in labels01:
for L2 in labels02:
sliced_df = df.loc[L1].loc[L2]
results = perform_calculations(sub_df)
save_results(results)
This will both improve memory consumption and performance, which may be important considering your large dataset.

How to join data from multiple netCDF files with xarray in Python?

I'm trying to open multiple netCDF files with xarray in Python. The files have data with same shape and I want to join them, creating a new dimension.
I tried to use concat_dim argument for xarray.open_mfdataset(), but it doesn't work as expected. An example is given below, which open two files with temperature data for 124 times, 241 latitudes and 480 longitudes:
DS = xr.open_mfdataset( 'eraINTERIM_t2m_*.nc', concat_dim='cases' )
da_t2m = DS.t2m
print( da_t2m )
With this code, I expect that the result data array will have a shape like (cases: 2, time: 124, latitude: 241, longitude: 480). However, its shape was (cases: 2, time: 248, latitude: 241, longitude: 480).
It creates a new dimension, but also sums the leftmost dimension: 'time' dimension of two datasets.
I was wondering whether it's an error from 'xarray.open_mfdateset' or it's an expected behavior because 'time' dimension is UNLIMITED for both datasets.
Is there a way to join data from these files directly using xarray and get the above expected return?
Thank you.
Mateus

Extending from my comment I would try this:
def preproc(ds):
ds = ds.assign({'stime': (['time'], ds.time)}).drop('time').rename({'time': 'ntime'})
# we might need to tweak this a bit further, depending on the actual data layout
return ds
DS = xr.open_mfdataset( 'eraINTERIM_t2m_*.nc', concat_dim='cases', preprocess=preproc)
The good thing here is, that you keep the original time coordinate in stime while renaming the original dimension (time -> ntime).
If everything works well, you should get resulting dimensions as (cases, ntime, latitude, longitude).
Disclaimer: I do similar in a loop with a final concat (wich works very well), but did not test the preprocess-approach.

Thank you #AdrianTompkins and #jhamman. After your comments I realize that due different time periods I really can't get what I want, with xarray.
My main purpose to create such array is to get in one single N-D array all data for different events, with same time duration. Thus, I can get easily, for example, composite fields of all events for each time (hour, day, etc).
I'm trying to do the same as I do with NCL. See below a code for NCL that works as expected (for me) for the same data:
f = addfiles( (/"eraINTERIM_t2m_201812.nc", "eraINTERIM_t2m_201901.nc"/), "r" )
ListSetType( f, "join" )
temp = f[:]->t2m
printVarSummary( temp )
The final result is an array with 4 dimensions, with the new one automatically named as ncl_join.
However, NCL doesn't respect time axis, joins the arrays and gives to the resulting time axis the coordinates of the first file. So, time axis become useless.
However, as well said for #AdrianTompkins, the time periods are different and xarray can't join data like this. So, to create such array, in Python with xarray, I think the only way is to delete time coordinate from arrays. Thus, time dimension would have only integer indexes.
The array given by xarray works like #AdrianThompkins said in his small example. Since it keep time coordinates for all merged data, I think xarray solution is the correct one, in comparison with NCL. But, now I think that a computation of composites (getting same example given above) wouldn't be done as easyly as it seems with NCL.
In a small test, I print two values from merged array with xarray with
print( da_t2m[ 0, 0, 0, 0 ].values )
print( da_t2m[ 1, 0, 0, 0 ].values )
What results in
252.11412
nan
For the second case, there isn't data for the first time, as expected.
UPDATE: all answers help me to understand better this problem, so I had to add an update here to also thanks #kmuehlbauer for his answer, indicating that his code give the expected array.
Again, thank you all for help!
Mateus

The result makes sense if the times are different.
To simplify it, forget about the lat-lon dimension for a moment and imagine you have two files that are simply data at 2 timeslices. The first has data at timesteps 1,2 and the second file with timesteps of 3 and 4. You can't create a combined dataset with a time dimension that only spans 2 timeslices; the time dimension variable has to have the times 1,2,3,4. So if you say you want a new dimension "cases", then the data is then combined as a 2d array and would look like this:
times: 1,2,3,4
cases: 1,2
data:
time
1 2 3 4
cases 1: x1 x2
2: x3 x4
Think of the netcdf file that would be the equivalent, the time dimension has to span the range of values present in both files. The only way you could combine two files and get (cases: 2, time: 124, latitude: 241, longitude: 480) would be if both files have the same time, lat AND lon values, i.e. point to exactly the same region in time-lat-lon space.
ps: Somewhat off-topic for the question, but if you are just starting a new analysis, why not instead switch to the new generation, higher resolution ERA-5 reanalysis, which is now available back to 1979 too (and eventually will be extended further back), you can download it straight to your desktop with the python api scripts from here:
https://cds.climate.copernicus.eu/cdsapp#!/search?type=dataset

Pandas Dataframes merging thru iterations. How to avoid lists and rows of headers

I code just once in a while and I am super basic at the moment. Might be a silly question, but it got me stuck in for a bit too much now.
Background
I have a function (get_profiles) that plots points every 5m along one transect line (100m long) and extracts elevation (from a geotiff).
The arguments are:
dsm (digital surface model)
transect_file (geopackage, holds many LineStrings with different transect_ID)
transect_id (int, extracted from transect_file)
step (int, number of meters to extract elevation along transect lines)
The output for one transect line is a dataframe like in the picture, which is what I expected, and I like it!
However, the big issue is when I iterate the function over the transect_ids (the transect_files has 10 Shapely LineStrings), like this:
tr_list = np.arange(1,transect_file.shape[0]-1)
geodb_transects= []
for i in tr_list:
temp=get_profiles(dsm,transect_file,i,5)
geodb_transects.append(temp)
I get a list. It might be here the error, but I don't know how to do in another way.
type(geodb_transects)
output:list
And, what's worse, I get headers (distance, z, tr_id, date) every time a new iteration starts.
How to get a clean pandas dataframe, just like the output of 1 iteration (20rows) but with all the tr_id chunks of 20row each aligned and without headers?

If your output is a DataFrame then you’re simply looking to concatenate the incremental DataFrame into some growing DataFrame.
It’s not the most efficient but something like
import pandas
df = pandas.DataFrame()
for i in range(7) :
df = df.concat( df_ret_func(i))
You may also be interested in the from_records function if you have a list of elements that are all records of the same form and can be converted into the rows of a DataFrame.

Append multiple columns into two columns python

I have a csv file which contains approximately 100 columns of data. Each column represents temperature values taken every 15 minutes throughout the day for each of the 100 days. The header of each column is the date for that day. I want to convert this into two columns, the first being the date time (I will have to create this somehow), and the second being the temperatures stacked on top of each other for each day.
My attempt:
with open("original_file.csv") as ofile:
stack_vec = []
next(ofile)
for line in ofile:
columns = lineo.split(',') # get all the columns
for i in range (0,len(columns)):
stack_vec.append(columnso[i])
np.savetxt("converted.csv",stack_vec, delimiter=",", fmt='%s')
In my attempt, I am trying to create a new vector with each column appended to the end of it. However, the code is extremely slow and likely not working! Once I have this step figured out, I then need to take the date from each column and add 15 minutes to the date time for each row. Any help would be greatly appreciated.

If i got this correct you have a csv with 96 rows and 100 Columns and want to stack in into one vector day after day to a vector with 960 entries , right ?
An easy approach would be to use numpy:
import numpy as np
x = np.genfromtxt('original_file.csv', delimiter=',')
data = x.ravel(order ='F')
Note numpy is a third party library but the go-to library for math.
the first line will read the csv into a ndarray which is like matrix ( even through it behaves different for mathematical operations)
Then with ravel you vectorize it. the oder is so that it stacks rows ontop of each other instead of columns, i.e day after day. (Leave it as default / blank if you want time point after point)
For your date problem see How can I make a python numpy arange of datetime i guess i couldn't give a better example.
if you have this two array you can ensure the shape by x.reshape(960,1) and then stack them with np.concatenate([x,dates], axis = 1 ) with dates being you date vector.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.