I'm working with histograms presented as pandas Series and representing the realizations of random variables from an observation set. I'm looking for an efficient way to store and read them back.
The histogram's bins are the index of the Series. For example :
histogram1 :
(-1.3747106810983318, 3.529160051186781] 0.012520
(3.529160051186781, 8.433030783471894] 0.013830
(8.433030783471894, 13.336901515757006] 0.016495
(13.336901515757006, 18.24077224804212] 0.007194
(18.24077224804212, 23.144642980327234] 0.041667
(23.144642980327234, 28.048513712612344] 0.000000
I would like to store several of these histograms in a single csv file (one file for each set of random variables, one file would store ~100 histograms), and read them back later exactly as they were before storing (each histogram from the file as a single Series, all values as floats).
How can I do this ? Since speed matters, is there a more efficient way than csv files ?
Therefore, when a new realization of a variable comes in, I would retrieve it's histogram from the corresponding file and assess the bin that it "falls in". Something like this :
# Not very elegant
for bin in histogram1.index:
if 1.0232545 in bin:
print("It's in!")
print(histogram1.loc[bin])
Thanks !
You are addressing two different topics here:
What is an efficient way to store multiple series?
How to determine the bin for a float from an already formed IntervalIndex?
The first part is straightforward. I would use pandas.concat() to create a big frame before saving to csv (or rather
pd.concat(histograms, keys=hist_names, names=['hist_name','bin']).rename('random_variable').to_frame().to_parquet()
see .to_parquet(), this answer, and this benchmark for more
Then when reading back, select a single histogram with
hist1 = df.loc[('hist1', :), 'random_variable']
or
grouped = df.reset_index('hist_name').groupby('hist_name')
hist1 = grouped.get_group('hist1')
The second part is already answered here.
In short, you need to flatten the IntervalIndex by:
bins = hist1.index.right
Then you can find the bin for your value (or list of values) with numpy.digitize:
i = np.digitize(my_value, bins)
return_value = hist1.iloc[i]
Edit
Just found this answer about Indexing with an IntervalIndex, which also works:
return_value = hist1.loc[my_value]
Related
I need to use pd.cut on a dask dataframe.
This answer indicates that map_partitions will work by passing pd.cut as the function.
It seems that map_partitions passes only one partition at a time to the function. However, pd.cut will need access to an entire column of my df in order to create the bins. So, my question is: will map_partitions in this case actually operate on the the entire dataframe or am I going to get incorrect results with this this approach?
In your question you correctly identify why the bins should be provided explicitly.
By specifying the exact bin cuts (either based on some calculation or external reasoning), you ensure that what dask does is comparable across partitions.
# this does not guarantee comparable cuts
ddf['a'].map_partitions(pd.cut)
# this ensures the cuts are as per the specified bins
ddf['a'].map_partitions(pd.cut, bins)
If you want to generate bins in an automatic way, one way is to get the min/max for the column of interest and generate the bins with np.linspace:
# note that computation is needed to give
# actual (not delayed) values to np.linspace
bmin, bmax = dask.compute(ddf['a'].min(), ddf['a'].max)
# specify the number of desired cuts here
bins = np.linspace(bmin, bmax, num=123)
I am seeking to make a kind of binning of lines of data according to the first element of the line.
My data has this shape:
[[Temperature, value0, value1, ... value249]
[Temperature, ...
]
So to say: The first element of each line is a temperature value, the rest of the line is a time trace of a signal.
I would like to make an array of this shape:
[Temperature-bin,[[values]
[values]
... ]]
Next Temp.-bin, [[values]
[values]
... ]]
...
]
where the lines from the original data-array should be sorted in the subarray of the respective temperature bin.
data= np.array([values]) # shape is [temp+250 timesteps,400K]
temp=data[0]
start=23000
end=380000
tempmin=np.min(temp[start:end])
tempmax=np.max(temp[start:end])
binsize=1
bincenters=np.arange(np.round(tempmin),np.round(tempmax)+1,binsize)
binneddata=np.empty([len(bincenters),2])
for i in np.arange(len(temp)):
binneddata[i]=[bincenters[i],np.array([])]
I was hoping to get a result array as described above, where every line consists of the mean temperature of the bin (bincenters[i]) and an array of time traces. Python gives me an error regarding "setting an array element with a sequence.
I could create this kind of array, consisting of different data types, in another script before, but there I had to define it specifically, which is not possible in this case because I'm handling files on the scale of several 100K lines of data. At the same point I would like to use as many built-in functions and the least possible loops, because my computer is already taking some time to process files of that size.
Thank you for your input,
lepakk
First: Thanks to kwinkunks for the hint of using a pandas dataframe.
I found a solution using this feature.
The binning is now done like this:
tempmin=np.min(temp[start:end])
tempmax=np.max(temp[start:end])
binsize=1
bincenters=np.array(np.arange(np.round(tempmin),np.round(tempmax)+1,binsize))
lowerbinedges=np.array(bincenters-binsize/2)
higherbinedges=np.array(bincenters+binsize/2)
allbinedges=np.append(lowerbinedges,higherbinedges[-1])
temp_pd=pd.Series(temp[start:end])
traces=pd.Series(list(data[start:end,0:250]))
tempbins=pd.cut(temp_pd,allbinedges,labels=bincenters)
df=pd.concat([temp_pd,tempbins,traces], keys=['Temp','Bincenter','Traces'], axis=1)
by defining bins (in this case even-sized). The variable "tempbins" is of the same shape as temp (the "raw" temperature) and assignes every line of data to a certain bin.
The actual analysis is then extremely short. Starting with:
rf=pd.DataFrame({'Bincenter': bincenters})
the resultframe ("rf") starts with the bincenters (as the x-axis in a plot later), and simply adds columns for the desired results.
With
df[df.Bincenter==xyz]
I can select only those data lines from df, that I want to have in the selected bin.
In my case, I am not interested in the actual time traces, but in the sum or the average, so I use lambda-functions, that run through the rows of rf and searches for every row in df, that has the same value in "Bincenter" there.
rf['Binsize']=rf.apply(lambda row: len(df.Traces[df.Bincenter==row.Bincenter]), axis=1)
rf['Trace_sum']=rf.apply(lambda row: sum(df.Traces[df.Bincenter==row.Bincenter]), axis=1)
With those, another column is added to the resultframe rf for the sum of the traces and the number of lines in the bin.
I performed some fits of the traces in rf.Trace_sum, which I did not in pandas.
Still, the dataframe was very useful here. I used odr for fitting like this
for i in binnumber:
fitdata=odr.Data(time[fitstart:],rf.Trace_sum.values[i][fitstart:])
... some more fit stuff here...
and saved the fitresults in
lifetimefits=pd.DataFrame({'lifetime': fitresult[:,1], 'sd_lifetime':fitresult[:,4]})
and finally added them in the resultframe with
rf=pd.concat([rf,lifetimefits],axis=1)
rf[['Bincenter','Binsize','lifetime','sd_lifetime']].to_csv('results.csv', header=True, index=False)
which makes an output like
Out[78]:
Bincenter Binsize ... lifetime sd_lifetime
0 139.0 4102 ... 38.492028 2.803211
1 140.0 4252 ... 33.659729 2.534872
2 141.0 3785 ... 31.220312 2.252104
3 142.0 3823 ... 29.391562 1.783890
4 143.0 3808 ... 40.422578 2.849545
I hope, this explanation might help others to not waste time, trying this with numpy. Thanks again to kwinkunks for his very helpful advice to use the pandas DataFrame.
Best,
lepakk
So I have this giant matrix (~1.5 million rows x 7 columns) and am trying to figure out an efficient way to split it up. For simplicity of what I'm trying to do, I'll work with this much smaller matrix as an example for what I'm trying to do. The 7 columns consist of (in this order): item number, an x and y coordinate, 1st label (non-numeric), data #1, data #2, and 2nd label (non-numeric). So using pandas, I've imported from an excel sheet my matrix called A that looks like this:
What I need to do is partition this based on both labels (i.e. so I have one matrix that is all the 13G + Aa together, another matrix that is 14G + Aa, and another one that is 14G + Ab together -- this would have me wind up with 3 separate 2x7 matrices). The reason for this is because I need to run a bunch of statistics on the dataset of numbers of the "Marker" column for each individual matrix (e.g. in this example, break the 6 "marker" numbers into three sets of 2 "marker" numbers, and then run statistics on each set of two numbers). Since there are going to be hundreds of these smaller matrices on the real data set I have, I was trying to figure out some way to make the smaller matrices be labeled something like M1, M2, ..., M500 (or whatever number it ends up being) so that way later, I can use some loops to apply statistics to each individual matrix all at once without having to write it 500+ times.
What I've done so far is to use pandas to import my data set into python as a matrix with the command:
import pandas as pd
import numpy as np
df = pd.read_csv(r"C:\path\cancerdata.csv")
A = df.as_matrix() #Convert excel sheet to matrix
A = np.delete(A, (0),axis=0) #Delete header row
Unfortunately, I haven't come across many resources for how to do what I want, which is why I wanted to ask here to see if anyone knows how to split up a matrix into smaller matrices based on multiple labels.
Your question has many implications, so instead of giving you a straight answer I'll try to give you some pointers on how to tackle this problem.
First off, don't transform your DataFrame into a Matrix. DataFrames are well-optimised for slicing and indexing operations (a Pandas Series object is in reality a fancy Numpy array anyway), so you only lose functionality by converting it to a Matrix.
You could probably convert your label columns into a MultiIndex. This way, you'll be able to access slices of your original DataFrame using df.loc, with a syntax similar to df.loc[label1].loc[label2].
A MultiIndex may sound confusing at first, but it really isn't. Try executing this code block and see for yourself how the resulting DataFrame looks like:
df = pd.read_csv("C:\path\cancerdata.csv")
labels01 = df["Label 1"].unique()
labels02 = df["Label 2"].unique()
index = pd.MultiIndex.from_product([labels01, labels02])
df.set_index(index, inplace=True)
print(df)
Here, we extracted all unique values in the columns "Label 1" and "Label 2", and created an MultiIndex based on all possible combinations of Label 1 vs. Label 2. In the df.set_index line, we extracted those columns from the DataFrame - now they act as indices for your other columns. For example, in order to access the DataFrame slice from your original DataFrame whose Label 1 = 13G and Label 2 = Aa, you can simply write:
sliced_df = df.loc["13G"].loc["Aa"]
And perform whatever calculations/statistics you need with it.
Lastly, instead of saving each sliced DataFrame into a list or dictionary, and then iterating over them to perform the calculations, consider rearranging your code so that, as soon as you create your sliced DataFrame, you peform the calculations, save them to a output/results file/DataFrame, and move on to the next slicing operation. Something like:
for L1 in labels01:
for L2 in labels02:
sliced_df = df.loc[L1].loc[L2]
results = perform_calculations(sub_df)
save_results(results)
This will both improve memory consumption and performance, which may be important considering your large dataset.
I code just once in a while and I am super basic at the moment. Might be a silly question, but it got me stuck in for a bit too much now.
Background
I have a function (get_profiles) that plots points every 5m along one transect line (100m long) and extracts elevation (from a geotiff).
The arguments are:
dsm (digital surface model)
transect_file (geopackage, holds many LineStrings with different transect_ID)
transect_id (int, extracted from transect_file)
step (int, number of meters to extract elevation along transect lines)
The output for one transect line is a dataframe like in the picture, which is what I expected, and I like it!
However, the big issue is when I iterate the function over the transect_ids (the transect_files has 10 Shapely LineStrings), like this:
tr_list = np.arange(1,transect_file.shape[0]-1)
geodb_transects= []
for i in tr_list:
temp=get_profiles(dsm,transect_file,i,5)
geodb_transects.append(temp)
I get a list. It might be here the error, but I don't know how to do in another way.
type(geodb_transects)
output:list
And, what's worse, I get headers (distance, z, tr_id, date) every time a new iteration starts.
How to get a clean pandas dataframe, just like the output of 1 iteration (20rows) but with all the tr_id chunks of 20row each aligned and without headers?
If your output is a DataFrame then you’re simply looking to concatenate the incremental DataFrame into some growing DataFrame.
It’s not the most efficient but something like
import pandas
df = pandas.DataFrame()
for i in range(7) :
df = df.concat( df_ret_func(i))
You may also be interested in the from_records function if you have a list of elements that are all records of the same form and can be converted into the rows of a DataFrame.
I have many files in a folder that like this one:
enter image description here
and I'm trying to implement a dictionary for data. I'm interested in create it with 2 keys (the first one is the http address and the second is the third field (plugin used), like adblock). The values are referred to different metrics so my intention is to compute the for each site and plugin the mean,median and variance of each metric, once the dictionary has been implemented. For example for the mean, my intention is to consider all the 4-th field values in the file, etc. I tried to write this code but, first of all, I'm not sure that it is correct.
enter image description here
I read others posts but no-one solved my problem, since they threats or only one key or they don't show how to access the different values inside the dictionary to compute the mean,median and variance.
The problem is simple, admitting that the dictionary implementation is ok, in which way must I access the different values for the key1:www.google.it -> key2:adblock ?
Any kind oh help is accepted and I'm available for any other answer.
You can do what you want using a dictionary, but you should really consider using the Pandas library. This library is centered around tabular data structure called "DataFrame" that excels in column-wise and row-wise calculations such as the one that you seem to need.
To get you started, here is the Pandas code that reads one text file using the read_fwf() method. It also displays the mean and variance for the fourth column:
# import the Pandas library:
import pandas as pd
# Read the file 'table.txt' into a DataFrame object. Assume
# a header-less, fixed-width file like in your example:
df = pd.read_fwf("table.txt", header=None)
# Show the content of the DataFrame object:
print(df)
# Print the fourth column (zero-indexed):
print(df[3])
# Print the mean for the fourth column:
print(df[3].mean())
# Print the variance for the fourth column:
print(df[3].var())
There are different ways of selecting columns and rows from a DataFrame object. The square brackets [ ] in the previous examples selected a column in the data frame by column number. If you want to calculate the mean of the fourth column only from those rows that contain adblock in the third column, you can do it like so:
# Print those rows from the data frame that have the value 'adblock'
# in the third column (zero-indexed):
print(df[df[2] == "adblock"])
# Print only the fourth column (zero-indexed) from that data frame:
print(df[df[2] == "adblock"][3])
# Print the mean of the fourth column from that data frame:
print(df[df[2] == "adblock"][3].mean())
EDIT:
You can also calculate the mean or variance for more than one column at the same time:
# Use a list of column numbers to calculate the mean for all of them
# at the same time:
l = [3, 4, 5]
print(df[l].mean())
END EDIT
If you want to read the data from several files and do the calculations for the concatenated data, you can use the concat() method. This method takes a list of DataFrame objects and concatenates them (by default, row-wise). Use the following line to create a DataFrame from all *.txt files in your directory:
df = pd.concat([pd.read_fwf(file, header=None) for file in glob.glob("*.txt")],
ignore_index=True)