Accelerate a pandas operation involving several dataframes - python

Hello everyone
For a school project, I am stuck with the duration of an operation with Pandas Dataframe.
I have one dataframe df which shape is (250 000 000, 200). This dataframe contains values of variable describing the behaviours of sensors on a machine.
They are organized by 'Cycle' (everytime the machine begins a new cycle, this variable is incremented by one). And in this cycle, 'CycleTime' describes the position of the row within the 'Cycle'.
In the 'mean' DataFrame, I compute the mean of each variables group by the 'CycleTime'
The 'anomaly_matrix' DataFrame represents the global anomaly of each cycle which is the sum of the square difference of each rows belonging to the Cycle with the mean of corresponding CycleTime.
An example of my code is below
df = pd.DataFrame({'Cycle': [0, 0, 0, 1, 1, 1, 2, 2], 'CycleTime': [0, 1, 2, 0, 1, 2, 0, 1], 'variable1': [0, 0.5, 0.25, 0.3, 0.4, 0.1, 0.2, 0.25], 'variable2':[1, 2, 1, 1, 2, 2, 1, 2], 'variable3': [100, 5000, 200, 900, 100, 2000, 300, 300]})
mean = df.drop(['Cycle'], axis = 1).groupby("CycleTime").agg('mean')
anomali_matrix = df.drop(['CycleTime'], axis = 1).groupby("Cycle").agg('mean')
anomaly_matrix = anomali_matrix - anomali_matrix
for index, row in df.iterrows():
cycle = row["Cycle"]
time = row["CycleTime"]
anomaly_matrix.loc[cycle] += (row - mean.loc[time])**2
>>>anomaly_matrix
variable1 variable2 variable3
Cycle
0 0.047014 0.25 1.116111e+07
1 0.023681 0.25 3.917778e+06
2 0.018889 0.00 2.267778e+06
This calculation for my (250 000 000, 200) DataFrame last 6 hours, it is due to anomaly_matrix.loc[cycle] += (row - mean.loc[time])**2
I tried to improve by using an apply function but I do not succeed in adding other DataFrame in that apply function. Same thing trying to vectorize pandas.
Do you have any idea how to accelerate this process ? Thanks

You can use:
df1 = df.set_index(['Cycle', 'CycleTime'])
mean = df1.sub(df1.groupby('CycleTime').transform('mean'))**2
df2 = mean.groupby('Cycle').sum()
print (df2)
variable1 variable2 variable3
Cycle
0 0.047014 0.25 1.116111e+07
1 0.023681 0.25 3.917778e+06
2 0.018889 0.00 2.267778e+06

Related

Python - Summing values and number of duplicates

I have csv file looking like this: part of the data.
X and Y are my coordinates of pixel.
I need to filter column ADC only for TDC values (in this column are also 0 values), and after this I need to sum up the energy value for every unique value of pixel, so for every x=0 y=0, x=0 y=1, x=0 y=2... until x=127 y=127. And in another column I need the number of duplicates for the pixel coordinates that occurs (so the number of places/rows from which I need to make a summation in the Energy column).
I don't know how to write the appropriate conditions for this kind of task. I will appreciate any type of help.
The following StackOverflow question and answers might help you out:
Group dataframe and get sum AND count?
But here is some code for your case which might be useful, too:
# import the pandas package, for doing data analysis and manipulation
import pandas as pd
# create a dummy dataframe using data of the type you are using (I hope)
df = pd.DataFrame(
data = {
"X": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
"Y": [0, 0, 1, 1, 1, 1, 1, 1, 2, 2],
"ADC": ["TDC", "TDC", "TDC", "TDC", "TDC", 0, 0, 0, "TDC", "TDC"],
"Energy": [0, 0, 1, 1, 1, 2, 2, 2, 3, 3],
"Time": [1.2, 1.2, 2.3, 2.3, 3.6, 3.61, 3.62, 0.66, 0.67, 0.68],
}
)
# use pandas' groupby method and aggregation methods to get the sum of the energy in every unique combination of X and Y, and the number of times those combinations appear
df[df["ADC"] == "TDC"].groupby(by=["X","Y"]).agg({"Energy": ['sum','count']}).reset_index()
The result I get from this in my dummy example is:
X Y Energy
sum count
0 0 0 0 2
1 0 1 3 3
2 0 2 6 2

Why is this dictionary comprehension so slow? Please suggest way to speed it up

Hi Please help me either: speed up this dictionary compression; offer a better way to do it or gain a higher understanding of why it is so slow internally (like for example is calculation slowing down as the dictionary grows in memory size). I'm sure there must be a quicker way without learning some C!
classes = {i : [1 if x in df['column'].str.split("|")[i] else 0 for x in df['column']] for i in df.index}
with the output:
{1:[0,1,0...0],......, 4000:[0,1,1...0]}
from a df like this:
data_ = {'drugbank_id': ['DB06605', 'DB06606', 'DB06607', 'DB06608', 'DB06609'],
'drug-interactions': ['DB06605|DB06695|DB01254|DB01609|DB01586|DB0212',
'DB06605|DB06695|DB01254|DB01609|DB01586|DB0212',
'DB06606|DB06607|DB06608|DB06609',
'DB06606|DB06607',
'DB06608']
}
pd.DataFrame(data = data_ , index=range(0,5) )
I am preforming it in a df with 4000 rows, the column df['column'] contains a string of Ids separated by |. The number of IDs in each row that needs splitting varies from 1 to 1000, however, this is done for all 4000 indexes. I tested it on the head of the df and it seemed quick enough, now the comprehension has been running for 24hrs. So maybe it is just the sheer size of the job, but feel like I could speed it up and at this point I want to stop it an re-engineer, however, I am scared that will set me back without much increase in speed, so before I do that wanted to get some thoughts, ideas and suggestions.
Beyond 4000x4000 size I suspect that using the Series and Index Objects is the another problem and that I would be better off using lists, but given the size of the task I am not sure how much speed that will gain and maybe I am better off using some other method such as pd.apply(df, f(write line by line to json)). I am not sure - any help and education appreciated, thanks.
Here is one approach:
import pandas as pd
# create data frame
df = pd.DataFrame({'idx': [1, 2, 3, 4], 'col': ['1|2', '1|2|3', '2|3', '1|4']})
# split on '|' to convert string to list
df['col'] = df['col'].str.split('|')
# explode to get one row for each list element
df = df.explode('col')
# create dummy ID (this will become True in the final result)
df['dummy'] = 1
# use pivot to create dense matrix
df = (df.pivot(index='idx', columns='col', values='dummy')
.fillna(0)
.astype(int))
# convert each row to a list
df['test'] = df.apply(lambda x: x.to_list(), axis=1)
print(df)
col 1 2 3 4 test
idx
1 1 1 0 0 [1, 1, 0, 0]
2 1 1 1 0 [1, 1, 1, 0]
3 0 1 1 0 [0, 1, 1, 0]
4 1 0 0 1 [1, 0, 0, 1]
The output you want can be achieved using dummies. We split the column, stack, and use max to turn it into dummy indicators based on the original index. Then we use reindex to get it in the order you want based on the 'drugbank_id' column.
Finally to get the dictionary you want we will transpose and use to_dict
classes = (pd.get_dummies(df['drug-interactions'].str.split('|', expand=True).stack())
.max(level=0)
.reindex(df['drugbank_id'], axis=1)
.fillna(0, downcast='infer')
.T.to_dict('list'))
print(classes)
{0: [1, 0, 0, 0, 0], #Has DB06605, No DB06606, No DB06607, No DB06608, No DB06609
1: [1, 0, 0, 0, 0],
2: [0, 1, 1, 1, 1],
3: [0, 1, 1, 0, 0],
4: [0, 0, 0, 1, 0]}

how to iterate though 5 rows from get Max per row of 4 columns and set value of a different column accordingly

I have a DataFrame, i've made a subset df2 of 4 columns from df1 and create a list of 5 items containing max value from each row. Now depending on which column the max value for that row is in i.e column 1, 2, 3, 4, determines the int label i.e. 1, 2, 3, or 4 in the label column in df1.
The df2 is because some of the other columns not including those 4 have a higher value than the 4 to compare and obviously screws that up. Starting to think it should be a list or series?
code
df1= pd.DataFrame({'x_1': [xvalues[0][0], xvalues[0][1], xvalues[0][2],
xvalues[0][3], xvalues[0][4]],
'x_2': [yvalues[0][0], yvalues[0][1], yvalues[0][2],
yvalues[0][3], yvalues[0][4]],
'True labels': [truelabels[0], truelabels[1],
truelabels[2],truelabels[3], truelabels[4]],
'g11': [classifier1[0][0],classifier1[0][1],
classifier1[0][2],classifier1[0][3],
classifier1[0][4],],
'g12': [classifier1[1][0],classifier1[1][1],
classifier1[1][2],classifier1[1][3],
classifier1[1][4],],
'g13': [classifier1[2][0],classifier1[2][1],
classifier1[2][2],classifier1[2][3],
classifier1[2][4],],
'g14': [classifier1[3][0],classifier1[3][1],
classifier1[3][2],classifier1[3][3],
classifier1[3][4],],
'L1': [2, 5, 6, 7, 8],
'g21': [classifier2[0][0],classifier2[0][1],
classifier2[0][2],classifier2[0][3],
classifier2[0][4],],
'g22': [classifier2[1][0],classifier2[1][1],
classifier2[1][2],classifier2[1][3],
classifier2[1][4],],
'g23': [classifier2[2][0],classifier2[2][1],
classifier2[2][2],classifier2[2][3],
classifier2[2][4],],
'g24': [classifier2[3][0],classifier2[3][1],
classifier2[3][2],classifier2[3][3],
classifier2[3][4],],
'L2': [0, 0, 0, 0, 0],
'g31': [classifier3[0],classifier3[0],
classifier3[0],classifier3[0],
classifier3[0],],
'g32': [classifier3[1][0],classifier3[1][1],
classifier3[1][2],classifier3[1][3],
classifier3[1][4],],
'g33': [classifier3[2][0],classifier3[2][1],
classifier3[2][2],classifier3[2][3],
classifier3[2][4],],
'g34': [classifier3[3][0],classifier3[3][1],
classifier3[3][2],classifier3[3][3],
classifier3[3][4],],
'L3': [0, 0, 0, 0, 0],
'Assigned L':[1, 1, 1, 1,1]}, index =['Datapoint1', 'D2', 'D3',
'D4', 'D5'])
df2= df1[['g11','g12','g13','g14']]
hdf = df2.max(axis = 1)
g11 = df1['g11'].to_list()
g12 = df1['g12'].to_list()
g13 = df1['g13'].to_list()
g14 = df1['g14'].to_list()
for item, label in zip(hdf, table['L1']):
if hdf[item] in g11:
df1['L1'][label] = labels[0]
print(item, label)
elif hdf[item] in g12:
df1['L1'][label] = labels[1]
print(item, label)
elif hdf[item] in g13:
df1['L1'][label] = labels[2]
print(item, label)
elif hdf[item] in g14:
df1['L1'][label] = labels[3]
print(item, label)
I have tried using .loc, .at but when it didn't work I just scrapped it and tried something else, maybe those approaches would be better? This is where I'm at so far.
The error is coming from the for loop for hdf,
The issue i'm having is "cannot do label indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [0.0311272081] of <class 'float'>"
I don't think the other values in the data frame are relvant, its just there so people know I have made one. The 5 relavant columns in the dataframe are g11, g12, g13, g14 and L1.

How can I properly use a Pandas Dataframe with a multiindex that includes Intervals?

I'm trying to slice into a DataFrame that has a MultiIndex composed of an IntervalIndex and a regular Index. Example code:
from pandas import Interval as ntv
df = pd.DataFrame.from_records([
{'id': 1, 'var1': 0.1, 'ntv': ntv(0,10), 'E': 1},
{'id':2, 'var1': 0.5, 'ntv': ntv(0,12), 'E': 0}
], index=('ntv', 'id'))
Looks like this:
E var1
ntv id
(0, 10] 1 1 0.1
(0, 12] 2 0 0.5
What I would like to do is to slice into the DataFrame at a specific value and return all rows that has an interval that contains the value. Ex:
df.loc[4]
should return (trivially)
E var1
id
1 1 0.1
2 0 0.5
The problem is I keep getting a TypeError about the index, and the docs show a similar operation (but on a single-level index) that does produce what I'm looking for.
TypeError: only integer scalar arrays can be converted to a scalar index
I've tried many things, nothing seems to work normally. I could include the id column inside the dataframe, but I'd rather keep my index unique, and I would constantly be calling set_index('id').
I feel like either a) I'm missing something about MultiIndexes or b) there is a bug / ambiguity with using an IntervalIndex in a MultiIndex.
Since we are speaking intervals there is a method called get_loc to find the rows that has the value in between the interval. To say what I mean :
from pandas import Interval as ntv
df = pd.DataFrame.from_records([
{'id': 1, 'var1': 0.1, 'ntv': ntv(0,10), 'E': 1},
{'id':2, 'var1': 0.5, 'ntv': ntv(0,12), 'E': 0}
], index=('ntv', 'id'))
df.iloc[(df.index.get_level_values(0).get_loc(4))]
E var1
ntv id
(0, 10] 1 1 0.1
(0, 12] 2 0 0.5
df.iloc[(df.index.get_level_values(0).get_loc(11))]
E var1
ntv id
(0, 12] 2 0 0.5
This also works if you have multiple rows of data for one inteval i.e
df = pd.DataFrame.from_records([
{'id': 1, 'var1': 0.1, 'ntv': ntv(0,10), 'E': 1},
{'id': 3, 'var1': 0.1, 'ntv': ntv(0,10), 'E': 1},
{'id':2, 'var1': 0.5, 'ntv': ntv(0,12), 'E': 0}
], index=('ntv', 'id'))
df.iloc[(df.index.get_level_values(0).get_loc(4))]
E var1
ntv id
(0, 10] 1 1 0.1
3 1 0.1
(0, 12] 2 0 0.5
If you time this up with a list comprehension, this approach is way faster for large dataframes i.e
ndf = pd.concat([df]*10000)
%%timeit
ndf.iloc[ndf.index.get_level_values(0).get_loc(4)]
10 loops, best of 3: 32.8 ms per loop
%%timeit
intervals = ndf.index.get_level_values(0)
mask = [4 in i for i in intervals]
ndf.loc[mask]
1 loop, best of 3: 193 ms per loop
So I did a bit of digging to try and understand the problem. If I try to run your code the following happens.
You try to index into the index label with
"slice(array([0, 1], dtype=int64), array([1, 2], dtype=int64), None)"
(when I say index_type I mean the Pandas datatype)
An index_type's label is a list of indices that map to the index_type's levels array. Here is an example from the documentation.
>>> arrays = [[1, 1, 2, 2], ['red', 'blue', 'red', 'blue']]
>>> pd.MultiIndex.from_arrays(arrays, names=('number', 'color'))
MultiIndex(levels=[[1, 2], ['blue', 'red']],
labels=[[0, 0, 1, 1], [1, 0, 1, 0]],
names=['number', 'color'])
Notice how the second list in labels connects to the order of levels. level[1][1] is equal to red, and level[1][0] is equal to blue.
Anyhow, this is all to say that I don't believe intervalindex is meant to be used in an overlapping fashion. If you look at the orginal proposal for it
https://github.com/pandas-dev/pandas/issues/7640
"A IntervalIndex would be a monotonic and non-overlapping one-dimensional array of intervals."
My suggestion is to move the interval into a column. You could probably write up a simple function with numba to test if a number is in each interval. Do you mind explaining the way you're benefiting from the interval?
Piggybacking off of #Dark's solution, Index.get_loc just calls Index.get_indexer under the hood, so it might be more efficient to call the underlying method when you don't have additional parameters and red tape.
idx = df.index.get_level_values(0)
df.iloc[idx.get_indexer([4])]
My originally proposed solution:
intervals = df.index.get_level_values(0)
mask = [4 in i for i in intervals]
df.loc[mask]
Regardless, it's certainly strange though that these return two different results, but does look like it has to do with the index being unique/monotonic/neither of the two:
df.reset_index(level=1, drop=True).loc[4] # good
df.loc[4] # TypeError
This is not really a solution and I don't fully understand but think it may have to do with your interval index not being monotonic (in that you have overlapping intervals). I guess that could in a sense be considered monotonic so perhaps alternately you could say the overlap means the index is not unique?
Anyway, check out this github issue:
ENH: Implement MultiIndex.is_monotonic_decreasing #17455
And here's an example with your data, but changing the intervals to be non-overlapping (0,6) & (7,12):
df = pd.DataFrame.from_records([
{'id': 1, 'var1': 0.1, 'ntv': ntv(0, 6), 'E': 1},
{'id': 2, 'var1': 0.5, 'ntv': ntv(7,12), 'E': 0}
], index=('ntv', 'id'))
Now, loc works OK:
df.loc[4]
E var1
id
1 1 0.1
def check_value(num):
return df[[num in i for i in map(lambda x: x[0], df.index)]]
a = check_value(4)
a
>>
E var1
ntv id
(0, 10] 1 1 0.1
(0, 12] 2 0 0.5
if you want to drop the index level, you can add
a.index = a.droplevel(0)

Python Numpy: how to get the fraction of the frequency of a specific number in an array

I'm trying to find the fraction of ones in a specific row or column of an array and make a new array of these fractions.
so far i have :
def calc_frac(a,axis=0):
"""a function that returns the fraction of ones in each column or row"""
s=np.array(((a==1).sum())/len(a))
return(s)
and all my test values are coming back false when they should be true
If there are no missing values in the array, you could just call mean method on the a == 1 logical array, which returns fraction of 1s:
a = np.array([[1,2,3,1], [1,1,1,1], [1,0,2,2], [2,2,1,1]])
a
#array([[1, 2, 3, 1],
# [1, 1, 1, 1],
# [1, 0, 2, 2],
# [2, 2, 1, 1]])
1) Fraction of 1s per column
(a == 1).mean(0)
# array([ 0.75, 0.25, 0.5 , 0.75])
2) Fraction of 1s per row
(a == 1).mean(1)
# array([ 0.5 , 1. , 0.25, 0.5 ])
If nan counts as an entry, the above method still works; if nan doesn't count as an entry, you could take care of nan as follows:
(a == 1).sum(axis)/(~np.isnan(a)).sum(axis)
Where axis = 0, fraction per column; axis = 1, fraction per row.

Categories