Anything faster than groupby for iterating through groups? - python

So I've narrowed down a previous problem down to this: I have a DataFrame that looks like this
id temp1 temp2
9 10.0 True False
10 10.0 True False
11 10.0 False True
12 10.0 False True
17 15.0 True False
18 15.0 True False
19 15.0 True False
20 15.0 True False
21 15.0 False False
33 27.0 True False
34 27.0 True False
35 27.0 False True
36 27.0 False False
40 31.0 True False
41 31.0 False True
.
.
.
and in reality, it's a few million lines long (and has a few other columns).
What I have it currently doing is
grouped = coinc.groupby('id')
final = grouped.filter(lambda x: ( x['temp2'].any() and x['temp1'].any()))
lanif = final.drop(['temp1','temp2'],axis = 1 )
(coinc is the name of the dataframe)
which only keeps rows (grouped by id) if there is a true in both temp1 and temp2 for some rows with the same id. For example, with the above dataframe, it would get rid of rows with id 15, but keep everything else.
This, however, is deathly slow and I was wondering if there was a faster way to do this.

Using filter with a lambda function here is slowing you down a lot. You can speed things up by removing that.
u = coinc.groupby('id')
m = u.temp1.any() & u.temp2.any()
res = df.loc[coinc.id.isin(m[m].index), ['id']]
Comparing this to your approach on a larger frame.
a = np.random.randint(1, 1000, 100_000)
b = np.random.randint(0, 2, 100_000, dtype=bool)
c = ~b
coinc = pd.DataFrame({'id': a, 'temp1': b, 'temp2': c})
In [295]: %%timeit
...: u = coinc.groupby('id')
...: m = u.temp1.any() & u.temp2.any()
...: res = coinc.loc[coinc.id.isin(m[m].index), ['id']]
...:
13.5 ms ± 476 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [296]: %%timeit
...: grouped = coinc.groupby('id')
...: final = grouped.filter(lambda x: ( x['temp2'].any() and x['temp1'].any()))
...: lanif = final.drop(['temp1','temp2'],axis = 1 )
...:
527 ms ± 7.91 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
np.array_equal(res.values, lanif.values)
True

i, u = pd.factorize(coinc.id)
t = np.zeros((len(u), 2), bool)
c = np.column_stack([coinc.temp1.to_numpy(), coinc.temp2.to_numpy()])
np.logical_or.at(t, i, c)
final = coinc.loc[t.all(1)[i], ['id']]
final
id
9 10.0
10 10.0
11 10.0
12 10.0
33 27.0
34 27.0
35 27.0
36 27.0
40 31.0
41 31.0

The problem isn't the groupby it's the lambda. Lambda operations are not vectorized*. You can get the same result faster using agg. I'd do:
groupdf = coinc.groupby('id').agg(any)
# Selects instance where both contain at least one true statement
mask = maskdf[['temp1','temp2']].all(axis=1)
lanif = groupdf[mask].drop(['temp1','temp2'],axis = 1 )
*This is a pretty nuanced issue that I'm waaaay oversimplifying, sorry.

Here is another alternative solution
f = coinc.groupby('id').transform('any')
result = coinc.loc[f['temp1'] & f['temp2'], coinc.columns.drop(['temp1', 'temp2'])]

Related

Classify DataFrame rows based on first matching condition

I have a pandas DataFrame, each column represents a quarter, the most recent quarters are placed to the right, not all the information gets at the same time, some columns might be missing information (NaN values)
I would like to create a new column with the first criteria number that the row matches, or zero if it doesn't match any criteria
The criteria gets applied to the 3 most recent columns that have data (an integer, ignoring NaNs) and a match is considered if the value in the list is greater than or equal to its corresponding value in the DataFrame
I tried using apply, but I couldn't make it work and the failed attempts were slow
import pandas as pd
import numpy as np
criteria_dict = {
1: [10, 0, 10]
, 2: [0, 10, 10]
}
list_of_tuples = [
(78, 7, 11, 15), # classify as 2 since 7 >= 0, 11 >= 10, 15 >= 10
(98, -5, np.NaN, 18), # classify as 0, ignoring NaN it doesn't match any criteria because of the -5
(-78, 20, 64, 28), # classify as 1 20 >= 10, 64 >= 0, 28 >= 10
(35, 63, 27, np.NaN), # classify as 1, NaN value should be ignored, 35 >= 10, 63 >=0, 27 >= 10
(-11, 0, 56, 10) # classify as 2, 0 >= 0, 56 >= 10, 10 >= 10
]
df = pd.DataFrame(
list_of_tuples,
index=['A', 'B', 'C', 'D', 'E'],
columns=['2021Q2', '2021Q3', '2021Q4', '2022Q1']
)
print(df)
Applying a custom function to each row should work.
def func(x):
x = x.dropna().to_numpy()[-3:]
if len(x) < 3:
return 0
for k, v in criteria_dict.items():
if np.all(x >= v):
return k
return 0
df.apply(func, axis=1)
Probably using apply is the most straightforward, but I wanted to try a solution with numpy, which should be faster with dataframes with many rows.
import numpy as np
# Rows with too many NaNs.
df_arr = df.to_numpy()
# Find NaNs.
nans = np.nonzero(np.isnan(df_arr))
# Roll the rows so that the latest three columns with valid data are all to the right.
for row, col in zip(*nans):
df_arr[row, :] = np.roll(df_arr[row, :], shift=4-col)
# Check for matching criteria.
df['criteria'] = np.select([np.all((df_arr[:, 1:] - criteria_dict[crit])>=0, axis=1) for crit in criteria_dict],
[crit for crit in criteria_dict])
print(df)
2021Q2 2021Q3 2021Q4 2022Q1 criteria
A 78 7 11.0 15.0 2.0
B 98 -5 NaN 18.0 0.0
C -78 20 64.0 28.0 1.0
D 35 63 27.0 NaN 1.0
E -11 0 56.0 10.0 2.0
Some timings on df = pd.concat([df]*10000):
# 103 ms ± 1.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit numpy(df)
# 1.32 s ± 14.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit pandas_apply(df)
So it is ~10x faster.
It is possible to achieve a full vectorial comparison. Note that the bottleneck is the broadcasting step that will create an intermediate array of K*N*M size where M*N is the size of the subset of the dataframe (here 5*3) and K*N that of the criterions (here 2*3). You need to have enough memory to create this array.
Step by step procedure:
First get last 3 non-nan values as b:
N = 3
a = df.to_numpy()
from scipy.stats import rankdata
b = a[rankdata(~np.isnan(a), method='ordinal', axis=1)>(a.shape[1]-N)].reshape(-1,N)
array([[ 7., 11., 15.],
[98., -5., 18.],
[20., 64., 28.],
[35., 63., 27.],
[ 0., 56., 10.]])
Then craft an array with the conditions as c;
c = np.array(list(criteria_dict.values()))
array([[10, 0, 10],
[ 0, 10, 10]])
Broadcast the comparison of b and c and get all values >=:
d = (b>=c[:, None]).all(2)
array([[False, False, True, True, False],
[ True, False, True, True, True]])
Get index of first True using the criteria_dict keys (else 0):
e = np.where(d.any(0), np.array(list(criteria_dict))[np.argmax(d, axis=0)], 0)
array([2, 0, 1, 1, 2])
Assign to DataFrame:
df['criteria'] = e
2021Q2 2021Q3 2021Q4 2022Q1 criteria
A 78 7 11.0 15.0 2
B 98 -5 NaN 18.0 0
C -78 20 64.0 28.0 1
D 35 63 27.0 NaN 1
E -11 0 56.0 10.0 2

mutiples conditions with if on a datraframe python, create function

I have a dataframe with four columns.
I should have normally : conso_HC=index_fin_HC-index_debut_HC. But as you can see it's not the case, subtraction is really equal to that. The problem is that if we want to find conso_HC you need to add sometimes 100000 to one of the index_fin_HC or index_debut_HC.
x=fichier['index_fin_HC']-fichier['index_debut_HC']
y=fichier['conso_HC']
def conditions(x,y):
if x+100000==y:
return x
elif x==y+100000:
return y
fichier['test']=fichirt
It is easy to build a new Series that test for the condition:
>>> (df.soustraction + 100000 == df.conso) | (df.soustraction - 100000 == df.conso)
0 False
1 False
2 False
3 False
4 True
5 True
6 False
dtype: bool
You can then:
add it to the dataframe as a new column:
df['condition'] = (df.soustraction + 100000 == df.conso) | (df.soustraction - 100000 == df.conso)
select rows in the dataframe matching the condition
>>> df[(df.soustraction + 100000 == df.conso) | (df.soustraction - 100000 == df.conso)]
debut fin conso soustraction
4 99193.0 526.0 1333.0 -98667.0
5 91833.0 6407.0 14574.0 -85426.0
select rows in the dataframe not matching the condition
>>> df[~ ((df.soustraction + 100000 == df.conso) | (df.soustraction - 100000 == df.conso))]
debut fin conso soustraction
0 34390.0 414.0 452.0 -33976.0
1 18117.0 85.0 216.0 -18032.0
2 37588.0 234.0 8468.0 -37354.0
3 49060.0 53.0 1399.0 -49007.0
6 38398.0 1594.0 1994.0 -36804.0

How to find the maximum consecutive number for multiple columns?

I need to identify the highest number of consecutive values that meet a certain criteria for multiple columns.
If my df is:
A B C D E
26 24 21 23 24
26 23 22 15 23
24 19 17 11 15
27 22 28 24 24
26 27 30 23 11
26 26 29 27 29
I want to know the maximum consecutive times that numbers over 25 occur for each column. So the output would be:
A 3
B 2
C 3
D 1
E 1
Using the following code, I can obtain the outcome for one column at a time; is there a way to create a table as above rather than repeating for each column (I have over 40 columns in total).
df.A.isnull().astype(int).groupby(df.A.notnull().astype(int).cumsum()).sum().max()
Thanks in advance.
Is this what you want ? pandas approach (PS: never thought I can make it one line LOL)
(df>25).apply(lambda x :x.groupby(x.diff().ne(0).cumsum()).cumcount()+1).mask(df<25).max()
Out[320]:
A 3.0
B 2.0
C 3.0
D 1.0
E 1.0
dtype: float64
One option using numpy to calculate the max consecutive:
def max_consecutive(arr):
# calculate the indices where the condition changes
split_indices = np.flatnonzero(np.ediff1d(arr.values, to_begin=1, to_end=1))
# calculate the chunk length of consecutive values and pick every other value based on
# the initial value
try:
max_size = np.diff(split_indices)[not arr.iat[0]::2].max()
except ValueError:
max_size = 0
return max_size
df.gt(25).apply(max_consecutive)
#A 3
#B 2
#C 3
#D 1
#E 1
#dtype: int64
Timing compared with the other approach:
%timeit df.gt(25).apply(max_consecutive)
# 520 µs ± 6.92 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit (df>25).apply(lambda x :x.groupby(x.diff().ne(0).cumsum()).cumcount()+1).mask(df<25).max(0)
# 10.3 ms ± 221 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Here's one with NumPy -
# mask is 2D boolean array representing islands as True values per col
def max_island_len_cols(mask):
m,n = mask.shape
out = np.zeros(n,dtype=int)
b = np.zeros((m+2,n),dtype=bool)
b[1:-1] = mask
for i in range(mask.shape[1]):
idx = np.flatnonzero(b[1:,i] != b[:-1,i])
if len(idx)>0:
out[i] = (idx[1::2] - idx[::2]).max()
return out
output = pd.Series(max_island_len_cols(df.values>25), index=df.columns)
Sample run -
In [690]: df
Out[690]:
A B C D E
0 26 24 21 23 24
1 26 23 22 15 23
2 24 19 17 11 15
3 27 22 28 24 24
4 26 27 30 23 11
5 26 26 29 27 29
In [690]:
In [691]: pd.Series(max_island_len_cols(df.values>25), index=df.columns)
Out[691]:
A 3
B 2
C 3
D 1
E 1
dtype: int64
Runtime test
Inspired by the given sample that has numbers in range (24,28) and with 40 cols, let's setup a bigger input dataframe and test out all the solutions -
# Input dataframe
In [692]: df = pd.DataFrame(np.random.randint(24,28,(1000,40)))
# Proposed in this post
In [693]: %timeit pd.Series(max_island_len_cols(df.values>25), index=df.columns)
1000 loops, best of 3: 539 µs per loop
# #Psidom's solution
In [694]: %timeit df.gt(25).apply(max_consecutive)
1000 loops, best of 3: 1.81 ms per loop
# #Wen's solution
In [695]: %timeit (df>25).apply(lambda x :x.groupby(x.diff().ne(0).cumsum()).cumcount()+1).mask(df<25).max(0)
10 loops, best of 3: 95.2 ms per loop
An approach using pandas and scipy.ndimage.label, for fun.
import pandas as pd
from scipy.ndimage import label
struct = [[0, 1, 0], # Structure used for segmentation
[0, 1, 0], # Equivalent to axis=0 in `numpy`
[0, 1, 0]] # Or 'columns' in `pandas`
labels, nlabels = label(df > 25, structure=struct)
>>> labels # Labels for each column-wise block of consecutive numbers > 25
Out[]:
array([[1, 0, 0, 0, 0],
[1, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[2, 0, 3, 0, 0],
[2, 4, 3, 0, 0],
[2, 4, 3, 5, 6]])
labels_df = pd.DataFrame(columns=df.columns, data=labels) # Add original columns names
res = (labels_df.apply(lambda x: x.value_counts()) # Execute `value_counts` on each column
.iloc[1:] # slice results for labels > 0
.max()) # and get max value
>>> res
Out[]:
A 3.0
B 2.0
C 3.0
D 1.0
E 1.0
dtype: float64

Removing usernames from a dataframe that do not appear a certain number of times?

I am trying to understand the provided below (which I found online, but do not fully understand). I want to essentially remove user names that do not appear in my dataframe at least 4 times (other than removing this names, I do not want to modify the dataframe in any other way). Does the following code solve this problem and if so, can you explain how the filter combined with the lambda achieves this? I have the following:
df.groupby('userName').filter(lambda x: len(x) > 4)
I am also open to alternative solutions/approaches that are easy to understand.
You can check filtration.
Faster solution in bigger DataFrame is with transform and boolean indexing:
df[df.groupby('userName')['userName'].transform('size') > 4]
Sample:
df = pd.DataFrame({'userName':['a'] * 5 + ['b'] * 3 + ['c'] * 6})
print (df.groupby('userName').filter(lambda x: len(x) > 4))
userName
0 a
1 a
2 a
3 a
4 a
8 c
9 c
10 c
11 c
12 c
13 c
print (df[df.groupby('userName')['userName'].transform('size') > 4])
userName
0 a
1 a
2 a
3 a
4 a
8 c
9 c
10 c
11 c
12 c
13 c
Timings:
np.random.seed(123)
N = 1000000
L = np.random.randint(1000,size=N).astype(str)
df = pd.DataFrame({'userName': np.random.choice(L, N)})
print (df)
In [128]: %timeit (df.groupby('userName').filter(lambda x: len(x) > 1000))
1 loop, best of 3: 468 ms per loop
In [129]: %timeit (df[df.groupby('userName')['userName'].transform(len) > 1000])
1 loop, best of 3: 661 ms per loop
In [130]: %timeit (df[df.groupby('userName')['userName'].transform('size') > 1000])
10 loops, best of 3: 96.9 ms per loop
Using numpy
def pir(df, k):
names = df.userName.values
f, u = pd.factorize(names)
c = np.bincount(f)
m = c[f] > k
return df[m]
pir(df, 4)
userName
0 a
1 a
2 a
3 a
4 a
8 c
9 c
10 c
11 c
12 c
13 c
__
Timing
#jezrael's large data
np.random.seed(123)
N = 1000000
L = np.random.randint(1000,size=N).astype(str)
df = pd.DataFrame({'userName': np.random.choice(L, N)})
pir(df, 1000).equals(
df[df.groupby('userName')['userName'].transform('size') > 1000]
)
True
%timeit df[df.groupby('userName')['userName'].transform('size') > 1000]
%timeit pir(df, 1000)
10 loops, best of 3: 78.4 ms per loop
10 loops, best of 3: 61.9 ms per loop

Write 2d dictionary into a dataframe or tab-delimited file using python

I have a 2-d dictionary in the following format:
myDict = {('a','b'):10, ('a','c'):20, ('a','d'):30, ('b','c'):40, ('b','d'):50,('c','d'):60}
How can I write this into a tab-delimited file so that the file contains the following. While filling a tuple (x, y) will fill two locations: (x,y) and (y,x). (x,x) is always 0.
The output would be :
a b c d
a 0 10 20 30
b 10 0 40 50
c 20 40 0 60
d 30 50 60 0
PS: If somehow the dictionary can be converted into a dataframe (using pandas) then it can be easily written into a file using pandas function
You can do this with the lesser-known align method and a little unstack magic:
In [122]: s = Series(myDict, index=MultiIndex.from_tuples(myDict))
In [123]: df = s.unstack()
In [124]: lhs, rhs = df.align(df.T)
In [125]: res = lhs.add(rhs, fill_value=0).fillna(0)
In [126]: res
Out[126]:
a b c d
a 0 10 20 30
b 10 0 40 50
c 20 40 0 60
d 30 50 60 0
Finally, to write this to a CSV file, use the to_csv method:
In [128]: res.to_csv('res.csv', sep='\t')
In [129]: !cat res.csv
a b c d
a 0.0 10.0 20.0 30.0
b 10.0 0.0 40.0 50.0
c 20.0 40.0 0.0 60.0
d 30.0 50.0 60.0 0.0
If you want to keep things as integers, cast using DataFrame.astype(), like so:
In [137]: res.astype(int).to_csv('res.csv', sep='\t')
In [138]: !cat res.csv
a b c d
a 0 10 20 30
b 10 0 40 50
c 20 40 0 60
d 30 50 60 0
(It was cast to float because of the intermediate step of filling in nan values where indices from one frame were missing from the other)
#Dan Allan's answer using combine_first is nice:
In [130]: df.combine_first(df.T).fillna(0)
Out[130]:
a b c d
a 0 10 20 30
b 10 0 40 50
c 20 40 0 60
d 30 50 60 0
Timings:
In [134]: timeit df.combine_first(df.T).fillna(0)
100 loops, best of 3: 2.01 ms per loop
In [135]: timeit lhs, rhs = df.align(df.T); res = lhs.add(rhs, fill_value=0).fillna(0)
1000 loops, best of 3: 1.27 ms per loop
Those timings are probably a bit polluted by construction costs, so what do things look like with some really huge frames?
In [143]: df = DataFrame({i: randn(1e7) for i in range(1, 11)})
In [144]: df2 = DataFrame({i: randn(1e7) for i in range(10)})
In [145]: timeit lhs, rhs = df.align(df2); res = lhs.add(rhs, fill_value=0).fillna(0)
1 loops, best of 3: 4.41 s per loop
In [146]: timeit df.combine_first(df2).fillna(0)
1 loops, best of 3: 2.95 s per loop
DataFrame.combine_first() is faster for larger frames.
In [49]: data = map(list, zip(*myDict.keys())) + [myDict.values()]
In [50]: df = DataFrame(zip(*data)).set_index([0, 1])[2].unstack()
In [52]: df.combine_first(df.T).fillna(0)
Out[52]:
a b c d
a 0 10 20 30
b 10 0 40 50
c 20 40 0 60
d 30 50 60 0
For posterity: If you are just tuning in, check out Phillip Cloud's answer below for a neater way to construct df.
Not as elegant as I'd like (and not using pandas) but until you find something better:
adj = dict()
for ((u, v), w) in myDict.items():
if u not in adj: adj[u] = dict()
if v not in adj: adj[v] = dict()
adj[u][v] = adj[v][u] = w
keys = adj.keys()
print '\t' + '\t'.join(keys)
for u in keys:
def f(v):
try:
return str(adj[u][v])
except KeyError:
return "0"
print u + '\t' + '\t'.join(f(v) for v in keys)
or equivalently (if you don't want to construct the adjacency matrix):
k = dict()
for ((u, v), w) in myDict.items():
k[u] = k[v] = True
keys = k.keys()
print '\t' + '\t'.join(keys)
for u in keys:
def f(v):
if (u, v) in myDict:
return str(myDict[(u, v)])
elif (v, u) in myDict:
return str(myDict[(v, u)])
else:
return "0"
print u + '\t' + '\t'.join(f(v) for v in keys)
Got it working using pandas package.
#Find all column names
z = []
[z.extend(x) for x in myDict.keys()]
colnames = sorted(set(z))
#Create an empty DataFrame using pandas
myDF = DataFrame(index= colnames, columns = colnames )
myDF = myDF.fillna(0) #Initialize with zeros
#Fill each item one by one
for val in myDict:
myDF[val[0]][val[1]] = myDict[val]
myDF[val[1]][val[0]] = myDict[val]
#Write to a file
outfilename = "matrixCooccurence.txt"
myDF.to_csv(outfilename, sep="\t", index=True, header=True, index_label = "features" )

Categories