Pandas - Interate over row and compare previous values -faster - python

I am trying to get my results faster (13 minutes for 800 rows). I asked a similar question here: pandas - iterate over rows and calculate - faster - but I not able to use the good solutions for my variation. The difference is that if the overlap of previous values in 'col2' is more than 'n=3', the value of 'col1' in the row is set to '0' and affect the following code.
import pandas as pd
d = {'col1': [20, 23, 40, 41, 46, 47, 48, 49, 50, 50, 52, 55, 56, 69, 70],
'col2': [39, 32, 42, 50, 63, 67, 64, 68, 68, 74, 59, 75, 58, 71, 66]}
df = pd.DataFrame(data=d)
df["overlap_count"] = "" #create new column
n = 3 #if x >= n, then value = 0
for row in range(len(df)):
x = (df["col2"].loc[0:row-1] > (df["col1"].loc[row])).sum()
df["overlap_count"].loc[row] = x
if x >= n:
df["col2"].loc[row] = 0
df["overlap_count"].loc[row] = 'x'
df
I obtain following result: replacing values in col1 if they are greater than 'n' and the column overlap_count
col1 col2 overlap_count
0 20 39 0
1 23 32 1
2 40 42 0
3 41 50 1
4 46 63 1
5 47 67 2
6 48 0 x
7 49 0 x
8 50 68 2
9 50 0 x
10 52 0 x
11 55 0 x
12 56 0 x
13 69 71 0
14 70 66 1
Thank you for your help and time!

I think you can use numba for improve performance, only is necessary working with numeric values, so instead x is added -1 and new column is filled by 0 instead empty string:
df["overlap_count"] = 0 #create new column
n = 3 #if x >= n, then value = 0
a = df[['col1','col2','overlap_count']].values
from numba import njit
#njit
def custom_sum(arr, n):
for row in range(arr.shape[0]):
x = (arr[0:row, 1] > arr[row, 0]).sum()
arr[row, 2] = x
if x >= n:
arr[row, 1] = 0
arr[row, 2] = -1
return arr
df1 = pd.DataFrame(custom_sum(a, n), columns=df.columns)
print (df1)
col1 col2 overlap_count
0 20 39 0
1 23 32 1
2 40 42 0
3 41 50 1
4 46 63 1
5 47 67 2
6 48 0 -1
7 49 0 -1
8 50 68 2
9 50 0 -1
10 52 0 -1
11 55 0 -1
12 56 0 -1
13 69 71 0
14 70 66 1
Performance:
d = {'col1': [20, 23, 40, 41, 46, 47, 48, 49, 50, 50, 52, 55, 56, 69, 70],
'col2': [39, 32, 42, 50, 63, 67, 64, 68, 68, 74, 59, 75, 58, 71, 66]}
df = pd.DataFrame(data=d)
#4500rows
df = pd.concat([df] * 300, ignore_index=True)
print (df)
In [115]: %%timeit
...: pd.DataFrame(custom_sum(a, n), columns=df.columns)
...:
8.11 ms ± 224 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [116]: %%timeit
...: for row in range(len(df)):
...: x = (df["col2"].loc[0:row-1] > (df["col1"].loc[row])).sum()
...: df["overlap_count"].loc[row] = x
...:
...: if x >= n:
...: df["col2"].loc[row] = 0
...: df["overlap_count"].loc[row] = 'x'
...:
...:
7.84 s ± 442 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

create a function and then just apply the function as shown below:
df['overlap_count'] = [fn(i) for i in df['overlap_count']]

Try this one, maybe it will be faster.
df['overlap_count'] = df.groupby('col1')['col2'].transform(lambda g: len((g >= g.name).index))

Related

Is there a way to dinamically perform this loop?

do you know how if there is a better way to perform this task without using for loop?
Starting with the following dataset:
import pandas as pd
df = pd.DataFrame({'A': [90, 85, 85, 85, 100, 170, 150, 130, 125, 125],
'B':[100, 100, 100, 100, 100, 100, 100, 100, 100, 100]})
df['C'] = 0
df.loc[0, 'C'] = df.loc[0, 'B']
df['D'] = 0
df.loc[0, 'D'] = df.loc[0, 'C'] * 0.95
df['E'] = 0
df.loc[0, 'E'] = df.loc[0, 'C'] * 0.80
Now,
if the value in row 1 column A is greater than the value in row 0 column D:
the value in row 1 column C will be equal to the value in row 1 column A * 2
the value in row 1 column D will be equal to the value in row 1 column C * 0.95
the value in row 1 column E will be equal to the value in row 1 column D * 0.8
elif the value in row 1 column A is less than the value in row 0 column E:
the value in row 1 column C will be equal to the value in row 1 column A
the value in row 1 column D will be equal to the value in row 1 column C * 0.95
the value in row 1 column E will be equal to the value in row 1 column D * 0.8
else:
the value in row 1 column C will be equal to the value in row 0 column C
the value in row 1 column D will be equal to the value in row 1 column C * 0.95
the value in row 1 column E will be equal to the value in row 1 column D * 0.8
As output, I would like to create a df like this:
df_out = pd.DataFrame({'A': [90, 85, 85, 85, 100, 170, 150, 130, 125, 125],
'B':[100, 100, 100, 100, 100, 100, 100, 100, 100, 100],
'C':[100, 100, 100, 100, 200, 200, 150, 150, 150, 150],
'D':[95, 95, 95, 95, 190, 190, 190, 143, 143, 143],
'E':[80, 80, 80, 80, 160, 160, 160, 120, 120, 120]})
Considering that I have to iterate for more than 5000 rows and for around 3000 possible scenarios I'm looking for the fastest way to perform this task and I've noted that the for loop is extremely slow.
Thank you guys in advance and apologize for the trivial question!! I'm new to python and I'm trying to learn as much as possible!!
Best
Per our discussion in the comments, if you do the loop this way it's reasonably quick:
alist = [90, 85, 85, 85, 100, 170, 150, 130, 125, 125] * 500
a = alist[0]
c = 100
d = 95
e = 80
clist = [c]
dlist = [d]
elist = [e]
for a in alist[1:]:
if a > d:
c_new = round(a*1.5)
elif a < e:
c_new = a
else:
c_new = c
c = c_new
d = round(c_new * 0.95)
e = round(d * 0.8)
clist.append(c_new)
dlist.append(d)
elist.append(e)
df_out = pd.DataFrame({ 'A' : alist, 'C' : clist, 'D' : dlist, 'E' : elist })
print(df_out.head(10))
A C D E
0 90 100 95 80
1 85 100 95 76
2 85 100 95 76
3 85 100 95 76
4 100 150 142 114
5 170 255 242 194
6 150 150 142 114
7 130 150 142 114
8 125 150 142 114
9 125 150 142 114

Divide a pandas dataframe by the sum of its index column and row

Here is what I currently have:
print(df)
10 25 26
10 530 1 46
25 1 61 61
26 46 61 330
How can i transform this to df1 so that we divide each element in the row by the sum of the index columns? The output of df1 should look like this:
df1:
10 25 26
10 530/(530) 1/(530+61) 46/(530+330)
25 1/(61+530) 61/(61) 61/(61+330)
26 46/(330+530) 61/(330+61) 330/(330)
print(df1)
10 25 26
10 1 0.0016 0.0534
25 0.0016 1 0.1560
26 0.0534 0.1560 1
IIUC, try:
a = np.diag(df)[None, :]
b = np.diag(df)[:, None]
c = a+b
np.fill_diagonal(c, np.diag(df))
df_out = df.div(c)
df_out
Output:
10 25 26
10 1.000000 0.001692 0.053488
25 0.001692 1.000000 0.156010
26 0.053488 0.156010 1.000000
I think this is the solution but you have to change your columns and indexes.
import pandas as pd
df = pd.DataFrame({530: [530, 1, 46],
61: [1, 61, 61],
330: [46, 61, 330]},
index = [530, 61, 330])
for i in range(len(df)):
for j in range(len(df)):
if i == j:
df.iloc[i,j] = df.iloc[i, j] / df.index[i]
else:
df.iloc[i,j] = df.iloc[i,j] / (df.index[i] + df.columns[j])
df
You can divide the rows by the max in the column to reproduce your example.
df1 = pd.DataFrame(
{
"column1": df['10'].divide(df['10'].max()),
"column2": df['25'].divide(df['25'].max()),
"column3": df['26'].divide(df['26'].max())
}
)

Creating a list from data frame of value greater than a specific value

I have a question on how to create a list of values that are greater than a specific value in a given data frame variable.
a. b. c.
1. 100 57 23
2. 99 56 23
3. 100 56 22
4. 101 57 23
...
300. 99 50 23
301. 99 51 29
302. 101 57 22
Create a list of all values where a > 100.
I am able to index, but not a list since all the values are boolean:
Greater_100 = df['a']>100
How do I turn this into a list?
df = pd.DataFrame(np.random.randint(0, 200, (10, 3)), columns=list('abc'))
list_a_more_than_hundred = df[df.a>100]
Only df[df['a'] > 100].loc[:, 'a'] or df[df['a'] > 100].loc[:, 'a'].tolist() is sufficient.
Selecting the rows from column a where value is > 100.
>>> df[df['a'] > 100].loc[:, 'a']
4 101
302 101
Name: a, dtype: int64
>>>
>>> type(df[df['a'] > 100].loc[:, 'a'])
<class 'pandas.core.series.Series'>
Converting the above Series into list.
>>> l = df[df['a'] > 100].loc[:, 'a'].tolist()
>>> l
[101, 101]
>>>
>>> type(l)
<class 'list'>
>>>
Let's look at the above code in more detail.
>>> import numpy as np
>>> import pandas as pd
>>>
>>> arr = [[100, 57, 23], [99, 56, 23],
... [100, 56, 20], [101, 57, 23], [99, 50, 23],
... [99, 51, 29], [101, 57, 22]]
>>>
>>> columns = [ch for ch in 'abc']
>>> indices = [str(n) for n in [1, 2, 3, 4, 300, 301, 302]]
>>>
>>> df = pd.DataFrame(arr, index=indices, columns=columns)
>>> df
a b c
1 100 57 23
2 99 56 23
3 100 56 20
4 101 57 23
300 99 50 23
301 99 51 29
302 101 57 22
>>>
>>> df['a'] > 100
1 False
2 False
3 False
4 True
300 False
301 False
302 True
Name: a, dtype: bool
>>>
>>> arr2 = df.loc[:,'a']
>>> arr2
1 100
2 99
3 100
4 101
300 99
301 99
302 101
Name: a, dtype: int64
>>>
>>> arr2 = df[df['a'] > 100]
>>> arr2
a b c
4 101 57 23
302 101 57 22
>>>
>>> arr3 = df[df['a'] > 100].loc[:, 'a']
>>> arr3
4 101
302 101
Name: a, dtype: int64
>>>
>>> l = arr3.tolist()
>>> l
[101, 101]
>>>
To filter your dataframe for rows where a > 100, you can use pd.DataFrame.query:
res_df = df.query('a > 100')
This also works for multiple conditions:
res_df = df.query('a > 100 & b < 57')
If you wish to extract a list of values from these rows, you can use use NumPy, e.g.
res_lst = df.query('a > 100 & b < 57').values.ravel().tolist()

selecting indexes with multiple years of observations

I wish to select only the rows that have observations across multiple years. For example, suppose
mlIndx = pd.MultiIndex.from_tuples([('x', 0,),('x',1),('z', 0), ('y', 1),('t', 0),('t', 1)])
df = pd.DataFrame(np.random.randint(0,100,(6,2)), columns = ['a','b'], index=mlIndx)
In [18]: df
Out[18]:
a b
x 0 6 1
1 63 88
z 0 69 54
y 1 27 27
t 0 98 12
1 69 31
My desired output is
Out[19]:
a b
x 0 6 1
1 63 88
t 0 98 12
1 69 31
My current solution is blunt so something that can scale up more easily would be great. You can assumed a sorted index.
df.reset_index(level=0, inplace=True)
df[df.level_0.duplicated() | df.level_0.duplicated(keep='last')]
Out[30]:
level_0 a b
0 x 6 1
1 x 63 88
0 t 98 12
1 t 69 31
You can figure this out with groupby (on the first level of the index) + transform, and then use boolean indexing to filter out those rows:
df[df.groupby(level=0).a.transform('size').gt(1)]
a b
x 0 67 83
1 2 34
t 0 18 87
1 63 20
Details
Output of the groupby -
df.groupby(level=0).a.transform('size')
x 0 2
1 2
z 0 1
y 1 1
t 0 2
1 2
Name: a, dtype: int64
Filtering from here is straightforward, just find those rows with size > 1.
Use the group by filter
You can pass a function that returns a boolean to
df.groupby(level=0).filter(lambda x: len(x) > 1)
a b
x 0 7 33
1 31 43
t 0 71 18
1 68 72
I've spent my fare share of time focused on speed. Not all solutions need to be the fastest solutions. However, since the subject has come up. I'll offer what I think should be a fast solution. It is my intent to keep future readers informed.
Results of Time Test
res.plot(loglog=True)
res.div(res.min(1), 0).T
10 30 100 300 1000 3000
cs 4.425970 4.643234 5.422120 3.768960 3.912819 3.937120
wen 2.617455 4.288538 6.694974 18.489803 57.416648 148.860403
jp 6.644870 21.444406 67.315362 208.024627 569.421257 1525.943062
pir 6.043569 10.358355 26.099766 63.531397 165.032540 404.254033
pir_pd_factorize 1.153351 1.132094 1.141539 1.191434 1.000000 1.000000
pir_np_unique 1.058743 1.000000 1.000000 1.000000 1.021489 1.188738
pir_best_of 1.000000 1.006871 1.030610 1.086425 1.068483 1.025837
Simulation Details
def pir_pd_factorize(df):
f, u = pd.factorize(df.index.get_level_values(0))
m = np.bincount(f)[f] > 1
return df[m]
def pir_np_unique(df):
u, f = np.unique(df.index.get_level_values(0), return_inverse=True)
m = np.bincount(f)[f] > 1
return df[m]
def pir_best_of(df):
if len(df) > 1000:
return pir_pd_factorize(df)
else:
return pir_np_unique(df)
def cs(df):
return df[df.groupby(level=0).a.transform('size').gt(1)]
def pir(df):
return df.groupby(level=0).filter(lambda x: len(x) > 1)
def wen(df):
s=df.a.count(level=0)
return df.loc[s[s>1].index.tolist()]
def jp(df):
return df.loc[[i for i in df.index.get_level_values(0).unique() if len(df.loc[i]) > 1]]
res = pd.DataFrame(
index=[10, 30, 100, 300, 1000, 3000],
columns='cs wen jp pir pir_pd_factorize pir_np_unique pir_best_of'.split(),
dtype=float
)
np.random.seed([3, 1415])
for i in res.index:
d = pd.DataFrame(
dict(a=range(i)),
pd.MultiIndex.from_arrays([
np.random.randint(i // 4 * 3, size=i),
range(i)
])
)
for j in res.columns:
stmt = f'{j}(d)'
setp = f'from __main__ import d, {j}'
res.at[i, j] = timeit(stmt, setp, number=100)
Just a new way
s=df.a.count(level=0)
df.loc[s[s>1].index.tolist()]
Out[12]:
a b
x 0 1 31
1 70 29
t 0 42 26
1 96 29
And if you want to keep using duplicate
s=df.index.get_level_values(level=0)
df.loc[s[s.duplicated()].tolist()]
Out[18]:
a b
x 0 1 31
1 70 29
t 0 42 26
1 96 29
I'm not convinced groupby is necessary:
df = df.sort_index()
df.loc[[i for i in df.index.get_level_values(0).unique() if len(df.loc[i]) > 1]]
# a b
# x 0 16 3
# 1 97 36
# t 0 9 18
# 1 37 30
Some benchmarking:
df = pd.concat([df]*10000).sort_index()
def cs(df):
return df[df.groupby(level=0).a.transform('size').gt(1)]
def pir(df):
return df.groupby(level=0).filter(lambda x: len(x) > 1)
def wen(df):
s=df.a.count(level=0)
return df.loc[s[s>1].index.tolist()]
def jp(df):
return df.loc[[i for i in df.index.get_level_values(0).unique() if len(df.loc[i]) > 1]]
%timeit cs(df) # 19.5ms
%timeit pir(df) # 33.8ms
%timeit wen(df) # 17.0ms
%timeit jp(df) # 22.3ms

Merge two dataframes based on interval overlap

I have two dataframes A and B:
For example:
import pandas as pd
import numpy as np
In [37]:
A = pd.DataFrame({'Start': [10, 11, 20, 62, 198], 'End': [11, 11, 35, 70, 200]})
A[["Start","End"]]
Out[37]:
Start End
0 10 11
1 11 11
2 20 35
3 62 70
4 198 200
In [38]:
B = pd.DataFrame({'Start': [8, 5, 8, 60], 'End': [10, 90, 13, 75], 'Info': ['some_info0','some_info1','some_info2','some_info3']})
B[["Start","End","Info"]]
Out[38]:
Start End Info
0 8 10 some_info0
1 5 90 some_info1
2 8 13 some_info2
3 60 75 some_info3
I would like to add column info to dataframe A based on if the interval (Start-End) of A overlaps with the interval of B. In case, the A interval overlaps with more than one B interval, the info corresponding to the shorter interval should be added.
I have been looking arround how to manage this issue and I have found kind of similar questions but most of their answers are using iterrows() which in my case, as I am dealing with huge dataframes is not viable.
I would like something like:
A.merge(B,on="overlapping_interval", how="left")
And then drop duplicates keeping the info coming from the shorter interval.
The output should look like this:
In [39]:
C = pd.DataFrame({'Start': [10, 11, 20, 62, 198], 'End': [11, 11, 35, 70, 200], 'Info': ['some_info0','some_info2','some_info1','some_info3',np.nan]})
C[["Start","End","Info"]]
Out[39]:
Start End Info
0 10 11 some_info0
1 11 11 some_info2
2 20 35 some_info1
3 62 70 some_info3
4 198 200 NaN
I have found this question really interesting as it suggests the posibility of solving this issue using pandas Interval object. But after lots attempts I have not managed to solve it.
Any ideas?
I would suggest to do a function then apply on the rows:
First I compute the delta (End - Start) in B for sorting purpose
B['delta'] = B.End - B.Start
Then a function to get information:
def get_info(x):
#Fully included
c0 = (x.Start >= B.Start) & (x.End <= B.End)
#start lower, end include
c1 = (x.Start <= B.Start) & (x.End >= B.Start)
#start include, end higher
c2 = (x.Start <= B.End) & (x.End >= B.End)
#filter with conditions and sort by delta
_B = B[c0|c1|c2].sort_values('delta',ascending=True)
return None if len(_B) == 0 else _B.iloc[0].Info #None if no info corresponding
Then you can apply this function to A:
A['info'] = A.apply(lambda x : get_info(x), axis='columns')
print(A)
Start End info
0 10 11 some_info0
1 11 11 some_info2
2 20 35 some_info1
3 62 70 some_info3
4 198 200 None
Note:
Instead of using pd.Interval, make your own conditions. cx are your intervals definitions, change them to get the exact expected behaviour

Categories