How to speed up this pandas dataframes combination that runs too slow? - python

I want to speed up this kind of combination of dataframes that I wrote.
More details:
Scores are not important and they are some kind of random numbers so skip that. Index in df1 is time series with step 5 and Index in df2 is time series with step 15 and Index in df3 is time series with step 30
Thanks.
import pandas as pd
#initialize dataframes and fill some data
df1 = pd.DataFrame([[6,20],[11,19],[16,18],[21,17],[26,16],[31,15],[36,14]],columns=['Index','Score'])
df1.set_index('Index', inplace=True)
print(df1)
df2 = pd.DataFrame([[6,20],[21,19],[36,18]],columns=['Index','Score'])
df2.set_index('Index', inplace=True)
print(df2)
df3 = pd.DataFrame([[6,20],[36,19]],columns=['Index','Score'])
df3.set_index('Index', inplace=True)
print(df3)
#This code block runs slow and I want to speed up here.
#-----------------------------------------------------
for index1 in df1.index:
for index2 in df2.index:
if (index2-index1<=10):
df1.at[index1,'Score2'] =df2.at[index2,'Score']
for index1 in df1.index:
for index2 in df3.index:
if (index2-index1<=25):
df1.at[index1,'Score3'] =df3.at[index2,'Score']
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
print(df1)
Score
Index
6 20
11 19
16 18
21 17
26 16
31 15
36 14
Score
Index
6 20
21 19
36 18
Score
Index
6 20
36 19
Score Score2 Score3
Index
6 20 20.0 20.0
11 19 19.0 19.0
16 18 19.0 19.0
21 17 19.0 19.0
26 16 18.0 19.0
31 15 18.0 19.0
36 14 18.0 19.0

If the values does not matter, you just need to do a merge:
df1 = df1.merge(right=df2.merge(right=df3,how='left',left_index=True, right_index=True) , how = 'left', left_index=True, right_index=True)
df1.columns = ['Score', 'Score 2', 'Score 3']
df1

convtools allows to define data transforms like this (I must confess -- I've developed it). Once a conversion is defined and you run gen_converter it generates & compiles ad-hoc python code under the hood, so you have a simple python function to use. github | docs
from convtools import conversion as c
from convtools.contrib.tables import Table
items_1 = [[6, 20], [11, 19], [16, 18], [21, 17], [26, 16], [31, 15], [36, 14]]
items_2 = [[6, 20], [21, 19], [36, 18]]
items_3 = [[6, 20], [36, 19]]
rows_iter = (
Table.from_rows(items_1, header=["Index", "Score"])
.join(
Table.from_rows(items_2, header=["Index", "Score2"]),
on=c.RIGHT.col("Index") - c.LEFT.col("Index") <= 10,
how="left",
)
.drop("Index_RIGHT")
.rename({"Index_LEFT": "Index"})
.join(
Table.from_rows(items_3, header=["Index", "Score3"]),
on=c.RIGHT.col("Index") - c.LEFT.col("Index") <= 25,
how="left",
)
.drop("Index_RIGHT")
.rename({"Index_LEFT": "Index"})
.into_iter_rows(dict)
)
# rows_iter contains duplicate rows after joins, so we need to take only last
# ones per index;
deduplicate_converter = (
c.chunk_by(c.item("Index"))
.aggregate(c.ReduceFuncs.Last(c.this))
.gen_converter()
)
# the deduplicate_converter also returns an iterable
data = list(deduplicate_converter(rows_iter))
"""
In [13]: data
Out[13]:
[{'Index': 6, 'Score': 20, 'Score2': 20, 'Score3': 20},
{'Index': 11, 'Score': 19, 'Score2': 19, 'Score3': 19},
{'Index': 16, 'Score': 18, 'Score2': 19, 'Score3': 19},
{'Index': 21, 'Score': 17, 'Score2': 19, 'Score3': 19},
{'Index': 26, 'Score': 16, 'Score2': 18, 'Score3': 19},
{'Index': 31, 'Score': 15, 'Score2': 18, 'Score3': 19},
{'Index': 36, 'Score': 14, 'Score2': 18, 'Score3': 19}]
"""
Should you have any questions, please comment below, I'll adjust the code. If it works for you, please, comment on whether there's any speed-up on large data sets.

Related

Subtracting data in a row based on similar values in a different column

I have sample of a much larger dataframe here:
import pandas as pd
data = {'Name': [27, 27, 30, 30, 43, 43, 50, 62, 62],
'Time': [10, 30, 23.4, 28.6, 10, 15, 20, 25, 50]}
df = pd.DataFrame(data)
I want to be able to create a new column or a new dataframe that is able to subtract the Time values for each of the same numbers in the Name column.
Expected Outcome:
Name Time Bucket
27 20
30 5.2
43 5
50 20
62 25
I am not too sure how I need to go about this.
Try:
out = df.assign(Time=df.groupby('Name')['Time'].diff().fillna(df['Time'])) \
.drop_duplicates('Name', keep='last')
print(out)
# Output
Name Time
1 27 20.0
3 30 5.2
5 43 5.0
6 50 20.0
8 62 25.0
You can groupby+apply to get the last item of the diff per group, and fillna for the case of a single element:
df.groupby('Name')['Time'].apply(lambda s: s.diff().fillna(s).iloc[-1])
Output:
Name
27 20.0
30 5.2
43 5.0
50 20.0
62 25.0
Name: Time, dtype: float64
try using zip and reduce
data = {'Name': [27, 27, 30, 30, 43, 43, 50, 62, 62],
'Time': [10, 30, 23.4, 28.6, 10, 15, 20, 25, 50]}
keys=set(data['Name'])
lst=list(zip(data['Name'],data['Time']))
print(lst)
results={}
for key in keys:
value=functools.reduce(lambda x,y: y-x ,[x[1] for x in lst if x[0]==key])
results[key]=value
print(results)
output:
{43: 5, 50: 20, 30: 5.200000000000003, 27: 20, 62: 25}

How to drop duplicate multiindex column in Pandas

Been searching the net, but I cannot find any resource on deleting duplicate multiindex column name.
Given a multiindex as below
level1
level2
A B C B A C
ONE 11 12 13 11 12 13
TWO 21 22 23 21 22 23
THREE 31 32 33 31 32 33
Drop duplicated B and C
Expected output
level1
level2
A B C A
ONE 11 12 13 11
TWO 21 22 23 21
THREE 31 32 33 31
Code
import pandas as pd
df = pd.DataFrame({'A': [11, 21, 31],
'B': [12, 22, 32],
'C': [13, 23, 33]},
index=['ONE', 'TWO', 'THREE'])
df2 = pd.DataFrame({'B': [11, 21, 31],
'A': [12, 22, 32],
'C': [13, 23, 33]},
index=['ONE', 'TWO', 'THREE'])
df.columns = pd.MultiIndex.from_product([['level1'],['level2'],df.columns ])
df2.columns = pd.MultiIndex.from_product([['level1'],['level2'],df2.columns ])
df=pd.concat([df,df2],axis=1)
-Drop by index not working
You can try:
mask=(df.T.duplicated() | (df.columns.get_level_values(2).isin(['A','D'])))
Finally:
df=df.loc[:, mask]
#OR
#df=df.T.loc[mask].T
Adaptation to df.T.drop_duplicates().T by Anurag Dabas.
Select only unique column and its value
drop_col=['B','C']
drop_single=[df.loc [:, (slice ( None ), slice ( None ), DCOL)].T.drop_duplicates().T for DCOL in drop_col]
Drop the columns from the df
df=df.drop ( drop_col, axis=1, level=2 )
Combine everything to get the intended output
df=pd.concat([df,*drop_single],axis=1)
Complete solution
import pandas as pd
df = pd.DataFrame({'A': [11, 21, 31],
'B': [12, 22, 32],
'C': [13, 23, 33]},
index=['ONE', 'TWO', 'THREE'])
df2 = pd.DataFrame({'B': [11, 21, 31],
'A': [12, 22, 32],
'C': [13, 23, 33]},
index=['ONE', 'TWO', 'THREE'])
df.columns = pd.MultiIndex.from_product([['level1'],['level2'],df.columns ])
df2.columns = pd.MultiIndex.from_product([['level1'],['level2'],df2.columns ])
df=pd.concat([df,df2],axis=1)
drop_col=['B','C']
drop_single=[df.loc [:, (slice ( None ), slice ( None ), DCOL)].iloc [:, 1] for DCOL in drop_col]
df=df.drop ( drop_col, axis=1, level=2 )
df_unique=pd.concat(drop_single,axis=1)
df=pd.concat([df,df_unique],axis=1)
print(df)
You can try this:
# Drop last 2 columns of dataframe
df.drop(columns=df.columns[-2:],
axis=1,
inplace=True)

Replacing data in column with mean value of corresponding bin?

I make bins out of my column using pandas' pd.qcut(). I would like to, then apply smoothing by corresponding bin's mean value.
I generate my bins with something like
pd.qcut(col, 3)
For example,
Given the column values [4, 8, 15, 21, 21, 24, 25, 28, 34]
and the generated bins
Bin1 [4, 15]: 4, 8, 15
Bin2 [21, 24]: 21, 21, 24
Bin3 [25, 34]: 25, 28, 34
I would like to replace the values with the following means
Mean of Bin1 (4, 8, 15) = 9
Mean of Bin2 (21, 21, 24) = 22
Mean of Bin3 (25, 28, 34) = 29
Therefore:
Bin1: 9, 9, 9
Bin2: 22, 22, 22
Bin3: 29, 29, 29
making the final dataset: [9, 9, 9, 22, 22, 22, 29, 29, 29]
How can one also add a column with closest bin boundaries?
Bin1: 4, 4, 15
Bin2: 21, 21, 24
Bin3: 25, 25, 34
making the final dataset: [4, 4, 15, 21, 21, 24, 25, 25, 34]
very similar to this question which is for R
It's exactly as you laid out. Using this technique to get nearest
df = pd.DataFrame({"col":[4, 8, 15, 21, 21, 24, 25, 28, 34]})
df2 = df.assign(bin=pd.qcut(df.col, 3),
colbmean=lambda dfa: dfa.groupby("bin").transform("mean"),
colbin=lambda dfa: dfa.apply(lambda r: min([r.bin.left,r.bin.right], key=lambda x: abs(x-r.col)), axis=1))
col
bin
colbmean
colbin
0
4
(3.999, 19.0]
9
3.999
1
8
(3.999, 19.0]
9
3.999
2
15
(3.999, 19.0]
9
19
3
21
(19.0, 24.333]
22
19
4
21
(19.0, 24.333]
22
19
5
24
(19.0, 24.333]
22
24.333
6
25
(24.333, 34.0]
29
24.333
7
28
(24.333, 34.0]
29
24.333
8
34
(24.333, 34.0]
29
34
You'll find below the solution I came up with to answer your problem.
There is still a limitation, pandas.qcut does not return closed intervals, for this matter the results are not exactly the one you described.
import pandas as pd
df = pd.DataFrame({'value': [4, 8, 15, 21, 21, 24, 25, 28, 34]})
df['bin'] = pd.qcut(df['value'], 3)
df = df.join(df.groupby('bin')['value'].mean(), on='bin', rsuffix='_average_in_bin')
df['bin_left'] = df['bin'].apply(lambda x: x.left)
df['bin_right'] = df['bin'].apply(lambda x: x.right)
df['nearest_boundary'] = df.apply(lambda x: x['bin_left'] if abs(x['value'] - x['bin_left']) < abs(x['value'] - x['bin_right']) else x['bin_right'], axis=1)

Scoring pandas column's vs other columns

I want to rank how many of other cols in df is greater than or equal to a reference col. Given testdf:
testdf = pd.DataFrame({'RefCol': [10, 20, 30, 40],
'Col1': [11, 19, 29, 40],
'Col2': [12, 21, 28, 39],
'Col3': [13, 22, 31, 38]
})
I am using the helper function:
def sorter(row):
sortedrow = row.sort_values()
return sortedrow.index.get_loc('RefCol')
as:
testdf['Score'] = testdf.apply(sorter, axis=1)
With actual data this method is very slow, how to speed it up? Thanks
Looks like you need to compare RefCol and check if there are any column less than the RefCol , use:
testdf.lt(testdf['RefCol'],axis=0).sum(1)
0 0
1 1
2 2
3 2
For greater than equal to use:
testdf.drop('RefCol',1).ge(testdf.RefCol,axis=0).sum(1)

Merging two pandas dataframes by interval

I have two pandas dataframes with following format:
df_ts = pd.DataFrame([
[10, 20, 1, 'id1'],
[11, 22, 5, 'id1'],
[20, 54, 5, 'id2'],
[22, 53, 7, 'id2'],
[15, 24, 8, 'id1'],
[16, 25, 10, 'id1']
], columns = ['x', 'y', 'ts', 'id'])
df_statechange = pd.DataFrame([
['id1', 2, 'ok'],
['id2', 4, 'not ok'],
['id1', 9, 'not ok']
], columns = ['id', 'ts', 'state'])
I am trying to get it to the format, such as:
df_out = pd.DataFrame([
[10, 20, 1, 'id1', None ],
[11, 22, 5, 'id1', 'ok' ],
[20, 54, 5, 'id2', 'not ok'],
[22, 53, 7, 'id2', 'not ok'],
[15, 24, 8, 'id1', 'ok' ],
[16, 25, 10, 'id1', 'not ok']
], columns = ['x', 'y', 'ts', 'id', 'state'])
I understand how to accomplish it iteratively by grouping by id and then iterating through each row and changing status when it appears. Is there a pandas build-in more scalable way of doing this?
Unfortunately pandas merge support only equality joins. See more details at the following thread:
merge pandas dataframes where one value is between two others
if you want to merge by interval you'll need to overcome the issue, for example by adding another filter after the merge:
joined = a.merge(b,on='id')
joined = joined[joined.ts.between(joined.ts1,joined.ts2)]
You can merge pandas data frames on two columns:
pd.merge(df_ts,df_statechange, how='left',on=['id','ts'])
in df_statechange that you shared here there is no common values on ts in both dataframes. Apparently you just copied not complete data frame here. So i got this output:
x y ts id state
0 10 20 1 id1 NaN
1 11 22 5 id1 NaN
2 20 54 5 id2 NaN
3 22 53 7 id2 NaN
4 15 24 8 id1 NaN
5 16 25 10 id1 NaN
But indeed if you have common ts in the data frames it will have your desired output. For example:
df_statechange = pd.DataFrame([
['id1', 5, 'ok'],
['id1', 8, 'ok'],
['id2', 5, 'not ok'],
['id2',7, 'not ok'],
['id1', 9, 'not ok']
], columns = ['id', 'ts', 'state'])
the output:
x y ts id state
0 10 20 1 id1 NaN
1 11 22 5 id1 ok
2 20 54 5 id2 not ok
3 22 53 7 id2 not ok
4 15 24 8 id1 ok
5 16 25 10 id1 NaN

Categories