Pandas: update a column based on index in another dataframe - python

I want to update couple of columns in a dataframe using a multiplying factor in another df (both the dfs have a 'KEY' column). Though I was able to achieve this, it takes a lot of processing time since I have a few million records. Looking for a more optimum solution if any.
Let me explain my scenario using dummy dfs. I have a dataframe df1 as below
In [8]: df1
Out[8]:
KEY col2 col3 col4
0 1 1 10 5
1 2 7 13 8
2 1 12 15 12
3 4 3 23 1
4 3 14 5 6
Now I want to change col2 and col3 by a factor that I fetch from the below df2 dataframe based on the KEY.
In [11]: df2
Out[11]:
FACTOR
KEY
1 100
2 3000
3 1000
4 200
5 50
I'm using the below for loop to achieve what I need.
In [12]: for index, row in df2.iterrows():
df1.loc[(df1['KEY']==index), ['col2', 'col3']] *= df2.loc[index]['FACTOR']
In [13]: df1
Out[13]:
KEY col2 col3 col4
0 1 100 1000 5
1 2 21000 39000 8
2 1 1200 1500 12
3 4 600 4600 1
4 3 14000 5000 6
This does the job. But my actual data has a few million records that come in real time and takes about 15 seconds to complete for each batch of incoming data. I am looking for a better solution since the for loop seems to be doing it in O(n) complexity

you should use a merge:
c=df1.merge(df2,on="KEY")
the c dataframe will now contain the "FACTOR" column which is the result you want to achieve
if one of the fields to merge is the index, you can use :
c=df1.merge(df2,left_on="KEY",right_index=True)

Related

count based on column and add as new column in numpy

It should be straight forward but I cannot find the right command.
I want to add a new column (Col3) to my Numpy which counts the occurrence of a value of a column (Col2) for each row. take this example:
before:
Col1
Col2
1
4
2
4
3
1500
4
60
5
60
6
60
after:
Col1
Col2
Col3
1
4
2
2
4
2
3
1500
1
4
60
3
5
60
3
6
60
3
any idea?
Using numpy:
Create a frequency dictionary based on the values on Col2
from collections import Counter
freq = Counter(arr[:,1])
generate the values of the Col3 iterating the elements of Col2
new_col = np.array([freq[val] if val in freq else 0 for val in arr[:,1] ]).reshape(-1,1)
concatenate the new column to the existing array
new_arr = np.concatenate([arr, new_col],axis=1)

select based on row combinations on different columns pandas

I have the following pandas data frame.
ID col1 col2 value
1 4 New 20
2 4 OLD 30
3 5 OLD 60
4 5 New 50
5 3 New 70
I would like to select only rows which has the following rules. from col1 value 4 and 3 should be in New and 5 should be in Old in col2. Drop other rows other wise.
ID col1 col2 value
1 4 New 20
3 5 Old 60
5 3 New 70
Can any one help with this in Python pandas?
Use DataFrame.query with filter by in chained by & for bitwise AND and second condition chain by | for bitwise OR:
df1 = df.query("(col1 in [4,3] & col2 == 'New') | (col1 == 5 & col2 == 'OLD')")
print (df1)
ID col1 col2 value
0 1 4 New 20
2 3 5 OLD 60
4 5 3 New 70
Or use boolean indexing with Series.isin:
df1 = df[df['col1'].isin([3,4]) & df['col2'].eq('New') |
df['col1'].eq(5) & df['col2'].eq('OLD')]

Removing columns in Pandas

I work on a big Python dataframe and notice that some columns have same values for each row BUT columns' names are different.
Also, some values are text, or timeseries data.
Any easy was to get rid of these columns duplicates and keep first each time?
Many thanks
Let create a dummy data frame, where two columns with different names are duplicate.
import pandas as pd
df=pd.DataFrame({
'col1':[1,2,3,'b',5,6],
'col2':[11,'a',13,14,15,16],
'col3':[1,2,3,'b',5,6],
})
col1 col2 col3
0 1 11 1
1 2 a 2
2 3 13 3
3 b 14 b
4 5 15 5
5 6 16 6
To remove duplicate columns, first, take transpose, then apply drop_duplicate and again take transpose
df.T.drop_duplicates().T
result
col1 col2
0 1 11
1 2 a
2 3 13
3 b 14
4 5 15
5 6 16

subtract one column from multiple columns in the same dataframe using method chaining

I have a dataframe in pandas and I would like to subtract one column (lets say col1) from col2 and col3 (or from more columns if there are) without writing the the below assign statement for each column.
df = pd.DataFrame({'col1':[1,2,3,4], 'col2':[2,5,6,8], 'col3':[5,5,5,9]})
df = (df
...
.assign(col2 = lambda x: x.col2 - x.col1)
)
How can I do this? Or would it work with apply? How would you be able to do this with method chaining?
Edit: (using **kwarg with chainning method)
As in your comment, if you want to chain method on the intermediate(on-going calculated) dataframe, you need to define a custom dictionary to calculate each column to use with assign as follows (you can't use lambda to directly construct dictionary inside assign).
In this example I do add 5 to the dataframe before chaining assign to show how it works on chain processing as you want
d = {cl: lambda x, cl=cl: x[cl] - x['col1'] for cl in ['col2','col3']}
df_final = df.add(5).assign(**d)
In [63]: df
Out[63]:
col1 col2 col3
0 1 2 5
1 2 5 5
2 3 6 5
3 4 8 9
In [64]: df_final
Out[64]:
col1 col2 col3
0 6 1 4
1 7 3 3
2 8 3 2
3 9 4 5
Note: df_final.col1 is different from df.col1 because of the add operation before assign. Don't forget cl=cl in the lambda of dictionary. It is there to avoid late-binding issue of python.
Use df.sub
df_sub = df.assign(**df[['col2','col3']].sub(df.col1, axis=0).add_prefix('sub_'))
Out[22]:
col1 col2 col3 sub_col2 sub_col3
0 1 2 5 1 4
1 2 5 5 3 3
2 3 6 5 3 2
3 4 8 9 4 5
If you want to assign values back to col2, col3, use additional update
df.update(df[['col2','col3']].sub(df.col1, axis=0))
print(df)
Output:
col1 col2 col3
0 1 1 4
1 2 3 3
2 3 3 2
3 4 4 5

find a fix value from a column around a range with each unique values of another column in pandas data frame

I have a data frame like this:
df
col1 col2
1 50000
1 2000
2 51000
3 100
3 5000
3 50500
4 200
4 51500
5 49000
I want to identify the values with plus minus 10 percent for each of col1 values which occurs for every col1 unique values.
the final output should look like
col1 col2
1 50000
2 51000
3 50500
4 51500
5 49000
if other values other than the values around 50000 presents and have within plus minus 10 percent range, add those with the values around 50000
How to do it using pandas/python with most efficient way ?
Use list cpmprehension for loop by all unique values of col2, filter by +-10% with Series.between and boolean indexing and compare if all values exist in all groups with set created by col1. Last filter by Series.isin:
s = set(df['col1'])
print (s)
{1, 2, 3, 4, 5}
a = [x for x in df['col2'].unique()
if set(df.loc[df['col2'].between(x - x *.1, x + x*.1), 'col1']) == s]
print (a)
[50000, 51000, 50500, 51500, 49000]
df = df[df['col2'].isin(a)]
print (df)
col1 col2
0 1 50000
2 2 51000
5 3 50500
7 4 51500
8 5 49000

Categories