Python Pandas Dataframe Groupby Sum question - python

I'm new in Python and I need to combine 2 dataframe with 'id' as the primary key. I need to sum up all the Charges from df1 and df2.
df1:
[df1][1]
id Name Charge
1 A 100
1 A 100
2 B 200
2 B 200
5 C 300
6 D 400
df2:
[df2][2]
id Name Charge
1 A 100
1 A 100
2 B 200
8 X 200
output:
[output][3]
id Name Charge(TOTAL from df1 & df2)
1 A 400
2 B 600
5 C 300
6 D 400
8 X 200

Try:
pd.concat([df1, df2]).groupby(['id', 'Name'], as_index=False)['Charge'].sum()
Output:
id Name Charge
0 1 A 400
1 2 B 600
2 5 C 300
3 6 D 400
4 8 X 200

ans = pd.concat([df1, df2], axis=0).groupby(["id", "Name"]).sum().reset_index()
print(ans)
id Name Charge
0 1 A 400
1 2 B 600
2 5 C 300
3 6 D 400
4 8 X 200

Related

Create an incremental serial no for filtered rows in pandas dataframe

Can you please help me change my code from current output to expected output? I am using apply() function of dataframe. It would be great if it could also be done more efficiently using vector operation.
Current output:
col1 col2 serialno
0 100 0 4
1 100 100 0
2 100 100 0
3 200 100 4
4 200 200 0
5 300 200 4
6 300 300 0
Expected output:
col1 col2 serialno
0 100 0 4
1 100 100 4
2 100 100 4
3 200 100 5
4 200 200 5
5 300 200 6
6 300 300 6
My current code contains a static value (4). I need to increment it by one based on a condition (col1 != col2). In addition, I need to repeat the serial no for all rows that meet the condition (col1 == col2).
My code:
import pandas as pd
columns = ['col1']
data = ['100','100','100','200','200','300','300']
df = pd.DataFrame(data=data,columns=columns)
df['col2'] = df.col1.shift(1).fillna(0)
print(df)
start = 4
series = (df['col1']!=df['col2']).apply(lambda x: start if x==True else 0)
df['serialno'] = series
print(df)
You can try this
import itertools
start = 4
counter = itertools.count(start) # to have incremental counter
df["serialno"] = [start if (x["col1"]==x["col2"] or (start:=next(counter))) else start for _, x in df.iterrows()]
This if condition will be executed in two manner: if col1 and col2 have same value then it will not go the next condition so start value will be same and if first condition is false then our counter will be incremented by 1.
Here is how you can do it with apply function:
ID = 3
def check_value(A, B):
global ID
if A != B:
ID += 1
return ID
df['id'] = df.apply(lambda row: check_value(row['col1'], row['col2']), axis=1)
You just need to start from 3 since the first row will increment it.
print(df) will give you this:
col1 col2 id
0 100 0 4
1 100 100 4
2 100 100 4
3 200 100 5
4 200 200 5
5 300 200 6
6 300 300 6
Another way could be using itertools.accumulate as follows:
import pandas as pd
imoprt numpy as np
df['serialno'] = list(accumulate(np.arange(1, len(df.index)), lambda x, y: x + 1 if df.iloc[y, 0] != df.iloc[y, 1] else x, initial = 4))
df
col1 col2 serialno
0 100 0 4
1 100 100 4
2 100 100 4
3 200 100 5
4 200 200 5
5 300 200 6
6 300 300 6
First, I have reused the idea posted by #BehRouz above of creating a user-defined function that increments the serial no.
Second, I transform the boolean series to a dynamic serial no as shown below for reference sake.
STEP 1: Create a dataframe and initialize the incremental counter (serialno)
import pandas as pd
columns = ['col1']
data = ['100','100','100','200','200','300','300']
df = pd.DataFrame(data=data,columns=columns)
df['col2'] = df.col1.shift(1).fillna(0)
df['serialno']=0 #initialize new column
print(df)
col1 col2 serialno
0 100 0 0
1 100 100 0
2 100 100 0
3 200 100 0
4 200 200 0
5 300 200 0
6 300 300 0
STEP 2: Create a boolean series and then use the transform method and the user-defined function posted by #BehRouz. This will create a dynamic serial no each time new rows are added to the dataframe.
start = df['serialno'].max()
def getvalue(x):
global start
if x:
start += 1
else:
start
return start
df['serialno'] = (df['col1']!=df['col2']).transform(func = lambda x: getvalue(x))
print(df)
Iteration 1:
col1 col2 serialno
0 100 0 1
1 100 100 1
2 100 100 1
3 200 100 2
4 200 200 2
5 300 200 3
6 300 300 3
Iteration 2:
col1 col2 serialno
0 100 0 4
1 100 100 4
2 100 100 4
3 200 100 5
4 200 200 5
5 300 200 6
6 300 300 6
Iteration 3:
col1 col2 serialno
0 100 0 7
1 100 100 7
2 100 100 7
3 200 100 8
4 200 200 8
5 300 200 9
6 300 300 9

Split a data frame into six equal parts based on number of rows without knowing the number of rows - pandas

I have a df as shown below.
df:
ID Job Salary
1 A 100
2 B 200
3 B 20
4 C 150
5 A 500
6 A 600
7 A 200
8 B 150
9 C 110
10 B 200
11 B 220
12 A 150
13 C 20
14 B 50
I would like to split the df into 6 equal parts based on the number of rows.
Expected Output
df1:
ID Job Salary
1 A 100
2 B 200
3 B 20
df2:
ID Job Salary
4 C 150
5 A 500
6 A 600
df3:
ID Job Salary
7 A 200
8 B 150
df4:
ID Job Salary
9 C 110
10 B 200
df5:
ID Job Salary
11 B 220
12 A 150
df6:
ID Job Salary
13 C 20
14 B 50
Note: Since there are 14 rows first two dfs can have 3 rows and the remaining 4 dfs should have 2 rows.
And I would like to save all dfs as csv dynamically
You can use np.array_split():
dfs = np.array_split(df, 6)
for index, df in enumerate(dfs):
df.to_csv(f'df{index+1}.csv')
>>> print(dfs)
[ ID Job Salary
0 1 A 100
1 2 B 200
2 3 B 20,
ID Job Salary
3 4 C 150
4 5 A 500
5 6 A 600,
ID Job Salary
6 7 A 200
7 8 B 150,
ID Job Salary
8 9 C 110
9 10 B 200,
ID Job Salary
10 11 B 220
11 12 A 150,
ID Job Salary
12 13 C 20
13 14 B 50]

How to join pandas dataframe on 2 columns?

Assume the following DataFrames
df1:
id data1
1 10
2 200
3 3000
4 40000
df2:
id1 id2 data2
1 2 210
1 3 3010
1 4 40010
2 3 3200
2 4 40200
3 4 43000
I want the new df3:
id1 id2 data2 data11 data12
1 2 210 10 200
1 3 3010 10 3000
1 4 40010 10 40000
2 3 3200 200 3000
2 4 40200 200 40000
3 4 43000 3000 40000
What is the correct way to achieve this in pandas?
Edit: Please not the specific data can be arbitrary. I chose this specific data just to show where everything comes from, but every data element has no correlation to any other data element.
Other dataframes examples, because the first one wasn't clear enough:
df4:
id data1
1 a
2 b
3 c
4 d
df5:
id1 id2 data2
1 2 e
1 3 f
1 4 g
2 3 h
2 4 i
3 4 j
I want the new df6:
id1 id2 data2 data11 data12
1 2 e a b
1 3 f a c
1 4 g a d
2 3 h b c
2 4 i b d
3 4 j c d
Edit2:
Data11 and Data12 are simply a copy of data1, with the corresponding id id1 or id2
1.First merge both dataframe using id1 and id column
2.rename data1 as data11
3. drop id column
4. Now merge df1 and df3 on id2 and id
df3 = pd.merge(df2,df1,left_on=['id1'],right_on=['id'],how='left')
df3.rename(columns={'data1':'data11'},inplace=True)
df3.drop('id',axis=1,inplace=True)
df3 = pd.merge(d3,df1,left_on=['id2'],right_on=['id'],how='left')
df3.rename(columns={'data1':'data12'},inplace=True)
df3.drop('id',axis=1,inplace=True)
I hope it would solve your problem
Try this:
# merge dataframes, first on id and id1 then on id2
df3 = pd.merge(df1, df2, left_on="id", right_on="id1", how="inner")
df3 = pd.merge(df1, df3, left_on="id", right_on="id2", how="inner")
# rename and reorder columns
cols = [ 'id1', 'id2', 'data2', 'data1_y', 'data1_x']
df3 = df3[cols]
new_cols = ["id1", "id2", "data2", "data11", "data12"]
df3.columns = new_cols
df3.sort_values("id1", inplace=True)
print(df3)
This prints out:
id1 id2 data2 data11 data12
0 1 2 210 10 200
1 1 3 3010 10 3000
2 1 4 40010 10 40000
3 2 3 3200 200 3000
4 2 4 40200 200 40000
5 3 4 43000 3000 40000
one of the solution to your problem is:
data1 = {'id' : [1,2,3,4],
'data1' : [10,200,3000,40000]}
data2 = {'id1' : [1,1,1,2,2,3],
'id2' : [2,3,4,3,4,4],
'data2' : [210,3010,40010,3200,40200,43000]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df1:
id data1
1 10
2 200
3 3000
4 40000
df2:
id1 id2 data2
1 2 210
1 3 3010
1 4 40010
2 3 3200
2 4 40200
3 4 43000
df3 = df2.set_index('id1').join(df1.set_index('id'))
df3.index.names = ['id1']
df3.reset_index(inplace=True)
final = df3.set_index('id2').join(df1.set_index('id'), rsuffix='2')
final.index.names = ['id2']
final.reset_index(inplace=True)
final[['id1','id2','data2','data1','data12']].sort_values('id1')
output df:
id1 id2 data2 data1 data12
1 2 210 10 200
1 3 3010 10 3000
1 4 40010 10 40000
2 3 3200 200 3000
2 4 40200 200 40000
3 4 43000 3000 40000
I hope this will help you.
​
Using merge in a for loop with range and f-string
One way we can generalise this and to make it more easily expandable when having more than two dataframes, is to use list comprehension and a for loop with range.
After that we drop the duplicate column names:
dfs = [df2.merge(df1,
left_on=f'id{x+1}',
right_on='id',
how='left').rename(columns={'data1':f'data1{x+1}'}) for x in range(2)]
df = pd.concat(dfs, axis=1).drop('id', axis=1)
df = df.loc[:, ~df.columns.duplicated()]
Output
id1 id2 data2 data11 data12
0 1 2 210 10 200
1 1 3 3010 10 3000
2 1 4 40010 10 40000
3 2 3 3200 200 3000
4 2 4 40200 200 40000
5 3 4 43000 3000 40000
As #tawab_shakeel has mentioned earlier, your primary step is to merge the Dataframes on a particular column based on certain (SQL) join rules; just for you to understand the different approaches to merging on specific column(s), here is a general guide.
Joining Dataframes in Pandas
SQL Join Types
use two left hand merges on column id1 and id2 for dataframe df2
txt="""id,data1
1,a
2,b
3,c
4,d
"""
from io import StringIO
f = StringIO(txt)
df1 = pd.read_table(f,sep =',')
df1['id']=df1['id'].astype(int)
txt="""id1,id2,data2
1,2,e
1,3,f
1,4,g
2,3,h
2,4,i
3,4,j
"""
f = StringIO(txt)
df2 = pd.read_table(f,sep =',')
df2['id1']=df2['id1'].astype(int)
df2['id2']=df2['id2'].astype(int)
left_on='id1'
right_on='id'
suffix='_1'
df2=df2.merge(df1, how='left', left_on=left_on, right_on=right_on,
suffixes=("", suffix))
left_on='id2'
right_on='id'
suffix='_2'
df2=df2.merge(df1, how='left', left_on=left_on, right_on=right_on,
suffixes=("", suffix))
print(df2)
output
id1 id2 data2 id data1 id_2 data1_2
0 1 2 e 1 a 2 b
1 1 3 f 1 a 3 c
2 1 4 g 1 a 4 d
3 2 3 h 2 b 3 c
4 2 4 i 2 b 4 d
5 3 4 j 3 c 4 d

How to remove rows of a data frame when specific amount are not in specific columns?

I have two data frames with four and two columns. For example:
A B C D
0 4 2 320 700
1 5 7 400 800
2 2 6 300 550
3 4 6 100 300
4 5 2 250 360
and
A B
0 2 4
1 5 7
2 2 5
I need to compare the first data frame with the second data frame and if column A and column B in the second data frame was in column A and column B in the first data frame.
(order doesn't matter. it means in the first data frame in the first row A is 4, B is 2 and in the second data frame is A is 2 and B is 4 and it's not important but both numbers should be in the columns) keep the whole row in the first data frame; otherwise remove the row. so the output will be :
A B C D
0 4 2 320 700
1 5 7 400 800
2 5 2 250 360
How can I get this output (my actual data frames are so huge and can't iterate through them so need a fast efficient way)?
I would do this by first sorting, then performing an LEFT OUTER JOIN using merge with an indicator to determine which rows to keep. Example,
u = df.loc[:, ['A', 'B']]
u.values.sort() # sort columns of `u`
df2.values.sort() # sort columns of `df2`
df[u.merge(df2, how='left', indicator='ind').eval('ind == "both"').values]
A B C D
0 4 2 320 700
1 5 7 400 800
4 5 2 250 360
More info on joins with indicator can be found in my post: Pandas Merging 101
If you don't care about the final result being sorted or not, you can simplify this to an inner join.
df[['A', 'B']] = np.sort(df[['A', 'B']])
df2[:] = np.sort(df2)
df.merge(df2, on=['A', 'B'])
A B C D
0 2 4 320 700
1 5 7 400 800
2 2 5 250 360
What I will do using frozenset + isin
yourdf=df[df[['A','B']].apply(frozenset,1).isin(df1.apply(frozenset,1))].copy()
A B C D
0 4 2 320 700
1 5 7 400 800
4 5 2 250 360
Using np.equal.outer
arr = np.equal.outer(df, df2)
df.loc[arr.any(1).all(-1).any(-1)]
Outputs
A B C D
0 4 2 320 700
1 5 7 400 800
4 5 2 250 360

Conditional shift in pandas

The following pandas DataFrame is an example that I need to deal with:
Group Amount
1 1 100
2 1 300
3 1 400
4 1 700
5 2 500
6 2 900
Here's the result that I want after calculation:
Group Amount Difference
1 1 100 100
2 1 300 200
3 1 400 100
4 1 700 300
5 2 500 500
6 2 900 400
I knew that df["Difference"] = df["Amount"] - df["Amount"].shift(-1) can produce the difference between all rows, but what can I do for the problem I have like this that needs a group as condition?
groupby on 'Group' and call transform on the 'Amount' col, additionally call fillna and pass the 'Amount' column:
In [110]:
df['Difference'] = df.groupby('Group')['Amount'].transform(pd.Series.diff).fillna(df['Amount'])
df
​
Out[110]:
Group Amount Difference
1 1 100 100
2 1 300 200
3 1 400 100
4 1 700 300
5 2 500 500
6 2 900 400

Categories