Python Pandas: Conditional subraction of data between two dataframes? - python

I'm trying to use conditional subtraction between two dataframes.
Dataframe df1 has columns name and price.name is not unique.
>>df1
name price
0 mark 50
1 mark 200
2 john 10
3 chris 500
Another dataframe has two columns name and paid, Here name is unique
>>df2
name paid
0 mark 150
1 john 10
How can I conditionally subtract both dataframes to get following output
Final Output expected
name price paid
0 mark 50 50
1 mark 200 100
2 john 10 10
3 chris 500 0

IIUC, you can use:
# mapper for paid values
s = df2.set_index('name')['paid']
df1['paid'] = (df1
.groupby('name')['price'] # for each name
.apply(lambda g: g.cumsum() # sum the total owed
.clip(upper=s.get(g.name, default=0)) # in the limit of the paid
.pipe(lambda s: s.diff().fillna(s)) # compute reverse cumsum
)
)
output:
name price paid
0 mark 50 50.0
1 mark 200 100.0
2 john 10 10.0
3 chris 500 0.0

Related

Pandas merge columns with the same name

I have the following Dataframe:
Timestamp
participant
level
gold
participant
level
gold
1
1
100
6000
2
76
4200
2
1
150
5000
2
120
3700
I am trying to change the Dataframe so that all rows from columns named the same is moved below each other, while keeping the column named timestamp:
Timestamp
participant
level
gold
1
1
100
6000
2
1
150
5000
1
2
76
4200
2
2
120
3700
To be clear, the example above is a small sample, the actual Dataframe has a lot of columnes named the same, and a lot more rows. Hence, the solution needs to take that in to account.
Thanks!
Idea is deduplicated duplicated columns names by GroupBy.cumcount for counter and then reshape by DataFrame.stack:
df = df.set_index('Timestamp')
s = df.columns.to_series()
df.columns = [df.columns, s.groupby(s).cumcount()]
df = df.stack().reset_index(level=1, drop=True).reset_index()
If columns names are not duplicated and added . with number:
print (df)
Timestamp participant level gold participant.1 level.1 gold.1
0 1 1 100 6000 2 76 4200
1 2 1 150 5000 2 120 3700
df = df.set_index('Timestamp')
df.columns = pd.MultiIndex.from_frame(df.columns.str.split('.', expand=True)
.to_frame().fillna('0'))
df = df.stack().reset_index(level=1, drop=True).reset_index()
print (df)
0 Timestamp gold level participant
0 1 6000 100 1
1 1 4200 76 2
2 2 5000 150 1
3 2 3700 120 2
Hope this helps
df1=pd.concat([df.iloc[:,0],df.loc[:,df.columns.duplicates()]],axis=1)
df2=df.loc[:,~df.columns.duplicates()]
df=pd.concat([df1,df2],axis=1)

Transpose row values into specific columns in Pandas

I have a df like this:
MemberID FirstName LastName ClaimID Amount
0 1 John Doe 001A 100
1 1 John Doe 001B 150
2 2 Andy Right 004C 170
3 2 Andy Right 005A 200
4 2 Andy Right 002B 100
I need to transpose the values in the 'ClaimID' column for each member into one row, so each member will have each Claim as a value in a separate column called 'Claim(1-MaxNumofClaims), and the same logic goes for the Amount columns, the output needs to look like this:
MemberID FirstName LastName Claim1 Claim2 Claim3 Amount1 Amount2 Amount3
0 1 John Doe 001A 001B NaN 100 150 NaN
1 2 Andy Right 004C 005A 002B 170 200 100
I am new to Pandas and got myself stuck on this, any help would be greatly appreciated.
the operation you need is not transpose, this swaps row and column indexes
this approach groupby() identifying columns and constructs dict of columns that you want values to become columns 1..n
part two is expand out these dict. pd.Series expands a series of dict to columns
df = pd.read_csv(io.StringIO(""" MemberID FirstName LastName ClaimID Amount
0 1 John Doe 001A 100
1 1 John Doe 001B 150
2 2 Andy Right 004C 170
3 2 Andy Right 005A 200
4 2 Andy Right 002B 100 """), sep="\s+")
cols = ["ClaimID","Amount"]
# aggregate to columns that define rows, generate a dict for other columns
df = df.groupby(
["MemberID","FirstName","LastName"], as_index=False).agg(
{c:lambda s: {f"{s.name}{i+1}":v for i,v in enumerate(s)} for c in cols})
# expand out the dicts and drop the now redundant columns
df = df.join(df["ClaimID"].apply(pd.Series)).join(df["Amount"].apply(pd.Series)).drop(columns=cols)
MemberID
FirstName
LastName
ClaimID1
ClaimID2
ClaimID3
Amount1
Amount2
Amount3
0
1
John
Doe
001A
001B
nan
100
150
nan
1
2
Andy
Right
004C
005A
002B
170
200
100

How to map the values in dataframe in pandas using python

have a df with values
df
name marks
mark 10
mark 40
tom 25
tom 20
mark 50
tom 5
tom 50
tom 25
tom 10
tom 15
How to sum the marks of names and count how many times it took
expected_output:
name total count
mark 100 3
tom 150 7
Here is possible use aggregate by named aggregations:
df = df.groupby('name').agg(total=('marks','sum'),
count=('marks','size')).reset_index()
print (df)
name total count
0 mark 100 3
1 tom 150 7
Or with specify column after groupby and pass tuples:
df = df.groupby('name')['marks'].agg([('total', 'sum'),
('count','size')]).reset_index()
print (df)
name total count
0 mark 100 3
1 tom 150 7
Here's a solution. I'm doing it step-by-step for simplicity:
df["commulative_sum"] = df.groupby("name").cumsum()
df["commulative_sum_50"] = df["commulative_sum"] // 50
df["commulative_count"] = df.assign(one = 1).groupby("name").cumsum()["one"]
res = pd.pivot_table(df, index="name", columns="commulative_sum_50", values="commulative_count", aggfunc=min).drop(0, axis=1)
# the following two lines can be done in a loop if there are a lot of columns. I simplified it here.
res[3] = res[3]-res[2]
res[2] = res[2]-res[1]
res.columns = ["50-" + str(c) for c in res.columns]
The result is:
50-1 50-2 50-3
name
mark 2.0 1.0 NaN
tom 3.0 1.0 3.0
Edit: I see you changed the "to 50" part, this might not be revelant anymore.
I agree with Jezrael's answer, but in case you only want to count to 50, maybe something like this:
l = []
for name, group in df.groupby('name')['marks']:
l.append({'name': name,
'total': group.sum(),
'count': group.loc[:group.eq(50).idxmax()].count()})
pd.DataFrame(l)
name total count
0 mark 100 3
1 tom 150 4

Summarizing a dataset and creating new variables

I have a Dataset that lists individual transactions by country, quarter, division, the transaction type and the value. I would like to sum it up based on the first three variables but create new columns for the other two. The dataset looks like this:
Country Quarter Division Type Value
A 1 Sales A 50
A 2 Sales A 150
A 3 Sales B 20
A 1 Sales A 250
A 2 Sales B 50
A 3 Sales B 50
A 2 Marketing A 50
Now I would like to aggregate the data to get the number of transactions by type as a new variable. The overall number of transactions grouped by the first three variables is easy:
df.groupby(['Country', 'Quarter', 'Division'], as_index=False).agg({'Type':'count', 'Value':'sum'})
However, I would like my new dataframe to look as follows:
Country Quarter Division Type_A Type_B Value_A Value_B
A 1 Sales 2 0 300 0
A 2 Sales 1 1 150 50
A 3 Sales 0 2 0 70
A 2 Marketing 1 0 50 0
How do I do that?
Specify column after groupby with tuples in agg functions for new columns names with aggregate functions, then reshape by DataFrame.unstack and last convert MultiIndex in columns by map:
df1 = (df.groupby(['Country', 'Quarter', 'Division', 'Type'])['Value']
.agg([('Type','count'), ('Value','sum')])
.unstack(fill_value=0))
df1.columns = df1.columns.map('_'.join)
df1 = df1.reset_index()
print (df1)
Country Quarter Division Type_A Type_B Value_A Value_B
0 A 1 Sales 2 0 300 0
1 A 2 Marketing 1 0 50 0
2 A 2 Sales 1 1 150 50
3 A 3 Sales 0 2 0 70

How to Compare Values of two Dataframes in Pandas?

I have two dataframes df and df2 like this
id initials
0 100 J
1 200 S
2 300 Y
name initials
0 John J
1 Smith S
2 Nathan N
I want to compare the values in the initials columns found in (df and df2) and copy the name (in df2) which its initial is matching to the initial in the first dataframe (df)
import pandas as pd
for i in df.initials:
for j in df2.initials:
if i == j:
# copy the name value of this particular initial to df
The output should be like this:
id name
0 100 Johon
1 200 Smith
2 300
Any idea how to solve this problem?
How about?:
df3 = df.merge(df2,on='initials',
how='outer').drop(['initials'],axis=1).dropna(subset=['id'])
>>> df3
id name
0 100.0 John
1 200.0 Smith
2 300.0 NaN
So the 'initials' column is dropped and so is anything with np.nan in the 'id' column.
If you don't want the np.nan in there tack on a .fillna():
df3 = df.merge(df2,on='initials',
how='outer').drop(['initials'],axis=1).dropna(subset=['id']).fillna('')
>>> df3
id name
0 100.0 John
1 200.0 Smith
2 300.0
df1
id initials
0 100 J
1 200 S
2 300 Y
df2
name initials
0 John J
1 Smith S
2 Nathan N
Use Boolean masks: df2.initials==df1.initials will tell you which values in the two initials columns are the same.
0 True
1 True
2 False
Use this mask to create a new column:
df1['name'] = df2.name[df2.initials==df1.initials]
Remove the initials column in df1:
df1.drop('initials', axis=1)
Replace the NaN using fillna(' ')
df1.fillna('', inplace=True) #inplace to avoid creating a copy
id name
0 100 John
1 200 Smith
2 300

Categories