How to map the values in dataframe in pandas using python - python

have a df with values
df
name marks
mark 10
mark 40
tom 25
tom 20
mark 50
tom 5
tom 50
tom 25
tom 10
tom 15
How to sum the marks of names and count how many times it took
expected_output:
name total count
mark 100 3
tom 150 7

Here is possible use aggregate by named aggregations:
df = df.groupby('name').agg(total=('marks','sum'),
count=('marks','size')).reset_index()
print (df)
name total count
0 mark 100 3
1 tom 150 7
Or with specify column after groupby and pass tuples:
df = df.groupby('name')['marks'].agg([('total', 'sum'),
('count','size')]).reset_index()
print (df)
name total count
0 mark 100 3
1 tom 150 7

Here's a solution. I'm doing it step-by-step for simplicity:
df["commulative_sum"] = df.groupby("name").cumsum()
df["commulative_sum_50"] = df["commulative_sum"] // 50
df["commulative_count"] = df.assign(one = 1).groupby("name").cumsum()["one"]
res = pd.pivot_table(df, index="name", columns="commulative_sum_50", values="commulative_count", aggfunc=min).drop(0, axis=1)
# the following two lines can be done in a loop if there are a lot of columns. I simplified it here.
res[3] = res[3]-res[2]
res[2] = res[2]-res[1]
res.columns = ["50-" + str(c) for c in res.columns]
The result is:
50-1 50-2 50-3
name
mark 2.0 1.0 NaN
tom 3.0 1.0 3.0

Edit: I see you changed the "to 50" part, this might not be revelant anymore.
I agree with Jezrael's answer, but in case you only want to count to 50, maybe something like this:
l = []
for name, group in df.groupby('name')['marks']:
l.append({'name': name,
'total': group.sum(),
'count': group.loc[:group.eq(50).idxmax()].count()})
pd.DataFrame(l)
name total count
0 mark 100 3
1 tom 150 4

Related

Python Pandas: Conditional subraction of data between two dataframes?

I'm trying to use conditional subtraction between two dataframes.
Dataframe df1 has columns name and price.name is not unique.
>>df1
name price
0 mark 50
1 mark 200
2 john 10
3 chris 500
Another dataframe has two columns name and paid, Here name is unique
>>df2
name paid
0 mark 150
1 john 10
How can I conditionally subtract both dataframes to get following output
Final Output expected
name price paid
0 mark 50 50
1 mark 200 100
2 john 10 10
3 chris 500 0
IIUC, you can use:
# mapper for paid values
s = df2.set_index('name')['paid']
df1['paid'] = (df1
.groupby('name')['price'] # for each name
.apply(lambda g: g.cumsum() # sum the total owed
.clip(upper=s.get(g.name, default=0)) # in the limit of the paid
.pipe(lambda s: s.diff().fillna(s)) # compute reverse cumsum
)
)
output:
name price paid
0 mark 50 50.0
1 mark 200 100.0
2 john 10 10.0
3 chris 500 0.0

Unable to apply where clause properly in python panda data frame

Hi i have following data frame
S.No Description amount
1 a, b, c 100
2 a, c 50
3 b, c 80
4 b, d 90
5 a 150
I want to extract only values of 'a' forexample
expected answer:
Description amount
a 100
a 50
a 150
and sum them up as
Description amount
a 300
But i am getting this answer:
Description amount
1 a 100
2 a 50
3 nan nan
4 nan nan
5 a 150
please guide me how to properly use where clause on panda's dataframes.
Code:
filter = new_df ["Description"] =="a"
new_df.where(filter, inplace = True)
print (new_df)
Use df.assign, Series.str.split, df.explode, df.query and Groupby.sum:
In [703]: df_a = df.assign(Description=df.Description.str.split(',')).explode('Description').query('Description == "a"')
In [704]: df_a
Out[704]:
S.No Description amount
0 1 a 100
1 2 a 50
4 5 a 150
In [706]: df_a.groupby('Description')['amount'].sum().reset_index()
Out[706]:
Description amount
0 a 300
Or as a one-liner:
df.assign(letters=df['Description'].str.split(',\s'))\
.explode('letters')\
.query('letters == "a"')\
.groupby('letters', as_index=False)['amount'].sum()
Here you go:
In [3]: df["Description"] = df["Description"].str.split(", ")
In [4]: df.explode("Description").groupby("Description", as_index=False).sum()[["Description", "amount"]]
Out[4]:
Description amount
0 a 300
1 b 270
2 c 230
3 d 90
This allows you to get all the sums by each description, not just the 'a' group.

Transpose row values into specific columns in Pandas

I have a df like this:
MemberID FirstName LastName ClaimID Amount
0 1 John Doe 001A 100
1 1 John Doe 001B 150
2 2 Andy Right 004C 170
3 2 Andy Right 005A 200
4 2 Andy Right 002B 100
I need to transpose the values in the 'ClaimID' column for each member into one row, so each member will have each Claim as a value in a separate column called 'Claim(1-MaxNumofClaims), and the same logic goes for the Amount columns, the output needs to look like this:
MemberID FirstName LastName Claim1 Claim2 Claim3 Amount1 Amount2 Amount3
0 1 John Doe 001A 001B NaN 100 150 NaN
1 2 Andy Right 004C 005A 002B 170 200 100
I am new to Pandas and got myself stuck on this, any help would be greatly appreciated.
the operation you need is not transpose, this swaps row and column indexes
this approach groupby() identifying columns and constructs dict of columns that you want values to become columns 1..n
part two is expand out these dict. pd.Series expands a series of dict to columns
df = pd.read_csv(io.StringIO(""" MemberID FirstName LastName ClaimID Amount
0 1 John Doe 001A 100
1 1 John Doe 001B 150
2 2 Andy Right 004C 170
3 2 Andy Right 005A 200
4 2 Andy Right 002B 100 """), sep="\s+")
cols = ["ClaimID","Amount"]
# aggregate to columns that define rows, generate a dict for other columns
df = df.groupby(
["MemberID","FirstName","LastName"], as_index=False).agg(
{c:lambda s: {f"{s.name}{i+1}":v for i,v in enumerate(s)} for c in cols})
# expand out the dicts and drop the now redundant columns
df = df.join(df["ClaimID"].apply(pd.Series)).join(df["Amount"].apply(pd.Series)).drop(columns=cols)
MemberID
FirstName
LastName
ClaimID1
ClaimID2
ClaimID3
Amount1
Amount2
Amount3
0
1
John
Doe
001A
001B
nan
100
150
nan
1
2
Andy
Right
004C
005A
002B
170
200
100

How to count the occurrence of values in one pandas Dataframe if the values to count are in another (in a faster way)?

I have a (really big) pandas Dataframe df:
country age gender
Brazil 10 F
USA 20 F
Brazil 10 F
USA 20 M
Brazil 10 M
USA 20 M
I have another pandas Dataframe freq:
age gender counting
10 F 0
10 M 0
20 F 0
I wanna count the pair of values in freq when they occur in df:
age gender counting
10 F 2
10 M 1
20 F 1
I'm using this code, but it takes too long:
for row in df.itertuples(index=False):
freq.loc[np.all(freq['age','gender']==row[2:3],axis=1),'counting'] += 1
Is there a faster way to do that?
Please note:
I have to use freq because not all combinations (as for instance 20 and M) are desired
some columns in df may not be used
counting counts how many times both values appear in each row
freq may have more than 2 values to check for (this is just an small example)
you can do it with inner merge to filter the combinations in df you don't want, then groupby age and gender and count the column counting. just reset_index to fit your expected output.
freq = (df.merge(freq, on=['age', 'gender'], how='inner')
.groupby(['age','gender'])['counting'].size()
.reset_index())
print (freq)
age gender counting
0 10 F 2
1 10 M 1
2 20 F 1
Depending on the number of combinations you don't want, it could be faster to groupby on df before doing the merge like:
freq = (df.groupby(['age','gender']).size()
.rename('counting').reset_index()
.merge(freq[['age','gender']])
)
NumPy into the mix for some performance (hopefully!) with the idea of dimensionality-reduction to 1D, so that we can bring in the efficient bincount -
agec = np.r_[df.age,freq.age]
genderc = np.r_[df.gender,freq.gender]
aIDs,aU = pd.factorize(agec)
gIDs,gU = pd.factorize(genderc)
cIDs = aIDs*(gIDs.max()+1) + gIDs
count = np.bincount(cIDs[:len(df)], minlength=cIDs.max()+1)
freq['counting'] = count[cIDs[-len(freq):]]
Sample run -
In [44]: df
Out[44]:
country age gender
0 Brazil 10 F
1 USA 20 F
2 Brazil 10 F
3 USA 20 M
4 Brazil 10 M
5 USA 20 M
In [45]: freq # introduced a missing element as the second row for variety
Out[45]:
age gender counting
0 10 F 2
1 23 M 0
2 20 F 1
Specific scenario optimization #1
If age header is known to contain only integers, we can skip one factorize. So, skip aIDs,aU = pd.factorize(agec) and compute cIDs instead with -
cIDs = agec*(gIDs.max()+1) + gIDs
Another way is to use reindex to filter down to freq list:
df.groupby(['gender', 'age']).count()\
.reindex(pd.MultiIndex.from_arrays([df1['gender'], df1['age']]))
Output:
country
gender age
F 10 2
M 10 1
F 20 1

How to Compare Values of two Dataframes in Pandas?

I have two dataframes df and df2 like this
id initials
0 100 J
1 200 S
2 300 Y
name initials
0 John J
1 Smith S
2 Nathan N
I want to compare the values in the initials columns found in (df and df2) and copy the name (in df2) which its initial is matching to the initial in the first dataframe (df)
import pandas as pd
for i in df.initials:
for j in df2.initials:
if i == j:
# copy the name value of this particular initial to df
The output should be like this:
id name
0 100 Johon
1 200 Smith
2 300
Any idea how to solve this problem?
How about?:
df3 = df.merge(df2,on='initials',
how='outer').drop(['initials'],axis=1).dropna(subset=['id'])
>>> df3
id name
0 100.0 John
1 200.0 Smith
2 300.0 NaN
So the 'initials' column is dropped and so is anything with np.nan in the 'id' column.
If you don't want the np.nan in there tack on a .fillna():
df3 = df.merge(df2,on='initials',
how='outer').drop(['initials'],axis=1).dropna(subset=['id']).fillna('')
>>> df3
id name
0 100.0 John
1 200.0 Smith
2 300.0
df1
id initials
0 100 J
1 200 S
2 300 Y
df2
name initials
0 John J
1 Smith S
2 Nathan N
Use Boolean masks: df2.initials==df1.initials will tell you which values in the two initials columns are the same.
0 True
1 True
2 False
Use this mask to create a new column:
df1['name'] = df2.name[df2.initials==df1.initials]
Remove the initials column in df1:
df1.drop('initials', axis=1)
Replace the NaN using fillna(' ')
df1.fillna('', inplace=True) #inplace to avoid creating a copy
id name
0 100 John
1 200 Smith
2 300

Categories