I have a df like this:
MemberID FirstName LastName ClaimID Amount
0 1 John Doe 001A 100
1 1 John Doe 001B 150
2 2 Andy Right 004C 170
3 2 Andy Right 005A 200
4 2 Andy Right 002B 100
I need to transpose the values in the 'ClaimID' column for each member into one row, so each member will have each Claim as a value in a separate column called 'Claim(1-MaxNumofClaims), and the same logic goes for the Amount columns, the output needs to look like this:
MemberID FirstName LastName Claim1 Claim2 Claim3 Amount1 Amount2 Amount3
0 1 John Doe 001A 001B NaN 100 150 NaN
1 2 Andy Right 004C 005A 002B 170 200 100
I am new to Pandas and got myself stuck on this, any help would be greatly appreciated.
the operation you need is not transpose, this swaps row and column indexes
this approach groupby() identifying columns and constructs dict of columns that you want values to become columns 1..n
part two is expand out these dict. pd.Series expands a series of dict to columns
df = pd.read_csv(io.StringIO(""" MemberID FirstName LastName ClaimID Amount
0 1 John Doe 001A 100
1 1 John Doe 001B 150
2 2 Andy Right 004C 170
3 2 Andy Right 005A 200
4 2 Andy Right 002B 100 """), sep="\s+")
cols = ["ClaimID","Amount"]
# aggregate to columns that define rows, generate a dict for other columns
df = df.groupby(
["MemberID","FirstName","LastName"], as_index=False).agg(
{c:lambda s: {f"{s.name}{i+1}":v for i,v in enumerate(s)} for c in cols})
# expand out the dicts and drop the now redundant columns
df = df.join(df["ClaimID"].apply(pd.Series)).join(df["Amount"].apply(pd.Series)).drop(columns=cols)
MemberID
FirstName
LastName
ClaimID1
ClaimID2
ClaimID3
Amount1
Amount2
Amount3
0
1
John
Doe
001A
001B
nan
100
150
nan
1
2
Andy
Right
004C
005A
002B
170
200
100
Related
I'm trying to use conditional subtraction between two dataframes.
Dataframe df1 has columns name and price.name is not unique.
>>df1
name price
0 mark 50
1 mark 200
2 john 10
3 chris 500
Another dataframe has two columns name and paid, Here name is unique
>>df2
name paid
0 mark 150
1 john 10
How can I conditionally subtract both dataframes to get following output
Final Output expected
name price paid
0 mark 50 50
1 mark 200 100
2 john 10 10
3 chris 500 0
IIUC, you can use:
# mapper for paid values
s = df2.set_index('name')['paid']
df1['paid'] = (df1
.groupby('name')['price'] # for each name
.apply(lambda g: g.cumsum() # sum the total owed
.clip(upper=s.get(g.name, default=0)) # in the limit of the paid
.pipe(lambda s: s.diff().fillna(s)) # compute reverse cumsum
)
)
output:
name price paid
0 mark 50 50.0
1 mark 200 100.0
2 john 10 10.0
3 chris 500 0.0
I did this in R with this question, but switched to Python and still do not see a good answer.
I have a dataframe with 200 columns of different strings and numbers.
Example:
Name Gender Disease1 Disease2 Disease3
Joe Male disease1 NA disease3
Ben Male NA disease2 NA
Chloe Female disease1 disease2 NA
How can I convert different Disease values in multiple columns into 1 and then mutate a new column, counting total number of 1 in specific columns (maybe in columns 22:65).
Desired output
Name Gender Disease1 Disease2 Disease3 Total_diseases
Joe Male disease1 NA disease3 2
Ben Male NA disease2 NA 1
Chloe Female disease1 disease2 NA 2
I want to have a new column Total_diseases, where all text values (now converted to 1) are summarized. So if one person has 10 diseases, it will show up in this mutated column. Hope it answers your questions.
You can set index on id then use notna() to get those entries not null and change them to 1 with .astype(int). Then filter Disease* columns by .filter() and sum on axis=1 for the count on each row:
df_out = df.set_index('id').notna().astype(int).reset_index()
df_out['Total_diseases'] = df_out.filter(like='Disease').sum(axis=1)
Result
print(df_out)
id Disease1 Disease2 Disease3 Total_diseases
0 1 1 0 1 2
1 2 0 1 0 1
2 3 1 1 0 2
Edit:
If you want to specify range of columns by number, you can use .iloc e.g. use df_out.iloc[:, 10:30] and use df_out.iloc[:, 10:30].sum(axis=1) for summing these columns.
Edit 2
According to the updated sample input and desired output, and you mentioned you want to quote the range of columns by column numbers instead of by filtering similar column labels as in the solution above (probably the real diseases names have no common pattern), you can use .iloc instead, as follows:
df['Total_diseases'] = df.iloc[:, 2:5].notna().sum(axis=1)
Result
print(df)
Name Gender Disease1 Disease2 Disease3 Total_diseases
0 Joe Male disease1 NaN disease3 2
1 Ben Male NaN disease2 NaN 1
2 Chloe Female disease1 disease2 NaN 2
If you only want the "Total_diseases" column, you don't need to change anything to the original columns:
df = df.set_index('id')
df['Total_diseases'] = df.nunique(axis=1)
If there are initially columns other than 'Disease':
cols = df.filter(like='Disease').columns
df['Total_diseases'] = df[cols].nunique(axis=1)
This here is just to change the original columns to one for non NA values:
cols = df.filter(like='Disease').columns
df[cols] = df[cols].where(df[cols].isna(), 1)
output:
Disease1 Disease2 Disease3 Total_diseases
id
1 1 NaN 1 2
2 NaN 1 NaN 1
3 1 1 NaN 2
have a df with values
df
name marks
mark 10
mark 40
tom 25
tom 20
mark 50
tom 5
tom 50
tom 25
tom 10
tom 15
How to sum the marks of names and count how many times it took
expected_output:
name total count
mark 100 3
tom 150 7
Here is possible use aggregate by named aggregations:
df = df.groupby('name').agg(total=('marks','sum'),
count=('marks','size')).reset_index()
print (df)
name total count
0 mark 100 3
1 tom 150 7
Or with specify column after groupby and pass tuples:
df = df.groupby('name')['marks'].agg([('total', 'sum'),
('count','size')]).reset_index()
print (df)
name total count
0 mark 100 3
1 tom 150 7
Here's a solution. I'm doing it step-by-step for simplicity:
df["commulative_sum"] = df.groupby("name").cumsum()
df["commulative_sum_50"] = df["commulative_sum"] // 50
df["commulative_count"] = df.assign(one = 1).groupby("name").cumsum()["one"]
res = pd.pivot_table(df, index="name", columns="commulative_sum_50", values="commulative_count", aggfunc=min).drop(0, axis=1)
# the following two lines can be done in a loop if there are a lot of columns. I simplified it here.
res[3] = res[3]-res[2]
res[2] = res[2]-res[1]
res.columns = ["50-" + str(c) for c in res.columns]
The result is:
50-1 50-2 50-3
name
mark 2.0 1.0 NaN
tom 3.0 1.0 3.0
Edit: I see you changed the "to 50" part, this might not be revelant anymore.
I agree with Jezrael's answer, but in case you only want to count to 50, maybe something like this:
l = []
for name, group in df.groupby('name')['marks']:
l.append({'name': name,
'total': group.sum(),
'count': group.loc[:group.eq(50).idxmax()].count()})
pd.DataFrame(l)
name total count
0 mark 100 3
1 tom 150 4
When try to sort my dataframe by the column "Number" i get the error code
1708 # Check for duplicates
KeyError: 'Number'
the dataframe looks something like this
Number Name City Sex
3 Jay A M
1 Marry A F
5 John B M
Number is int64 and the rest are objects
df.sort_values(by=['Number']) --> error
df.sort_values(by=['Name']) --> works
df.sort_values(by=['City']) --> error
df.sort_values(by=['Sex']) --> works
What i am looking for is something like this
Number Name City Sex
1 Marry A F
3 Jay A M
5 John B M
I try to make a DataFrame like yours and sort it & it works to sort by Number column:
df=pd.DataFrame({'Number':[3,1,5],
'Name':['Jay','Marry','John'],
'City':['A','A','B'],
'Sex':['M','F','M']})
print(df)
print(df.Number.dtype)
df=df.sort_values(by=['Number'])
print(df)
Output:
Number Name City Sex
0 3 Jay A M
1 1 Marry A F
2 5 John B M
int64
Number Name City Sex
1 1 Marry A F
0 3 Jay A M
2 5 John B M
Maybe there is a white space in your columns, try this before sorting:
df.columns=df.columns.str.strip()
I have two dataframes df and df2 like this
id initials
0 100 J
1 200 S
2 300 Y
name initials
0 John J
1 Smith S
2 Nathan N
I want to compare the values in the initials columns found in (df and df2) and copy the name (in df2) which its initial is matching to the initial in the first dataframe (df)
import pandas as pd
for i in df.initials:
for j in df2.initials:
if i == j:
# copy the name value of this particular initial to df
The output should be like this:
id name
0 100 Johon
1 200 Smith
2 300
Any idea how to solve this problem?
How about?:
df3 = df.merge(df2,on='initials',
how='outer').drop(['initials'],axis=1).dropna(subset=['id'])
>>> df3
id name
0 100.0 John
1 200.0 Smith
2 300.0 NaN
So the 'initials' column is dropped and so is anything with np.nan in the 'id' column.
If you don't want the np.nan in there tack on a .fillna():
df3 = df.merge(df2,on='initials',
how='outer').drop(['initials'],axis=1).dropna(subset=['id']).fillna('')
>>> df3
id name
0 100.0 John
1 200.0 Smith
2 300.0
df1
id initials
0 100 J
1 200 S
2 300 Y
df2
name initials
0 John J
1 Smith S
2 Nathan N
Use Boolean masks: df2.initials==df1.initials will tell you which values in the two initials columns are the same.
0 True
1 True
2 False
Use this mask to create a new column:
df1['name'] = df2.name[df2.initials==df1.initials]
Remove the initials column in df1:
df1.drop('initials', axis=1)
Replace the NaN using fillna(' ')
df1.fillna('', inplace=True) #inplace to avoid creating a copy
id name
0 100 John
1 200 Smith
2 300