dataframe not droping duplicate

dataframe not droping duplicate - python

i have two dataframes :
df:
id Name Number Stat
1 co 4
2 ma 98
3 sa 0
df1:
id Name Number Stat
1 co 4
2 ma 98 5%
I want to merge both dataframes in 1 (dfnew) and i want it as follow:
id Name Number Stat
1 co 4
2 ma 98 5%
3 sa 0
I used
dfnew = pd.concat([df, df2])
dfnew = df_row.drop_duplicates(keep='last')
I am not getting the result i want. the dataframes are joined but duplicates are not deleted. I need help please

It seems you need check only first 3 columns for duplicates:
dfnew = pd.concat([df, df2]).drop_duplicates(subset=['id','Name','Number'], keep='last')
print (dfnew)
id Name Number Stat
2 3 sa 0 NaN
0 1 co 4 NaN
1 2 ma 98 5%

try pd.merge function with inner/ outer based on requirement.

Related

Update dataframe column based on another dataframe column without for loop

I have two dataframes df1 and df2.
df1:
id val
1 25
2 40
3 78
df2:
id val
2 8
1 5
Now I want to do something like df1['val'] = df1['val']/df2['val'] for matching id. I can do that by iterating over all df2 rows as df2 is a subset of df1 so it may be missing some values, which I want to keep unchanged. This is what I have right now:
for row in df2.iterrows():
df1.loc[df1['id']==row[1]['id'], 'val'] /= row[1]['val']
df1:
id val
1 5
2 5
3 78
How can I achieve the same without using for loop to improve speed?

Use Series.map with Series.div:
df1['val'] = df1['val'].div(df1['id'].map(df2.set_index('id')['val']), fill_value=1)
print (df1)
id val
0 1 5.0
1 2 5.0
2 3 78.0
Solution with merge with left join:
df1['val'] = df1['val'].div(df1.merge(df2, on='id', how='left')['val_y'], fill_value=1)
print (df1)
id val
0 1 5.0
1 2 5.0
2 3 78.0

Pandas merge columns with the same name

I have the following Dataframe:
Timestamp
participant
level
gold
participant
level
gold
1
1
100
6000
2
76
4200
2
1
150
5000
2
120
3700
I am trying to change the Dataframe so that all rows from columns named the same is moved below each other, while keeping the column named timestamp:
Timestamp
participant
level
gold
1
1
100
6000
2
1
150
5000
1
2
76
4200
2
2
120
3700
To be clear, the example above is a small sample, the actual Dataframe has a lot of columnes named the same, and a lot more rows. Hence, the solution needs to take that in to account.
Thanks!

Idea is deduplicated duplicated columns names by GroupBy.cumcount for counter and then reshape by DataFrame.stack:
df = df.set_index('Timestamp')
s = df.columns.to_series()
df.columns = [df.columns, s.groupby(s).cumcount()]
df = df.stack().reset_index(level=1, drop=True).reset_index()
If columns names are not duplicated and added . with number:
print (df)
Timestamp participant level gold participant.1 level.1 gold.1
0 1 1 100 6000 2 76 4200
1 2 1 150 5000 2 120 3700
df = df.set_index('Timestamp')
df.columns = pd.MultiIndex.from_frame(df.columns.str.split('.', expand=True)
.to_frame().fillna('0'))
df = df.stack().reset_index(level=1, drop=True).reset_index()
print (df)
0 Timestamp gold level participant
0 1 6000 100 1
1 1 4200 76 2
2 2 5000 150 1
3 2 3700 120 2

Hope this helps
df1=pd.concat([df.iloc[:,0],df.loc[:,df.columns.duplicates()]],axis=1)
df2=df.loc[:,~df.columns.duplicates()]
df=pd.concat([df1,df2],axis=1)

Group by multiple columns and pivot and count values from other column in pandas

I have a dataframe
city skills priority acknowledge id_count acknowledge_count
ABC XXX High Yes 11 2
ABC XXX High No 10 3
ABC XXX Med Yes 5 1
ABC YYY Low No 1 5
I want to group by city and skills and get total_id_count from the column id_count, divided into three seperate column from priority as high.med,low.
SIMILARLY for total_acknowledge_count, take acknowledge
output required:
total_id_count total_acknowledege_count
city,skills High Med Low Yes No
ABC,XXX 22 5 0 3 3 # 22=11+10 3=(2+1)
ABC,YYY 0 0 1 0 5
I am trying different methods like pivot_table, and groupby & stack, but it seems very difficult.
Is there any way to achieve this result.?

You'll need to pivot separately for the total_id_count and the total_acknowledege_count here, since you have two separate column/value schemes for the aggregation:
piv1 = df.pivot_table(index=['city', 'skills'], columns='priority',
values='id_count', aggfunc='sum', fill_value=0)
piv2 = df.pivot_table(index=['city', 'skills'], columns='acknowledge',
values='acknowledge_count', aggfunc='sum', fill_value=0)
piv1.columns = pd.MultiIndex.from_product([['id_count'], piv1.columns])
piv2.columns = pd.MultiIndex.from_product([['acknowledge_count'], piv2.columns])
output = pd.concat([piv1, piv2], axis=1)
print(output)
id_count acknowledge_count
High Low Med No Yes
city skills
ABC XXX 21 0 5 3 3
YYY 0 1 0 5 0

How to Compare Values of two Dataframes in Pandas?

I have two dataframes df and df2 like this
id initials
0 100 J
1 200 S
2 300 Y
name initials
0 John J
1 Smith S
2 Nathan N
I want to compare the values in the initials columns found in (df and df2) and copy the name (in df2) which its initial is matching to the initial in the first dataframe (df)
import pandas as pd
for i in df.initials:
for j in df2.initials:
if i == j:
# copy the name value of this particular initial to df
The output should be like this:
id name
0 100 Johon
1 200 Smith
2 300
Any idea how to solve this problem?

How about?:
df3 = df.merge(df2,on='initials',
how='outer').drop(['initials'],axis=1).dropna(subset=['id'])
>>> df3
id name
0 100.0 John
1 200.0 Smith
2 300.0 NaN
So the 'initials' column is dropped and so is anything with np.nan in the 'id' column.
If you don't want the np.nan in there tack on a .fillna():
df3 = df.merge(df2,on='initials',
how='outer').drop(['initials'],axis=1).dropna(subset=['id']).fillna('')
>>> df3
id name
0 100.0 John
1 200.0 Smith
2 300.0

df1
id initials
0 100 J
1 200 S
2 300 Y
df2
name initials
0 John J
1 Smith S
2 Nathan N
Use Boolean masks: df2.initials==df1.initials will tell you which values in the two initials columns are the same.
0 True
1 True
2 False
Use this mask to create a new column:
df1['name'] = df2.name[df2.initials==df1.initials]
Remove the initials column in df1:
df1.drop('initials', axis=1)
Replace the NaN using fillna(' ')
df1.fillna('', inplace=True) #inplace to avoid creating a copy
id name
0 100 John
1 200 Smith
2 300

Pandas merge two dataframes with different columns

I'm surely missing something simple here. Trying to merge two dataframes in pandas that have mostly the same column names, but the right dataframe has some columns that the left doesn't have, and vice versa.
>df_may
id quantity attr_1 attr_2
0 1 20 0 1
1 2 23 1 1
2 3 19 1 1
3 4 19 0 0
>df_jun
id quantity attr_1 attr_3
0 5 8 1 0
1 6 13 0 1
2 7 20 1 1
3 8 25 1 1
I've tried joining with an outer join:
mayjundf = pd.DataFrame.merge(df_may, df_jun, how="outer")
But that yields:
Left data columns not unique: Index([....
I've also specified a single column to join on (on = "id", e.g.), but that duplicates all columns except id like attr_1_x, attr_1_y, which is not ideal. I've also passed the entire list of columns (there are many) to on:
mayjundf = pd.DataFrame.merge(df_may, df_jun, how="outer", on=list(df_may.columns.values))
Which yields:
ValueError: Buffer has wrong number of dimensions (expected 1, got 2)
What am I missing? I'd like to get a df with all rows appended, and attr_1, attr_2, attr_3 populated where possible, NaN where they don't show up. This seems like a pretty typical workflow for data munging, but I'm stuck.

I think in this case concat is what you want:
In [12]:
pd.concat([df,df1], axis=0, ignore_index=True)
Out[12]:
attr_1 attr_2 attr_3 id quantity
0 0 1 NaN 1 20
1 1 1 NaN 2 23
2 1 1 NaN 3 19
3 0 0 NaN 4 19
4 1 NaN 0 5 8
5 0 NaN 1 6 13
6 1 NaN 1 7 20
7 1 NaN 1 8 25
by passing axis=0 here you are stacking the df's on top of each other which I believe is what you want then producing NaN value where they are absent from their respective dfs.

The accepted answer will break if there are duplicate headers:
InvalidIndexError: Reindexing only valid with uniquely valued Index objects.
For example, here A has 3x trial columns, which prevents concat:
A = pd.DataFrame([[3, 1, 4, 1]], columns=['id', 'trial', 'trial', 'trial'])
# id trial trial trial
# 0 3 1 4 1
B = pd.DataFrame([[5, 9], [2, 6]], columns=['id', 'trial'])
# id trial
# 0 5 9
# 1 2 6
pd.concat([A, B], ignore_index=True)
# InvalidIndexError: Reindexing only valid with uniquely valued Index objects
To fix this, deduplicate the column names before concat:
parser = pd.io.parsers.base_parser.ParserBase({'usecols': None})
for df in [A, B]:
df.columns = parser._maybe_dedup_names(df.columns)
pd.concat([A, B], ignore_index=True)
# id trial trial.1 trial.2
# 0 3 1 4 1
# 1 5 9 NaN NaN
# 2 2 6 NaN NaN
Or as a one-liner but less readable:
pd.concat([df.set_axis(parser._maybe_dedup_names(df.columns), axis=1) for df in [A, B]], ignore_index=True)
Note that for pandas <1.3.0, use: parser = pd.io.parsers.ParserBase({})

I had this problem today using any of concat, append or merge, and I got around it by adding a helper column sequentially numbered and then doing an outer join
helper=1
for i in df1.index:
df1.loc[i,'helper']=helper
helper=helper+1
for i in df2.index:
df2.loc[i,'helper']=helper
helper=helper+1
df1.merge(df2,on='helper',how='outer')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

dataframe not droping duplicate - python

It seems you need check only first 3 columns for duplicates: dfnew = pd.concat([df, df2]).drop_duplicates(subset=['id','Name','Number'], keep='last') print (dfnew) id Name Number Stat 2 3 sa 0 NaN 0 1 co 4 NaN 1 2 ma 98 5%

try pd.merge function with inner/ outer based on requirement.

Related

Update dataframe column based on another dataframe column without for loop

Pandas merge columns with the same name

Group by multiple columns and pivot and count values from other column in pandas

How to Compare Values of two Dataframes in Pandas?

Pandas merge two dataframes with different columns

Categories

Resources