How to remove duplicates from data frame using python [closed]

How to remove duplicates from data frame using python [closed] - python

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
dframe= pd.DataFrame({'col1':['A']*3 + ['B']*4 + ['C','B','A'],'col2':[2,3,4,2,4,2,1,3,4,4]})
I want to remove duplicates from both columns and final result should look like this:
pd.DataFrame({'col1':['A'] + ['B'] + ['C'],'col2':[2,4,3]})
I tried following but the result was not as per the expectations
dframe.drop_duplicates(subset=['col1'], keep='first')
Please help.
Thanks

try:
via agg() and dropna() method:
out=dframe.agg(lambda x:pd.Series(pd.unique(x))).dropna()
OR
via apply() and dropna() method:
out=dframe.apply(lambda x:pd.Series(pd.unique(x))).dropna()
output of out:
col1 col2
0 A 2
1 B 3
2 C 4

Related

how to get Unique count from a DataFrame in case of duplicate index [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I am working on a dataframe. Data in the image
Q. I want the number of shows released per year but if I'm applying count() function, it's giving me 6 instead of 3. Could anyone suggest how do I get the correct value count.

To get unique value of single year, you can use
count = len(df.loc[df['release_year'] == 1945, 'show_id'].unique())
# or
count = df.loc[df['release_year'] == 1945, 'show_id'].nunique()
To summarize unique value of dataframe by year, you can drop_duplicates() on column show_id first.
df.drop_duplicates(subset=['show_id']).groupby('release_year').count()
Or use value_counts() on column after dropping duplicates.
df.drop_duplicates(subset=['show_id'])['release_year'].value_counts()

df['show_id'].nunique().count()
should do the job.

Formatting a string in pandas dataframe [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have a dataframe that (simplified) looks something like:
col1 col2
1 a
2 b
3 c,ddd,ee,f,5,hfsf,a
In col2, I need to be able to remove everything after the last 2 commas, and if it doesn't have commas just keep the value as is:
col1 col2
1 a
2 b
3 c,ddd,ee
again, this is simplified and the solution needs to scale up to something that has 1000's of rows, and the space between each comma will not always be the same
edit:
This is got me on the right track
df.col2 = df.col2.str.split(',').str[:2].str.join(',')

Pandas provides access to many familiar string functions, including slicing and selection, through the .str attribute:
df.col2.str.split(',').str[:3].str.join(',')
#0 a
#1 b
#2 c,ddd,ee

How to Read A CSV With A Variable Number of Columns? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
My csv file looks like this:
5783,145v
g656,4589,3243,tt56
6579
How do I read this with pandas (or otherwise)?
(the table should contain empty cells)

You could pass a dummy separator, and then use str.split (by ",") with expand=True:
df = pd.read_csv('path/to/file.csv', sep=" ", header=None)
df = df[0].str.split(",", expand=True).fillna("")
print(df)
Output
0 1 2 3
0 5783 145v
1 g656 4589 3243 tt56
2 6579

I think that the solution proposed by #researchnewbie is good. If you need to replace the NaN values for say, zero, you could add this line after the read:
dataFrame.fillna(0, inplace=True)

Try doing the following:
import pandas as pd
dataFrame = pd.read_csv(filename)
Your empty cells should contain the NaN value, which essentially null.

Python :Select the rows for the most recent entry from multiple users [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I have a dataframe df with 3 columns :
df=pd.DataFrame({
'User':['A','A','B','A','C','B','C'],
'Values':['x','y','z','p','q','r','s'],
'Date':[14,11,14,12,13,10,14]
})
I want to create a new dataframe that will contain the rows corresponding to highest values in the 'Date' columns for each user. For example for the above dataframe I want the desired dataframe to be as follows ( its a jpeg image):
Can anyone help me with this problem?

This answer assumes that there is different maximum values per user in Values column:
In [10]: def get_max(group):
...: return group[group.Date == group.Date.max()]
...:
In [12]: df.groupby('User').apply(get_max).reset_index(drop=True)
Out[12]:
Date User Values
0 14 A x
1 14 B z
2 14 C s

creating date range on csv using python-pandas [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
How to create a date range in python using pandas in Y-M-D?

import pandas as pd
df = pd.DataFrame([['2015-07-07','2016-09-22'],['2012-02-03','2013-02-19'],['2013-02-17','2013-03-22']],columns = ['start','end'])
#change strings to date format
df['start'] = [pd.to_datetime(x) for x in df['start']]
df['end'] = [pd.to_datetime(x) for x in df['end']]
df['range'] = df['end']-df['start']
df
Output should be:
start end range
0 2015-07-07 2016-09-22 443 days
1 2012-02-03 2013-02-19 382 days
2 2013-02-17 2013-03-22 33 days
In case you want to read from csv, switch the beginning to:
df = pd.read_csv('file_name.csv')
in case you want a concatenated column:
df['details'] = [str(x)+' - '+str(y)+' has '+str(z)[:-9] for x,y,z in zip(df['start'],df['end'],df['range'])]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to remove duplicates from data frame using python [closed] - python

try: via agg() and dropna() method: out=dframe.agg(lambda x:pd.Series(pd.unique(x))).dropna() OR via apply() and dropna() method: out=dframe.apply(lambda x:pd.Series(pd.unique(x))).dropna() output of out: col1 col2 0 A 2 1 B 3 2 C 4

Related

how to get Unique count from a DataFrame in case of duplicate index [closed]

Formatting a string in pandas dataframe [closed]

How to Read A CSV With A Variable Number of Columns? [closed]

Python :Select the rows for the most recent entry from multiple users [closed]

creating date range on csv using python-pandas [closed]

Categories

Resources