How to drop Duplicates for each unique row value in Pandas? - python

I have the following dataframe:
df = pd.DataFrame({
'ID': [42, 42, 42, 43, 43, 43,58, 58, 58],
'Thing': ['cup', 'cup', 'plate', 'plate', 'plate', 'plate', 'cup', 'cup', 'plate']
})
df
ID Thing
0 42 cup
1 42 cup
2 42 plate
3 43 plate
4 43 plate
5 43 plate
6 58 cup
7 58 cup
8 58 plate
I want to drop duplicates from the "Thing" column, but only for each unique ID. I want the result to look like this:
ID Thing
0 42 cup
2 42 plate
6 58 cup
8 58 plate
I tried this:
for id in df['ID'].unique():
df= df.drop_duplicates(subset=['Thing'], keep='first')
But the result looks like this:
ID Thing
0 42 cup
2 42 plate
Does anyone know what is the best way to accomplish this in Pandas?

Try:
df = df.drop_duplicates(subset = ['ID','Thing'])

Related

Combine multiple related products into one in pandas dataframe

I have a dataframe with the following sample data
Product quantity sold
a 30
at 20
am 10
b 5
bn 7
bt 90
c 76
c1 67
ct 54
m 12
t 87
n 12
I want to group the products that start with a into a new product name Art, those that start with b under name Brt and those that start with c into Crt and leave products m, t and n in the same dataframe into something below:
Product quantity sold
Art 60
Brt 102
Crt 197
m 12
t 87
n 12
Since you have complex conditions, might be easy enough to just rename the ones you want.
import pandas as pd
df = pd.DataFrame({'Product': ['a', 'at', 'am', 'b', 'bn', 'bt', 'c', 'c1', 'ct', 'm', 't', 'n'],
'quantity sold ': [30, 20, 10, 5, 7, 90, 76, 67, 54, 12, 87, 12]})
df.loc[df['Product'].str.startswith('a'), 'Product'] = 'Art'
df.loc[df['Product'].str.startswith('b'), 'Product'] = 'Brt'
df.loc[df['Product'].str.startswith('c'), 'Product'] = 'Crt'
df.groupby('Product', as_index=False).sum()
Output
Product quantity sold
0 Art 60
1 Brt 102
2 Crt 197
3 m 12
4 n 12
5 t 87
You can do it using str.map and dictionary:
grp = df['Product'].str[0].map({'a':'Art', 'b':'BRT', 'c':'CRT'}).fillna(df['Product'])
df.groupby(grp)['quantity sold'].sum()
Output:
Product
Art 60
BRT 102
CRT 197
m 12
n 12
t 87
Name: quantity sold, dtype: int64
Here, we are using a shortcut for .str.get, str[0] indexing the first character from the string, then using map to create desired groups, and those values not in map are filled with the original values from df['Product']. Lastly, we groupby the newly created group.

make a change in each group after groupby

Let we have a dataframe like
names = ['bob', 'alice', 'bob', 'bob', 'ali', 'alice', 'bob', 'ali', 'moji', 'ali', 'moji', 'ali', 'bob', 'bob', 'bob']
times = [21 , 34, 37, 40, 55, 65, 67, 84, 88, 90 , 91, 97, 104,105, 108]
df = pd.DataFrame({'name' : names , 'time_of_action' : times})
We would like to groupby it by name and in each group consider times from 0. That is in each group we want to subtract min-time_of_action in that group from all times of that group. how could we do this systematically with pandas?
If I am correct then you want this:
df['new time'] = df['time_of_action']-df.groupby('name')['time_of_action'].transform('min')
df:
name time_of_action new time
0 bob 21 0
1 alice 34 0
2 bob 37 16
3 bob 40 19
4 ali 55 0
5 alice 65 31
6 bob 67 46
7 ali 84 29
8 moji 88 0
9 ali 90 35
10 moji 91 3
11 ali 97 42
12 bob 104 83
13 bob 105 84
14 bob 108 87
Try this
df['new_time'] = df.groupby('name')['time_of_action'].apply(lambda x: x - x.min())
df
Output:
name time_of_action new_time
0 bob 21 0
1 alice 34 0
2 bob 37 16
3 bob 40 19
4 ali 55 0
5 alice 65 31
Others have already answered but here's mine
import pandas as pd
names = ['bob', 'alice', 'bob', 'bob', 'ali', 'alice', 'bob', 'ali', 'moji', 'ali', 'moji', 'ali', 'bob', 'bob', 'bob']
times = [21 , 34, 37, 40, 55, 65, 67, 84, 88, 90 , 91, 97, 104,105, 108]
df = pd.DataFrame({'name' : names , 'time_of_action' : times})
def subtract_min(df):
df['new_time'] = df['time_of_action'] - df['time_of_action'].min()
return df
df.groupby('name').apply(subtract_min).sort_values('name')
As others have said I am kind of guessing as well

Pandas - normalize Json list

I am trying to normalize a column from a Pandas dataframe that is a list of dictionaries (can be missing).
Example to reproduce
import pandas as pd
bids = pd.Series([[{'price': 606, 'quantity': 28},{'price': 588, 'quantity': 29},
{'price': 513, 'quantity': 33}],[],[{'price': 7143, 'quantity': 15},
{'price': 68, 'quantity': 91},{'price': 6849, 'quantity': 12}]])
data = pd.DataFrame([1,2,3]).rename(columns={0:'id'})
data['bids'] = bids
Desired output
id price quantity
1 606 28
1 588 29
1 513 33
3 7143 15
3 68 91
3 6849 12
Attempt
Trying to resolve using pandas json_normalize, following docs here. I'm confused by why none of the below work, and what type of record_path will fix my problem. All the below error.
pd.json_normalize(data['bids'])
pd.json_normalize(data['bids'],['price','quantity'])
pd.json_normalize(data['bids'],[['price','quantity']])
Use DataFrame.explode on column bids then create a new dataframe from the dictionaries in exploded bids column and use DataFrame.join to join it with df:
df = data.explode('bids').dropna(subset=['bids']).reset_index(drop=True)
df = df.join(pd.DataFrame(df.pop('bids').tolist()))
Result:
print(df)
id price quantity
0 1 606 28
1 1 588 29
2 1 513 33
3 3 7143 15
4 3 68 91
5 3 6849 12
Adding another approach with np.repeat and np.concatenate with json_normalize
out = pd.io.json.json_normalize(np.concatenate(data['bids']))
out.insert(0,'id',np.repeat(data['id'],data['bids'].str.len()).to_numpy())
Or you can also use np.hstack as #Shubham mentions instead of np.concatenate:
out = pd.io.json.json_normalize(np.hstack(data['bids']))
print(out)
id price quantity
0 1 606 28
1 1 588 29
2 1 513 33
3 3 7143 15
4 3 68 91
5 3 6849 12

Pandas: Convert annual data to decade data

Background
I want to determine the global cumulative value of a variable for different decades starting from 1990 to 2014 i.e. 1990, 2000, 2010 (3 decades separately). I have annual data for different countries. However, data availability is not uniform.
Existing questions
Uses R: 1
Following questions look at date formatting issues: 2, 3
Answers to these questions do not address the current question.
Current question
How to obtain a global sum for the period of different decades using features/tools of Pandas?
Expected outcome
1990-2000 x1
2000-2010 x2
2010-2015 x3
Method used so far
data_binned = data_pivoted.copy()
decade = []
# obtaining decade values for each country
for i in range(1960, 2017):
if i in list(data_binned):
# adding the columns into the decade list
decade.append(i)
if i % 10 == 0:
# adding large header so that newly created columns are set at the end of the dataframe
data_binned[i *10] = data_binned.apply(lambda x: sum(x[j] for j in decade), axis=1)
decade = []
for x in list(data_binned):
if x < 3000:
# removing non-decade columns
del data_binned[x]
# renaming the decade columns
new_names = [int(x/10) for x in list(data_binned)]
data_binned.columns = new_names
# computing global values
global_values = data_binned.sum(axis=0)
This is a non-optimal method because of less experience in using Pandas. Kindly suggest a better method which uses features of Pandas. Thank you.
If I had pandas.DataFrame called df looking like this:
>>> df = pd.DataFrame(
... {
... 1990: [1, 12, 45, 67, 78],
... 1999: [1, 12, 45, 67, 78],
... 2000: [34, 6, 67, 21, 65],
... 2009: [34, 6, 67, 21, 65],
... 2010: [3, 6, 6, 2, 6555],
... 2015: [3, 6, 6, 2, 6555],
... }, index=['country_1', 'country_2', 'country_3', 'country_4', 'country_5']
... )
>>> print(df)
1990 1999 2000 2009 2010 2015
country_1 1 1 34 34 3 3
country_2 12 12 6 6 6 6
country_3 45 45 67 67 6 6
country_4 67 67 21 21 2 2
country_5 78 78 65 65 6555 6555
I could make another pandas.DataFrame called df_decades with decades statistics like this:
>>> df_decades = pd.DataFrame()
>>>
>>> for decade in set([(col // 10) * 10 for col in df.columns]):
... cols_in_decade = [col for col in df.columns if (col // 10) * 10 == decade]
... df_decades[f'{decade}-{decade + 9}'] = df[cols_in_decade].sum(axis=1)
>>>
>>> df_decades = df_decades[sorted(df_decades.columns)]
>>> print(df_decades)
1990-1999 2000-2009 2010-2019
country_1 2 68 6
country_2 24 12 12
country_3 90 134 12
country_4 134 42 4
country_5 156 130 13110
The idea behind this is to iterate over all possible decades provided by column names in df, filtering those columns, which are part of the decade and aggregating them.
Finally, I could merge these data frames together, so my data frame df could be enriched by decades statistics from the second data frame df_decades.
>>> df = pd.merge(left=df, right=df_decades, left_index=True, right_index=True, how='left')
>>> print(df)
1990 1999 2000 2009 2010 2015 1990-1999 2000-2009 2010-2019
country_1 1 1 34 34 3 3 2 68 6
country_2 12 12 6 6 6 6 24 12 12
country_3 45 45 67 67 6 6 90 134 12
country_4 67 67 21 21 2 2 134 42 4
country_5 78 78 65 65 6555 6555 156 130 13110

Python numpy where function behavior

Have a question regarding using numpy's where condition. I am able to use where condition with == operator but not able to use where condition with "is one string substring of another string ?"
CODE:
import pandas as pd
import datetime as dt
import numpy as np
data = {'name': ['Smith, Jason', 'Bush, Molly', 'Smith, Tina',
'Clinton, Jake', 'Hamilton, Amy'],
'age': [42, 52, 36, 24, 73],
'preTestScore': [4, 24, 31, 2, 3],
'postTestScore': [25, 94, 57, 62, 70]}
df = pd.DataFrame(data, columns = ['name', 'age', 'preTestScore',
'postTestScore'])
print "BEFORE---- "
print df
print "AFTER----- "
df["Smith Family"]=np.where("Smith" in df['name'],'Y','N' )
print df
OUTPUT:
BEFORE-----
name age preTestScore postTestScore
0 Smith, Jason 42 4 25
1 Bush, Molly 52 24 94
2 Smith, Tina 36 31 57
3 Clinton, Jake 24 2 62
4 Hamilton, Amy 73 3 70
AFTER-----
name age preTestScore postTestScore Smith Family
0 Smith, Jason 42 4 25 N
1 Bush, Molly 52 24 94 N
2 Smith, Tina 36 31 57 N
3 Clinton, Jake 24 2 62 N
4 Hamilton, Amy 73 3 70 N
Why numpy.where condition does not work in the above case.
Had expected Smith Family to have values
Y
N
Y
N
N
But did not get that output. Output as seen above is all N,N,N,N,N
Instead of using condition "Smith" in df['name'] (also tried str(df['name']).find("Smith") >-1 ) but that did not work either.
Any idea what is wrong or what could I have done differently?
I think you need str.contains for boolean mask:
print (df['name'].str.contains("Smith"))
0 True
1 False
2 True
3 False
4 False
Name: name, dtype: bool
df["Smith Family"]=np.where(df['name'].str.contains("Smith"),'Y','N' )
print (df)
name age preTestScore postTestScore Smith Family
0 Smith, Jason 42 4 25 Y
1 Bush, Molly 52 24 94 N
2 Smith, Tina 36 31 57 Y
3 Clinton, Jake 24 2 62 N
4 Hamilton, Amy 73 3 70 N
Or str.startswith:
df["Smith Family"]=np.where(df['name'].str.startswith("Smith"),'Y','N' )
print (df)
name age preTestScore postTestScore Smith Family
0 Smith, Jason 42 4 25 Y
1 Bush, Molly 52 24 94 N
2 Smith, Tina 36 31 57 Y
3 Clinton, Jake 24 2 62 N
4 Hamilton, Amy 73 3 70 N
If want use in working with scalars need apply:
This solution is faster, but doesnt work if NaN in column name.
df["Smith Family"]=np.where(df['name'].apply(lambda x: "Smith" in x),'Y','N' )
print (df)
name age preTestScore postTestScore Smith Family
0 Smith, Jason 42 4 25 Y
1 Bush, Molly 52 24 94 N
2 Smith, Tina 36 31 57 Y
3 Clinton, Jake 24 2 62 N
4 Hamilton, Amy 73 3 70 N
The behavior of np.where("Smith" in df['name'],'Y','N' ) depends on what df['name'] produces - I assume some sort of numpy array. The rest is numpy
In [733]: x=np.array(['one','two','three'])
In [734]: 'th' in x
Out[734]: False
In [744]: 'two' in np.array(['one','two','three'])
Out[744]: True
in is a whole string test, both for a list and an array of strings. It's not a substring test.
np.char has a bunch of functions that apply string functions to elements of an array. These are roughly the equivalent of np.array([x.fn() for x in arr]).
In [754]: x=np.array(['one','two','three'])
In [755]: np.char.startswith(x,'t')
Out[755]: array([False, True, True], dtype=bool)
In [756]: np.where(np.char.startswith(x,'t'),'Y','N')
Out[756]:
array(['N', 'Y', 'Y'],
dtype='<U1')
Or with find:
In [760]: np.char.find(x,'wo')
Out[760]: array([-1, 1, -1])
The pandas .str method appears to do something similar; applying string methods to elements of a data series.

Categories