Python: how to drop duplicates with duplicates?

Python: how to drop duplicates with duplicates? - python

I have a dataframe like the following
df
Name Y
0 A 1
1 A 0
2 B 0
3 B 0
5 C 1
I want to drop the duplicates of Name and keep the ones that have Y=1 such as:
df
Name Y
0 A 1
1 B 0
2 C 1

Use drop_duplicates method,
df.sort_values('Y', ascending= False).drop_duplicates(subset=['Name'])

groupby + max
Assuming your Y series consists only of 0 and 1 values:
res = df.groupby('Name', as_index=False)['Y'].max()
print(res)
Name Y
0 A 1
1 B 0
2 C 1

Does 'Y' column contain only 0-1? In that case, you can try the following :
df = df.sort_values(['Y'], ascending= False)
df = df.drop_duplicates(['Name'])

Related

How to split comma separated text into columns on pandas dataframe?

I have a dataframe where one of the columns has its items separated with commas. It looks like:
Data
a,b,c
a,c,d
d,e
a,e
a,b,c,d,e
My goal is to create a matrix that has as header all the unique values from column Data, meaning [a,b,c,d,e]. Then as rows a flag indicating if the value is at that particular row.
The matrix should look like this:
Data
a
b
c
d
e
a,b,c
1
1
1
0
0
a,c,d
1
0
1
1
0
d,e
0
0
0
1
1
a,e
1
0
0
0
1
a,b,c,d,e
1
1
1
1
1
To separate column Data what I did is:
df['data'].str.split(',', expand = True)
Then I don't know how to proceed to allocate the flags to each of the columns.

Maybe you can try this without pivot.
Create the dataframe.
import pandas as pd
import io
s = '''Data
a,b,c
a,c,d
d,e
a,e
a,b,c,d,e'''
df = pd.read_csv(io.StringIO(s), sep = "\s+")
We can use pandas.Series.str.split with expand argument equals to True. And value_counts each rows with axis = 1.
Finally fillna with zero and change the data into integer with astype(int).
df["Data"].str.split(pat = ",", expand=True).apply(lambda x : x.value_counts(), axis = 1).fillna(0).astype(int)
#
a b c d e
0 1 1 1 0 0
1 1 0 1 1 0
2 0 0 0 1 1
3 1 0 0 0 1
4 1 1 1 1 1
And then merge it with the original column.
new = df["Data"].str.split(pat = ",", expand=True).apply(lambda x : x.value_counts(), axis = 1).fillna(0).astype(int)
pd.concat([df, new], axis = 1)
#
Data a b c d e
0 a,b,c 1 1 1 0 0
1 a,c,d 1 0 1 1 0
2 d,e 0 0 0 1 1
3 a,e 1 0 0 0 1
4 a,b,c,d,e 1 1 1 1 1

Use the Series.str.get_dummies() method to return the required matrix of 'a', 'b', ... 'e' columns.
df["Data"].str.get_dummies(sep=',')

If you split the strings into lists, then explode them, it makes pivot possible.
(df.assign(data_list=df.Data.str.split(','))
.explode('data_list')
.pivot_table(index='Data',
columns='data_list',
aggfunc=lambda x: 1,
fill_value=0))
Output
data_list a b c d e
Data
a,b,c 1 1 1 0 0
a,b,c,d,e 1 1 1 1 1
a,c,d 1 0 1 1 0
a,e 1 0 0 0 1
d,e 0 0 0 1 1

You could apply a custom count function for each key:
for k in ["a","b","c","d","e"]:
df[k] = df.apply(lambda row: row["Data"].count(k), axis=1)

how to handle unknown categorical value in one hot encoding in pandas

I have a pandas dataframe on which I do one hot encoding using get_dummies method.
Here is the sample code -
import pandas as pd
X = pd.DataFrame( ['a','a,b','a,c'], columns = ['category'])
X.head()
category
0 a
1 a,b
2 a,c
Here is how I do one hot encoding
X_transformed = pd.concat([X, X['category'].str.get_dummies(sep=',')], axis=1)
X_transformed.head()
category a b c
0 a 1 0 0
1 a,b 1 1 0
2 a,c 1 0 1
The problem is, that when I get a record with an unknown categorical value, I dont know how to best handle it -
y = pd.DataFrame(['a','d'], columns = ['category'])
y.head()
category
0 a
1 d
If i again do get_dummies on this new dataframe, then I get something like
y_transformed = pd.concat([y, y['category'].str.get_dummies(sep=',')], axis=1)
y_transformed.head()
category a d
0 a 1 0
1 d 0 1
whereas my expected output is
category a b c
0 a 1 0 0
1 d 0 0 0
because category d was never seen before in the first place, so I want to neglect it by making all flags of columns a,b,c as 0.
How can I achieve this in pandas?

Use DataFrame.reindex on axis=1 with fill_value=0:
y_transformed = y_transformed.reindex(X_transformed.columns, axis=1, fill_value=0)
Result:
category a b c
0 a 1 0 0
1 d 0 0 0

Count the frequency of list element in a row grouped by Date and tag

I have a dataframe df which looks like this:
ID Date Input
1 1-Nov A,B
1 2-NOV A
2 3-NOV A,B,C
2 4-NOV B,D
i want my output to count the occurrence of each input, if it is consecutive otherwise reset it to zero again(if IDs are same then only count) , Also the output should be renamed to X.A, X.B, X.C and X.D so my output will look like this:
ID Date Input X.A X.B X.C X.D
1 1-NOV A,B 1 1 0 0
1 2-NOV A 2 0 0 0
2 3-NOV A,B,C 1 1 1 0
2 4-NOV B,D 0 2 0 1
How can I create the output(A,B,C and D) which will count the input occurence date and ID wise.

Use Series.str.get_dummies for indicator columns and then count consecutive 1 per groups - so use GroupBy.cumsum with subtract by GroupBy.ffill, change columns names by DataFrame.add_prefix and last DataFrame.join to original:
a = df['Input'].str.get_dummies(',') == 1
b = a.groupby(df.ID).cumsum().astype(int)
df1 = (b-b.mask(a).groupby(df.ID).ffill().fillna(0).astype(int)).add_prefix('X.')
df = df.join(df1)
print (df)
ID Date Input X.A X.B X.C X.D
0 1 1-Nov A,B 1 1 0 0
1 1 2-NOV A 2 0 0 0
2 2 3-NOV A,B,C 1 1 1 0
3 2 4-NOV B,D 0 2 0 1

first add the counts of new columns and then use group by to make a cumulative sum
# find which columns to add
cols = set([l for sublist in df['Input'].apply(lambda x: x.split(',')).values for l in sublist])
# add the new columns
for col in cols:
df['X.' + col] = df['Input'].apply(lambda x: int(col in x))
# group by and add cumulative sum conditional it has a positive value
group = df.groupby('ID')
for col in cols:
df['X.' + col] = group['X.' + col].apply(lambda x: np.cumsum(x) * (x > 0).astype(int))
results is then
print(df)
ID Date Input X.C X.D X.A X.B
0 1 1-NOV A,B 0 0 1 1
1 1 2-NOV A 0 0 2 0
2 2 3-NOV A,B,C 1 0 1 1
3 2 4-NOV B,D 0 1 0 2

Pandas Python : how to create multiple columns from a list

I have a list with columns to create :
new_cols = ['new_1', 'new_2', 'new_3']
I want to create these columns in a dataframe and fill them with zero :
df[new_cols] = 0
Get error :
"['new_1', 'new_2', 'new_3'] not in index"
which is true but unfortunate as I want to create them...
EDIT : This is a duplicate of this question : Add multiple empty columns to pandas DataFrame however I keep this one too because the accepted answer here was the simple solution I was looking for, and it was not he accepted answer out there
EDIT 2 : While the accepted answer is the most simple, interesting one-liner solutions were posted below

You need to add the columns one by one.
for col in new_cols:
df[col] = 0
Also see the answers in here for other methods.

Use assign by dictionary:
df = pd.DataFrame({
'A': ['a','a','a','a','b','b','b','c','d'],
'B': list(range(9))
})
print (df)
0 a 0
1 a 1
2 a 2
3 a 3
4 b 4
5 b 5
6 b 6
7 c 7
8 d 8
new_cols = ['new_1', 'new_2', 'new_3']
df = df.assign(**dict.fromkeys(new_cols, 0))
print (df)
A B new_1 new_2 new_3
0 a 0 0 0 0
1 a 1 0 0 0
2 a 2 0 0 0
3 a 3 0 0 0
4 b 4 0 0 0
5 b 5 0 0 0
6 b 6 0 0 0
7 c 7 0 0 0
8 d 8 0 0 0

import pandas as pd
new_cols = ['new_1', 'new_2', 'new_3']
df = pd.DataFrame.from_records([(0, 0, 0)], columns=new_cols)
Is this what you're looking for ?

You can use assign:
new_cols = ['new_1', 'new_2', 'new_3']
values = [0, 0, 0] # could be anything, also pd.Series
df = df.assign(**dict(zip(new_cols, values)

Try looping through the column names before creating the column:
for col in new_cols:
df[col] = 0

We can use the Apply function to loop through the columns in the dataframe and assigning each of the element to a new field
for instance for a list in a dataframe with a list named keys
[10,20,30]
In your case since its all 0 we can directly assign them as 0 instead of looping through. But if we have values we can populate them as below
...
df['new_01']=df['keys'].apply(lambda x: x[0])
df['new_02']=df['keys'].apply(lambda x: x[1])
df['new_03']=df['keys'].apply(lambda x: x[2])

Get substring in one column based on the value in another column

My dataframe looks like this:
test_df = pd.DataFrame({'name':['a12','b1','c'],'Length':[2,1,0]})
test_df
Length name
0 2 a12
1 1 b1
2 0 c
I would like to have a result like this:
Length name
0 2 a
1 1 b
2 0 c
With this code: Getting substring based on another column in a pandas dataframe
test_df.apply(lambda x: x['name'][:-x['Length']],axis = 1)
test_df
I got the same dataframe than before
Length name
0 2 a12
1 1 b1
2 0 c

Modify your apply a bit, to slice with respect to len(x['name']) -
def f(x):
return x['name'][:len(x['name']) - x['Length_to_drop']]
df.apply(f, 1)
0 a
1 b
2 c
dtype: object

Try this:
import pandas as pd
test_df = pd.DataFrame({'name':['a12','b1','c'],'Length':[2,1,0]})
test_df['name']=test_df.apply(lambda x: x['name'][:len(x['name'])-x['Length']],axis = 1)
test_df
This will output as you intended
Length name
0 2 a
1 1 b
2 0 c

One can use list functions for this:
outlist = list(map(lambda x,y: x[0:(len(x)-y)], test_df.name, test_df.Length_to_drop))
test_df.name = outlist
print(test_df)
Output:
Length_to_drop name
0 2 a
1 1 b
2 0 c

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: how to drop duplicates with duplicates? - python

I have a dataframe like the following df Name Y 0 A 1 1 A 0 2 B 0 3 B 0 5 C 1 I want to drop the duplicates of Name and keep the ones that have Y=1 such as: df Name Y 0 A 1 1 B 0 2 C 1

Use drop_duplicates method, df.sort_values('Y', ascending= False).drop_duplicates(subset=['Name'])

groupby + max Assuming your Y series consists only of 0 and 1 values: res = df.groupby('Name', as_index=False)['Y'].max() print(res) Name Y 0 A 1 1 B 0 2 C 1

Does 'Y' column contain only 0-1? In that case, you can try the following : df = df.sort_values(['Y'], ascending= False) df = df.drop_duplicates(['Name'])

Related

How to split comma separated text into columns on pandas dataframe?

how to handle unknown categorical value in one hot encoding in pandas

Count the frequency of list element in a row grouped by Date and tag

Pandas Python : how to create multiple columns from a list

Get substring in one column based on the value in another column

Categories

Resources