Python: how to drop duplicates with duplicates? - python

I have a dataframe like the following
df
Name Y
0 A 1
1 A 0
2 B 0
3 B 0
5 C 1
I want to drop the duplicates of Name and keep the ones that have Y=1 such as:
df
Name Y
0 A 1
1 B 0
2 C 1

Use drop_duplicates method,
df.sort_values('Y', ascending= False).drop_duplicates(subset=['Name'])

groupby + max
Assuming your Y series consists only of 0 and 1 values:
res = df.groupby('Name', as_index=False)['Y'].max()
print(res)
Name Y
0 A 1
1 B 0
2 C 1

Does 'Y' column contain only 0-1? In that case, you can try the following :
df = df.sort_values(['Y'], ascending= False)
df = df.drop_duplicates(['Name'])

Related

How to split comma separated text into columns on pandas dataframe?

I have a dataframe where one of the columns has its items separated with commas. It looks like:
Data
a,b,c
a,c,d
d,e
a,e
a,b,c,d,e
My goal is to create a matrix that has as header all the unique values from column Data, meaning [a,b,c,d,e]. Then as rows a flag indicating if the value is at that particular row.
The matrix should look like this:
Data
a
b
c
d
e
a,b,c
1
1
1
0
0
a,c,d
1
0
1
1
0
d,e
0
0
0
1
1
a,e
1
0
0
0
1
a,b,c,d,e
1
1
1
1
1
To separate column Data what I did is:
df['data'].str.split(',', expand = True)
Then I don't know how to proceed to allocate the flags to each of the columns.
Maybe you can try this without pivot.
Create the dataframe.
import pandas as pd
import io
s = '''Data
a,b,c
a,c,d
d,e
a,e
a,b,c,d,e'''
df = pd.read_csv(io.StringIO(s), sep = "\s+")
We can use pandas.Series.str.split with expand argument equals to True. And value_counts each rows with axis = 1.
Finally fillna with zero and change the data into integer with astype(int).
df["Data"].str.split(pat = ",", expand=True).apply(lambda x : x.value_counts(), axis = 1).fillna(0).astype(int)
#
a b c d e
0 1 1 1 0 0
1 1 0 1 1 0
2 0 0 0 1 1
3 1 0 0 0 1
4 1 1 1 1 1
And then merge it with the original column.
new = df["Data"].str.split(pat = ",", expand=True).apply(lambda x : x.value_counts(), axis = 1).fillna(0).astype(int)
pd.concat([df, new], axis = 1)
#
Data a b c d e
0 a,b,c 1 1 1 0 0
1 a,c,d 1 0 1 1 0
2 d,e 0 0 0 1 1
3 a,e 1 0 0 0 1
4 a,b,c,d,e 1 1 1 1 1
Use the Series.str.get_dummies() method to return the required matrix of 'a', 'b', ... 'e' columns.
df["Data"].str.get_dummies(sep=',')
If you split the strings into lists, then explode them, it makes pivot possible.
(df.assign(data_list=df.Data.str.split(','))
.explode('data_list')
.pivot_table(index='Data',
columns='data_list',
aggfunc=lambda x: 1,
fill_value=0))
Output
data_list a b c d e
Data
a,b,c 1 1 1 0 0
a,b,c,d,e 1 1 1 1 1
a,c,d 1 0 1 1 0
a,e 1 0 0 0 1
d,e 0 0 0 1 1
You could apply a custom count function for each key:
for k in ["a","b","c","d","e"]:
df[k] = df.apply(lambda row: row["Data"].count(k), axis=1)

how to handle unknown categorical value in one hot encoding in pandas

I have a pandas dataframe on which I do one hot encoding using get_dummies method.
Here is the sample code -
import pandas as pd
X = pd.DataFrame( ['a','a,b','a,c'], columns = ['category'])
X.head()
category
0 a
1 a,b
2 a,c
Here is how I do one hot encoding
X_transformed = pd.concat([X, X['category'].str.get_dummies(sep=',')], axis=1)
X_transformed.head()
category a b c
0 a 1 0 0
1 a,b 1 1 0
2 a,c 1 0 1
The problem is, that when I get a record with an unknown categorical value, I dont know how to best handle it -
y = pd.DataFrame(['a','d'], columns = ['category'])
y.head()
category
0 a
1 d
If i again do get_dummies on this new dataframe, then I get something like
y_transformed = pd.concat([y, y['category'].str.get_dummies(sep=',')], axis=1)
y_transformed.head()
category a d
0 a 1 0
1 d 0 1
whereas my expected output is
category a b c
0 a 1 0 0
1 d 0 0 0
because category d was never seen before in the first place, so I want to neglect it by making all flags of columns a,b,c as 0.
How can I achieve this in pandas?
Use DataFrame.reindex on axis=1 with fill_value=0:
y_transformed = y_transformed.reindex(X_transformed.columns, axis=1, fill_value=0)
Result:
category a b c
0 a 1 0 0
1 d 0 0 0

Count the frequency of list element in a row grouped by Date and tag

I have a dataframe df which looks like this:
ID Date Input
1 1-Nov A,B
1 2-NOV A
2 3-NOV A,B,C
2 4-NOV B,D
i want my output to count the occurrence of each input, if it is consecutive otherwise reset it to zero again(if IDs are same then only count) , Also the output should be renamed to X.A, X.B, X.C and X.D so my output will look like this:
ID Date Input X.A X.B X.C X.D
1 1-NOV A,B 1 1 0 0
1 2-NOV A 2 0 0 0
2 3-NOV A,B,C 1 1 1 0
2 4-NOV B,D 0 2 0 1
How can I create the output(A,B,C and D) which will count the input occurence date and ID wise.
Use Series.str.get_dummies for indicator columns and then count consecutive 1 per groups - so use GroupBy.cumsum with subtract by GroupBy.ffill, change columns names by DataFrame.add_prefix and last DataFrame.join to original:
a = df['Input'].str.get_dummies(',') == 1
b = a.groupby(df.ID).cumsum().astype(int)
df1 = (b-b.mask(a).groupby(df.ID).ffill().fillna(0).astype(int)).add_prefix('X.')
df = df.join(df1)
print (df)
ID Date Input X.A X.B X.C X.D
0 1 1-Nov A,B 1 1 0 0
1 1 2-NOV A 2 0 0 0
2 2 3-NOV A,B,C 1 1 1 0
3 2 4-NOV B,D 0 2 0 1
first add the counts of new columns and then use group by to make a cumulative sum
# find which columns to add
cols = set([l for sublist in df['Input'].apply(lambda x: x.split(',')).values for l in sublist])
# add the new columns
for col in cols:
df['X.' + col] = df['Input'].apply(lambda x: int(col in x))
# group by and add cumulative sum conditional it has a positive value
group = df.groupby('ID')
for col in cols:
df['X.' + col] = group['X.' + col].apply(lambda x: np.cumsum(x) * (x > 0).astype(int))
results is then
print(df)
ID Date Input X.C X.D X.A X.B
0 1 1-NOV A,B 0 0 1 1
1 1 2-NOV A 0 0 2 0
2 2 3-NOV A,B,C 1 0 1 1
3 2 4-NOV B,D 0 1 0 2

Pandas Python : how to create multiple columns from a list

I have a list with columns to create :
new_cols = ['new_1', 'new_2', 'new_3']
I want to create these columns in a dataframe and fill them with zero :
df[new_cols] = 0
Get error :
"['new_1', 'new_2', 'new_3'] not in index"
which is true but unfortunate as I want to create them...
EDIT : This is a duplicate of this question : Add multiple empty columns to pandas DataFrame however I keep this one too because the accepted answer here was the simple solution I was looking for, and it was not he accepted answer out there
EDIT 2 : While the accepted answer is the most simple, interesting one-liner solutions were posted below
You need to add the columns one by one.
for col in new_cols:
df[col] = 0
Also see the answers in here for other methods.
Use assign by dictionary:
df = pd.DataFrame({
'A': ['a','a','a','a','b','b','b','c','d'],
'B': list(range(9))
})
print (df)
0 a 0
1 a 1
2 a 2
3 a 3
4 b 4
5 b 5
6 b 6
7 c 7
8 d 8
new_cols = ['new_1', 'new_2', 'new_3']
df = df.assign(**dict.fromkeys(new_cols, 0))
print (df)
A B new_1 new_2 new_3
0 a 0 0 0 0
1 a 1 0 0 0
2 a 2 0 0 0
3 a 3 0 0 0
4 b 4 0 0 0
5 b 5 0 0 0
6 b 6 0 0 0
7 c 7 0 0 0
8 d 8 0 0 0
import pandas as pd
new_cols = ['new_1', 'new_2', 'new_3']
df = pd.DataFrame.from_records([(0, 0, 0)], columns=new_cols)
Is this what you're looking for ?
You can use assign:
new_cols = ['new_1', 'new_2', 'new_3']
values = [0, 0, 0] # could be anything, also pd.Series
df = df.assign(**dict(zip(new_cols, values)
Try looping through the column names before creating the column:
for col in new_cols:
df[col] = 0
We can use the Apply function to loop through the columns in the dataframe and assigning each of the element to a new field
for instance for a list in a dataframe with a list named keys
[10,20,30]
In your case since its all 0 we can directly assign them as 0 instead of looping through. But if we have values we can populate them as below
...
df['new_01']=df['keys'].apply(lambda x: x[0])
df['new_02']=df['keys'].apply(lambda x: x[1])
df['new_03']=df['keys'].apply(lambda x: x[2])

Get substring in one column based on the value in another column

My dataframe looks like this:
test_df = pd.DataFrame({'name':['a12','b1','c'],'Length':[2,1,0]})
test_df
Length name
0 2 a12
1 1 b1
2 0 c
I would like to have a result like this:
Length name
0 2 a
1 1 b
2 0 c
With this code: Getting substring based on another column in a pandas dataframe
test_df.apply(lambda x: x['name'][:-x['Length']],axis = 1)
test_df
I got the same dataframe than before
Length name
0 2 a12
1 1 b1
2 0 c
Modify your apply a bit, to slice with respect to len(x['name']) -
def f(x):
return x['name'][:len(x['name']) - x['Length_to_drop']]
df.apply(f, 1)
0 a
1 b
2 c
dtype: object
Try this:
import pandas as pd
test_df = pd.DataFrame({'name':['a12','b1','c'],'Length':[2,1,0]})
test_df['name']=test_df.apply(lambda x: x['name'][:len(x['name'])-x['Length']],axis = 1)
test_df
This will output as you intended
Length name
0 2 a
1 1 b
2 0 c
One can use list functions for this:
outlist = list(map(lambda x,y: x[0:(len(x)-y)], test_df.name, test_df.Length_to_drop))
test_df.name = outlist
print(test_df)
Output:
Length_to_drop name
0 2 a
1 1 b
2 0 c

Categories