Assign unique id to columns pandas data frame - python

Hello I have the following dataframe
df =
A B
John Tom
Homer Bart
Tom Maggie
Lisa John
I would like to assign to each name a unique ID and returns
df =
A B C D
John Tom 0 1
Homer Bart 2 3
Tom Maggie 1 4
Lisa John 5 0
What I have done is the following:
LL1 = pd.concat([df.a,df.b],ignore_index=True)
LL1 = pd.DataFrame(LL1)
LL1.columns=['a']
nameun = pd.unique(LL1.a.ravel())
LLout['c'] = 0
LLout['d'] = 0
NN = list(nameun)
for i in range(1,len(LLout)):
LLout.c[i] = NN.index(LLout.a[i])
LLout.d[i] = NN.index(LLout.b[i])
But since I have a very large dataset this process is very slow.

Here's one way. First get the array of unique names:
In [11]: df.values.ravel()
Out[11]: array(['John', 'Tom', 'Homer', 'Bart', 'Tom', 'Maggie', 'Lisa', 'John'], dtype=object)
In [12]: pd.unique(df.values.ravel())
Out[12]: array(['John', 'Tom', 'Homer', 'Bart', 'Maggie', 'Lisa'], dtype=object)
and make this a Series, mapping names to their respective numbers:
In [13]: names = pd.unique(df.values.ravel())
In [14]: names = pd.Series(np.arange(len(names)), names)
In [15]: names
Out[15]:
John 0
Tom 1
Homer 2
Bart 3
Maggie 4
Lisa 5
dtype: int64
Now use applymap and names.get to lookup these numbers:
In [16]: df.applymap(names.get)
Out[16]:
A B
0 0 1
1 2 3
2 1 4
3 5 0
and assign it to the correct columns:
In [17]: df[["C", "D"]] = df.applymap(names.get)
In [18]: df
Out[18]:
A B C D
0 John Tom 0 1
1 Homer Bart 2 3
2 Tom Maggie 1 4
3 Lisa John 5 0
Note: This assumes that all the values are names to begin with, you may want to restrict this to some columns only:
df[['A', 'B']].values.ravel()
...
df[['A', 'B']].applymap(names.get)

(Note: I'm assuming you don't care about the precise details of the mapping -- which number John becomes, for example -- but only that there is one.)
Method #1: you could use a Categorical object as an intermediary:
>>> ranked = pd.Categorical(df.stack()).codes.reshape(df.shape)
>>> df.join(pd.DataFrame(ranked, columns=["C", "D"]))
A B C D
0 John Tom 2 5
1 Homer Bart 1 0
2 Tom Maggie 5 4
3 Lisa John 3 2
It feels like you should be able to treat a Categorical as providing an encoding dictionary somehow (whether directly or by generating a Series) but I can't see a convenient way to do it.
Method #2: you could use rank("dense"), which generates an increasing number for each value in order:
>>> ranked = df.stack().rank("dense").reshape(df.shape).astype(int)-1
>>> df.join(pd.DataFrame(ranked, columns=["C", "D"]))
A B C D
0 John Tom 2 5
1 Homer Bart 1 0
2 Tom Maggie 5 4
3 Lisa John 3 2

Related

How to auto increment counter by repeteaded values in a column

I Have a data frame with the column name and I need to create the column seq, which allows me identify the different times that a name appears in the data frame, it's important to preserve the order.
import pandas as pd
data = {'name': ['Tom', 'Joseph','Joseph','Joseph', 'Tom', 'Tom', 'John','Tom','Tom','John','Joseph']
, 'seq': ['Tom 0', 'Joseph 0','Joseph 0','Joseph 0', 'Tom 1', 'Tom 1', 'John 0','Tom 2','Tom 2','John 1','Joseph 1']}
df = pd.DataFrame(data)
print(df)
name seq
0 Tom Tom 0
1 Joseph Joseph 0
2 Joseph Joseph 0
3 Joseph Joseph 0
4 Tom Tom 1
5 Tom Tom 1
6 John John 0
7 Tom Tom 2
8 Tom Tom 2
9 John John 1
10 Joseph Joseph 1
Create a boolean mask to know if the name has changed from the previous row. Then filter out the second, third, ... names of a sequence before grouping by name. cumcount increment the sequence number and finally concatenate name and sequence number.
# Boolean mask
m = df['name'].ne(df['name'].shift())
# Create sequence number
seq = df.loc[m].groupby('name').cumcount().astype(str) \
.reindex(df.index, fill_value=pd.NA).ffill()
# Concatenate name and seq
df['seq'] = df['name'] + ' ' + seq
Output:
>>> df
name seq
0 Tom Tom 0
1 Joseph Joseph 0
2 Joseph Joseph 0
3 Joseph Joseph 0
4 Tom Tom 1
5 Tom Tom 1
6 John John 0
7 Tom Tom 2
8 Tom Tom 2
9 John John 1
10 Joseph Joseph 1
>>> m
0 True
1 True
2 False
3 False
4 True
5 False
6 True
7 True
8 False
9 True
10 True
Name: name, dtype: bool
You need check for the existence of a new name and then create a new index for each name using groupby and cumsum, the resulting string Series can be concatenated with str.cat
df['seq'] = df['name'].str.cat(
df['name'].ne(df['name'].shift()).groupby(df['name']).cumsum().sub(1).astype(str),
sep=' '
)
Assuming your data frame is indexes sequentiallly (0, 1, 2, 3, ...):
Group the data frame by name
For each group, apply a gap-and-island algorithm: every time the index jumps by more than 1, create a new island
def sequencer(group):
idx = group.index.to_series()
# Every time the index has a gap >1, create a new island
return idx.diff().ne(1).cumsum().sub(1)
seq = df.groupby('name').apply(sequencer).droplevel(0).rename('seq')
df.merge(seq, left_index=True, right_index=True)

How to create a column that measures the number of items that exits in another string column?

I have the dataframe that has employees, and their level.
import pandas as pd
d = {'employees': ["John", "Jamie", "Ann", "Jane", "Kim", "Steve"], 'Level': ["A/Ba", "C/A", "A", "C", "Ba/C", "D"]}
df = pd.DataFrame(data=d)
How do I add a new column that measures the number of employees with the same levels. For example, John would have 3 as there are 2 A's (Jamie and Ann) and one other Ba (Kim). Note it does not count the employee in this case John level(s) to that count.
My goal is for the end dataframe to be this.
Try this:
df['Number of levels'] = df['Level'].str.split('/').explode().map(df['Level'].str.split('/').explode().value_counts()).sub(1).groupby(level=0).sum()
Output:
>>> df
employees Level Number of levels
0 John A/Ba 3
1 Jamie C/A 4
2 Ann A 2
3 Jane C 2
4 Kim Ba/C 3
5 Steve D 0
exploded = df.Level.str.split("/").explode()
counts = exploded.groupby(exploded).transform("count").sub(1)
df["Num Levels"] = counts.groupby(level=0).sum()
We first explode the "Level" column by splitting over "/" so we can reach each level:
>>> exploded = df.Level.str.split("/").explode()
>>> exploded
0 A
0 Ba
1 C
1 A
2 A
3 C
4 Ba
4 C
5 D
Name: Level, dtype: object
We now need counts of each element in this series so we group by itself and transform by counts:
>>> exploded.groupby(exploded).transform("count")
0 3
0 2
1 3
1 3
2 3
3 3
4 2
4 3
5 1
Name: Level, dtype: int64
Since it counts elements themselves but you look at other places, we subtract 1 to get counts:
>>> counts = exploded.groupby(exploded).transform("count").sub(1)
>>> counts
0 2
0 1
1 2
1 2
2 2
3 2
4 1
4 2
5 0
Name: Level, dtype: int64
Now, we need to "come back", and the index is our helper for that; we group by it (level=0 means that) and sum the counts thereof:
>>> counts.groupby(level=0).sum()
0 3
1 4
2 2
3 2
4 3
5 0
Name: Level, dtype: int64
This is the end result and is assigned to df["Num Levels"].
to get
employees Level Num Levels
0 John A/Ba 3
1 Jamie C/A 4
2 Ann A 2
3 Jane C 2
4 Kim Ba/C 3
5 Steve D 0
This is all writable in "1 line" but it may hinder readability and further debuggings!
df["Num Levels"] = (df.Level
.str.split("/")
.explode()
.pipe(lambda ex: ex.groupby(ex))
.transform("count")
.sub(1)
.groupby(level=0)
.sum())

Insert a Zero in a Pandas Dataframe pd.count() Result < 1

I'm trying to find a method of inserting a zero into a pandas dataframe where the result of the .count()aggregate function is < 1. I've tried putting in a condition where it looks for null/None values and using a simple < 1 operator. So far I can only count instances where a categorical variable exists. Below is some example code to demonstrate my issue:
data = {'Person': ['Jim', 'Jim', 'Jim', 'Jim', 'Jim', 'Bob','Bob','Bob','Bob','Bob',], 'Result': ['Good', 'Good','Good','Good','Good','Good','Bad','Good','Bad','Bad',]}
dtf = pd.DataFrame.from_dict(data)
names = ['Jim','Bob']
append = []
for i in names:
good = dtf[dtf['Person']==i]
good = good[good['Result']=='Good']
if good['Result'].count() > 0:
good.insert(2,"Count",good['Result'].count())
elif good['Result'].count() < 1:
good.insert(2,"Count",0)
bad = dtf[dtf['Person']==i]
bad = bad[bad['Result']=='Bad']
if bad['Result'].count() > 0:
bad.insert(2,"Count",bad['Result'].count())
elif bad['Result'].count() < 1:
bad.insert(2,"Count",0)
res = [good,bad]
res = pd.concat(res)
append.append(res)
print(res)
The current output is:
Person Result Count
0 Jim Good 5
1 Jim Good 5
2 Jim Good 5
3 Jim Good 5
4 Jim Good 5
Person Result Count
5 Bob Good 2
7 Bob Good 2
6 Bob Bad 3
8 Bob Bad 3
9 Bob Bad 3
What I am trying to achieve is a zero count for Jim for the 'Bad' variable in the dtf['Results'] column. Like this:
Person Result Count
0 Jim Good 5
1 Jim Good 5
2 Jim Good 5
3 Jim Good 5
4 Jim Good 5
5 Jim Bad 0
Person Result Count
6 Bob Good 2
7 Bob Good 2
8 Bob Bad 3
9 Bob Bad 3
10 Bob Bad 3
I hope this makes sense. Vive la Resistance! └[∵┌]└[ ∵ ]┘[┐∵]┘
First create a multiindex mi from the product of Person and Result to keep missing combinations from df. Then count (size) all groups and reindex by the multiindex. Finally, merge the two dataframes use union of keys from both.
mi = pd.MultiIndex.from_product([df["Person"].unique(),
df["Result"].unique()],
names=["Person", "Result"])
out = df.groupby(["Person", "Result"]) \
.size() \
.reindex(mi, fill_value=0) \
.rename("Count") \
.reset_index()
out = out.merge(df, on=["Person", "Result"], how="outer")
>>> out
Person Result Count
0 Jim Good 5
1 Jim Good 5
2 Jim Good 5
3 Jim Good 5
4 Jim Good 5
5 Jim Bad 0
6 Bob Good 2
7 Bob Good 2
8 Bob Bad 3
9 Bob Bad 3
10 Bob Bad 3
Output:
names, append = list(zip(*out.groupby("Person")))
>>> names
('Bob', 'Jim')
>>> append
( Person Result Count
6 Bob Good 2
7 Bob Good 2
8 Bob Bad 3
9 Bob Bad 3
10 Bob Bad 3,
Person Result Count
0 Jim Good 5
1 Jim Good 5
2 Jim Good 5
3 Jim Good 5
4 Jim Good 5
5 Jim Bad 0)

Check if value from one dataframe exists in another dataframe

I have 2 dataframes.
Df1 = pd.DataFrame({'name': ['Marc', 'Jake', 'Sam', 'Brad']
Df2 = pd.DataFrame({'IDs': ['Jake', 'John', 'Marc', 'Tony', 'Bob']
I want to loop over every row in Df1['name'] and check if each name is somewhere in Df2['IDs'].
The result should return 1 if the name is in there, 0 if it is not like so:
Marc 1
Jake 1
Sam 0
Brad 0
Thank you.
Use isin
Df1.name.isin(Df2.IDs).astype(int)
0 1
1 1
2 0
3 0
Name: name, dtype: int32
Show result in data frame
Df1.assign(InDf2=Df1.name.isin(Df2.IDs).astype(int))
name InDf2
0 Marc 1
1 Jake 1
2 Sam 0
3 Brad 0
In a Series object
pd.Series(Df1.name.isin(Df2.IDs).values.astype(int), Df1.name.values)
Marc 1
Jake 1
Sam 0
Brad 0
dtype: int32
This should do it:
Df1 = Df1.assign(result=Df1['name'].isin(Df2['IDs']).astype(int))
By using merge
s=Df1.merge(Df2,left_on='name',right_on='IDs',how='left')
s.IDs=s.IDs.notnull().astype(int)
s
Out[68]:
name IDs
0 Marc 1
1 Jake 1
2 Sam 0
3 Brad 0
This is one way. Convert to set for O(1) lookup and use astype(int) to represent Boolean values as integers.
values = set(Df2['IDs'])
Df1['Match'] = Df1['name'].isin(values).astype(int)

Create a column with a random number for each exact match of two columns in pandas dataframe

I'm trying to create a NEW_ID column with a unique value for each exact match of FIRST_NM, LAST_NM.
data = np.array([['John', 'Smith', 1], ['John', 'West', 7], ['Eric', 'Adams', 9],
['Jane', 'Doe', 14], ['Jane', 'Doe', 16], ['John', 'Smith', 19]])
df = pd.DataFrame(data, columns=['FIRST_NM', 'LAST_NM', 'PAGE_NUM'])
FIRST_NM LAST_NM PAGE_NUM
0 John Smith 1
1 John West 7
2 Eric Adams 9
3 Jane Doe 14
4 Jane Doe 16
5 John Smith 19
The desired dataframe:
FIRST_NM LAST_NM PAGE_NUM NEW_ID
0 John Smith 1 654
1 John West 7 123
2 Eric Adams 9 78
3 Jane Doe 14 3
4 Jane Doe 16 3
5 John Smith 19 654
I figured I should do something like the code below but I know it's not right ...
import random
df.groupby(['FIRST_NM', 'LAST_NM']).apply(lambda group: random.getrandbits(16))
Your original version would work if you used transform, which broadcasts the result back up to the original indices:
>>> df["NEW_ID"] = df.groupby(['FIRST_NM', 'LAST_NM']).transform(lambda group:
random.getrandbits(16))
>>> df
FIRST_NM LAST_NM PAGE_NUM NEW_ID
0 John Smith 1 57757
1 John Smith 7 57757
2 Eric Adams 9 46139
3 Jane Doe 14 55091
4 Jane Doe 16 55091
5 John Smith 19 57757
But I'm not a big fan of just taking random numbers and hoping for the best (i.e. no collisions.) If you have a range-like index like your example has, you can use that instead:
>>> df.groupby(['FIRST_NM', 'LAST_NM'])["PAGE_NUM"].transform("idxmin")
0 0
1 0
2 2
3 3
4 3
5 0
dtype: int64
Or the ranked version:
>>> df.groupby(['FIRST_NM', 'LAST_NM'])["PAGE_NUM"].transform("idxmin").rank("dense")
0 1
1 1
2 2
3 3
4 3
5 1
dtype: float64
Once you have those you can map them into unique random numbers safely however you like.
Unfortunately I don't think the only place the group assignments are located is guaranteed, namely
>>> grouped = df.groupby(["FIRST_NM", "LAST_NM"])
>>> grouped.grouper.group_info[0]
array([2, 2, 0, 1, 1, 2], dtype=int64)
I wouldn't mind a groupcount() method which returned either this or the "rank in order of first occurrence" version.
You could add column with some good hash function, either faster but less secure (like cityhash in example below), or go with crypto-secure hash, or even some AES based transformation. Obviously, if name is the same, ID will be the same. Last name and first name joined by _, you could use any symbol you want
import numpy as np
import pandas as pd
import pyhash
data = np.array([['John', 'Smith', 1], ['John', 'Smith', 7], ['Eric', 'Adams', 9],
['Jane', 'Doe', 14], ['Jane', 'Doe', 16], ['John', 'Smith', 19]])
df = pd.DataFrame(data, columns=['FIRST_NM', 'LAST_NM', 'PAGE_NUM'])
print(df)
hasher = pyhash.city_64()
df['FULL_ID'] = df[['FIRST_NM', 'LAST_NM']].apply(lambda x: hasher('_'.join(x)), axis=1)
print(df)
You don't really need to use groupby. You're probably better off making a dict with the mapping and then just using map to assign it:
nameIDs = {name: ix for name, ix in zip(df.FIRST_NM.unique(), range(df.FIRST_NM.nunique()))}
df['NEWID'] = df.FIRST_NM.map(nameIDs)
Then:
>>> df
FIRST_NM LAST_NM PAGE_NUM NEWID
0 John Smith 1 0
1 John Smith 7 0
2 Eric Adams 9 1
3 Jane Doe 14 2
4 Jane Doe 16 2
5 John Smith 19 0
Here I just generated the IDs as sequential integers. You can certainly adapt this to use random numbers if you want, although I don't really see why you would want to.
I'm sure you've found an answer by now in the past 5 years, but just create a key column and then run the above code based off that key (the combination of first and last names)

Categories