Adding unique identifiers to duplicate values in pandas dataframe - python

I would like to create unique identifiers for values that are duplicates. Values that are duplicates are only 0's. The idea is to convert each zero to zero plus its position (0+1 for first row, 0+2 for second row etc). However the problem is the column also has other non duplicate values.
I have written this line of code to try and convert the zero values as stated but I am getting this error message
TypeError: ufunc 'add' did not contain a loop with signature matching
types dtype('
Here is my code
seller_customer['customer_id'] = np.where(seller_customer['customer_id']==0, seller_customer['customer_id'] + seller_customer.groupby(['customer_id']).cumcount().replace('0',''))
Here is a sample of my data
{0: '7e468d618e16c6e1373fb2c4a522c969',
1: '1c14a115bead8a332738c5d7675cca8c',
2: '434dee65d973593dbb8461ba38202798',
3: '4bbeac9d9a22f0628ba712b90862df28',
4: '578d5098cbbe40771e1229fea98ccafd',
5: 0,
6: 0,
7: 0}

If I understand correctly, you can just assign range values to those ids that are 0:
df.loc[df['id']==0, 'id'] = np.arange((df['id']==0).sum()) + 1
print(df)
Output:
id
0 7e468d618e16c6e1373fb2c4a522c969
1 1c14a115bead8a332738c5d7675cca8c
2 434dee65d973593dbb8461ba38202798
3 4bbeac9d9a22f0628ba712b90862df28
4 578d5098cbbe40771e1229fea98ccafd
5 1
6 2
7 3
Or a shorter but slightly slower:
df.loc[df['id']==0, 'id'] = (df['id']==0).cumsum()

You can do something like this:
from pandas.util import hash_pandas_object
import numpy as np
df.x = np.where(df.x == 0, hash_pandas_object(df.x), df.x)
df
Output:
x
0 7e468d618e16c6e1373fb2c4a522c969
1 1c14a115bead8a332738c5d7675cca8c
2 434dee65d973593dbb8461ba38202798
3 4bbeac9d9a22f0628ba712b90862df28
4 578d5098cbbe40771e1229fea98ccafd
5 593769213749726025
6 14559158595676751865
7 4575103004772269825
They won't be sequential like the index but they will be unique (almost certainly, unless you encounter a hash collision)

Related

How can I count and group the elements of an array that are the same?

I have in my code an array that has a lot of similar values, as example
A=[1,1,2,2,2,4,5,6,6,6]
but my array is way more long and complicated.
I want to group the values that are the same and count how many there are from this value.
Is there any specific way to do that?
Maybe this will help:
from collections import Counter
A=[1,1,2,2,2,4,5,6,6,6]
a = dict(Counter(A))
print(a)
which gives a dictionary showing the unique values and their count:
{1: 2, 2: 3, 4: 1, 5: 1, 6: 3}
You can use Pandas library for this.
import pandas as pd
A=[1,1,2,2,2,4,5,6,6,6]
count = pd.Series(A).value_counts()
print("elements count")
print(count)
Result will come up as below:
elements count
2 3
6 3
1 2
4 1
5 1

iterating large pandas DataFrame too slow

I have a large dataframe where I would like to make a new column based on existing columns.
test = pd.DataFrame({'Test1':["100","4242","3454","2","54"]})
test['Test2'] = ""
for i in range(0,len(test)):
if len(test.iloc[i,0]) == 4:
test.iloc[i,-1] = test.iloc[i,0][0:1]
elif len(test.iloc[i,0]) == 3:
test.iloc[i,-1] = test.iloc[i,0][0]
elif len(test.iloc[i,0]) < 3:
test.iloc[i,-1] = 0
else:
test.iloc[i,-1] = np.nan
This is working for a small dataframe, but when I have a large data set, (10+ million rows), it is taking way too long. How can I make this process faster?
Use str.len method to find the lengths of strings in the 'Test1' column and then using this information, use np.select to assign relevant parts of the strings in 'Test1' or default values to 'Test2'.
import numpy as np
lengths = test['Test1'].str.len()
test['Test2'] = np.select([lengths == 4, lengths == 3, lengths < 3], [test['Test1'].str[0:1], test['Test1'].str[0], 0], np.nan)
Output:
Test1 Test2
0 100 1
1 4242 4
2 3454 3
3 2 0
4 54 0
Note that [0:1] only returns the first element (same as [0]) so maybe you meant [0:2] (or something else) otherwise you can save one condition there.
So, basically you want to extract the first character of the string if it is at least 3 characters long. (NB. for a string, [0] and [0:1] yields exactly the same thing)
Just use a regex with a lookbehind for that.
test['Test2'] = test['Test1'].str.extract('^(.)(?=..)').fillna(0)
output:
Test1 Test2
0 100 1
1 4242 4
2 3454 3
3 2 0
4 54 0
How the regex works:
^ # match beginning of string
(.) # capture one character
(?=..) # only if it is followed by at least two characters

nunique compare two Pandas dataframe with duplicates and pivot them

My input:
df1 = pd.DataFrame({'frame':[ 1,1,1,2,3,0,1,2,2,2,3,4,4,5,5,5,8,9,9,10,],
'label':['GO','PL','ICV','CL','AO','AO','AO','ICV','PL','TI','PL','TI','PL','CL','CL','AO','TI','PL','ICV','ICV'],
'user': ['user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1']})
df2 = pd.DataFrame({'frame':[ 1, 1, 2, 3, 4,0,1,2,2,2,4,4,5,6,6,7,8,9,10,11],
'label':['ICV','GO', 'CL','TI','PI','AO','GO','ICV','TI','PL','ICV','TI','PL','CL','CL','CL','AO','AO','PL','ICV'],
'user': ['user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2']})
df_c = pd.concat([df1,df2])
I trying compare two df, frame by frame, and check if label in df1 existing in same frame in df2. And make some calucation with result (pivot for example)
That my code:
m_df = df1.merge(df2,on=['frame'],how='outer' )
m_df['cross']=m_df.apply(lambda row: 'Matched'
if row['label_x']==row['label_y']
else 'Mismatched', axis='columns')
pv_m_unq= pd.pivot_table(m_df,
columns='cross',
index='label_x',
values='frame',
aggfunc=pd.Series.nunique,fill_value=0,margins=True)
pv_mc = pd.pivot_table(m_df,
columns='cross',
index='label_x',
values='frame',
aggfunc=pd.Series.count,fill_value=0,margins=True)
but this creates a some problem:
first, I can calqulate "simple" total (column All) of matched and missmatched as descipted in picture, or its "duplicated" as AO in pv_m or wrong number as in CL in pv_m_unq
and second, I think merge method as I use int not clever way, because I get if frame+label repetead in df(its happens often), in merged df I get number row in df1 X number of rows in df2 for this specific frame+label
I think maybe there is a smarter way to compare df and pivot them?
You got the unexpected result on margin total because the margin is making use the same function passed to aggfunc (i.e. pd.Series.nunique in this case) for its calculation and the values of Matched and Mismatched in these 2 rows are both the same as 1 (hence only one unique value of 1). (You are currently getting the unique count of frame id's)
Probably, you can achieve more or less what you want by taking the count on them (including margin, Matched and Mismatched) instead of the unique count of frame id's, by using pd.Series.count instead in the last line of codes:
pv_m = pd.pivot_table(m_df,columns='cross',index='label_x',values='frame', aggfunc=pd.Series.count, margins=True, fill_value=0)
Result
cross Matched Mismatched All
label_x
AO 0 1 1
CL 1 0 1
GO 1 1 2
ICV 1 1 2
PL 0 2 2
All 3 5 8
Edit
If all you need is to have the All column being the sum of Matched and Mismatched, you can do it as follows:
Change your code of generating pv_m_unq without building margin:
pv_m_unq= pd.pivot_table(m_df,
columns='cross',
index='label_x',
values='frame',
aggfunc=pd.Series.nunique,fill_value=0)
Then, we create the column All as the sum of Matched and Mismatched for each row, as follows:
pv_m_unq['All'] = pv_m_unq['Matched'] + pv_m_unq['Mismatched']
Finally, create the row All as the sum of Matched and Mismatched for each column and append it as the last row, as follows:
row_All = pd.Series({'Matched': pv_m_unq['Matched'].sum(),
'Mismatched': pv_m_unq['Mismatched'].sum(),
'All': pv_m_unq['All'].sum()},
name='All')
pv_m_unq = pv_m_unq.append(row_All)
Result:
print(pv_m_unq)
Matched Mismatched All
label_x
AO 1 3 4
CL 1 2 3
GO 1 1 2
ICV 2 4 6
PL 1 5 6
TI 2 3 5
All 8 18 26
You can use isin() function like this:
df3 =df1[df1.label.isin(df2.label)]

Find index of cell in dateframe

I would like to modify the cell value based on its size.
If the dateframe is as below:
A B C
25802523 X1 2
M25JK0010 Y1 1
K25JK0010 Y2 1
I would like to modify the column 'A' and insert to another column.
For example, if the first cell value the size of column A is 8. I would like to break it and get least 5 values, similarly others depend on their sizes of each cell.
Is there any way I'm able to do this?
You can do this:
t = pd.DataFrame({'A': ['25802523', 'M25JK00010', 'KRJOJR4445'],
'size': [2, 1, 8]} )
Define a dictionary of your desired final length based on the corresponding size. Here if the size is 8 I will take the 5 last characters
size_dict = {8: 5, 2: 3, 1: 4}
Then use a simple pandas apply
t['A_bis'] = t.apply(lambda x: x['A'][len(x['A']) - size_dict[x['size']]:], axis=1)
The result is
0 523 >> 3 last characters (key 2)
1 0010 >> 4 last characters (key 1)
2 R4445 >> 5 last characters (key 8)
Another approach to do this:
Sample df:
t = pd.DataFrame({'A': ['25802523', 'M25JK00010', 'KRJOJR4445']})
Get the count of each elements of A:
t['Count'] =(t['A'].apply(len))
Then write a condition to replace:
t.loc[t.Count == 8, 'Number'] = t['A'].str[-5:]

pandas get position of a given index in DataFrame

Let's say I have a DataFrame like this:
df
A B
5 0 1
18 2 3
125 4 5
where 5, 18, 125 are the index
I'd like to get the line before (or after) a certain index. For instance, I have index 18 (eg. by doing df[df.A==2].index), and I want to get the line before, and I don't know that this line has 5 as an index.
2 sub-questions:
How can I get the position of index 18? Something like df.loc[18].get_position() which would return 1 so I could reach the line before with df.iloc[df.loc[18].get_position()-1]
Is there another solution, a bit like options -C, -A or -B with grep ?
For your first question:
base = df.index.get_indexer_for((df[df.A == 2].index))
or alternatively
base = df.index.get_loc(18)
To get the surrounding ones:
mask = pd.Index(base).union(pd.Index(base - 1)).union(pd.Index(base + 1))
I used Indexes and unions to remove duplicates. You may want to keep them, in which case you can use np.concatenate
Be careful with matches on the very first or last rows :)
If you need to convert more than 1 index, you can use np.where.
Example:
# df
A B
5 0 1
18 2 3
125 4 5
import pandas as pd
import numpy as np
df = pd.DataFrame({"A": [0,2,4], "B": [1,3,5]}, index=[5,18,125])
np.where(df.index.isin([18,125]))
Output:
(array([1, 2]),)

Categories