Efficiently labelling a column that contains repeated elements [duplicate] - python

This question already has answers here:
How to efficiently assign unique ID to individuals with multiple entries based on name in very large df
(3 answers)
Pandas: convert categories to numbers
(6 answers)
Convert pandas series from string to unique int ids [duplicate]
(2 answers)
Closed 1 year ago.
This post was edited and submitted for review 1 year ago and failed to reopen the post:
Original close reason(s) were not resolved
I have a dataframe with a column consisting of author names, where sometimes the name of an author repeats. My problem is: I want to assign a unique number to each author name in a corresponding parallel column (for simplicity, assume that this numbering follows the progression of whole numbers, starting with 0, then 1, 2, 3, and so on).
I can do this using nested FOR loops, but with 57000 records consisting of 500 odd unique authors, it is taking way too long. Is there a quicker way to do this?
For example,
Original DataFrame contains:
**Author**
Name 1
Name 2
Name 1
Name 3
I want another column added next to it, such that:
**Author** **AuthorID*
Name 1 1
Name 2 2
Name 1 1
Name 3 3

Related

How to find duplicate rows based on given combination of columns and roll up observations in pandas data frame? [duplicate]

This question already has an answer here:
Pandas | Group by with all the values of the group as comma separated
(1 answer)
Closed 11 months ago.
This post was edited and submitted for review 11 months ago and failed to reopen the post:
Original close reason(s) were not resolved
I have a data frame as below -
df_add = pd.DataFrame({
'doc_id':[100,101,102,103],
'last_name':['Mallesham','Mallesham','Samba','Bhavik'],
'first_name':['Yamulla','Yamulla','Anil','Yamulla'],
'dob':['06-03-1900','06-03-1900','20-09-2020','09-16-2020']
})
Here doc_id 100 and 101 are duplicated rows on considering last, first names and DOB's.
Here My requirement is to roll up 101 to 100 as follows -
doc_id should be filled up as 100;101 with semicolon separator.
In a second case:
If I have just consider last_name and first_name combination it should display as below since a Same Name persons might have different DOB's
You need to change doc_id to str , to use str.cat function
df_add["doc_id"] = df_add["doc_id"].astype('str)
resultant_df = df_add.groupby(["first_name",
"last_name","dob"])[['doc_id']].apply(lambda x : x.str.cat(sep=','))
print(resultant_df.reset_index())
first_name last_name dob 0
0 Anil Samba 20-09-2020 102
1 Yamulla Bhavik 09-16-2020 103
2 Yamulla Mallesham 06-03-1900 100,101

Fill the column of a dataframe with random values chosen from a list in pandas [duplicate]

This question already has answers here:
Pandas: create new column in df with random integers from range
(3 answers)
Closed last year.
This post was edited and submitted for review 6 months ago and failed to reopen the post:
Duplicate This question has been answered, is not unique, and doesn’t differentiate itself from another question.
I have a data frame with the customers as shown below.
df:
id name
1 john
2 dan
3 sam
also, I have a list as
['www.costco.com', 'www.walmart.com']
I would like to add a column named domain to df by randomly selecting the elements from the list.
Expected output:
id name domain
1 john www.walmart.com
2 dan www.costco.com
3 sam www.costco.com
Note: since it is a random selection output may not be the same as always.
It is randomly selecting from the given list of strings, hence it is not same and not duplicate. And it is a specific question and it got great and very specific answers.
You can use random.choices:
import random
df['domain'] = random.choices(lst, k=len(df))
A sample output:
id name domain
0 1 john www.walmart.com
1 2 dan www.costco.com
2 3 sam www.costco.com

How to filter pandas dataframe based on length of a list in a column? [duplicate]

This question already has answers here:
How to filter a pandas dataframe based on the length of a entry
(2 answers)
Closed 1 year ago.
I have a pandas DataFrame like this:
id subjects
1 [math, history]
2 [English, Dutch, Physics]
3 [Music]
How to filter this dataframe based on the length of the column subjects?
So for example, if I only want to have rows where len(subjects) >= 2?
I tried using
df[len(df["subjects"]) >= 2]
But this gives
KeyError: True
Also, using loc does not help, that gives me the same error.
Thanks in advance!
Use the string accessor to work with lists:
df[df['subjects'].str.len() >= 2]
Output:
id subjects
0 1 [math, history]
1 2 [English, Dutch, Physics]

Calculate the number of non empty cells when the name of a column contains 'XXX' [duplicate]

This question already has answers here:
Find column whose name contains a specific string
(8 answers)
Closed 3 years ago.
I have 59 columns whose name is in the format: nn: xxxxxx (ttttttt), where tttttt is some name which is repeated for some particular columns. Now I want to calculate the sum of non-empty cells when tttttt='XXXXXX'. I know how to calculate the number of non-empty cells in a column, but how do I add the condition of ttttt being XXXXXX in the name of a column?
import pandas as pd
df = pd.read_csv("dane.csv", sep=';')
shape = list(df.shape)
nonempty=df.apply(lambda x: shape[0]-x.isnull().sum())
Input:
1: Brandenburg (Post-Panamax) 2: Acheron (Feeder) 5: Fenton (Feeder)
ES-NL-10633096/1938/X1#hkzydbezon.dk/6749 DE-JP-20438082/2066/A2#qwinfhcaer.cu/68849 NL-LK-02275406/2136/A1#ozmmfdpfts.de/73198
BE-BR-61613986/3551/B1#oqk.bf/39927 NL-LK-02275406/2136/A1#ozmmfdpfts.de/73198
PH-SA-39552610/2436/A1#venagi.hr/80578
PA-AE-59814691/4881/X1#zhicvzvksl.cl/25247 OM-PH-31303222/3671/Z1#jtqy.ml/52408
So for instance for this input, lets say I want to calculate the number of non empty cells for the name in the column 'Feeder'
You can use filter:
df.filter(like='(Feeder)').isna().sum()
or a more precise version, which requires (Feeder) to appear at the end of the column:
df.filter(regex='.*(\(Feeder\))$').isna().sum()
Output:
2: Acheron (Feeder) 1
5: Fenton (Feeder) 3
dtype: int64

Return rows that match a larger partial string of a string [duplicate]

This question already has answers here:
Python Pandas: Check if string in one column is contained in string of another column in the same row
(3 answers)
Closed 4 years ago.
I have a dataframe df that looks like the following:
Type Size
Biomass 12
Coal 15
Nuclear 23
And I have a string str such as the following: Biomass_wood
I would like to return the following dataframe:
Type Size
Biomass 12
This is because Biomass is partially matched by the first part of Biomass_wood.
This would effectively be the opposite of df[df.Type.str.contains(str)] as the bigger string is contained in str and not in the column Type
The following should do
df[df['Type'].map(lambda t: t in 'Biomass_wood')]

Categories