Splitting Artist Names from One Column - python

I am pulling data from the billboard 100 list and am stuck on how to split the artists names. This is a csv file, but I have the data in a pandas dataframe before export. I would like to split using python/pandas. I have included a picture of the column below. The artist names are all in the same column with delimiters (in red) I would like to split but it is very complicated. The most common delimiters are " & ", " Featuring ", " X ", so basically I need help on splitting all of these names into different columns.
I was thinking I could use nested for loops so that I could split on a combination of these delimiters. My idea was to split based on a pattern of " (symbol) ", " X ", " x ", and " Featuring ", but am not sure if this is possible. Is there an easier way to do this without losing data? All help is appreciated.

Consider a sample dataframe df
df = pd.DataFrame({'singers': ['A & B', 'C Featuring D', 'E X F', 'G % H']})
df
singers
0 A & B
1 C Featuring D
2 E X F
3 G % H
Now, it's up to you which delimiter you want to choose to split the names on. May be just X or just Featuring or & or maybe all of them. Use str.split to achieve this as shown -
df.singers.str.split('&|X|Featuring|%', expand=True)
0 1
0 A B
1 C D
2 E F
3 G H
You can even add any other symbol too inside the split method.

Related

Python Dataframe replace multiple values that separated by comma single row

i have problem with this situation :
this is course table
I have this dataframe :
My expected outcome is :
I've tried using while, for loop and if else. but I think it is because incorrect position or understanding of the code :
Thanks for your help!
Based on your description, I have created sample data and tried to do what you are willing to.
df = pd.DataFrame()
df['academic_id'] = ['1','1,2,3','3,4','5,6']
# created a dictionary with the mappings we need
info = {'1':'course a','2':'course b', '3':'course c', '4':'course d','5':'course e', '6':'course f'}
def mapper(elem):
return ','.join([info[i] for i in elem.split(',')])
df['academic_id_text'] = df['academic_id'].apply(mapper)
print(df)
Output:
academic_id academic_id_text
0 1 course a
1 1,2,3 course a,course b,course c
2 3,4 course c,course d
3 5,6 course e,course f

how to remove whitespace from string in pandas column

I need to remove whitespaces in pandas df column. My data looks like this:
industry magazine
Home "Goodhousekeeping.com"; "Prevention.com";
Fashion "Cosmopolitan"; " Elle"; "Vogue"
Fashion " Vogue"; "Elle"
Below is my code:
# split magazine column values, create a new column in df
df['magazine_list'] = dfl['magazine'].str.split(';')
# stip the first whitespace from strings
df.magazine_list = df.magazine_list.str.lstrip()
This returns all NaN, I have also tried:
df.magazine = df.magazine.str.lstrip()
This didn't remove the white spaces either.
Use list comprehension with strip of splitted values, also strip values before split for remove trailing ;, spaces and " values:
f = lambda x: [y.strip('" ') for y in x.strip(';" ').split(';')]
df['magazine_list'] = df['magazine'].apply(f)
print (df)
industry magazine \
0 Home Goodhousekeeping.com; "Prevention.com";
1 Fashion Cosmopolitan; " Elle"; "Vogue"
2 Fashion Vogue; "Elle
magazine_list
0 [Goodhousekeeping.com, Prevention.com]
1 [Cosmopolitan, Elle, Vogue]
2 [Vogue, Elle]
Jezrael provides a good solution. It is useful to know that pandas has string accessors for similar operations without the need of list comprehensions. Normally a list comprehension is faster, but depending on the use case using pandas built-in functions could be more readable or simpler to code.
df['magazine'] = (
df['magazine']
.str.replace(' ', '', regex=False)
.str.replace('"', '', regex=False)
.str.strip(';')
.str.split(';')
)
Output
industry magazine
0 Home [Goodhousekeeping.com, Prevention.com]
1 Fashion [Cosmopolitan, Elle, Vogue]
2 Fashion [Vogue, Elle]

Merge Rows of Dataframe based on condition

I have a csv file with only one column "notes". I want to merge rows of data-frame based on some condition.
Input_data={'notes':
['aaa','bbb','*','hello','**','my name','is xyz',
'(1)','this is','temp','name',
'(2)','BTW','how to','solve this',
'(3)','with python','I don’t want this to be added ',
'I don’t want this to be added ']}
df_in = pd.DataFrame(Input_data)
Input looks like this
Output
output_Data={'notes':
['aaa','bbb','*hello','**my name is xyz',
'(1) this is temp name',
'(2) BTW how to solve this',
'(3) with python','I don’t want this to be added ',
'I don’t want this to be added ']}
df_out=pd.DataFrame(output_Data)
I want to merge the rows with the above row which have either "*" or "(number)" in it. So the output will look like
Other rows which can not be merged should be left.
Also, in case of last row as there is no proper way to know up-to what range we can merge lets say just add only one next row
I solved this but its very long. Any simpler way
df=pd.DataFrame(Input_data)
notes=[];temp=[];flag='';value='';c=0;chk_star='yes'
for i,row in df.iterrows():
row[0]=str(row[0])
if '*' in row[0].strip()[:5] and chk_star=='yes':
value=row[0].strip()
temp=temp+[value]
value=''
continue
if '(' in row[0].strip()[:5]:
chk_star='no'
temp=temp+[value]
value='';c=0
flag='continue'
value=row[0].strip()
if flag=='continue' and '(' not in row[0][:5] :
value=value+row[0]
c=c+1
if c>4:
temp=temp+[value]
print "111",value,temp
break
if '' in temp:
temp.remove('')
df=pd.DataFrame({'notes':temp})
Below solution recognises special characters like *,** and (number) at the start of the of the sentence and starts merging later rows except last row.
import pandas as pd
import re
df = pd.DataFrame({'row':['aaa','bbb','*','hello','**','my name','is xyz',
'(1)','this is','temp','name',
'(2)','BTW','how to','solve this',
'(3)','with python','I don’t want this to be added ',
'I don’t want this to be added ']})
pattern = "^\(\d+\)|^\*+" #Pattern to identify string starting with (number),*,**.
#print(df)
#Selecting index based on the above pattern
selected_index = df[df["row"].str.contains(re.compile(pattern))].index.values
delete_index = []
for index in selected_index:
i=1
#Merging row until next selected index found and add merged rows to delete_index list
while(index+i not in selected_index and index+i < len(df)-1):
df.at[index, 'row'] += ' ' + df.at[index+i, 'row']
delete_index.append(index+i)
i+=1
df.drop(delete_index,inplace=True)
#print(df)
Output:
row
0 aaa
1 bbb
2 *hello
4 **my nameis xyz
7 (1)this istempname
11 (2)BTWhow tosolve this
15 (3)with pythonI don’t want this to be added
18 I don’t want this to be added
You can reset index if you want. using df.reset_index()
I think it is easier when you design your logic to separate df_in into 3 parts: top, middle and bottom. Keeping top and bottom intact while joining middle part. Finally, concat 3 parts together into df_out
First, create m1 and m2 masks to separate df_in to 3 parts.
m1 = df_in.notes.str.strip().str.contains(r'^\*+|\(\d+\)$').cummax()
m2 = ~df_in.notes.str.strip().str.contains(r'^I don’t want this to be added$')
top = df_in[~m1].notes
middle = df_in[m1 & m2].notes
bottom = df_in[~m2].notes
Next, create groupby_mask to group rows and groupby and join:
groupby_mask = middle.str.strip().str.contains(r'^\*+|\(\d+\)$').cumsum()
middle_join = middle.groupby(groupby_mask).agg(' '.join)
Out[3110]:
notes
1 * hello
2 ** my name is xyz
3 (1) this is temp name
4 (2) BTW how to solve this
5 (3) with python
Name: notes, dtype: object
Finally, use pd.concat to concat top, middle_join, bottom
df_final = pd.concat([top, middle_join, bottom], ignore_index=True).to_frame()
Out[3114]:
notes
0 aaa
1 bbb
2 * hello
3 ** my name is xyz
4 (1) this is temp name
5 (2) BTW how to solve this
6 (3) with python
7 I don’t want this to be added
8 I don’t want this to be added
You can use a mask to avoid the for loop :
df = pd.DataFrame({'row':['aaa','bbb','*','hello','**','my name','is xyz',
'(1)','this is ','temp ','name',
'(2)','BTW ','how to ','solve this',
'(3)','with python ','I don’t want this to be added ',
'I don’t want this to be added ']})
special = ['*', '**']
for i in range(11):
special.append('({})'.format(i))
# We find the indexes where we will have to merge
index_to_merge = df[df['row'].isin(special)].index.values
for idx, val in enumerate(index_to_merge):
if idx != len(index_to_merge)-1:
df.loc[val, 'row'] += ' ' + df.loc[val+1:index_to_merge[idx+1]-1, 'row'].values.sum()
else:
df.loc[index, 'row'] += ' ' + df.loc[index+1:, 'row'].values.sum()
# We delete the rows that we just used to merge
df.drop([x for x in np.array(range(len(df))) if x not in index_to_merge])
Out :
row
2 * hello
4 ** my nameis xyz
7 (1) this is temp name
11 (2) BTW how to solve this
15 (3) with python I don’t want this to be added ..
You could also convert your column into a numpy array and use numpy functions to simplify what you did. First you can use the np.where and np.isin to find the indexes where you will have to merge. That way you don't have to iterate on your whole array using a for loop.
Then you can do the mergures on the corresponding indexes. Finally, you can delete the values that have been merged. Here is what it could look like :
list_to_merge = np.array(['aaa','bbb','*','hello','**','my name','is xyz',
'(1)','this is','temp','name',
'(2)','BTW','how to','solve this',
'(3)','with python','I don’t want this to be added ',
'I don’t want this to be added '])
special = ['*', '**']
for i in range(11):
special.append('({})'.format(i))
ix = np.isin(list_to_merge, special)
rows_to_merge = np.where(ix)[0]
# We merge the rows
for index_to_merge in np.where(ix)[0]:
# Check if there we are not trying to merge with an out of bounds value
if index_to_merge!=len(list_to_merge)-1:
list_to_merge[index_to_merge] = list_to_merge[index_to_merge] + ' ' + list_to_merge[index_to_merge+1]
# We delete the rows that have just been used to merge:
rows_to_delete = rows_to_merge +1
list_to_merge = np.delete(list_to_merge, rows_to_delete)
Out :
['aaa', 'bbb', '* hello', '** my name', 'is xyz', '(1) this is',
'temp', 'name', '(2) BTW', 'how to', 'solve this',
'(3) with python', 'I don’t want this to be added ',
'I don’t want this to be added ']

How to calculate mode over two columns in a python dataframe?

There are two columns in my csv: FirstName and LastName. I need to find the most common full name.
Eg:
FirstName LastName
A X
A P
A Y
A Z
B X
B Z
C X
C W
C W
I have tried using the mode function:
df["FirstName"].mode()[0]
df["LastName"].mode()[0]
But it wont work over two columns
The mode of each columns are :
FirstName : A - occurs 4 times
LastName : X - occurs 3 times
But the output should be "C W". As this is the full name that occur most times.
You can do,
(df['FirstName'] + df['LastName']).mode()[0]
# Output : 'CW'
If you really need space in between first and last names you can concatenate ' ' like this,
(df['FirstName'] + ' ' + df['LastName']).mode()[0]
# Output : 'C W'
You can combine the columns and find mode,
df.apply(tuple, 1).mode()[0]
('C', 'W')
You can concatenate those into a single string with:
full_names = df.FirstName + df.LastName
full_names.mode()[0]

how to get just a string from a data frame

I am trying to define a function with two arguments : df (dataframe), and an integer (employerID) as my arguments. this function will return the full name of the employer.
If the given ID does not belong to any employee, I want to return the string "UNKNOWN" / If no middle name is given only return "LAST, FIRST". / If only the middle initial is given the return the full name in the format "LAST, FIRST M." with the middle initial followed by a '.'.
def getFullName(df, int1):
df = pd.read_excel('/home/data/AdventureWorks/Employees.xls')
newdf = df[(df['EmployeeID'] == int1)]
print("'" + newdf['LastName'].item() + "," + " " + newdf['FirstName'].item() + " " + newdf['MiddleName'].item() + "." + "'")
getFullName('df', 110)
I wrote this code but came up with two problems :
1) if I don't put quotation mark around df, it will give me an error message, but I just want to take a data frame as an argument not a string.
2) this code can't deal with someone with out middle name.
I am sorry but I used pd.read_excel to read the excel file which you can not access. I know it will be hard for you to test the codes without the excel file, if someone let me know how to create a random data frame with the column names, I will go ahead and change it. Thank you,
I created some fake data for this:
EmployeeID FirstName LastName MiddleName
0 0 a a a
1 1 b b b
2 2 c c c
3 3 d d d
4 4 e e e
5 5 f f f
6 6 g g g
7 7 h h h
8 8 i i i
9 9 j j None
EmployeeID 9 has no middle name, but everyone else does. The way I would do it is to break up the logic into two parts. The first, for when you cannot find the EmployeeID. The second manages the printing of the employee's name. That second part should also have two sets of logic, one to control if the employee has a middle name, and the other for if they don't. You could likely combine a lot of this into single line statements, but you will likely sacrifice clarity.
I also removed the pd.read_excel call from the function. If you want to pass the dataframe in to the function, then the dataframe should be created oustide of it.
def getFullName(df, int1):
newdf = df[(df['EmployeeID'] == int1)]
# if the dataframe is empty, then we can't find the give ID
# otherwise, go ahead and print out the employee's info
if(newdf.empty):
print("UNKNOWN")
return "UNKNOWN"
else:
# all strings will start with the LastName and FirstName
# we will then add the MiddleName if it's present
# and then we can end the string with the final '
s = "'" + newdf['LastName'].item() + ", " +newdf['FirstName'].item()
if (newdf['MiddleName'].item()):
s = s + " " + newdf['MiddleName'].item() + "."
s = s + "'"
print(s)
return s
I have the function returning values in case you want to manipulate the string further. But that was just me.
If you run getFullName(df, 1) you should get 'b, b b.'. And for getFullName(df, 9) you should get 'j, j'.
So in full, it would be:
df = pd.read_excel('/home/data/AdventureWorks/Employees.xls')
getFullName(df, 1) #outputs 'b, b b.'
getFullName(df, 9) #outputs 'j, j'
getFullName(df, 10) #outputs UNKNOWN
Fake data:
d = {'EmployeeID' : [0,1,2,3,4,5,6,7,8,9],
'FirstName' : ['a','b','c','d','e','f','g','h','i','j'],
'LastName' : ['a','b','c','d','e','f','g','h','i','j'],
'MiddleName' : ['a','b','c','d','e','f','g','h','i',None]}
df = pd.DataFrame(d)

Categories