There are two columns in my csv: FirstName and LastName. I need to find the most common full name.
Eg:
FirstName LastName
A X
A P
A Y
A Z
B X
B Z
C X
C W
C W
I have tried using the mode function:
df["FirstName"].mode()[0]
df["LastName"].mode()[0]
But it wont work over two columns
The mode of each columns are :
FirstName : A - occurs 4 times
LastName : X - occurs 3 times
But the output should be "C W". As this is the full name that occur most times.
You can do,
(df['FirstName'] + df['LastName']).mode()[0]
# Output : 'CW'
If you really need space in between first and last names you can concatenate ' ' like this,
(df['FirstName'] + ' ' + df['LastName']).mode()[0]
# Output : 'C W'
You can combine the columns and find mode,
df.apply(tuple, 1).mode()[0]
('C', 'W')
You can concatenate those into a single string with:
full_names = df.FirstName + df.LastName
full_names.mode()[0]
Related
image is added so that you can look how my dataframe df2 will look like
i have written a code to check condition and if it matches it will update the items of my list but it's working at all it return me the same not updated or same previous list. is this code wrong?
please suggest
emp1=[]
for j in range(8,df.shape[0],10):
for i in range(2,len(df.columns)):
b=df.iloc[j][3]
#values are appended from dataframe to list and values are like['3 : 3','4 : 4',.....]
ess=[]
for i in range(df2.shape[0]):
a=df2.iloc[i][2]
ess.append(a) #values taken from file which are(3,4,5,6,7,8,....etc i.e unique id number)
nm=[]
for i in range(df2.shape[0]):
b=df2.iloc[i][3]
nm.append(b) #this list contains name of the employees
ap= [i.split(' : ', 1)[0] for i in emp1] #split it with ' : ' and stored in two another list(if 3 : 3 then it will store left 3)
bp= [i.split(' : ', 1)[1] for i in emp1] #if 3 : 3 the it will store right 3
cp=' : '
#the purpose is to replace right 3 with the name i.e 3 : nameabc and then again join to the list
for i in range(len(emp1)):
for j in range(len(ess)):
#print(i,j)
if ap[i]==ess[j]:
bp[i]=nm[j]
for i in range(df.shape[0]):
ap[i]=ap[i]+cp # adding ' : ' after left integer
emp = [i + j for i, j in zip(ap, bp)] # joining both the values
expected output:
if emp1 contains 3 : 3
then after processing it should show like 3 : nameabc
May be I missed something, but I don't see you assigning any value to emp1. Its empty and for "ap" and "bp", you are looping over empty emp1. That may be the one causing problem.
I have a csv file with only one column "notes". I want to merge rows of data-frame based on some condition.
Input_data={'notes':
['aaa','bbb','*','hello','**','my name','is xyz',
'(1)','this is','temp','name',
'(2)','BTW','how to','solve this',
'(3)','with python','I don’t want this to be added ',
'I don’t want this to be added ']}
df_in = pd.DataFrame(Input_data)
Input looks like this
Output
output_Data={'notes':
['aaa','bbb','*hello','**my name is xyz',
'(1) this is temp name',
'(2) BTW how to solve this',
'(3) with python','I don’t want this to be added ',
'I don’t want this to be added ']}
df_out=pd.DataFrame(output_Data)
I want to merge the rows with the above row which have either "*" or "(number)" in it. So the output will look like
Other rows which can not be merged should be left.
Also, in case of last row as there is no proper way to know up-to what range we can merge lets say just add only one next row
I solved this but its very long. Any simpler way
df=pd.DataFrame(Input_data)
notes=[];temp=[];flag='';value='';c=0;chk_star='yes'
for i,row in df.iterrows():
row[0]=str(row[0])
if '*' in row[0].strip()[:5] and chk_star=='yes':
value=row[0].strip()
temp=temp+[value]
value=''
continue
if '(' in row[0].strip()[:5]:
chk_star='no'
temp=temp+[value]
value='';c=0
flag='continue'
value=row[0].strip()
if flag=='continue' and '(' not in row[0][:5] :
value=value+row[0]
c=c+1
if c>4:
temp=temp+[value]
print "111",value,temp
break
if '' in temp:
temp.remove('')
df=pd.DataFrame({'notes':temp})
Below solution recognises special characters like *,** and (number) at the start of the of the sentence and starts merging later rows except last row.
import pandas as pd
import re
df = pd.DataFrame({'row':['aaa','bbb','*','hello','**','my name','is xyz',
'(1)','this is','temp','name',
'(2)','BTW','how to','solve this',
'(3)','with python','I don’t want this to be added ',
'I don’t want this to be added ']})
pattern = "^\(\d+\)|^\*+" #Pattern to identify string starting with (number),*,**.
#print(df)
#Selecting index based on the above pattern
selected_index = df[df["row"].str.contains(re.compile(pattern))].index.values
delete_index = []
for index in selected_index:
i=1
#Merging row until next selected index found and add merged rows to delete_index list
while(index+i not in selected_index and index+i < len(df)-1):
df.at[index, 'row'] += ' ' + df.at[index+i, 'row']
delete_index.append(index+i)
i+=1
df.drop(delete_index,inplace=True)
#print(df)
Output:
row
0 aaa
1 bbb
2 *hello
4 **my nameis xyz
7 (1)this istempname
11 (2)BTWhow tosolve this
15 (3)with pythonI don’t want this to be added
18 I don’t want this to be added
You can reset index if you want. using df.reset_index()
I think it is easier when you design your logic to separate df_in into 3 parts: top, middle and bottom. Keeping top and bottom intact while joining middle part. Finally, concat 3 parts together into df_out
First, create m1 and m2 masks to separate df_in to 3 parts.
m1 = df_in.notes.str.strip().str.contains(r'^\*+|\(\d+\)$').cummax()
m2 = ~df_in.notes.str.strip().str.contains(r'^I don’t want this to be added$')
top = df_in[~m1].notes
middle = df_in[m1 & m2].notes
bottom = df_in[~m2].notes
Next, create groupby_mask to group rows and groupby and join:
groupby_mask = middle.str.strip().str.contains(r'^\*+|\(\d+\)$').cumsum()
middle_join = middle.groupby(groupby_mask).agg(' '.join)
Out[3110]:
notes
1 * hello
2 ** my name is xyz
3 (1) this is temp name
4 (2) BTW how to solve this
5 (3) with python
Name: notes, dtype: object
Finally, use pd.concat to concat top, middle_join, bottom
df_final = pd.concat([top, middle_join, bottom], ignore_index=True).to_frame()
Out[3114]:
notes
0 aaa
1 bbb
2 * hello
3 ** my name is xyz
4 (1) this is temp name
5 (2) BTW how to solve this
6 (3) with python
7 I don’t want this to be added
8 I don’t want this to be added
You can use a mask to avoid the for loop :
df = pd.DataFrame({'row':['aaa','bbb','*','hello','**','my name','is xyz',
'(1)','this is ','temp ','name',
'(2)','BTW ','how to ','solve this',
'(3)','with python ','I don’t want this to be added ',
'I don’t want this to be added ']})
special = ['*', '**']
for i in range(11):
special.append('({})'.format(i))
# We find the indexes where we will have to merge
index_to_merge = df[df['row'].isin(special)].index.values
for idx, val in enumerate(index_to_merge):
if idx != len(index_to_merge)-1:
df.loc[val, 'row'] += ' ' + df.loc[val+1:index_to_merge[idx+1]-1, 'row'].values.sum()
else:
df.loc[index, 'row'] += ' ' + df.loc[index+1:, 'row'].values.sum()
# We delete the rows that we just used to merge
df.drop([x for x in np.array(range(len(df))) if x not in index_to_merge])
Out :
row
2 * hello
4 ** my nameis xyz
7 (1) this is temp name
11 (2) BTW how to solve this
15 (3) with python I don’t want this to be added ..
You could also convert your column into a numpy array and use numpy functions to simplify what you did. First you can use the np.where and np.isin to find the indexes where you will have to merge. That way you don't have to iterate on your whole array using a for loop.
Then you can do the mergures on the corresponding indexes. Finally, you can delete the values that have been merged. Here is what it could look like :
list_to_merge = np.array(['aaa','bbb','*','hello','**','my name','is xyz',
'(1)','this is','temp','name',
'(2)','BTW','how to','solve this',
'(3)','with python','I don’t want this to be added ',
'I don’t want this to be added '])
special = ['*', '**']
for i in range(11):
special.append('({})'.format(i))
ix = np.isin(list_to_merge, special)
rows_to_merge = np.where(ix)[0]
# We merge the rows
for index_to_merge in np.where(ix)[0]:
# Check if there we are not trying to merge with an out of bounds value
if index_to_merge!=len(list_to_merge)-1:
list_to_merge[index_to_merge] = list_to_merge[index_to_merge] + ' ' + list_to_merge[index_to_merge+1]
# We delete the rows that have just been used to merge:
rows_to_delete = rows_to_merge +1
list_to_merge = np.delete(list_to_merge, rows_to_delete)
Out :
['aaa', 'bbb', '* hello', '** my name', 'is xyz', '(1) this is',
'temp', 'name', '(2) BTW', 'how to', 'solve this',
'(3) with python', 'I don’t want this to be added ',
'I don’t want this to be added ']
I am pulling data from the billboard 100 list and am stuck on how to split the artists names. This is a csv file, but I have the data in a pandas dataframe before export. I would like to split using python/pandas. I have included a picture of the column below. The artist names are all in the same column with delimiters (in red) I would like to split but it is very complicated. The most common delimiters are " & ", " Featuring ", " X ", so basically I need help on splitting all of these names into different columns.
I was thinking I could use nested for loops so that I could split on a combination of these delimiters. My idea was to split based on a pattern of " (symbol) ", " X ", " x ", and " Featuring ", but am not sure if this is possible. Is there an easier way to do this without losing data? All help is appreciated.
Consider a sample dataframe df
df = pd.DataFrame({'singers': ['A & B', 'C Featuring D', 'E X F', 'G % H']})
df
singers
0 A & B
1 C Featuring D
2 E X F
3 G % H
Now, it's up to you which delimiter you want to choose to split the names on. May be just X or just Featuring or & or maybe all of them. Use str.split to achieve this as shown -
df.singers.str.split('&|X|Featuring|%', expand=True)
0 1
0 A B
1 C D
2 E F
3 G H
You can even add any other symbol too inside the split method.
I have a column called SSN in a CSV file with values like this
289-31-9165
I need to loop through the values in this column and replace the first five characters so it looks like this
***-**-9165
Here's the code I have so far:
emp_file = "Resources/employee_data1.csv"
emp_pd = pd.read_csv(emp_file)
new_ssn = emp_pd["SSN"].str.replace([:5], "*")
emp_pd["SSN"] = new_ssn
How do I loop through the value and replace just the first five numbers (only) with asterisks and keep the hiphens as is?
Similar to Mr. Me, this will instead remove everything before the first 6 characters and replace them with your new format.
emp_pd["SSN"] = emp_pd["SSN"].apply(lambda x: "***-**" + x[6:])
You can simply achieve this with replace() method:
Example dataframe :
borrows from #AkshayNevrekar..
>>> df
ssn
0 111-22-3333
1 121-22-1123
2 345-87-3425
Result:
>>> df.replace(r'^\d{3}-\d{2}', "***-**", regex=True)
ssn
0 ***-**-3333
1 ***-**-1123
2 ***-**-3425
OR
>>> df.ssn.replace(r'^\d{3}-\d{2}', "***-**", regex=True)
0 ***-**-3333
1 ***-**-1123
2 ***-**-3425
Name: ssn, dtype: object
OR:
df['ssn'] = df['ssn'].str.replace(r'^\d{3}-\d{2}', "***-**", regex=True)
Put your asterisks in front, then grab the last 4 digits.
new_ssn = '***-**-' + emp_pd["SSN"][-4:]
You can use regex
df = pd.DataFrame({'ssn':['111-22-3333','121-22-1123','345-87-3425']})
def func(x):
return re.sub(r'\d{3}-\d{2}','***-**', x)
df['ssn'] = df['ssn'].apply(func)
print(df)
Output:
ssn
0 ***-**-3333
1 ***-**-1123
2 ***-**-3425
I am trying to define a function with two arguments : df (dataframe), and an integer (employerID) as my arguments. this function will return the full name of the employer.
If the given ID does not belong to any employee, I want to return the string "UNKNOWN" / If no middle name is given only return "LAST, FIRST". / If only the middle initial is given the return the full name in the format "LAST, FIRST M." with the middle initial followed by a '.'.
def getFullName(df, int1):
df = pd.read_excel('/home/data/AdventureWorks/Employees.xls')
newdf = df[(df['EmployeeID'] == int1)]
print("'" + newdf['LastName'].item() + "," + " " + newdf['FirstName'].item() + " " + newdf['MiddleName'].item() + "." + "'")
getFullName('df', 110)
I wrote this code but came up with two problems :
1) if I don't put quotation mark around df, it will give me an error message, but I just want to take a data frame as an argument not a string.
2) this code can't deal with someone with out middle name.
I am sorry but I used pd.read_excel to read the excel file which you can not access. I know it will be hard for you to test the codes without the excel file, if someone let me know how to create a random data frame with the column names, I will go ahead and change it. Thank you,
I created some fake data for this:
EmployeeID FirstName LastName MiddleName
0 0 a a a
1 1 b b b
2 2 c c c
3 3 d d d
4 4 e e e
5 5 f f f
6 6 g g g
7 7 h h h
8 8 i i i
9 9 j j None
EmployeeID 9 has no middle name, but everyone else does. The way I would do it is to break up the logic into two parts. The first, for when you cannot find the EmployeeID. The second manages the printing of the employee's name. That second part should also have two sets of logic, one to control if the employee has a middle name, and the other for if they don't. You could likely combine a lot of this into single line statements, but you will likely sacrifice clarity.
I also removed the pd.read_excel call from the function. If you want to pass the dataframe in to the function, then the dataframe should be created oustide of it.
def getFullName(df, int1):
newdf = df[(df['EmployeeID'] == int1)]
# if the dataframe is empty, then we can't find the give ID
# otherwise, go ahead and print out the employee's info
if(newdf.empty):
print("UNKNOWN")
return "UNKNOWN"
else:
# all strings will start with the LastName and FirstName
# we will then add the MiddleName if it's present
# and then we can end the string with the final '
s = "'" + newdf['LastName'].item() + ", " +newdf['FirstName'].item()
if (newdf['MiddleName'].item()):
s = s + " " + newdf['MiddleName'].item() + "."
s = s + "'"
print(s)
return s
I have the function returning values in case you want to manipulate the string further. But that was just me.
If you run getFullName(df, 1) you should get 'b, b b.'. And for getFullName(df, 9) you should get 'j, j'.
So in full, it would be:
df = pd.read_excel('/home/data/AdventureWorks/Employees.xls')
getFullName(df, 1) #outputs 'b, b b.'
getFullName(df, 9) #outputs 'j, j'
getFullName(df, 10) #outputs UNKNOWN
Fake data:
d = {'EmployeeID' : [0,1,2,3,4,5,6,7,8,9],
'FirstName' : ['a','b','c','d','e','f','g','h','i','j'],
'LastName' : ['a','b','c','d','e','f','g','h','i','j'],
'MiddleName' : ['a','b','c','d','e','f','g','h','i',None]}
df = pd.DataFrame(d)