Regular Expressions in a dataframe Python - python

I am trying to extract the name from the data frame.
df.['target_name'].head()
3 Minnie
4 Albert [unclear]Gles[/unclear]
5 Eliza [unclear]Gles[/unclear]
6 John Slaltery
7 [unclear]P.[/unclear] Slaltery
23 ? Stewart
34 John Maddison
35 Herbert Olney
36 William Iverach
37 [unclear][/unclear]
38 Peter Blacksmith
39 William Oliver
40 Emily
Name: target_name, dtype: object
This is the output. We just want to get rid of the unnecessary characters and fetch the name.
This is what I have done:
import re
df['target_name'] = df['target_name'].astype(str) #converting it into a string.
I tried using these two methods, but the both gave me the same output i.e. Nan
df['target_name'] = df['target_name'].str.extract('([a-zA-Z ]+)', expand=False).str.strip()
df['target_name3'] = df['target_name'].str.replace(r'\([^)]*\)', '').str.strip()

This seems to work for me.
import pandas as pd
import re
target_name = ["Minnie", "Albert [unclear]Gles[/unclear]",
"Eliza [unclear]Gles[/unclear]",
"[unclear]P.[/unclear] Slaltery", "? Stewart"]
df = pd.DataFrame(target_name, columns = ['target_name'])
df['target_name'] = df['target_name'].astype('str').str.replace(r'\/|\?','').str.replace('\[[a-z]+\]','').str.strip()

Related

How to split text using a list of keywords

I am trying split my data so that I can categorize/classify. Below sample data, although is most are well labelled, but all is written in one string under description.
df = pd.DataFrame({"name":['Kelly', 'David', 'Mandy', "John"], "description":["age: 12 gender:female hobbies: loves to read", "age:16, gender:male, hobbies: play soccer", "age: 15, gender:female, hobbies: cooking","18, male, reading"]})
df
name
description
0
Kelly
age: 12 gender:female hobbies: loves to read
1
David
age:16, gender:male, hobbies: play soccer
2
Mandy
age: 15, gender:female, hobbies: cooking
3
John
18, male, reading
I decided to focus on the descriptions that has label first since they are already well labelled. However, if I split by ":", it won't show the desired outcome I wanted. I tried using re.findall() function and although it gives me the key labels, I still not sure how to split my data according to the key labels found. Below is what I hope to achieve first.
name
description
0
Kelly
age: 12
1
Kelly
gender:female
2
Kelly
hobbies: loves to read
3
David
age:16
4
David
gender:male
5
David
hobbies: play soccer
6
Mandy
age: 15
7
Mandy
gender:female
8
Mandy
hobbies: cooking
9
John
18, male, reading
You could try using regular expressions to help search and parse the strings, like below:
# Import statements
import pandas as pd
import re
# Your initial data frame
df = pd.DataFrame({"name":['Kelly', 'David', 'Mandy', "John"], "description":["age: 12 gender:female hobbies: loves to read", "age:16, gender:male, hobbies: play soccer", "age: 15, gender:female, hobbies: cooking","18, male, reading"]})
# set regex values to search for in your data frame
ageRegex = '(age)?' + '\s?' + ':?' + '\s?' + '[0-9][0-9]?'
genderRegex = '(gender)?' + '\s?' + ':?' + '\s?'+ '(fe)?' + 'male'
# create dictionary to store some output
dictOutput = {}
# for each index in our data frame
for index in df.index:
#create a dictionary of values for each name
dictBuilder = {}
nameValue = df.loc[index,'name']
description_startValue = df.loc[index,'description'].replace(",","")
#Try to get our age value
age = re.search(ageRegex, description_startValue).group(0)
#Try to get our gender value
gender = re.search(genderRegex, description_startValue).group(0)
#Per the sample data, the remaining text should be the hobby value
hobbyValue = description_startValue.split(gender)[1]
#set those values to the working dictionary created above
dictBuilder["age"] = age
dictBuilder["gender"] = gender
dictBuilder["hobbies"] = hobbyValue
#Set our working dictionary to the output dictionary created outside the loop
dictOutput[nameValue] = dictBuilder
#parse our dictionary to create a list of values to use in a data frame
listBuilder = []
for eachDictionary in dictOutput:
nameVal = eachDictionary
for eachValue in dictOutput[eachDictionary]:
listBuilder.append([nameVal, dictOutput[eachDictionary][eachValue]])
#create a new data frame with our list of values
df2 = pd.DataFrame(listBuilder, columns=("Name", "Description"))
#debug print statement
print(df2)
That should yield an output like below:
Name Description
0 Kelly age: 12
1 Kelly gender:female
2 Kelly hobbies: loves to read
3 David age:16
4 David gender:male
5 David hobbies: play soccer
6 Mandy age: 15
7 Mandy gender:female
8 Mandy hobbies: cooking
9 John 18
10 John male
11 John reading
You could consider enhancing the Regex searching to further clean your input, so each value like age, gender, and hobbies would be formatted with the same prefix (some type of find and replace)
Below is a thread on regex functions that may help a bit
https://www.w3schools.com/python/python_regex.asp

Get string instead of list in Pandas DataFrame

I have a column Name of string data type. I want to get all the values except the last one and put it in a new column FName, which I could achieve
df = pd.DataFrame({'Name': ['John A Sether', 'Robert D Junior', 'Jonny S Rab'],
'Age':[32, 34, 36]})
df['FName'] = df['Name'].str.split(' ').str[0:-1]
Name Age FName
0 John A Sether 32 [John, A]
1 Robert D Junior 34 [Robert, D]
2 Jonny S Rab 36 [Jonny, S]
But the new column FName looks like a list, which I don't want. I want it to be like: John A.
I tried convert the list to string, but it does not seems to be right.
Any suggestion ?
You can use .str.rsplit:
df['FName'] = df['Name'].str.rsplit(n=1).str[0]
Or you can use .str.extract:
df['FName'] = df['Name'].str.extract(r'(\S+\s?\S*)', expand=False)
Or, you can chain .str.join after .str.split:
df['FName'] = df['Name'].str.split().str[:-1].str.join(' ')
Name Age FName
0 John A Sether 32 John A
1 Robert D Junior 34 Robert D
2 Jonny S Rab 36 Jonny S

Amend row in a data-frame if it exists in another data-frame

I have two dataframes DfMaster and DfError
DfMaster which looks like:
Id Name Building
0 4653 Jane Smith A
1 3467 Steve Jones B
2 34 Kim Lee F
3 4567 John Evans A
4 3643 Kevin Franks S
5 244 Stella Howard D
and DfError looks like
Id Name Building
0 4567 John Evans A
1 244 Stella Howard D
In DfMaster I would like to change the Building value for a record to DD if it appears in the DfError data-frame. So my desired output would be:
Id Name Building
0 4653 Jane Smith A
1 3467 Steve Jones B
2 34 Kim Lee F
3 4567 John Evans DD
4 3643 Kevin Franks S
5 244 Stella Howard DD
I am trying to use the following:
DfMaster.loc[DfError['Id'], 'Building'] = 'DD'
however I get an error:
KeyError: "None of [Int64Index([4567,244], dtype='int64')] are in the [index]"
What have I done wrong?
try this using np.where
import numpy as np
errors = list(dfError['id'].unqiue())
dfMaster['Building_id'] = np.where(dfMaster['Building_id'].isin(errors),'DD',dfMaster['Building_id'])
DataFrame.loc expects that you input an index or a Boolean series, not a value from a column.
I believe this should do the trick:
DfMaster.loc[DfMaster['Id'].isin(DfError['Id']), 'Building'] = 'DD'
Basically, it's telling:
For all rows where Id value is present in DfError['Id'], set the value of 'Building' to 'DD'.

Parsing data from file

I wanted to parse a given file line by line. The file has a format of
'name age gender hobby1 hobby2...'.
The first thing that came to mind was to use a named tuple of the form namedtuple('info',['name','age', 'gender','hobby']).
How can I save the data in my file to a list of tuples with the corresponding value. I tried using line.split() but I couldn't figure out how I can save the space separated hobbies to info.hobby.
Input file
If I understand you correctly, you can use pandas and pass 'this_is_a_space' as the sep if data is like this:
name age gender hobby1 hobby2
steve 12 male xyz abc
bob 29 male swimming golfing
alice 40 female reading cooking
tom 50 male sleeping
and here is syntax for method described above:
import pandas as pd
df = pd.read_csv('file.txt', sep=' ')
df.fillna(' ', inplace=True)
df['hobby'] = df[['hobby1', 'hobby2']].apply(lambda i: ' '.join(i), axis=1)
df.drop(['hobby1', 'hobby2'], axis=1, inplace=True)
print df
out:
name age gender hobby
0 steve 12 male xyz abc
1 bob 29 male swimming golfing
2 alice 40 female reading cooking
3 tom 50 male sleeping
EDIT: added your data from comment above

Why does this Pandas csv import fail?

I am trying to import the following csv text:
name, favorites, age, other_hobbies
joe, "[madonna, elvis, u2]", 28, "[football, cooking]"
mary, "[lady gaga, adele]", 36, "[]"
With the following pandas command
file_name = "new_data.csv"
df = pd.read_csv(file_name, sep =",")
print(df)
And I get this result:
name favorites age other_hobbies
joe "[madonna elvis u2]" 28 "[football cooking]"
mary "[lady gaga adele]" 36 "[]" NaN NaN
Why is this happening, and how can I get pandas to read this properly?
Pass skipinitialspace along with the sep:
df = pd.read_csv("in.csv",sep="," , skipinitialspace=1)
print(df)
Output:
name favorites age other_hobbies
0 joe [madonna, elvis, u2] 28 [football, cooking]
1 mary [lady gaga, adele] 36 []

Categories