How to split text using a list of keywords - python

I am trying split my data so that I can categorize/classify. Below sample data, although is most are well labelled, but all is written in one string under description.
df = pd.DataFrame({"name":['Kelly', 'David', 'Mandy', "John"], "description":["age: 12 gender:female hobbies: loves to read", "age:16, gender:male, hobbies: play soccer", "age: 15, gender:female, hobbies: cooking","18, male, reading"]})
df
name
description
0
Kelly
age: 12 gender:female hobbies: loves to read
1
David
age:16, gender:male, hobbies: play soccer
2
Mandy
age: 15, gender:female, hobbies: cooking
3
John
18, male, reading
I decided to focus on the descriptions that has label first since they are already well labelled. However, if I split by ":", it won't show the desired outcome I wanted. I tried using re.findall() function and although it gives me the key labels, I still not sure how to split my data according to the key labels found. Below is what I hope to achieve first.
name
description
0
Kelly
age: 12
1
Kelly
gender:female
2
Kelly
hobbies: loves to read
3
David
age:16
4
David
gender:male
5
David
hobbies: play soccer
6
Mandy
age: 15
7
Mandy
gender:female
8
Mandy
hobbies: cooking
9
John
18, male, reading

You could try using regular expressions to help search and parse the strings, like below:
# Import statements
import pandas as pd
import re
# Your initial data frame
df = pd.DataFrame({"name":['Kelly', 'David', 'Mandy', "John"], "description":["age: 12 gender:female hobbies: loves to read", "age:16, gender:male, hobbies: play soccer", "age: 15, gender:female, hobbies: cooking","18, male, reading"]})
# set regex values to search for in your data frame
ageRegex = '(age)?' + '\s?' + ':?' + '\s?' + '[0-9][0-9]?'
genderRegex = '(gender)?' + '\s?' + ':?' + '\s?'+ '(fe)?' + 'male'
# create dictionary to store some output
dictOutput = {}
# for each index in our data frame
for index in df.index:
#create a dictionary of values for each name
dictBuilder = {}
nameValue = df.loc[index,'name']
description_startValue = df.loc[index,'description'].replace(",","")
#Try to get our age value
age = re.search(ageRegex, description_startValue).group(0)
#Try to get our gender value
gender = re.search(genderRegex, description_startValue).group(0)
#Per the sample data, the remaining text should be the hobby value
hobbyValue = description_startValue.split(gender)[1]
#set those values to the working dictionary created above
dictBuilder["age"] = age
dictBuilder["gender"] = gender
dictBuilder["hobbies"] = hobbyValue
#Set our working dictionary to the output dictionary created outside the loop
dictOutput[nameValue] = dictBuilder
#parse our dictionary to create a list of values to use in a data frame
listBuilder = []
for eachDictionary in dictOutput:
nameVal = eachDictionary
for eachValue in dictOutput[eachDictionary]:
listBuilder.append([nameVal, dictOutput[eachDictionary][eachValue]])
#create a new data frame with our list of values
df2 = pd.DataFrame(listBuilder, columns=("Name", "Description"))
#debug print statement
print(df2)
That should yield an output like below:
Name Description
0 Kelly age: 12
1 Kelly gender:female
2 Kelly hobbies: loves to read
3 David age:16
4 David gender:male
5 David hobbies: play soccer
6 Mandy age: 15
7 Mandy gender:female
8 Mandy hobbies: cooking
9 John 18
10 John male
11 John reading
You could consider enhancing the Regex searching to further clean your input, so each value like age, gender, and hobbies would be formatted with the same prefix (some type of find and replace)
Below is a thread on regex functions that may help a bit
https://www.w3schools.com/python/python_regex.asp

Related

how to break up data in column value to multiple rows in pandas dataframe

I have an issue where I have multiple rows in a csv file that have to be converted to a pandas data frame but there are some rows where the columns 'name' and 'business' have multiple names and businesses that should be in separate rows and need to be split up while keeping the data from the other columns the same for each row that is split.
Here is the example data:
input:
software
name
business
abc
Andrew Johnson, Steve Martin
Outsourcing/Offshoring, 201-500 employees,Health, Wellness and Fitness, 5001-10,000 employees
xyz
Jack Jones, Rick Paul, Johnny Jones
Banking, 1001-5000 employees,Construction, 51-200 employees,Consumer Goods, 10,001+ employees
def
Tom D., Connie J., Ricky B.
Unspecified, Unspecified, Self-employed
output I need:
software
name
business
abc
Andrew Johnson
Outsourcing/Offshoring, 201-500 employees
abc
Steve Martin
Health, Wellness and Fitness, 5001-10,000 employees
xyz
Jack Jones
Banking, 1001-5000 employees
xyz
Rick Paul
Construction, 51-200 employees
xyz
Johnny Jones
Consumer Goods, 10,001+ employees
def
Tom D
Unspecified
def
Connie J
Unspecified
def
Ricky B
Self-employed
There are additional columns similar to 'name' and 'business' that contain multiple pieces of information that need to be split up just like 'name' and 'business'. Cells that contain multiple pieces of information are in sequence (ordered).
Here's the code I have so far and creates new rows but it only splits up the contents in name column, but that leaves the business column and a few other columns left over that need to be split up along with the contents from the name column.
name2 = df.name.str.split(',', expand=True).stack()
df = df.join(pd.Series(index=name2.index.droplevel(1), data=name2.values, name = 'name2'))
dict = df.to_dict('record')
for row in dict:
new_segment = {}
new_segment['name'] = str(row['name2'])
#df['name'] = str(row['name2'])
for col,content in new_segment.items():
row[col] = content
df = pd.DataFrame.from_dict(dict)
df = df.drop('name2', 1)
Here's an alternative solution I was trying as well but it gives me an error too:
review_path = r'data/base_data'
review_files = glob.glob(review_path + "/test_data.csv")
review_df_list = []
for review_file in review_files:
df = pd.read_csv(io.StringIO(review_file), sep = '\t')
print(df.head())
df["business"] = (df["business"].str.extractall(r"(?:[\s,]*)(.*?(?:Unspecified|employees|Self-employed))").groupby(level=0).agg(list))
df["name"] = df["name"].str.split(r"\s*,\s*")
print(df.explode(["name", "business"]))
outPutPath = Path('data/base_data/test_data.csv')
df.to_csv(outPutPath, index=False)
Error Message for alternative solution:
Read:data/base_data/review_base.csv
Success!
Empty DataFrame
Columns: [data/base_data/test_data.csv]
Index: []
Try:
df["business"] = (
df["business"]
.str.extractall(r"(?:[\s,]*)(.*?(?:Unspecified|employees|Self-employed))")
.groupby(level=0)
.agg(list)
)
df["name"] = df["name"].str.split(r"\s*,\s*")
print(df.explode(["name", "business"]))
Prints:
software name business
0 abc Andrew Johnson Outsourcing/Offshoring, 201-500 employees
0 abc Steve Martin Health, Wellness and Fitness, 5001-10,000 employees
1 xyz Jack Jones Banking, 1001-5000 employees
1 xyz Rick Paul Construction, 51-200 employees
1 xyz Johnny Jones Consumer Goods, 10,001+ employees
2 def Tom D. Unspecified
2 def Connie J. Unspecified
2 def Ricky B. Self-employed

Get string instead of list in Pandas DataFrame

I have a column Name of string data type. I want to get all the values except the last one and put it in a new column FName, which I could achieve
df = pd.DataFrame({'Name': ['John A Sether', 'Robert D Junior', 'Jonny S Rab'],
'Age':[32, 34, 36]})
df['FName'] = df['Name'].str.split(' ').str[0:-1]
Name Age FName
0 John A Sether 32 [John, A]
1 Robert D Junior 34 [Robert, D]
2 Jonny S Rab 36 [Jonny, S]
But the new column FName looks like a list, which I don't want. I want it to be like: John A.
I tried convert the list to string, but it does not seems to be right.
Any suggestion ?
You can use .str.rsplit:
df['FName'] = df['Name'].str.rsplit(n=1).str[0]
Or you can use .str.extract:
df['FName'] = df['Name'].str.extract(r'(\S+\s?\S*)', expand=False)
Or, you can chain .str.join after .str.split:
df['FName'] = df['Name'].str.split().str[:-1].str.join(' ')
Name Age FName
0 John A Sether 32 John A
1 Robert D Junior 34 Robert D
2 Jonny S Rab 36 Jonny S

Regular Expressions in a dataframe Python

I am trying to extract the name from the data frame.
df.['target_name'].head()
3 Minnie
4 Albert [unclear]Gles[/unclear]
5 Eliza [unclear]Gles[/unclear]
6 John Slaltery
7 [unclear]P.[/unclear] Slaltery
23 ? Stewart
34 John Maddison
35 Herbert Olney
36 William Iverach
37 [unclear][/unclear]
38 Peter Blacksmith
39 William Oliver
40 Emily
Name: target_name, dtype: object
This is the output. We just want to get rid of the unnecessary characters and fetch the name.
This is what I have done:
import re
df['target_name'] = df['target_name'].astype(str) #converting it into a string.
I tried using these two methods, but the both gave me the same output i.e. Nan
df['target_name'] = df['target_name'].str.extract('([a-zA-Z ]+)', expand=False).str.strip()
df['target_name3'] = df['target_name'].str.replace(r'\([^)]*\)', '').str.strip()
This seems to work for me.
import pandas as pd
import re
target_name = ["Minnie", "Albert [unclear]Gles[/unclear]",
"Eliza [unclear]Gles[/unclear]",
"[unclear]P.[/unclear] Slaltery", "? Stewart"]
df = pd.DataFrame(target_name, columns = ['target_name'])
df['target_name'] = df['target_name'].astype('str').str.replace(r'\/|\?','').str.replace('\[[a-z]+\]','').str.strip()

Formatting nlargest output pandas

I'm new to pandas and so am a bit unfamiliar with how it works. I have processed some data and obtained the results I want, however, I am having trouble figuring out how to format the output with print. For instance, I only want to display certain rows of data, as well as putting certain values in ().
From doing this:
df = pd.read_csv('data_file.csv')
tallmen = df[df['gender'] == 'M'].nlargest(2, 'height')
This is the output I get by doing print(tallmen):
id name gender state height
6 5 Smith, Bob M New York 73.5
2 7 Wright, Frank M Kentucky 75.2
And this is the output I want:
Smith, Bob (M) 6' 1.5"
Wright, Frank (M) 6' 3.2"
When I tried to use tallmen as a dictionary, and that gave me an error. So I'm not quite sure what to do. Additionally, it there a way for me to be able to manipulate the height values so that I can reformat them (aka display them in the ft in format as shown below)?
you can create a new column this way:
In [207]: df
Out[207]:
id name gender state height
6 5 Smith, Bob M New York 73.5
2 7 Wright, Frank M Kentucky 75.2
In [208]: df['new'] = (
...: df.name + ' (' + df.gender + ') ' +
...: (df.height // 12).astype(int).astype(str) +
...: "' " + (df.height % 12).astype(str) + '"')
...:
In [209]: df
Out[209]:
id name gender state height new
6 5 Smith, Bob M New York 73.5 Smith, Bob (M) 6' 1.5"
2 7 Wright, Frank M Kentucky 75.2 Wright, Frank (M) 6' 3.2"
My professor helped me figure this out. Really what I needed was to know how to iterate through values in the DataFrame. My solution looks like this:
df = pd.read_csv('data_file.csv')
tallmen = df[df['gender'] == 'M'].nlargest(2, 'height')
for i, val in tallmen.iterrows():
feet = val['height']//12
inches = val['height']%12
print("%s (%s) %i'%i"" % (val['name'], val['gender'],
feet, inches))

Parsing data from file

I wanted to parse a given file line by line. The file has a format of
'name age gender hobby1 hobby2...'.
The first thing that came to mind was to use a named tuple of the form namedtuple('info',['name','age', 'gender','hobby']).
How can I save the data in my file to a list of tuples with the corresponding value. I tried using line.split() but I couldn't figure out how I can save the space separated hobbies to info.hobby.
Input file
If I understand you correctly, you can use pandas and pass 'this_is_a_space' as the sep if data is like this:
name age gender hobby1 hobby2
steve 12 male xyz abc
bob 29 male swimming golfing
alice 40 female reading cooking
tom 50 male sleeping
and here is syntax for method described above:
import pandas as pd
df = pd.read_csv('file.txt', sep=' ')
df.fillna(' ', inplace=True)
df['hobby'] = df[['hobby1', 'hobby2']].apply(lambda i: ' '.join(i), axis=1)
df.drop(['hobby1', 'hobby2'], axis=1, inplace=True)
print df
out:
name age gender hobby
0 steve 12 male xyz abc
1 bob 29 male swimming golfing
2 alice 40 female reading cooking
3 tom 50 male sleeping
EDIT: added your data from comment above

Categories