Parsing data from file

Parsing data from file - python

I wanted to parse a given file line by line. The file has a format of
'name age gender hobby1 hobby2...'.
The first thing that came to mind was to use a named tuple of the form namedtuple('info',['name','age', 'gender','hobby']).
How can I save the data in my file to a list of tuples with the corresponding value. I tried using line.split() but I couldn't figure out how I can save the space separated hobbies to info.hobby.
Input file

If I understand you correctly, you can use pandas and pass 'this_is_a_space' as the sep if data is like this:
name age gender hobby1 hobby2
steve 12 male xyz abc
bob 29 male swimming golfing
alice 40 female reading cooking
tom 50 male sleeping
and here is syntax for method described above:
import pandas as pd
df = pd.read_csv('file.txt', sep=' ')
df.fillna(' ', inplace=True)
df['hobby'] = df[['hobby1', 'hobby2']].apply(lambda i: ' '.join(i), axis=1)
df.drop(['hobby1', 'hobby2'], axis=1, inplace=True)
print df
out:
name age gender hobby
0 steve 12 male xyz abc
1 bob 29 male swimming golfing
2 alice 40 female reading cooking
3 tom 50 male sleeping
EDIT: added your data from comment above

Related

How to split text using a list of keywords

I am trying split my data so that I can categorize/classify. Below sample data, although is most are well labelled, but all is written in one string under description.
df = pd.DataFrame({"name":['Kelly', 'David', 'Mandy', "John"], "description":["age: 12 gender:female hobbies: loves to read", "age:16, gender:male, hobbies: play soccer", "age: 15, gender:female, hobbies: cooking","18, male, reading"]})
df
name
description
0
Kelly
age: 12 gender:female hobbies: loves to read
1
David
age:16, gender:male, hobbies: play soccer
2
Mandy
age: 15, gender:female, hobbies: cooking
3
John
18, male, reading
I decided to focus on the descriptions that has label first since they are already well labelled. However, if I split by ":", it won't show the desired outcome I wanted. I tried using re.findall() function and although it gives me the key labels, I still not sure how to split my data according to the key labels found. Below is what I hope to achieve first.
name
description
0
Kelly
age: 12
1
Kelly
gender:female
2
Kelly
hobbies: loves to read
3
David
age:16
4
David
gender:male
5
David
hobbies: play soccer
6
Mandy
age: 15
7
Mandy
gender:female
8
Mandy
hobbies: cooking
9
John
18, male, reading

You could try using regular expressions to help search and parse the strings, like below:
# Import statements
import pandas as pd
import re
# Your initial data frame
df = pd.DataFrame({"name":['Kelly', 'David', 'Mandy', "John"], "description":["age: 12 gender:female hobbies: loves to read", "age:16, gender:male, hobbies: play soccer", "age: 15, gender:female, hobbies: cooking","18, male, reading"]})
# set regex values to search for in your data frame
ageRegex = '(age)?' + '\s?' + ':?' + '\s?' + '[0-9][0-9]?'
genderRegex = '(gender)?' + '\s?' + ':?' + '\s?'+ '(fe)?' + 'male'
# create dictionary to store some output
dictOutput = {}
# for each index in our data frame
for index in df.index:
#create a dictionary of values for each name
dictBuilder = {}
nameValue = df.loc[index,'name']
description_startValue = df.loc[index,'description'].replace(",","")
#Try to get our age value
age = re.search(ageRegex, description_startValue).group(0)
#Try to get our gender value
gender = re.search(genderRegex, description_startValue).group(0)
#Per the sample data, the remaining text should be the hobby value
hobbyValue = description_startValue.split(gender)[1]
#set those values to the working dictionary created above
dictBuilder["age"] = age
dictBuilder["gender"] = gender
dictBuilder["hobbies"] = hobbyValue
#Set our working dictionary to the output dictionary created outside the loop
dictOutput[nameValue] = dictBuilder
#parse our dictionary to create a list of values to use in a data frame
listBuilder = []
for eachDictionary in dictOutput:
nameVal = eachDictionary
for eachValue in dictOutput[eachDictionary]:
listBuilder.append([nameVal, dictOutput[eachDictionary][eachValue]])
#create a new data frame with our list of values
df2 = pd.DataFrame(listBuilder, columns=("Name", "Description"))
#debug print statement
print(df2)
That should yield an output like below:
Name Description
0 Kelly age: 12
1 Kelly gender:female
2 Kelly hobbies: loves to read
3 David age:16
4 David gender:male
5 David hobbies: play soccer
6 Mandy age: 15
7 Mandy gender:female
8 Mandy hobbies: cooking
9 John 18
10 John male
11 John reading
You could consider enhancing the Regex searching to further clean your input, so each value like age, gender, and hobbies would be formatted with the same prefix (some type of find and replace)
Below is a thread on regex functions that may help a bit
https://www.w3schools.com/python/python_regex.asp

Compare two dataframes based one grain column and print out differences into txt file

I have two dataframes df1 and df2 which different row sizes but same columns, The ID column is common across both dataframes. I want a write the difference in a text file. For example:
df1:
ID Name Age Profession sex
1 Tom 20 engineer M
2 nick 21 doctor M
3 krishi 19 lawyer F
4 jacky 18 dentist F
df2:
ID Name Age Profession sex
1 Tom 20 plumber M
2 nick 21 doctor M
3 krishi 23 Analyst F
4 jacky 18 dentist F
The resultant text file should look like:
ID : 1
Profession_old Profession_new
engineer plumber
ID : 3
Age_old Age_new Profession_old Profession_new
19 23 lawyer Analyst

You can use compare and a loop:
df3 = df1.set_index('ID').compare(df2.set_index('ID'))
df3.columns = (df3.rename({'self': 'old', 'other': 'new'}, level=1, axis=1)
.columns.map('_'.join)
)
for id, row in df3.iterrows():
print(f'ID : {id}')
print(row.dropna().to_frame().T.to_string(index=False))
print()
output:
ID : 1
Profession_old Profession_new
engineer plumber
ID : 3
Age_old Age_new Profession_old Profession_new
19.0 23.0 lawyer Analyst
NB. using print here for the demo, to write to a file:
with open('file.txt') as f:
f.write(f'ID : {id}\n')
f.write(row.dropna().to_frame().T.to_string(index=False))
f.write('\n\n')
You could also directly use df3:
Age_old Age_new Profession_old Profession_new
ID
1 NaN NaN engineer plumber
3 19.0 23.0 lawyer Analyst

Get string instead of list in Pandas DataFrame

I have a column Name of string data type. I want to get all the values except the last one and put it in a new column FName, which I could achieve
df = pd.DataFrame({'Name': ['John A Sether', 'Robert D Junior', 'Jonny S Rab'],
'Age':[32, 34, 36]})
df['FName'] = df['Name'].str.split(' ').str[0:-1]
Name Age FName
0 John A Sether 32 [John, A]
1 Robert D Junior 34 [Robert, D]
2 Jonny S Rab 36 [Jonny, S]
But the new column FName looks like a list, which I don't want. I want it to be like: John A.
I tried convert the list to string, but it does not seems to be right.
Any suggestion ?

You can use .str.rsplit:
df['FName'] = df['Name'].str.rsplit(n=1).str[0]
Or you can use .str.extract:
df['FName'] = df['Name'].str.extract(r'(\S+\s?\S*)', expand=False)
Or, you can chain .str.join after .str.split:
df['FName'] = df['Name'].str.split().str[:-1].str.join(' ')
Name Age FName
0 John A Sether 32 John A
1 Robert D Junior 34 Robert D
2 Jonny S Rab 36 Jonny S

Regular Expressions in a dataframe Python

I am trying to extract the name from the data frame.
df.['target_name'].head()
3 Minnie
4 Albert [unclear]Gles[/unclear]
5 Eliza [unclear]Gles[/unclear]
6 John Slaltery
7 [unclear]P.[/unclear] Slaltery
23 ? Stewart
34 John Maddison
35 Herbert Olney
36 William Iverach
37 [unclear][/unclear]
38 Peter Blacksmith
39 William Oliver
40 Emily
Name: target_name, dtype: object
This is the output. We just want to get rid of the unnecessary characters and fetch the name.
This is what I have done:
import re
df['target_name'] = df['target_name'].astype(str) #converting it into a string.
I tried using these two methods, but the both gave me the same output i.e. Nan
df['target_name'] = df['target_name'].str.extract('([a-zA-Z ]+)', expand=False).str.strip()
df['target_name3'] = df['target_name'].str.replace(r'\([^)]*\)', '').str.strip()

This seems to work for me.
import pandas as pd
import re
target_name = ["Minnie", "Albert [unclear]Gles[/unclear]",
"Eliza [unclear]Gles[/unclear]",
"[unclear]P.[/unclear] Slaltery", "? Stewart"]
df = pd.DataFrame(target_name, columns = ['target_name'])
df['target_name'] = df['target_name'].astype('str').str.replace(r'\/|\?','').str.replace('\[[a-z]+\]','').str.strip()

Why does this Pandas csv import fail?

I am trying to import the following csv text:
name, favorites, age, other_hobbies
joe, "[madonna, elvis, u2]", 28, "[football, cooking]"
mary, "[lady gaga, adele]", 36, "[]"
With the following pandas command
file_name = "new_data.csv"
df = pd.read_csv(file_name, sep =",")
print(df)
And I get this result:
name favorites age other_hobbies
joe "[madonna elvis u2]" 28 "[football cooking]"
mary "[lady gaga adele]" 36 "[]" NaN NaN
Why is this happening, and how can I get pandas to read this properly?

Pass skipinitialspace along with the sep:
df = pd.read_csv("in.csv",sep="," , skipinitialspace=1)
print(df)
Output:
name favorites age other_hobbies
0 joe [madonna, elvis, u2] 28 [football, cooking]
1 mary [lady gaga, adele] 36 []

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing data from file - python

Related

How to split text using a list of keywords

Compare two dataframes based one grain column and print out differences into txt file

Get string instead of list in Pandas DataFrame

Regular Expressions in a dataframe Python

Why does this Pandas csv import fail?

Categories

Resources