Grab useful data between strings using python from a file - python

I have the following Raw data from a HTML source code file
{$deletedFields:[courses,projects,description,degreeName,recommendations,honors,entityLocale,activities,grade,fieldOfStudyUrn,testScores,degreeUrn],entityUrn:urn:li:fs_education:(ACoAAAIUJvgBC7QTHSmLEjgtomzxvwceeM71E1c,75863717),school:urn:li:fs_miniSchool:11709,timePeriod:urn:li:fs_education:(ACoAAAIUJvgBC7QTHSmLEjgtomzxvwceeM71E1c,75863717),timePeriod,schoolName:Charles University in Prague,fieldOfStudy:Economics, Politics,schoolUrn:urn:li:fs_miniSchool:11709,$type:com.linkedin.voyager.identity.profile.Education},
{$deletedFields:[courses,projects,description,recommendations,honors,entityLocale,activities,grade,fieldOfStudyUrn,testScores,degreeUrn],entityUrn:urn:li:fs_education:(ACoAAAIUJvgBC7QTHSmLEjgtomzxvwceeM71E1c,26812055),school:urn:li:fs_miniSchool:17888,timePeriod:urn:li:fs_education:(ACoAAAIUJvgBC7QTHSmLEjgtomzxvwceeM71E1c,26812055),timePeriod,degreeName:BA,schoolName:Occidental College,fieldOfStudy:Economics,schoolUrn:urn:li:fs_miniSchool:17888,$type:com.linkedin.voyager.identity.profile.Education},
{$deletedFields:[],profileId:ACoAAAIUJvgBC7QTHSmLEjgtomzxvwceeM71E1c,elements:[urn:li:fs_education:(ACoAAAIUJvgBC7QTHSmLEjgtomzxvwceeM71E1c,26812055),urn:li:fs_education:(ACoAAAIUJvgBC7QTHSmLEjgtomzxvwceeM71E1c,75863717)],paging:urn:li:fs_profileView:ACoAAAIUJvgBC7QTHSmLEjgtomzxvwceeM71E1c,educationView,paging,$type:com.linkedin.voyager.identity.profile.EducationView,$id:urn:li:fs_profileView:ACoAAAIUJvgBC7QTHSmLEjgtomzxvwceeM71E1c,educationView},
{$deletedFields:[],start:501,end:1000,$type:com.linkedin.voyager.identity.profile.EmployeeCountRange,$id:urn:li:fs_position:(ACoAAAIUJvgBC7QTHSmLEjgtomzxvwceeM71E1c,323432440),company,employeeCountRange}
{$deletedFields:[month,day],year:2007,$type:com.linkedin.common.Date,$id:urn:li:fs_education:(ACoAAAIUJvgBC7QTHSmLEjgtomzxvwceeM71E1c,75863717),timePeriod,startDate},
{$deletedFields:[month,day],year:2004,$type:com.linkedin.common.Date,$id:urn:li:fs_education:(ACoAAAIUJvgBC7QTHSmLEjgtomzxvwceeM71E1c,26812055),timePeriod,startDate},
{$deletedFields:[month,day],year:2008,$type:com.linkedin.common.Date,$id:urn:li:fs_education:(ACoAAAIUJvgBC7QTHSmLEjgtomzxvwceeM71E1c,26812055),timePeriod,endDate},
{$deletedFields:[month,day],year:2007,$type:com.linkedin.common.Date,$id:urn:li:fs_education:(ACoAAAIUJvgBC7QTHSmLEjgtomzxvwceeM71E1c,75863717),timePeriod,endDate},
What i need is extract some data out of it using.
schoolname = re.findall(r',schoolname:(.*?),' , page_html)
fieldofstudy = skills = re.findall(r'fieldOfStudy:(.*?),s' , page_html)
degreename = re.findall(r'degreeName:(.*?),' , page_html)
Needed Output
schoolName:Charles University in Prague
fieldOfStudy:Economics, Politics
Start : Year 2007
End : 2007
schoolName:Occidental College
fieldOfStudy:Economics
degreeName:BA
start : 2004
End : 2008

Question: What i need is extract some data out of it using
Define a Data Container class School:
class School(object):
def __init__(self, raw_data):
key = None
year = '?'
for kv in raw_data:
i = kv.find(':')
if i >= 0:
key = kv[0:i]
value = kv[i + 1:]
if key in ['schoolName', 'fieldOfStudy', 'startDate', 'endDate', 'degreeName']:
object.__setattr__(self, key, value)
if key in ['year']:
year = value
else:
if key in ['entityUrn', '$id']:
if kv[:-1].isdigit():
self.entity = kv[:-1]
elif key in ['fieldOfStudy']:
self.fieldOfStudy += ', '+kv
elif kv in ['startDate', 'endDate']:
object.__setattr__(self, kv, year)
key = ''
if not hasattr(self, 'degreeName'):
self.degreeName = 'unknown'
def __repr__(self):
return "entity:\t\t{s.entity:>28}\n" \
"schoolName:\t{s.schoolName:>28}\n" \
"fieldOfStudy:{s.fieldOfStudy:>27}\n" \
"degreeName:\t{s.degreeName:>28}\n" \
"startDate:\t{s.startDate:>28}\n" \
"endDate:\t{s.endDate:>28}\n".format(s=self)
Read the file line by line:
with open('<path to file>') as fh:
degreeUrn = {}
for line in fh:
match = re.findall(r'\{(.*?)\:\[(.*?)\],(.*)\}', line)
m2 = match[0][2].split(',')
school = School(m2)
if hasattr(school, 'entity'):
if hasattr(school, 'startDate'):
degreeUrn[school.entity].startDate = school.startDate
del school
elif hasattr(school, 'endDate'):
degreeUrn[school.entity].endDate = school.endDate
del school
elif hasattr(school, 'schoolName'):
degreeUrn[school.entity] = school
else:
del school
for entity in degreeUrn:
print(degreeUrn[entity])
Output:
entity: 75863717
schoolName: Charles University in Prague
fieldOfStudy: Economics, Politics
degreeName: unknown
startDate: 2007
endDate: 2007
entity: 26812055
schoolName: Occidental College
fieldOfStudy: Economics
degreeName: BA
startDate: 2004
endDate: 2008
Tested with Python: 3.4.2

Related

Trying to find averages from a .txt but I keep getting ValueError: could not convert string to float: ''

I'm using the txt file: https://drive.google.com/file/d/1-VrWf7aqiqvnshVQ964zYsqaqRkcUoL1/view?usp=sharin
I'm running the script:
data = f.read()
ny_sum=0
ny_count=0
sf_sum=0
sf_count=0
for line in data.split('\n'):
print(line)
parts = line.split('\t')
city = parts[2]
amount = float(parts[4])
if city == 'San Francisco':
sf_sum = sf_sum + amount
elif city == 'New York':
ny_sum = ny_sum + amount
ny_count = ny_count + 1
ny_avg = ny_sum / ny_count
sf_avg = sf_sum / sf_count
#print(ny_avg, sf_avg)
f = open('result_file.txt', 'w')
f.write('The average transaction amount based on {} transactions in New York is {}\n'.format(ny_count, ny_avg))
f.write('The average transaction amount based on {} transactions in San Francisco is {}\n'.format(sf_count, sf_avg))
if ny_avg>sf_avg:
f.write('New York has higher average transaction amount than San Francisco\n')
else:
f.write('San Francisco has higher average transaction amount than New York\n')
f.close()
And I ALWAYS get the error:
ValueError: could not convert string to float: ''
I'm pretty new-ish to Python and I'm really not sure what I'm doing wrong here. I'm trying to get averages for New York and San Francisco, then export the results AND the comparison to a txt results file
This should give you what you're looking for:
from collections import defaultdict as DD
with open('New Purchases.txt') as pfile:
sums = DD(lambda: [0.0, 0])
for line in [line.split('\t') for line in pfile]:
try:
k = line[2]
sums[k][0] += float(line[4])
sums[k][1] += 1
except Exception:
pass
for k in ['San Francisco', 'New York']:
v = sums.get(k, [0.0, 1])
print(f'Average for {k} = ${v[0]/v[1]:.2f}')
I have re-arranged the code. I agree with BrutusFocus that the splits are making it difficult to read exactly the location on each row. I have set it so if it sees the location at any point in the row, it counts it.
with open("data.txt", "r") as f:
data = f.read()
ny_sum=0
ny_count=0
sf_sum=0
sf_count=0
for line in data.split('\n'):
parts = line.split('\t')
city = parts[2]
amount = float(parts[4])
print(city, amount)
if "New York" in line:
ny_sum = ny_sum + amount
ny_count = ny_count + 1
elif "San Francisco" in line:
sf_sum = sf_sum + amount
sf_count = sf_count + 1
ny_avg = ny_sum / ny_count
sf_avg = sf_sum / sf_count
#print(ny_avg, sf_avg)
f = open('result_file.txt', 'w')
f.write('The average transaction amount based on {} transactions in New York is
{}\n'.format(ny_count, ny_avg))
f.write('The average transaction amount based on {} transactions in San
Francisco is {}\n'.format(sf_count, sf_avg))
if ny_avg>sf_avg:
f.write('New York has higher average transaction amount than San Francisco\n')
else:
f.write('San Francisco has higher average transaction amount than New York\n')
f.close()

Determining most common name from web scraped birth name data

I have the task to do web scraping from this page https://www.ssa.gov/cgi-bin/popularnames.cgi. There you can find a list of the most common birth names. Now I have to find the most common name that both girls and boys have for a given year (in other words, the exact same name is used in both genders), but I don't know how I am able to do that. With the code below I solved the previous task to output the list for a given year but I have no clue how I can modify my code so I get the most common name that both girls and boys have.
import requests
import lxml.html as lh
url = 'https://www.ssa.gov/cgi-bin/popularnames.cgi'
string = input("Year: ")
r = requests.post(url, data=dict(year=string, top="1000", number="n" ))
doc = lh.fromstring(r.content)
tr_elements = doc.xpath('//table[2]//td[2]//tr')
cols = []
for col in tr_elements[0]:
name = col.text_content()
number = col.text_content()
cols.append((number, []))
count=0
for row in tr_elements[1:]:
i = 0
for col in row:
val = col.text_content()
cols[i][1].append(val)
i += 1
if(count<4):
print(val, end = ' ')
count += 1
else:
count=0
print(val)
Here's one approach. The first step is to group the data by name and record how many genders have used the name and their aggregate total. After that, we can filter the structure by names with more than one gender using it. Finally, we sort this multi-gender list by counts and take the 0-th element. This is our most popular multi-gender name for the year.
import requests
import lxml.html as lh
url = "https://www.ssa.gov/cgi-bin/popularnames.cgi"
year = input("Year: ")
response = requests.post(url, data=dict(year=year, top="1000", number="n"))
doc = lh.fromstring(response.content)
tr_elements = doc.xpath("//table[2]//td[2]//tr")
column_names = [col.text_content() for col in tr_elements[0]]
names = {}
most_common_shared_names_by_year = {}
for row in tr_elements[1:-1]:
row = [cell.text_content() for cell in row]
for i, gender in ((1, "male"), (3, "female")):
if row[i] not in names:
names[row[i]] = {"count": 0, "genders": set()}
names[row[i]]["count"] += int(row[i+1].replace(",", ""))
names[row[i]]["genders"].add(gender)
shared_names = [
(name, data) for name, data in names.items() if len(data["genders"]) > 1
]
most_common_shared_names = sorted(shared_names, key=lambda x: -x[1]["count"])
print("%s => %s" % most_common_shared_names[0])
If you're curious, here are the results since 2000:
2000 => Tyler, 22187
2001 => Tyler, 19842
2002 => Tyler, 18788
2003 => Ryan, 20171
2004 => Madison, 20829
2005 => Ryan, 18661
2006 => Ryan, 17116
2007 => Jayden, 17287
2008 => Jayden, 19040
2009 => Jayden, 19053
2010 => Jayden, 18641
2011 => Jayden, 18064
2012 => Jayden, 16952
2013 => Jayden, 15462
2014 => Logan, 14478
2015 => Logan, 13753
2016 => Logan, 12099
2017 => Logan, 15117

Error when creating dictionaries from text files

I've been working on a function which will update two dictionaries (similar authors, and awards they've won) from an open text file. The text file looks something like this:
Brabudy, Ray
Hugo Award
Nebula Award
Saturn Award
Ellison, Harlan
Heinlein, Robert
Asimov, Isaac
Clarke, Arthur
Ellison, Harlan
Nebula Award
Hugo Award
Locus Award
Stephenson, Neil
Vonnegut, Kurt
Morgan, Richard
Adams, Douglas
And so on. The first name is an authors name (last name first, first name last), followed by awards they may have won, and then authors who are similar to them. This is what I've got so far:
def load_author_dicts(text_file, similar_authors, awards_authors):
name_of_author = True
awards = False
similar = False
for line in text_file:
if name_of_author:
author = line.split(', ')
nameA = author[1].strip() + ' ' + author[0].strip()
name_of_author = False
awards = True
continue
if awards:
if ',' in line:
awards = False
similar = True
else:
if nameA in awards_authors:
listawards = awards_authors[nameA]
listawards.append(line.strip())
else:
listawards = []
listawards.append(line.strip()
awards_authors[nameA] = listawards
if similar:
if line == '\n':
similar = False
name_of_author = True
else:
sim_author = line.split(', ')
nameS = sim_author[1].strip() + ' ' + sim_author[0].strip()
if nameA in similar_authors:
similar_list = similar_authors[nameA]
similar_list.append(nameS)
else:
similar_list = []
similar_list.append(nameS)
similar_authors[nameA] = similar_list
continue
This works great! However, if the text file contains an entry with just a name (i.e. no awards, and no similar authors), it screws the whole thing up, generating an IndexError: list index out of range at this part Zname = sim_author[1].strip()+" "+sim_author[0].strip() )
How can I fix this? Maybe with a 'try, except function' in that area?
Also, I wouldn't mind getting rid of those continue functions, I wasn't sure how else to keep it going. I'm still pretty new to this, so any help would be much appreciated! I keep trying stuff and it changes another section I didn't want changed, so I figured I'd ask the experts.
How about doing it this way, just to get the data in, then manipulate the dictionary any ways you want.
test.txt contains your data
Brabudy, Ray
Hugo Award
Nebula Award
Saturn Award
Ellison, Harlan
Heinlein, Robert
Asimov, Isaac
Clarke, Arthur
Ellison, Harlan
Nebula Award
Hugo Award
Locus Award
Stephenson, Neil
Vonnegut, Kurt
Morgan, Richard
Adams, Douglas
And my code to parse it.
award_parse.py
data = {}
name = ""
awards = []
f = open("test.txt")
for l in f:
# make sure the line is not blank don't process blank lines
if not l.strip() == "":
# if this is a name and we're not already working on an author then set the author
# otherwise treat this as a new author and set the existing author to a key in the dictionary
if "," in l and len(name) == 0:
name = l.strip()
elif "," in l and len(name) > 0:
# check to see if recipient is already in list, add to end of existing list if he/she already
# exists.
if not name.strip() in data:
data[name] = awards
else:
data[name].extend(awards)
name = l.strip()
awards = []
# process any lines that are not blank, and do not have a ,
else:
awards.append(l.strip())
f.close()
for k, v in data.items():
print("%s got the following awards: %s" % (k,v))

How to read a particular line of interest from a text file?

Here I have a text file. I want to read Adress, Beneficiary, Beneficiary Bank, Acc Nbr, Total US$, Date which is at the top, RUT, BOX. I tried writing some code by myself but I am not able to correctly get the required information and moreover if the length of character changes I will not get correct output. How should I do this such that I will get every required information in a particular string.
The main problem will arise when my slicings will go wrong. For eg: I am using line[31:] for Acc Nbr. But if the address change then my slicing will also go wrong
My Text.txt
2014-11-09 BOX 1531 20140908123456 RUT 21 654321 0123
Girry S.A. CONTADO
G 5 Y Serie A
NO 098765
11 al Rayo 321 - Oqwerty 108 Monteaudio - Gruguay
Pharm Cosco, Inc - Britania PO Box 43215
Dirección Hot Springs AR 71903 - Estados Unidos
Oescripción Importe
US$
DO 7640183 - 50% of the Production Degree 246,123
Beneficiary Bank: Bankue Heritage (Gruguay) S.A Account Nbr: 1234563 Swift: MANIUYMM
Adress: Tencon 108 Monteaudio, Gruguay.
Beneficiary: Girry SA Acc Nbr: 1234567
Servicios prestados en el exterior, exentos de IVA o IRAE
Subtotal US$ 102,500
Iva US$ ---------------
Total US$ 102,500
I.V.A AL DIA Fecha de Vencimiento
IMPRENTA IRIS LTDA. - RUT 210161234015 - 0/40987 17/11/2015
CONSTANCIA N9 1234559842 -04/2013
CONTADO A 000.001/ A 000.050 x 2 VIAS
QWERTYAS ZXCVBIZADA
R. U.T. Bamprador Asdfumldor Final
Fecha 12/12/2014
1º ORIGINAL CLLLTE (Blanco) 2º CASIA AQWERVO (Rosasd)
My Code:
txt = 'Text.txt'
lines = [line.rstrip('\n') for line in open(txt)]
for line in lines:
if 'BOX' in line:
Date = line.split("BOX")[0]
BOX = line.split('BOX ', 1)[-1].split("RUT")[0]
RUT = line.split('RUT ',1)[-1]
print 'Date : ' + Date
print 'BOX : ' + BOX
print 'RUT : ' + RUT
if 'Adress' in line:
Adress = line[8:]
print 'Adress : ' + Adress
if 'NO ' in line:
Invoice_No = line.split('NO ',1)[-1]
print 'Invoice_No : ' + Invoice_No
if 'Swift:' in line:
Swift = line.split('Swift: ',1)[-1]
print 'Swift : ' + Swift
if 'Fecha' in line and '/' in line:
Invoice_Date = line.split('Fecha ',1)[-1]
print 'Invoice_Date : ' + Invoice_Date
if 'Beneficiary Bank' in line:
Beneficiary_Bank = line[18:]
Ben_Acc_Nbr = line.split('Nbr: ', 1)[-1]
print 'Beneficiary_Bank : ' + Beneficiary_Bank.split("Acc")[0]
print 'Ben_Acc_Nbr : ' + Ben_Acc_Nbr.split("Swift")[0]
if 'Beneficiary' in line and 'Beneficiary Bank' not in line:
Beneficiary = line[13:]
print 'Beneficiary : ' + Beneficiary.split("Acc")[0]
if 'Acc Nbr' in line:
Acc_Nbr = line.split('Nbr: ', 1)[-1]
print 'Acc_Nbr : ' + Acc_Nbr
if 'Total US$' in line:
Total_US = line.split('US$ ', 1)[-1]
print 'Total_US : ' + Total_US
Output:
Date : 2014-11-09
BOX : 1531 20140908123456
RUT : 21 654321 0123
Invoice_No : 098765
Swift : MANIUYMM
Beneficiary_Bank : Bankue Heritage (Gruguay) S.A
Ben_Acc_Nbr : 1234563
Adress : Tencon 108 Monteaudio, Gruguay.
Beneficiary : Girry SA
Acc_Nbr : 1234567
Total_US : 102,500
Invoice_Date : 12/12/2014
Some Code Changes
I have made some changes but still I am not convinced as I need to provide spaces also in split.
I would recommend you to use regular expressions to extract information you need. It helps to avoid the calculation of the numbers of offset characters.
import re
with open('C:\Quad.txt') as f:
for line in f:
match = re.search(r"Acc Nbr: (.*?)", line)
if match is not None:
Acc_Nbr = match.group(1)
print Acc_Nbr
# etc...
you can search to obtain index of it. for example:
if 'Acc Nbr' in line:
Acc_Nbr = line[line.find("Acc Nbr") + 10:]
print Acc_Nbr
note that find gives you index of first char of item you searched.

Organize by Twitter unique identifier using python

I have a CSV file with each line containing information pertaining to a particular tweet (i.e. each line contains Lat, Long, User_ID, tweet and so on). I need to read the file and organize the tweets by the User_ID. I am trying to end up with a given User_ID attached to all of the tweets with that specific ID.
Here is what I want:
user_id: 'lat', 'long', 'tweet'
: 'lat', 'long', 'tweet'
user_id2: 'lat', 'long', 'tweet'
: 'lat', 'long', 'tweet'
: 'lat', 'long', 'tweet'
and so on...
This is a snip of my code that reads in the CSV file and creates a list:
UID = []
myID = []
ID = []
f = None
with open(csv_in,'rU') as f:
myreader = csv.reader(f, delimiter=',')
for row in myreader:
# Assign columns in csv to variables.
latitude = row[0]
longitude = row[1]
user_id = row[2]
user_name = row[3]
date = row[4]
time = row[5]
tweet = row[6]
flag = row[7]
compound = row[8]
Vote = row[9]
# Read variables into separate lists.
UID.append(user_id + ', ' + latitude + ', ' + longitude + ', ' + user_name + ', ' + date + ', ' + time + ', ' + tweet + ', ' + flag + ', ' + compound)
myID = ', '.join(UID)
ID = myID.split(', ')
I'd suggest you use pandas for this. It will allow you not only to list your tweets by user_id, as in your question, but also to do many other manipulations quite easily.
As an example, take a look at this python notebook from NLTK. At the end of it, you see an operation very closed to yours, reading a csv file containing tweets,
In [25]:
import pandas as pd
​
tweets = pd.read_csv('tweets.20150430-223406.tweet.csv', index_col=2, header=0, encoding="utf8")
You can also find a simple operation: looking for the tweets of a certain user,
In [26]:
tweets.loc[tweets['user.id'] == 557422508]['text']
Out[26]:
id
593891099548094465 VIDEO: Sturgeon on post-election deals http://...
593891101766918144 SNP leader faces audience questions http://t.c...
Name: text, dtype: object
For listing the tweets by user_id, you would simply do something like the following (this is not in the original notebook),
In [9]:
tweets.set_index('user.id')[0:4]
Out[9]:
created_at favorite_count in_reply_to_status_id in_reply_to_user_id retweet_count retweeted text truncated
user.id
107794703 Thu Apr 30 21:34:06 +0000 2015 0 NaN NaN 0 False RT #KirkKus: Indirect cost of the UK being in ... False
557422508 Thu Apr 30 21:34:06 +0000 2015 0 NaN NaN 0 False VIDEO: Sturgeon on post-election deals http://... False
3006692193 Thu Apr 30 21:34:06 +0000 2015 0 NaN NaN 0 False RT #LabourEoin: The economy was growing 3 time... False
455154030 Thu Apr 30 21:34:06 +0000 2015 0 NaN NaN 0 False RT #GregLauder: the UKIP east lothian candidat... False
Hope it helps.

Categories