Matching two alternating XPath values into one list - python

I have encountered this problem:
A page has 5 divs of info, with child values inside of each:
Title,
Price,
Date.
Base Lists:
Titles = []
Prices = []
Dates = []
They all have the same XPath Lead so I can track all of them. Problem is - "Price" sometimes shows and sometimes doesn't in a div.
So what I get is:
Titles = [Title1, Title2, Title3, Title4, Title5]
Prices = [Price1, Price2, Price5] #(Missing two non-existing values of 3rd and 4th divs prices)
Dates = [Date1, Date2, Date3, Date4, Date5]
Since "Prices" class is not always there inside the div, it catches all of them but unmatching the axis to the correct div belonging. (makes the Price5 align with Title3 and Date3)
So when I do
print(len(Titles))
print(len(Prices))
print(len(Dates))
Obviously it prints out
5
3
5
And it cannot be worked with.
The bright side in this case is that every time there is no "Prices" class, a different one shows - "Estimation".
So what I'm trying to do is to find both xpaths of "Prices" and "Estimation" and append them by axis order to the same list, to ultimately get:
Titles = [Title1(1), Title2(2), Title3(3), Title4(4), Title5(5)]
Prices = [Price1(1), Price2(2), Estimation1(3), Estimation2(4), Price3(5)]
Dates = [Date1(1), Date2(2), Date3(3), Date4(4), Date5(5)]
In () is the (x)axis number to match the correct div.
Then I will have my dataFrame.
Though I cannot seem to find a way to find them simultaniously and append them by order of appearance.
Assisstance?
EDIT: Added sample code by request:
title = []
price = []
date = []
def grabber():
title_ = browser.find_elements_by_xpath('//*[#class="title-class"]')
for x in title_:
try:
title.append(x.text)
except:
title.append("N/A")
price_ = browser.find_elements_by_xpath('//*[#itemprop="price-class"]')
no_price = browser.find_elements_by_xpath('//*[#class="estimation-class"]')
for x, y in price_, no_price:
price.append(x.text, "N/A")
# ^ Doesn't work of course ^, One method I tried among many other.
date_ = browser.find_elements_by_xpath('//*[#itemprop="date-class"]')
for x in date_:
try:
date.append(x.text)
except:
date.append("N/A")

You can do this using an or operation via a pipe (|), combining your two XPaths into one:
Pseudocode:
//div/price | //price
EDIT from OP's code:
price_ = browser.find_elements_by_xpath('//*[#itemprop="price-class"]')
no_price = browser.find_elements_by_xpath('//*[#class="estimation-class"]')
can be combined into:
price_ = browser.find_elements_by_xpath(
'//*[#itemprop="price-class"] | //*[#class="estimation-class"]')

Related

Problem generating a list with a numeric qualifier

I am working on a course with low code requirements, and have one step where I am stuck.
I have this code that creates a list of restaurants and the number of reviews each has:
Filter the rated restaurants
df_rated = df[df['rating'] != 'Not given'].copy()
df_rated['rating'] = df_rated['rating'].astype('int')
df_rating_count = df_rated.groupby(['restaurant_name'])['rating'].count().sort_values(ascending = False).reset_index()
df_rating_count.head()
From there I am supposed to create a list limited to those above 50 reviews, starting from this base:
# Get the restaurant names that have rating count more than 50
rest_names = df_rating_count['______________']['restaurant_name']
# Filter to get the data of restaurants that have rating count more than 50
df_mean_4 = df_rated[df_rated['restaurant_name'].isin(rest_names)].copy()
# Group the restaurant names with their ratings and find the mean rating of each restaurant
df_mean_4.groupby(['_______'])['_______'].mean().sort_values(ascending = False).reset_index().dropna() ## Complete the code to find the mean rating
Where I am stuck is on the first step.
rest_names = df_rating_count['______________']['restaurant_name']
I am pretty confident in the other 2 steps.
df_mean_4 = df_rated[df_rated['restaurant_name'].isin(rest_names)].copy()
df_mean_4.groupby(['restaurant_name'])['rating'].mean().sort_values(ascending = False).reset_index().dropna()
I have frankly tried so many different things I don't even know where to start.
Does anyone have any hints to at least point me in the right direction?
you can index and filter using [].
# Get the restaurant names that have rating count more than 50
rest_names = df_rating_count[df_rating_count['rating'] > 50]['restaurant_name']
#function to determine the revenue
def compute_rev(x):
if x > 20:
return x*0.25
elif x > 5:
return x*0.15
else:
return x*0
## Write the appropriate column name to compute the revenue
df['Revenue'] = df['________'].apply(compute_rev)
df.head()

Unable to get the required key values with respect to different fields in python

I am crawling through a series of APIs and trying to get the price from the keys. Some APIs have the value " AVG_price", "Average_price" and the rest have " price_min, price_max".
WANTED_PRICES = ['average_Price', 'average_price', 'avg_price', 'avg_Price']
WANTED_avgPRICES = ['price_min', 'price_max']
In order to get the prices, Average prices are required. So, the above two list of fields are treated differently.
If an api has "price_min" and price_max", the prices are calculated as the average of the two. And if the api has any field from "Wanted_Prices", then the price is just extracted as it is.
def parse_slugs(self, response, slug_id, crawled_url):
summary = json.loads(response.body)
results = summary.get('results', [])
for result in results:
for result_avg, result_mm in zip(WANTED_PRICES, WANTED_avgPRICES):
price = None
if result_avg in result:
price = Decimal(result[result_avg])
elif result_mm in result:
minp = result['price_min']
maxp = result['price_max']
price = ((Decimal(minp + maxp)) / 2)
if 'published_date' in result.keys():
date = result['published_date'].split(' ')[0]
else:
date = result['published_Date'].split(' ')[0]
print(slug_id,date,price)
But I am getting errors:
The prices from the APIs that have the fields from "WANTED_avgPRICES" are not getting extracted. Only the date and the Slug_Id is shown in the output(for example):
{'1234'} 08/25/2020 None
{'1234'} 08/25/2020 None
{'1234'} 08/25/2020 None
{'1234'} 08/25/2020 None
It's giving out two outputs for each unique values.Duplicates
{'1053'} 07/01/2020 0.9137499999999999511501869165
{'1053'} 07/01/2020 0.9137499999999999511501869165
{'1053'} 06/25/2020 0.9137499999999999511501869165
{'1053'} 06/25/2020 0.9137499999999999511501869165
By default, zip() stops when the shortest iterable is exhausted. It will ignore the remaining items in the longer iterables, cutting off the result to the length of the shortest iterable:
>>> list(zip(range(3), ['fee', 'fi', 'fo', 'fum']))
[(0, 'fee'), (1, 'fi'), (2, 'fo')]
This mean that you don't process ALL items from WANTED_PRICES list.

Scraping OSHA website using BeautifulSoup

I'm looking for help with two main things: (1) scraping a web page and (2) turning the scraped data into a pandas dataframe (mostly so I can output as .csv, but just creating a pandas df is enough for now). Here is what I have done so far for both:
(1) Scraping the web site:
I am trying to scrape this page: https://www.osha.gov/pls/imis/establishment.inspection_detail?id=1285328.015&id=1284178.015&id=1283809.015&id=1283549.015&id=1282631.015. My end goal is to create a dataframe that would ideally contain only the information I am looking for (i.e. I'd be able to select only the parts of the site that I am interested in for my df); it's OK if I have to pull in all the data for now.
As you can see from the URL as well as the ID hyperlinks underneath "Quick Link Reference" at the top of the page, there are five distinct records on this page. I would like each of these IDs/records to be treated as an individual row in my pandas df.
EDIT: Thanks to a helpful comment, I'm including an example of what I would ultimately want in the table below. The first row represents column headers/names and the second row represents the first inspection.
inspection_id open_date inspection_type close_conference close_case violations_serious_initial
1285328.015 12/28/2017 referral 12/28/2017 06/21/2018 2
Mostly relying on BeautifulSoup4, I've tried a few different options to get at the page elements I'm interested in:
# This is meant to give you the first instance of Case Status, which in the case of this page is "CLOSED".
case_status_template = html_soup.head.find('div', {"id" : "maincontain"},
class_ = "container").div.find('table', class_ = "table-bordered").find('strong').text
# I wasn't able to get the remaining Case Statuses with find_next_sibling or find_all, so I used a different method:
for table in html_soup.find_all('table', class_= "table-bordered"):
print(table.text)
# This gave me the output I needed (i.e. the Case Status for all five records on the page),
# but didn't give me the structure I wanted and didn't really allow me to connect to the other data on the page.
# I was also able to get to the same place with another page element, Inspection Details.
# This is the information reflected on the page after "Inspection: ", directly below Case Status.
insp_details_template = html_soup.head.find('div', {"id" : "maincontain"},
class_ = "container").div.find('table', class_ = "table-unbordered")
for div in html_soup.find_all('table', class_ = "table-unbordered"):
print(div.text)
# Unfortunately, although I could get these two pieces of information to print,
# I realized I would have a hard time getting the rest of the information for each record.
# I also knew that it would be hard to connect/roll all of these up at the record level.
So, I tried a slightly different approach. By focusing instead on a version of that page with a single inspection record, I thought maybe I could just hack it by using this bit of code:
url = 'https://www.osha.gov/pls/imis/establishment.inspection_detail?id=1285328.015'
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
first_table = html_soup.find('table', class_ = "table-borderedu")
first_table_rows = first_table.find_all('tr')
for tr in first_table_rows:
td = tr.find_all('td')
row = [i.text for i in td]
print(row)
# Then, actually using pandas to get the data into a df and out as a .csv.
dfs_osha = pd.read_html('https://www.osha.gov/pls/imis/establishment.inspection_detail?id=1285328.015',header=1)
for df in dfs_osha:
print(df)
path = r'~\foo'
dfs_osha = pd.read_html('https://www.osha.gov/pls/imis/establishment.inspection_detail?id=1285328.015',header=1)
for df[1,3] in dfs_osha:
df.to_csv(os.path.join(path,r'osha_output_table1_012320.csv'))
# This worked better, but didn't actually give me all of the data on the page,
# and wouldn't be replicable for the other four inspection records I'm interested in.
So, finally, I found a pretty handy example here: https://levelup.gitconnected.com/quick-web-scraping-with-python-beautiful-soup-4dde18468f1f. I was trying to work through it, and had gotten as far as coming up with this code:
for elem in all_content_raw_lxml:
wrappers = elem.find_all('div', class_ = "row-fluid")
for x in wrappers:
case_status = x.find('div', class_ = "text-center")
print(case_status)
insp_details = x.find('div', class_ = "table-responsive")
for tr in insp_details:
td = tr.find_all('td')
td_row = [i.text for i in td]
print(td_row)
violation_items = insp_details.find_next_sibling('div', class_ = "table-responsive")
for tr in violation_items:
tr = tr.find_all('tr')
tr_row = [i.text for i in tr]
print(tr_row)
print('---------------')
Unfortunately, I ran into too many bugs with this to be able to use it so I was forced to abandon the project until I got some further guidance. Hopefully the code I've shared so far at least shows the effort I've put in, even if it doesn't do much to get to the final output! Thanks.
For this type of page you don't really need beautifulsoup; pandas is enough.
url = 'your url above'
import pandas as pd
#use pandas to read the tables on the page; there are lots of them...
tables = pd.read_html(url)
#Select from this list of tables only those tables you need:
incident = [] #initialize a list of inspections
for i, table in enumerate(tables): #we need to find the index position of this table in the list; more below
if table.shape[1]==5: #all relevant tables have this shape
case = [] #initialize a list of inspection items you are interested in
case.append(table.iat[1,0]) #this is the location in the table of this particular item
case.append(table.iat[1,2].split(' ')[2]) #the string in the cell needs to be cleaned up a bit...
case.append(table.iat[9,1])
case.append(table.iat[12,3])
case.append(table.iat[13,3])
case.append(tables[i+2].iat[0,1]) #this particular item is in a table which 2 positions down from the current one; this is where the index position of the current table comes handy
incident.append(case)
columns = ["inspection_id", "open_date", "inspection_type", "close_conference", "close_case", "violations_serious_initial"]
df2 = pd.DataFrame(incident,columns=columns)
df2
Output (pardon the formatting):
inspection_id open_date inspection_type close_conference close_case violations_serious_initial
0 Nr: 1285328.015 12/28/2017 Referral 12/28/2017 06/21/2018 2
1 Nr: 1283809.015 12/18/2017 Complaint 12/18/2017 05/24/2018 5
2 Nr: 1284178.015 12/18/2017 Accident 05/17/2018 09/17/2018 1
3 Nr: 1283549.015 12/13/2017 Referral 12/13/2017 05/22/2018 3
4 Nr: 1282631.015 12/12/2017 Fat/Cat 12/12/2017 11/16/2018 1

Iterate through DIV Class using selenium on page

I'm looking to iterate through a set of rows on a page using selenium to scrape live results from the page in a quick manner. I have a code which appears to return the first row and print it, but doesn't look to be iterating through the set.
content = [browser.find_element_by_class_name('event')]
rows = [browser.find_elements_by_class_name('event__match')]
for rows in content:
goals = {}
goals['Home'] = rows.find_element_by_class_name("event__participant--home").text.strip()
goals['Away'] = rows.find_element_by_class_name("event__participant--away").text.strip()
goals['hScore'] = rows.find_element_by_class_name("event__scores").text.split("-")[1]
goals['aScore'] = rows.find_element_by_class_name("event__scores").text.split("-")[-1]
print(goals['Home'],goals['aScore'],goals['aScore'],goals['Away'])
gets me the result;
Team 1
0
0 Team 2
Which would be the expected result when it's only one match on the page - but there's 50 at the moment.
I feel like I'm missing something in my method here, it could be pretty simple and staring me in the face so apologies if that's the case!
You mistake is in for rows in content:, where content is parent div and you need rows. To iterate all through rows use code below:
rows = browser.find_elements_by_class_name('event__match')
for row in rows:
goals = {}
goals['Home'] = row.find_element_by_class_name("event__participant--home").text.strip()
goals['Away'] = row.find_element_by_class_name("event__participant--away").text.strip()
goals['hScore'] = row.find_element_by_class_name("event__scores").text.split("-")[1]
goals['aScore'] = row.find_element_by_class_name("event__scores").text.split("-")[-1]
print(goals['Home'],goals['aScore'],goals['aScore'],goals['Away'])

How do I get all parsed data for each each business listing and keep it aligned in a CSV file?

I am using Ubuntu 14.04.3 with Python 2.7.6 to parse a Yellowpages.com, YP, Apartments page. I am using lxml and xpath. YP pages appear to have the same layout. The center of each page lists each apartment. Each center apartment has an index nuber. There are 30 indexed apartments per page. Other apartments are listed at the top, bottom and right side of the page and look to be advertisements and of no interest for parsing.
I parsed the page getting a count of each item listed for an apartment. If there are 30 numbered apartments I get different counts for the items and it appears something is wrong. For example:
lenIdxBusNames = 30
lenBusinessNames = 32
lenStreets = 30
lenPageHrefs = 15.
I wrote the items/elements to rows in a CSV. BusinessName and pageHref columns are misaligned. BusinessName column is shifted up by one row. There are only 15 pageHrefs which means some are missing. Instead of being kept in the same row as the other items they are listed in the first 16 rows of that column. Some of the search paths are:
idxBusNames = tree.xpath('//h3[#class="n"]/text()'),
businessNames = tree.xpath('//h3/a[#class="business-name"]/text()'),
streets = tree.xpath('//p[#class="adr"]/span[1]/text()') and
pageHrefs = tree.xpath('//a[#class="track-visit-website"]/#href')
I found xpaths using Firefox firebug. More detail is in the attachment.
Thanks for your help,
Bob
building on my comment:
import requests
from lxml import etree
url="""http://www.yellowpages.com/search?search_terms=apartment"""
url+="""&insert geo params here"""
r = requests.get(url)
h = etree.HTMLParser()
tree = etree.fromstring(r.text,h)
xp_info_nodes = """//div[#class='info']"""
xp_id = """h3[#class='n']/text()"""
xp_name = """h3[#class='n']/a[#class='business-name']/text()"""
xp_adr = """div[#class='info-section info-primary']/p[#class='adr']/span[1]/text()"""
xp_link = """.//a[#class='track-visit-website']/#href"""
info_nodes = tree.xpath(xp_info_nodes)
all_data = []
for node in info_nodes:
#mandatory_nodes
data = [
node.xpath(xp_id),
node.xpath(xp_name),
node.xpath(xp_adr),
]
#insert some function to clean up data[0] here. its returning weird strings.
ldata = len(data)
data = [d for d in data if d]
if len(data) != ldata:
continue
#optional
optional_data = [
data.append(node.xpath(xp_link))
]
optional_data = [o for o in optional_data]
all_data.append(data + optional_data)
for row in all_data:
"""print a line of your csv"""

Categories