Issue with scraping data using indexing from html structure - python

I am scraping data from following html structure from 30-40 webpages like these https://www.o2.co.uk/shop/tariffs/sony/xperia-z-purple/ :
<td class="monthlyCost">£13<span>.50</span></td>
<td class="phoneCost">£479.99</td>
<td><span class="lowLight">24 Months</span></td>
<td>50</td>
<td>Unlimited</td>
<td class="dataAllowance">100MB</td>
<td class="extras">
I am indexing to scrape data present under td tags having no class like 50 & Unlimited which corresponds to Minutes and texts column in the dataset. Code which I am using is:
results = tariff_link_soup.findAll('td', {"class": None})
minutes = results[1]
texts = results[2]
print minutes,texts
All these 30-40 webplinks are present on https://www.o2.co.uk/shop/phones/ webpage, I am finding those links on this webpage accessing them and then reaching this desired webpage, all these final device webpages follow same structure.
Problem: I was hoping to get only minutes and text values which are like 50 & Unlimited, 200 & Unlimited and are present at 2nd and 3rd index for all webpages. Still I am getting some other values when I am printing the data for eg. 500MB, 100MB which are values under dataAllowance class and td tag. I am using class as None attribute but still not able to get required data. I checked html structure and it was consistent across pages.
Please help me in solving this issue as I am not able to fathom reason for this anomaly.
Update: Entire Python code which I am using:
urls = ['https://www.o2.co.uk/shop/phones/',
'https://www.o2.co.uk/shop/phones/?payGo=true']
plans = ['Pay Monthly','Pay & Go']
for url,plan in zip(urls,plans):
if plan == 'Pay Monthly':
device_links = parse().direct_url(url,'span', {"class": "model"})
for device_link in device_links:
device_link.parent['href'] = urlparse.urljoin(url, device_link.parent['href'])
device_link_page = urllib2.urlopen(device_link.parent['href'])
device_link_soup = BeautifulSoup(device_link_page)
dev_names = device_link_soup.find('h1')
for devname in dev_names:
tariff_link = device_link_soup.find('a',text = re.compile('View tariffs'))
tariff_link['href'] = urlparse.urljoin(url, tariff_link['href'])
tariff_link_page = urllib2.urlopen(tariff_link['href'])
tariff_link_soup = BeautifulSoup(tariff_link_page)
dev_price = tariff_link_soup.findAll('td', {"class": "phoneCost"})
monthly_price = tariff_link_soup.findAll('td', {"class": "monthlyCost"})
tariff_length = tariff_link_soup.findAll('span', {"class": "lowLight"})
data_plan = tariff_link_soup.findAll('td', {"class": "dataAllowance"})
results = tariff_link_soup.xpath('//td[not(#class)]')
print results[1].text
print results[2].text

I finally used following code to solve my problem:
for row in tariff_link_soup('table', {'id' : 'tariffTable'})[0].tbody('tr'):
tds = row('td')
#print tds[0].text,tds[1].text,tds[2].text,tds[3].text,tds[4].text,tds[5].text
monthly_prices = unicode(tds[0].text).encode('utf8').replace("£","").replace("FREE","0").replace("Free","0").strip()
dev_prices = unicode(tds[1].text).encode('utf8').replace("£","").replace("FREE","0").replace("Free","0").strip()
tariff_lengths = unicode(tds[2].text).encode('utf8').strip()
minutes = unicode(tds[3].text).encode('utf8').strip()
texts = unicode(tds[4].text).encode('utf8').strip()
data = unicode(tds[5].text).encode('utf8').strip()
device_names = unicode(dev_names).encode('utf8').strip()
I am accessing the required data row by row here, using the tabular structure in which data is present. I am taking all elements present in a row and assigning names to those which are required in my data.

Related

How can I plug this section of code into my BeautifulSoup script?

I am new to Python and Beautiful Soup. My project I am working on is a script which scrapes the pages inside of the hyperlinks on this page:
https://bitinfocharts.com/top-100-richest-dogecoin-addresses-2.html
Currently, the script has a filter which will only scrape the pages which have a "Last Out" date which is past a certain date.
I am trying to add an additional filter to the script, which does the following:
Scrape the "Profit from price change:" section on the page inside hyperlink (Example page: https://bitinfocharts.com/dogecoin/address/D8WhgsmFUkf4imvsrwYjdhXL45LPz3bS1S
Convert the profit into a float
Compare the profit to a variable called "goal" which has a float assigned to it.
If the profit is greater or equal to goal, then scrape the contents of the page. If the profit is NOT greater or equal to the goal, do not scrape the webpage, and continue the script.
Here is the snippet of code I am using to try and do this:
#Get the profit
sections = soup.find_all(class_='table-striped')
for section in sections:
oldprofit = section.find_all('td')[11].text
removetext = oldprofit.replace('USD', '')
removetext = removetext.replace(' ', '')
removetext = removetext.replace(',', '')
profit = float(removetext)
# Compare profit to goal
goal = float(50000)
if profit >= goal
Basically, what I am trying to do is run an if statement on a value on the webpage, and if the statement is true, then scrape the webpage. If the if statement is false, then do not scrape the page and continue the code.
Here is the entire script that I am trying to plug this into:
import csv
import requests
from bs4 import BeautifulSoup as bs
from datetime import datetime
headers = []
datarows = []
# define 1-1-2020 as a datetime object
after_date = datetime(2020, 1, 1)
with requests.Session() as s:
s.headers = {"User-Agent": "Safari/537.36"}
r = s.get('https://bitinfocharts.com/top-100-richest-dogecoin-addresses-2.html')
soup = bs(r.content, 'lxml')
# select all tr elements (minus the first one, which is the header)
table_elements = soup.select('tr')[1:]
address_links = []
for element in table_elements:
children = element.contents # get children of table element
url = children[1].a['href']
last_out_str = children[8].text
# check to make sure the date field isn't empty
if last_out_str != "":
# load date into datetime object for comparison (second part is defining the layout of the date as years-months-days hour:minute:second timezone)
last_out = datetime.strptime(last_out_str, "%Y-%m-%d %H:%M:%S %Z")
# if check to see if the date is after 2020/1/1
if last_out > after_date:
address_links.append(url)
for url in address_links:
r = s.get(url)
soup = bs(r.content, 'lxml')
table = soup.find(id="table_maina")
#Get the Doge Address for the filename
item = soup.find('h1').text
newitem = item.replace('Dogecoin', '')
finalitem = newitem.replace('Address', '')
#Get the profit
sections = soup.find_all(class_='table-striped')
for section in sections:
oldprofit = section.find_all('td')[11].text
removetext = oldprofit.replace('USD', '')
removetext = removetext.replace(' ', '')
removetext = removetext.replace(',', '')
profit = float(removetext)
# Compare profit to goal
goal = float(50000)
if profit >= goal
if table:
for row in table.find_all('tr'):
heads = row.find_all('th')
if heads:
headers = [th.text for th in heads]
else:
datarows.append([td.text for td in row.find_all('td')])
fcsv = csv.writer(open(f'{finalitem}.csv', 'w', newline=''))
fcsv.writerow(headers)
fcsv.writerows(datarows)
I am familiar with if statements however I this unsure how to plug this into the existing code and have it accomplish what I am trying to do. If anyone has any advice I would greatly appreciate it. Thank you.
From my understanding, it seems like all that you are asking is how to have the script continue if it fails that criteria in which case you need to just do
if profit < goal:
continue
Though the for loop in your snippet is only using the final value of profit, if there are other profit values that you need to look at those values are not being evaluated.

How to scrape particular data from Yahoo Finance?

I am new to web scraping and I'm trying to scrape the "statistics" page of yahoo finance for AAPL. Here's the link: https://finance.yahoo.com/quote/AAPL/key-statistics?p=AAPL
Here is the code I have so far...
from bs4 import BeautifulSoup
from requests import get
url = 'https://finance.yahoo.com/quote/AAPL/key-statistics?p=AAPL'
response = get(url)
soup = BeautifulSoup(response.text, 'html.parser')
stock_data = soup.find_all("table")
for stock in stock_data:
print(stock.text)
When I run that, I return all of the table data on the page. However, I only want specific data from each table (e.g. "Market Cap", "Revenue", "Beta").
I tried messing around with the code by doing print(stock[1].text) to see if I could limit the amount of data returned to just the second value in each table but that returned an error message. Am I on the right track by using BeautifulSoup or do I need to use a completely different library? What would I have to do in order to only return particular data and not all of the table data on the page?
Examining the HTML-code gives you the best idea of how BeautifulSoup will handle what it sees.
The web page seems to contain several tables, which in turn contain the information you are after. The tables follow a certain logic.
First scrape all the tables on the web page, then find all the table rows (<tr>) and the table data (<td>) that those rows contain.
Below is one way of achieving this. I even threw in a function to print only a specific measurement.
from bs4 import BeautifulSoup
from requests import get
url = 'https://finance.yahoo.com/quote/AAPL/key-statistics?p=AAPL'
response = get(url)
soup = BeautifulSoup(response.text, 'html.parser')
stock_data = soup.find_all("table")
# stock_data will contain multiple tables, next we examine each table one by one
for table in stock_data:
# Scrape all table rows into variable trs
trs = table.find_all('tr')
for tr in trs:
# Scrape all table data tags into variable tds
tds = tr.find_all('td')
# Index 0 of tds will contain the measurement
print("Measure: {}".format(tds[0].get_text()))
# Index 1 of tds will contain the value
print("Value: {}".format(tds[1].get_text()))
print("")
def get_measurement(table_array, measurement):
for table in table_array:
trs = table.find_all('tr')
for tr in trs:
tds = tr.find_all('td')
if measurement.lower() in tds[0].get_text().lower():
return(tds[1].get_text())
# print only one measurement, e.g. operating cash flow
print(get_measurement(stock_data, "operating cash flow"))
Although this isn't Yahoo Finance, you can do something very similar like this...
import requests
from bs4 import BeautifulSoup
base_url = 'https://finviz.com/screener.ashx?v=152&o=price&t=MSFT,AAPL,SBUX,S,GOOG&o=price&c=0,1,2,3,4,5,6,7,8,9,25,63,64,65,66,67'
html = requests.get(base_url)
soup = BeautifulSoup(html.content, "html.parser")
main_div = soup.find('div', attrs = {'id':'screener-content'})
light_rows = main_div.find_all('tr', class_="table-light-row-cp")
dark_rows = main_div.find_all('tr', class_="table-dark-row-cp")
data = []
for rows_set in (light_rows, dark_rows):
for row in rows_set:
row_data = []
for cell in row.find_all('td'):
val = cell.a.get_text()
row_data.append(val)
data.append(row_data)
# sort rows to maintain original order
data.sort(key=lambda x: int(x[0]))
import pandas
pandas.DataFrame(data).to_csv("C:\\your_path\\AAA.csv", header=False)
This is a nice substitute in case Yahoo decided to depreciate more of the functionality of their API. I know they cut out a lot of things (mostly historical quotes) a couple years ago. It was sad to see that go away.

How to get text from inside span or outside at the same time with xpath?

I have a problem with using xpath to get inconsistent price list
Example
<td><span="green">$33.99</span></td>
<td>Out of stock</td>
<td><span="green">$27.99</span></td>
<td><span="green">$35.00</span></td>
How to get the price inside span and Out of stock at the same time?
Because I get only $33.99 or anything that have span and text that is not inside span got skipped. And it ruined the ordering.
The failed attempt that I used w/ updated from #piratefache's solution (Scrapy)
product_prices_tds = response.xpath('//td/')
product_prices = []
for td in product_prices_tds:
if td.xpath('//span'):
product_prices = td.xpath('//span/text()').extract()
else:
product_prices = td.xpath('//text()').extract()
for n in range(len(product_names)):
items['price'] = product_prices[n]
yield items
It's not working because product_prices doesn't get the right text it get from all over the place. Not just inside span or outside as I intended to.
Update
For the one who came later. I fixed my code Thanks to #piratefache's. Here's corrected snippet for who want to use later.
product_prices_tds = response.xpath('//td')
product_prices = []
for td in product_prices_tds:
if td.xpath('span'):
product_prices.append(td.xpath('span//text()').extract())
else:
product_prices.append(td.xpath('/text()').extract())
for n in range(len(product_names)):
items['price'] = product_prices[n]
yield items
See edit below with Scrapy
Based on your html code, using BeautifulSoup library, you can get the information this way :
from bs4 import BeautifulSoup
page = """<td><span="green">$33.99</span></td>
<td>Out of stock</td>
<td><span="green">$27.99</span></td>
<td><span="green">$35.00</span></td>"""
soup = BeautifulSoup(page, features="lxml")
tds = soup.body.findAll('td') # get all spans
for td in tds:
# if attribute span exist
if td.find('span'):
print(td.find('span').text)
# if not, just print inner text (here it's out of stock)
else:
print(td.text)
output :
$33.99
Out of stock
$27.99
$35.00
With Scrapy:
import scrapy
page = """<td><span="green">$33.99</span></td>
<td>Out of stock</td>
<td><span="green">$27.99</span></td>
<td><span="green">$35.00</span></td>"""
response = scrapy.Selector(text=page, type="html")
tds = response.xpath('//td')
for td in tds:
# if attribute span exist
if td.xpath('span'):
print(td.xpath('span//text()')[0].extract())
# if not, just print inner text (here it's out of stock)
else:
print(td.xpath('text()')[0].extract())
output :
$33.99
Out of stock
$27.99
$35.00
XPath solution (from 2.0 upwards) (same logic as #piratefache posted before):
for $td in //td
return
if ($td[span])
then
$td/span/data()
else
$td/data()
Applied on
<root>
<td>
<span>$33.99</span>
</td>
<td>Out of stock</td>
<td>
<span>$27.99</span>
</td>
<td>
<span>$35.00</span>
</td>
</root>
returns
$33.99
Out of stock
$27.99
$35.00
BTW: <span="green"> is not valid XML. Probably an attribute #color or similar is missing (?)

Web crawling <!--suppress HtmlUnknownAttribute -->

I was trying to crawl the link : "http://codeforces.com/contest/554/standings" .
I used the given two lines to read all contestant names :
table1 = soup.find("table", {'class':'standings'})
table2 = table1.find_all("tr")
However table2 doesn't print all the table rows.
I found " <--suppress HtmlUnknownAttribute --> " written before all the rows I wasn't able to crawl.
Is there any particular reason for it.
I am just a beginner to web crawling
You may need to share the code in entirety. I get the expected 100 contestant names based on your initial "tr" find_all:
import urllib2
from bs4 import BeautifulSoup
response = urllib2.urlopen('http://codeforces.com/contest/554/standings')
html = response.read()
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('table', {'class': 'standings'})
rows = table.find_all('tr')
for row in rows:
contestant = row.find_all('td', {'class': 'contestant-cell'})
if len(contestant) > 0:
# Quick'n dirty dig. Makes un-safe assumptions about the HTML structure.
print contestant[0].a.string
You'll note that some additional digging is required after you get the table rows since not every row contains contestant info.

Extracting data from a web page using BS4 in Python

I am trying to extract data from this site: http://www.afl.com.au/fixture
in a way such that I have a dictionary having the date as key and the "Preview" links as Values in a list, like
dict = {Saturday, June 07: ["preview url-1, "preview url-2","preview url-3","preview url-4"]}
Please help me get it, I have used the code below:
def extractData():
lDateInfoMatchCase = False
# lDateInfoMatchCase = []
global gDict
for row in table_for_players.findAll("tr"):
for lDateRowIndex in row.findAll("th", {"colspan" : "4"}):
ldateList.append(lDateRowIndex.text)
print ldateList
for index in ldateList:
#print index
lPreviewLinkList = []
for row in table_for_players.findAll("tr"):
for lDateRowIndex in row.findAll("th", {"colspan" : "4"}):
if lDateRowIndex.text == index:
lDateInfoMatchCase = True
else:
lDateInfoMatchCase = False
if lDateInfoMatchCase == True:
for lInfoRowIndex in row.findAll("td", {"class": "info"}):
for link in lInfoRowIndex.findAll("a", {"class" : "preview"}):
lPreviewLinkList.append("http://www.afl.com.au/" + link.get('href'))
print lPreviewLinkList
gDict[index] = lPreviewLinkList
My main aim is to get the all player names who are playing for a match in home and in away team according to date in a data structure.
I prefer using CSS Selectors. Select the first table, then all rows in the tbody for ease of processing; the rows are 'grouped' by tr th rows. From there you can select all next siblings that don't contain th headers and scan these for preview links:
previews = {}
table = soup.select('table.fixture')[0]
for group_header in table.select('tbody tr th'):
date = group_header.string
for next_sibling in group_header.parent.find_next_siblings('tr'):
if next_sibling.th:
# found a next group, end scan
break
for preview in next_sibling.select('a.preview'):
previews.setdefault(date, []).append(
"http://www.afl.com.au" + preview.get('href'))
This builds a dictionary of lists; for the current version of the page this produces:
{u'Monday, June 09': ['http://www.afl.com.au/match-centre/2014/12/melb-v-coll'],
u'Sunday, June 08': ['http://www.afl.com.au/match-centre/2014/12/gcfc-v-syd',
'http://www.afl.com.au/match-centre/2014/12/fre-v-adel',
'http://www.afl.com.au/match-centre/2014/12/nmfc-v-rich']}

Categories