Extracting data from a web page using BS4 in Python

Extracting data from a web page using BS4 in Python - python

I am trying to extract data from this site: http://www.afl.com.au/fixture
in a way such that I have a dictionary having the date as key and the "Preview" links as Values in a list, like
dict = {Saturday, June 07: ["preview url-1, "preview url-2","preview url-3","preview url-4"]}
Please help me get it, I have used the code below:
def extractData():
lDateInfoMatchCase = False
# lDateInfoMatchCase = []
global gDict
for row in table_for_players.findAll("tr"):
for lDateRowIndex in row.findAll("th", {"colspan" : "4"}):
ldateList.append(lDateRowIndex.text)
print ldateList
for index in ldateList:
#print index
lPreviewLinkList = []
for row in table_for_players.findAll("tr"):
for lDateRowIndex in row.findAll("th", {"colspan" : "4"}):
if lDateRowIndex.text == index:
lDateInfoMatchCase = True
else:
lDateInfoMatchCase = False
if lDateInfoMatchCase == True:
for lInfoRowIndex in row.findAll("td", {"class": "info"}):
for link in lInfoRowIndex.findAll("a", {"class" : "preview"}):
lPreviewLinkList.append("http://www.afl.com.au/" + link.get('href'))
print lPreviewLinkList
gDict[index] = lPreviewLinkList
My main aim is to get the all player names who are playing for a match in home and in away team according to date in a data structure.

I prefer using CSS Selectors. Select the first table, then all rows in the tbody for ease of processing; the rows are 'grouped' by tr th rows. From there you can select all next siblings that don't contain th headers and scan these for preview links:
previews = {}
table = soup.select('table.fixture')[0]
for group_header in table.select('tbody tr th'):
date = group_header.string
for next_sibling in group_header.parent.find_next_siblings('tr'):
if next_sibling.th:
# found a next group, end scan
break
for preview in next_sibling.select('a.preview'):
previews.setdefault(date, []).append(
"http://www.afl.com.au" + preview.get('href'))
This builds a dictionary of lists; for the current version of the page this produces:
{u'Monday, June 09': ['http://www.afl.com.au/match-centre/2014/12/melb-v-coll'],
u'Sunday, June 08': ['http://www.afl.com.au/match-centre/2014/12/gcfc-v-syd',
'http://www.afl.com.au/match-centre/2014/12/fre-v-adel',
'http://www.afl.com.au/match-centre/2014/12/nmfc-v-rich']}

Related

How can I plug this section of code into my BeautifulSoup script?

I am new to Python and Beautiful Soup. My project I am working on is a script which scrapes the pages inside of the hyperlinks on this page:
https://bitinfocharts.com/top-100-richest-dogecoin-addresses-2.html
Currently, the script has a filter which will only scrape the pages which have a "Last Out" date which is past a certain date.
I am trying to add an additional filter to the script, which does the following:
Scrape the "Profit from price change:" section on the page inside hyperlink (Example page: https://bitinfocharts.com/dogecoin/address/D8WhgsmFUkf4imvsrwYjdhXL45LPz3bS1S
Convert the profit into a float
Compare the profit to a variable called "goal" which has a float assigned to it.
If the profit is greater or equal to goal, then scrape the contents of the page. If the profit is NOT greater or equal to the goal, do not scrape the webpage, and continue the script.
Here is the snippet of code I am using to try and do this:
#Get the profit
sections = soup.find_all(class_='table-striped')
for section in sections:
oldprofit = section.find_all('td')[11].text
removetext = oldprofit.replace('USD', '')
removetext = removetext.replace(' ', '')
removetext = removetext.replace(',', '')
profit = float(removetext)
# Compare profit to goal
goal = float(50000)
if profit >= goal
Basically, what I am trying to do is run an if statement on a value on the webpage, and if the statement is true, then scrape the webpage. If the if statement is false, then do not scrape the page and continue the code.
Here is the entire script that I am trying to plug this into:
import csv
import requests
from bs4 import BeautifulSoup as bs
from datetime import datetime
headers = []
datarows = []
# define 1-1-2020 as a datetime object
after_date = datetime(2020, 1, 1)
with requests.Session() as s:
s.headers = {"User-Agent": "Safari/537.36"}
r = s.get('https://bitinfocharts.com/top-100-richest-dogecoin-addresses-2.html')
soup = bs(r.content, 'lxml')
# select all tr elements (minus the first one, which is the header)
table_elements = soup.select('tr')[1:]
address_links = []
for element in table_elements:
children = element.contents # get children of table element
url = children[1].a['href']
last_out_str = children[8].text
# check to make sure the date field isn't empty
if last_out_str != "":
# load date into datetime object for comparison (second part is defining the layout of the date as years-months-days hour:minute:second timezone)
last_out = datetime.strptime(last_out_str, "%Y-%m-%d %H:%M:%S %Z")
# if check to see if the date is after 2020/1/1
if last_out > after_date:
address_links.append(url)
for url in address_links:
r = s.get(url)
soup = bs(r.content, 'lxml')
table = soup.find(id="table_maina")
#Get the Doge Address for the filename
item = soup.find('h1').text
newitem = item.replace('Dogecoin', '')
finalitem = newitem.replace('Address', '')
#Get the profit
sections = soup.find_all(class_='table-striped')
for section in sections:
oldprofit = section.find_all('td')[11].text
removetext = oldprofit.replace('USD', '')
removetext = removetext.replace(' ', '')
removetext = removetext.replace(',', '')
profit = float(removetext)
# Compare profit to goal
goal = float(50000)
if profit >= goal
if table:
for row in table.find_all('tr'):
heads = row.find_all('th')
if heads:
headers = [th.text for th in heads]
else:
datarows.append([td.text for td in row.find_all('td')])
fcsv = csv.writer(open(f'{finalitem}.csv', 'w', newline=''))
fcsv.writerow(headers)
fcsv.writerows(datarows)
I am familiar with if statements however I this unsure how to plug this into the existing code and have it accomplish what I am trying to do. If anyone has any advice I would greatly appreciate it. Thank you.

From my understanding, it seems like all that you are asking is how to have the script continue if it fails that criteria in which case you need to just do
if profit < goal:
continue
Though the for loop in your snippet is only using the final value of profit, if there are other profit values that you need to look at those values are not being evaluated.

Python extract and append data into data frame

I've scraped the website for my research but I couldn't find the right way to extract it into data frame. I believe that my problem is related with list objects that are between lines 36 and 38.
The print line has worked very nice that I can see the final version of data frame in the Python console.
The solution can be really easy but I couldn't figure it out. Thanks in advance for all help.
from time import sleep
from bs4 import BeautifulSoup, SoupStrainer
import requests
import pandas as pd
# Insert the hisghest page number for website
highest_number = 12
def total_page_number(url):
all_webpage_links = []
all_webpage_links.insert(0, url)
pages = [str(each_number) for each_number in range(2, highest_number)]
for page in pages:
link = ''.join(url + '&page=' + page)
all_webpage_links.append(link)
return all_webpage_links
# Use total_page_number function to create page list for website
All_page = total_page_number(
'https://www.imdb.com/search/title?countries=tr&languages=tr&locations=Turkey&count=250&view=simple')
def clean_text(text):
""" Removes white-spaces before, after, and between characters
:param text: the string to remove clean
:return: a "cleaned" string with no more than one white space between
characters
"""
return ' '.join(text.split())
# Create list objects for data
# Problem occurs in this line !!!!!!
actor_names = []
titles = []
dates = []
def get_cast_from_link(movie_link):
""" Go to the IMDb Movie page in link, and find the cast overview list.
Prints tab-separated movie_title, actor_name, and character_played to
stdout as a result. Nothing returned
:param movie_link: string of the link to IMDb movie page (http://imdb.com
...)
:return: void
"""
movie_page = requests.get(movie_link)
# Use SoupStrainer to strain the cast_list table from the movie_page
# This can save some time in bigger scraping projects
cast_strainer = SoupStrainer('table', class_='cast_list')
movie_soup = BeautifulSoup(movie_page.content, 'html.parser', parse_only=cast_strainer)
# Iterate through rows and extract the name and character
# Remember that some rows might not be a row of interest (e.g., a blank
# row for spacing the layout). Therefore, we need to use a try-except
# block to make sure we capture only the rows we want, without python
# complaining.
for row in movie_soup.find_all('tr'):
try:
actor = clean_text(row.find(itemprop='name').text)
actor_names.append(actor)
titles.append(movie_title)
dates.append(movie_date)
print('\t'.join([movie_title, actor, movie_date]))
except AttributeError:
pass
# Export data frame
# Problem occurs in this line !!!!!!
tsd_df = pd.DataFrame({'Actor_Names': actor_names,
'Movie_Title': titles,
'Movie_Date': dates})
tsd_df.to_csv('/Users/ea/Desktop/movie_df.tsv', encoding='utf-8')
for each in All_page:
# Use requests.get('url') to load the page you want
web_page = requests.get(each)
# https://www.imdb.com/search/title?countries=tr&languages=tr&count=250&view=simple&page=2
# Prepare the SoupStrainer to strain just the tbody containing the list of movies
list_strainer = SoupStrainer('div', class_='lister-list')
# Parse the html content of the web page with BeautifulSoup
soup = BeautifulSoup(web_page.content, 'html.parser', parse_only=list_strainer)
# Generate a list of the "Rank & Title" column of each row and iterate
movie_list = soup.find_all('span', class_='lister-item-header')
for movie in movie_list:
movie_title = movie.a.text
movie_date = movie.find('span', class_='lister-item-year text-muted unbold').text
# get the link to the movie's own IMDb page, and jump over
link = 'http://imdb.com' + movie.a.get('href')
get_cast_from_link(link)
# remember to be nice, and sleep a while between requests!
sleep(15)

Web table scraping: how do I find the column number of a cell in excel using python

I have an excel file with many Chinese names in the first row like this:
enter image description here
And what I am doing is to scrape some more Chinese names from a web table and the names are all at the 2nd col in each row (tr). I want to see if the names being scraped is already in my excel file. So I use a boolean have to keep track. It should return True if found. And I want to know the exact position (column number) of the found name, so I use name_position to keep track.
from lxml import html
from bs4 import BeautifulSoup
import requests
import openpyxl
from openpyxl.workbook import Workbook
wb=openpyxl.load_workbook('hehe.xlsx')
ws1=wb.get_sheet_by_name('Taocan')
page = requests.get(url)
tree = html.fromstring(page.text)
web = page.text
soup = BeautifulSoup(web, 'lxml')
table = soup.find('table', {'class': "tc_table"})
trs = table.find_all('tr')
for tr in trs:
ls = []
for td in tr.find_all('td'):
ls.append(td.text)
ls = [x.encode('utf-8') for x in ls]
try:
name = ls[1]
have = False
name_position = 1
for cell in ws1[1]:
if name == cell:
have = True
break
else:
name_position += 1
except IndexError:
print("there is an index error")
However, my code doesn't seem to work, and I think the problem is from the comparison of the names:
if name == cell
I changed to:
if name == cell.value
it still doesn't work.
Can anyone help me with this? thanks/:
Just to add on: the web page Im scraping is also in Chinese. So when I
print(ls)
it gives a list like this
['1', '\xe4\xb8\x80\xe8\x88\xac\xe6\xa3\x80\xe6\x9f\xa5', '\xe8\xba\xab\xe9\xab\x98\xe3\x80\x81\xe4\xbd\x93\xe9\x87\x8d\xe3\x80\x81\xe4\xbd\x93\xe9\x87\x8d\xe6\x8c\x87\xe6\x95\xb0\xe3\x80\x81\xe8\x85\xb0\xe5\x9b\xb4\xe3\x80\x81\xe8\x88\x92\xe5\xbc\xa0\xe5\x8e\x8b\xe3\x80\x81\xe6\x94\xb6\xe7\xbc\xa9\xe5\x8e\x8b\xe3\x80\x81\xe8\xa1\x80\xe5\x8e\x8b\xe6\x8c\x87\xe6\x95\xb0', '\xe9\x80\x9a\xe8\xbf\x87\xe4\xbb\xaa\xe5\x99\xa8\xe6\xb5\x8b\xe9\x87\x8f\xe4\xba\xba\xe4\xbd\x93\xe8\xba\xab\xe9\xab\x98\xe3\x80\x81\xe4\xbd\x93\xe9\x87\x8d\xe3\x80\x81\xe4\xbd\x93\xe8\x84\x82\xe8\x82\xaa\xe7\x8e\x87\xe5\x8f\x8a\xe8\xa1\x80\xe5\x8e\x8b\xef\xbc\x8c\xe7\xa7\x91\xe5\xad\xa6\xe5\x88\xa4\xe6\x96\xad\xe4\xbd\x93\xe9\x87\x8d\xe6\x98\xaf\xe5\x90\xa6\xe6\xa0\x87\xe5\x87\x86\xe3\x80\x81\xe8\xa1\x80\xe5\x8e\x8b\xe6\x98\xaf\xe5\x90\xa6\xe6\xad\xa3\xe5\xb8\xb8\xe3\x80\x81\xe4\xbd\x93\xe8\x84\x82\xe8\x82\xaa\xe6\x98\xaf\xe5\x90\xa6\xe8\xb6\x85\xe6\xa0\x87\xe3\x80\x82']
but if I
print(ls[1])
it gives Chinese name like "广州"

Using Requests and lxml, get href values for rows in a table

Python 3
I am having a hard time iterating through the rows of a table.
How do I iterate the tr[1] component through the number of rows in the table body for teamName, teamState, teamLink xpaths?
import lxml.html
from lxml.etree import XPath
url = "http://www.maxpreps.com/rankings/basketball-winter-15-16/7/national.htm"
rows_xpath = XPath('//*[#id="rankings"]/tbody)
teamName_xpath = XPath('//*[#id="rankings"]/tbody/tr[1]/th/a/text()')
teamState_xpath = XPath('//*[#id="rankings"]/tbody/tr[1]/td[2]/text()')
teamLink_xpath = XPath('//*[#id="rankings"]/tbody/tr[1]/th/a/#href')
html = lxml.html.parse(url)
for row in rows_xpath(html):
teamName = teamName_xpath(row)
teamState = teamState_xpath(row)
teamLink = teamLink_xpath(row)
print (teamName, teamLink)
I have also attempted this through the following:
from lxml import html
import requests
siteItem = ['http://www.maxpreps.com/rankings/basketball-winter-15-16/7/national.htm'
]
def linkScrape():
page = requests.get(target)
tree = html.fromstring(page.content)
#Get team link
for link in tree.xpath('//*[#id="rankings"]/tbody/tr[1]/th/a/#href'):
print (link)
#Get team name
for name in tree.xpath('//*[#id="rankings"]/tbody/tr[1]/th/a/text()'):
print (name)
#Get team state
for state in tree.xpath('//*[#id="rankings"]/tbody/tr[1]/td[2]/text()'):
print (state)
for target in siteItem:
linkScrape()
Thank you for looking :D

If I understand what you're asking, you want to iterate over the rows in the ranking table. So, start with a loop over those rows:
import lxml.html
doc = lxml.html.parse('http://www.maxpreps.com/rankings/basketball-winter-15-16/7/national.htm')
for row in doc.xpath('//table[#id="rankings"]/tbody/tr'):
This will iterate over each row in that document. Now, for each row, extract the data you want:
team_link = row.xpath('th/a/#href')[0]
team_name = row.xpath('th/a/text()')[0]
team_state = row.xpath('td[contains(#class, "state")]/text()')[0]
print(team_state, team_name, team_link)
Which on my system yields output along the lines of:
CA Manteca /high-schools/manteca-buffaloes-(manteca,ca)/basketball-winter-15-16/rankings.htm
MD Mount St. Joseph (Baltimore) /high-schools/mount-st-joseph-gaels-(baltimore,md)/basketball-winter-15-16/rankings.htm
TX Brandeis (San Antonio) /high-schools/brandeis-broncos-(san-antonio,tx)/basketball-winter-15-16/rankings.htm

Issue with scraping data using indexing from html structure

I am scraping data from following html structure from 30-40 webpages like these https://www.o2.co.uk/shop/tariffs/sony/xperia-z-purple/ :
<td class="monthlyCost">£13<span>.50</span></td>
<td class="phoneCost">£479.99</td>
<td><span class="lowLight">24 Months</span></td>
<td>50</td>
<td>Unlimited</td>
<td class="dataAllowance">100MB</td>
<td class="extras">
I am indexing to scrape data present under td tags having no class like 50 & Unlimited which corresponds to Minutes and texts column in the dataset. Code which I am using is:
results = tariff_link_soup.findAll('td', {"class": None})
minutes = results[1]
texts = results[2]
print minutes,texts
All these 30-40 webplinks are present on https://www.o2.co.uk/shop/phones/ webpage, I am finding those links on this webpage accessing them and then reaching this desired webpage, all these final device webpages follow same structure.
Problem: I was hoping to get only minutes and text values which are like 50 & Unlimited, 200 & Unlimited and are present at 2nd and 3rd index for all webpages. Still I am getting some other values when I am printing the data for eg. 500MB, 100MB which are values under dataAllowance class and td tag. I am using class as None attribute but still not able to get required data. I checked html structure and it was consistent across pages.
Please help me in solving this issue as I am not able to fathom reason for this anomaly.
Update: Entire Python code which I am using:
urls = ['https://www.o2.co.uk/shop/phones/',
'https://www.o2.co.uk/shop/phones/?payGo=true']
plans = ['Pay Monthly','Pay & Go']
for url,plan in zip(urls,plans):
if plan == 'Pay Monthly':
device_links = parse().direct_url(url,'span', {"class": "model"})
for device_link in device_links:
device_link.parent['href'] = urlparse.urljoin(url, device_link.parent['href'])
device_link_page = urllib2.urlopen(device_link.parent['href'])
device_link_soup = BeautifulSoup(device_link_page)
dev_names = device_link_soup.find('h1')
for devname in dev_names:
tariff_link = device_link_soup.find('a',text = re.compile('View tariffs'))
tariff_link['href'] = urlparse.urljoin(url, tariff_link['href'])
tariff_link_page = urllib2.urlopen(tariff_link['href'])
tariff_link_soup = BeautifulSoup(tariff_link_page)
dev_price = tariff_link_soup.findAll('td', {"class": "phoneCost"})
monthly_price = tariff_link_soup.findAll('td', {"class": "monthlyCost"})
tariff_length = tariff_link_soup.findAll('span', {"class": "lowLight"})
data_plan = tariff_link_soup.findAll('td', {"class": "dataAllowance"})
results = tariff_link_soup.xpath('//td[not(#class)]')
print results[1].text
print results[2].text

I finally used following code to solve my problem:
for row in tariff_link_soup('table', {'id' : 'tariffTable'})[0].tbody('tr'):
tds = row('td')
#print tds[0].text,tds[1].text,tds[2].text,tds[3].text,tds[4].text,tds[5].text
monthly_prices = unicode(tds[0].text).encode('utf8').replace("Â£","").replace("FREE","0").replace("Free","0").strip()
dev_prices = unicode(tds[1].text).encode('utf8').replace("Â£","").replace("FREE","0").replace("Free","0").strip()
tariff_lengths = unicode(tds[2].text).encode('utf8').strip()
minutes = unicode(tds[3].text).encode('utf8').strip()
texts = unicode(tds[4].text).encode('utf8').strip()
data = unicode(tds[5].text).encode('utf8').strip()
device_names = unicode(dev_names).encode('utf8').strip()
I am accessing the required data row by row here, using the tabular structure in which data is present. I am taking all elements present in a row and assigning names to those which are required in my data.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting data from a web page using BS4 in Python - python

Related

How can I plug this section of code into my BeautifulSoup script?

Python extract and append data into data frame

Web table scraping: how do I find the column number of a cell in excel using python

Using Requests and lxml, get href values for rows in a table

Issue with scraping data using indexing from html structure

Categories

Resources