Web Scrape, Python and BeautifulSoup, how to store output - python

I am trying to use webscrape method to get some Temperature and Precipitation data for www.wunderground.com (they have an API, but I must use web scrape method in my project)
My problem is that I can't figure out how to store my data after the scrape.
There is my code for example:
import urllib2
from bs4 import BeautifulSoup
url = "http://www.wunderground.com/history/airport/KBUF/2014/5/25/DailyHistory.html"
soup = BeautifulSoup(urllib2.urlopen(url).read()
#Mean Temperature Values
mean_temp_row = soup.findAll('table')[0].findAll('tr')[2]
for tds in mean_temp_row.findAll('td'):
print tds.text
The output I'am getting is:
Mean Temperature
15 °C
16 °C
I would like to know how I can get something like: station = {"Temp_Mean":[15 , 16]}

Is this output format always the same ?
If it is, so we can see the info name is on the first td of the row. Then, there is an empty td, then the min, then empty, empty, and at the end the max.
So you could do something like :
def celcius2float(celcius):
return float(celcius.split('°')[0].strip())
cells = Mean_Temp_Row.findAll('td')
name = cells[0].text
min_temp = celcius2float(cells[2].text)
max_temp = celcius2float(cells[5].text)
# Then you can do all you want with this suff :
station = {name: [min_temp, max_temp]}

After considering the answer from TurpIF , here is my code
def collect_data(url):
soup = BeautifulSoup(urllib2.urlopen(url).read())
Mean_temp = soup.findAll('table')[0].findAll('tr')[2].findAll('td')
temp = Mean_temp[1].text.split()[0].encode('utf8')
rows = soup.findAll('table')[0].findAll('tr')
for num,row in enumerate(rows):
if "Precipitation" in row.text:
preci_line = num
Preci = soup.findAll('table')[0].findAll('tr')[preci_line].findAll('td')
perci = Preci[1].text.split()[0].encode('utf8')
return temp,perci
So,
url = "http://www.wunderground.com/history/airport/KBUF/2014/5/25/DailyHistory.html"
temp,perci = collect_data(url)

Related

How can I plug this section of code into my BeautifulSoup script?

I am new to Python and Beautiful Soup. My project I am working on is a script which scrapes the pages inside of the hyperlinks on this page:
https://bitinfocharts.com/top-100-richest-dogecoin-addresses-2.html
Currently, the script has a filter which will only scrape the pages which have a "Last Out" date which is past a certain date.
I am trying to add an additional filter to the script, which does the following:
Scrape the "Profit from price change:" section on the page inside hyperlink (Example page: https://bitinfocharts.com/dogecoin/address/D8WhgsmFUkf4imvsrwYjdhXL45LPz3bS1S
Convert the profit into a float
Compare the profit to a variable called "goal" which has a float assigned to it.
If the profit is greater or equal to goal, then scrape the contents of the page. If the profit is NOT greater or equal to the goal, do not scrape the webpage, and continue the script.
Here is the snippet of code I am using to try and do this:
#Get the profit
sections = soup.find_all(class_='table-striped')
for section in sections:
oldprofit = section.find_all('td')[11].text
removetext = oldprofit.replace('USD', '')
removetext = removetext.replace(' ', '')
removetext = removetext.replace(',', '')
profit = float(removetext)
# Compare profit to goal
goal = float(50000)
if profit >= goal
Basically, what I am trying to do is run an if statement on a value on the webpage, and if the statement is true, then scrape the webpage. If the if statement is false, then do not scrape the page and continue the code.
Here is the entire script that I am trying to plug this into:
import csv
import requests
from bs4 import BeautifulSoup as bs
from datetime import datetime
headers = []
datarows = []
# define 1-1-2020 as a datetime object
after_date = datetime(2020, 1, 1)
with requests.Session() as s:
s.headers = {"User-Agent": "Safari/537.36"}
r = s.get('https://bitinfocharts.com/top-100-richest-dogecoin-addresses-2.html')
soup = bs(r.content, 'lxml')
# select all tr elements (minus the first one, which is the header)
table_elements = soup.select('tr')[1:]
address_links = []
for element in table_elements:
children = element.contents # get children of table element
url = children[1].a['href']
last_out_str = children[8].text
# check to make sure the date field isn't empty
if last_out_str != "":
# load date into datetime object for comparison (second part is defining the layout of the date as years-months-days hour:minute:second timezone)
last_out = datetime.strptime(last_out_str, "%Y-%m-%d %H:%M:%S %Z")
# if check to see if the date is after 2020/1/1
if last_out > after_date:
address_links.append(url)
for url in address_links:
r = s.get(url)
soup = bs(r.content, 'lxml')
table = soup.find(id="table_maina")
#Get the Doge Address for the filename
item = soup.find('h1').text
newitem = item.replace('Dogecoin', '')
finalitem = newitem.replace('Address', '')
#Get the profit
sections = soup.find_all(class_='table-striped')
for section in sections:
oldprofit = section.find_all('td')[11].text
removetext = oldprofit.replace('USD', '')
removetext = removetext.replace(' ', '')
removetext = removetext.replace(',', '')
profit = float(removetext)
# Compare profit to goal
goal = float(50000)
if profit >= goal
if table:
for row in table.find_all('tr'):
heads = row.find_all('th')
if heads:
headers = [th.text for th in heads]
else:
datarows.append([td.text for td in row.find_all('td')])
fcsv = csv.writer(open(f'{finalitem}.csv', 'w', newline=''))
fcsv.writerow(headers)
fcsv.writerows(datarows)
I am familiar with if statements however I this unsure how to plug this into the existing code and have it accomplish what I am trying to do. If anyone has any advice I would greatly appreciate it. Thank you.
From my understanding, it seems like all that you are asking is how to have the script continue if it fails that criteria in which case you need to just do
if profit < goal:
continue
Though the for loop in your snippet is only using the final value of profit, if there are other profit values that you need to look at those values are not being evaluated.

Web scraping with bs4 python: How to display football matchups

I'm a beginner to Python and am trying to create a program that will scrape the football/soccer schedule from skysports.com and will send it through SMS to my phone through Twilio. I've excluded the SMS code because I have that figured out, so here's the web scraping code I am getting stuck with so far:
import requests
from bs4 import BeautifulSoup
URL = "https://www.skysports.com/football-fixtures"
page = requests.get(URL)
results = BeautifulSoup(page.content, "html.parser")
d = defaultdict(list)
comp = results.find('h5', {"class": "fixres__header3"})
team1 = results.find('span', {"class": "matches__item-col matches__participant matches__participant--side1"})
date = results.find('span', {"class": "matches__date"})
team2 = results.find('span', {"class": "matches__item-col matches__participant matches__participant--side2"})
for ind in range(len(d)):
d['comp'].append(comp[ind].text)
d['team1'].append(team1[ind].text)
d['date'].append(date[ind].text)
d['team2'].append(team2[ind].text)
Down below should do the trick for you:
from bs4 import BeautifulSoup
import requests
a = requests.get('https://www.skysports.com/football-fixtures')
soup = BeautifulSoup(a.text,features="html.parser")
teams = []
for date in soup.find_all(class_="fixres__header2"): # searching in that date
for i in soup.find_all(class_="swap-text--bp30")[1:]: #skips the first one because that's a heading
teams.append(i.text)
date = soup.find(class_="fixres__header2").text
print(date)
teams = [i.strip('\n') for i in teams]
for x in range(0,len(teams),2):
print (teams[x]+" vs "+ teams[x+1])
Let me further explain what I have done:
All the football have this class name - swap-text--bp30
So we can use find_all to extract all the classes with that name.
Once we have our results we can put them into an array "teams = []" then append them in a for loop "team.append(i.text)". ".text" strips the html
Then we can get rid of "\n" in the array by stripping it and printing out each string in the array two by two.
This should be your final output:
EDIT: To scrape the title of the leagues we will do pretty much the same:
league = []
for date in soup.find_all(class_="fixres__header2"): # searching in that date
for i in soup.find_all(class_="fixres__header3"): #skips the first one because that's a heading
league.append(i.text)
Strip the array and create another one:
league = [i.strip('\n') for i in league]
final = []
Then add this final bit of code which is essentially just printing the league then the two teams over and over:
for x in range(0,len(teams),5):
final.append(teams[x]+" vs "+ teams[x+1])
for i in league:
print(i)
for i in final:
print(i)

Extracting similar items from a website with beautiful soup

I`m trying to scrape a website rating. I want to get each individual rating and it´s particular date. However, I only get one result in my list, although there should be several.
Am I doing something wrong in the for loop?
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import time
url = "https://www.kununu.com/de/heidelpay/kommentare"
while url != " ":
print(url)
time.sleep(15)
r = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
print(r.status_code)
soup = BeautifulSoup(r.text, "html.parser")
#print(soup.prettify())
#Get overall score of the company
score_avg = soup.find("span", class_="index__aggregationValue__32exy").text
print(score_avg)
#get individuel scores and dates of the company
rating_list = []
for box in soup.find_all(".index__rating__3nC2L"):
score_ind = box.select(".index__score__16yy9").text
date = select(".index__date__eIOxr").text
rating = [score_ind, date]
rating_list.append(rating)
print(rating_list)
3,3
[['5,0', 'Januar 2017']]
Many thanks in advance!
It looks like you aren't appending the rating to the rating_list until the last loop is done. Is the printed rating perchance the very last one?
Add the append to your loop, like so:
for box in soup.find_all(".index__rating__3nC2L"):
score_ind = box.select(".index__score__16yy9").text
date = select(".index__date__eIOxr").text
rating = [score_ind, date]
rating_list.append(rating)
Well, the problem is that you're just appending the last rating value in rating_list.append(rating) because it's out of the foor loop, so what you have to do is this:
for box in soup.find_all(".index__rating__3nC2L"):
score_ind = box.select(".index__score__16yy9").text
date = select(".index__date__eIOxr").text
rating = [score_ind, date]
rating_list.append(rating)
Like this way you're gonna append each rating value in each iteration of the forloop. Hope this can help you

Extracting Web Data Using Beautiful Soup (Python 2.7)

in the code sample below, 3 of the 5 elements I am attempting to scrape return values as expected. 2 (goals_scored and assists) return no values. I have verified that the data does exist on the web page and that I am using the correct attribute, but not sure why results are not returning. Is there something obvious I am overlooking?
import sys
from bs4 import BeautifulSoup as bs
import urllib2
import datetime as dt
import time
import pandas as pd
proxy_support = urllib2.ProxyHandler({})
opener = urllib2.build_opener(proxy_support)
player_name=[]
club =[]
position = []
goals_scored = []
assists = []
for p in range(25):
player_url = 'http://www.mlssoccer.com/stats/season?page={p}&franchise=select&year=2017&season_type=REG&group=goals'.format(
p=p)
page = opener.open(player_url).read()
player_soup = bs(page,"lxml")
print >>sys.stderr, '[{time}] Running page {n}...'.format(
time=dt.datetime.now(), n=p)
length = len(player_soup.find('tbody').findAll('tr'))
for row in range(0, length):
try:
name = player_soup.find('tbody').findAll('td', attrs={'data-title': 'Player'})[row].find('a').contents[0]
player_name.append(name)
team = player_soup.find('tbody').findAll('td', attrs={'data-title': 'Club'})[row].contents[0]
club.append(team)
pos = player_soup.find('tbody').findAll('td', attrs={'data-title': 'POS'})[row].contents[0]
position.append(pos)
goals = player_soup.find('tbody').findAll('td', attrs={'data-title': 'G' ,'class': 'responsive'})[row].contents[0]
goals_scored.apppend(goals)
a = player_soup.find('tbody').findAll('td', attrs={'data-title': 'A'})[row].contents[0]
assists.append(a)
except:
pass
player_data = {'player_name':player_name,
'club':club,
'position' : position,
'goals_scored' : goals_scored,
'assists' : assists,
}
df = pd.DataFrame.from_dict(player_data,orient='index')
df
The only thing I can figure out is that there is a slight difference in the HTML for the variables not returning data. Do i need to include the class= responsive in my code? If so, any examples of how that might look?
Position HTML : F
Goals HTML: 11
Any insight is appreciated
You can try like this to get your desired data. I've only parsed the portion you needed. The rest you can do for dataframe. FYI, there are two types of classes attached to different td tags. odd and even. Don't forget to consider that as well.
from bs4 import BeautifulSoup
import requests
page_url = "https://www.mlssoccer.com/stats/season?page={0}&franchise=select&year=2017&season_type=REG&group=goals"
for url in [page_url.format(p) for p in range(5)]:
soup = BeautifulSoup(requests.get(url).text, "lxml")
table = soup.select("table")[0]
for items in table.select(".odd,.even"):
player = items.select("td[data-title='Player']")[0].text
club = items.select("td[data-title='Club']")[0].text
position = items.select("td[data-title='POS']")[0].text
goals = items.select("td[data-title='G']")[0].text
assist = items.select("td[data-title='A']")[0].text
print(player,club,position,goals,assist)
Partial result looks like:
Nemanja Nikolic CHI F 24 4
Diego Valeri POR M 21 11
Ola Kamara CLB F 18 3
As I've included both the classes in my script so you will get all data from that site.

Using Requests and lxml, get href values for rows in a table

Python 3
I am having a hard time iterating through the rows of a table.
How do I iterate the tr[1] component through the number of rows in the table body for teamName, teamState, teamLink xpaths?
import lxml.html
from lxml.etree import XPath
url = "http://www.maxpreps.com/rankings/basketball-winter-15-16/7/national.htm"
rows_xpath = XPath('//*[#id="rankings"]/tbody)
teamName_xpath = XPath('//*[#id="rankings"]/tbody/tr[1]/th/a/text()')
teamState_xpath = XPath('//*[#id="rankings"]/tbody/tr[1]/td[2]/text()')
teamLink_xpath = XPath('//*[#id="rankings"]/tbody/tr[1]/th/a/#href')
html = lxml.html.parse(url)
for row in rows_xpath(html):
teamName = teamName_xpath(row)
teamState = teamState_xpath(row)
teamLink = teamLink_xpath(row)
print (teamName, teamLink)
I have also attempted this through the following:
from lxml import html
import requests
siteItem = ['http://www.maxpreps.com/rankings/basketball-winter-15-16/7/national.htm'
]
def linkScrape():
page = requests.get(target)
tree = html.fromstring(page.content)
#Get team link
for link in tree.xpath('//*[#id="rankings"]/tbody/tr[1]/th/a/#href'):
print (link)
#Get team name
for name in tree.xpath('//*[#id="rankings"]/tbody/tr[1]/th/a/text()'):
print (name)
#Get team state
for state in tree.xpath('//*[#id="rankings"]/tbody/tr[1]/td[2]/text()'):
print (state)
for target in siteItem:
linkScrape()
Thank you for looking :D
If I understand what you're asking, you want to iterate over the rows in the ranking table. So, start with a loop over those rows:
import lxml.html
doc = lxml.html.parse('http://www.maxpreps.com/rankings/basketball-winter-15-16/7/national.htm')
for row in doc.xpath('//table[#id="rankings"]/tbody/tr'):
This will iterate over each row in that document. Now, for each row, extract the data you want:
team_link = row.xpath('th/a/#href')[0]
team_name = row.xpath('th/a/text()')[0]
team_state = row.xpath('td[contains(#class, "state")]/text()')[0]
print(team_state, team_name, team_link)
Which on my system yields output along the lines of:
CA Manteca /high-schools/manteca-buffaloes-(manteca,ca)/basketball-winter-15-16/rankings.htm
MD Mount St. Joseph (Baltimore) /high-schools/mount-st-joseph-gaels-(baltimore,md)/basketball-winter-15-16/rankings.htm
TX Brandeis (San Antonio) /high-schools/brandeis-broncos-(san-antonio,tx)/basketball-winter-15-16/rankings.htm

Categories