I am currently trying to scrape football match data from the following URL:
https://liveonsat.com/uk-england-all-football.php
I am able to scrape the match names, start times and channel names correctly. Unfortunately I seem to be having an issue scraping the correct match date. I have identified with help from stackoverflow previously that the element containing the match date can be called with
parent.find
The issue I am having though is that the first date that is scraped persists throughout all the matches that are scraped even if a particular game is not on that date. For instance if I run the code today it is showing the match date for all matches as Saturday 11th July, even though some of the matches that are scraped are on different dates.
I am unsure at this point what the problem could be and would be extremely grateful if someone could assist me or point me in the right direction to attempt to solve this issue. I was first thinking that the problem was with the HTML element that was selected to grab the match date from but I have changed this to previous parent elements to test and no date is scraped at all, so it appears the element currently selected to gather match date is correct, but it is possible not implemented correctly by me.
To help I have left a comment beside the match date element which I am having the issue with.
import requests
import time
import csv
import sys
from bs4 import BeautifulSoup
import tkinter as tk
from tkinter import messagebox
from tkinter import *
from PIL import ImageTk, Image
def makesoup(url):
page=requests.get(url)
return BeautifulSoup(page.text,"lxml")
def matchscrape(g_data):
for match in g_data:
competitors = match.find('div', class_='fix').text
match_date = match.parent.find('h2',class_='time_head').text # this is used to scrape the match date as it is not contained within "div", {"class": "blockfix"}))
match_time = match.find('div',class_='fLeft_time_live').text.strip()
print("Competitors ", competitors)
print("Match date", match_date)
print("Match time", match_time)
#Match time
channel = match.find_all("td", {"class": "chan_col"})
for i in channel:
print(i.get_text().strip())
def matches():
soup=makesoup(url = "https://liveonsat.com/uk-england-all-football.php")
matchscrape(g_data = soup.findAll("div", {"class": "blockfix"}))
root = tk.Tk()
root.resizable(False, False)
root.geomAetry("600x600")
root.wm_title("liveonsat scraper")
Label = tk.Label(root, text = 'liveonsat scraper', font = ('Comic Sans MS',18))
button = tk.Button(root, text="Scrape Matches", command=matches)
button3 = tk.Button(root, text = "Quit Program", command=quit)
Label.pack()
button.pack()
button3.pack()
status_label = tk.Label(text="")
status_label.pack()
root.mainloop()
Below is the relevant example HTML code of the site I am scraping:
<div style="clear:right"> <div class=floatAndClearL><h2 class = sport_head >Football</h2></div> <!-- sport_head -->
<div class=floatAndClearL><h2 class = time_head>Saturday, 11th July</h2></div> <!-- time_head --> <div><span class = comp_head>English Championship - Week 43</span></div>
<div class = blockfix > <!-- block 1-->
<div class=fix> <!-- around fixture and notes 2-->
<div class=fix_text> <!-- around fixture text 3-->
<div class = imgCenter><span><img src="../img/team/england.gif"></span></div>
<div class = fLeft style="width:270px;text-align:center;background-color:#ffd379;color:#800000;font-size:10pt;font-family:Tahoma, Geneva, sans-serif">Derby County v Brentford</div>
<div class = imgCenter><img src="../img/team/england.gif"></div>
</div> <!-- around fixture text 3 ENDS-->
<div class=notes></div>
</div> <!-- around fixture and notes 2 ENDS-->
<div class = fLeft> <!-- around all of channel types 2--> <div> <!-- around channel type group 3-->
<div class=fLeft_icon_live_l> <!-- around icon 4-->
<img src="../img/icon/live3.png"/>
</div>
<div class=fLeft_time_live> <!-- around icon 4-->
ST: 12:30
</div> <!-- around icon 4 ENDS--> <div class = fLeft_live> <!-- around all tables of a channel type 4--> <table border="0" cellspacing="0" cellpadding="0"><tr><td class=chan_col> <a href="https://connect.bein.net/" target="_blank" class = chan_live_iptvcable> beIN Connect MENA 📺</a></td><td width = 0></td>
</tr></table> <table border="0" cellspacing="0" cellpadding="0"><tr><td class=chan_col> <a href="https://tr.beinsports.com/kullanici/giris?ReturnUrl=" target="_blank" class = chan_live_iptvcable> beIN Connect TURKEY 📺</a></td><td width = 0></td>
Instead of find.parent use .find_previous(), because parent is common (and thus the same) for all <div class="blockfix">:
import requests
from bs4 import BeautifulSoup
url = 'https://liveonsat.com/uk-england-all-football.php'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for match in soup.select('div.blockfix'):
competitors = match.find('div', class_='fix').text.strip()
match_date = match.find_previous('h2', class_='time_head').text.strip() # <-- use .find_previous()
match_time = match.find('div',class_='fLeft_time_live').text.strip()
channels = match.select('.chan_col')
print("Competitors ", competitors)
print("Match date", match_date)
print("Match time", match_time)
print('Channels:\n\t' + '\n\t'.join(c.get_text(strip=True) for c in channels))
print('-' * 80)
Prints:
Competitors Derby County v Brentford
Match date Saturday, 11th July
Match time ST: 13:30
Channels:
beIN Connect MENA 📺
beIN Connect TURKEY 📺
beIN Sports MENA 5 HD
beIN Sports Turkey 4 HD
Eleven Sports 1 Portugal HD
Nova Sport (serbia) HD
Nova Sports 1 HD (Cyprus)
Nova Sports 1 HD (Hellas)
Sky Sports Football UK / HD
Sport 4 Israel / HD
Sportdigital TV HD
SportsMax 2 HD
Stöd 2 Sport 2 / HD
SuperSport 9 RSA
Telekanal Futbol
TV3 Sport HD Sweden
V Sport 1 HD (norge)
V Sport Extra HD (sweden)
ViaPlay (denmark) / HD
ViaPlay (finland) / HD
ViaPlay (norway) / HD
ViaPlay (sweden) / HD
--------------------------------------------------------------------------------
Competitors Watford v Newcastle United
Match date Saturday, 11th July
Match time ST: 13:30
Channels:
Amazon Prime UK Only [$]
beIN Connect MENA 📺
beIN Sports MENA 12 HD
beIN Sports MENA 2 HD
Belarus 5 TV
Canal+ Now HD (poland)
Cosmote Sport 7 HD
Cytavision Sports 1 HD
DAZN Canada [$] (geo/R)
DAZN España [$] (geo/R)
Diema Sport 2 HD
ESPN Brasil HD
EuroSport 1 Romania / HD
Premier Sports 1 HD (ROI only)
QazSport / HD
RMC Sport 2 HD
Setanta Qazaqstan HD
Setanta Sports Ukraine+ HD
Sky Sport 1 / HD Germany
Sky Sport Arena Italia / HD
Sky Sport Austria 1 HD
Sky Sport Football Italia / HD
Sport 2 Israel / HD
Sport TV2 (portugal) / HD
SportKlub 2 (serbia) HD
SpÃÂler 1 TV / HD
SuperSport 4 RSA / HD
TRT Spor / HD 📺
TSN Malta 2 HD
TV2 Sport Premium 2 HD
TV2sumo.no [$] (geo/R)
TV3 MAX (denmark) / HD
V Sport Premium HD
V Sport Urheilu / HD
ViaPlay (denmark) / HD
ViaPlay (finland) / HD
ViaPlay (sweden) / HD
VOOsport World 1 / HD
--------------------------------------------------------------------------------
... and so on.
Related
I don't know how to scrape this text
Telefon Mobil Apple iPhone 13, Super Retina XDR OLED 6.1", 256GB
Flash, Camera Duala 12 + 12 MP, Wi-Fi, 5G, iOS (Negru)
<div class="npi_name">
<h2>
<a href="/solutii-mobile-telefoane-mobile/apple-telefon-mobil-apple-iphone-13-super-retina-xdr-oled-6.1-256gb-flash-camera-duala-12-12-mp-wi-fi-5g-ios-negru-3824456.html">
<span style="color:red">Stoc limitat!</span>
Telefon Mobil Apple iPhone 13, Super Retina XDR OLED 6.1", 256GB Flash, Camera Duala 12 + 12 MP, Wi-Fi, 5G, iOS (Negru)
</a>
</h2>
</div>
What I've tried:
for n in j.find_all("div","npi_name"):
n2=n.find("a", href=True, text=True)
try:
n1=n2['href']
except:
n2=n.find("a")
n1=n2['href']
n3=n2.string
print(n3)
Output:
None
Try:
from bs4 import BeautifulSoup
html_doc = """
<div class="npi_name">
<h2>
<a href="/solutii-mobile-telefoane-mobile/apple-telefon-mobil-apple-iphone-13-super-retina-xdr-oled-6.1-256gb-flash-camera-duala-12-12-mp-wi-fi-5g-ios-negru-3824456.html">
<span style="color:red">Stoc limitat!</span>
Telefon Mobil Apple iPhone 13, Super Retina XDR OLED 6.1", 256GB Flash, Camera Duala 12 + 12 MP, Wi-Fi, 5G, iOS (Negru)
</a>
</h2>
</div>
"""
soup = BeautifulSoup(html_doc, "html.parser")
t = "".join(soup.select_one(".npi_name a").find_all(text=True, recursive=False))
print(t.strip())
Prints:
Telefon Mobil Apple iPhone 13, Super Retina XDR OLED 6.1", 256GB Flash, Camera Duala 12 + 12 MP, Wi-Fi, 5G, iOS (Negru)
I've made a few assumptions but something like this should work:
for n in j.find_all("div", {"class": "npi_name"}):
print(n.find("a").contents[2].strip())
This is how I arrived at my answer (the HTML you provided was entered in to a.html):
from bs4 import BeautifulSoup
def main():
with open("a.html", "r") as file:
html = file.read()
soup = BeautifulSoup(html, "html.parser")
divs = soup.find_all("div", {"class": "npi_name"})
for div in divs:
a = div.find("a").contents[2].strip()
# Testing
print(a)
if __name__ == "__main__":
main()
texts = []
for a in soup.select("div.npi_name a[href]"):
texts.append(a.contents[-1].strip())
Or more explicitly:
texts = []
for a in soup.select("div.npi_name a[href]"):
if a.span:
text = a.span.next_sibling
else:
text = a.string
texts.append(text.strip())
Select your elements more specific e.g. css selectors and use stripped_strings to get text, assuming it is always the last node in your element:
for e in soup.select('div.npi_name a[href]'):
text = list(e.stripped_strings)[-1]
print(text)
This way you could also process other information if needed e.g. href,span text,...
Example
Select multiple items, store information in list of dicts and convert it into a dataframe:
from bs4 import BeautifulSoup
import pandas as pd
html = '''
<div class="npi_name">
<h2>
<a href="/solutii-mobile-telefoane-mobile/apple-telefon-mobil-apple-iphone-13-super-retina-xdr-oled-6.1-256gb-flash-camera-duala-12-12-mp-wi-fi-5g-ios-negru-3824456.html">
<span style="color:red">Stoc limitat!</span>
Telefon Mobil Apple iPhone 13, Super Retina XDR OLED 6.1", 256GB Flash, Camera Duala 12 + 12 MP, Wi-Fi, 5G, iOS (Negru)
</a>
</h2>
</div>
'''
soup = BeautifulSoup(html)
data = []
for e in soup.select('div.npi_name a[href]'):
data.append({
'url' : e['href'],
'stock': s.text if (s := e.span) else None,
'label' :list(e.stripped_strings)[-1]
})
pd.DataFrame(data)
Output
url
stock
label
/solutii-mobile-telefoane-mobile/apple-telefon-mobil-apple-iphone-13-super-retina-xdr-oled-6.1-256gb-flash-camera-duala-12-12-mp-wi-fi-5g-ios-negru-3824456.html
Stoc limitat!
Telefon Mobil Apple iPhone 13, Super Retina XDR OLED 6.1", 256GB Flash, Camera Duala 12 + 12 MP, Wi-Fi, 5G, iOS (Negru)
I'm creating a web scraper in order to pull the name of a company from a chamber of commerce website directory.
Im using BeautifulSoup. The page and soup objects appear to be working, but when I scrape the HTML content, an empty list is returned when it should be filled with the directory names on the page.
Web page trying to scrape: https://www.austinchamber.com/directory
Here is the HTML:
<div>
<ul> class="item-list item-list--small"> == $0
<li>
<div class='item-content'>
<div class='item-description'>
<h5 class = 'h5'>Women Helping Women LLC</h5>
Here is the python code:
def pageRequest(url):
page = requests.get(url)
return page
def htmlSoup(page):
soup = BeautifulSoup(page.content, "html.parser")
return soup
def getNames(soup):
name = soup.find_all('h5', class_='h5')
return name
page = pageRequest("https://www.austinchamber.com/directory")
soup = htmlSoup(page)
name = getNames(soup)
for n in name:
print(n)
The data is loaded dynamically via Ajax. To get the data, you can use this script:
import json
import requests
url = 'https://www.austinchamber.com/api/v1/directory?filter[categories]=&filter[show]=all&page={page}&limit=24'
page = 1
for page in range(1, 10):
print('Page {}..'.format(page))
data = requests.get(url.format(page=page)).json()
# uncommentthis to print all data:
# print(json.dumps(data, indent=4))
for d in data['data']:
print(d['title'])
Prints:
...
Indeed
Austin Telco Federal Credit Union - Taos
Green Bank
Seton Medical Center Austin
Austin Telco Federal Credit Union - Jollyville
Page 42..
Texas State SBDC - San Marcos Office
PlainsCapital Bank - Motor Bank
University of Texas - Thompson Conference Center
Lamb's Tire & Automotive Centers - #2 Research & Braker
AT&T Labs
Prosperity Bank - Rollingwood
Kerbey Lane Cafe - Central
Lamb's Tire & Automotive Centers - #9 Bee Caves
Seton Medical Center Hays
PlainsCapital Bank - North Austin
Ellis & Salazar Body Shop
aLamb's Tire & Automotive Centers - #6 Lake Creek
Rudy's Country Store and BarBQ
...
<div class="book-cover-image">
<img alt="NOT IN MY BACKYARD – Solid Waste Mgmt in Indian Cities" class="img-responsive" src="https://cdn.downtoearth.org.in/library/medium/2016-05-23/0.42611000_1463993925_book-cover.jpg" title="NOT IN MY BACKYARD – Solid Waste Mgmt in Indian Cities"/>
</div>
I need to extract this title value from all such div tags. What can be the best way to perform this operation. Please suggest.
I am trying to fetch the title of all the books mentioned on this page.
I have tried this so far:
import requests
from bs4 import BeautifulSoup as bs
url1 ="https://www.downtoearth.org.in/books"
page1 = requests.get(url1, verify=False)
#print(page1.content)
soup1= bs(page1.content, 'html.parser')
class_names = soup1.find_all('div',{'class':'book-cover-image'} )
for class_name in class_names:
title_text = class_name.text
print(class_name)
print(title_text)
To get all title attributes for the book covers, you can use CSS selector .book-cover-image img[title] (select all <img> tags with attribute title that are under tag with class book-cover-image):
import requests
from bs4 import BeautifulSoup
url = 'https://www.downtoearth.org.in/books'
soup = BeautifulSoup(requests.get(url).text, 'lxml')
for i, img in enumerate(soup.select('.book-cover-image img[title]'), 1):
print('{:>4}\t{}'.format(i, img['title']))
Prints:
1 State of India’s Environment 2019: In Figures (eBook)
2 Victim Africa (eBook)
3 Frames of change - Heartening tales that define new India
4 STATE OF INDIA’S ENVIRONMENT 2019
5 State of India’s Environment In Figures 2018 (eBook)
6 Getting to know about environment
7 CLIMATE CHANGE NOW - The Story of Carbon Colonisation
8 Climate change - For the young and curious
9 Conflicts of Interest: My Journey through India’s Green Movement
10 Body Burden: Lifestyle Diseases
11 STATE OF INDIA’S ENVIRONMENT 2018
12 DROUGHT BUT WHY? How India can fight the scourge by abandoning drought relief
13 SOE 2017 (Print version) and SOE 2017 in Figures (Digital version) combo offer
14 State of India's Environment 2017 In Figures (eBook)
15 Environment Reader for Universities
16 Not in My Backyard (Book & DVD combo offer)
17 The Crow, Honey Hunter and the Kitchen Garden
18 BIOSCOPE OF PIU & POM
19 SOE 2017 and Food book combo offer
20 FIRST FOOD: Culture of Taste
21 Annual State Of India’s Environment - SOE 2017
22 An 8-million-year-old mysterious date with monsoon (e-book)
23 Why I Should be Tolerant
24 NOT IN MY BACKYARD – Solid Waste Mgmt in Indian Cities
You can do with xpath like this.
import requests
from lxml import html
url1 ="https://www.downtoearth.org.in/books"
res = requests.get(url1, verify=False)
tree = html.fromstring(res.text)
d = tree.xpath("//div[#class='book-cover-image']//img/#title")
for title in d:
print(title)
Output
State of India’s Environment 2019: In Figures (eBook)
Victim Africa (eBook)
Frames of change - Heartening tales that define new India
STATE OF INDIA’S ENVIRONMENT 2019
State of India’s Environment In Figures 2018 (eBook)
Getting to know about environment
CLIMATE CHANGE NOW - The Story of Carbon Colonisation
Climate change - For the young and curious
Conflicts of Interest: My Journey through India’s Green Movement
Body Burden: Lifestyle Diseases
STATE OF INDIA’S ENVIRONMENT 2018
DROUGHT BUT WHY? How India can fight the scourge by abandoning drought relief
SOE 2017 (Print version) and SOE 2017 in Figures (Digital version) combo offer
State of India's Environment 2017 In Figures (eBook)
Environment Reader for Universities
Not in My Backyard (Book & DVD combo offer)
The Crow, Honey Hunter and the Kitchen Garden
BIOSCOPE OF PIU & POM
SOE 2017 and Food book combo offer
FIRST FOOD: Culture of Taste
Annual State Of India’s Environment - SOE 2017
An 8-million-year-old mysterious date with monsoon (e-book)
Why I Should be Tolerant
NOT IN MY BACKYARD – Solid Waste Mgmt in Indian Cities
I want my output to be like:
count:0 - Bournemouth and Watford to go head-to-head for Abdisalam Ibrahim
Olympiacos midfielder Abdisalam Ibrahim is a target for Premier League new-boys Bournemouth and Watford.The former Manchester City man is keen to leave Greece this summer, and his potential availability has alerted Eddie Howe and Quique Sanchez Flores.Lorient of Ligue 1 and La Liga's Rayo Vallacano are also interested in the 24-year-old.
Count:1 - Andre-Pierre Gignac set for Mexico move
Former West Brom target Andre-Pierre Gignac is to complete a move to Mexican side Tigres.The France international is a free agent after leaving Marseille and is set to undergo a medical later today.West Ham, Stoke, Newcastle, West Brom and Dynamo Moscow all showed interest in the 30-year-old although Tony Pulis is understood to have cooled his interest after watching Gignac against Monaco towards the end of last season.
My Program:
from bs4 import BeautifulSoup
import urllib2
response = urllib2.urlopen('http://www.dailymail.co.uk/sport/football/article-3129389/Transfer-News-LIVE-Manchester-United-Arsenal-Liverpool-Real-Madrid-Barcelona-latest-plus-rest-Europe.html')
html = response.read()
soup = BeautifulSoup(html)
count=0
for tag in soup.find_all("div", {"id":"lc-commentary-posts"}):
divTaginb = tag.find_all("div", {"class":"lc-title-container"})
divTaginp = tag.find_all("div",{"class":"lc-post-body"})
for tag1 in divTaginb:
h4Tag = tag1.find_all("b")
for tag2 in h4Tag:
print "count:%d - "%count,
print tag2.text
print '\n'
tagp = divTaginp[count].find_all('p')
for p in tagp:
print p
print '\n'
count +=1
My output:
Count:0 - ....
...
count:37 - ICYMI: Hamburg target Celtic star Stefan Johansen as part of summer
rebuilding process
<p><strong>STEPHEN MCGOWAN:</strong> Bundesliga giants Hamburg have been linked
with a move for CelticΓÇÖs PFA Scotland player of the year Stefan Johansen.</p>
<p>German newspapers claim the Norwegian features on a three-man shortlist of po
tential signings for HSV as part of their summer rebuilding process.</p>
<p>Hamburg scouts are reported to have watched Johansen during Friday nightΓÇÖs
scoreless Euro 2016 qualifier draw with Azerbaijan.</p>
<p><a href="http://www.dailymail.co.uk/sport/football/article-3128854/Hamburg-ta
rget-Celtic-star-Stefan-Johansen-summer-rebuilding-process.html"><strong>CLICK H
ERE for more</strong></a></p>
count:38 - ICYMI: Sevilla agree deal with Chelsea to sign out-of-contract midfi
elder Gael Kakuta
<p>Sevilla have agreed a deal with Premier League champions Chelsea to sign out-
of-contract winger Gael Kakuta.</p>
<p>The French winger, who spent last season on loan in the Primera Division with
Rayo Vallecano, will arrive in Seville on Thursday to undergo a medical with th
e back-to-back Europa League winners.</p>
<p>A statement published on Sevilla's official website confirmed the 23-year-old
's transfer would go through if 'everything goes well' in the Andalusian city.</
p>
<p><strong><a href="http://www.dailymail.co.uk/sport/football/article-3128756/Se
villa-agree-deal-Chelsea-sign-Gael-Kakuta-contract-winger-aims-resurrect-career-
Europa-League-winners.html">CLICK HERE for more</a></strong></p>
count:39 - Good morning everybody!
<p>And welcome to <em>Sportsmail's</em> coverage of all the potential movers and
shakers ahead of the forthcoming summer transfer window.</p>
<p>Whatever deals will be rumoured, agreed or confirmed today you can read all
about them here.</p>
DailyMail Website looks like this:
<div id="lc-commentary-posts"><div id="lc-id-39" class="lc-commentary-post cleared">
<div class="lc-icons">
<img src="http://i.mol.im/i/furniture/live_commentary/football_icons/teams/60x60_bournemouth.png" class="lc-icon">
<img src="http://i.mol.im/i/furniture/live_commentary/football_icons/teams/60x60_watford.png" class="lc-icon">
<div class="lc-post-time">18:03 </div>
</div>
<div class="lc-title-container">
<h4>
<b>Bournemouth and Watford to go head-to-head for Abdisalam Ibrahim</b>
</h4>
</div>
<div class="lc-post-body">
<p><strong>SAMI MOKBEL: </strong>Olympiacos midfielder Abdisalam Ibrahim is a target for Premier League new-boys Bournemouth and Watford.</p>
<p class="mol-para-with-font">The former Manchester City man is keen to leave Greece this summer, and his potential availability has alerted Eddie Howe and Quique Sanchez Flores.</p>
<p class="mol-para-with-font"><font>Lorient of Ligue 1 and La Liga's Rayo Vallacano are also interested in the 24-year-old.</font></p>
</div>
<img class="lc-post-image" src="http://i.dailymail.co.uk/i/pix/2015/06/18/18/1434647000147_lc_galleryImage_TEL_AVIV_ISRAEL_JUNE_11_A.JPG">
<b class="lc-image-caption">Abdisalam Ibrahim could return to England</b>
<div class="lc-clear"></div>
<ul class="lc-social">
<li class="lc-facebook"><span onclick="window.LiveCommentary.socialShare(postToFB, '39', 'facebook')"></span></li>
<li class="lc-twitter"><span onclick="window.LiveCommentary.socialShare(postToTWTTR, '39', 'twitter', window.twitterVia)"></span></li>
</ul>
</div>
<div id="lc-id-38" class="lc-commentary-post cleared">
<div class="lc-icons">
<img src="http://i.mol.im/i/furniture/live_commentary/football_icons/teams/60x60_west_brom.png" class="lc-icon">
<img src="http://i.mol.im/i/furniture/live_commentary/flags/60x60_mexico.png" class="lc-icon">
<div class="lc-post-time">16:54 </div>
</div>
<div class="lc-title-container">
<span><b>Andre-Pierre Gignac set for Mexico move</b></span>
</div>
<div class="lc-post-body">
<p>Former West Brom target Andre-Pierre Gignac is to complete a move to Mexican side Tigres.</p>
<p id="ext-gen225">The France international is a free agent after leaving Marseille and is set to undergo a medical later today.</p>
<p>West Ham, Stoke, Newcastle, West Brom and Dynamo Moscow all showed interest in the 30-year-old although Tony Pulis is understood to have cooled his interest after watching Gignac against Monaco towards the end of last season.</p>
</div>
<img class="lc-post-image" src="http://i.dailymail.co.uk/i/pix/2015/06/18/16/1434642784396_lc_galleryImage__FILES_A_file_picture_tak.JPG">
<b class="lc-image-caption">Andre-Pierre Gignac is to complete a move to Mexican side Tigres</b>
<div class="lc-clear"></div>
<ul class="lc-social">
<li class="lc-facebook"><span onclick="window.LiveCommentary.socialShare(postToFB, '38', 'facebook')"></span></li>
<li class="lc-twitter"><span onclick="window.LiveCommentary.socialShare(postToTWTTR, '38', 'twitter', window.twitterVia)"></span></li>
</ul>
</div>
Now my target is <div class="lc-title-container"> inside this <b></b>.Which I am getting easily. But when I am targeting <div class="lc-post-body"> inside this all <p></p>. I am not able to get only required text.
I tried p.text and p.strip() but still I am not able to solve my problem.
Error while using p.text
count:19 - City's pursuit of Sterling, Wilshere and Fabian Delph show a need fo
r English quality
MIKE KEEGAN: Colonial explorer Cecil Rhodes is famously reported to have once sa
id that to be an Englishman 'is to have won first prize in the lottery of life'.
Back in the 19th century, the vicar's son was no doubt preaching about the expan
ding Empire and his own experiences in Africa.
Traceback (most recent call last):
File "app.py", line 24, in <module>
print p.text
File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2013' in position
160: character maps to <undefined>
And while i am using p.strip() I am not getting any output.
Is there any good way to do it. Help me get the best way. I am trying this thing from morning and now its night.
I dont want to use any encoder or decoder if possible
dammit = UnicodeDammit(html) print(dammit.unicode_markup)
Here's my code. You should go though it. I was to lazy to add specific fields for the dataset and instead just combined everything.
from bs4 import BeautifulSoup, element
import urllib2
response = urllib2.urlopen('http://www.dailymail.co.uk/sport/football/article-3129389/Transfer-News-LIVE-Manchester-United-Arsenal-Liverpool-Real-Madrid-Barcelona-latest-plus-rest-Europe.html')
html = response.read()
soup = BeautifulSoup(html)
count=0
article_dataset = {}
# Try to make your variables express what your trying to do.
# Collect article posts
article_post_tags = soup.find_all("div", {"id":"lc-commentary-posts"})
# Set up the aricle_dataset with the artilce name as it's key
for article_post_tag in article_post_tags:
container_tags = article_post_tag.find_all("div", {"class":"lc-title-container"})
body_tags = article_post_tag.find_all("div",{"class":"lc-post-body"})
# Find the article name, and initialize an empty dict as the value
for count, container in enumerate(container_tags):
# We know there is only 1 <b> tag in our container,
# so use find() instead of find_all()
article_name_tag = container.find('b')
# Our primary key is the article name, the corrosponding value is the body_tag.
article_dataset[article_name_tag.text] = {'body_tag':body_tags[count]}
for article_name, details in article_dataset.items():
content = []
content_line_tags = details['body_tag'].find_all('p')
# Go through each tag and collect the text
for content_tag in content_line_tags:
for data in content_tag.contents: # gather strings in our tags
if type(data) == element.NavigableString:
data = unicode(data)
else:
data = data.text
content += [data]
# combine the content
content = '\n'.join(content)
# Add the content to our data
article_dataset[article_name]['content'] = content
# remove the body_tag from our aricle data_set
for name, details in article_dataset.items():
del details['body_tag']
print
print
print 'Artilce Name: ' + name
print 'Player: ' + details['content'].split('\n')[0]
print 'Article Summary: ' + details['content']
print
I want to scrape Restaurants from this URL
for rests in dining_soup.select("div.infos-restos"):
for rest in rests.select("h3"):
safe_print(" Rest Nsme: "+rest.text)
print(rest.next_sibling.next_sibling.next_sibling.next_sibling.contents)
outputs
<div class="descriptif-resto">
<p>
<strong>Type of cuisine</strong>:International</p>
<p>
<strong>Opening hours</strong>:06:00-23:30</p>
<p>The Food Square bar and restaurant offers a varied menu in an elegant and welcoming setting. In fine weather you can also enjoy your meal next to the pool or relax on the garden terrace.</p>
</div>
and
print(rest.next_sibling.next_sibling.next_sibling.next_sibling.text)
outputs always empty
So my question is how do I scrape Type of cuisine and opening hours from that Div?
Opening hours and cuisine are in "descriptif-resto" text:
import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.accorhotels.com/gb/hotel-5548-mercure-niederbronn-hotel/restaurant.shtml")
soup = BeautifulSoup(r.content)
print(soup.find("div",attrs={"class":"descriptif-resto"}).text)
Type of cuisine:Brasserie
Opening hours:12:00 - 14:00 / 19:00 - 22:00
The name is in the first h3 tag, the type and opening hours are in the two p tags:
name = soup.find("div", attrs={"class":"infos-restos"}).h3.text
det = soup.find("div",attrs={"class":"descriptif-resto"}).p
hours = det.find_next("p").text
tpe = det.text
print(name)
print(hours)
print(tpe)
LA STUB DU CASINO
Opening hours:12:00 - 14:00 / 19:00 - 22:00
Type of cuisine:Brasserie
Ok so some parts don't have both opening hours and cuisine so you will have to fine tune that but this gets all the info:
from itertools import chain
all_dets = soup.find_all("div", attrs={"class":"infos-restos"})
# get all names from h3 tagsusing chain so we can zip later
names = chain.from_iterable(x.find_all("h3") for x in all_dets)
# get all info to extract cuisine, hours
det = chain.from_iterable(x.find_all("div",attrs={"class":"descriptif-resto"}) for x in all_dets)
# zipp appropriate details with each name
zipped = zip(names, det)
for name, det in zipped:
details = det.p
name, tpe = name.text, details
hours = details.find_next("p") if "cuisine" in det.p.text else ""
if hours: # empty string means we have a bar
print(name, tpe.text, hours.text)
else:
print(name, tpe.text)
print("-----------------------------")
LA STUB DU CASINO
Type of cuisine:Brasserie
Opening hours:12:00 - 14:00 / 19:00 - 22:00
-----------------------------
RESTAURANT DU CASINO IVORY
Type of cuisine:French
Opening hours:19:00 - 22:00
-----------------------------
BAR DE L'HOTEL LE DOLLY
Opening hours:10:00-01:00
-----------------------------
BAR DES MACHINES A SOUS
Opening hours:10:30-03:00
-----------------------------