I am writing a script in Python using the modules 'requests' and 'BeautifulSoup' to scrape results from football matches found in the links from the following page:
https://www.premierleague.com/results?co=1&se=363&cl=-1
The task consists of two steps (taking the first match, Arsenal against Brighton, as an example):
Extract and navigate to the href "https://www.premierleague.com/match/59266" found in the element:
div data-template-rendered data-href.
Navigate or to the "Stats"-tab and extracting the information found in the element:
tbody class = "matchCentreStatsContainer".
I have already tried things like
page = requests.get("https://www.premierleague.com/match/59266")
soup = BeautifulSoup(page.text, "html.parser")
soup.findAll("div", {"class" : "matchCentreStatsContainer"})
but I am not able to locate any of the elements in step 1) or 2) (empty list is returned).
Instead of this:
soup.findAll("div", {"class" : "matchCentreStatsContainer"})
Use this
soup.findAll({"class" : "matchCentreStatsContainer"})
It will work.
In this case the problem is simply that you are looking for the wrong thing. There is no <div class="matchCentreStatsContainer"> on that page, that's a <tbody> so it doesn't match. If you want the div, do:
divs = soup.find_all("div", class_="statsSection")
Otherwise search for the tbodys:
soup.find_all("tbody", class_="matchCentreStatsContainer")
Incidentally the Right Way (TM) to match classes is with class_, which takes either a list or a string (for a single class). This was added to bs4 a while back, but the old syntax is still floating around a lot.
Do note your first url as posted here is invalid: it needs a http: or https: before it.
Update
Please note I would not parse this particularly file like this. It has likely everything you already want as json. I would just do:
import json
data = json.loads(soup.find("div", class_="mcTabsContainer")["data-fixture"])
print(json.dumps(data, indent=2))
Note that data is just a dictionary: I'm only using json.dumps at the end to pretty print it.
Related
I am trying to retrieve all strings from a webpage using BeautifulSoup and return a list of all the retrieved strings.
I have 2 approaches in mind:
Find all elements who have a text that is not null, append the text to result list and return it. I am having a hard time implementing this as I couldn't find any way to do it in BeautifulSoup.
Use BeautifulSoup's "find_all" method to find all attributes that I am looking for such as "p" for paragraphs, "a" for links etc. The problem I am facing with this approach is that for some reason, find_all is returning a duplicated output. For example, if a website has a link with a text "Get Hired", I am receiving "Get Hired" more than once in the output.
I am honestly not sure how to proceed from here and I have been stuck for several hours trying to figure out how to get all strings form a webpage.
Would really appreciate your help.
Use .stripped_strings to get all the strings with whitespaces stripped off.
.stripped_strings - Read the Docs.
Here is the code that returns a list of strings present inside the <body> tag.
import requests
from bs4 import BeautifulSoup
url = 'YOUR URL GOES HERE...'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
b = soup.find('body')
list_of_strings = [s for s in b.stripped_strings]
list_of_strings will have a list of all the strings present in the URL.
Post the code that you've used.
If I remember correctly, something like this should get the complete page in one variable "page" and all the text of the page would be available as page.text
import requests
URL = "https://realpython.github.io/fake-jobs/"
page = requests.get(URL)
print(page.text)
So im trying to compare 2 lists using python, one contains like 1000 links i fetched from a website. The other one contains a few words, that might be contained in a link in the first list. If this is the case, i want to get an output. i printed that first list, it actually works. for example if the link is "https://steamcdn-a.swap.gg/apps/730/icons/econ/stickers/eslkatowice2015/counterlogic.f49adabd6052a558bff3fe09f5a09e0675737936.png" and my list contains the word "eslkatowice2015", i want to get an output using the print() function. My code looks like this:
page_source = driver.page_source
soup = BeautifulSoup(page_source, 'lxml')
Bot_Stickers = soup.find_all('img', class_='csi')
for sticker in Bot_Stickers:
for i in StickerIDs:
if i in sticker:
print("found")
driver.close()
now the problem is that i dont get an output, which is impossible because if i manually compare the lists, there are clearly elements from the first list existing in the 2nd list (the one with the links). when trying to fix i always got a NoneType error. The driver.page_source is above defined by some selenium i used to access a site and click some javascript stuff, to be able to find everything. I hope its more or less clear what i wanted to reach
Edit: the StickerIDs variable is the 2nd list containing the words i want to be checked
NoneType error means that you might be getting a None somewhere, so it's probably safer to check the results returned by find_all for None.
It's been a while since is used BeautifulSoup, but If I remember correctly, find_all returns a list of beautiful soup tags that match the search criteria, not URLs. You need to get the href attribute from the tag before checking if it contains a keyword.
Something like that:
page_source = driver.page_source
soup = BeautifulSoup(page_source, 'lxml')
Bot_Stickers = soup.find_all('img', class_='csi')
if Bot_Stickers and StickersIDs:
for sticker in Bot_Stickers:
for i in StickerIDs:
if i in sticker.get("href"): # get href attribute of the img tag
print("found")
else:
print("Bot_Stickers:", Bot_Stickers)
print("StickersIDs:" StickersIDs)
driver.close()
I'm working with beautiful soup and am trying to grab the first tag on a page that has the attribute equal to a certain string.
For example:
What I've been trying to do is grab the href of the first that is found whose title is "export".
If I use soup.select("a[title='export']") then I end up finding all tags who satisfy this requirement, not just the first.
If I use find("a", {"title":"export"}) with conditions being set such that the title should equal "export", then it grabs the actual items inside the tag, not the href.
If I write .get("href") after calling find(), I get None back.
I've been searching the documentation and stack overflow for an answer but have yet found one. Does anyone know a solution to this? Thank you!
What I've been trying to do is grab the href of the first that is found whose title is "export".
You're almost there. All you need to do is, once you've obtained the tag, you'll need to just index it to get the href. Here's a slightly more bulletproof version:
try:
url = soup.find('a', {'title' : 'export'})['href']
print(url)
except TypeError:
pass
following the same topic in the html file I would like to find just the patent number, title of the citations from the HTML tag. I tried this but it prints all the titles in the HTML file, but I specifically want it under the citations only.
url = 'https://patents.google.com/patent/EP1208209A1/en?oq=medicinal+chemistry'
patent = html_file.read()
#print(patent)
soup = BeautifulSoup(patent, 'html.parser')
x=soup.select('tr[itemprop="backwardReferences"]')
y=soup.select('td[itemprop="title"]')
print(y)```
I want to scrape the pricing data from an eCommerce site called flipkart, I tried using Beautifulsoup with casperjs(nodejs utility) and similar libraries but none of them is good enough.
Here's the URL and the structure.
https://www.flipkart.com/redmi-note-4-gold-32-gb/p/itmer37fmekafqct?
the problem is the layout...What are some ways to get around this?
P.S : Is there anyway I could apply machine learning for getting the pricing data without knowing complex math? Like where do i even start?
You should probably construct your XPath in a way so it does not rely on the class, but rather on the content (node()) of the element you want to match. Alternatively you could match the data-reactid if that doesn't change?
For matching the div by data-reactid:
//div[#data-reactid=220]
Or for matching the div based on its location:
//span[child::img[#src="//img1a.flixcart.com/www/linchpin/fk-cp-zion/img/fa_8b4b59.png"]]/preceding-sibling::div
Assuming the img_path doesn't change you're on the safe side.
Since you can't use xpath due to dynamic changing you probably could try to use a regex for finding a price in the script tag on the page.
Something like this:
import requests
import re
url = "https://www.flipkart.com/redmi-note-4-gold-32-gb/p/itmer37fmekafqct"
r = requests.get(url)
pattern = re.compile('prexoAvailable\":[\w]+,\"price\":(\d+)')
result = pattern.search(r.text)
print(result.group(1))
from bs4 import BeatifulSoup
page = request.get(url, headers)
soup = BeautifulSoup(page.content, 'html.parser')
for a in soup.findAll('a', href=True, attrs={'class': '_31qSD5'}):
price = a.find('div', attrs={'class': '_1vC4OE _2rQ-NK'})
print(price.text)
E-commerce have does not allow anymore to scrape data like before, every entity of the product like product price, specification, reviews are now enclosed in a separate “Dynamic” class name.
And scraping certain data from the webpage you need to use specific class name which is dynamic. So using request.get() or soup() won't work.
I'm Trying to parse the following page: http://www.oddsportal.com/soccer/france/ligue-1-2015-2016/results/
The part I'm interested in is getting the table along with the scores and odds.
The code I have so far:
url = "http://www.oddsportal.com/soccer/france/ligue-1-2015-2016/results/"
req = requests.get(url, timeout = 9)
soup = BeautifulSoup(req.text)
print soup.find("div", id = "tournamentTable"), soup.find("#tournamentTable")
>>> <div id="tournamentTable"></div> None
Very simple but I'm thus weirdly stuck at finding the table in the tree. Although I found already prepared datasets I would like to know to why the printed strings are a tag and None.
Any ideas?
Thanks
First, this page uses JavaScript to fetch data, if you disable the JS in your browser, you will notice that the div tag exist but nothing in it, so, the first will print a single tag.
Second, # is CSS selector, you can not use it in find()
Any argument that’s not recognized will be turned into a filter on one
of a tag’s attributes.
so, the second find will to find some tag with #tournamentTable as it's attribute, and nothing will be match, so it will return None
It looks like the table gets populated with an Ajax call back to the server. That is why why you print soup.find("div", id = "tournamentTable") you get only the empty tag. When you print soup.find("#tournamentTable"), you get None because that is trying to find a element with the tag "#tournamentTable". If you want to use CSS selectors, you should use soup.select() like this, soup.select('#tournamentTable') or soup.select('div#tournamentTable') if you want to be even more particular.