Beautiful Soup find first <a> whose title attribute equal a certain string - python

I'm working with beautiful soup and am trying to grab the first tag on a page that has the attribute equal to a certain string.
For example:
What I've been trying to do is grab the href of the first that is found whose title is "export".
If I use soup.select("a[title='export']") then I end up finding all tags who satisfy this requirement, not just the first.
If I use find("a", {"title":"export"}) with conditions being set such that the title should equal "export", then it grabs the actual items inside the tag, not the href.
If I write .get("href") after calling find(), I get None back.
I've been searching the documentation and stack overflow for an answer but have yet found one. Does anyone know a solution to this? Thank you!

What I've been trying to do is grab the href of the first that is found whose title is "export".
You're almost there. All you need to do is, once you've obtained the tag, you'll need to just index it to get the href. Here's a slightly more bulletproof version:
try:
url = soup.find('a', {'title' : 'export'})['href']
print(url)
except TypeError:
pass

following the same topic in the html file I would like to find just the patent number, title of the citations from the HTML tag. I tried this but it prints all the titles in the HTML file, but I specifically want it under the citations only.
url = 'https://patents.google.com/patent/EP1208209A1/en?oq=medicinal+chemistry'
patent = html_file.read()
#print(patent)
soup = BeautifulSoup(patent, 'html.parser')
x=soup.select('tr[itemprop="backwardReferences"]')
y=soup.select('td[itemprop="title"]')
print(y)```

Related

Unable to locate elements using requests and BeautifulSoup

I am writing a script in Python using the modules 'requests' and 'BeautifulSoup' to scrape results from football matches found in the links from the following page:
https://www.premierleague.com/results?co=1&se=363&cl=-1
The task consists of two steps (taking the first match, Arsenal against Brighton, as an example):
Extract and navigate to the href "https://www.premierleague.com/match/59266" found in the element:
div data-template-rendered data-href.
Navigate or to the "Stats"-tab and extracting the information found in the element:
tbody class = "matchCentreStatsContainer".
I have already tried things like
page = requests.get("https://www.premierleague.com/match/59266")
soup = BeautifulSoup(page.text, "html.parser")
soup.findAll("div", {"class" : "matchCentreStatsContainer"})
but I am not able to locate any of the elements in step 1) or 2) (empty list is returned).
Instead of this:
soup.findAll("div", {"class" : "matchCentreStatsContainer"})
Use this
soup.findAll({"class" : "matchCentreStatsContainer"})
It will work.
In this case the problem is simply that you are looking for the wrong thing. There is no <div class="matchCentreStatsContainer"> on that page, that's a <tbody> so it doesn't match. If you want the div, do:
divs = soup.find_all("div", class_="statsSection")
Otherwise search for the tbodys:
soup.find_all("tbody", class_="matchCentreStatsContainer")
Incidentally the Right Way (TM) to match classes is with class_, which takes either a list or a string (for a single class). This was added to bs4 a while back, but the old syntax is still floating around a lot.
Do note your first url as posted here is invalid: it needs a http: or https: before it.
Update
Please note I would not parse this particularly file like this. It has likely everything you already want as json. I would just do:
import json
data = json.loads(soup.find("div", class_="mcTabsContainer")["data-fixture"])
print(json.dumps(data, indent=2))
Note that data is just a dictionary: I'm only using json.dumps at the end to pretty print it.

Nonetype error/ No elements printed using beautifulsoup for python

So im trying to compare 2 lists using python, one contains like 1000 links i fetched from a website. The other one contains a few words, that might be contained in a link in the first list. If this is the case, i want to get an output. i printed that first list, it actually works. for example if the link is "https://steamcdn-a.swap.gg/apps/730/icons/econ/stickers/eslkatowice2015/counterlogic.f49adabd6052a558bff3fe09f5a09e0675737936.png" and my list contains the word "eslkatowice2015", i want to get an output using the print() function. My code looks like this:
page_source = driver.page_source
soup = BeautifulSoup(page_source, 'lxml')
Bot_Stickers = soup.find_all('img', class_='csi')
for sticker in Bot_Stickers:
for i in StickerIDs:
if i in sticker:
print("found")
driver.close()
now the problem is that i dont get an output, which is impossible because if i manually compare the lists, there are clearly elements from the first list existing in the 2nd list (the one with the links). when trying to fix i always got a NoneType error. The driver.page_source is above defined by some selenium i used to access a site and click some javascript stuff, to be able to find everything. I hope its more or less clear what i wanted to reach
Edit: the StickerIDs variable is the 2nd list containing the words i want to be checked
NoneType error means that you might be getting a None somewhere, so it's probably safer to check the results returned by find_all for None.
It's been a while since is used BeautifulSoup, but If I remember correctly, find_all returns a list of beautiful soup tags that match the search criteria, not URLs. You need to get the href attribute from the tag before checking if it contains a keyword.
Something like that:
page_source = driver.page_source
soup = BeautifulSoup(page_source, 'lxml')
Bot_Stickers = soup.find_all('img', class_='csi')
if Bot_Stickers and StickersIDs:
for sticker in Bot_Stickers:
for i in StickerIDs:
if i in sticker.get("href"): # get href attribute of the img tag
print("found")
else:
print("Bot_Stickers:", Bot_Stickers)
print("StickersIDs:" StickersIDs)
driver.close()

why do python and BS4 return only one 'href' when called specifically, but all values when called as text?

Scraping a page and trying to get all the urls from the first column. When I call as text I get everything in the div, which I get. But, when I specifically target the URL, I only get the first one. How do I get all of them - separated for storage?
from bs4 import BeautifulSoup
from urllib import urlopen
base_url = "http://www.heavyliftpfi.com/news/"
html = urlopen(base_url)
soup = BeautifulSoup(html.read().decode('latin-1', 'ignore'),"lxml")
main_div = soup.select_one("div.fullWidth")
div_sub = main_div.select_one("div.leftcol")
print (div_sub).text # I get that this gets everything as .text
print (div_sub).h2.a['href'] # alternate - with only one 'href' return
since you are navigating the parse tree via tag names, if you have multiple matching attribute names, only the first one is returned. This is expected behavior. Try using find_all() to search for them instead.
from the BS4 docs:
"Using a tag name as an attribute will give you only the first tag by
that name."
"If you need to get all the tags, or anything more complicated
than the first tag with a certain name, you’ll need to use one of the
methods described in Searching the tree, such as find_all()"
see: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigating-using-tag-names
It was the findAll, but I needed to move up the tree
for a in main_div.findAll('a', href=True):
print a['href']

soup.find("div", id = "tournamentTable"), None returned - python 2.7 - BS 4.5.1

I'm Trying to parse the following page: http://www.oddsportal.com/soccer/france/ligue-1-2015-2016/results/
The part I'm interested in is getting the table along with the scores and odds.
The code I have so far:
url = "http://www.oddsportal.com/soccer/france/ligue-1-2015-2016/results/"
req = requests.get(url, timeout = 9)
soup = BeautifulSoup(req.text)
print soup.find("div", id = "tournamentTable"), soup.find("#tournamentTable")
>>> <div id="tournamentTable"></div> None
Very simple but I'm thus weirdly stuck at finding the table in the tree. Although I found already prepared datasets I would like to know to why the printed strings are a tag and None.
Any ideas?
Thanks
First, this page uses JavaScript to fetch data, if you disable the JS in your browser, you will notice that the div tag exist but nothing in it, so, the first will print a single tag.
Second, # is CSS selector, you can not use it in find()
Any argument that’s not recognized will be turned into a filter on one
of a tag’s attributes.
so, the second find will to find some tag with #tournamentTable as it's attribute, and nothing will be match, so it will return None
It looks like the table gets populated with an Ajax call back to the server. That is why why you print soup.find("div", id = "tournamentTable") you get only the empty tag. When you print soup.find("#tournamentTable"), you get None because that is trying to find a element with the tag "#tournamentTable". If you want to use CSS selectors, you should use soup.select() like this, soup.select('#tournamentTable') or soup.select('div#tournamentTable') if you want to be even more particular.

Python - Looping through HTML Tags and using IF

I am using python to extract data from a webpage. The webpage has a reoccurring html div tag with class = "result" which contains other data in it (such as location, organisation etc...). I am able to successfully loop through the html using beautiful soup but when I add a condition such as if a certain word ('NHS' for e.g.) exists in the segment it doesn't return anything - though I know certain segments contain it. This is the code:
soup = BeautifulSoup(content)
details = soup.findAll('div', {'class': 'result'})
for detail in details:
if 'NHS' in detail:
print detail
Hope my question makes sense...
findAll returns a list of tags, not strings. Perhaps convert them to strings?
s = "<p>golly</p><p>NHS</p><p>foo</p>"
soup = BeautifulSoup(s)
details = soup.findAll('p')
type(details[0]) # prints: <class 'BeautifulSoup.Tag'>
You are looking for a string amongst tags. Better to look for a string amongst strings...
for detail in details:
if 'NHS' in str(detail):
print detail

Categories