I am scraping items from a webpage (there are multiple of these):
<a class="iusc" style="height:160px;width:233px" m="{"cid":"T0QMbGSZ","purl":"http://www.tti.library.tcu.edu.tw/DERMATOLOGY/mm/mmsa04.htm","murl":"http://www.tti.lcu.edu.tw/mm/img0035.jpg","turl":"https://tse2.mm.bing.net/th?id=OIP.T0QMbGSZbOpkyXU4ms5SFwEsDI&pid=15.1","md5":"4f440c6c64996cea64c975389ace5217"}" mad="{"turl":"https://tse3.mm.bing.net/th?id=OIP.T0QMbGSZbOpkyXU4ms5EsDI&w=300&h=200&pid=1.1","maw":"300","mah":"200","mid":"C303D7F4BB661CA67E2CED4DB11E9154A0DD330B"}" href="/images/search?view=detailV2&ccid=T0QMbGSZ&id=C303D7F4BB661E2CED4DB11E9154A0DD330B&thid=OIP.T0QMbGSZbOpkyXU4ms5SFwEsDI&q=searchtearm;amp;simid=6080204499593&selectedIndex=162" h="ID=images.5978_5,5125.1" data-focevt="1"><div class="img_cont hoff"><img class="mimg" style="color: rgb(169, 88, 34);" height="160" width="233" src="https://tse3.mm.bing.net/th?id=OIP.T0QMbGSZ4ms5SFwEsDI&w=233&h=160&c=7&qlt=90&o=4&dpr=2&pid=1.7" alt="Image result fsdata-bm="169" /></div></a>
What I want to do is download the image and information associated with it in the m attribute.
To accomplish that, I tried something like this to get the attributes:
links = soup.find_all("a", class_="iusc")
And then, to get the m attribute, I tried something like this:
for a in soup.find_all("m"):
test = a.text.replace(""" '"')
metadata = json.loads(test)["murl"]
print(str(metadata))
However, that doesn't quite work as expected, and nothing is printed out (with no errors either).
You are not iterating through the links list. Try this.
links = soup.find_all("a", class_="iusc")
for link in links:
print(link.get('m'))
Related
>>> soup_brand
<a data-role="BRAND" href="/URL/somename">
Some Name
</a>
>>> type(soup_brand)
<class 'bs4.BeautifulSoup'>
>>> print(soup_brand.get('href'))
None
Documentation followed: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Hi people from all over the world,
does someone now whats going wrong or am I targeting the object wrong ?
Need to get the href.
Have you tried:
soup.find_all(name="a")
or
soup.select_one(selector="a")
it should also be possible to catch with
all_anchor_tags = soup.find_all(name="a")
for tag in all_anchor_tags:
print(tag.get("href")) #prints the href element of each a tag, thus each link
Although the all bs4 looks for multiple elemnts (the reason why we have a loop here) I encountered, that bs4 sometime is better in catching things, if you give it a search for all approach and then iterate over the elements
in order to apply ['href'] the object must be <bs4.Element.Tag>.
so, try this:
string = \
"""
<a data-role="BRAND" href="/URL/somename">
Some Name
</a>
"""
s = BeautifulSoup(string)
a_tag = s.find('a')
print(a_tag["href"])
out
/URL/somename
or if you have multiple a tags you can try this:
a_tags = s.findAll('a')
for a in a_tags:
print(a.get("href"))
out
/URL/somename
I need to get the content from the following tag with these attributes: <span class="h6 m-0">.
An example of the HTML I'll encounter would be <span class="h6 m-0">Hello world</span>, and it obviously needs to return Hello world.
My current code is as follows:
page = BeautifulSoup(text, 'html.parser')
names = [item["class"] for item in page.find_all('span')]
This works fine, and gets me all the spans in the page, but I don't know how to specify that I only want those with the specific class "h6 m-0" and grab the content inside. How will I go about doing this?
page = BeautifulSoup(text, 'html.parser')
names = page.find_all('span' , class_ = 'h6 m-0')
Without knowing your use case I don't know if this will work.
names = [item["class"] for item in page.find_all('span',class_="h6 m-0" )]
can you please be more specific about what problem you face
but this should work fine for you
For a while have been trying to make a python program which can split data from websites. I came across the bs4 library for python and decided to use it for that job.
The problem is that I always get as a result None which is something that I cannot understand
I want to get only one word which is in a #href, located in a div class and for that, I wrote a function which is like that:
def run(self):
response = requests.get(self.url)
soup = BeautifulSoup(response.text, 'html.parser')
finalW = soup.find('a', attrs={'class': 'target'})
print(finalW)
With this code, I expect to get a word, but it just returns None.
It is highly possible, too, that I had made a mistake with the path to this directory, so I post an image with the thing I want to extract from the HTML:
When bs4 is not able to find the query, it returns None.
In your case the html is more or less like this.
...
<div class='target'>
neededlink
notneededlink
...
</div>
...
soup.find('a', attrs={'class': 'target'}) thus will not be able to math your query as there are not attrs in a.
If you are certain that your link is first in below query.
soup.find('div', {'class': 'target'}).find('a')['href']
background: i'm trying to scrape some tables from this pro-football-reference page. I'm a complete newbie to Python, so a lot of the technical jargon ends up lost on me but in trying to understand how to solve the issue, i can't figure it out.
specific issue: because there are multiple tables on the page, i can't figure out how to get python to target the one i want. I'm trying to get the Defense & Fumbles table. The code below is what i've got so far, and it's from this tutorial using a page from the same site- but one that only has a single table.
sample code:
#url we are scraping
url = "https://www.pro-football-reference.com/teams/nwe/2017.htm"
#html from the given url
html=urlopen(url)
# make soup object of html
soup = BeautifulSoup(html)
# we see that soup is a beautifulsoup object
type(soup)
#
column_headers = [th.getText() for th in
soup.findAll('table', {"id": "defense").findAll('th')]
column_headers #our column headers
attempts made: I realized that the tutorial's method would not work for me, so i attempted to change the soup.findAll portion to target the specific table. But i repeatedly get an error saying:
AttributeError: ResultSet object has no attribute 'findAll'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
when changing it to find, the error becomes:
AttributeError: 'NoneType' object has no attribute 'find'
I'll be absolutely honest that i have no idea what i'm doing or what these mean. I'd appreciate any help in figuring how to target that data and then scrape it.
Thank you,
your missing a "}" in the dict after the word "defense". Try below and see if it works.
column_headers = [th.getText() for th in
soup.findAll('table', {"id": "defense"}).findAll('th')]
First off, you want to use soup.find('table', {"id": "defense"}).findAll('th') - find one table, then find all of its 'th' tags.
The other problem is that the table with id "defense" is commented out in the html on that page:
<div class="placeholder"></div>
<!--
<div class="table_outer_container">
<div class="overthrow table_container" id="div_defense">
<table class="sortable stats_table" id="defense" data-cols-to-freeze=2><caption>Defense & Fumbles Table</caption>
<colgroup><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col></colgroup>
<thead>
etc. I assume that javascript is un-hiding it. BeautifulSoup doesn't parse the text of comments, so you'll need to find the text of all the comments on the page as in this answer, look for one with id="defense" in it, and then feed the text of that comment into BeautifulSoup.
Like this:
from bs4 import Comment
comments = comments = soup.findAll(text=lambda text:isinstance(text, Comment))
defenseComment = next(c for c in comments if 'id="defense"' in c)
defenseSoup = BeautifulSoup(str(defenseComment))
I have the following html part which repeates itself several times with other href links:
<div class="product-list-item margin-bottom">
<a title="titleexample" href="http://www.urlexample.com/example_1" data-style-id="sp_2866">
Now I want to get all the href links in this document that are directly after the div tag with the class "product-list-item".
Pretty new to beautifulsoup and nothing that I came up with worked.
Thanks for your ideas.
EDIT: Does not really have to be beautifulsoup; when it can be done with regex and the python html parser this is also ok.
EDIT2: What I tried (I'm pretty new to python, so what I did might be totaly stupid from an advanced viewpoint):
soup = bs4.BeautifulSoup(htmlsource)
x = soup.find_all("div")
for i in range(len(x)):
if x[i].get("class") and "product-list-item" in x[i].get("class"):
print(x[i].get("class"))
This will give me a list of all the "product-list-item" but then I tried something like
print(x[i].get("class").next_element)
Because I thought next_element or next_sibling should give me the next tag but it just leads to AttributeError: 'list' object has no attribute 'next_element'. So I tried with only the first list element:
print(x[i][0].get("class").next_element)
Which led to this error: return self.attrs[key] KeyError: 0.
Also tried with .find_all("href") and .get("href") but this all leads to the same errors.
EDIT3: Ok seems I found out how to solve it, now I did:
x = soup.find_all("div")
for i in range(len(x)):
if x[i].get("class") and "product-list-item" in x[i].get("class"):
print(x[i].next_element.next_element.get("href"))
This can also be shortened by using another attribute to the find_all function:
x = soup.find_all("div", "product-list-item")
for i in x:
print(i.next_element.next_element.get("href"))
greetings
I want to get all the href links in this document that are directly after the div tag with the class "product-list-item"
To find the first <a href> element in the <div>:
links = []
for div in soup.find_all('div', 'product-list-item'):
a = div.find('a', href=True) # find <a> anywhere in <div>
if a is not None:
links.append(a['href'])
It assumes that the link is inside <div>. Any elements in <div> before the first <a href> are ignored.
If you'd like; you can be more strict about it e.g., taking the link only if it is the first child in <div>:
a = div.contents[0] # take the very first child even if it is not a Tag
if a.name == 'a' and a.has_attr('href'):
links.append(a['href'])
Or if <a> is not inside <div>:
a = div.find_next('a', href=True) # find <a> that appears after <div>
if a is not None:
links.append(a['href'])
There are many ways to search and navigate in BeautifulSoup.
If you search with lxml.html, you could also use xpath and css expressions if you are familiar with them.