I'm working on an automated program to identify website logos using BeautifulSoup and Python 3. For the first step I am looking for images that have the term 'logo' in their image name. It actually works decently. However, I want to expand this to an image who may contain the term image or is contained in a link with a class/id/attribute that says logo, or is even deeper buried in a link in a div that contains a class of 'logo'. For example:
<div id="logo">
<a href="http://www.mexgrocer.com/">
<img src="http://ep.yimg.com/ca/I/mex-grocer_2269_22595" width="122" height="72" border="0" hspace="0" vspace="0" alt="Mexican Food">
</a>
</div>
My code right now is:
img = soup.find("img",src=re.compile(r'logo',re.I))
How can I expand this to search through all of the parent tag attributes?
use find_all to find all particular tag in whole document. you can try like this
from bs4 import Beautifulsoup
import urllib2
soup = BeautifulSoup(urllib2.urlopen('your_url').read())
for x in soup.find_all(id='logo'):
try:
if x.name == 'img':
print x['src']
except:pass
if you want to search on class, just use class='logo'
The answer of this question need to be updated to:
from bs4 import BeautifulSoup
from urllib.request import urlopen
import pandas as pd
def getLogoSrc(url):
soup = BeautifulSoup(urlopen('your_url').read())
for x in soup.find_all(id='logo'):
try:
if x.name == 'img':
print(x['src'])
except:
pass
You can use find_all(tag,atributte), for example:
from bs4 import Beautifulsoup
soup = BeautifulSoup(f)
var =soup.find_all("font",color="#990000") //all <font color=#990000></font>
var2 = soup.find_all("a",class_="LinkIndex") // all <a class="LinkIndex"></a>
Related
<div class="ELEMENT1">
<div class="ELEMENT2">
<div class="ELEMENT3">valeur1</div>
<div class="ELEMENT4">
<svg class="ELEMENT5 ">
<a href="ELEMENT6» target="ELEMENT7" class="ELEMENT8">
<div>TEXT</div
Hello to all,
My request is the following
From the following piece of code, I want to create a loop that allows me
to extract TEXT if and only if div class = ELEMENT 4 AND svg class = ELEMENT 5 (because there are other different ones)
thank you for your help
eddy
you'll need to import urllib2 or some other library that allows you to fetch a urls html structure. Then you need to import beautiful soup as well. Scrape the url and store into a variable. Then reformat the output in any way that serves your needs.
For example:
import urllib2
from bs4 import beautifulSoup
page = urlopen("the_url")
content = BeautifulSoup(page.read().decode("utf-8")) #decode data (utf-8)
filter = content.find_all("div") #finds all div elements in the body
Then you could use regexp to find the actual text inside the element.
Good luck on your assignment!
I have
<div style="background:url('link_to_img')"></div>
and I need to extract an image link of this div, does anyone know how to do it?
You can do this with regular expressions.
from bs4 import BeautifulSoup
import re
html = '''<div style="background:url('link_to_img')"></div>'''
soup = BeautifulSoup(html,'lxml')
print(re.search(r'\((.*?)\)',soup.find('div')['style']).group(1))
The result is
'link_to_img'
I'm trying to learn how to webscrape with beautifulsoup + python, and I want to grab the name of the cinematographer from https://letterboxd.com/film/donnie-darko/ but I can't figure out how to isolate the text. The html for what I want is written as below, what I want to output is "Steven Poster":
<h3><span>Cinematography</span></h3>
<div class="text-sluglist">
<p>
Steven Poster
</p>
</div>
within my code I've done soup.find(text="Cinematography"), and a mixture of different thigns like trying to find item or get_text from within the a and p tags, but ...
I would use a regex to parse the soup object for a link that contains "cinematography".
import re
import requests
from bs4 import BeautifulSoup
r = requests.get('https://letterboxd.com/film/donnie-darko/')
soup = BeautifulSoup(r.text, 'lxml')
cinematographer = soup(href=re.compile(r'/cinematography/'))[0].text
print cinematographer
# outputs "Stephen Poster"
You can do the same without using regex as well:
import requests
from bs4 import BeautifulSoup
res = requests.get('https://letterboxd.com/film/donnie-darko/')
soup = BeautifulSoup(res.text,'lxml')
item = soup.select("[href*='cinematography']")[0].text
print(item)
Output:
Steven Poster
Use CSS partial text selector:
soup.find('a[href*="cinematography"]').text
I'm trying to scrape data from an HTML file. it looks like this:
from bs4 import BeautifulSoup as bs
import urllib
redditPage1 = "http://redditlist.com/sfw"
r=urllib.urlopen(redditPage1).read()
soup = bs(r)
Now I want to get the reddit moderators (or subredditors, as they are called) in a list by order of the number of their subscribers. For that I need to only look at the data that comes after the this line of code:
<h3 class="listing-header">Subscribers</h3>
Everything before this line is irrelevant and all entries about the subredditors after this line look like this:
<div class="listing-item" data-target-filter="sfw" data-target-subreddit="funny">
<div class="offset-anchor" id="funny-subscribers"></div>
<span class="rank-value">1</span>
<span class="subreddit-info-panel-toggle sfw"> <div>i</div> </span>
<span class="subreddit-url">
<a class="sfw" href="http://reddit.com/r/funny" target="_blank">funny</a>
</span>
<span class="listing-stat">18,197,786</span>
</div>
What should I do to be able to extract the subredditor names that come after this line and not before?
Try to find the <h3 class="listing-header">Subscribers</h3>, then get the parent div, the scope will be limited to Subscribers div. Then find all div whose class is listing-item, loop them to get the text (names) of inside element <a>:
from bs4 import BeautifulSoup as bs
import urllib
redditPage1 = "http://redditlist.com/sfw"
r=urllib.urlopen(redditPage1).read()
soup = bs(r,'lxml')
for sub_div in soup.find("h3", text="Subscribers").parent.find_all('div',{ "class" : "listing-item" }):
print(sub_div.find('a').getText())
To get the desired results making your code much readable, you can go like this as well.
import requests
from lxml.html import fromstring
res = requests.get("http://redditlist.com/sfw").text
root = fromstring(res)
for container in root.cssselect(".listing"):
if container.cssselect("h3:contains('Subscribers')"):
for subreddit in container.cssselect(".listing-item"):
print(subreddit.attrib['data-target-subreddit'])
Or with BeautifulSoup if you like:
import requests
from bs4 import BeautifulSoup
main_link = "http://redditlist.com/all?page={}"
for link in [main_link.format(page) for page in range(1,5)]:
res = requests.get(link).text
soup = BeautifulSoup(res,"lxml")
for container in soup.select(".listing"):
if container.select("h3")[0].text=="Subscribers":
for subreddit in container.select(".listing-item"):
print(subreddit['data-target-subreddit'])
Try this:
for div in soup.select('.span4.listing'):
if div.h3.text.lower()=='subscribers':
output = [(ss.select('a.sfw')[0].text, ss.select('.listing-stat')[0].text) for ss in div.select('.listing-item')]
I'm trying to parse a bit of HTML and I'd like to extract the link that matches a particular pattern. I'm using the find method with a regular expression but it doesn't get me the correct link. Here's my snippet. Could someone tell me what I'm doing wrong?
from BeautifulSoup import BeautifulSoup
import re
html = """
<div class="entry">
<a target="_blank" href="http://www.rottentomatoes.com/m/diary_of_a_wimpy_kid/">RT</a>
<a target="_blank" href="http://www.imdb.com/video/imdb/vi2496267289/">Trailer</a> –
<a target="_blank" href="http://www.imdb.com/title/tt1196141/">IMDB</a> –
</div>
"""
soup = BeautifulSoup(html)
print soup.find('a', href = re.compile(r".*title/tt.*"))['href']
I should be getting the second link but BS always returns the first link. The href of the first link doesn't even match my regex so why does it return it?
Thanks.
find only returns the first <a> tag. You want findAll.
Can't answer your question, but anyway your (originally) posted code has an import typo. Change
import BeautifulSoup
to
from BeautifulSoup import BeautifulSoup
Then, your output (using beautifulsoup version 3.1.0.1) will be:
http://www.imdb.com/title/tt1196141/