with BeautifulSoup extract text from div in a href in loop - python

<div class="ELEMENT1">
<div class="ELEMENT2">
<div class="ELEMENT3">valeur1</div>
<div class="ELEMENT4">
<svg class="ELEMENT5 ">
<a href="ELEMENT6» target="ELEMENT7" class="ELEMENT8">
<div>TEXT</div
Hello to all,
My request is the following
From the following piece of code, I want to create a loop that allows me
to extract TEXT if and only if div class = ELEMENT 4 AND svg class = ELEMENT 5 (because there are other different ones)
thank you for your help
eddy

you'll need to import urllib2 or some other library that allows you to fetch a urls html structure. Then you need to import beautiful soup as well. Scrape the url and store into a variable. Then reformat the output in any way that serves your needs.
For example:
import urllib2
from bs4 import beautifulSoup
page = urlopen("the_url")
content = BeautifulSoup(page.read().decode("utf-8")) #decode data (utf-8)
filter = content.find_all("div") #finds all div elements in the body
Then you could use regexp to find the actual text inside the element.
Good luck on your assignment!

Related

Putting Links in Parenthesis with BeautifulSoup

BeautifulSoup's get_text() function only records the textual information of an HTML webpage. However, I want my program to return the href link of an tag in parenthesis directly after it returns the actual text.
In other words, using get_text() will just return "17.602" on the following HTML:
<a class="xref fm:ParaNumOnly" href="17.602.html#FAR_17_602">17.602</a>
However, I want my program to return "17.602 (17.602.html#FAR_17_602)". How would I go about doing this?
EDIT: What if you need to print text from other tags, such as:
<p> Sample text.
<a class="xref fm:ParaNumOnly" href="17.602.html#FAR_17_602">17.602</a>
Sample closing text.
</p>
In other words, how would you compose a program that would print
Sample text. 17.602 (17.602.html#FAR_17_602) Sample closing text.
You can format the output using f-strings.
Access the tag's text using .text, and then access the href attribute.
from bs4 import BeautifulSoup
html = """
<a class="xref fm:ParaNumOnly" href="17.602.html#FAR_17_602">17.602</a>
"""
soup = BeautifulSoup(html, "html.parser")
a_tag = soup.find("a")
print(f"{a_tag.text} ({a_tag['href']})")
Output:
17.602 (17.602.html#FAR_17_602)
Edit: You can use .next_sibling and .previous_sibling
print(f"{a_tag.previous_sibling.strip()} {a_tag.text} ({a_tag['href']}) {a_tag.next_sibling.strip()}")
Output:
Sample text. 17.602 (17.602.html#FAR_17_602) Sample closing text.

Parsing mutiple items using BeautifulSoup in Python

I'm trying to parse HTML from a website, where there are multiple elements having the same class ID. I can't seem to find a solution; I manage to get one item but not all of them.
Here's a bit of the HTML I'm trying to parse :
<h1>Synonymes travail</h1>
<div class="container-bloc1">
<strong> Nom</strong>
<br/>
-
<i><a class="lien2" href="/fr/accouchement.html"> accouchement </a></i>
:
<a class="lien3" href="/fr/gésine.html"> gésine</a>
<br/>
-
<i> <a class="lien2" href="/fr/action.html"> action </a></i>
:
<a class="lien3" href="/fr/activité.html"> activité</a>
,
<a class="lien3" href="/fr/labeur.html"> labeur</a>
</div>
In Python, I wrote it like this :
from bs4 import BeautifulSoup
import requests
import csv
source = requests.get("http://www.synonymes.net/fr/travail.html").text
soup = BeautifulSoup(source, "lxml")
for synonyme in soup.find_all("div", class_="container-bloc1"):
print(synonyme)
synonymesdumot = synonyme.find("a", class_="lien2").text
print(synonymesdumot)
for synonymesautres in synonyme.find_all("a", class_="lien3").text:
print(synonymesautres)
The first part is working, since there is only one "lien2" in the HTML file. I could do the same for "lien3" but I'd only get one item, and I want all of them.
What am I doing wrong here? Thanks for your help guys!
If you the code as is in your question, you run into an AttributeError because the output of .find_all() is a collection of tags (a ResultSet more specifically) that has no attribute text; but each of its elements, which are of type bs4.Element.Tag, do. So you need to get the text attribute for each of the tags inside the for loop:
for synonymesautres in synonyme.find_all("a", class_="lien3"):
print(synonymesautres.text)
Output:
le
travail
manque
de
travail
travail
fatigant

How to scrape data in HTML file from a certain line onwards

I'm trying to scrape data from an HTML file. it looks like this:
from bs4 import BeautifulSoup as bs
import urllib
redditPage1 = "http://redditlist.com/sfw"
r=urllib.urlopen(redditPage1).read()
soup = bs(r)
Now I want to get the reddit moderators (or subredditors, as they are called) in a list by order of the number of their subscribers. For that I need to only look at the data that comes after the this line of code:
<h3 class="listing-header">Subscribers</h3>
Everything before this line is irrelevant and all entries about the subredditors after this line look like this:
<div class="listing-item" data-target-filter="sfw" data-target-subreddit="funny">
<div class="offset-anchor" id="funny-subscribers"></div>
<span class="rank-value">1</span>
<span class="subreddit-info-panel-toggle sfw"> <div>i</div> </span>
<span class="subreddit-url">
<a class="sfw" href="http://reddit.com/r/funny" target="_blank">funny</a>
</span>
<span class="listing-stat">18,197,786</span>
</div>
What should I do to be able to extract the subredditor names that come after this line and not before?
Try to find the <h3 class="listing-header">Subscribers</h3>, then get the parent div, the scope will be limited to Subscribers div. Then find all div whose class is listing-item, loop them to get the text (names) of inside element <a>:
from bs4 import BeautifulSoup as bs
import urllib
redditPage1 = "http://redditlist.com/sfw"
r=urllib.urlopen(redditPage1).read()
soup = bs(r,'lxml')
for sub_div in soup.find("h3", text="Subscribers").parent.find_all('div',{ "class" : "listing-item" }):
print(sub_div.find('a').getText())
To get the desired results making your code much readable, you can go like this as well.
import requests
from lxml.html import fromstring
res = requests.get("http://redditlist.com/sfw").text
root = fromstring(res)
for container in root.cssselect(".listing"):
if container.cssselect("h3:contains('Subscribers')"):
for subreddit in container.cssselect(".listing-item"):
print(subreddit.attrib['data-target-subreddit'])
Or with BeautifulSoup if you like:
import requests
from bs4 import BeautifulSoup
main_link = "http://redditlist.com/all?page={}"
for link in [main_link.format(page) for page in range(1,5)]:
res = requests.get(link).text
soup = BeautifulSoup(res,"lxml")
for container in soup.select(".listing"):
if container.select("h3")[0].text=="Subscribers":
for subreddit in container.select(".listing-item"):
print(subreddit['data-target-subreddit'])
Try this:
for div in soup.select('.span4.listing'):
if div.h3.text.lower()=='subscribers':
output = [(ss.select('a.sfw')[0].text, ss.select('.listing-stat')[0].text) for ss in div.select('.listing-item')]

How to change the class of a HTML <div> using BeautifulSoup?

I'm modifying a HTML file using Python and BeautifulSoup,and I can change the content of headers,but I couldn't find a way to change the class of a div. My goal is to turn
<div id="div1" class="blue_titles">test</div>
into:
<div id="div1" class="green_titles">test</div>
I looked up and down the docs,but to no avail. It's probably right on my face,but I can't find it.Thanks in advance!
You can simply assign the new value to the key class:
from bs4 import BeautifulSoup
soup = BeautifulSoup("""<div id="div1" class="blue_titles">test</div>""", "lxml")
soup.find("div")['class'] = "green_titles"
soup
# <html><body><div class="green_titles" id="div1">test</div></body></html>

Beautiful Soup - Cannot find the tags

The page is: http://item.taobao.com/item.htm?id=13015989524
you can see its source code.
In its source code the following code exists
<a href="http://item.taobao.com/item.htm?id=13015989524" target="_blank">
But when I use BeautifulSoup to read the source code and execute the following
soup.findAll('a', href="http://item.taobao.com/item.htm?id=13015989524")
It returns [] empty. What does it return '[]'?
As far as I can see, the <a> tag you are trying to find is inside a <textarea> tag. BS does not parse the contents of <textarea> as HTML, and rightly so since <textarea> should not contain HTML. In short, that page is doing something sketchy.
If you really need to get that, you might "cheat" and parse the contents of <textarea> again and search within them:
import urllib
from BeautifulSoup import BeautifulSoup as BS
soup = BS(urllib.urlopen("http://item.taobao.com/item.htm?id=13015989524"))
a = []
for textarea in soup.findAll("textarea"):
textsoup = BS(textarea.text) # parse the contents as html
a.extend(textsoup.findAll("a", attrs={"href":"http://item.taobao.com/item.htm?id=13015989524"}))
for tag in a:
print tag
# outputs
# <a href="http://item.taobao.com/item.htm?id=13015989524" target="_blank"><img ...
# <a href="http://item.taobao.com/item.htm?id=13015989524" title="901 ...
Use a dictionary to store the attribute:
soup.findAll('a', {
'href': "http://item.taobao.com/item.htm?id=13015989524"
})

Categories