How to scrape data in HTML file from a certain line onwards - python

I'm trying to scrape data from an HTML file. it looks like this:
from bs4 import BeautifulSoup as bs
import urllib
redditPage1 = "http://redditlist.com/sfw"
r=urllib.urlopen(redditPage1).read()
soup = bs(r)
Now I want to get the reddit moderators (or subredditors, as they are called) in a list by order of the number of their subscribers. For that I need to only look at the data that comes after the this line of code:
<h3 class="listing-header">Subscribers</h3>
Everything before this line is irrelevant and all entries about the subredditors after this line look like this:
<div class="listing-item" data-target-filter="sfw" data-target-subreddit="funny">
<div class="offset-anchor" id="funny-subscribers"></div>
<span class="rank-value">1</span>
<span class="subreddit-info-panel-toggle sfw"> <div>i</div> </span>
<span class="subreddit-url">
<a class="sfw" href="http://reddit.com/r/funny" target="_blank">funny</a>
</span>
<span class="listing-stat">18,197,786</span>
</div>
What should I do to be able to extract the subredditor names that come after this line and not before?

Try to find the <h3 class="listing-header">Subscribers</h3>, then get the parent div, the scope will be limited to Subscribers div. Then find all div whose class is listing-item, loop them to get the text (names) of inside element <a>:
from bs4 import BeautifulSoup as bs
import urllib
redditPage1 = "http://redditlist.com/sfw"
r=urllib.urlopen(redditPage1).read()
soup = bs(r,'lxml')
for sub_div in soup.find("h3", text="Subscribers").parent.find_all('div',{ "class" : "listing-item" }):
print(sub_div.find('a').getText())

To get the desired results making your code much readable, you can go like this as well.
import requests
from lxml.html import fromstring
res = requests.get("http://redditlist.com/sfw").text
root = fromstring(res)
for container in root.cssselect(".listing"):
if container.cssselect("h3:contains('Subscribers')"):
for subreddit in container.cssselect(".listing-item"):
print(subreddit.attrib['data-target-subreddit'])
Or with BeautifulSoup if you like:
import requests
from bs4 import BeautifulSoup
main_link = "http://redditlist.com/all?page={}"
for link in [main_link.format(page) for page in range(1,5)]:
res = requests.get(link).text
soup = BeautifulSoup(res,"lxml")
for container in soup.select(".listing"):
if container.select("h3")[0].text=="Subscribers":
for subreddit in container.select(".listing-item"):
print(subreddit['data-target-subreddit'])

Try this:
for div in soup.select('.span4.listing'):
if div.h3.text.lower()=='subscribers':
output = [(ss.select('a.sfw')[0].text, ss.select('.listing-stat')[0].text) for ss in div.select('.listing-item')]

Related

with BeautifulSoup extract text from div in a href in loop

<div class="ELEMENT1">
<div class="ELEMENT2">
<div class="ELEMENT3">valeur1</div>
<div class="ELEMENT4">
<svg class="ELEMENT5 ">
<a href="ELEMENT6ยป target="ELEMENT7" class="ELEMENT8">
<div>TEXT</div
Hello to all,
My request is the following
From the following piece of code, I want to create a loop that allows me
to extract TEXT if and only if div class = ELEMENT 4 AND svg class = ELEMENT 5 (because there are other different ones)
thank you for your help
eddy
you'll need to import urllib2 or some other library that allows you to fetch a urls html structure. Then you need to import beautiful soup as well. Scrape the url and store into a variable. Then reformat the output in any way that serves your needs.
For example:
import urllib2
from bs4 import beautifulSoup
page = urlopen("the_url")
content = BeautifulSoup(page.read().decode("utf-8")) #decode data (utf-8)
filter = content.find_all("div") #finds all div elements in the body
Then you could use regexp to find the actual text inside the element.
Good luck on your assignment!

Scraping div with a data- attribute using Python and BeautifulSoup

I have to scrape a web page using BeautifulSoup in python.So to extract the complete div which hass the relavent information and looks like the one below:
<div data-v-24a74549="" class="row row-mg-mod term-row">
I wrote soup.find('div',{'class':'row row-mg-mod term-row'}).
But it is returning nothing.I guess it is something to do with this data-v value.
Can someone tell the exact syntaxof scraping this type of data?
Give this a try:
from bs4 import BeautifulSoup
content = """
<div data-v-24a74549="" class="row row-mg-mod term-row">"""
soup = BeautifulSoup(content,'html.parser')
for div in soup.find_all("div", {"class" : "row"}):
print(div)

Can't get item with python beautifulsoup

I'm trying to learn how to webscrape with beautifulsoup + python, and I want to grab the name of the cinematographer from https://letterboxd.com/film/donnie-darko/ but I can't figure out how to isolate the text. The html for what I want is written as below, what I want to output is "Steven Poster":
<h3><span>Cinematography</span></h3>
<div class="text-sluglist">
<p>
Steven Poster
</p>
</div>
within my code I've done soup.find(text="Cinematography"), and a mixture of different thigns like trying to find item or get_text from within the a and p tags, but ...
I would use a regex to parse the soup object for a link that contains "cinematography".
import re
import requests
from bs4 import BeautifulSoup
r = requests.get('https://letterboxd.com/film/donnie-darko/')
soup = BeautifulSoup(r.text, 'lxml')
cinematographer = soup(href=re.compile(r'/cinematography/'))[0].text
print cinematographer
# outputs "Stephen Poster"
You can do the same without using regex as well:
import requests
from bs4 import BeautifulSoup
res = requests.get('https://letterboxd.com/film/donnie-darko/')
soup = BeautifulSoup(res.text,'lxml')
item = soup.select("[href*='cinematography']")[0].text
print(item)
Output:
Steven Poster
Use CSS partial text selector:
soup.find('a[href*="cinematography"]').text

How to get rid of whitespace above text, using bs4

Ok, so I'm using bs4 (BeautifulSoup) to parse through a website and find the specific titles I am looking for. My code looks like this:
import requests
from bs4 import BeautifulSoup
url = 'http://www.ewn.co.za/Categories/Local'
r = requests.get(url).text
soup = BeautifulSoup(r)
for i in soup.find_all(class_='article-short'):
if i.a:
print(i.a.text.replace('\n', '').strip())
else:
print(i.contents[0].strip())
This code works, but in output it shows like 20 lines of whitespace first, before printing the requested titles from the website. Is there something wrong with my code or is there something I can do to get rid of the whitespace?
Because you have elements like this:
<article class="article-short">
<div class="thumb"><img alt="FILE: Boys who have undergone a circumcision ceremony walk near Qunu in the Eastern Cape in 2013. Picture: AFP." height="147" src="http://ewn.co.za/cdn/-%2fmedia%2f3C37CB28056746CD95FC913757AAD41C.ashx%3fas%3d1%26h%3d147%26w%3d234%26crop%3d1;waeb9b8157b3e310df" width="234"/></div>
<h6 class="h6-mega">Contralesa against scrapping initiation due to cold weather</h6>
</article>
where the first link contains an image and no text.
You should probably look for h6 tags instead. So, something like this works:
import requests
from bs4 import BeautifulSoup
url = 'http://www.ewn.co.za/Categories/Local'
r = requests.get(url).text
soup = BeautifulSoup(r)
for i in soup.find_all(class_='article-short'):
title = (i.h6.text.replace('\n', '') if i.h6 else contents[0]).strip()
if title:
print(title)

BeautifulSoup - Finding Logos

I'm working on an automated program to identify website logos using BeautifulSoup and Python 3. For the first step I am looking for images that have the term 'logo' in their image name. It actually works decently. However, I want to expand this to an image who may contain the term image or is contained in a link with a class/id/attribute that says logo, or is even deeper buried in a link in a div that contains a class of 'logo'. For example:
<div id="logo">
<a href="http://www.mexgrocer.com/">
<img src="http://ep.yimg.com/ca/I/mex-grocer_2269_22595" width="122" height="72" border="0" hspace="0" vspace="0" alt="Mexican Food">
</a>
</div>
My code right now is:
img = soup.find("img",src=re.compile(r'logo',re.I))
How can I expand this to search through all of the parent tag attributes?
use find_all to find all particular tag in whole document. you can try like this
from bs4 import Beautifulsoup
import urllib2
soup = BeautifulSoup(urllib2.urlopen('your_url').read())
for x in soup.find_all(id='logo'):
try:
if x.name == 'img':
print x['src']
except:pass
if you want to search on class, just use class='logo'
The answer of this question need to be updated to:
from bs4 import BeautifulSoup
from urllib.request import urlopen
import pandas as pd
def getLogoSrc(url):
soup = BeautifulSoup(urlopen('your_url').read())
for x in soup.find_all(id='logo'):
try:
if x.name == 'img':
print(x['src'])
except:
pass
You can use find_all(tag,atributte), for example:
from bs4 import Beautifulsoup
soup = BeautifulSoup(f)
var =soup.find_all("font",color="#990000") //all <font color=#990000></font>
var2 = soup.find_all("a",class_="LinkIndex") // all <a class="LinkIndex"></a>

Categories