Retrieving Imgur Image Link via Web Scraping Python - python

I am trying to retrieve the link for an image using imgur.com. It seems that the picture (if .jpg or .png) is usually stored within (div class="image post-image") on their website, like:
<div class='image post-image'>
<img alt="" src="//i.imgur.com/QSGvOm3.jpg" original-title="" style="max-width: 100%; min-height: 666px;">
</div>
so here is my code so far:
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = 'https://imgur.com/gallery/0PTPt'
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
info = soup.find_all('div', {'class':'post-image'})
file = open('imgur-html.txt', 'w')
file.write(str(info))
file.close()
Instead of being able to get everything within these tags, this is my output:
<div class="post-image" style="min-height: 666px">
</div>
What do I need to do in order to access this further so I can get the image link? Or is this simply something where I need to only use the API? Thanks for any help.

The child img it would appear is dynamically added and not present. You can extract full link from rel
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://imgur.com/gallery/0PTPt')
soup = bs(r.content, 'lxml')
print(soup.select_one('[rel=image_src]')['href'])

Related

How to scrape <span > and next <p>?

I am trying to scrape some information from a webpage using Selenium. In <span id='text'>, I want to extract the id value (text) and in the same div I want to extract <p> element.
here is what I have tried:
import requests
from bs4 import BeautifulSoup
# Send an HTTP request to the website and retrieve the HTML code of the webpage
response = requests.get('https://www.osha.gov/laws-regs/regulations/standardnumber/1926/1926.451#1926.451(a)(6)')
html = response.text
# Parse the HTML code using Beautiful Soup to extract the desired information
soup = BeautifulSoup(html, 'html.parser')
# find all <a> elements on the page with name attribute
links = soup.find_all('a', attrs={'name': True})
print(links)
linq = []
for link in links:
#print(link['name'])
linq.append(link['name'])
information = soup.find_all('p') # find all <p> elements on the page
# This is how I did it
with open('osha.txt', 'w') as f:
for i in range(len(linq)):
f.write(linq[i])
f.write('\n')
f.write(infoo[i])
f.write('\n')
f.write('-' * 50)
f.write('\n')
Below is the HTML code.
What I want is to save this in a separate text file is this information:
1926.451(a)
Capacity
<div class="field--item">
<div class="paragraph paragraph--type--regulations-standard-number paragraph--view-mode--token">
<span id="1926.451(a)">
<a href="/laws-regs/interlinking/standards/1926.451(a)" name="1926.451(a)">
1926.451(a)
</a>
</span>
<div class="field field--name-field-standard-paragraph-body-p">
<p>"Capacity"</p>
</div>
</div>
</div>
Some of the a tag and paragraph you might missing on the page.
Use try except block to handle that.
Use css selector to get the parent node and then get respective child nodes.
user dataframe to store the value and export it to csv file.
import pandas as pd
import csv
import requests
from bs4 import BeautifulSoup
# Send an HTTP request to the website and retrieve the HTML code of the webpage
response = requests.get('https://www.osha.gov/laws-regs/regulations/standardnumber/1926/1926.451#1926.451(a)(6)')
html = response.text
code=[]
para=[]
# Parse the HTML code using Beautiful Soup to extract the desired information
soup = BeautifulSoup(html, 'html.parser')
for item in soup.select(".field.field--name-field-reg-standard-number .field--item"):
try:
code.append(item.find("a").text.strip())
except:
code.append(item.find("span").text.strip())
try:
para.append(item.find("p").text.strip())
except:
para.append("Nan")
df=pd.DataFrame({"code" : code, "paragraph" : para})
print(df)
df.to_csv("path/to/filenme")
Output:

Python beautifulsoup: Get a placeholder from img src

With the help of BeautifulSoup I try to read the image address of an image from a homepage.
In the page source text I see the URL of the image.
But if I try to read the address with the command find_all from BeautifulSoup I only get a placeholder for the image URL.
The URL from the image is structured as follows:
<br /><img src="mangas/Young Justice (2019)/Young Justice (2019) Issue 11/cw002.jpg" alt="" width="1200" height="1846" class="picture" />
In BeautifulSoup i get this:
<img 0="" alt="" class="picture" height="" src="/pics/placeholder2.jpg" width=""/>]
I hope anybody can give me a tip or why i get a placeholder instead the original image url.
My Code:
import requests
from bs4 import BeautifulSoup as BS
from requests.exceptions import ConnectionError
def getimageurl(url):
try:
response = requests.get(url)
soup = BS(response.text, 'html.parser')
data = soup.find_all('a', href=True)
for a in data:
t = a.find_all('img', attrs={'class': 'picture'})
print(t)
except ConnectionError:
print('Cant open url: {0}'.format(url))

Scraping div with a data- attribute using Python and BeautifulSoup

I have to scrape a web page using BeautifulSoup in python.So to extract the complete div which hass the relavent information and looks like the one below:
<div data-v-24a74549="" class="row row-mg-mod term-row">
I wrote soup.find('div',{'class':'row row-mg-mod term-row'}).
But it is returning nothing.I guess it is something to do with this data-v value.
Can someone tell the exact syntaxof scraping this type of data?
Give this a try:
from bs4 import BeautifulSoup
content = """
<div data-v-24a74549="" class="row row-mg-mod term-row">"""
soup = BeautifulSoup(content,'html.parser')
for div in soup.find_all("div", {"class" : "row"}):
print(div)

How to scrape data in HTML file from a certain line onwards

I'm trying to scrape data from an HTML file. it looks like this:
from bs4 import BeautifulSoup as bs
import urllib
redditPage1 = "http://redditlist.com/sfw"
r=urllib.urlopen(redditPage1).read()
soup = bs(r)
Now I want to get the reddit moderators (or subredditors, as they are called) in a list by order of the number of their subscribers. For that I need to only look at the data that comes after the this line of code:
<h3 class="listing-header">Subscribers</h3>
Everything before this line is irrelevant and all entries about the subredditors after this line look like this:
<div class="listing-item" data-target-filter="sfw" data-target-subreddit="funny">
<div class="offset-anchor" id="funny-subscribers"></div>
<span class="rank-value">1</span>
<span class="subreddit-info-panel-toggle sfw"> <div>i</div> </span>
<span class="subreddit-url">
<a class="sfw" href="http://reddit.com/r/funny" target="_blank">funny</a>
</span>
<span class="listing-stat">18,197,786</span>
</div>
What should I do to be able to extract the subredditor names that come after this line and not before?
Try to find the <h3 class="listing-header">Subscribers</h3>, then get the parent div, the scope will be limited to Subscribers div. Then find all div whose class is listing-item, loop them to get the text (names) of inside element <a>:
from bs4 import BeautifulSoup as bs
import urllib
redditPage1 = "http://redditlist.com/sfw"
r=urllib.urlopen(redditPage1).read()
soup = bs(r,'lxml')
for sub_div in soup.find("h3", text="Subscribers").parent.find_all('div',{ "class" : "listing-item" }):
print(sub_div.find('a').getText())
To get the desired results making your code much readable, you can go like this as well.
import requests
from lxml.html import fromstring
res = requests.get("http://redditlist.com/sfw").text
root = fromstring(res)
for container in root.cssselect(".listing"):
if container.cssselect("h3:contains('Subscribers')"):
for subreddit in container.cssselect(".listing-item"):
print(subreddit.attrib['data-target-subreddit'])
Or with BeautifulSoup if you like:
import requests
from bs4 import BeautifulSoup
main_link = "http://redditlist.com/all?page={}"
for link in [main_link.format(page) for page in range(1,5)]:
res = requests.get(link).text
soup = BeautifulSoup(res,"lxml")
for container in soup.select(".listing"):
if container.select("h3")[0].text=="Subscribers":
for subreddit in container.select(".listing-item"):
print(subreddit['data-target-subreddit'])
Try this:
for div in soup.select('.span4.listing'):
if div.h3.text.lower()=='subscribers':
output = [(ss.select('a.sfw')[0].text, ss.select('.listing-stat')[0].text) for ss in div.select('.listing-item')]

When scraping image url src, get data:image/jpeg;base64

I was trying to scrape the image url from a website using python urllib2.
Here is my code to get the html string:
req = urllib2.Request(url, headers = urllib2Header)
htmlStr = urllib2.urlopen(req, timeout=15).read()
When I view from the browser, the html code of the image looks like this:
<img id="main-image" src="http://abcd.com/images/41Q2VRKA2QL._SY300_.jpg" alt="" rel="" style="display: inline; cursor: pointer;">
However, when I read from the htmlStr I captured, the image was converted to base64 image, which looks like this:
<img id="main-image" src="....">
I am wondering why this happened. Is there a way to get the original image url rather than the base64 image string?
Thanks.
you could use BeautifulSoup
Example:
import urllib2
from bs4 import BeautifulSoup
url = "www.theurlyouwanttoscrape.com"
html = urllib2.urlopen(url)
soup = BeautifulSoup(html)
img_src = soup.find('img', {'id':'main_image'})['src']

Categories