Python Scraper Unable to scrape img src

Python Scraper Unable to scrape img src - python

I'm unable to scrape images from the website www.kissmanga.com . I'm using Python3 and the Requests and Beautifulsoup libraries. The scraped image tags give blank "src".
SRC:
from bs4 import BeautifulSoup
import requests
scraper = cfscrape.create_scraper()
url = "http://kissmanga.com/Manga/Bleach/Bleach-634--Friend-004?id=235206"
response = requests.get(url)
soup2 = BeautifulSoup(response.text, 'html.parser')
divImage = soup2.find('div',{"id": "divImage"})
for img in divImage.findAll('img'):
print(img)
response.close()
I think image scraping is prevented because I believe the website uses cloudflare. Upon this assumption, I also tried using the "cfscrape" library to scrape the content.

You need to wait for JavaScript to inject the html code for images.
Multiple tools are capable of doing this, here are some of them:
Ghost
PhantomJS (Ghost Driver)
Selenium
I was able to get it working with Selenium:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
driver = webdriver.Firefox()
# it takes forever to load the page, therefore we are setting a threshold
driver.set_page_load_timeout(5)
try:
driver.get("http://kissmanga.com/Manga/Bleach/Bleach-634--Friend-004?id=235206")
except TimeoutException:
# never ignore exceptions silently in real world code
pass
soup2 = BeautifulSoup(driver.page_source, 'html.parser')
divImage = soup2.find('div', {"id": "divImage"})
# close the browser
driver.close()
for img in divImage.findAll('img'):
print img.get('src')
Refer to How to download image using requests if you also want to download these images.

Have you tried setting a custom user-agent?
It's typically considered unethical to do so, but so is scraping manga.

Related

HTML parsing with BeautifulSoup in Python unknown error

I know that this code works for other websites that end in .com
However I noticed that the code doesn't work if I try to parse websites that end in .kr
Can somebody help to find why this is happening and an alternate solution to parse these types of websites?
Following is my code.
import requests
from bs4 import BeautifulSoup
URL = 'https://everytime.kr/#nN4K1XC0weHnnM9VB5Qe'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id='container')
print(results)
The URL here is a link to my timetable. I need to parse this website so that I can easily collect the information for the subjects and data relevant to the subject (duration, location, professor's name, etc.).
Thanks

Website is serving dynamic content and you get an empty response back - you may use selenium.
Example
from selenium import webdriver
from bs4 import BeautifulSoup
import time
driver = webdriver.Chrome(executable_path=r'C:\Program Files\ChromeDriver\chromedriver.exe')
url = 'https://everytime.kr/#nN4K1XC0weHnnM9VB5Qe'
driver.get(url)
time.sleep(5)
soup = BeautifulSoup(driver.page_source, 'html.parser')
results = soup.find(id='container')
print(results)
driver.close()

Selenium dynamic scraping code only working when ran multiple times in python

I have been trying to dynamically scrape a news website using python and return back the text version of the live headlines. For now, I have decided to just return the div. I had success with sometimes making it work. If I run the code at least three times in quick succession, it returns back what I am looking for. However, when ran once, it returns back a "Loading articles..." text instead of the headlines. I have tried buffering the code (thought that maybe it had to do with connection or the article loading on the software run browser but that wasn't the case). Any suggestions?
Here's the code:
import bs4 as bs
import urllib.request
from urllib.request import Request, urlopen
from selenium import webdriver
import time
url = 'https://newsfilter.io/latest/merger-and-acquisitions'
browser = webdriver.Chrome('C:\\Users\\sam\\Documents\\chromedriver_win32\\chromedriver.exe')
browser.get(url)
sauce= browser.execute_script('return document.documentElement.outerHTML')
browser.quit()
soup = bs.BeautifulSoup(sauce, 'lxml')
for i in soup.find_all('div'):
print(i.text)

The contents are loading dynamically, I would loop the website continuously to scrape the data like this,
# SWAMI KARUPPASWAMI THUNNAI
import bs4 as bs
import urllib.request
from urllib.request import Request, urlopen
from selenium import webdriver
import time
url = 'https://newsfilter.io/latest/merger-and-acquisitions'
browser = webdriver.Chrome()
browser.get(url)
unique_div = []
while True:
soup = bs.BeautifulSoup(browser.page_source, 'html.parser')
for i in soup.find_all('div'):
if i.text not in unique_div:
unique_div.append(i.text)
print(i.text)
browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
unique_div will contain unique elements.
Note: The above program does not know when to stop, So you can write something like to check the length of unique elements before scraping and the length of unique elements after scraping. If the length remains same then no new elements are found in the website. Something like this,
unique_div = []
while True:
previous_length = len(unique_div)
time.sleep(3)
soup = bs.BeautifulSoup(browser.page_source, 'html.parser')
for i in soup.find_all('div'):
if i.text not in unique_div:
unique_div.append(i.text)
#print(i.text)
browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
after_scraped = len(unique_div)
if previous_length == after_scraped:
print("Scraped Everything")
break
I would go for WebdriverWait insteand of time.sleep(secs) anyways. Waits in Selenium.

Python3, Beautifulsoup isn't returning anything? [duplicate]

I have been digging on the site for some time and im unable to find the solution to my issue. Im fairly new to web scraping and trying to simply extract some links from a web page using beautiful soup.
url = "https://www.sofascore.com/pt/futebol/2018-09-18"
page = urlopen(url).read()
soup = BeautifulSoup(page, "lxml")
print(soup)
At the most basic level, all im trying to do is access a specific tag within the website. I can work out the rest for myself, but the part im struggling with is the fact that a tag that I am looking for is not in the output.
For example: using the built in find() I can grab the following div class tag:
class="l__grid js-page-layout"
However what i'm actually looking for are the contents of a tag that is embedded at a lower level in the tree.
js-event-list-tournament-events
When I perform the same find operation on the lower-level tag, I get no results.
Using Azure-based Jupyter Notebook, i have tried a number of the solutions to similar problems on stackoverflow and no luck.
Thanks!
Kenny

The page use JS to load the data dynamically so you have to use selenium. Check below code.
Note you have to install selenium and chromedrive (unzip the file and copy into python folder)
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
url = "https://www.sofascore.com/pt/futebol/2018-09-18"
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(chrome_options=options)
driver.get(url)
time.sleep(3)
page = driver.page_source
driver.quit()
soup = BeautifulSoup(page, 'html.parser')
container = soup.find_all('div', attrs={
'class':'js-event-list-tournament-events'})
print(container)
or you can use their json api
import requests
url = 'https://www.sofascore.com/football//2018-09-18/json'
r = requests.get(url)
print(r.json())

I had the same problem and the following code worked for me. Chromedriver must be installed!
import time
from bs4 import BeautifulSoup
from selenium import webdriver
chromedriver_path= "/Users/.../chromedriver"
driver = webdriver.Chrome(chromedriver_path)
url = "https://yourURL.com"
driver.get(url)
time.sleep(3) #if you want to wait 3 seconds for the page to load
page_source = driver.page_source
soup = bs4.BeautifulSoup(page_source, 'lxml')
This soup you can use as usual.

Beautiful Soup 4 findall() not matching elements from the <img> tag

I am trying to use Beautiful Soup 4 to help me download an image from Imgur, although I doubt the Imgur part is relevant. As an example, I'm using the webpage here: https://imgur.com/t/lenovo/mLwnorj
My code is as follows:
import webbrowser, time, sys, requests, os, bs4 # Not all libraries are used in this code snippet
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("https://imgur.com/t/lenovo/mLwnorj")
res = requests.get(https://imgur.com/t/lenovo/mLwnorj)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, features="html.parser")
imageElement = soup.findAll('img', {'class': 'post-image-placeholder'})
print(imageElement)
The HTML code on the Imgur link contains a part that reads as:
<img alt="" src="//i.imgur.com/JfLsH5y.jpg" class="post-image-placeholder" style="max-width: 100%; min-height: 546px;" original-title="">
which I found by picking the first image element on the page using the point and click tool in Inspect Element.
The problem is that I would expect there to be two items in imageElement, one for each image, however, the print function shows []. I have also tried other forms of soup.findAll('img', {'class': 'post-image-placeholder'}) such as soup.findall("img[class='post-image-placeholder']") but that made no difference.
Furthermore, when I used
imageElement = soup.select("h1[class='post-title']")
,just to test, the print function did return a match, which made me wonder if it had something to do with the tag.
[<h1 class="post-title">Cable management increases performance. </h1>]
Thank you for your time and effort

The fundamental problem here seems to be that the actual <img ...> element is not present when the page is first loaded. The best solution to this, in my opinion, would be to take advantage of the selenium webdriver that you already have available to grab the image. Selenium will allow the page to properly render (with JavaScript and all), and then locate whatever elements you care about.
For example:
import webbrowser, time, sys, requests, os, bs4 # Not all libraries are used in this code snippet
from selenium import webdriver
# For pretty debugging output
import pprint
browser = webdriver.Firefox()
browser.get("https://imgur.com/t/lenovo/mLwnorj")
# Give the page up to 10 seconds of a grace period to finish rendering
# before complaining about images not being found.
browser.implicitly_wait(10)
# Find elements via Selenium's search
selenium_image_elements = browser.find_elements_by_css_selector('img.post-image-placeholder')
pprint.pprint(selenium_image_elements)
# Use page source to attempt to find them with BeautifulSoup 4
soup = bs4.BeautifulSoup(browser.page_source, features="html.parser")
soup_image_elements = soup.findAll('img', {'class': 'post-image-placeholder'})
pprint.pprint(soup_image_elements)
I cannot say that I have tested this code yet on my side, but the general concept should work.
Update:
I went ahead and tested this on my side, fixed some errors in the code, and I then got the results I was hoping to see:

If a website will insert objects after page load you will need to use Selenium instead of requests.
from bs4 import BeautifulSoup
from selenium import webdriver
url = 'https://imgur.com/t/lenovo/mLwnorj'
browser = webdriver.Firefox()
browser.get(url)
soup = BeautifulSoup(browser.page_source, 'html.parser')
images = soup.find_all('img', {'class': 'post-image-placeholder'})
[print(image['src']) for image in images]
# //i.imgur.com/JfLsH5yr.jpg
# //i.imgur.com/lLcKMBzr.jpg

Beautifulsoup not returning complete HTML of the page

I have been digging on the site for some time and im unable to find the solution to my issue. Im fairly new to web scraping and trying to simply extract some links from a web page using beautiful soup.
url = "https://www.sofascore.com/pt/futebol/2018-09-18"
page = urlopen(url).read()
soup = BeautifulSoup(page, "lxml")
print(soup)
At the most basic level, all im trying to do is access a specific tag within the website. I can work out the rest for myself, but the part im struggling with is the fact that a tag that I am looking for is not in the output.
For example: using the built in find() I can grab the following div class tag:
class="l__grid js-page-layout"
However what i'm actually looking for are the contents of a tag that is embedded at a lower level in the tree.
js-event-list-tournament-events
When I perform the same find operation on the lower-level tag, I get no results.
Using Azure-based Jupyter Notebook, i have tried a number of the solutions to similar problems on stackoverflow and no luck.
Thanks!
Kenny

The page use JS to load the data dynamically so you have to use selenium. Check below code.
Note you have to install selenium and chromedrive (unzip the file and copy into python folder)
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
url = "https://www.sofascore.com/pt/futebol/2018-09-18"
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(chrome_options=options)
driver.get(url)
time.sleep(3)
page = driver.page_source
driver.quit()
soup = BeautifulSoup(page, 'html.parser')
container = soup.find_all('div', attrs={
'class':'js-event-list-tournament-events'})
print(container)
or you can use their json api
import requests
url = 'https://www.sofascore.com/football//2018-09-18/json'
r = requests.get(url)
print(r.json())

I had the same problem and the following code worked for me. Chromedriver must be installed!
import time
from bs4 import BeautifulSoup
from selenium import webdriver
chromedriver_path= "/Users/.../chromedriver"
driver = webdriver.Chrome(chromedriver_path)
url = "https://yourURL.com"
driver.get(url)
time.sleep(3) #if you want to wait 3 seconds for the page to load
page_source = driver.page_source
soup = bs4.BeautifulSoup(page_source, 'lxml')
This soup you can use as usual.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Scraper Unable to scrape img src - python

Have you tried setting a custom user-agent? It's typically considered unethical to do so, but so is scraping manga.

Related

HTML parsing with BeautifulSoup in Python unknown error

Selenium dynamic scraping code only working when ran multiple times in python

Python3, Beautifulsoup isn't returning anything? [duplicate]

Beautiful Soup 4 findall() not matching elements from the <img> tag

Beautifulsoup not returning complete HTML of the page

Categories

Resources