How to scrape all the home page text content of a website?

How to scrape all the home page text content of a website? - python

So I am new to webscraping, I want to scrape all the text content of only the home page.
this is my code, but it now working correctly.
from bs4 import BeautifulSoup
import requests
website_url = "http://www.traiteurcheminfaisant.com/"
ra = requests.get(website_url)
soup = BeautifulSoup(ra.text, "html.parser")
full_text = soup.find_all()
print(full_text)
When I print "full_text" it give me a lot of html content but not all, when I ctrl + f " traiteurcheminfaisant#hotmail.com" the email adress that is on the home page (footer)
is not found on full_text.
Thanks you for helping!

A quick glance at the website that you're attempting to scrape from makes me suspect that not all content is loaded when sending a simple get request via the requests module. In other words, it seems likely that some components on the site, such as the footer you mentioned, are being loaded asynchronously with Javascript.
If that is the case, you'll probably want to use some sort of automation tool to navigate to the page, wait for it to load and then parse the fully loaded source code. For this, the most common tool would be Selenium. It can be a bit tricky to set up the first time since you'll also need to install a separate webdriver for whatever browser you'd like to use. That said, the last time I set this up it was pretty easy. Here's a rough example of what this might look like for you (once you've got Selenium properly set up):
from bs4 import BeautifulSoup
from selenium import webdriver
import time
driver = webdriver.Firefox(executable_path='/your/path/to/geckodriver')
driver.get('http://www.traiteurcheminfaisant.com')
time.sleep(2)
source = driver.page_source
soup = BeautifulSoup(source, 'html.parser')
full_text = soup.find_all()
print(full_text)

I haven't used BeatifulSoup before, but try using urlopen instead. This will store the webpage as a string, which you can use to find the email.
from urllib.request import urlopen
try:
response = urlopen("http://www.traiteurcheminfaisant.com")
html = response.read().decode(encoding = "UTF8", errors='ignore')
print(html.find("traiteurcheminfaisant#hotmail.com"))
except:
print("Cannot open webpage")

Related

Python urlopener not retrieving tables and lists

I am trying to make a simple webscraper where I take the information off a HTML page. It's simple but I have a problem I can't seem to solve:
When I download the HTML page by myself and parse it using BeautifulSoup, it parses everything and gives me all the data, this is ok but I don't need to do this. Instead I am trying to using a link instead which doesn't seem to be working. Whenever I use the link using the "urlopen" function and parse the page using BeautifulSoup, it always seems to completely ignore/exclude some lists and tables from the HTML file. These tables appear when I look up the page online using the "Inspect Element" method, and they also appear when I download the HTML page myself but they never appear when I use the "urlopen" function. I even tried encoding post data and sending it as an argument of the function but it doesn't seem to work that way either.
import bs4
from urllib.request import urlopen as uReq
from urllib.parse import urlencode as uEnc
from bs4 import BeautifulSoup as soup
my_url = 'https://sp.com.sa/en/tracktrace/?tid=RB961555017SG'
#data = {'tid':'RB961555017SG'}
#sdata = uEnc(data)
#sdata = bytearray(sdata, 'utf-8')
uClient = uReq(my_url, timeout=2) # opening url or downloading the webpage
page_html = uClient.read() # saving html file in page_html
uClient.close() # closing url or connection idk properly
page_soup = soup(page_html, "html.parser") # parsing the html file and saving
updates = page_soup.findAll("div",{"class":"col-sm-12"})
#updates = page_soup.findAll("ol", {})
print(updates)
These tables contain the information I need, is there anyway I can fix this?

request works a bit differently than a browser. E.g. it does not actually run JavaScript.
In this case the table with info is generated by a script rather than hardcoded in HTML. You can see the actual source-code using "view-source:" followed by the url.
view-source:https://sp.com.sa/en/tracktrace/?tid=RB961555017SG
So, we'd want to run that script some way. The easiest is to use "selenium", which uses the driver of your browser. Then simply take the loaded html and run that through BS4. I noticed that there is a more specific tag you can use rather than "col-sm-12". Hope it helps :)
import bs4
from urllib.request import urlopen as uReq
from urllib.parse import urlencode as uEnc
from bs4 import BeautifulSoup as soup
my_url = 'https://sp.com.sa/en/tracktrace/?tid=RB961555017SG'
from selenium import webdriver
import time
chrome_path = "path of chrome driver that fits your current browser version"
driver = webdriver.Chrome(chrome_path)
driver.get(my_url)
time.sleep(5) #to make sure it's loaded
page_html = driver.page_source
driver.quit()
page_soup = soup(page_html, "html.parser") # parsing the html file and saving
updates = page_soup.findAll("table",{"class":"table-shipment table-striped"}) #The table is more specific to you wanted data than the col-sm-12 so I'd rather use that.
print(updates)

Beautiful Soup 4 findall() not matching elements from the <img> tag

I am trying to use Beautiful Soup 4 to help me download an image from Imgur, although I doubt the Imgur part is relevant. As an example, I'm using the webpage here: https://imgur.com/t/lenovo/mLwnorj
My code is as follows:
import webbrowser, time, sys, requests, os, bs4 # Not all libraries are used in this code snippet
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("https://imgur.com/t/lenovo/mLwnorj")
res = requests.get(https://imgur.com/t/lenovo/mLwnorj)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, features="html.parser")
imageElement = soup.findAll('img', {'class': 'post-image-placeholder'})
print(imageElement)
The HTML code on the Imgur link contains a part that reads as:
<img alt="" src="//i.imgur.com/JfLsH5y.jpg" class="post-image-placeholder" style="max-width: 100%; min-height: 546px;" original-title="">
which I found by picking the first image element on the page using the point and click tool in Inspect Element.
The problem is that I would expect there to be two items in imageElement, one for each image, however, the print function shows []. I have also tried other forms of soup.findAll('img', {'class': 'post-image-placeholder'}) such as soup.findall("img[class='post-image-placeholder']") but that made no difference.
Furthermore, when I used
imageElement = soup.select("h1[class='post-title']")
,just to test, the print function did return a match, which made me wonder if it had something to do with the tag.
[<h1 class="post-title">Cable management increases performance. </h1>]
Thank you for your time and effort

The fundamental problem here seems to be that the actual <img ...> element is not present when the page is first loaded. The best solution to this, in my opinion, would be to take advantage of the selenium webdriver that you already have available to grab the image. Selenium will allow the page to properly render (with JavaScript and all), and then locate whatever elements you care about.
For example:
import webbrowser, time, sys, requests, os, bs4 # Not all libraries are used in this code snippet
from selenium import webdriver
# For pretty debugging output
import pprint
browser = webdriver.Firefox()
browser.get("https://imgur.com/t/lenovo/mLwnorj")
# Give the page up to 10 seconds of a grace period to finish rendering
# before complaining about images not being found.
browser.implicitly_wait(10)
# Find elements via Selenium's search
selenium_image_elements = browser.find_elements_by_css_selector('img.post-image-placeholder')
pprint.pprint(selenium_image_elements)
# Use page source to attempt to find them with BeautifulSoup 4
soup = bs4.BeautifulSoup(browser.page_source, features="html.parser")
soup_image_elements = soup.findAll('img', {'class': 'post-image-placeholder'})
pprint.pprint(soup_image_elements)
I cannot say that I have tested this code yet on my side, but the general concept should work.
Update:
I went ahead and tested this on my side, fixed some errors in the code, and I then got the results I was hoping to see:

If a website will insert objects after page load you will need to use Selenium instead of requests.
from bs4 import BeautifulSoup
from selenium import webdriver
url = 'https://imgur.com/t/lenovo/mLwnorj'
browser = webdriver.Firefox()
browser.get(url)
soup = BeautifulSoup(browser.page_source, 'html.parser')
images = soup.find_all('img', {'class': 'post-image-placeholder'})
[print(image['src']) for image in images]
# //i.imgur.com/JfLsH5yr.jpg
# //i.imgur.com/lLcKMBzr.jpg

How to fix Python requests/BeautifulSoup response from database

I am new to web scraping/coding, and I am trying to use Python requests/BeautifulSoup to parse through the html code in order to get some physical and chemical properties.
For some reason, although I have used the following script for other websites successfully, BeautifulSoup has only printed a few lines from the header and footer, and then pages of HTML code that doesn't really make sense. This is the code I have been using:
import requests
from bs4 import BeautifulSoup
url='https://comptox.epa.gov/dashboard/dsstoxdb/results?search=ammonia#properties'
response = requests.get(url).text
soup=BeautifulSoup(response,'lxml')
print(soup.prettify())
When I try to find the table or even a row, it gives no output. Is there something I haven't accounted for? Any help would be greatly appreciated!

It is present in one of the attributes. You can extract as follows (there is a lot more info there but I subset to physical properties
import requests
from bs4 import BeautifulSoup as bs
import json
url = "https://comptox.epa.gov/dashboard/dsstoxdb/results?search=ammonia#properties"
r = requests.get(url)
soup = bs(r.content, 'lxml')
soup.select_one('[data-result]')['data-result']
data = json.loads(soup.select_one('[data-result]')['data-result'])
properties = data['physprop']
print(properties)

It's pretty common that if a page is populated by JavaScript after you load it requests and BeautifulSoup will not process the page correctly. The best thing to do is likely switch to the selenium module which allows your program to dynamically access the page and interact with elements. After loading (and maybe clicking on a couple elements) you can feed the HTML to BeautifulSoup and process it how you wish. The basic framework I recommend you start with would look like:
from selenium import webdriver
browser = webdriver.Chrome() # You'll need to download drivers from link above
browser.implicitly_wait(10) # probably unnecessary, just makes sure all pages you visit fully load
browser.get('https://stips.co.il/explore')
while True:
input('Press Enter to print HTML')
HTML = browser.page_source
print(HTML)
Just click around in the browser and when you want to see if the HTML is correct, click back to your prompt and press ENTER. This is how you would locate elements automatically, so you don't have to manually interact with the page every time

Using Beautiful Soup in Python to check availability of a product online

I am using python 2.7 and version 4.5.1 of Beautiful Soup
I'm at my wits end trying to make this very simple script to work. My goal is to to get the information on the online availability status of the NES console from Best Buy's website by parsing the html for the product's page and extracting the information in
<div class="status online-availability-status"> Sold out online </div>
This is my first time using the Beautiful Soup module so forgive me if I have missed something obvious. Here is the script I wrote to try to get the information above:
import requests
from bs4 import BeautifulSoup
page = requests.get('http://www.bestbuy.ca/en-CA/product/nintendo-nintendo-entertainment-system-nes-classic-edition-console-clvsnesa/10488665.aspx?path=922de2a5ceb066b0f058cc567ad3d547en02')
soup = BeautifulSoup(page.content, 'html.parser')
avail = soup.findAll('div', {"class": "status online-availability-status"})
But then I just get an empty list for avail. Any idea why?
Any help is greatly appreciated.

As the comments above suggest, it seems that you are looking for a tag which is generated client side by JavaScript; it shows up using 'inspect' on the loaded page, but not when viewing the page source, which is what the call to requests is pulling back. You might try using dryscrape (which you may need to install with pip install dryscrape).
import dryscrape
from bs4 import BeautifulSoup
session = dryscrape.Session()
url = 'http://www.bestbuy.ca/en-CA/product/nintendo-nintendo-entertainment-system-nes-classic-edition-console-clvsnesa/10488665.aspx?path=922de2a5ceb066b0f058cc567ad3d547en02'
session.visit(url)
response = session.body()
soup = BeautifulSoup(response)
avail = soup.findAll('div', {"class": "status online-availability-status"})
This was the most popular solution in a question relating to scraping dynamically generated content:
Web-scraping JavaScript page with Python

If you try printing soup you'll see it probably returns something like Access Denied. This is because Best Buy requires an allowable User-Agent to be making the GET request. As you do not have a User-Agent specified in the Header, it is not returning anything.
Here is a link to generate a User Agent
How to use Python requests to fake a browser visit a.k.a and generate User Agent?
or you could figure out your user agent generated when you are viewing the webpage in your own browser
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent

Availability is loaded in JSON. You don't even need to parse HTML for that:
import urllib
import simplejson
sku = 1048865 # look at the URL of the web page, it is <blablah>//10488665.aspx
# chnage locations to get the right store
response = urllib.urlopen('http://api.bestbuy.ca/availability/products?callback=apiAvailability&accept-language=en&skus=%s&accept=application%2Fvnd.bestbuy.standardproduct.v1%2Bjson&postalCode=M5G2C3&locations=977%7C203%7C931%7C62%7C617&maxlos=3'%sku)
availability = simplejson.loads(response.read())
print availability[0]['shipping']['status']

BeautifulSoup not parsing the entire page content

I am trying to get the set of url's(which are webpages) from newyork times, but i get a different answer, I am sure that I gave a correct class, though it extracts different classes. My ny_url.txt has "http://query.nytimes.com/search/sitesearch/?action=click&region=Masthead&pgtype=SectionFront&module=SearchSubmit&contentCollection=us&t=qry900#/isis; http://query.nytimes.com/search/sitesearch/?action=click&region=Masthead&pgtype=SectionFront&module=SearchSubmit&contentCollection=us&t=qry900#/isis/since1851/allresults/2/"
Here is my code:
import urllib2
import urllib
from cookielib import CookieJar
from bs4 import BeautifulSoup
cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
text_file = open('ny_url.txt', 'r')
for line in text_file:
print line
soup = BeautifulSoup(opener.open(line))
links = soup.find_all('div', attrs = {'class' : 'element2'})
for href in links:
print href

Well its not that simple.
The data you are looking for is not in your page_source downloaded by urllib2.
Try printing the opener.open(line).read() you will find the data to be missing.
This is because, the site is making another GET request to http://query.nytimes.com/svc/cse/v2pp/sitesearch.json?query=isis&page=1
Where within the url your query parameters are passed query=isis and page=1
The data fetched is in json format, try opening the url above in the browser manually. You will find your data there.
So a pure pythonic way would be to call this url and parse JSON to get what you want.
No rocket science needed - just parse the dict using proper keys.
OR
An easier way would be to use webdrivers like Selenium - navigate to the page - and parse the page source using BeautifulSoup. That should easily fetch the entire Content.
Hope that helps. Let me know if you need more insights.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to scrape all the home page text content of a website? - python

Related

Python urlopener not retrieving tables and lists

Beautiful Soup 4 findall() not matching elements from the <img> tag

How to fix Python requests/BeautifulSoup response from database

Using Beautiful Soup in Python to check availability of a product online

BeautifulSoup not parsing the entire page content

Categories

Resources