Scraping with Python. Can't get wanted data - python

I am trying to scrape website, but I encountered a problem. When I try to scrape data, it looks like the html differs from what I see on google inspect and from what I get from python. I get this with http://edition.cnn.com/election/results/states/arizona/house/01 I tried to scrape election results. I used this script to check HTML part of the webpage, and I noticed that they different. There is no classes that I need, like section-wrapper.
page =requests.get('http://edition.cnn.com/election/results/states/arizona/house/01')
soup = BeautifulSoup(page.content, "lxml")
print(soup)
Anyone knows what is the problem ?

http://data.cnn.com/ELECTION/2016/AZ/county/H_d1_county.json
This site use JavaScript fetch data, you can check the url above.
You can find this url in chrome dev-tools, there are many links, check it out
Chrome >>F12>> network tab>>F5(refresh page)>>double click the .josn url>> open new tab

import requests
from bs4 import BeautifulSoup
page=requests.get('http://edition.cnn.com/election/results/states/arizona/house/01')
soup = BeautifulSoup(page.content)
#you can try all sorts of tags here I used class: "ad" and class:"ec-placeholder"
g_data = soup.find_all("div", {"class":"ec-placeholder"})
h_data = soup.find_all("div"),{"class":"ad"}
for item in g_data:print item
#print '\n'
#for item in h_data:print item

Related

How to get simple information through a crawler

I am trying to make a simple crawler that scrapes through this https://en.wikipedia.org/wiki/Web_scraping page, then proceeds to extract the 19 links from the See About section. This I manage to do, however I am also trying to extract the first paragraph from each of those 19 links and this is where it stops "working". I get the same paragraph from the first page and not from each one. This is what I have so far. I know there might be better options for doing this but i want to stick to BeautifulSoup and simple python code.
from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/Web_scraping'
data = requests.get('https://en.wikipedia.org/wiki/Web_scraping').text
soup = BeautifulSoup(data, 'html.parser')
def visit():
try:
p = soup.p
print(p.get_text())
except AttributeError:
print('<p> Tag was not found')
links_todo = []
links = soup.find('div', {'class': 'div-col'}).find_all('a')
for link in links:
if 'href' in link.attrs:
links_todo.append(urljoin(url, link.attrs['href']))
while links_todo:
url_to_visit = links_todo.pop()
print('Now visiting:', url_to_visit)
visit()
Example of the first print
Now visiting: https://en.wikipedia.org/wiki/OpenSocial
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. The web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.
Intended function should be that it prints the first paragraph for every new link printed, not the same paragraph from the first link. What do I need to do in order to fix this? Or any tips on what I am missing. I am fairly new to python so I am still learning the concepts as I work on things.
At the top of your code you define data and soup. Both are tied to https://en.wikipedia.org/wiki/Web_scraping.
Every time you call visit(), you print from soup, and soup never changes.
You need to pass the url to visit(), e.g. visit(url_to_visit). The visit function should accept the url as an argument, then visit the page using requests, and create a new soup from the returned data, then print the first paragraph.
Edited to add code explaining my original answer:
from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup
start_url = 'https://en.wikipedia.org/wiki/Web_scraping'
# Renamed this to start_url to make it clear that this is the source page
data = requests.get(start_url).text
soup = BeautifulSoup(data, 'html.parser')
def visit(new_url): # function now accepts a url as an argument
try:
new_data = requests.get(new_url).text # retrieve the text from the url
new_soup = BeautifulSoup(new_data, 'html.parser') # process the retrieved html in beautiful soup
p = new_soup.p
print(p.get_text())
except AttributeError:
print('<p> Tag was not found')
links_todo = []
links = soup.find('div', {'class': 'div-col'}).find_all('a')
for link in links:
if 'href' in link.attrs:
links_todo.append(urljoin(start_url, link.attrs['href']))
while links_todo:
url_to_visit = links_todo.pop()
print('Now visiting:', url_to_visit)
visit(url_to_visit) # here's where we pass each line to the visit() function

Using Beautiful Soup in Python to check availability of a product online

I am using python 2.7 and version 4.5.1 of Beautiful Soup
I'm at my wits end trying to make this very simple script to work. My goal is to to get the information on the online availability status of the NES console from Best Buy's website by parsing the html for the product's page and extracting the information in
<div class="status online-availability-status"> Sold out online </div>
This is my first time using the Beautiful Soup module so forgive me if I have missed something obvious. Here is the script I wrote to try to get the information above:
import requests
from bs4 import BeautifulSoup
page = requests.get('http://www.bestbuy.ca/en-CA/product/nintendo-nintendo-entertainment-system-nes-classic-edition-console-clvsnesa/10488665.aspx?path=922de2a5ceb066b0f058cc567ad3d547en02')
soup = BeautifulSoup(page.content, 'html.parser')
avail = soup.findAll('div', {"class": "status online-availability-status"})
But then I just get an empty list for avail. Any idea why?
Any help is greatly appreciated.
As the comments above suggest, it seems that you are looking for a tag which is generated client side by JavaScript; it shows up using 'inspect' on the loaded page, but not when viewing the page source, which is what the call to requests is pulling back. You might try using dryscrape (which you may need to install with pip install dryscrape).
import dryscrape
from bs4 import BeautifulSoup
session = dryscrape.Session()
url = 'http://www.bestbuy.ca/en-CA/product/nintendo-nintendo-entertainment-system-nes-classic-edition-console-clvsnesa/10488665.aspx?path=922de2a5ceb066b0f058cc567ad3d547en02'
session.visit(url)
response = session.body()
soup = BeautifulSoup(response)
avail = soup.findAll('div', {"class": "status online-availability-status"})
This was the most popular solution in a question relating to scraping dynamically generated content:
Web-scraping JavaScript page with Python
If you try printing soup you'll see it probably returns something like Access Denied. This is because Best Buy requires an allowable User-Agent to be making the GET request. As you do not have a User-Agent specified in the Header, it is not returning anything.
Here is a link to generate a User Agent
How to use Python requests to fake a browser visit a.k.a and generate User Agent?
or you could figure out your user agent generated when you are viewing the webpage in your own browser
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent
Availability is loaded in JSON. You don't even need to parse HTML for that:
import urllib
import simplejson
sku = 1048865 # look at the URL of the web page, it is <blablah>//10488665.aspx
# chnage locations to get the right store
response = urllib.urlopen('http://api.bestbuy.ca/availability/products?callback=apiAvailability&accept-language=en&skus=%s&accept=application%2Fvnd.bestbuy.standardproduct.v1%2Bjson&postalCode=M5G2C3&locations=977%7C203%7C931%7C62%7C617&maxlos=3'%sku)
availability = simplejson.loads(response.read())
print availability[0]['shipping']['status']

BeautifulSoup (bs4) does not find all tags

I'm using Python 3.5 and bs4
The following code will not retrieve all the tables from the specified website. The page has 14 tables but the return value of the code is 2. I have no idea what's going on. I manually inspected the HTML and can't find a reason as to why it's not working. There doesn't seem to be anything special about each table.
import bs4
import requests
link = "http://www.pro-football-reference.com/players/B/BradTo00.htm"
htmlPage = requests.get(link)
soup = bs4.BeautifulSoup(htmlPage.content, 'html.parser')
all_tables = soup.findAll('table')
print(len(all_tables))
What's going on?
EDIT: I should clarify. If I inspect the soup variable, it contains all of the tables that I expected to see. How am I not able to extract those tables from soup with the findAll method?
this page is rendered by javascript, and if you disable the javascrip in you broswer, you will notice that this page only hava two table.
i recommend to use selenium for this situation.

Extract Link URL After Specified Element with Python and Beautifulsoup4

I'm trying to extract a link from a page with python and the beautifulsoup library, but I'm stuck. The link is on the following page, on the sidebar area, directly underneath the h4 subtitle "Original Source:
http://www.eurekalert.org/pub_releases/2016-06/uonc-euc062016.php
I've managed to isolate the link (mostly), but I'm unsure of how to further advance my targeting to actually extract the link. Here's my code so far:
import requests
from bs4 import BeautifulSoup
url = "http://www.eurekalert.org/pub_releases/2016-06/uonc-euc062016.php"
data = requests.get(url)
soup = BeautifulSoup(data.text, 'lxml')
source_url = soup.find('section', class_='widget hidden-print').find('div', class_='widget-content').findAll('a')[-1]
print(source_url)
I am currently getting the full html of the last element in which I've isolated, where I'm trying to simply get the link. Of note, this is the only link on the page I'm trying to get.
You're looking for the link which is the href html attribute. source_url is a bs4.element.Tag which has the get method like:
source_url.get('href')
You almost got it!!
SOLUTION 1:
You just have to run the .text method on the soup you've assigned to source_url.
So instead of:
print(source_url)
You should use:
print(source_url.text)
Output:
http://news.unchealthcare.org/news/2016/june/e-cigarette-use-can-alter-hundreds-of-genes-involved-in-airway-immune-defense
SOLUTION 2:
You should call source_url.get('href') to get only the specific href tag related to your soup.findall element.
print source_url.get('href')
Output:
http://news.unchealthcare.org/news/2016/june/e-cigarette-use-can-alter-hundreds-of-genes-involved-in-airway-immune-defense

How to scrape dynamic webpages by Python

[What I'm trying to do]
Scrape the webpage below for used car data.
http://www.goo-net.com/php/search/summary.php?price_range=&pref_c=08,09,10,11,12,13,14&easysearch_flg=1
[Issue]
To scrape the entire pages. In the url above, only first 30 items are shown. Those could be scraped by the code below which I wrote. Links to other pages are displayed like 1 2 3... but the link addresses seems to be in Javascript. I googled for useful information but couldn't find any.
from bs4 import BeautifulSoup
import urllib.request
html = urllib.request.urlopen("http://www.goo-net.com/php/search/summary.php?price_range=&pref_c=08,09,10,11,12,13,14&easysearch_flg=1")
soup = BeautifulSoup(html, "lxml")
total_cars = soup.find(class_="change change_01").find('em').string
tmp = soup.find(class_="change change_01").find_all('span')
car_start, car_end = tmp[0].string, tmp[1].string
# get urls to car detail pages
car_urls = []
heading_inners = soup.find_all(class_="heading_inner")
for heading_inner in heading_inners:
href = heading_inner.find('h4').find('a').get('href')
car_urls.append('http://www.goo-net.com' + href)
for url in car_urls:
html = urllib.request.urlopen(url)
soup = BeautifulSoup(html, "lxml")
#title
print(soup.find(class_='hdBlockTop').find('p', class_='tit').string)
#price of car itself
print(soup.find(class_='price1').string)
#price of car including tax
print(soup.find(class_='price2').string)
tds = soup.find(class_='subData').find_all('td')
# year
print(tds[0].string)
# distance
print(tds[1].string)
# displacement
print(tds[2].string)
# inspection
print(tds[3].string)
[What I'd like to know]
How to scrape the entire pages. I prefer to use BeautifulSoup4 (Python). But if that is not the appropriate tool, please show me other ones.
[My environment]
Windows 8.1
Python 3.5
PyDev (Eclipse)
BeautifulSoup4
Any guidance would be appreciated. Thank you.
you can use selenium like below sample:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://example.com')
element = driver.find_element_by_class_name("yourClassName") #or find by text or etc
element.click()
The python module splinter may be a good starting point. It calls an external browser (such as Firefox) and access the browser's DOM rather than dealing with HTML only.

Categories