Unable to scrape data on results list using scrapy - python

I am currently trying to scrape the links to the cars on this page.
I have ran this xpath command on the chrome console to return the links of each cars
$x('//div[#class="vehicle-make-model"]/h3/a/#href')
However, when I try to use the same xpath, whilst using the scrapy shell command it does not return any of the links. This is the code I run for the scrapy shell command
response.xpath('//div[#class="vehicle-make-model"]/h3/a/#href')
Can somebody point out what I am doing wrong?

The XPaths that work in Chrome are run on top of the DOM that is built with JavaScript.
That's why sometimes one thing works in Chrome but does not work in scrapy shell.
This is the case in the page you linked. If you check out the source of the page (right-click and choose "View Page Source" or hit Ctrl-U), you will see the same data that Scrapy gets.
In this particular case, the data seems to be all in one JSON block, so you can extract the JSON code out and parse it using python's JSON module, with something like:
import json
raw_json = response.xpath(
"//script[contains(., 'window.jsonData')]/text()"
).re('window.jsonData\s*=\s*(.+);$')[0]
json_data = json.loads(raw_json)
Then you can use the data in the json_data to build the next requests or scrape whatever you need.
In case there wasn't an easily parseable JSON, another option would be to use the js2xml library to parse the JavaScript into a XML that you would be able to scrape using XPath.

Related

Why do i get none or empty list when using find or find_all in beautifulsoap even though the tags do contain other tags and data

I am trying to make a coronavirus tracker using beautifulsoup just for some practice.
my code is,
import requests
from bs4 import BeautifulSoup
page=requests.get("https://sample.com")
soup=BeautifulSoup(page.content,'html.parser')
table=soup.find("div",class_="ZDcxi")
print(table)
In the output its showing none, but the div tag with the class ZDcxi do have content.
please help
The data, which you see in the browser, and includes the target div, is dynamic content, generated by scripts included with the page and run in the browser. If you just search for the class name in page.content, you will find it is not there.
What many people do is use selenium to open desired pages through Chrome (or another web browser), and then, after the page finishes loading and generating dynamic content, use BeautifulSoup to harvest the content from the browser, and continue processing from there.
Find out more at Requests vs Selenium Python, and also when you search selenium vs requests/

Python - Scrapy ecommerce website

I'm trying to scrape the price of this product
http://www.asos.com/au/fila/fila-vintage-plus-ringer-t-shirt-with-small-logo-in-green/prd/9065343?clr=green&SearchQuery=&cid=7616&gridcolumn=2&gridrow=1&gridsize=4&pge=1&pgesize=72&totalstyles=4699
With the following code but it returns an empty array
response.xpath('//*[#id="product-price"]/div/span[2]/text()').extract()
Any help is appreciated, Thanks.
Because the site is dynamic(this is what I got when I use view(response) command in scrapy shell:
As you can see, the price info doesn't come out.
Solutions:
1. splash.
2. selenium+phantomJS
It might help also by checking this answer:Empty List From Scrapy When Using Xpath to Extract Values
The price is later added by the browser which renders the page using javascript code found in the html. If you disable javascript in your browser, you would notice that the page would look a bit different. Also, take a look at the page source, usually that's unaltered, to see that the tag you're looking for doesn't exist (yet).
Scrapy doesn't execute any javascript code. It receives the plain html and that's what you have to work with.
If you want to extract data from pages which look the same as in the browser, I recommend using an headless browser like Splash (if you're already using scrapy): https://github.com/scrapinghub/splash
You can programaticaly tell it to download your page, render it and select the data points you're interested in.
The other way is to check for the request made to the Asos API which asks for the product data. In your case, for this product:
http://www.asos.com/api/product/catalogue/v2/stockprice?productIds=9065343&currency=AUD&keyStoreDataversion=0ggz8b-4.1&store=AU
I got this url by taking a look at all the XMLHttpRequest (XHR) requests sent in the Network tab found in Developers Tools (on Google Chrome).
You can try to find JSON inside HTML (using regular expression) and parse it:
json_string = response.xpath('//script[contains(., "function (view) {")]/text()').re_first( r'view\(\'([^\']+)' )
data = json.loads(json_string)
price = data["price"]["current"]

HTML Parsing with Python (HTML vs. complete website)

I'm trying to parse html from a website that contains information about train tickets and there prices (source below), however I'm having an issue getting back all the html from the website when I use urllib to request the html.
What I need is the price per ticket which doesn't seem to appear when I used urllib to request the html. After doing some investigative work, I determined that if I save the webpage with chrome and select "HTML only", I don't get the price, however if I select "Complete WebPage," I do. Is there anyway to view the HTML that I get when I download the "Complete Webpage" and use that in python. Or is there a way to automate the downloading of the complete webpage and use the downloaded files to parse in python.
Thanks,
George
https://www.raileurope.com/en/us/point_to_point/ptp_results.htm?execution=e3s1&resultId=147840746&cobrand=public&saleCountry=us&resultId=147840746&cobrand=public&saleCountry=us&itemId=-1&fn=fsRequest&cobrand=public&c=USD&roundtrip=0&isAtocRequest=0&georequest=1&lang=en&route-type=0&from0=paris&to0=amsterdam&deptDate0=06%2F07%2F2017&time0=8&pass-question-radio=1&nCountries=&selCountry1=&selCountry2=&selCountry3=&selCountry4=&selCountry5=&familyId=&p=0&additionalTraveler0=adult&additionalTravelerAge0=&paxIds=&nA=1&nY=0&nC=0&nS=0
Take a look at selenium
Since the website is rendered by JS, you will have to use a webdriver to simulate the "Click".
You will need a crawler instead of a simple scraper

Web scraping for divs inserted by scripts

Sorry if this is a silly question.
I am trying to use Beautifulsoup and urllib2 in python to look at a url and extract all divs with a particular class. However, the result is always empty even though I can see the divs when I "inspect element" in chrome's developer tools.
I looked at the page source and those divs were not there which means they were inserted by a script. So my question is how can i look for those divs (using their class name) using Beautifulsoup? I want to eventually read and follow hrefs under those divs.
Thanks.
[Edit]
I am currently looking at the H&M website: http://www.hm.com/sg/products/ladies and I am interested to get all the divs with class 'product-list-item'
Try using selenium to run the javascript
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("http://www.python.org")
html = driver.page_source
check this link enter link description here
you can get all info by change the url, this link can be found in chrome dev tools > Network
The reason why you got nothing from that specific url is simply because, the info you need is not there.
So first let me explain a little bit about how that page is loaded in a browser: when you request for that page(http://www.hm.com/sg/products/ladies), the literal content will be returned in the very first phase(which is what you got from your urllib2 request), then the browser starts to read/parse the content, basically it tells the browser where to find all information it needs to render the whole page(e.g. CSS to control layout, additional javascript/urls/pages to populate certain area etc.), and the browser does all that behind the scene. When you "inspect element" in chrome, the page is already fully loaded, and those info you want is not in original url, so you need to find out which url is used to populate those area and go after that specific url instead.
So now we need to find out what happens behind the scene, and a tool is needed to capture all traffic when that page loads(I would recommend fiddler).
As you can see, lots of things happen when you open that page in a browser!(and that's only part of the whole page-loading process) So by educated guess, those info you need should be in one of those three "api.hm.com" requests, and the best part is they are alread JSON formatted, which means you might not even bother with BeautifulSoup, the built-in json module could do the job!
OK, now what? Use urllib2 to simulate those requests and get what you want.
P.S. requests is a great tool for this kind of job, you can get it here.
Try This one :
from bs4 import BeautifulSoup
import urllib2
page = urllib2.urlopen("http://www.hm.com/sg/products/ladies")
soup = BeautifulSoup(page.read(),'lxml')
scrapdiv = open('scrapdiv.txt','w')
product_lists = soup.findAll("div",{"class":"o-product-list"})
print product_lists
for product_list in product_lists:
print product_list
scrapdiv.write(str(product_list))
scrapdiv.write("\n\n")
scrapdiv.close()

read webpage search results with Python

I want to be able to generate auto alerts for certain type of matches to a web search. The first step is to read the url in Python, so that I can then parse it using BeautifulSoup or other regex based methods.
For a page like the one in the example below though, the html doesn't capture the results that I'm visualizing when I open the page with a browser.
Is there a way to actually get the HTML with search results themselves?
import urllib
link = 'http://www.sas.com/jobs/USjobs/search.html'
f = urllib.urlopen(link)
myfile = f.read()
print myfile
You cannot get the data that is being generated dynamically using javascript by using traditional urllib, urllib2 or requests modules (or even mechanize for that matter). You'll have to simulate a browser environment by using selenium with chrome or Firefox or phantomjs to evaluate the javascript in the webpage.
Have a look at Selenium Binding for python

Categories