python/scrapy for dynamic content - python

I am trying to write a python/scrapy script to get a list of ads from https://www.donedeal.ie/search/search?section=cars&adType=forsale&source=&sort=relevance%20desc&max=30&start=0, im interested in getting urls to individual ads. I found that page is making a XHR POST request to https://www.donedeal.ie/search/api/v3/find/.
Tried to write scrapy shell script to try my idea:
from scrapy.http import FormRequest
url = 'https://www.donedeal.ie/search/api/v3/find/'
payload = {'section': "cars", 'adType': "forsale", 'source': "", 'sort': "relevance desc", 'area': '', 'max': '30', 'start':'0'}
req = FormRequest(url, formdata=payload)
fetch(req)
but i get no response. In Chrome dev tools i saw that such request gives a json response with item ids which I could use to form urls myself.
I tried Selenium approach as well, where it gives time for a page to load up the dynamic content but that didn't seem to work either. Completely lost at this stage :(

The problem is with the call, the payload is almost OK.
The site you want to scrape accepts only JSON as payload so you should change your FormRequest to something like this:
import json
yield Request( url, method='POST',
body=json.dumps(payload),
headers={'Content-Type':'application/json'} )
This is because FormRequest is for simulate HTML forms (the content type is set to application/x-www-form-urlencoded), not JSON calls.

I was not able to create a working example with Scrapy.
However, I did come up with two other solutions for you.
In the examples below, response contains JSON data.
Working Example #1 using urllib2 — Tested with Python 2.7.10
import urllib2
url = 'https://www.donedeal.ie/search/api/v3/find/'
payload = '{"section":"cars","adType":"forsale","source":"","sort":"relevance desc","max":30,"start":0,"area":[]}'
req = urllib2.Request(url)
req.add_header('Content-Type', 'application/json')
response = urllib2.urlopen(req, payload).read()
Working Example #2 using requests — Tested with Python 2.7.10 and 3.3.5 and 3.5.0
import requests
url = 'https://www.donedeal.ie/search/api/v3/find/'
payload = '{"section":"cars","adType":"forsale","source":"","sort":"relevance desc","max":30,"start":0,"area":[]}'
response = requests.post(url, json=payload).content

Related

Scrape a JavaScript heavy website with requests+bs4

I'm trying to scrape some data on a stock market. But I don't like to use selenium. I always trace the fetch requests and use the requests module.
On this site, I found the URLs which I think as responsible for fetching data to the frontend.
But those links can't be accessed directly ex:link. They throw HTTP 405. Why is that?
You can access the API by making a POST request.
Try this:
import requests
response = requests.post('https://www.cse.lk/api/marketSummery')
json.loads(response.content)
Output
{'id': 30071974,
'tradeVolume': 5829973145.4,
'shareVolume': 231803649,
'tradeDate': 1660549800496}
The other APIs work too, eg. this:
url = " https://www.cse.lk/api/dailyMarketSummery"
response = requests.post(url)
json.loads(response.content)
Output
[[{'id': 12893,
'tradeDate': 1660501800000,
'marketTurnover': 5829973000.0,
'marketTrades': 51086.0,
'marketDomestic': 50376.0,
'marketForeign': 710.0,
'equityTurnover': 5829973000.0,
'equityDomesticPurchase': 5709995000.0,
'equityDomesticSales': 5592159700.0,
'equityForeignPurchase': 119978192.0,
'equityForeignSales': 237813568.0,
'volumeOfTurnOverNumber': 231803648.0,
'volumeOfTurnoverDomestic': 226246512.0,
'volumeOfTurnoverForeign': 5320827,
'tradesNo': 51086,
...
import requests
from pprint import pp
def main(url):
r = requests.post(url)
pp(r.json())
main('https://www.cse.lk/api/marketSummery')
{'id': 30071974,
'tradeVolume': 5829973145.4,
'shareVolume': 231803649,
'tradeDate': 1660549800496}

Login to Internal website and Scrap Data Behind Login Page

I am quite new to python and have been writing some test scripts to figure out things, I have managed to write test scripts to log in to a webpage and scrap the data however the code didn't seem to apply to my company's internal webpage. First of all, I was getting an SSH error which I managed to fix.
The following code manages to scrap the login page but I can't seem to pass the PayLoad. I am not sure if there is a better way to do what I'm doing as once I manage to get past the login page the website is dynamic and mainly coded in Javascript.
import requests
import os.path
import time
from bs4 import BeautifulSoup
from datetime import datetime
from urllib.request import urlopen
from urllib import request, parse
import ssl
url = "https://10.43.52.106/3/Authenticate"
context = ssl._create_unverified_context()
data = {'j_username': 'username',
'j_password': 'Password',
'j_newpassword': '',
'j_newpasswordrepeat': '',
'continue': '',
'submit': 'submit form'
}
data = parse.urlencode(data).encode()
req = request.Request(url, data=data)
response = request.urlopen("https://10.43.52.106/3/Console",
context=context)
print (response.read())
I appreciate any help provided and know its going to best hard to test as its an internal webpage.
This is also the payload that gets sent when logging in:
Payload

Python - Requests pulling HTML instead of JSON

I'm building a Python web scraper (personal use) and am running into some trouble retrieving a JSON file. I was able to find the request URL I need, but when I run my script (I'm using Requests) the URL returns HTML instead of the JSON shown in the Chrome Developer Tools console. Here's my current script:
import requests
import json
url = 'https://nytimes.wd5.myworkdayjobs.com/Video?clientRequestID=1f1a6071627946499b4b09fd0f668ef0'
r = requests.get(url)
print(r.text)
Completely new to Python, so any push in the right direction is greatly appreciated. Thanks!
Looks like that website returns the response depending on the accept headers provided by the request. So try:
import requests
import json
url = 'https://nytimes.wd5.myworkdayjobs.com/Video?clientRequestID=1f1a6071627946499b4b09fd0f668ef0'
r = requests.get(url, headers={'accept': 'application/json'})
print(r.json())
You can have a look at the full api for further reference: http://docs.python-requests.org/en/latest/api/.

Trigger data response from .aspx page

from bs4 import BeautifulSoup
from pprint import pprint
import requests
url = 'http://estadistico.ut.com.sv/OperacionDiaria.aspx'
s = requests.Session()
pagereq = s.get(url)
soup = BeautifulSoup(pagereq.content, 'lxml')
viewstategenerator = soup.find("input", attrs = {'id': '__VIEWSTATEGENERATOR'})['value']
viewstate = soup.find("input", attrs = {'id': '__VIEWSTATE'})['value']
eventvalidation = soup.find("input", attrs = {'id': '__EVENTVALIDATION'})['value']
eventtarget = 'ASPxDashboardViewer1'
DXCss = '1_33,1_4,1_9,1_5,15_2,15_4'
DXScript = '1_232,1_134,1_225,1_169,1_187,15_1,1_183,1_182,1_140,1_147,1_148,1_142,1_141,1_143,1_144,1_145,1_146,15_0,15_6,15_7'
eventargument = {"Task":"Export","ExportInfo":{"Mode":"SingleItem","GroupName":"pivotDashboardItem1","FileName":"Generación+por+tipo+de+tecnología+(MWh)","ClientState":{"clientSize":{"width":509,"height":385},"titleHeight":48,"itemsState":[{"name":"pivotDashboardItem1","headerHeight":34,"position":{"left":11,"top":146},"width":227,"height":108,"virtualSize":'null',"scroll":{"horizontal":'true',"vertical":'true'}}]},"Format":"Excel","DocumentOptions":{"paperKind":"Letter","pageLayout":"Portrait","scaleMode":"AutoFitWithinOnePage","scaleFactor":1,"autoFitPageCount":1,"showTitle":'true',"title":"Operación+Diaria","imageFormatOptions":{"format":"Png","resolution":96},"excelFormatOptions":{"format":"Csv","csvValueSeparator":","},"commonOptions":{"filterStatePresentation":"None","includeCaption":'true',"caption":"Generación+por+tipo+de+tecnología+(MWh)"},"pivotOptions":{"printHeadersOnEveryPage":'true'},"gridOptions":{"fitToPageWidth":'true',"printHeadersOnEveryPage":'true'},"chartOptions":{"automaticPageLayout":'true',"sizeMode":"Zoom"},"pieOptions":{"autoArrangeContent":'true'},"gaugeOptions":{"autoArrangeContent":'true'},"cardOptions":{"autoArrangeContent":'true'},"mapOptions":{"automaticPageLayout":'true',"sizeMode":"Zoom"},"rangeFilterOptions":{"automaticPageLayout":'true',"sizeMode":"Stretch"},"imageOptions":{},"fileName":"Generación+por+tipo+de+tecnología+(MWh)"},"ItemType":"PIVOT"},"Context":"BwAHAAIkY2NkNWRiYzItYzIwNS00MDIyLTkzZjUtYWQ0NzVhYTM5Y2E3Ag9PcGVyYWNpb25EaWFyaWECAAIAAAAAAMByQA==","RequestMarker":1,"ClientState":{}}
postdata = {'__EVENTTARGET': eventtarget,
'__EVENTARGUMENT': eventargument,
'__VIEWSTATE': viewstate,
'__VIEWSTATEGENERATOR': viewstategenerator,
'__EVENTVALIDATION': eventvalidation,
'DXScript': DXScript,
'DXCss': DXCss
}
datareq = s.post(url, data = postdata)
print datareq.text
I'm trying to scrape data from this .aspx webpage. The page loads the data dynamically via javascript so scraping directly with requests/BeautifulSoup won't work.
By looking at the network traffic I can see that when you click the export (Exportar a) button for an element, select a type of export (excel, csv) then confirm a POST request is made to the page. It returns a base64 encoded string of the data I need. As far as I can tell there is no way to make a GET request for the file directly as it is only generated when requested.
What I'm trying to do is is copy the POST request which triggers the csv response. So first I scrape for __VIEWSTATE, __VIEWSTATEGENERATOR and __EVENTVALIDATION. __EVENTTARGET, DXCSS and DXScript look to be fixed. __EVENTARGUMENT is copied directly from the POST request.
My code returns a server application error. I'm thinking the problem is either a) wrong __EVENTARGUMENT (maybe part dynamic rather than fixed?), b) not really understanding how .aspx pages work or c) what I'm trying to do isn't possible with these tools.
I did look at using selenium to trigger the data export but I couldn't see a way to capture the server response.
I was able to get help from someone who knows more about aspx pages than me.
Link to the Github gist that provides the solution.
https://gist.github.com/jarek/d73c672d8dd4ddb48d80bffc4d8038ba

POST request via requests (python) not returning data

I have another question about posts.
This post should be almost identical to one referenced on stack overflow using this question 'Using request.post to post multipart form data via python not working', but for some reason I can't get it to work. The website is http://www.camp.bicnirrh.res.in/predict/. I want to post a file that is already in the FASTA format to this website and select the 'SVM' option using requests in python. This is based on what #NorthCat gave me previously, which worked like a charm:
import requests
import urllib
file={'file':(open('Bishop/newdenovo2.txt','r').read())}
url = 'http://www.camp.bicnirrh.res.in/predict/hii.php'
payload = {"algo[]":"svm"}
raw = urllib.urlencode(payload)
response = session.post(url, files=file, data=payload)
print(response.text)
Since it's not working, I assumed the payload was the problem. I've been playing with the payload, but I can't get any of these to work.
payload = {'S1':str(data), 'filename':'', 'algo[]':'svm'} # where I tried just reading the file in, called 'data'
payload = {'svm':'svm'} # not actually in the headers, but I tried this too)
payload = {'S1': '', 'algo[]':'svm', 'B1': 'Submit'}
None of these payloads resulted in data.
Any help is appreciated. Thanks so much!
You need to set the file post variable name to "userfile", i.e.
file={'userfile':(open('Bishop/newdenovo2.txt','r').read())}
Note that the read() is unnecessary, but it doesn't prevent the file upload succeeding. Here is some code that should work for you:
import requests
session = requests.session()
response = session.post('http://www.camp.bicnirrh.res.in/predict/hii.php',
files={'userfile': ('fasta.txt', open('fasta.txt'), 'text/plain')},
data={'algo[]':'svm'})
response.text contains the HTML results, save it to a file and view it in your browser, or parse it with something like Beautiful Soup and extract the results.
In the request I've specified a mime type of "text/plain" for the file. This is not necessary, but it serves as documentation and might help the receiving server.
The content of my fasta.txt file is:
>24.6jsd2.Tut
GGTGTTGATCATGGCTCAGGACAAACGCTGGCGGCGTGCTTAATACATGCAAGTCGAACGGGCTACCTTCGGGTAGCTAGTGGCGGACGGGTGAGTAACACGTAGGTTTTCTGCCCAATAGTGGGGAATAACAGCTCGAAAGAGTTGCTAATACCGCATAAGCTCTCTTGCGTGGGCAGGAGAGGAAACCCCAGGAGCAATTCTGGGGGCTATAGGAGGAGCCTGCGGCGGATTAGCTAGATGGTGGGGTAAAGGCCTACCATGGCGACGATCCGTAGCTGGTCTGAGAGGACGGCCAGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTAAGGAATATTCCACAATGGCCGAAAGCGTGATGGAGCGAAACCGCGTGCGGGAGGAAGCCTTTCGGGGTGTAAACCGCTTTTAGGGGAGATGAAACGCCACCGTAAGGTGGCTAAGACAGTACCCCCTGAATAAGCATCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGATGCAAGCGTTGTCCGGATTTACTGGGCGTAAAGCGCGCGCAGGCGGCAGGTTAAGTAAGGTGTGAAATCTCCCTGCTCAACGGGGAGGGTGCACTCCAGACTGACCAGCTAGAGGACGGTAGAGGGTGGTGGAATTGCTGGTGTAGCGGTGAAATGCGTAGAGATCAGCAGGAACACCCGTGGCGAAGGCGGCCACCTGGGCCGTACCTGACGCTGAGGCGCGAAGGCTAGGGGAGCGAACGGGATTAGATACCCCGGTAGTCCTAGCAGTAAACGATGTCCACTAGGTGTGGGGGGTTGTTGACCCCTTCCGTGCCGAAGCCAACGCATTAAGTGGACCGCCTGGGGAGTACGGTCGCAAGACTAAAACTCAAAGGAATTGACGGGGACCCGCACAAGCAGCGGAGCGTGTGGTTTAATTCGATGCGACGCGAAGAACCTTACCTGGGCTTGACATGCTATCGCAACACCCTGAAAGGGGTGCCTCCTTCGGGACGGTAGCACAGATGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTGTCCTTAGTTGTATATCTAAGGAGACTGCCGGAGACAAACCGGAGGAAGGTGGGGATGACGTCAAGTCAGCATGGCTCTTACGTCCAGGGCTACACATACGCTACAATGGCCGTTACAGTGAGATGCCACACCGCGAGGTGGAGCAGATCTCCAAAGGCGGCCTCAGTTCAGATTGCACTCTGCAACCCGAGTGCATGAAGTCGGAGTTGCTAGTAACCGCGTGTCAGCATAGCGCGGTGAATATGTTCCCGGGTCTTGTACACACCGCCCGTCACGTCATGGGAGCCGGCAACACTTCGAGTCCGTGAGCTAACCCCCCCTTTCGAGGGTGTGGGAGGCAGCGGCCGAGGGTGGGGCTGGTGACTGGGACGAAGTCGTAACAAGGT

Categories