HTTP POST and parsing JSON with Scrapy - python

I have a site that I want to extract data from. The data retrieval is very straight forward.
It takes the parameters using HTTP POST and returns a JSON object. So, I have a list of queries that I want to do and then repeat at certain intervals to update a database. Is scrapy suitable for this or should I be using something else?
I don't actually need to follow links but I do need to send multiple requests at the same time.

How does looks like the POST request? There are many variations, like simple query parameters (?a=1&b=2), form-like payload (the body contains a=1&b=2), or any other kind of payload (the body contains a string in some format, like json or xml).
In scrapy is fairly straightforward to make POST requests, see: http://doc.scrapy.org/en/latest/topics/request-response.html#request-usage-examples
For example, you may need something like this:
# Warning: take care of the undefined variables and modules!
def start_requests(self):
payload = {"a": 1, "b": 2}
yield Request(url, self.parse_data, method="POST", body=urllib.urlencode(payload))
def parse_data(self, response):
# do stuff with data...
data = json.loads(response.body)

For handling requests and retrieving response, scrapy is more than enough. And to parse JSON, just use the json module in the standard library:
import json
data = ...
json_data = json.loads(data)
Hope this helps!

Based on my understanding of the question, you just want to fetch/scrape data from a web page at certain intervals. Scrapy is generally used for crawling.
If you just want to make http post requests you might consider using the python requests library.

Related

No JSON object could be decoded (Requests + Pandas)

learning to work with the request library and pandas but have been struggling to get past the starting point even with a good amount of examples online.
I am trying to extract NBA shot data from the URL below using a GET request, and then turn it into a DataFrame:
def extractData():
Harden_data_url = "https://stats.nba.com/events/?flag=3&CFID=33&CFPARAMS=2017-18&PlayerID=201935&ContextMeasure=FGA&Season=2017-18&section=player&sct=hex"
response = requests.get(Harden_data_url)
data = response.json()
shots = data['resultSets'][0]['rowSet']
headers = data['resultSets'][0]['headers']
df = pandas.DataFrame.from_records(shots, columns = headers)
However I get this error starting on line 2 "response = requests.get(url)"
ValueError: No JSON object could be decoded
I imagine I am missing something basic, any debugging help is appreciated!
The problem is that you are using the wrong URL for fetching the data.
The URL you used was for the HTML, which is in charge of the layout of the site. The data comes from a different URL, which fetches it in JSON format.
The correct URL for the data you are looking for is this:
https://stats.nba.com/stats/shotchartdetail?CFID=33&CFPARAMS=2017-18&ContextMeasure=FGA&DateFrom=&DateTo=&EndPeriod=10&EndRange=28800&GameID=&GameSegment=&GroupQuantity=5&LastNGames=0&LeagueID=00&Location=&Month=0&OnOff=&OpponentTeamID=0&Outcome=&PORound=0&Period=0&PlayerID=201935&PlayerPosition=&RangeType=0&RookieYear=&Season=2017-18&SeasonSegment=&SeasonType=Regular+Season&StartPeriod=1&StartRange=0&TeamID=0&VsConference=&VsDivision=
If you run it on the browser, you can see only the raw JSON data, which is exactly what you will get in your code, and make it work properly.
This blog post explains the method to find the data URL, and although the API has changed a little since the post was written, the method still works:
http://www.gregreda.com/2015/02/15/web-scraping-finding-the-api/

How to use Scrapy Request and get response at same place?

I am writing the scrapy crawler to scrape the data from the e-commerce website.
The website has the color variant and each variant has own price, sizes and stock for that sizes.
To get the price, sizes, and the stocks for variant need to visit the link of the variant(color).
And all data needes in one record.
I have tried using requests but it is slow and sometimes fails to load the page.
I have written the crawler using requests.get() and use the response in the scrapy.selector.Selector() and parsing data.
What my question is, is there any way to use scrapy.Request() to get the response where I use it not at the callback function. I need the response at the same place as below(something like below),
response = scrapy.Request(url=variantUrl)
sizes = response.xpath('sizesXpath').extract()
I know scrapy.Request() require parameter called callback=self.callbackparsefunction
that will be called when scrapy generates the response to handle that generated response. I do not want to use callback functions I want to handle the response in the current function.
Or is there any way to return the response from the callback function to function where scrapy.Request() is written as below(something like below),
def parse(self, response):
variants = response.xpath('variantXpath').extract()
for variant in variants:
res = scrapy.Request(url=variant,callback=self.parse_color)
# use of the res response
def parse_color(self, response):
return response
Take a look at scrapy-inline-requests package, I think it's exactly what you are looking for.

Passing Multiple URLs (strings) Into One request.get statement Python

I am trying to have a request.get statement with two urls in it. What I am aiming to do is have requests (Python Module) make two requests based on list or two strings I provide. How can I pass multiple strings from a list into a request.get statement, and have requests go to each url (string) and have do something?
Thanks
Typically if we talking python requests library it only runs one url get request at a time. If what you are trying to do is perform multiple requests with a list of known urls then it's quite easy.
import requests
my_links = ['www.google.com', 'www.yahoo.com']
my_responses = []
for link in my_links:
payload = requests.get(link).json()
print('got response from {}'.format(link))
my_response.append(payload)
print(payload)
my_responses now has all the content from the pages.
You don't. The requests.get() method (or any other method, really) takes single URL and makes a single HTTP request because that is what most humans want it to do.
If you need to make two requests, you must call that method twice.
requests.get(url)
requests.get(another_url)
Of course, these calls are synchronous, the second will only begin once the first response is received.

python scrape webpage and parse the content

I want to scrape the data on this link
http://www.realclearpolitics.com/epolls/json/5491_historical.js?1453388629140&callback=return_json
I am not sure what type of this link is, is it html or json or something else. Sorry for my bad web knowledge. But I try to use the following code to scrape:
import requests
url='http://www.realclearpolitics.com/epolls/json/5491_historical.js?1453388629140&callback=return_json'
source=requests.get(url).text
The type of the source is unicode. I also try to use the urllib2 to scrape like:
source2=urllib2.urlopen(url).read()
The type of source2 is string. I am not sure which method is better. Because the link is not like the normal webpage contains different tags. If I want to clean the scraped data and form the dataframe data (like the pandas dataframe), what method or process I should follow/
Thanks.
The returned response is text containing valid JSON data within it. You can validate it on your own using a service like http://jsonlint.com/ if you want. For doing so just copy the code within the brackets
return_json("JSON code to copy")
In order to make use of that data you just need to parse it in your program. Here an example: https://docs.python.org/2/library/json.html
The response is text. It does contain JSON, just need to extract it
import json
strip_len = len("return_json(")
source=requests.get(url).text[strip_len:-2]
source = json.loads(source)

HTML data grabbing in python?

I'm fairly new to programming and I am trying to take data from a webpage and use it in my python code. Basically, I'm trying to take the price of an item for a game by having python grab the data whenever I run my code, if that makes sense. Here's what I'm struggling with in particular:
The HTML page I'm using is for runescape, namely
http://services.runescape.com/m=itemdb_oldschool/api/catalogue/detail.json?item=4151
This page provides me with a bunch of dictionaries from which I am trying to extract the price of the item in question. All I really want to do it get all of this data into python so I can then manipulate it. My current code is:
import urllib2
response =urllib2.urlopen('http://services.runescape.com/m=itemdb_oldschool/api/catalogue/detail.json?item=4151')
print response
And it outputs:
addinfourl at 49631760 whose fp = socket._fileobject object at 0x02F4B2F0
whereas I just want it to display exactly what is on the URL in question.
Any ideas? I'm sorry if my formatting is terrible. And if it sounds like I have no idea what I'm talking about, it's because I don't.
If the webpage returns a json-encoded data, then do something like this:
import urllib2
import json
response = urllib2.urlopen("http://services.runescape.com/m=itemdb_oldschool/api/catalogue/detail.json?item=4151")
data = json.load(response)
print(data)
Extract the relevant keys in the data variable to get the values you want.

Categories