I am trying to have a request.get statement with two urls in it. What I am aiming to do is have requests (Python Module) make two requests based on list or two strings I provide. How can I pass multiple strings from a list into a request.get statement, and have requests go to each url (string) and have do something?
Thanks
Typically if we talking python requests library it only runs one url get request at a time. If what you are trying to do is perform multiple requests with a list of known urls then it's quite easy.
import requests
my_links = ['www.google.com', 'www.yahoo.com']
my_responses = []
for link in my_links:
payload = requests.get(link).json()
print('got response from {}'.format(link))
my_response.append(payload)
print(payload)
my_responses now has all the content from the pages.
You don't. The requests.get() method (or any other method, really) takes single URL and makes a single HTTP request because that is what most humans want it to do.
If you need to make two requests, you must call that method twice.
requests.get(url)
requests.get(another_url)
Of course, these calls are synchronous, the second will only begin once the first response is received.
Related
I'm trying to identify if a string like "data=sold" is present in a website.
Now I'm using requests and a while loop but I need it to be faster:
response = requests.get(link)
if ('data=sold' in response.text):
It works well but it is not fast , is there a way to "request" only the part of the website I need to make the researching faster ?
I think you response.text is html right ?
to avoid to search string you can try with Beautiful Soup Doc here
from bs4 import BeautifulSoup
html = response.text
bs = BeautifulSoup(html)
[item['data-sold] for item in bs.find_all('ul', attrs={'data-sold' : True})]
can see other ref here
or maybe I think a about parallel for loop in python
we can make many requests in same time
As already commented, it depends on the website/server if you can only request a part of the page. Since it is a website I would think it's not possible.
If the website is really really big, the only way I can currently think of to make the search faster is to process the data just in time. When you call requests.get(link), the site will be downloaded before you can process the data. You maybe could try to call
r = requests.get(link, stream=True)
instead. And then iterate through all the lines:
for line in r:
if ('data=sold' in line):
print("hooray")
Of course you could also analyze the raw stream and just skip x bytes, use the aiohttp library, ... maybe you need to give some more information about your problem.
I've written a script in python using post requests to fetch the json content from a webpage. The script is doing just fine if I'm only stick to it's default page. However, my intention is to create a loop to collect the content from few different pages. The only problem I'm struggling to solve is use page keyword within payload in order to loop three different pages. Consider my faulty approach as a placeholder.
How can I use format within dict in order to change page numbers?
Working script (if I get rid of the pagination loop):
import requests
link = 'https://nsv3auess7-3.algolianet.com/1/indexes/idealist7-production/query?x-algolia-agent=Algolia%20for%20vanilla%20JavaScript%203.30.0&x-algolia-application-id=NSV3AUESS7&x-algolia-api-key=c2730ea10ab82787f2f3cc961e8c1e06'
for page in range(0,3):
payload = {"params":"getRankingInfo=true&clickAnalytics=true&facets=*&hitsPerPage=20&page={}&attributesToSnippet=%5B%22description%3A20%22%5D&attributesToRetrieve=objectID%2Ctype%2Cpublished%2Cname%2Ccity%2Cstate%2Ccountry%2Curl%2CorgID%2CorgUrl%2CorgName%2CorgType%2CgroupID%2CgroupUrl%2CgroupName%2CisFullTime%2CremoteOk%2Cpaid%2ClocalizedStarts%2ClocalizedEnds%2C_geoloc&filters=(orgType%3A'NONPROFIT')%20AND%20type%3A'JOB'&aroundLatLng=40.7127837%2C%20-74.0059413&aroundPrecision=15000&minimumAroundRadius=16000&query="}
res = requests.post(link,json=payload.format(page)).json()
for item in res['hits']:
print(item['name'])
I get an error when I run the script as it is:
res = requests.post(link,json=payload.format(page)).json()
AttributeError: 'dict' object has no attribute 'format'
format is a string method. You should apply it to the string value of your payload instead:
payload = {"params":"getRankingInfo=true&clickAnalytics=true&facets=*&hitsPerPage=20&page={}&attributesToSnippet=%5B%22description%3A20%22%5D&attributesToRetrieve=objectID%2Ctype%2Cpublished%2Cname%2Ccity%2Cstate%2Ccountry%2Curl%2CorgID%2CorgUrl%2CorgName%2CorgType%2CgroupID%2CgroupUrl%2CgroupName%2CisFullTime%2CremoteOk%2Cpaid%2ClocalizedStarts%2ClocalizedEnds%2C_geoloc&filters=(orgType%3A'NONPROFIT')%20AND%20type%3A'JOB'&aroundLatLng=40.7127837%2C%20-74.0059413&aroundPrecision=15000&minimumAroundRadius=16000&query=".format(page)}
res = requests.post(link,json=payload).json()
I am writing the scrapy crawler to scrape the data from the e-commerce website.
The website has the color variant and each variant has own price, sizes and stock for that sizes.
To get the price, sizes, and the stocks for variant need to visit the link of the variant(color).
And all data needes in one record.
I have tried using requests but it is slow and sometimes fails to load the page.
I have written the crawler using requests.get() and use the response in the scrapy.selector.Selector() and parsing data.
What my question is, is there any way to use scrapy.Request() to get the response where I use it not at the callback function. I need the response at the same place as below(something like below),
response = scrapy.Request(url=variantUrl)
sizes = response.xpath('sizesXpath').extract()
I know scrapy.Request() require parameter called callback=self.callbackparsefunction
that will be called when scrapy generates the response to handle that generated response. I do not want to use callback functions I want to handle the response in the current function.
Or is there any way to return the response from the callback function to function where scrapy.Request() is written as below(something like below),
def parse(self, response):
variants = response.xpath('variantXpath').extract()
for variant in variants:
res = scrapy.Request(url=variant,callback=self.parse_color)
# use of the res response
def parse_color(self, response):
return response
Take a look at scrapy-inline-requests package, I think it's exactly what you are looking for.
I am trying to automate a web page request by using mechanize in python.
When I add custom headers like
X-Session= 'abc'
and
X-Auth='123'
by using addheader function.
object=mechanize.Browser()
object.addheaders=[('X-Session','abc'),('X-Auth','123')]
It changes those headers to X-session and X-auth.
I believe due to that the server is not able to authenticate me.
Can anybody help how to maintain the case?
Thanks.
Mechanize expect two items tuple as header, first item is header name, second is value, so you must do:
object.addheaders=[('X-Session','abc'), ('X-Auth','123')]
(Two tuples of two elements instead of one tuple with 4 elements).
To check headers that Mechanize will send with query, you can do:
print(request.header_items())
This should print something like:
[('X-Session','abc'), ('X-Auth','123')]
Doc: http://wwwsearch.sourceforge.net/mechanize/doc.html#adding-headers
I have a site that I want to extract data from. The data retrieval is very straight forward.
It takes the parameters using HTTP POST and returns a JSON object. So, I have a list of queries that I want to do and then repeat at certain intervals to update a database. Is scrapy suitable for this or should I be using something else?
I don't actually need to follow links but I do need to send multiple requests at the same time.
How does looks like the POST request? There are many variations, like simple query parameters (?a=1&b=2), form-like payload (the body contains a=1&b=2), or any other kind of payload (the body contains a string in some format, like json or xml).
In scrapy is fairly straightforward to make POST requests, see: http://doc.scrapy.org/en/latest/topics/request-response.html#request-usage-examples
For example, you may need something like this:
# Warning: take care of the undefined variables and modules!
def start_requests(self):
payload = {"a": 1, "b": 2}
yield Request(url, self.parse_data, method="POST", body=urllib.urlencode(payload))
def parse_data(self, response):
# do stuff with data...
data = json.loads(response.body)
For handling requests and retrieving response, scrapy is more than enough. And to parse JSON, just use the json module in the standard library:
import json
data = ...
json_data = json.loads(data)
Hope this helps!
Based on my understanding of the question, you just want to fetch/scrape data from a web page at certain intervals. Scrapy is generally used for crawling.
If you just want to make http post requests you might consider using the python requests library.