im making a simple code sending http requests over and over again to get some response that can be changed all the time, and its a little bit slow, i've tried to speed it up by using asyncio but it didnt worked with me
import requests
r = requests.Session()
url = 'https://example.com'
headers = {'user-agent: example'}
def a():
while True:
post = r.get(url, headers=headers).text
if post == 'something':
print(5)
elif post == 'something-else':
quit()
a()
Related
Thanks for reading. For a small reserach project, I'm trying to gather some data from KBB (www.kbb.com). However, I'm always getting a "urllib.error.HTTPError: HTTP Error 400: Bad Request" Error. I think I can access different websites with this simple piece of code. I'm not sure if this is an issue with the code or the specific website itself?
Maybe someone can point me in the right direction.
from urllib import request as urlrequest
proxy_host = '23.107.176.36:32180'
url = "https://www.kbb.com/gmc/canyon-extended-cab/2018/"
req = urlrequest.Request(url)
req.set_proxy(proxy_host, 'https')
page = urlrequest.urlopen(req)
print(page)
There are 2 issue but one solution as I found below
Is the proxy server which is refused.
You need authentication for the server in every case it responds with a 403 forbidden
Using urlib
from urllib import request as urlrequest
proxy_host = '23.107.176.36:32180'
url = "https://www.kbb.com/gmc/canyon-extended-cab/2018/"
req = urlrequest.Request(url)
# req.set_proxy(proxy_host, 'https')
page = urlrequest.urlopen(req)
print(req)
> urllib.error.HTTPError: HTTP Error 403: Forbidden
Using Requests
import requests
url = "https://www.kbb.com/gmc/canyon-extended-cab/2018/"
res = requests.get(url)
print(res)
# >>> <Response [403]>
Using PostMan
edit Solution
Setting a timeout litter longer it works. however I had to retry several times, because the proxy sometimes just dont' reponds
import urllib.request
proxy_host = '23.107.176.36:32180'
url = "https://www.kbb.com/gmc/canyon-extended-cab/2018/"
proxy_support = urllib.request.ProxyHandler({'https' : proxy_host})
opener = urllib.request.build_opener(proxy_support)
urllib.request.install_opener(opener)
res = urllib.request.urlopen(url, timeout=1000) # Set
print(res.read())
Result
b'<!doctype html><html lang="en"><head><meta http-equiv="X-UA-Compatible" content="IE=edge"><meta charset="utf-8"><meta name="viewport" content="width=device-width,initial-scale=1,maximum-scale=5,minimum-scale=1"><meta http-equiv="x-dns-prefetch-control" content="on"><link rel="dns-prefetch preconnect" href="//securepubads.g.doubleclick.net" crossorigin><link rel="dns-prefetch preconnect" href="//c.amazon-adsystem.com" crossorigin><link .........
Using Requests
import requests
proxy_host = '23.107.176.36:32180'
url = "https://www.kbb.com/gmc/canyon-extended-cab/2018/"
# NOTE: we need a loger timeout for the proxy t response and set verify sale for an ssl error
r = requests.get(url, proxies={"https": proxy_host}, timeout=90000, verify=False) # Timeout are in milliseconds
print(r.text)
Your code appears to work fine without the set_proxy statement, I think it is most likely that your proxy server is rejecting the request rather than KBB.
I'm using a proxy service to cycle requests with different proxy ips for web scraping. Do I need to build in functionality to end requests so as to not overload the web server I'm scraping?
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlencode
import concurrent.futures
list_of_urls = ['https://www.example']
NUM_RETRIES = 3
NUM_THREADS = 5
def scrape_url(url):
params = {'api_key': 'API_KEY', 'url': url}
# send request to scraperapi, and automatically retry failed requests
for _ in range(NUM_RETRIES):
try:
response = requests.get('http://api.scraperapi.com/', params=urlencode(params))
if response.status_code in [200, 404]:
## escape for loop if the API returns a successful response
break
except requests.exceptions.ConnectionError:
response = ''
## parse data if 200 status code (successful response)
if response.status_code == 200:
## do stuff
with concurrent.futures.ThreadPoolExecutor(max_workers=NUM_THREADS) as executor:
executor.map(scrape_url, list_of_urls)
Hi if you are using the latest version of requests, then most probably it is keeping the TCP connection alive. What you can do is to define a request class and set it up not to keep the connections alive and then proceed normally with you code
s = requests.session()
s.config['keep_alive'] = False
As discussed here, there really isn't such a thing as an HTTP connection and what httplib refers to as the HTTPConnection is really the underlying TCP connection which doesn't really know much about your requests at all. Requests abstracts that away and you won't ever see it.
The newest version of Requests does in fact keep the TCP connection alive after your request.. If you do want your TCP connections to close, you can just configure the requests to not use keep-alive.
Alternatively
s = requests.session(config={'keep_alive': False})
Updated version of your code
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlencode
import concurrent.futures
list_of_urls = ['https://www.example']
NUM_RETRIES = 3
NUM_THREADS = 5
def scrape_url(url):
params = {'api_key': 'API_KEY', 'url': url}
s = requests.session()
s.config['keep_alive'] = False
# send request to scraperapi, and automatically retry failed requests
for _ in range(NUM_RETRIES):
try:
response = s.get('http://api.scraperapi.com/', params=urlencode(params))
if response.status_code in [200, 404]:
## escape for loop if the API returns a successful response
break
except requests.exceptions.ConnectionError:
response = ''
## parse data if 200 status code (successful response)
if response.status_code == 200:
## do stuff
with concurrent.futures.ThreadPoolExecutor(max_workers=NUM_THREADS) as executor:
executor.map(scrape_url, list_of_urls)
I am trying to scrape a website using python requests. We can only scrape the website using proxies so I implemented the code for that. However its banning all my requests even when i am using proxies, So I used a website https://api.ipify.org/?format=json to check whether proxies working properly or not. I found it showing my original IP even while using proxies. The code is below
from concurrent.futures import ThreadPoolExecutor
import string, random
import requests
import sys
http = []
#loading http into the list
with open(sys.argv[1],"r",encoding = "utf-8") as data:
for i in data:
http.append(i[:-1])
data.close()
url = "https://api.ipify.org/?format=json"
def fetch(session, url):
for i in range(5):
proxy = {'http': 'http://'+random.choice(http)}
try:
with session.get(url,proxies = proxy, allow_redirects=False) as response:
print("Proxy : ",proxy," | Response : ",response.text)
break
except:
pass
# #timer(1, 5)
if __name__ == '__main__':
with ThreadPoolExecutor(max_workers=1) as executor:
with requests.Session() as session:
executor.map(fetch, [session] * 100, [url] * 100)
executor.shutdown(wait=True)
I tried a lot but didn't understand how my ip address is getting shown instead of the proxy ipv4. You will find output of the code here https://imgur.com/a/z02uSvi
The problem that you have set proxy for http and sending request to website which uses https. Solution is simple:
proxies = dict.fromkeys(('http', 'https', 'ftp'), 'http://' + random.choice(http))
# You can set proxy for session
session.proxies.update(proxies)
response = session.get(url)
# Or you can pass proxy as argument
response = session.get(url, proxies=proxies)
I am working on data scraping and machine learning. I am new to both Python and Scraping. I am trying to scrape this particular site.
https://www.space-track.org/
From what I have monitored they execute several scripts in between login and next page. Hence they get those table data. I am able to successfully login and then with session get the data from next page as well, what I am missing is getting that data which they get from executing script in between. I need the data from table
satcat
and achieve pagination. Following is my code
import requests
from bs4 import BeautifulSoup
import urllib
from urllib.request import urlopen
import html2text
import time
from requests_html import HTMLSession
from requests_html import AsyncHTMLSession
with requests.Session() as s:
#s = requests.Session()
session = HTMLSession()
url = 'https://www.space-track.org/'
headers = {'User-Agent':'Mozilla/5.0(X11; Ubuntu; Linux x86_64; rv:66.0)Gecko/20100101 Firefox/66.0'}
login_data = { "identity": "",
"password": "",
"btnLogin": "LOGIN"
}
login_data_extra={"identity": "", "password": ""}
preLogin = session.get(url + 'auth/login', headers=headers)
time.sleep(3)
print('*******************************')
print('\n')
print('data to retrive csrf cookie')
#print(preLogin.text)
#soup = BeautifulSoup(preLogin.content,'html.parser')
#afterpretty = soup.prettify()
#login_data['spacetrack_csrf_token'] = soup.find('input',attrs={'name':'spacetrack_csrf_token'})['value']
csrf = dict(session.cookies)['spacetrack_csrf_cookie']
#csrf = p.headers['Set-Cookie'].split(";")[0].split("=")[-1]
login_data['spacetrack_csrf_token'] = csrf
#print(login_data)
# html = open(p.content).read()
# print (html2text.html2text(p.text))
#login_data['spacetrack_csrf_token'] = soup.find('spacetrack_csrf_token"')
#print(login_data)
login = session.post(url+'auth/login',data=login_data,headers=headers,allow_redirects=True)
time.sleep(1)
print('****************************************')
print('\n')
print('login api status code')
print(login.url)
#print(r.url)
#print(r.content)
print('******************************')
print(' ')
print(' ')
print('\n')
print('data post login')
#async def get_pyclock():
# r = await session.get(url)
# await r.html.arender()
# return r
#postLogin = session.run(get_pyclock)
time.sleep(3)
postLogin = session.get(url)
postLogin.html.render(sleep=5, keep_page=True)
As you can see I have used requests_html library to render the html, but I have been unsuccessful in getting the data. This is the url executed in js internally which gets my data
https://www.space-track.org/master/loadSatCatData
Can anyone help me with how to scrape that data or javascript ?
Thank you :)
You can go for selenium. It has a function browser.execute_script(). This will help you to execute script. Hope this helps :)
This question has been addresses in various shapes and flavors but I have not been able to apply any of the solutions I read online.
I would like to use Python to log into the site: https://app.ninchanese.com/login
and then reach the page: https://app.ninchanese.com/leaderboard/global/1
I have tried various stuff but without success...
Using POST method:
import urllib
import requests
oURL = 'https://app.ninchanese.com/login'
oCredentials = dict(email='myemail#hotmail.com', password='mypassword')
oSession = requests.session()
oResponse = oSession.post(oURL, data=oCredentials)
oResponse2 = oSession.get('https://app.ninchanese.com/leaderboard/global/1')
Using the authentication function from requests package
import requests
oSession = requests.session()
oResponse = oSession.get('https://app.ninchanese.com/login', auth=('myemail#hotmail.com', 'mypassword'))
oResponse2 = oSession.get('https://app.ninchanese.com/leaderboard/global/1')
Whenever I print oResponse2, I can see that I'm always on the login page so I am guessing the authentication did not work.
Could you please advise how to achieve this?
You have to send the csrf_token along with your login request:
import urllib
import requests
import bs4
URL = 'https://app.ninchanese.com/login'
credentials = dict(email='myemail#hotmail.com', password='mypassword')
session = requests.session()
response = session.get(URL)
html = bs4.BeautifulSoup(response.text)
credentials['csrf_token'] = html.find('input', {'name':'csrf_token'})['value']
response = session.post(URL, data=credentials)
response2 = session.get('https://app.ninchanese.com/leaderboard/global/1')