I'm experiencing some difficulty getting requests to utilise the proxy address when requesting a website. No error is returned but by getting the script to return http://ipecho.net/plain, I can see my own IP, not that of the proxy.
import random
import requests
import time
def proxy():
proxy = (random.choice(proxies)).strip()
print("selected proxy: {0}".format(proxy))
url = 'http://ipecho.net/plain'
data = requests.get(url, proxies={"https": proxy})
print(data)
print("data returned: {0}".format(data.text))
proxies = []
with open("proxies.txt", "r") as fi:
for line in fi:
proxies.append(line)
while True:
proxy()
time.sleep(5)
The structure of the proxies.txt file is as follows:
https://95.215.111.184:3128
https://79.137.80.210:3128
Can anyone explain this behaviour?
The URL you are passing is http and you only provide an https proxy key. You need to create a key in your proxies dictionary for both http and https. These can point to the same value.
proxies = {'http': 'http://proxy.example.com', 'https': 'http://proxy.example.com'}
data = requests.get(url, proxies=proxies)
Related
I'm using playwright to extract data from a website and I want to use proxies which I get from this website : https://www.proxy-list.download/HTTPS. It doesn't work, and I'm wondering if this is because the proxies are free ? If this is the reason, can someone know where can i find proxies that will work ?
This is my code :
from playwright.sync_api import sync_playwright
import time
url = "https://www.momox-shop.fr/livres-romans-et-litterature-C055/"
with sync_playwright() as p:
browser = p.firefox.launch(
headless=False,
proxy= {
'server': '209.166.175.201:3128'
})
page = browser.new_page()
page.goto(url)
time.sleep(5)
Thank you !
Yes, according to your link, all proxies are "dead"
Before using proxies try checking them here is one possible solution:
import json
import requests
from pythonping import ping
from concurrent.futures import ThreadPoolExecutor
check_proxies_url = "https://httpbin.org/ip"
good_proxy = set()
# proxy_lst = requests.get("https://www.proxy-list.download/api/v1/get", params={"type": "https"})
# proxies = [proxy for proxy in proxy_lst.text.split('\r\n') if proxy]
proxy_lst = requests.get("http://proxylist.fatezero.org/proxy.list")
proxies = (f"{json.loads(data)['host']}:{json.loads(data)['port']}" for data in proxy_lst.text.split('\n') if data)
def get_proxies(proxy):
proxies = {
"https": proxy,
"http": proxy
}
try:
response = requests.get(url=check_proxies_url, proxies=proxies, timeout=2)
response.raise_for_status()
if ping(target=proxies["https"].split(':')[0], count=1, timeout=2).rtt_avg_ms < 150:
good_proxy.add(proxies["https"])
print(f"Good proxies: {proxies['https']}")
except Exception:
print(f"Bad proxy: {proxies['https']}")
with ThreadPoolExecutor() as executor:
executor.map(get_proxies, proxies)
print(good_proxy)
Get a list of active proxies with ping up to 150ms.
Output:
{'209.166.175.201:8080', '170.39.194.156:3128', '20.111.54.16:80', '20.111.54.16:8123'}
But in any case, this is a shared proxy and their performance is not guaranteed. If you want to be sure that your parser will work, then it is better to buy a proxy.
I ran your code with received proxy '170.39.194.156:3128' and for now it works
I am trying to scrape a website using python requests. We can only scrape the website using proxies so I implemented the code for that. However its banning all my requests even when i am using proxies, So I used a website https://api.ipify.org/?format=json to check whether proxies working properly or not. I found it showing my original IP even while using proxies. The code is below
from concurrent.futures import ThreadPoolExecutor
import string, random
import requests
import sys
http = []
#loading http into the list
with open(sys.argv[1],"r",encoding = "utf-8") as data:
for i in data:
http.append(i[:-1])
data.close()
url = "https://api.ipify.org/?format=json"
def fetch(session, url):
for i in range(5):
proxy = {'http': 'http://'+random.choice(http)}
try:
with session.get(url,proxies = proxy, allow_redirects=False) as response:
print("Proxy : ",proxy," | Response : ",response.text)
break
except:
pass
# #timer(1, 5)
if __name__ == '__main__':
with ThreadPoolExecutor(max_workers=1) as executor:
with requests.Session() as session:
executor.map(fetch, [session] * 100, [url] * 100)
executor.shutdown(wait=True)
I tried a lot but didn't understand how my ip address is getting shown instead of the proxy ipv4. You will find output of the code here https://imgur.com/a/z02uSvi
The problem that you have set proxy for http and sending request to website which uses https. Solution is simple:
proxies = dict.fromkeys(('http', 'https', 'ftp'), 'http://' + random.choice(http))
# You can set proxy for session
session.proxies.update(proxies)
response = session.get(url)
# Or you can pass proxy as argument
response = session.get(url, proxies=proxies)
we are trying to pass the Virustotal API through our proxy which is getting denied. HTTP websites are accessible using the code but HTTPS is not going through. Request any of you to post few sample codes which would help us.
import postfile
host = "www.virustotal.com"
selector = "https://www.virustotal.com/vtapi/v2/file/scan"
fields = [("apikey", "-- YOUR API KEY --")]
file_to_send = open("test.txt", "rb").read()
files = [("file", "test.txt", file_to_send)]
json = postfile.post_multipart(host, selector, fields, files)
print json
Two ideas to try
1) without HTTPS you can post to http://www.virustotal.com/vtapi/v2/file/scan
2) try the python requests library
import requests
params = {'apikey': '-YOUR API KEY HERE-'}
files = {'file': ('myfile.exe', open('myfile.exe', 'rb'))}
response = requests.post('https://www.virustotal.com/vtapi/v2/file/scan', files=files, params=params)
json_response = response.json()
I'm using requests to make a simple web crawler, how would I go about directing all of the script's functions through a proxy so whatever website I am crawling doesn't know it is me?
To use requests to obtain response behind a proxy in script or to use urllib2 features with proxy use following snippet:
proxy_url = "https://proxy:port"
proxy_support = urllib2.ProxyHandler({'https': proxy_url})
opener = urllib2.build_opener(proxy_support)
urllib2.install_opener(opener)
url1 = "https://api_url"
req1 = urllib2.Request(url1)
print "response from API call is below"
res1 = urllib2.urlopen(req1)
response1 = res1.read()
print response1
jsonobj1 = json.loads(response1)
Right now this is the script:
import json
import urllib2
with open('urls.txt') as f:
urls = [line.rstrip() for line in f]
with open('proxies.txt') as proxies:
for line in proxies:
proxy = json.loads(line)
proxy_handler = urllib2.ProxyHandler(proxy)
opener = urllib2.build_opener(proxy_handler)
urllib2.install_opener(opener)
for url in urls:
data = urllib2.urlopen(url).read()
print data
This is the urls.txt file:
http://myipaddress.com
and the proxies.txt file:
{"https": "https://87.98.216.22:3128"}
{"https": "http://190.153.7.189:8080"}
{"https": "http://125.39.68.181:80"}
that I got at http://hidemyass.com
I have been trying to test it by going through the terminal output (a bunch of html) and looking to see if it shows the ip address somewhere and hoping that it is one of the proxy ip's. But this doesn't seem to work. Depending on the ip recognition site, either it throws a connection error or tells me I have to enter validation letters (though the site viewed through the browser works fine).
So am I going about this in the best way? Is there a simpler way to check what ip address the url is seeing?
Edit: I heard elsewhere (on another forum) that one way to check if the url is being accessed from a different ip is to check for cross headers (like the html header indicates that it was redirected). But I can't find any more info.
You can use simpler site like this. Example:
Code:
import json
import urllib2
with open('urls.txt') as f:
urls = [line.rstrip() for line in f]
with open('proxies.txt') as proxies:
for line in proxies:
proxy = json.loads(line)
proxy_handler = urllib2.ProxyHandler(proxy)
opener = urllib2.build_opener(proxy_handler)
urllib2.install_opener(opener)
for url in urls:
try:
data = urllib2.urlopen(url).read()
print proxy, "-", data
except:
print proxy, "- not working"
urls.txt:
http://api.exip.org/?call=ip
proxies.txt:
{"http": "http://218.108.114.140:8080"}
{"http": "http://59.47.43.93:8080"}
{"http": "http://218.108.170.172:80"}
Output:
{u'http': u'http://218.108.114.140:8080'} - 218.108.114.140
{u'http': u'http://59.47.43.93:8080'} - 118.207.240.161
{u'http': u'http://218.108.170.172:80'} - not working
[Finished in 25.4s]
Note: none of this is my real IP.
Or if you want to use http://myipaddress.com you can do that with BeautifulSoup, by extracting exact HTML element which contains you IP