Python Package For Multi-Threaded Spider w/ Proxy Support? - python

Instead of just using urllib does anyone know of the most efficient package for fast, multithreaded downloading of URLs that can operate through http proxies? I know of a few such as Twisted, Scrapy, libcurl etc. but I don't know enough about them to make a decision or even if they can use proxies.. Anyone know of the best one for my purposes? Thanks!

is's simple to implement this in python.
The urlopen() function works
transparently with proxies which do
not require authentication. In a Unix
or Windows environment, set the
http_proxy, ftp_proxy or gopher_proxy
environment variables to a URL that
identifies the proxy server before
starting the Python interpreter
# -*- coding: utf-8 -*-
import sys
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
from Queue import Queue, Empty
from threading import Thread
visited = set()
queue = Queue()
def get_parser(host, root, charset):
def parse():
try:
while True:
url = queue.get_nowait()
try:
content = urlopen(url).read().decode(charset)
except UnicodeDecodeError:
continue
for link in BeautifulSoup(content).findAll('a'):
try:
href = link['href']
except KeyError:
continue
if not href.startswith('http://'):
href = 'http://%s%s' % (host, href)
if not href.startswith('http://%s%s' % (host, root)):
continue
if href not in visited:
visited.add(href)
queue.put(href)
print href
except Empty:
pass
return parse
if __name__ == '__main__':
host, root, charset = sys.argv[1:]
parser = get_parser(host, root, charset)
queue.put('http://%s%s' % (host, root))
workers = []
for i in range(5):
worker = Thread(target=parser)
worker.start()
workers.append(worker)
for worker in workers:
worker.join()

usually proxies filter websites categorically based on how the website was created. It is difficult to transmit data through proxies based on categories. Eg youtube is classified as audio/video streams therefore youtube is blocked in some places espically schools.
If you want to bypass proxies and get the data off a website and put it in your own genuine website like a dot com website that can be registered it to you.
When you are making and registering the website categorise your website as anything you want.

Related

How to quickly check if domain exists? [duplicate]

This question already has an answer here:
How to reliably check if a domain has been registered or is available?
(1 answer)
Closed 2 years ago.
I have a large list of domains and I need to check if domains are available now. I do it like this:
import requests
list_domain = ['google.com', 'facebook.com']
for domain in list_domain:
result = requests.get(f'http://{domain}', timeout=10)
if result.status_code == 200:
print(f'Domain {domain} [+++]')
else:
print(f'Domain {domain} [---]')
But the check is too slow. Is there a way to make it faster? Maybe someone knows an alternative method for checking domains for existence?
You can use the socket library to determine if a domain has a DNS entry:
>>> import socket
>>>
>>> addr = socket.gethostbyname('google.com')
>>> addr
'74.125.193.100'
>>> socket.gethostbyname('googl42652267e.com')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
socket.gaierror: [Errno -2] Name or service not known
>>>
If you want to check which domains are available, the more correct approach would be to catch the ConnectionError from the requests module, because even if you get a response code that is not 200, the fact that there is a response means that there is a server associated with that domain. Hence, the domain is taken.
This is not full proof in terms of checking for domain availability, because a domain might be taken, but may not have appropriate A record associated with it, or the server may just be down for the time being.
The code below is asynchronous as well.
from concurrent.futures import ThreadPoolExecutor
import requests
from requests.exceptions import ConnectionError
def validate_existence(domain):
try:
response = requests.get(f'http://{domain}', timeout=10)
except ConnectionError:
print(f'Domain {domain} [---]')
else:
print(f'Domain {domain} [+++]')
list_domain = ['google.com', 'facebook.com', 'nonexistent_domain.test']
with ThreadPoolExecutor() as executor:
executor.map(validate_existence, list_domain)
You can do that via "requests-futures" module.
requests-futures runs Asynchronously, If you have a average internet connection it can check 8-10 url per second (Based on my experience).
What you can do is run the script multiple times but add only a limited amount of domains to each to make it speedy.
Use scrapy it is way faster and by default it yields only 200 response until you over ride it so in your case follow me
pip install scrapy
After installing in your project folder user terminal to creat project
Scrapy startproject projectname projectdir
It will create folder name projectdir
Now
cd projectdir
Inside projectdir enter
scrapy genspider mydomain mydomain.com
Now navigate to spiders folder open mydomain.py
Now add few lines of code
import scrapy
class MydomainSpider(scrapy.Spider):
name = "mydomain"
def start_requests(self):
urls = [
'facebook.com',
'Google.com',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
yield { ‘Available_Domains’ : response.url}
Now back to projectdir and run
scrapy crawl mydomain -o output.csv
You will have all the working domains having status code 200 in output.csv file
For more see

concurrent connections in urllib3

Using a loop to make multiple requests to various websites, how is it possible to do this with a proxy in urllib3?
The code will read in a tuple of URLs, and use a for loop to connect to each site, however, currently it does not connect past the first url in the tuple. There is a proxy in place as well.
list = ['https://URL1.com', 'http://URL2.com', 'http://URL3.com']
for i in list:
http = ProxyManager("PROXY-PROXY")
http_get = http.request('GET', i, preload_content=False).read().decode()
I have removed the urls and proxy information from the above code. The first URL in the tuple will run fine, but after this, nothing else occurs, just waiting. I have tried the clear() method to reset the connection for each time in the loop.
unfortunately urllib3 is synchronous and blocks. You could use it with threads, but that is a hassle and usually leads to more problems. The main approach these days is to use some asynchronous network. Twisted and asyncio (with aiohttp maybe) are the popular packages.
I'll provide an example using trio framework and asks:
import asks
import trio
asks.init('trio')
path_list = ['https://URL1.com', 'http://URL2.com', 'http://URL3.com']
results = []
async def grabber(path):
r = await s.get(path)
results.append(r)
async def main(path_list):
async with trio.open_nursery() as n:
for path in path_list:
n.spawn(grabber(path))
s = asks.Session()
trio.run(main, path_list)
Using threads is not really that much of a hassle since 3.2 when concurrent.futures was added:
from urllib3 import ProxyManager
from concurrent.futures import ThreadPoolExecutor,wait
url_list:list = ['https://URL1.com', 'http://URL2.com', 'http://URL3.com']
thread_pool:ThreadPoolExecutor = ThreadPoolExecutor(max_workers=min(len(url_list),20))
tasks = []
for url in url_list:
def send_request() -> type:
# copy i into this function's stack frame
this_url:str = url
# could this assignment be removed from the loop?
# I'd have to read the docs for ProxyManager but probably
http:ProxyManager = ProxyManager("PROXY-PROXY")
return http.request('GET', this_url, preload_content=False).read().decode()
tasks.append(thread_pool.submit(send_request))
wait(tasks)
all_responses:list = [task.result() for task in tasks]
Later versions offer an event loop via asyncio. Issues I've had with asyncio are usually related to portability of libraries (IE aiohttp via pydantic), most of which are not pure python and have external libc dependencies. This can be an issue if you have to support a lot of docker apps which might have musl-libc(alpine) or glibc(everyone else).

I tried this proxy checker, it shows urrlib2 is not defined on execution and get stuck for a while?

It says urllib2 is not defined
Nameerror : Global name 'urllib2' is not defined
I am using Python 2.7, can anyone help me out to fix this one? It's not my code tho but still needs some help regarding it.
And please if you guys can make some edit to make it use less cpu usage and good
execution time?
I am just a beginner in sending http request
from grab import Grab, GrabError
from Tkinter import *
from tkFileDialog import *
import requests
from urllib2 import urlopen
""" A multithreaded proxy checker
Given a file containing proxies, per line, in the form of ip:port, will attempt
to establish a connection through each proxy to a provided URL. Duration of
connection attempts is governed by a passed in timeout value. Additionally,
spins off a number of daemon threads to speed up processing using a passed in
threads parameter. Proxies that passed the test are written out to a file
called results.txt
Usage:
goodproxy.py [-h] -file FILE -url URL [-timeout TIMEOUT] [-threads THREADS]
Parameters:
-file -- filename containing a list of ip:port per line
-url -- URL to test connections against
-timeout -- attempt time before marking that proxy as bad (default 1.0)
-threads -- number of threads to spin off (default 16)
Functions:
get_proxy_list_size -- returns the current size of the proxy holdingQueue
test_proxy -- does the actual connecting to the URL via a proxy
main -- creates daemon threads, write results to a file
"""
import argparse
import queue
import socket
import sys
import threading
import time
def get_proxy_list_size(proxy_list):
""" Return the current Queue size holding a list of proxy ip:ports """
return proxy_list.qsize()
def test_proxy(url, url_timeout, proxy_list, lock, good_proxies, bad_proxies):
""" Attempt to establish a connection to a passed in URL through a proxy.
This function is used in a daemon thread and will loop continuously while
waiting for available proxies in the proxy_list. Once proxy_list contains
a proxy, this function will extract that proxy. This action automatically
lock the queue until this thread is done with it. Builds a urllib.request
opener and configures it with the proxy. Attempts to open the URL and if
successsful then saves the good proxy into the good_proxies list. If an
exception is thrown, writes the bad proxy to a bodproxies list. The call
to task_done() at the end unlocks the queue for further processing.
"""
while True:
# take an item from the proxy list queue; get() auto locks the
# queue for use by this thread
proxy_ip = proxy_list.get()
# configure urllib.request to use proxy
proxy = urllib.request.ProxyHandler({'http': proxy_ip})
opener = urllib.request.build_opener(proxy)
urllib.request.install_opener(opener)
# some sites block frequent querying from generic headers
request = urllib.request.Request(
url, headers={'User-Agent': 'Proxy Tester'})
try:
# attempt to establish a connection
urllib.request.urlopen(request, timeout=float(url_timeout))
# if all went well save the good proxy to the list
with lock:
good_proxies.append(proxy_ip)
except (urllib.request.URLError,
urllib.request.HTTPError,
socket.error):
# handle any error related to connectivity (timeouts, refused
# connections, HTTPError, URLError, etc)
with lock:
bad_proxies.append(proxy_ip)
finally:
proxy_list.task_done() # release the queue
def main(argv):
""" Main Function
Uses argparse to process input parameters. File and URL are required while
the timeout and thread values are optional. Uses threading to create a
number of daemon threads each of which monitors a Queue for available
proxies to test. Once the Queue begins populating, the waiting daemon
threads will start picking up the proxies and testing them. Successful
results are written out to a results.txt file.
"""
proxy_list = queue.Queue() # Hold a list of proxy ip:ports
lock = threading.Lock() # locks good_proxies, bad_proxies lists
good_proxies = [] # proxies that passed connectivity tests
bad_proxies = [] # proxies that failed connectivity tests
# Process input parameters
parser = argparse.ArgumentParser(description='Proxy Checker')
parser.add_argument(
'-file', help='a text file with a list of proxy:port per line',
required=True)
parser.add_argument(
'-url', help='URL for connection attempts', required=True)
parser.add_argument(
'-timeout',
type=float, help='timeout in seconds (defaults to 1', default=1)
parser.add_argument(
'-threads', type=int, help='number of threads (defaults to 16)',
default=16)
args = parser.parse_args(argv)
# setup daemons ^._.^
for _ in range(args.threads):
worker = threading.Thread(
target=test_proxy,
args=(
args.url,
args.timeout,
proxy_list,
lock,
good_proxies,
bad_proxies))
worker.setDaemon(True)
worker.start()
start = time.time()
# load a list of proxies from the proxy file
with open(args.file) as proxyfile:
for line in proxyfile:
proxy_list.put(line.strip())
# block main thread until the proxy list queue becomes empty
proxy_list.join()
# save results to file
with open("result.txt", 'w') as result_file:
result_file.write('\n'.join(good_proxies))
# some metrics
print("Runtime: {0:.2f}s".format(time.time() - start))
if __name__ == "__main__":
main(sys.argv[1:])

Python - Flask - open a webpage in default browser

I am working on a small project in Python. It is divided into two parts.
First part is responsible to crawl the web and extract some infromation and insert them into a database.
Second part is resposible for presenting those information with use of the database.
Both parts share the database. In the second part I am using Flask framework to display information as html with some formatting, styling and etc. to make it look cleaner.
Source files of both parts are in the same package, but to run this program properly user has to run crawler and results presenter separately like this :
python crawler.py
and then
python presenter.py
Everything is allright just except one thing. What I what presenter to do is to create result in html format and open the page with results in user's default browser, but it is always opened twice, probably due to the presence of run() method, which starts Flask in a new thread and things get cloudy for me. I don't know what I should do to be able to make my presenter.py to open only one tab/window after running it.
Here is the snippet of my code :
from flask import Flask, render_template
import os
import sqlite3
# configuration
DEBUG = True
DATABASE = os.getcwd() + '/database/database.db'
app = Flask(__name__)
app.config.from_object(__name__)
app.config.from_envvar('CRAWLER_SETTINGS', silent=True)
def connect_db():
"""Returns a new connection to the database."""
try:
conn = sqlite3.connect(app.config['DATABASE'])
return conn
except sqlite3.Error:
print 'Unable to connect to the database'
return False
#app.route('/')
def show_entries():
u"""Loads pages information and emails from the database and
inserts results into show_entires template. If there is a database
problem returns error page.
"""
conn = connect_db()
if conn:
try:
cur = connect_db().cursor()
results = cur.execute('SELECT url, title, doctype, pagesize FROM pages')
pages = [dict(url=row[0], title=row[1].encode('utf-8'), pageType=row[2], pageSize=row[3]) for row in results.fetchall()]
results = cur.execute('SELECT url, email from emails')
emails = {}
for row in results.fetchall():
emails.setdefault(row[0], []).append(row[1])
return render_template('show_entries.html', pages=pages, emails=emails)
except sqlite3.Error, e:
print ' Exception message %s ' % e
print 'Could not load data from the database!'
return render_template('show_error_page.html')
else:
return render_template('show_error_page.html')
if __name__ == '__main__':
url = 'http://127.0.0.1:5000'
webbrowser.open_new(url)
app.run()
I use similar code on Mac OS X (with Safari, Firefox, and Chrome browsers) all the time, and it runs fine. Guessing you may be running into Flask's auto-reload feature. Set debug=False and it will not try to auto-reload.
Other suggestions, based on my experience:
Consider randomizing the port you use, as quick edit-run-test loops sometimes find the OS thinking port 5000 is still in use. (Or, if you run the code several times simultaneously, say by accident, the port truly is still in use.)
Give the app a short while to spin up before you start the browser request. I do that through invoking threading.Timer.
Here's my code:
import random, threading, webbrowser
port = 5000 + random.randint(0, 999)
url = "http://127.0.0.1:{0}".format(port)
threading.Timer(1.25, lambda: webbrowser.open(url) ).start()
app.run(port=port, debug=False)
(This is all under the if __name__ == '__main__':, or in a separate "start app" function if you like.)
So this may or may not help. But my issue was with flask opening in microsoft edge when executing my app.py script... NOOB solution. Go to settings and default apps... And then change microsoft edge to chrome... And now it opens flask in chrome everytime. I still have the same issue where things just load though

Python - Twisted, Proxy and modifying content

So i've looked around at a few things involving writting an HTTP Proxy using python and the Twisted framework.
Essentially, like some other questions, I'd like to be able to modify the data that will be sent back to the browser. That is, the browser requests a resource and the proxy will fetch it. Before the resource is returned to the browser, i'd like to be able to modify ANY (HTTP headers AND content) content.
This ( Need help writing a twisted proxy ) was what I initially found. I tried it out, but it didn't work for me. I also found this ( Python Twisted proxy - how to intercept packets ) which i thought would work, however I can only see the HTTP requests from the browser.
I am looking for any advice. Some thoughts I have are to use the ProxyClient and ProxyRequest classes and override the functions, but I read that the Proxy class itself is a combination of the both.
For those who may ask to see some code, it should be noted that I have worked with only the above two examples. Any help is great.
Thanks.
To create ProxyFactory that can modify server response headers, content you could override ProxyClient.handle*() methods:
from twisted.python import log
from twisted.web import http, proxy
class ProxyClient(proxy.ProxyClient):
"""Mangle returned header, content here.
Use `self.father` methods to modify request directly.
"""
def handleHeader(self, key, value):
# change response header here
log.msg("Header: %s: %s" % (key, value))
proxy.ProxyClient.handleHeader(self, key, value)
def handleResponsePart(self, buffer):
# change response part here
log.msg("Content: %s" % (buffer[:50],))
# make all content upper case
proxy.ProxyClient.handleResponsePart(self, buffer.upper())
class ProxyClientFactory(proxy.ProxyClientFactory):
protocol = ProxyClient
class ProxyRequest(proxy.ProxyRequest):
protocols = dict(http=ProxyClientFactory)
class Proxy(proxy.Proxy):
requestFactory = ProxyRequest
class ProxyFactory(http.HTTPFactory):
protocol = Proxy
I've got this solution by looking at the source of twisted.web.proxy. I don't know how idiomatic it is.
To run it as a script or via twistd, add at the end:
portstr = "tcp:8080:interface=localhost" # serve on localhost:8080
if __name__ == '__main__': # $ python proxy_modify_request.py
import sys
from twisted.internet import endpoints, reactor
def shutdown(reason, reactor, stopping=[]):
"""Stop the reactor."""
if stopping: return
stopping.append(True)
if reason:
log.msg(reason.value)
reactor.callWhenRunning(reactor.stop)
log.startLogging(sys.stdout)
endpoint = endpoints.serverFromString(reactor, portstr)
d = endpoint.listen(ProxyFactory())
d.addErrback(shutdown, reactor)
reactor.run()
else: # $ twistd -ny proxy_modify_request.py
from twisted.application import service, strports
application = service.Application("proxy_modify_request")
strports.service(portstr, ProxyFactory()).setServiceParent(application)
Usage
$ twistd -ny proxy_modify_request.py
In another terminal:
$ curl -x localhost:8080 http://example.com
For two-way proxy using twisted see the article:
http://sujitpal.blogspot.com/2010/03/http-debug-proxy-with-twisted.html

Categories