I'm experimenting with proxy servers, I want to create a Bot who connects every few minutes to my web server and scrapes a file (namely the index.html) for changes.
I tried to apply things I learned in some multihour python tutorials and got to the result to make it a bit more funny I could use random proxies.
So I wrote down this method:
import requests
from bs4 import BeautifulSoup
from random import choice
#here I get the proxy from a proxylist due processing a table embedded in html with beautifulSoup
def get_proxy():
print("bin in get_proxy")
proxyDomain = 'https://free-proxy-list.net/'
r = requests.get(proxyDomain)
print("bin in mache gerade suppe")
soup = BeautifulSoup(r.content, 'html.parser')
table = soup.find('table', {'id': 'proxylisttable'})
#this part works
#print(table.get_text)
print("zeit für die Liste")
ipAddresses = []
for row in table.findAll('tr'):
columns = row.findAll('td')
try:
ipAddresses.append("https://"+str(columns[0].get_text()) + ":" + str(columns[1].get_text()))
#ipList.append(str(columns[0].get_text()) + ":" + str(columns[1].get_text()))
except:
pass
#here the program returns one random IP Address from the list
return choice(ipAddresses)
# return 'https://': + choice(iplist)
def proxy_request(request_type, url, **kwargs):
print("bin in proxy_request")
while 1:
try:
proxy = get_proxy()
print("heute verwenden wir {}".format(proxy))
#so until this line everything seems to work as i want it to do
#now the next line should do the proxied request and at the end of the loop it should return some html text....
r = requests.request(request_type, url, proxies=proxy, timeout=5, **kwargs)
break
except:
pass
return r
def launch():
print("bin in launch")
r = proxy_request('get', 'https://mysliwje.uber.space.')
### but this text never arrives here - maybe the request is going to be carried out the wrong way
###does anybody got a idea how to solve that program so that it may work?
print(r.text)
launch()
as i explained in the code section before, the code works nice, it picks some random ip of a random ip list and prints it even to the cli. the next step all of the sudden seems to be carried out the wrong way, because the tools is running back scraping a new ip address
and another
and another
and another
and another...
of a list that seems to be updated every few minutes....
so i ask myself what is happening, why i dont see the simple html code of my indexpage?
Anybody any Idea?
Thanxx
Related
I am using kafka-python and BeautifulSoup to Scrape website that I enter often, and send a message to kafka broker with python producer.
What I want to do is whenever new post is uploaded on website (actually it is some kind of community like reddit, usually korean hip-hop fans are using to share information etc), that post should be send to kafka broker.
However, my problem is within while loop, only the lateset post keeps sending to kafka broker repeatedly.
This is not I want.
Also, second problem is when new post is loaded,
HTTP Error 502: Bad Gateway error occurs on
soup = BeautifulSoup(urllib.request.urlopen("http://hiphople.com/kboard").read(), "html.parser")
and message is not send anymore.
this is dataScraping.py
from bs4 import BeautifulSoup
import re
import urllib.request
pattern = re.compile('[0-9]+')
def parseContent():
soup = BeautifulSoup(urllib.request.urlopen("http://hiphople.com/kboard").read(), "html.parser")
for div in soup.find_all("tr", class_="notice"):
div.decompose()
key_num = pattern.findall(soup.find_all("td", class_="no")[0].text)
category = soup.find_all("td", class_="categoryTD")[0].find("span").text
author = soup.find_all("td", class_="author")[0].find("span").text
title = soup.find_all("td", class_="title")[0].find("a").text
link = "http://hiphople.com" + soup.find_all("td", class_="title")[0].find("a").attrs["href"]
soup2 = BeautifulSoup(urllib.request.urlopen(link).read(), "html.parser")
content = str(soup2.find_all("div", class_="article-content")[0].find_all("p"))
content = re.sub("<.+?>","", content,0).strip()
content = re.sub("\xa0","", content, 0).strip()
result = {"key_num": key_num, "catetory": category, "title": title, "author": author, "content": content}
return result
if __name__ == "__main__":
print("data scraping from website")
and this is PythonWebScraping.py
import json
from kafka import KafkaProducer
from dataScraping import parseContent
def json_serializer(data):
return json.dumps(data).encode("utf-8")
producer = KafkaProducer(acks=1, compression_type = "gzip", bootstrap_servers=["localhost:9092"],
value_serializer = json_serializer)
if __name__ == "__main__":
while (True):
result = parseContent()
producer.send("hiphople",result)
Please let me know how to fix my code so I can send newly created post to kafka broker as I expected.
Your function is working but its true you return only one event, I did not get 502 bad gateway, maybe you are getting it as ddos protection because of accessing too much times to the url, try adding delays/sleep , or your ip been banned to stop it from scraping the url...
For your second error, your function returns only one/last message
You are sending each time the result to kafka, this is why you are seeing same message over and over again,
You are scraping and taking the last event , what did you wish your function to do?
prevResult = ""
while(True):
result = parseContent()
if(prevResult!=result):
prevResult = result
print( result )
Obviously I'm still new to Python by looking at my code but failing my way through it.
I am scraping Amazon jobs search results but keep getting a connection reset error 10054 after about 50 requests to the url. I added a Crawlera proxy network to prevent getting banned but still not working. I know the url is long but it seems to work without having to add too many other separate parts to the url. The results page has about 12,000 jobs total with 10 jobs per page, so I don't even know if scraping that much data is the problem to begin with. Amazon shows each page in the url as 'result_limit=10', so I've been going through each page by 10s instead of 1 page per request. Not sure if that's right. Also, the last page stops at 9,990.
The code works but not sure how to get passed the connection error. As you can see, I've added things like a user agent but not sure if it even does anything. Any help would be appreciated as I've been stuck on this for countless days and hours. Thanks!
def get_all_jobs(pages):
requests = 0
start_time = time()
total_runtime = datetime.now()
for page in pages:
try:
ua = UserAgent()
header = {
'User-Agent': ua.random
}
response = get('https://www.amazon.jobs/en/search.json?base_query=&city=&country=USA&county=&'
'facets%5B%5D=location&facets%5B%5D=business_category&facets%5B%5D=category&'
'facets%5B%5D=schedule_type_id&facets%5B%5D=employee_class&facets%5B%5D=normalized_location'
'&facets%5B%5D=job_function_id&job_function_id%5B%5D=job_function_corporate_80rdb4&'
'latitude=&loc_group_id=&loc_query=USA&longitude=&'
'normalized_location%5B%5D=Seattle%2C+Washington%2C+USA&'
'normalized_location%5B%5D=San+Francisco'
'%2C+California%2C+USA&normalized_location%5B%5D=Sunnyvale%2C+California%2C+USA&'
'normalized_location%5B%5D=Bellevue%2C+Washington%2C+USA&'
'normalized_location%5B%5D=East+Palo+Alto%2C+California%2C+USA&'
'normalized_location%5B%5D=Santa+Monica%2C+California%2C+USA&offset={}&query_options=&'
'radius=24km®ion=&result_limit=10&schedule_type_id%5B%5D=Full-Time&'
'sort=relevant'.format(page),
headers=header,
proxies={
"http": "http://1ea01axxxxxxxxxxxxxxxxxxx:#proxy.crawlera.com:8010/"
}
)
# Monitor the frequency of requests
requests += 1
# Pauses the loop between 8 and 15 seconds
sleep(randint(8, 15))
current_time = time()
elapsed_time = current_time - start_time
print("Amazon Request:{}; Frequency: {} request/s; Total Run Time: {}".format(requests,
requests / elapsed_time, datetime.now() - total_runtime))
clear_output(wait=True)
# Throw a warning for non-200 status codes
if response.status_code != 200:
warn("Request: {}; Status code: {}".format(requests, response.status_code))
# Break the loop if number of requests is greater than expected
if requests > 999:
warn("Number of requests was greater than expected.")
break
yield from get_job_infos(response)
except AttributeError as e:
print(e)
continue
def get_job_infos(response):
amazon_jobs = json.loads(response.text)
for website in amazon_jobs['jobs']:
site = website['company_name']
title = website['title']
location = website['normalized_location']
job_link = 'https://www.amazon.jobs' + website['job_path']
yield site, title, location, job_link
def main():
# Page range starts from 0 and the middle value increases by 10 each page.
pages = [str(i) for i in range(0, 9990, 10)]
with open('amazon_jobs.csv', "w", newline='', encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerow(["Website", "Title", "Location", "Job URL"])
writer.writerows(get_all_jobs(pages))
if __name__ == "__main__":
main()
i'm not expert on amazon anti bot policies, but if they have flagged you once, your ip could be flagged for a while, they might have a limit to how many similar requests you can do in a certain time frame.
google for a patch to urllib so you can see the request headers in real time, other than ip/domain per certain time frame, amazon will look at your request headers to determine if you're not human. compare what you're sending with a regular browser request headers
just standard practice, keep cookies for a normal amount of time, use proper referers and a popular user agent
all this can be done with requests library, pip install requests, see session object
it looks like you're sending a request to an internal amazon url without a referer header..... that doesnt happen in a normal browser
another example, keeping cookies from one user agent and then switching to another is also not what browser does
Edit 2
Second approach. For now, I gave up on using multiple instances and configured scrapy settings not to use concurrent requests. It's slow but stable. I opened a bounty. Who can help to make this work concurrently? If I configure scrapy to run concurrently, I get segmentation faults.
class WebkitDownloader( object ):
def __init__(self):
os.environ["DISPLAY"] = ":99"
self.proxyAddress = "a:b#" + PROXY_DEFAULT_HOST + ":" + str(PROXY_DEFAULT_PORT)
def process_response(self, request, response, spider):
self.request = request
self.response = response
if 'cached' not in response.flags:
webkitBrowser = webkit.WebkitBrowser(proxy = self.proxyAddress, gui=False, timeout=0.5, delay=0.5, forbidden_extensions=['js','css','swf','pdf','doc','xls','ods','odt'])
#print "added to queue: " + str(self.counter)
webkitBrowser.get(html=response.body, num_retries=0)
html = webkitBrowser.current_html()
respcls = responsetypes.from_args(headers=response.headers, url=response.url)
kwargs = dict(cls=respcls, body=killgremlins(html))
response = response.replace(**kwargs)
webkitBrowser.setPage(None)
del webkitBrowser
return response
Edit:
I tried to answer my own question in the meantime and implemented a queue but it does not run asynchronously for some reason. Basically when webkitBrowser.get(html=response.body, num_retries=0) is busy, scrapy is blocked until the method is finished. New requests are not assigned to the remaining free instances in self.queue.
Can anyone please point me into right direction to make this work?
class WebkitDownloader( object ):
def __init__(self):
proxyAddress = "http://" + PROXY_DEFAULT_HOST + ":" + str(PROXY_DEFAULT_PORT)
self.queue = list()
for i in range(8):
self.queue.append(webkit.WebkitBrowser(proxy = proxyAddress, gui=True, timeout=0.5, delay=5.5, forbidden_extensions=['js','css','swf','pdf','doc','xls','ods','odt']))
def process_response(self, request, response, spider):
i = 0
for webkitBrowser in self.queue:
i += 1
if webkitBrowser.status == "WAITING":
break
webkitBrowser = self.queue[i]
if webkitBrowser.status == "WAITING":
# load webpage
print "added to queue: " + str(i)
webkitBrowser.get(html=response.body, num_retries=0)
webkitBrowser.scrapyResponse = response
while webkitBrowser.status == "PROCESSING":
print "waiting for queue: " + str(i)
if webkitBrowser.status == "DONE":
print "fetched from queue: " + str(i)
#response = webkitBrowser.scrapyResponse
html = webkitBrowser.current_html()
respcls = responsetypes.from_args(headers=response.headers, url=response.url)
kwargs = dict(cls=respcls, body=killgremlins(html))
#response = response.replace(**kwargs)
webkitBrowser.status = "WAITING"
return response
I am using WebKit in a scrapy middleware to render JavaScript. Currently, scrapy is configured to process 1 request at a time (no concurrency).
I'd like to use concurrency (e.g. 8 requests at a time) but then I need to make sure that 8 instances of WebkitBrowser() receive requests based on their individual processing state (a fresh request as soon as WebkitBrowser.get() is done and ready to receive the next request)
How would I achieve that with Python? This is my current middleware:
class WebkitDownloader( object ):
def __init__(self):
proxyAddress = "http://" + PROXY_DEFAULT_HOST + ":" + str(PROXY_DEFAULT_PORT)
self.w = webkit.WebkitBrowser(proxy = proxyAddress, gui=True, timeout=0.5, delay=0.5, forbidden_extensions=['js','css','swf','pdf','doc','xls','ods','odt'])
def process_response(self, request, response, spider):
if not ".pdf" in response.url:
# load webpage
self.w.get(html=response.body, num_retries=0)
html = self.w.current_html()
respcls = responsetypes.from_args(headers=response.headers, url=response.url)
kwargs = dict(cls=respcls, body=killgremlins(html))
response = response.replace(**kwargs)
return response
I don't follow everything in your question because I don't know scrapy and I don't understand what would cause the segfault, but I think I can address one question: why is scrapy blocked when webkitBrowser.get is busy?
I don't see anything in your "queue" example that would give you the possibility of parallelism. Normally, one would use either the threading or multiprocessing module so that multiple things can run "in parallel". Instead of simply calling webkitBrowser.get, I suspect that you may want to run it in a thread. Retrieving web pages is a case where python threading should work reasonably well. Python can't do two CPU-intensive tasks simultaneously (due to the GIL), but it can wait for responses from web servers in parallel.
Here's a recent SO Q/A with example code that might help.
Here's an idea of how to get you started. Create a Queue. Define a function which takes this queue as an argument, get's the web page and puts the response in the queue. In the main program, enter a while True: loop after spawning all the get threads: check the queue and process the next entry, or time.sleep(.1) if it's empty.
I am aware this is an old question, but I had the similar question and hope this information I stumbled upon helps others with this similar question:
If scrapyjs + splash works for you (given you are using a webkit browser, they likely do, as splash is webkit-based), it is probably the easiest solution;
If 1 does not work, you may be able to run multiple spiders at the same time with scrapyd or do multiprocessing with scrapy;
Depending on wether your browser render is primarily waiting (for pages to render), IO intensive or CPU-intensive, you may want to use non-blocking sleep with twisted, multithreading or multiprocessing. For the latter, the value of sticking with scrapy diminishes and you may want to hack a simple scraper (e.g. the web crawler authored by A. Jesse Jiryu Davis and Guido van Rossum: code and document) or create your own.
I have a code that retrieves news results from this newspaper using a query and a time frame (could be up to a year).
The results are paginated up to 10 articles per page and since I couldn't find a way to increase it, I issue a request for each page then retrieve the title, url and date of each article. Each cycle (the HTTP request and the parsing) takes from 30 seconds to a minute and that's extremely slow. And eventually it will stop with a response code of 500. I am wondering if there is ways to speed it up or maybe make multiple requests at once. I simply want to retrieve the articles details in all the pages.
Here is the code:
import requests
import re
from bs4 import BeautifulSoup
import csv
URL = 'http://www.gulf-times.com/AdvanceSearchNews.aspx?Pageindex={index}&keywordtitle={query}&keywordbrief={query}&keywordbody={query}&category=&timeframe=&datefrom={datefrom}&dateTo={dateto}&isTimeFrame=0'
def run(**params):
countryFile = open("EgyptDaybyDay.csv","a")
i=1
results = True
while results:
params["index"]=str(i)
response = requests.get(URL.format(**params))
print response.status_code
htmlFile = BeautifulSoup(response.content)
articles = htmlFile.findAll("div", { "class" : "newslist" })
for article in articles:
url = (article.a['href']).encode('utf-8','ignore')
title = (article.img['alt']).encode('utf-8','ignore')
dateline = article.find("div",{"class": "floatright"})
m = re.search("([0-9]{2}\-[0-9]{2}\-[0-9]{4})", dateline.string)
date = m.group(1)
w = csv.writer(countryFile,delimiter=',',quotechar='|', quoting=csv.QUOTE_MINIMAL)
w.writerow((date, title, url ))
if not articles:
results = False
i+=1
countryFile.close()
run(query="Egypt", datefrom="12-01-2010", dateto="12-01-2011")
This is a good opportunity to try out gevent.
You should have a separate routine for the request.get part so that your application doesn't have to wait for IO blocking.
You can then spawn multiple workers and have queues to pass requests and articles around.
Maybe something similar to this:
import gevent.monkey
from gevent.queue import Queue
from gevent import sleep
gevent.monkey.patch_all()
MAX_REQUESTS = 10
requests = Queue(MAX_REQUESTS)
articles = Queue()
mock_responses = range(100)
mock_responses.reverse()
def request():
print "worker started"
while True:
print "request %s" % requests.get()
sleep(1)
try:
articles.put('article response %s' % mock_responses.pop())
except IndexError:
articles.put(StopIteration)
break
def run():
print "run"
i = 1
while True:
requests.put(i)
i += 1
if __name__ == '__main__':
for worker in range(MAX_REQUESTS):
gevent.spawn(request)
gevent.spawn(run)
for article in articles:
print "Got article: %s" % article
The most probably slow down is the server, so parallelising the http requests would be the best way to go about making the code run faster, although there's very little you can do to speed up the server response. There's a good tutorial over at IBM for doing exactly this
It seems to me that you're looking for a feed, which that newspaper doesn't advertise. However, it's a problem that has been solved before -- there are many sites that will generate feeds for you for an arbitrary website thus at least solving one of your problems. Some of these require some human guidance, and others have less opportunity for tweaking and are more automatic.
If you can at all avoid doing the pagination and parsing yourself, I'd recommend it. If you cannot, I second the use of gevent for simplicity. That said, if they're sending you back 500's, your code is likely less of an issue and added parallelism may not help.
You can try making all the calls asynchronously .
Have a look at this :
http://pythonquirks.blogspot.in/2011/04/twisted-asynchronous-http-request.html
You could use gevent as well rather than twisted but just telling you the options.
This might very well come close to what you're looking for.
Ideal method for sending multiple HTTP requests over Python? [duplicate]
Source code:
https://github.com/kennethreitz/grequests
I'm writing a script in Python that should determine if it has internet access.
import urllib
CHECK_PAGE = "http://64.37.51.146/check.txt"
CHECK_VALUE = "true\n"
PROXY_VALUE = "Privoxy"
OFFLINE_VALUE = ""
page = urllib.urlopen(CHECK_PAGE)
response = page.read()
page.close()
if response.find(PROXY_VALUE) != -1:
urllib.getproxies = lambda x = None: {}
page = urllib.urlopen(CHECK_PAGE)
response = page.read()
page.close()
if response != CHECK_VALUE:
print "'" + response + "' != '" + CHECK_VALUE + "'" #
else:
print "You are online!"
I use a proxy on my computer, so correct proxy handling is important. If it can't connect to the internet through the proxy, it should bypass the proxy and see if it's stuck at a login page (as many public hotspots I use do). With that code, if I am not connected to the internet, the first read() returns the proxy's error page. But when I bypass the proxy after that, I get the same page. If I bypass the proxy BEFORE making any requests, I get an error like I should. I think Python is caching the page from the 1st time around.
How do I force Python to clear its cache (or is this some other problem)?
Call urllib.urlcleanup() before each call of urllib.urlopen() will solve the problem. Actually, urllib.urlopen will call urlretrive() function which creates a cache to hold data, and urlcleanup() will remove it.
You want
page = urllib.urlopen(CHECK_PAGE, proxies={})
Remove the
urllib.getproxies = lambda x = None: {}
line.