Obviously I'm still new to Python by looking at my code but failing my way through it.
I am scraping Amazon jobs search results but keep getting a connection reset error 10054 after about 50 requests to the url. I added a Crawlera proxy network to prevent getting banned but still not working. I know the url is long but it seems to work without having to add too many other separate parts to the url. The results page has about 12,000 jobs total with 10 jobs per page, so I don't even know if scraping that much data is the problem to begin with. Amazon shows each page in the url as 'result_limit=10', so I've been going through each page by 10s instead of 1 page per request. Not sure if that's right. Also, the last page stops at 9,990.
The code works but not sure how to get passed the connection error. As you can see, I've added things like a user agent but not sure if it even does anything. Any help would be appreciated as I've been stuck on this for countless days and hours. Thanks!
def get_all_jobs(pages):
requests = 0
start_time = time()
total_runtime = datetime.now()
for page in pages:
try:
ua = UserAgent()
header = {
'User-Agent': ua.random
}
response = get('https://www.amazon.jobs/en/search.json?base_query=&city=&country=USA&county=&'
'facets%5B%5D=location&facets%5B%5D=business_category&facets%5B%5D=category&'
'facets%5B%5D=schedule_type_id&facets%5B%5D=employee_class&facets%5B%5D=normalized_location'
'&facets%5B%5D=job_function_id&job_function_id%5B%5D=job_function_corporate_80rdb4&'
'latitude=&loc_group_id=&loc_query=USA&longitude=&'
'normalized_location%5B%5D=Seattle%2C+Washington%2C+USA&'
'normalized_location%5B%5D=San+Francisco'
'%2C+California%2C+USA&normalized_location%5B%5D=Sunnyvale%2C+California%2C+USA&'
'normalized_location%5B%5D=Bellevue%2C+Washington%2C+USA&'
'normalized_location%5B%5D=East+Palo+Alto%2C+California%2C+USA&'
'normalized_location%5B%5D=Santa+Monica%2C+California%2C+USA&offset={}&query_options=&'
'radius=24km®ion=&result_limit=10&schedule_type_id%5B%5D=Full-Time&'
'sort=relevant'.format(page),
headers=header,
proxies={
"http": "http://1ea01axxxxxxxxxxxxxxxxxxx:#proxy.crawlera.com:8010/"
}
)
# Monitor the frequency of requests
requests += 1
# Pauses the loop between 8 and 15 seconds
sleep(randint(8, 15))
current_time = time()
elapsed_time = current_time - start_time
print("Amazon Request:{}; Frequency: {} request/s; Total Run Time: {}".format(requests,
requests / elapsed_time, datetime.now() - total_runtime))
clear_output(wait=True)
# Throw a warning for non-200 status codes
if response.status_code != 200:
warn("Request: {}; Status code: {}".format(requests, response.status_code))
# Break the loop if number of requests is greater than expected
if requests > 999:
warn("Number of requests was greater than expected.")
break
yield from get_job_infos(response)
except AttributeError as e:
print(e)
continue
def get_job_infos(response):
amazon_jobs = json.loads(response.text)
for website in amazon_jobs['jobs']:
site = website['company_name']
title = website['title']
location = website['normalized_location']
job_link = 'https://www.amazon.jobs' + website['job_path']
yield site, title, location, job_link
def main():
# Page range starts from 0 and the middle value increases by 10 each page.
pages = [str(i) for i in range(0, 9990, 10)]
with open('amazon_jobs.csv', "w", newline='', encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerow(["Website", "Title", "Location", "Job URL"])
writer.writerows(get_all_jobs(pages))
if __name__ == "__main__":
main()
i'm not expert on amazon anti bot policies, but if they have flagged you once, your ip could be flagged for a while, they might have a limit to how many similar requests you can do in a certain time frame.
google for a patch to urllib so you can see the request headers in real time, other than ip/domain per certain time frame, amazon will look at your request headers to determine if you're not human. compare what you're sending with a regular browser request headers
just standard practice, keep cookies for a normal amount of time, use proper referers and a popular user agent
all this can be done with requests library, pip install requests, see session object
it looks like you're sending a request to an internal amazon url without a referer header..... that doesnt happen in a normal browser
another example, keeping cookies from one user agent and then switching to another is also not what browser does
Related
I am using kafka-python and BeautifulSoup to Scrape website that I enter often, and send a message to kafka broker with python producer.
What I want to do is whenever new post is uploaded on website (actually it is some kind of community like reddit, usually korean hip-hop fans are using to share information etc), that post should be send to kafka broker.
However, my problem is within while loop, only the lateset post keeps sending to kafka broker repeatedly.
This is not I want.
Also, second problem is when new post is loaded,
HTTP Error 502: Bad Gateway error occurs on
soup = BeautifulSoup(urllib.request.urlopen("http://hiphople.com/kboard").read(), "html.parser")
and message is not send anymore.
this is dataScraping.py
from bs4 import BeautifulSoup
import re
import urllib.request
pattern = re.compile('[0-9]+')
def parseContent():
soup = BeautifulSoup(urllib.request.urlopen("http://hiphople.com/kboard").read(), "html.parser")
for div in soup.find_all("tr", class_="notice"):
div.decompose()
key_num = pattern.findall(soup.find_all("td", class_="no")[0].text)
category = soup.find_all("td", class_="categoryTD")[0].find("span").text
author = soup.find_all("td", class_="author")[0].find("span").text
title = soup.find_all("td", class_="title")[0].find("a").text
link = "http://hiphople.com" + soup.find_all("td", class_="title")[0].find("a").attrs["href"]
soup2 = BeautifulSoup(urllib.request.urlopen(link).read(), "html.parser")
content = str(soup2.find_all("div", class_="article-content")[0].find_all("p"))
content = re.sub("<.+?>","", content,0).strip()
content = re.sub("\xa0","", content, 0).strip()
result = {"key_num": key_num, "catetory": category, "title": title, "author": author, "content": content}
return result
if __name__ == "__main__":
print("data scraping from website")
and this is PythonWebScraping.py
import json
from kafka import KafkaProducer
from dataScraping import parseContent
def json_serializer(data):
return json.dumps(data).encode("utf-8")
producer = KafkaProducer(acks=1, compression_type = "gzip", bootstrap_servers=["localhost:9092"],
value_serializer = json_serializer)
if __name__ == "__main__":
while (True):
result = parseContent()
producer.send("hiphople",result)
Please let me know how to fix my code so I can send newly created post to kafka broker as I expected.
Your function is working but its true you return only one event, I did not get 502 bad gateway, maybe you are getting it as ddos protection because of accessing too much times to the url, try adding delays/sleep , or your ip been banned to stop it from scraping the url...
For your second error, your function returns only one/last message
You are sending each time the result to kafka, this is why you are seeing same message over and over again,
You are scraping and taking the last event , what did you wish your function to do?
prevResult = ""
while(True):
result = parseContent()
if(prevResult!=result):
prevResult = result
print( result )
I'm experimenting with proxy servers, I want to create a Bot who connects every few minutes to my web server and scrapes a file (namely the index.html) for changes.
I tried to apply things I learned in some multihour python tutorials and got to the result to make it a bit more funny I could use random proxies.
So I wrote down this method:
import requests
from bs4 import BeautifulSoup
from random import choice
#here I get the proxy from a proxylist due processing a table embedded in html with beautifulSoup
def get_proxy():
print("bin in get_proxy")
proxyDomain = 'https://free-proxy-list.net/'
r = requests.get(proxyDomain)
print("bin in mache gerade suppe")
soup = BeautifulSoup(r.content, 'html.parser')
table = soup.find('table', {'id': 'proxylisttable'})
#this part works
#print(table.get_text)
print("zeit für die Liste")
ipAddresses = []
for row in table.findAll('tr'):
columns = row.findAll('td')
try:
ipAddresses.append("https://"+str(columns[0].get_text()) + ":" + str(columns[1].get_text()))
#ipList.append(str(columns[0].get_text()) + ":" + str(columns[1].get_text()))
except:
pass
#here the program returns one random IP Address from the list
return choice(ipAddresses)
# return 'https://': + choice(iplist)
def proxy_request(request_type, url, **kwargs):
print("bin in proxy_request")
while 1:
try:
proxy = get_proxy()
print("heute verwenden wir {}".format(proxy))
#so until this line everything seems to work as i want it to do
#now the next line should do the proxied request and at the end of the loop it should return some html text....
r = requests.request(request_type, url, proxies=proxy, timeout=5, **kwargs)
break
except:
pass
return r
def launch():
print("bin in launch")
r = proxy_request('get', 'https://mysliwje.uber.space.')
### but this text never arrives here - maybe the request is going to be carried out the wrong way
###does anybody got a idea how to solve that program so that it may work?
print(r.text)
launch()
as i explained in the code section before, the code works nice, it picks some random ip of a random ip list and prints it even to the cli. the next step all of the sudden seems to be carried out the wrong way, because the tools is running back scraping a new ip address
and another
and another
and another
and another...
of a list that seems to be updated every few minutes....
so i ask myself what is happening, why i dont see the simple html code of my indexpage?
Anybody any Idea?
Thanxx
I am working on a project involving downloading historical stock price data from yahoo finance. A step in this process involves determining the correct cookie and crumb to use with a url to download the data. The code I currently have only works sometimes, i.e., the stock data is retrieved with out any problems, and will fail at seemingly random iterations. In the full problem, I am downloading data for multiple stocks. The problem I am encountering is a lack of consistency in retrieving the data. I run into an issue where the data that comes back is
b'{\n"finance":{\n"error":{\n"code":"Unauthorized",\n"description":"Invalid cookie"\n}\n}\n}\n
So, I believe the problem lies in the cookie retrieval set.
To test the issue, I wrote a little script that attempts to download data for the same stock over 20 iterations. When running this, I will typically have about 18 or so iterations that work properly, and the others will not work. The iterations in which this happens changes each time I execute the test script.
Here is the test code I have been using thus far:
import requests
import time
import re
for k in range(20):
symbol='AMZN'
url="https://finance.yahoo.com/quote/%s/?p=%s" % (symbol, symbol)
r = requests.get(url, timeout=10)
cookie = r.cookies
lines = r.content.decode('latin-1').replace('\\', '')
lines = lines.replace('}', '\n')
lines = lines.split('\n')
for l in lines:
if re.findall(r'CrumbStore', l):
crumb = l.split(':')[2].strip('"')
start_date = int(int(time.time())-15*86400)
end_date = int(time.time())
url = "https://query1.finance.yahoo.com/v7/finance/download/%s?period1=%s&period2=%s&interval=1d&events=history&crumb=%s" % (symbol, start_date, end_date, crumb)
response = requests.get(url, cookies=cookie, timeout=10)
for block in response.iter_content(1024):
print(block)
print(k)
I would expect this to return the stock price data each time, similar to:
b'Date,Open,High,Low,Close,Adj Close,Volume\n2019-06-06,1737.709961,1760.000000,1726.130005,1754.359985,1754.359985,3689300\n2019-06-07,1763.699951,1806.250000,1759.489990,1804.030029,1804.030029,4808200\n2019-06-10,1822.000000,1884.869995,1818.000000,1860.630005,1860.630005,5371000'
however, I get the error sometimes. Is there a more reliable way to ensure the data is downloaded properly? I know that I am able to access and download it, but the code is unreliable.
Note that this is similar to trying to access the data with a bad cookie/crumb directly in a browser, for example via the url:
https://query1.finance.yahoo.com/v7/finance/download/AMZN?period1=1559367165&period2=1560663165&interval=1d&events=history&crumb=ODCkS0u002FOZyL
Thank you for the help.
The problem is not related to your code. It is on the Yahoo-server side. I see that the issue is totally random. If you hit an invalid cookie, then you continuously get invalid cookies. I guess the loop cannot help you to jump over this problem.
Maybe, you can save a working cookie and crumb pair once. And you can load this pair to get data.
# Save cookie.
with open('cookie', 'wb') as f:
pickle.dump(cookie, f)
Loading cookie is:
# Load cookie.
with open('cookie', 'rb') as f:
session.cookies.update(pickle.load(f))
I encountered exactly this issue, trying to download historical data for a list of stocks. I fixed it using the brute force method, iterating 5 times over each request then breaking out of the loop if the request succeeds. If the request works, the first few characters returned are "Date,Open".
String crumb = getCrumb(securityID); // get the crumb
String yahooQuery = "?period1=" + fromDate + "&period2=" + toDate +
"&interval=1d&events=history&crumb=" + crumb;
String requestURL =
"https://query1.finance.yahoo.com/v7/finance/download/" + securityID +
yahooQuery;
PostRequest stockDetail = new PostRequest(requestURL);
for (int tries = 0; tries < 5; tries++) {
// Get history CSV file
stockDetail.send();
if (stockDetail.getContent().contains("Date,Open")) {
break;
}
try {
Thread.sleep(1000);
} catch (InterruptedException e) {
}
}
I am using python and cookielib to talk to an HTTP server that has its date incorrectly set. I have no control over this server, so fixing its time is not a possibility. Unfortunately, the server's incorrect time messes up cookielib because the cookies appear to be expired.
Interestingly, if I go to the same website with any web browser, the browser accepts the cookie and it gets saved. I assume that modern webbrowsers come across misconfigured web servers all the time and see that their Date header is set incorrectly, and adjust cookie expiration dates accordingly.
Has anyone come across this problem before? Is there any way of handling it within Python?
I hacked together a solution that includes live-monkey patching of the urllib library. Definitely not ideal, but if others find a better way, please let me know:
cook_proc = urllib2.HTTPCookieProcessor(cookielib.LWPCookieJar())
cookie_processing_lock = threading.Lock()
def _process_cookies(request, response):
'''Process cookies, but do so in a way that can handle servers with bad
clocks set.'''
# We do some real monkey hacking here, so put it in a lock.
with cookie_processing_lock:
# Get the server date.
date_header = cookielib.http2time(
response.info().getheader('Date') or '')
# Save the old cookie parsing function.
orig_parse = cookielib.parse_ns_headers
# If the server date is off by more than an hour, we'll adjust it.
if date_header:
off_by = time.time() - date_header
if abs(off_by) > 3600:
logging.warning("Server off %.1f hrs."%(abs(off_by)/3600))
# Create our monkey patched
def hacked_parse(ns_headers):
try:
results = orig_parse(ns_headers)
for r in results:
for r_i, (key, val) in enumerate(r):
if key == 'expires':
r[r_i] = key, val + off_by
logging.info("Fixing bad cookie "
"expiration time for: %s"%r[0][0])
logging.info("COOKIE RESULTS: %s", results)
return results
except Exception as e:
logging.error("Problem parse cookie: %s"%e)
raise
cookielib.parse_ns_headers = hacked_parse
response = cook_proc.http_response(request, response)
# Make sure we set the cookie processor back.
cookielib.parse_ns_headers = orig_parse
I have a code that retrieves news results from this newspaper using a query and a time frame (could be up to a year).
The results are paginated up to 10 articles per page and since I couldn't find a way to increase it, I issue a request for each page then retrieve the title, url and date of each article. Each cycle (the HTTP request and the parsing) takes from 30 seconds to a minute and that's extremely slow. And eventually it will stop with a response code of 500. I am wondering if there is ways to speed it up or maybe make multiple requests at once. I simply want to retrieve the articles details in all the pages.
Here is the code:
import requests
import re
from bs4 import BeautifulSoup
import csv
URL = 'http://www.gulf-times.com/AdvanceSearchNews.aspx?Pageindex={index}&keywordtitle={query}&keywordbrief={query}&keywordbody={query}&category=&timeframe=&datefrom={datefrom}&dateTo={dateto}&isTimeFrame=0'
def run(**params):
countryFile = open("EgyptDaybyDay.csv","a")
i=1
results = True
while results:
params["index"]=str(i)
response = requests.get(URL.format(**params))
print response.status_code
htmlFile = BeautifulSoup(response.content)
articles = htmlFile.findAll("div", { "class" : "newslist" })
for article in articles:
url = (article.a['href']).encode('utf-8','ignore')
title = (article.img['alt']).encode('utf-8','ignore')
dateline = article.find("div",{"class": "floatright"})
m = re.search("([0-9]{2}\-[0-9]{2}\-[0-9]{4})", dateline.string)
date = m.group(1)
w = csv.writer(countryFile,delimiter=',',quotechar='|', quoting=csv.QUOTE_MINIMAL)
w.writerow((date, title, url ))
if not articles:
results = False
i+=1
countryFile.close()
run(query="Egypt", datefrom="12-01-2010", dateto="12-01-2011")
This is a good opportunity to try out gevent.
You should have a separate routine for the request.get part so that your application doesn't have to wait for IO blocking.
You can then spawn multiple workers and have queues to pass requests and articles around.
Maybe something similar to this:
import gevent.monkey
from gevent.queue import Queue
from gevent import sleep
gevent.monkey.patch_all()
MAX_REQUESTS = 10
requests = Queue(MAX_REQUESTS)
articles = Queue()
mock_responses = range(100)
mock_responses.reverse()
def request():
print "worker started"
while True:
print "request %s" % requests.get()
sleep(1)
try:
articles.put('article response %s' % mock_responses.pop())
except IndexError:
articles.put(StopIteration)
break
def run():
print "run"
i = 1
while True:
requests.put(i)
i += 1
if __name__ == '__main__':
for worker in range(MAX_REQUESTS):
gevent.spawn(request)
gevent.spawn(run)
for article in articles:
print "Got article: %s" % article
The most probably slow down is the server, so parallelising the http requests would be the best way to go about making the code run faster, although there's very little you can do to speed up the server response. There's a good tutorial over at IBM for doing exactly this
It seems to me that you're looking for a feed, which that newspaper doesn't advertise. However, it's a problem that has been solved before -- there are many sites that will generate feeds for you for an arbitrary website thus at least solving one of your problems. Some of these require some human guidance, and others have less opportunity for tweaking and are more automatic.
If you can at all avoid doing the pagination and parsing yourself, I'd recommend it. If you cannot, I second the use of gevent for simplicity. That said, if they're sending you back 500's, your code is likely less of an issue and added parallelism may not help.
You can try making all the calls asynchronously .
Have a look at this :
http://pythonquirks.blogspot.in/2011/04/twisted-asynchronous-http-request.html
You could use gevent as well rather than twisted but just telling you the options.
This might very well come close to what you're looking for.
Ideal method for sending multiple HTTP requests over Python? [duplicate]
Source code:
https://github.com/kennethreitz/grequests