I am trying to implement a simple web crawler and I have already written a simple code to start off : There are two modules fetcher.py and crawler.py. Here are the files :
fetcher.py :
import urllib2
import re
def fetcher(s):
"fetch a web page from a url"
try:
req = urllib2.Request(s)
urlResponse = urllib2.urlopen(req).read()
except urllib2.URLError as e:
print e.reason
return
p,q = s.split("//")
d = q.split("/")
fdes = open(d[0],"w+")
fdes.write(str(urlResponse))
fdes.seek(0)
return fdes
if __name__ == "__main__":
defaultSeed = "http://www.python.org"
print fetcher(defaultSeed)
crawler.py :
from bs4 import BeautifulSoup
import re
from fetchpage import fetcher
usedLinks = open("Used","a+")
newLinks = open("New","w+")
newLinks.seek(0)
def parse(fd,var=0):
soup = BeautifulSoup(fd)
for li in soup.find_all("a",href=re.compile("http")):
newLinks.seek(0,2)
newLinks.write(str(li.get("href")).strip("/"))
newLinks.write("\n")
fd.close()
newLinks.seek(var)
link = newLinks.readline().strip("\n")
return str(link)
def crawler(seed,n):
if n == 0:
usedLinks.close()
newLinks.close()
return
else:
usedLinks.write(seed)
usedLinks.write("\n")
fdes = fetcher(seed)
newSeed = parse(fdes,newLinks.tell())
crawler(newSeed,n-1)
if __name__ == "__main__":
crawler("http://www.python.org/",7)
The problem is that when i run crawler.py it works fine for the first 4-5 links and then it hangs and after a minute gives me the following error :
[Errno 110] Connection timed out
Traceback (most recent call last):
File "crawler.py", line 37, in <module>
crawler("http://www.python.org/",7)
File "crawler.py", line 34, in crawler
crawler(newSeed,n-1)
File "crawler.py", line 34, in crawler
crawler(newSeed,n-1)
File "crawler.py", line 34, in crawler
crawler(newSeed,n-1)
File "crawler.py", line 34, in crawler
crawler(newSeed,n-1)
File "crawler.py", line 34, in crawler
crawler(newSeed,n-1)
File "crawler.py", line 33, in crawler
newSeed = parse(fdes,newLinks.tell())
File "crawler.py", line 11, in parse
soup = BeautifulSoup(fd)
File "/usr/lib/python2.7/dist-packages/bs4/__init__.py", line 169, in __init__
self.builder.prepare_markup(markup, from_encoding))
File "/usr/lib/python2.7/dist-packages/bs4/builder/_lxml.py", line 68, in prepare_markup
dammit = UnicodeDammit(markup, try_encodings, is_html=True)
File "/usr/lib/python2.7/dist-packages/bs4/dammit.py", line 191, in __init__
self._detectEncoding(markup, is_html)
File "/usr/lib/python2.7/dist-packages/bs4/dammit.py", line 362, in _detectEncoding
xml_encoding_match = xml_encoding_re.match(xml_data)
TypeError: expected string or buffer
Can anyone help me with this, I am very new to python and I am unable to find out why does it say connection timed out after some time ?
A Connection Timeout is not specific to python, it just means that you made a request to the server, and the server did not respond within the amount of time that your application was willing to wait.
On very possible reason that this could occur is that python.org may have some mechanism to detect when it is getting multiple requests from a script, and probably just completely stops serving pages after 4-5 requests. There is nothing you can really do to avoid this other than trying out your script on a different site.
You could try using proxies to avoid getting detected on multiple requests as stated above. You might want to check out this answer to get an idea on how to send urllib requests with proxies: How to open website with urllib via Proxy - Python
Related
I'm trying to scrape the HTML code of a new chrome tab, but I can't find a way that works using Python.
Here's what I've tried:
I've tried the requests module, but this code:
import requests
URL = "chrome://newtab"
page = requests.get(URL)
print(page.text)
Yields this error:
Traceback (most recent call last):
File "c:\Users\Ben Bistline\Code\PythonFiles\PythonFiles\chromescrape.py", line 4, in <module>
page = requests.get(URL)
File "C:\Users\Ben Bistline\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\requests\api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\Ben Bistline\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\requests\api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Users\Ben Bistline\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\requests\sessions.py", line 542, in request
resp = self.send(prep, **send_kwargs)
File "C:\Users\Ben Bistline\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\requests\sessions.py", line 649, in send
adapter = self.get_adapter(url=request.url)
File "C:\Users\Ben Bistline\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\requests\sessions.py", line 742, in get_adapter
raise InvalidSchema("No connection adapters were found for {!r}".format(url))
requests.exceptions.InvalidSchema: No connection adapters were found for 'chrome://newtab'
I suppose this result makes sense, but I'm not sure how/if I can get around it.
I've also tried using the webbrowser module with this code:
import requests, webbrowser
URL = "chrome://newtab"
chromePath = 'C:/Program Files/Google/Chrome/Application/chrome.exe %s'
webbrowser.get(chromePath).open(URL)
Unfortunately, although successful, this method does not seem to offer a way of gathering the HTML.
Anyone know of any other ways using Python to grab the HTML of a new Chrome tab?
Thanks!
You can use Selenium driver with Chrome to do that
from selenium import webdriver
browser = webdriver.Chrome()
browser.get('chrome://newtab')
content = driver.page_source
browser.close()
Can anyone please tell me How do I remove this Traceback (most recent call last): Error in python. I am using python 2.7.9
Take a look over the code.
import requests
import optparse
parser = optparse.OptionParser()
parser.add_option("-f", '--filename', action="store" ,dest="filee")
options, args = parser.parse_args()
file = options.filee
fopen = open(file, 'r')
for x in fopen.readlines():
print "Checking for Clickjacking vulnerability\n"
url = x.strip('\n')
req = requests.get(url)
try:
print "[-]Target:" + url + " Not vulnerable\n The targeted victim has %s header\n" % (req.headers['X-Frame-Options'])
except Exception as e:
print "[+] Target:" + url +" Vulnerable to clickjacking"
After running the code successfully I go this error at the end
Traceback (most recent call last):
File "C:\Python27\utkarsh3.py", line 17, in <module>
req = requests.get(url)
File "C:\Python27\lib\site-packages\requests\api.py", line 72, in get
return request('get', url, params=params, **kwargs)
File "C:\Python27\lib\site-packages\requests\api.py", line 58, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Python27\lib\site-packages\requests\sessions.py", line 494, in request
prep = self.prepare_request(req)
File "C:\Python27\lib\site-packages\requests\sessions.py", line 437, in prepare_request
hooks=merge_hooks(request.hooks, self.hooks),
File "C:\Python27\lib\site-packages\requests\models.py", line 305, in prepare
self.prepare_url(url, params)
File "C:\Python27\lib\site-packages\requests\models.py", line 379, in prepare_url
raise MissingSchema(error)
requests.exceptions.MissingSchema: Invalid URL '': No schema supplied. Perhaps you meant http://?
Which really irritate me. I know there are so many peoples who are already asking this before. But I can't understand it so I ask.
And please tell me How we beginners handle these errors?
in a eli5 fashion, a Traceback is a log of what the program was trying to do before actual error happened. Your actual error is requests.exceptions.MissingSchema
The line that follows Invalid URL '': No schema supplied. Perhaps you meant http://? describes the exact problem.
File "C:\Python27\utkarsh3.py", line 17, in <module>
req = requests.get(url)
These above lines describe where the error started..
So, if you go to line 17 of your program you must see this exact same line.
Making a context out of these two things, i get that url is a string that is just example.com and not http://example.com or something on those lines.
I can only speculate so much on what your code might be. But, feel free to provide your code snippets to explain more.
But, hope this helps you to read future tracebacks.
Edit1 : Now that you added snippet. Try printing url just before requests.get(url) and see what exactly you are trying to reach. And, if you have the right schema prepended.
I used this code to scan multiple PDF files contained in a folder with the online scanner "https://wepawet.iseclab.org/" using this scrip.
import mechanize
import re
import os
def upload_file(uploaded_file):
url = "https://wepawet.iseclab.org/"
br = mechanize.Browser()
br.set_handle_robots(False) # ignore robots
br.open(url)
br.select_form(nr=0)
f = os.path.join("200",uploaded_file)
br.form.add_file(open(f) ,'text/plain', f)
br.form.set_all_readonly(False)
res = br.submit()
content = res.read()
with open("200_clean.html", "a") as f:
f.write(content)
def main():
for file in os.listdir("200"):
upload_file(file)
if __name__ == '__main__':
main()
but after the execution of the code I got the following error:
Traceback (most recent call last):
File "test.py", line 56, in <module>
main()
File "test.py", line 50, in main
upload_file(file)
File "test.py", line 40, in upload_file
res = br.submit()
File "/home/suleiman/Desktop/mechanize/_mechanize.py", line 541, in submit
return self.open(self.click(*args, **kwds))
File "/home/suleiman/Desktop/mechanize/_mechanize.py", line 203, in open
return self._mech_open(url, data, timeout=timeout)
File "/home/suleiman/Desktop/mechanize/_mechanize.py", line 255, in _mech_open
raise response
mechanize._response.httperror_seek_wrapper: HTTP Error refresh: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
OK
could any one help me with this problem ?
I think the issue is the mime-type text/plain you set. For PDF, this should be application/pdf. Your code with this change worked for me when I uploaded a sample PDF.
Change the br.form.add_file call to look like this:
br.form.add_file(open(f), 'application/pdf', f)
I'm making a web page scraper using BeautifulSoup4 and requests libraries. I had some trouble with BeautifulSoup working but got some help and was able to get that fixed. Now I've run into a new problem and I'm not sure how to fix it. I'm using requests 2.2.1 and I'm trying to run this program on Python 3.1.2. And when I do I get a traceback error.
here is my code:
from bs4 import BeautifulSoup
import requests
url = input("Enter a URL (start with www): ")
link = "http://" + url
page = requests.get(link).content
soup = BeautifulSoup(page)
for url in soup.find_all('a'):
print(url.get('href'))
print()
and the error:
Enter a URL (start with www): www.google.com
Traceback (most recent call last):
File "/Users/user/Desktop/project.py", line 8, in <module>
page = requests.get(link).content
File "/Library/Frameworks/Python.framework/Versions/3.1/lib/python3.1/site-packages/requests-2.2.1-py3.1.egg/requests/api.py", line 55, in get
return request('get', url, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.1/lib/python3.1/site-packages/requests-2.2.1-py3.1.egg/requests/api.py", line 44, in request
return session.request(method=method, url=url, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.1/lib/python3.1/site-packages/requests-2.2.1-py3.1.egg/requests/sessions.py", line 349, in request
prep = self.prepare_request(req)
File "/Library/Frameworks/Python.framework/Versions/3.1/lib/python3.1/site-packages/requests-2.2.1-py3.1.egg/requests/sessions.py", line 287, in prepare_request
hooks=merge_hooks(request.hooks, self.hooks),
File "/Library/Frameworks/Python.framework/Versions/3.1/lib/python3.1/site-packages/requests-2.2.1-py3.1.egg/requests/models.py", line 287, in prepare
self.prepare_url(url,params)
File "/Library/Frameworks/Python.framework/Versions/3.1/lib/python3.1/site-packages/requests-2.2.1-py3.1.egg/requests/models.py", line 321, in prepare_url
url = str(url)
TypeError: 'tuple' object is not callable
I've done some looking and when others have gotten this error (in django mostly) there was a comma missing but I'm not sure where to put a comma at? Any help will be appreciated.
I've got a problem... we're writing project using django, and i'm trying to use django.test.client with nose test-framework for tests.
Our code is like this:
from simplejson import loads
from urlparse import urljoin
from django.test.client import Client
TEST_URL = "http://smakly.localhost:9090/"
def test_register():
cln = Client()
ref_data = {"email": "unique#mail.com", "name": "Василий", "website": "http://hot.bear.com", "xhr": "true"}
print urljoin(TEST_URL, "/accounts/register/")
response = loads(cln.post(urljoin(TEST_URL, "/accounts/register/"), ref_data))
print response["message"]
and in nose output I catch:
Traceback (most recent call last):
File "/home/psih/work/svn/smakly/eggs/nose-0.11.1-py2.6.egg/nose/case.py", line 183, in runTest
self.test(*self.arg)
File "/home/psih/work/svn/smakly/src/smakly.tests/smakly/tests/frontend/test_profile.py", line 25, in test_register
response = loads(cln.post(urljoin(TEST_URL, "/accounts/register/"), ref_data))
File "/home/psih/work/svn/smakly/parts/django/django/test/client.py", line 313, in post
response = self.request(**r)
File "/home/psih/work/svn/smakly/parts/django/django/test/client.py", line 225, in request
response = self.handler(environ)
File "/home/psih/work/svn/smakly/parts/django/django/test/client.py", line 69, in __call__
response = self.get_response(request)
File "/home/psih/work/svn/smakly/parts/django/django/core/handlers/base.py", line 78, in get_response
urlconf = getattr(request, "urlconf", settings.ROOT_URLCONF)
File "/home/psih/work/svn/smakly/parts/django/django/utils/functional.py", line 273, in __getattr__
return getattr(self._wrapped, name)
AttributeError: 'Settings' object has no attribute 'ROOT_URLCONF'
My settings.py file does have this attribute.
If I get the data from the server with standard urllib2.urllopen().read() it works in the proper way.
Any ideas how I can solve this case?
Probably want django-nose if you want to use nose.
http://github.com/jbalogh/django-nose
I would recommend using the TestCase class
http://docs.djangoproject.com/en/dev/topics/testing/
http://www.djangoproject.com/documentation/models/test_client/
Shameless self-promotion: exactly for those reasons I made test library that enables you to test application with urllib2.
Docs are there: http://readthedocs.org/docs/django-sane-testing/en/latest/
Example of what You might want to do is, for example, in django-http-digest-tests