Get sourcecode for urls [closed] - python

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions must demonstrate a minimal understanding of the problem being solved. Tell us what you've tried to do, why it didn't work, and how it should work. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I have following codes:
import urllib2
from itertools import product
with open('urllist.txt') as urllist:
urls=[line.strip() for line in urllist]
for url in product(urls):
usock = urllib2.urlopen(url)
data = usock.read()
usock.close()
sourcecode=open('./sourcecode', 'w+')
sourcecode.write(data)
When I ran it, it gave:
Traceback (most recent call last):
File "12.py", line 8, in <module>
usock = urllib2.urlopen(url)
File "/opt/python2.7.1/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/opt/python2.7.1/lib/python2.7/urllib2.py", line 383, in open
req.timeout = timeout
AttributeError: 'tuple' object has no attribute 'timeout'
Any idea how to fix it? Many thanks!

itertools.product returns a tuple not the item itself.:
>>> from itertools import product
>>> lis = ['a','b','c']
>>> for p in product(lis):
... print p
...
('a',)
('b',)
('c',)
Use a simple loop over urls:
for url in urls:
usock = urllib2.urlopen(url)

Related

NameError in function to retrieve JSON data

I'm using python 3.6.1 and have the following code which successfully retrieves data in JSON format:
import urllib.request,json,pprint
url = "https://someurl"
response = urllib.request.urlopen(url)
data = json.loads(response.read())
pprint.pprint(data)
I want to wrap this in a function, so i can reuse it. This is what i have tried in a file called getdata.py:
from urllib.request import urlopen
import json
def get_json_data(url):
response = urlopen(url)
return json.loads(response.read())
and this is the error i get after importing the file and attempting to print out the response:
>>> import getdata
>>> print(getdata.get_json_data("https://someurl"))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\Nick\getdata.py", line 6, in get_json_data
from urllib.request import urlopen
NameError: name 'urllib' is not defined
i also tried this and got the same error:
import urllib.request,json
def get_json_data(url):
response = urllib.request.urlopen(url)
return json.loads(response.read())
What do i need to do to get this to work please?
cheers
Its working now ! I think the problem was the hydrogen addon i have for the Atom editor. I uninstalled it, tried again and it worked. Thanks for looking.

NameError: name 'tweets_source' not defined | But it has default value [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 6 years ago.
Improve this question
I have this code:
import json
class TweetFormatter():
def __init__(self, source_file='twitter_data.txt'):
self.tweets_source = source_file
def convert2json(self, tweets_source=None):
tweets_data = []
if tweets_source is None:
if self.tweets_source is not None:
tweets_source = self.tweets_source
else:
raise ValueError("You need to specify a file")
with open(tweets_source, "r") as tweets_file:
for line in tweets_file:
try:
tweet = json.loads(line)
tweets_data.append(tweet)
except:
continue
return tweets_data
And I'm calling convert2json method like:
formatter = TweetFormatter('twitter_data.txt')
tweets = formatter.convert2json()
Python throws error:
Traceback (most recent call last):
File "analysis.py", line 12, in <module>
tweets = formatter.convert2json()
File "/home/azaroma/analisis-machismo/app/tweet_formatter.py", line 11, in convert2json
with open(tweets_source, "r") as tweets_file:
NameError: name 'tweets_source' is not defined
I believe that tweets_source should be defined because it has a default value. Also, if a i call formatter.convert2json('twitter_data.txt') python throws:
Traceback (most recent call last):
File "analysis.py", line 12, in <module>
tweets = formatter.convert2json('twitter_data.txt')
TypeError: convert2json() takes 1 positional argument but 2 were given
This has to be something really simple but I can't figure it out.
Thanks to #elethan and #cricket_007 that confirmed the error could not be reproduced because I found out that my emacs' theme does not show the 'file does not exist' character. I was working on another file. In some point I copied the file and didn't kill the emacs buffer, that's why changes weren't reflecting in execution. Sorry.

Reading the contents of a webpage with Python

I am trying to get the contents of a webpage. For some reason whenever I try urlopen it says there is "no such resource". I also can't use urllib2.
I would simply like to get the contents of a webpage such as http://www.example.com
import urllib
import re
textfile = open('depth_1.txt','w')
print("Enter the URL you wish to crawl..")
print('Usage - "http://phocks.org/stumble/creepy/" <-- With the double quotes')
myurl = input("#> ")
for i in re.findall('''href=["'](.[^"']+)["']''', urllib.urlopen(myurl).read(), re.I):
print(i)
for ee in re.findall('''href=["'](.[^"']+)["']''', urllib.urlopen(i).read(), re.I):
print(ee)
textfile.write(ee+'\n')
textfile.close()
Here is the error:
Traceback (most recent call last):
File "/Users/austinhitt/Desktop/clases_example.py", line 8, in <module>
for i in re.findall('''href=["'](.[^"']+)["']''',
urllib.urlopen(myurl).read(), re.I):
AttributeError: module 'urllib' has no attribute 'urlopen'
For only the content use requests and if you want to play arround with the content you need to use scrapy, example:
import requests
r = requests.get('http://scrapy.org')
r.content
r.headers
r.status_code

Not clear on why my function is returning none

I have very limited coding background except for some Ruby, so if there's a better way of doing this, please let me know!
Essentially I have a .txt file full of words. I want to import the .txt file and turn it into a list. Then, I want to take the first item in the list, assign it to a variable, and use that variable in an external request that sends off to get the definition of the word. The definition is returned, and tucked into a different .txt file. Once that's done, I want the code to grab the next item in the list and do it all again until the list is exhausted.
Below is my code in progress to give an idea of where I'm at. I'm still trying to figure out how to iterate through the list correctly, and I'm having a hard time interpreting the documentation.
Sorry in advance if this was already asked! I searched, but couldn't find anything that specifically answered my issue.
from __future__ import print_function
import requests
import urllib
from bs4 import BeautifulSoup
def get_definition(x):
url = 'http://services.aonaware.com/DictService/Default.aspx?action=define&dict=wn&query={0}'.format(x)
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html, "lxml")
return soup.find('pre', text=True)[0]
lines = []
with open('vocab.txt') as f:
lines = f.readlines()
lines = [line.strip() for line in lines]
definitions = []
for line in lines:
definitions.append(get_definition(line))
out_str = '\n'.join(definitions)
with open('definitions.txt', 'w') as f:
f.write(out_str)
the problem I'm having is
Traceback (most recent call last):
File "WIP.py", line 20, in <module>
definitions.append(get_definition(line))
File "WIP.py", line 11, in get_definition
return soup.find('pre', text=True)[0]
File "/Library/Python/2.7/site-packages/bs4/element.py", line 958, in __getitem__
return self.attrs[key]
KeyError: 0
I understand that soup.find('pre', text=True) is returning None, but not why or how to fix it.
your problem is that find() returns a single result not a list. The result is a dict-like object so it tries to find the key 0 which it cannot.
just remove the [0] and you should be fine
Also soup.find(...) is not returning None. It is returning an answer! If it were returning None you would get the error
NoneType has no attribute __getitem__
Beautiful soup documentation for find()

Reading the content of robots.txt in Python and printing it [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I want to check if a given website contains robot.txt, read all the content of that file and print it. Maybe also add the content to a dictionary would be very good.
I've tried playing with the robotparser module but can't figure out how to do it.
I would like to use only modules that come with the standard Python 2.7 package.
I did as #Stefano Sanfilippo suggested:
from urllib.request import urlopen
returned
Traceback (most recent call last):
File "<pyshell#1>", line 1, in <module>
from urllib.request import urlopen
ImportError: No module named request
So I tried:
import urllib2
from urllib2 import Request
from urllib2 import urlopen
with urlopen("https://www.google.com/robots.txt") as stream:
print(stream.read().decode("utf-8"))
but got:
Traceback (most recent call last):
File "", line 1, in
with urlopen("https://www.google.com/robots.txt") as stream:
AttributeError: addinfourl instance has no attribute 'exit'
From bugs.python.org it seems that's something not supported in 2.7 version.
As a matter of fact the code works fine with Python 3
Any idea how to work this around?
Yes, robots.txt is just a file, download and print it!
Python 3:
from urllib.request import urlopen
with urlopen("https://www.google.com/robots.txt") as stream:
print(stream.read().decode("utf-8"))
Python 2:
from urllib import urlopen
from contextlib import closing
with closing(urlopen("https://www.google.com/robots.txt")) as stream:
print stream.read()
Note that the path is always /robots.txt.
If you need to put content in a dictionary, .split(":") and .strip() are your friends:

Categories