Reading the contents of a webpage with Python - python

I am trying to get the contents of a webpage. For some reason whenever I try urlopen it says there is "no such resource". I also can't use urllib2.
I would simply like to get the contents of a webpage such as http://www.example.com
import urllib
import re
textfile = open('depth_1.txt','w')
print("Enter the URL you wish to crawl..")
print('Usage - "http://phocks.org/stumble/creepy/" <-- With the double quotes')
myurl = input("#> ")
for i in re.findall('''href=["'](.[^"']+)["']''', urllib.urlopen(myurl).read(), re.I):
print(i)
for ee in re.findall('''href=["'](.[^"']+)["']''', urllib.urlopen(i).read(), re.I):
print(ee)
textfile.write(ee+'\n')
textfile.close()
Here is the error:
Traceback (most recent call last):
File "/Users/austinhitt/Desktop/clases_example.py", line 8, in <module>
for i in re.findall('''href=["'](.[^"']+)["']''',
urllib.urlopen(myurl).read(), re.I):
AttributeError: module 'urllib' has no attribute 'urlopen'

For only the content use requests and if you want to play arround with the content you need to use scrapy, example:
import requests
r = requests.get('http://scrapy.org')
r.content
r.headers
r.status_code

Related

Why doesn't bs4 find the href attribute?

So I'm learning using atbwp and I'm now doing a program where I open top 5 search results on a website.
It all works up until I have to get the href for each of the top results and open it. I get this error:
Traceback (most recent call last):
File "C:\Users\Asus\Desktop\pyhton\projects\emagSEARCH.py", line 33, in <module>
webbrowser.open(url)
File "C:\Users\Asus\AppData\Local\Programs\Python\Python38-32\lib\webbrowser.py", line 86, in open
if browser.open(url, new, autoraise):
File "C:\Users\Asus\AppData\Local\Programs\Python\Python38-32\lib\webbrowser.py", line 603, in open
os.startfile(url)
TypeError: startfile: filepath should be string, bytes or os.PathLike, not NoneType
This is how the html looks:
The No-Brainer Set The Ordinary, Deciem
And this is the part of my code which won't work for some reason..:
Soup=bs4.BeautifulSoup(res.text,'html.parser')
results= Soup.select('.item-title')
numberTabs=min(5,len(results))
print('Opening top '+str(numberTabs)+' top results...')
for i in range(numberTabs):
url=results[i].get('href')
webbrowser.open(url)
It does what it should until the for loop. It looks pretty much exactly like the example program in the book, so I don't understand why it doesn't work. What am I doing wrong?
If u wanna extract the href under the a tag, then use this:
html = ' The No-Brainer Set The Ordinary, Deciem'
Soup=bs4.BeautifulSoup(html,'html.parser')
url = Soup.find('a')['href']
print(url)
webbrowser.open(url)
Output:
https://comenzi.farmaciatei.ro/ingrijire-personala/ingrijire-corp-si-fata/tratamente-/the-no-brainer-set-the-ordinary-deciem-p344003
U can do the same for all a tags in order to get all hrefs.

NameError in function to retrieve JSON data

I'm using python 3.6.1 and have the following code which successfully retrieves data in JSON format:
import urllib.request,json,pprint
url = "https://someurl"
response = urllib.request.urlopen(url)
data = json.loads(response.read())
pprint.pprint(data)
I want to wrap this in a function, so i can reuse it. This is what i have tried in a file called getdata.py:
from urllib.request import urlopen
import json
def get_json_data(url):
response = urlopen(url)
return json.loads(response.read())
and this is the error i get after importing the file and attempting to print out the response:
>>> import getdata
>>> print(getdata.get_json_data("https://someurl"))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\Nick\getdata.py", line 6, in get_json_data
from urllib.request import urlopen
NameError: name 'urllib' is not defined
i also tried this and got the same error:
import urllib.request,json
def get_json_data(url):
response = urllib.request.urlopen(url)
return json.loads(response.read())
What do i need to do to get this to work please?
cheers
Its working now ! I think the problem was the hydrogen addon i have for the Atom editor. I uninstalled it, tried again and it worked. Thanks for looking.

lxml.etree.XPathEvalError: Invalid expression

I am getting an error with Python that I am not able to understand. I have simplified my code to the very bare minimum:
response = requests.get('http://pycoders.com/archive')
tree = html.fromstring(response.text)
r = tree.xpath('//divass="campaign"]/a/#href')
print(r)
and still get the error
Traceback (most recent call last):
File "ultimate-1.py", line 17, in <module>
r = tree.xpath('//divass="campaign"]/a/#href')
File "lxml.etree.pyx", line 1509, in lxml.etree._Element.xpath (src/lxml/lxml.etree.c:50702)
File "xpath.pxi", line 318, in lxml.etree.XPathElementEvaluator.__call__ (src/lxml/lxml.etree.c:145954)
File "xpath.pxi", line 238, in lxml.etree._XPathEvaluatorBase._handle_result (src/lxml/lxml.etree.c:144962)
File "xpath.pxi", line 224, in lxml.etree._XPathEvaluatorBase._raise_eval_error (src/lxml/lxml.etree.c:144817)
lxml.etree.XPathEvalError: Invalid expression
Would anyone have an idea of where the issue is coming from? Might it be a dependencies problem? Thanks.
The expression '//divass="campaign"]/a/#href' is not syntactically correct and does not make much sense. Instead you meant to check the class attribute:
//div[#class="campaign"]/a/#href
Now, that would help to avoid the Invalid Expression error, but you would get nothing found by the expression. This is because the data is not there in the response that requests receives. You would need to mimic what the browser does to get the desired data and make an additional request to get the javascript file containing the campaigns.
Here is what works for me:
import ast
import re
import requests
from lxml import html
with requests.Session() as session:
# extract script url
response = session.get('http://pycoders.com/archive')
tree = html.fromstring(response.text)
script_url = tree.xpath("//script[contains(#src, 'generate-js')]/#src")[0]
# get the script
response = session.get(script_url)
data = ast.literal_eval(re.match(r'document.write\((.*?)\);$', response.content).group(1))
# extract the desired data
tree = html.fromstring(data)
campaigns = [item.attrib["href"].replace("\\", "") for item in tree.xpath('//div[#class="campaign"]/a')]
print(campaigns)
Prints:
['http://us4.campaign-archive2.com/?u=9735795484d2e4c204da82a29&id=3384ab2140',
...
'http://us4.campaign-archive2.com/?u=9735795484d2e4c204da82a29&id=8b91cb0481'
]
U was wrong in making xpath.
If u want to take all hrefs your xpath should be like
hrefs = tree.xpath('//div[#class="campaign"]/a')
for href in hrefs:
print(href.get('href'))
or in one line:
hrefs = [item.get('href') for item in tree.xpath('//div[#class="campaign"]/a')]

urllib3.urlencode googlescholar url from string

I am trying to encode a string to url to search google scholar, soon to realize, urlencode is not provided in urllib3.
>>> import urllib3
>>> string = "https://scholar.google.com/scholar?" + urllib3.urlencode( {"q":"rudra banerjee"} )
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute 'urlencode'
So, I checked urllib3 doc and found, I possibly need request_encode_url. But I have no experience in using that and failed.
>>> string = "https://scholar.google.com/scholar?" +"rudra banerjee"
>>> url = urllib3.request_encode_url('POST',string)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute 'request_encode_url'
So, how I can encode a string to url?
NB I don't have any particular fascination to urllib3. so, any other module will also do.
To simply encode fields in a URL, you can use urllib.urlencode.
In Python 2, this should do the trick:
import urllib
s = "https://scholar.google.com/scholar?" + urllib.urlencode({"q":"rudra banerjee"})
print(s)
# Prints: https://scholar.google.com/scholar?q=rudra+banerjee
In Python 3, it lives under urllib.parse.urlencode instead.
(Edit: I assumed you wanted to download the URL, not simply encode it. My mistake. I'll leave this answer as a reference for others, but see the other answer for encoding a URL.)
If you pass a dictionary into fields, urllib3 will take care of encoding it for you. First, you'll need to instantiate a pool for your connections. Here's a full example:
import urllib3
http = urllib3.PoolManager()
r = http.request('POST', 'https://scholar.google.com/scholar', fields={"q":"rudra banerjee"})
print(r.data)
Calling .request(...) will take care of figuring out the encoding for you based on the method.
Getting started examples are here: https://urllib3.readthedocs.org/en/latest/index.html#usage

renderContents in beautifulsoup (python)

The code I'm trying to get working is:
h = str(heading)
# '<h1>Heading</h1>'
heading.renderContents()
I get this error:
Traceback (most recent call last):
File "<pyshell#6>", line 1, in <module>
print h.renderContents()
AttributeError: 'str' object has no attribute 'renderContents'
Any ideas?
I have a string with html tags and i need to clean it if there is a different way of doing that please suggest it.
Your error message and your code sample don't line up. You say you're calling:
heading.renderContents()
But your error message says you're calling:
print h.renderContents()
Which suggests that perhaps you have a bug in your code, trying to call renderContents() on a string object that doesn't define that method.
In any case, it would help if you checked what type of object heading is to make sure it's really a BeautifulSoup instance. This works for me with BeautifulSoup 3.2.0:
from BeautifulSoup import BeautifulSoup
heading = BeautifulSoup('<h1>heading</h1>')
repr(heading)
# '<h1>heading</h1>'
print heading.renderContents()
# <h1>heading</h1>
print str(heading)
# '<h1>heading</h1>'
h = str(heading)
print h
# <h1>heading</h1>

Categories