regex and urllib.request to __scrape__ links from HTML - python

I am trying to parse an HTML to extract all values in this regex construction :
href="http//.+?"
This is the code:
import urllib.request
import re
url = input('Enter - ')
html = urllib.request.urlopen(url).read()
links = re.findall('href="(http://.*?)"',html)
for link in links:
print(link)
But I am getting an error saying :
TypeError: cannot use a string pattern on a bytes-like object

urlopen(url) returns a bytes object. So your html variable contains bytes as well. You can decode it using something like this:
htmlobject = urllib.request.urlopen(url)
html = htmlobject.read().decode('utf-8')
Then you can use html which is now a string, in your regex.

Your html is a byte string, use str(html):
re.findall(r'href="(http://.*?)"',str(html))
Or, use a byte pattern:
re.findall(rb'href="(http://.*?)"',html)

If you would like all links on a page, you don't even have to use regex. Because you can just use bs4 to get what you need :-)
import requests
import bs4
soup = bs4.BeautifulSoup(requests.get('https://dr.dk/').text, 'lxml')
links = soup.find_all('a', href=True)
[print(i['href']) for i in links]
Hope it was helpful. Good luck with the project ;-)

Related

How to stop BeautifulSoup from decoding HTML entities into symbols

I am trying to get all the links on a given website but is stuck with some problems about HTML entities. Here's my code that crawls websites using BeautifulSoup:
from bs4 import BeautifulSoup
import requests
.
.
baseRequest = requests.get("https://www.example.com", SOME_HEADER_SETTINGS)
soup = BeautifulSoup(baseRequest.content, "html.parser")
pageLinks = []
for anchor in soup.findAll("a"):
pageLinks.append(anchor["href"])
.
.
print(pageLinks)
The code becomes problematic when it sees this kind of element:
Link
Instead of printing ["./page?id=123&sect=2"], it treats the &sect part as an HTML entity and shows this in the console:
["./page?id=123§=2"]
Is there a solution to preventing the this?
Here is one
from bs4 import BeautifulSoup
soup = BeautifulSoup('Link', "html.parser")
pageLinks = []
for anchor in soup.findAll("a"):
pageLinks.append(anchor["href"])
uncoded = ''.join(i for i in pageLinks).encode('utf-8')
decoded = ''.join(map(lambda x: chr(ord(x)),''.join(i for i in pageLinks)))
print('uncoded =',uncoded)
print('decoded =',decoded)
output
uncoded = b'./page?id=123\xc2\xa7=2'
decoded = ./page?id=123§=2

Opening webpage and returning a dict of all the links and their text

I'm trying to open a webpage and return all the links as a dictionary that would look like this.
{"http://my.computer.com/some/file.html" : "link text"}
So the link would be after the href= and the text would be between the > and the </a>
I'm using https://www.yahoo.com/ as my test website
I keep getting a this error:
'href=' in line:
TypeError: a bytes-like object is required, not 'str'
Heres my code:
def urlDict(myUrl):
url = myUrl
page = urllib.request.urlopen(url)
pageText = page.readlines()
urlList = {}
for line in pageText:
if '<a href=' in line:
try:
url = line.split('<a href="')[-1].split('">')[0]
txt = line.split('<a href="')[-1].split('">')[-1].split('< /a>')[0]
urlList[url] = txt
except:
pass
return urlList
What am I doing wrong? I've looked around and people have mostly suggest this mysoup parser thing. I'd use it, but I don't think that would fly with my teacher.
The issue is that you're attempting to compare a byte string to a regular string. If you add print(line) as the first command in your for loops, you'll see that it will print a string of HTML but it will have a b' at the beginning, indicating it's not utf-8 encoding. This makes things difficult. The proper way to use urllib here is the following:
def url_dict(myUrl):
with urllib.request.urlopen(myUrl) as f:
s = f.read().decode('utf-8')
This will have the s variable hold the entire text of the page. You can then use a regular expression to parse out the links and the link target. Here is an example which will pull the link targets without the HTML.
import urllib.request
import re
def url_dict():
# url = myUrl
with urllib.request.urlopen('http://www.yahoo.com') as f:
s = f.read().decode('utf-8')
r = re.compile('(?<=href=").*?(?=")')
print(r.findall(s))
url_dict()
Using regex to get both the html and the link itself in a dictionary is outside the scope of where you are in your class, so I would absolutely not recommend submitting it for the assignment, although I would recommend learning it for later use.
You'll want to use BeautifulSoup as suggested, as it make this entire thing extremely easy. There is an example in the docs that you can cut and paste to extract the URLs.
For what it's worth, here is a BeautifulSoup and requests approach.
Feel free to replace requests with urllib, but BeautifulSoup doesn't really have a nice replacement.
import requests
from bs4 import BeautifulSoup
def get_links(url):
page = requests.get(url)
soup = BeautifulSoup(page.text, "html.parser")
return { a_tag['href']: a_tag.text for a_tag in soup.find_all('a') }
for link, text in get_links('https://www.yahoo.com/').items():
print(text.strip(), link)

Python: parsing UNICODE characters using bs4

I am building a python3 web crawler/scraper using bs4. The program crashes whenever it meets a UNICODE code character like a Chinese symbol. How do I modify my scraper so that it supports UNICODE?
Here's the code:
import urllib.request
from bs4 import BeautifulSoup
def crawlForData(url):
r = urllib.request.urlopen(url)
soup = BeautifulSoup(r.read(),'html.parser')
result = [i.text.replace('\n', ' ').strip() for i in soup.find_all('p')]
for p in result:
print(p)
url = 'https://en.wikipedia.org/wiki/Adivasi'
crawlForData(url)
You can try unicode() method. It decodes unicode strings.
or a way to go is
content.decode('utf-8','ignore')
where content is your string
The complete solution may be:
html = urllib2.urlopen("your url")
content = html.read().decode('utf-8', 'ignore')
soup = BeautifulSoup(content)

Getting incorrect match while using regular expressions

I am trying to find if a link contains ".pdf" at its end.
I am skipping all the characters before ".pdf" using [/w/-]+ in regular expression and then seeing if it contains ".pdf". I am new to regular expressions.
The code is:
import urllib2
import json
import re
from bs4 import BeautifulSoup
url = "http://codex.cs.yale.edu/avi/os-book/OS8/os8c/slide-dir/"
response = urllib2.urlopen(url)
soup = BeautifulSoup(response.read())
links = soup.find_all('a')
for link in links:
name = link.get("href")
if(re.match(r'[\w/.-]+.pdf',name)):
print name
I want to match name with following type of links:
PDF-dir/ch1.pdf
You don't need regular expressions. Use a CSS selector to check that an href ends with pdf:
for link in soup.select("a[href$=pdf]"):
print(link["href"])
I made a small change in your code
for link in links:
name = link.get("href")
if(re.search(r'\.pdf$',name)):
print name
The output is like:
PDF-dir/ch1.pdf
PDF-dir/ch2.pdf
PDF-dir/ch3.pdf
PDF-dir/ch4.pdf
PDF-dir/ch5.pdf
PDF-dir/ch6.pdf
PDF-dir/ch7.pdf
PDF-dir/ch8.pdf
PDF-dir/ch9.pdf
PDF-dir/ch10.pdf
PDF-dir/ch11.pdf
PDF-dir/ch12.pdf
PDF-dir/ch13.pdf
PDF-dir/ch14.pdf
PDF-dir/ch15.pdf
PDF-dir/ch16.pdf
PDF-dir/ch17.pdf
PDF-dir/ch18.pdf
PDF-dir/ch19.pdf
PDF-dir/ch20.pdf
PDF-dir/ch21.pdf
PDF-dir/ch22.pdf
PDF-dir/appA.pdf
PDF-dir/appC.pdf

Extracting URL from source code with Python 3

My question is with reference to the following one:
How to extract URL from HTML anchor element using Python3?
What if I do not know the exact URL and just have a keyword which should be present in the URL? How then do I extract the url from the page source?
Use an HTML parser.
In case of BeautifulSoup, you can pass a function as a keyword argument value:
from bs4 import BeautifulSoup
word = "test"
data = "your HTML here"
soup = BeautifulSoup(data)
for a in soup.find_all('a', href=lambda x: x and word in x):
print(a['href'])
Or, a regular expression:
import re
for a in soup.find_all('a', href=re.compile(word)):
print(a['href'])
Or, using a CSS selector:
for a in soup.select('a[href^="{word}"]'.format(word=word)):
print(a['href'])
Try to use regular expression
import re
re.findall(r'(?i)href=["\']([^\s"\'<>]+)', content)

Categories