Getting incorrect match while using regular expressions - python

I am trying to find if a link contains ".pdf" at its end.
I am skipping all the characters before ".pdf" using [/w/-]+ in regular expression and then seeing if it contains ".pdf". I am new to regular expressions.
The code is:
import urllib2
import json
import re
from bs4 import BeautifulSoup
url = "http://codex.cs.yale.edu/avi/os-book/OS8/os8c/slide-dir/"
response = urllib2.urlopen(url)
soup = BeautifulSoup(response.read())
links = soup.find_all('a')
for link in links:
name = link.get("href")
if(re.match(r'[\w/.-]+.pdf',name)):
print name
I want to match name with following type of links:
PDF-dir/ch1.pdf

You don't need regular expressions. Use a CSS selector to check that an href ends with pdf:
for link in soup.select("a[href$=pdf]"):
print(link["href"])

I made a small change in your code
for link in links:
name = link.get("href")
if(re.search(r'\.pdf$',name)):
print name
The output is like:
PDF-dir/ch1.pdf
PDF-dir/ch2.pdf
PDF-dir/ch3.pdf
PDF-dir/ch4.pdf
PDF-dir/ch5.pdf
PDF-dir/ch6.pdf
PDF-dir/ch7.pdf
PDF-dir/ch8.pdf
PDF-dir/ch9.pdf
PDF-dir/ch10.pdf
PDF-dir/ch11.pdf
PDF-dir/ch12.pdf
PDF-dir/ch13.pdf
PDF-dir/ch14.pdf
PDF-dir/ch15.pdf
PDF-dir/ch16.pdf
PDF-dir/ch17.pdf
PDF-dir/ch18.pdf
PDF-dir/ch19.pdf
PDF-dir/ch20.pdf
PDF-dir/ch21.pdf
PDF-dir/ch22.pdf
PDF-dir/appA.pdf
PDF-dir/appC.pdf

Related

regex and urllib.request to __scrape__ links from HTML

I am trying to parse an HTML to extract all values in this regex construction :
href="http//.+?"
This is the code:
import urllib.request
import re
url = input('Enter - ')
html = urllib.request.urlopen(url).read()
links = re.findall('href="(http://.*?)"',html)
for link in links:
print(link)
But I am getting an error saying :
TypeError: cannot use a string pattern on a bytes-like object
urlopen(url) returns a bytes object. So your html variable contains bytes as well. You can decode it using something like this:
htmlobject = urllib.request.urlopen(url)
html = htmlobject.read().decode('utf-8')
Then you can use html which is now a string, in your regex.
Your html is a byte string, use str(html):
re.findall(r'href="(http://.*?)"',str(html))
Or, use a byte pattern:
re.findall(rb'href="(http://.*?)"',html)
If you would like all links on a page, you don't even have to use regex. Because you can just use bs4 to get what you need :-)
import requests
import bs4
soup = bs4.BeautifulSoup(requests.get('https://dr.dk/').text, 'lxml')
links = soup.find_all('a', href=True)
[print(i['href']) for i in links]
Hope it was helpful. Good luck with the project ;-)

Regex to search specific text structure

I want to find all results of a certain structure in a string, preferably using regex.
To find all urls, one can use
re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', decode)
and it returns
'https://en.wikipedia.org'
I would like a regex string, which finds:
href="/wiki/*anything*"
OP: beginning must be href="/wiki/ middle can be anything and end must be "
st = "since-OP-did-not-provide-a-sample-string-34278234$'blahhh-okay-enough.href='/wiki/anything/everything/nothing'okay-bye"
print(st[st.find('href'):st.rfind("'")+1])
OUTPUT:
href='/wiki/anything/everything/nothing'
EDIT:
I would go with BeautifulSoup if we are to parse probably an html.
from bs4 import BeautifulSoup
text = '''<a href='/wiki/anything/everything/nothing'><img src="/hp_imgjhg/411/1/f_1hj11_100u.jpg" alt="dyufg" />well wait now <a href='/wiki/hello/how-about-now/nothing'>'''
soup = BeautifulSoup(text, features="lxml")
for line in soup.find_all('a'):
print("href =",line.attrs['href'])
OUTPUT:
href = /wiki/anything/everything/nothing
href = /wiki/hello/how-about-now/nothing

Extracting a specific substring from a specific hyper-reference using Python

I'm new to Python, and for my second attempt at a project, I wanted to extract a substring – specifically, an identifying number – from a hyper-reference on a url.
For example, this url is the result of my search query, giving the hyper-reference http://www.chessgames.com/perl/chessgame?gid=1012809. From this I want to extract the identifying number "1012809" and append it to navigate to the url http://www.chessgames.com/perl/chessgame?gid=1012809, after which I plan to download the file at the url http://www.chessgames.com/pgn/alekhine_naegeli_1932.pgn?gid=1012809 . But I am currently stuck a few steps behind this because I can't figure out a way to extract the identifier.
Here is my MWE:
from bs4 import BeautifulSoup
url = 'http://www.chessgames.com/perl/chess.pl?yearcomp=exactly&year=1932&playercomp=white&pid=&player=Alekhine&pid2=&player2=Naegeli&movescomp=exactly&moves=&opening=&eco=&result=1%2F2-1%2F2'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
import re
y = str(soup)
x = re.findall("gid=[0-9]+",y)
print x
z = re.sub("gid=", "", x(1)) #At this point, things have completely broken down...
As Albin Paul commented, re.findall return a list, you need to extract elements from it. By the way, you don't need BeautifulSoup here, use urllib2.urlopen(url).read() to get the string of the content, and the re.sub is also not needed here, one regex pattern (?:gid=)([0-9]+) is enough.
import re
import urllib2
url = 'http://www.chessgames.com/perl/chess.pl?yearcomp=exactly&year=1932&playercomp=white&pid=&player=Alekhine&pid2=&player2=Naegeli&movescomp=exactly&moves=&opening=&eco=&result=1%2F2-1%2F2'
page = urllib2.urlopen(url).read()
result = re.findall(r"(?:gid=)([0-9]+)",page)
print(result[0])
#'1012809'
You don't need regex here at all. Css selector along with string manipulation will lead you to the right direction. Try the below script:
import requests
from bs4 import BeautifulSoup
page_link = 'http://www.chessgames.com/perl/chess.pl?yearcomp=exactly&year=1932&playercomp=white&pid=&player=Alekhine&pid2=&player2=Naegeli&movescomp=exactly&moves=&opening=&eco=&result=1%2F2-1%2F2'
soup = BeautifulSoup(requests.get(page_link).text, 'lxml')
item_num = soup.select_one("[href*='gid=']")['href'].split("gid=")[1]
print(item_num)
Output:
1012809

How to check whether the url is of format "/askwiki/questions/<any number>" in Python

I am trying to learn web scraping using BeautifulSoup and Python.
I scraped a list of urls from a website and I want to display the text of all th links that are in format "/askwiki/questions/ like
"/askwiki/questions/4" or "/askwiki/questions/123".
import requests
from bs4 import BeautifulSoup
url = 'http://unistd.herokuapp.com/askrec';
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml");
links = soup.find_all("a")
for link in links:
if #url is of my desired format
print link.text
What should I write in the if statement.
I am new to python as well as web scraping. It may be a really stupid question but I am not getting what to write there.
I tried like
if "/askwiki/questions/[0-9]+ " in link.get("href"):
if "/askwiki/questions/[0-9]?" in link.get("href"):
but it's not working.
P.S - There are other links too like 'askwiki/questions/tags' and /askwiki/questions/users'.
Edit: Using regex to identify only those with numbers at the end.
import re
for link in links:
url = str(link.get('href'))
if re.findall('/askwiki/questions/[\d]+', url):
print(link)
You're on the right track! The missing component is the re module.
I think what you want is something like this:
import re
matcher = re.compile(r"/askwiki/questions/[0-9]+")
if matcher.search(link.get("href")):
print(link.text)
Alternatively, you can just drop the number component, if you're only really looking for links with "/askwiki/questions" in:
if "/askwiki/questions" in link.get("href")
print(link.text)
try something like :
for link in links:
link = link.get("href")
if link.startswith("/askwiki/questions/"):
print(link.test)
If you want to use regex (ie what you have, [0-9]+), you have to import the re library. Check out this link to the documentation on using re to find patterns!

Extracting URL from source code with Python 3

My question is with reference to the following one:
How to extract URL from HTML anchor element using Python3?
What if I do not know the exact URL and just have a keyword which should be present in the URL? How then do I extract the url from the page source?
Use an HTML parser.
In case of BeautifulSoup, you can pass a function as a keyword argument value:
from bs4 import BeautifulSoup
word = "test"
data = "your HTML here"
soup = BeautifulSoup(data)
for a in soup.find_all('a', href=lambda x: x and word in x):
print(a['href'])
Or, a regular expression:
import re
for a in soup.find_all('a', href=re.compile(word)):
print(a['href'])
Or, using a CSS selector:
for a in soup.select('a[href^="{word}"]'.format(word=word)):
print(a['href'])
Try to use regular expression
import re
re.findall(r'(?i)href=["\']([^\s"\'<>]+)', content)

Categories