Extracting URL from source code with Python 3

Extracting URL from source code with Python 3 - python

My question is with reference to the following one:
How to extract URL from HTML anchor element using Python3?
What if I do not know the exact URL and just have a keyword which should be present in the URL? How then do I extract the url from the page source?

Use an HTML parser.
In case of BeautifulSoup, you can pass a function as a keyword argument value:
from bs4 import BeautifulSoup
word = "test"
data = "your HTML here"
soup = BeautifulSoup(data)
for a in soup.find_all('a', href=lambda x: x and word in x):
print(a['href'])
Or, a regular expression:
import re
for a in soup.find_all('a', href=re.compile(word)):
print(a['href'])
Or, using a CSS selector:
for a in soup.select('a[href^="{word}"]'.format(word=word)):
print(a['href'])

Try to use regular expression
import re
re.findall(r'(?i)href=["\']([^\s"\'<>]+)', content)

Related

regex and urllib.request to scrape links from HTML

I am trying to parse an HTML to extract all values in this regex construction :
href="http//.+?"
This is the code:
import urllib.request
import re
url = input('Enter - ')
html = urllib.request.urlopen(url).read()
links = re.findall('href="(http://.*?)"',html)
for link in links:
print(link)
But I am getting an error saying :
TypeError: cannot use a string pattern on a bytes-like object

urlopen(url) returns a bytes object. So your html variable contains bytes as well. You can decode it using something like this:
htmlobject = urllib.request.urlopen(url)
html = htmlobject.read().decode('utf-8')
Then you can use html which is now a string, in your regex.

Your html is a byte string, use str(html):
re.findall(r'href="(http://.*?)"',str(html))
Or, use a byte pattern:
re.findall(rb'href="(http://.*?)"',html)

If you would like all links on a page, you don't even have to use regex. Because you can just use bs4 to get what you need :-)
import requests
import bs4
soup = bs4.BeautifulSoup(requests.get('https://dr.dk/').text, 'lxml')
links = soup.find_all('a', href=True)
[print(i['href']) for i in links]
Hope it was helpful. Good luck with the project ;-)

Webscraping merriam-webster using beautifulsoup

I am using beautifulSoup and trying to scrape only the first definition (very cold) of a word from merriam-webster but it scrapes second line (a sentence) as well. This is my code.
P.S: i want only the "very cold" part. "put on your jacket...." should not be included in the output. Please someone help.
import requests
from bs4 import BeautifulSoup
url = "https://www.merriam-webster.com/dictionary/freezing"
r = requests.get(url)
soup = BeautifulSoup(r.content,"lxml")
definition = soup.find("span", {"class" : "dt"})
tag = definition.findChild()
print(tag.text)

Selecting by class is second faster method for css selector matching. Using select_one returns only first match and using next_sibling will take you to the node you want
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.merriam-webster.com/dictionary/freezing')
soup = bs(r.content, 'lxml')
print(soup.select_one('.mw_t_bc').next_sibling.strip())

The way that Merriam-Webster structures their page is a little strange, but you can find the <strong> tag that precedes the definition, grab the next sibling and strip out all whitespace like this:
>>> tag.find('strong').next_sibling.strip()
u'very cold'

Regex to search specific text structure

I want to find all results of a certain structure in a string, preferably using regex.
To find all urls, one can use
re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', decode)
and it returns
'https://en.wikipedia.org'
I would like a regex string, which finds:
href="/wiki/*anything*"

OP: beginning must be href="/wiki/ middle can be anything and end must be "
st = "since-OP-did-not-provide-a-sample-string-34278234$'blahhh-okay-enough.href='/wiki/anything/everything/nothing'okay-bye"
print(st[st.find('href'):st.rfind("'")+1])
OUTPUT:
href='/wiki/anything/everything/nothing'
EDIT:
I would go with BeautifulSoup if we are to parse probably an html.
from bs4 import BeautifulSoup
text = '''<a href='/wiki/anything/everything/nothing'><img src="/hp_imgjhg/411/1/f_1hj11_100u.jpg" alt="dyufg" />well wait now <a href='/wiki/hello/how-about-now/nothing'>'''
soup = BeautifulSoup(text, features="lxml")
for line in soup.find_all('a'):
print("href =",line.attrs['href'])
OUTPUT:
href = /wiki/anything/everything/nothing
href = /wiki/hello/how-about-now/nothing

Getting incorrect match while using regular expressions

I am trying to find if a link contains ".pdf" at its end.
I am skipping all the characters before ".pdf" using [/w/-]+ in regular expression and then seeing if it contains ".pdf". I am new to regular expressions.
The code is:
import urllib2
import json
import re
from bs4 import BeautifulSoup
url = "http://codex.cs.yale.edu/avi/os-book/OS8/os8c/slide-dir/"
response = urllib2.urlopen(url)
soup = BeautifulSoup(response.read())
links = soup.find_all('a')
for link in links:
name = link.get("href")
if(re.match(r'[\w/.-]+.pdf',name)):
print name
I want to match name with following type of links:
PDF-dir/ch1.pdf

You don't need regular expressions. Use a CSS selector to check that an href ends with pdf:
for link in soup.select("a[href$=pdf]"):
print(link["href"])

I made a small change in your code
for link in links:
name = link.get("href")
if(re.search(r'\.pdf$',name)):
print name
The output is like:
PDF-dir/ch1.pdf
PDF-dir/ch2.pdf
PDF-dir/ch3.pdf
PDF-dir/ch4.pdf
PDF-dir/ch5.pdf
PDF-dir/ch6.pdf
PDF-dir/ch7.pdf
PDF-dir/ch8.pdf
PDF-dir/ch9.pdf
PDF-dir/ch10.pdf
PDF-dir/ch11.pdf
PDF-dir/ch12.pdf
PDF-dir/ch13.pdf
PDF-dir/ch14.pdf
PDF-dir/ch15.pdf
PDF-dir/ch16.pdf
PDF-dir/ch17.pdf
PDF-dir/ch18.pdf
PDF-dir/ch19.pdf
PDF-dir/ch20.pdf
PDF-dir/ch21.pdf
PDF-dir/ch22.pdf
PDF-dir/appA.pdf
PDF-dir/appC.pdf

ignoring a part of a string in find_all in beautiful soup

I want to extact all the urls from this webpage.
The python code I am using is this
htmlfile=urllib.urlopen("http://dubai.dubizzle.com/property-for-rent/residential/apartmentflat/").read()
soup=BeautifulSoup(htmlfile)
link=soup.find_all('a', xtclib="listing_list_1_title_link", href=True)
for a in link:
print a['href'],'\b'
but it extracts urls with xtclib = "listing_list_1_title_link" only. How can I make the expression like xtclib = "listing_list_(any number here)_title_link"

You can pass a compiled regular expression object:
import re
...
link=soup.find_all(
'a',
xtclib=re.compile(r"listing_list_\d+_title_link"),
href=True)
See Beautiful Soup Documentation - Regular Expression.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting URL from source code with Python 3 - python

My question is with reference to the following one: How to extract URL from HTML anchor element using Python3? What if I do not know the exact URL and just have a keyword which should be present in the URL? How then do I extract the url from the page source?

Try to use regular expression import re re.findall(r'(?i)href=["\']([^\s"\'<>]+)', content)

Related

regex and urllib.request to scrape links from HTML

Webscraping merriam-webster using beautifulsoup

Regex to search specific text structure

Getting incorrect match while using regular expressions

ignoring a part of a string in find_all in beautiful soup

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting URL from source code with Python 3 - python

My question is with reference to the following one: How to extract URL from HTML anchor element using Python3? What if I do not know the exact URL and just have a keyword which should be present in the URL? How then do I extract the url from the page source?

Try to use regular expression import re re.findall(r'(?i)href=["\']([^\s"\'<>]+)', content)

Related

regex and urllib.request to __scrape__ links from HTML

Webscraping merriam-webster using beautifulsoup

Regex to search specific text structure

Getting incorrect match while using regular expressions

ignoring a part of a string in find_all in beautiful soup

Categories

Resources

regex and urllib.request to scrape links from HTML