Regex to search specific text structure - python

I want to find all results of a certain structure in a string, preferably using regex.
To find all urls, one can use
re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', decode)
and it returns
'https://en.wikipedia.org'
I would like a regex string, which finds:
href="/wiki/*anything*"

OP: beginning must be href="/wiki/ middle can be anything and end must be "
st = "since-OP-did-not-provide-a-sample-string-34278234$'blahhh-okay-enough.href='/wiki/anything/everything/nothing'okay-bye"
print(st[st.find('href'):st.rfind("'")+1])
OUTPUT:
href='/wiki/anything/everything/nothing'
EDIT:
I would go with BeautifulSoup if we are to parse probably an html.
from bs4 import BeautifulSoup
text = '''<a href='/wiki/anything/everything/nothing'><img src="/hp_imgjhg/411/1/f_1hj11_100u.jpg" alt="dyufg" />well wait now <a href='/wiki/hello/how-about-now/nothing'>'''
soup = BeautifulSoup(text, features="lxml")
for line in soup.find_all('a'):
print("href =",line.attrs['href'])
OUTPUT:
href = /wiki/anything/everything/nothing
href = /wiki/hello/how-about-now/nothing

Related

Convert HTML String Tag into String Python

I am trying to convert the HTML String Tag into String using Python.
Here is the content I'm trying to convert:
htmltxt = "<b>Hello World</b>".
The result should appear like Hello World in bold. But I'm getting like
<html><body><b>Hello World</b></body></html>
with the below snippet of code
from bs4 import BeautifulSoup
htmltxt = "<b>Hello World</b>"
soup = BeautifulSoup(htmltxt, 'lxml')
Can anyone suggest me how to convert?
In this situation you're trying to find a tag from within your soup object. Given this is the only one and there is no id or class name you can use:
hello_world_tag = soup.find("b")
hello_world_tag_text = hello_world_tag.text
print(hello_world_tag_text) # Output: 'Hello World'
The key here is '.text'. Using beautiful soup to find a specific tag will return that entire tag, but the .text method returns just the text from within that tag.
Edit following comment:
I would still recommend using bs4 to parse html. Once you have your text if you'd like it in bold you may print with:
print('\033[1m' + text)
Note You won't get out a bold string per se, it is something that always have to be done by interpreting or formating.
Extracting text from HTML string with BeautifulSoup you can call the methods text or get_text():
from bs4 import BeautifulSoup
htmltxt = "<b>Hello World</b>"
soup = BeautifulSoup(htmltxt, 'lxml')
soup.text

Beautifulsoup issue in finding anchor tag

I am trying to capture a link in my python script.
I have a variable holding the regex pattern.
I want to capture below link from the page HTML.
<a class="pg-normal pg-bton" href="/department/office/pg2"> NEXT >> </a>
The code is:
parser = "lxml"
next_regex = r'(.*?)NEXT(.*?)'
html_bodySoup = BeautifulSoup(urllib.request.urlopen(url), parser)
links = html_bodySoup.find_all('a', href = re.compile(nextpg_regex))
Can't find what is the problem, but it does not give me the link as desired. I tried other more accurate regex patterns as well.
You do not need the regex here. You can simply check if the NEXT is in the node text.
You can use
links = html_bodySoup.find_all(lambda x: x.name=='a' and 'NEXT' in x.text)
Here, we search for any tag with a name and NEXT in the node text.
A Python test:
from bs4 import BeautifulSoup
html = '<p><a class="pg-normal pg-bton" href="/department/office/pg2"> NEXT >> </a></p>'
parser = "lxml"
html_bodySoup = BeautifulSoup(html, parser)
html_bodySoup.find_all(lambda x: x.name=='a' and 'NEXT' in x.text)
# => [<a class="pg-normal pg-bton" href="/department/office/pg2"> NEXT >> </a>]
If you want to search for an exact word NEXT, then you can use a regex like this:
html_bodySoup.find_all(lambda x: x.name=='a' and re.search(r'\bNEXT\b', x.text))
# => [<a class="pg-normal pg-bton" href="/department/office/pg2"> NEXT >> </a>]
where re.search searches for a match anywhere inside a string and \bNEXT\b pattern makes sure the NEXT it finds is a whole word (thanks to word boundaries).
You can also use -soup-contains to target that text. It does look like you could probably use just the class however (one of the multi-values). Some options shown below with the most descriptive not commented out:
from bs4 import BeautifulSoup as bs
html = '''<a class="pg-normal pg-bton" href="/department/office/pg2"> NEXT >> </a>'''
soup = bs(html, 'lxml')
# soup.select_one('.pg-bton[href*=department]:-soup-contains("NEXT")')
# soup.select_one('.pg-bton')
soup.select_one('.pg-bton[href*=department]:-soup-contains("NEXT")')

how to filter out correct words with spaces from a lxml data string using beatifulsoup

hi guyz here i'm getting on string that contains lots of html data (in single string)
from bs4 import BeautifulSoup
import requests
import bs4
url = "any randome url"
html = requests.get(url).text
soup = BeautifulSoup(html,'lxml')
web_page=soup.get_text().strip()
print(web_page.lower())
and some of the words are coming in output like
conditionstravel for conditions & travel
vaccinationstreatment for vaccination & treatment
the web page is scraping is correct, but this is not expected,
bcoz some of the tags are ending with text conditions and next tag is starting text with travels so that's why it's coming like conditionstravel
here i'm willing scrape the web page by one by one tags and make it as a web_page_data_list
so is there any way to scrape all tags texts with separate state like above
and the problem is we can't give specific dictionary words for this
is that possible with beautiful soup or any other package will help to extract this properly.?
Use separator=' ' parameter in .get_text() method. Also you can supply strip=True to strip whitespace characters automatically of every separated word.
For example:
import bs4
from bs4 import BeautifulSoup
txt = '''<div>Hello<span>World</span></div>'''
soup = BeautifulSoup(txt, 'html.parser')
web_page=soup.get_text(strip=True, separator=' ')
print(web_page.lower())
print(bs4.__version__)
Prints:
hello world
4.9.1

Webscraping merriam-webster using beautifulsoup

I am using beautifulSoup and trying to scrape only the first definition (very cold) of a word from merriam-webster but it scrapes second line (a sentence) as well. This is my code.
P.S: i want only the "very cold" part. "put on your jacket...." should not be included in the output. Please someone help.
import requests
from bs4 import BeautifulSoup
url = "https://www.merriam-webster.com/dictionary/freezing"
r = requests.get(url)
soup = BeautifulSoup(r.content,"lxml")
definition = soup.find("span", {"class" : "dt"})
tag = definition.findChild()
print(tag.text)
Selecting by class is second faster method for css selector matching. Using select_one returns only first match and using next_sibling will take you to the node you want
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.merriam-webster.com/dictionary/freezing')
soup = bs(r.content, 'lxml')
print(soup.select_one('.mw_t_bc').next_sibling.strip())
The way that Merriam-Webster structures their page is a little strange, but you can find the <strong> tag that precedes the definition, grab the next sibling and strip out all whitespace like this:
>>> tag.find('strong').next_sibling.strip()
u'very cold'

Extracting a specific substring from a specific hyper-reference using Python

I'm new to Python, and for my second attempt at a project, I wanted to extract a substring – specifically, an identifying number – from a hyper-reference on a url.
For example, this url is the result of my search query, giving the hyper-reference http://www.chessgames.com/perl/chessgame?gid=1012809. From this I want to extract the identifying number "1012809" and append it to navigate to the url http://www.chessgames.com/perl/chessgame?gid=1012809, after which I plan to download the file at the url http://www.chessgames.com/pgn/alekhine_naegeli_1932.pgn?gid=1012809 . But I am currently stuck a few steps behind this because I can't figure out a way to extract the identifier.
Here is my MWE:
from bs4 import BeautifulSoup
url = 'http://www.chessgames.com/perl/chess.pl?yearcomp=exactly&year=1932&playercomp=white&pid=&player=Alekhine&pid2=&player2=Naegeli&movescomp=exactly&moves=&opening=&eco=&result=1%2F2-1%2F2'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
import re
y = str(soup)
x = re.findall("gid=[0-9]+",y)
print x
z = re.sub("gid=", "", x(1)) #At this point, things have completely broken down...
As Albin Paul commented, re.findall return a list, you need to extract elements from it. By the way, you don't need BeautifulSoup here, use urllib2.urlopen(url).read() to get the string of the content, and the re.sub is also not needed here, one regex pattern (?:gid=)([0-9]+) is enough.
import re
import urllib2
url = 'http://www.chessgames.com/perl/chess.pl?yearcomp=exactly&year=1932&playercomp=white&pid=&player=Alekhine&pid2=&player2=Naegeli&movescomp=exactly&moves=&opening=&eco=&result=1%2F2-1%2F2'
page = urllib2.urlopen(url).read()
result = re.findall(r"(?:gid=)([0-9]+)",page)
print(result[0])
#'1012809'
You don't need regex here at all. Css selector along with string manipulation will lead you to the right direction. Try the below script:
import requests
from bs4 import BeautifulSoup
page_link = 'http://www.chessgames.com/perl/chess.pl?yearcomp=exactly&year=1932&playercomp=white&pid=&player=Alekhine&pid2=&player2=Naegeli&movescomp=exactly&moves=&opening=&eco=&result=1%2F2-1%2F2'
soup = BeautifulSoup(requests.get(page_link).text, 'lxml')
item_num = soup.select_one("[href*='gid=']")['href'].split("gid=")[1]
print(item_num)
Output:
1012809

Categories