Adding chevrons and underscores with BeautifulSoup - python

I want BeautifulSoup to add the strings like this into my HTML pages:
{{< Transfer/component_short_name >}}
(If you are interested why, this is a Hugo shortcode, a kind of variable for markdown)
when I build it programmatically in python and add it using tag.insert_after(), what ends up in the document looks like this:
{{< Transfer/component\_short\_name >}}
which of course does not work the same.
I managed a workaround for the chevrons > < using string replaces, but the underscores '_' would require going into regex, leaving complicated code for a simple operation, so I'm wondering whether there's an option in BeautifulSoup.
I tried various approaches, such as var_name = var_name.replace("\\_", "_") , but that does not work.

I don't see a way to avoid the < and > conversion using BeautifulSoup, but as you say they could be converted afterwards. In the following example there is no underscore escaping:
from bs4 import BeautifulSoup
import re
shortcode = "{{< Transfer/component_short_name >}}"
html = "<html><body><h1>hello world</h1></body>"
soup = BeautifulSoup(html, "html.parser")
soup.h1.insert_after(shortcode)
fixed = re.sub('\{\{<|>\}\}|\\\_', lambda x: {'{{<' : '{{<', '>}}' : '>}}', '\\_' : '_'}[x.group(0)], str(soup))
print(fixed)
Giving the HTML as:
<html><body><h1>hello world</h1>{{< Transfer/component_short_name >}}</body></html>
Here, the \_ replacement does not appear to be needed but I have included it for completeness.

Related

BeautifulSoup find partial string in section

I am trying to use BeautifulSoup to scrape a particular download URL from a web page, based on a partial text match. There are many links on the page, and it changes frequently. The html I'm scraping is full of sections that look something like this:
<section class="onecol habonecol">
<a href="https://longGibberishDownloadURL" title="Download">
<img src="\azure_storage_blob\includes\download_for_windows.png"/>
</a>
sentinel-3.2022335.1201.1507_1608C.ab.L3.FL3.v951T202211_1_3.CIcyano.LakeOkee.tif
</section>
The second to last line (sentinel-3.2022335...LakeOkee.tif) is the part I need to search using a partial string to pull out the correct download url. The code I have attempted so far looks something like this:
import requests, re
from bs4 import BeautifulSoup
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
result = soup.find('section', attrs={'class':'onecol habonecol'}, string=re.compile(?))
I've been searching StackOverflow a long time now and while there are similar questions and answers, none of the proposed solutions have worked for me so far (re.compile, lambdas, etc.). I am able to pull up a section if I remove the string argument, but when I try to include a partial matching string I get None for my result. I'm unsure what to put for the string argument (? above) to find a match based on partial text, say if I wanted to find the filename that has "CIcyano" somewhere in it (see second to last line of html example at top).
I've tried multiple methods using re.compile and lambdas, but I don't quite understand how either of those functions really work. I was able to pull up other sections from the html using these solutions, but something about this filename string with all the periods seems to be preventing it from working. Or maybe it's the way it is positioned within the section? Perhaps I'm going about this the wrong way entirely.
Is this perhaps considered part of the section id, and so the string argument can't find it?? An example of a section on the page that I AM able to find has html like the one below, and I'm easily able to find it using the string argument and re.compile using "Name", "^N", etc.
<section class="onecol habonecol">
<h3>
Name
</h3>
</section>
Appreciate any advice on how to go about this! Once I get the correct section, I know how to pull out the URL via the a tag.
Here is the full html of the page I'm scraping, if that helps clarify the structure I'm working against.
I believe you are overthinking. Just remove the regular expression part, take the text and you will be fine.
import requests
from bs4 import BeautifulSoup
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
result = soup.find('section', attrs={'class':'onecol habonecol'}).text
print(result)
You can query inside every section for the string you want. Like so:
s.find('section', attrs={'class':'onecol habonecol'}).find(string=re.compile(r'.sentinel.*'))
Using this regular expression you will match any text that has sentinel in it, be careful that you will have to match some characters like spaces, that's why there is a . at beginning of the regex, you might want a more robust regex which you can test here:
https://regex101.com/
I ended up finding another method not using the string argument in find(), instead using something like the code below, which pulls the first instance of a section that contains a partial text match.
sections = soup.find_all('section', attrs={'class':'onecol habonecol'})
for s in sections:
text = s.text
if 'CIcyano' in text:
print(s)
break
links = s.find('a')
dwn_url = links.get('href')
This works for my purposes and fetches the first instance of the matching filename, and grabs the URL.

Python Re.Search: How to find a substring between two strings, that must also contain a specific substring

I am writing a little script to get my F#H user data from a basic HTML page.
I want to locate my username on that page and the numbers before and after it.
All the data I want is between two HTML <tr> and </tr> tags.
I am currently using this:
re.search(r'<tr>(.*?)</tr>', htmlstring)
I know this works for any substring, as all google results for my question show. The difference here is i need it only when that substring also contains a specific word
However that only returns the first string between those two delimiters, not even all of them.
This pattern occurs hundreds of times on the page. I suspect it doesn't get them all because I'm not handling all the newline characters correctly but I'm not sure.
If it would return all of them, I could at least then sort them out to find one that contains my username going through each result.group(), but I can't even do that.
I have been fiddling with different regex expressions for ages now but can't figure what one I need to much frustration.
TL;DR -
I need a re.search() pattern that finds a substring between two words, that also contains a specific word.
If I understand correctly something like this might work
<tr>(?:(?:(?:(?!<\/tr>).)*?)\bWORD\b(?:.*?))<\/tr>
<tr> find "<tr>"
(?:(?:(?!<\/tr>).)*?) Find anything except "</tr>" as few times as possible
\bWORD\b find WORD
(?:.*?)) find anything as few times as possible
<\/tr> find "</tr>"
Sample
There are a few ways to do it but I prefer the pandas way:
from urllib import request
import pandas as pd # you need to install pandas
base_url = 'https://apps.foldingathome.org/teamstats/team3446.html'
web_request = request.urlopen(url=base_url).read()
web_df: pd.DataFrame = pd.read_html(web_request, attrs={'class': 'members'})
web_df = web_df[0].set_index(keys=['Name'])
# print(web_df)
user_name_to_find_in_table = 'SteveMoody'
user_name_df = web_df.loc[user_name_to_find_in_table]
print(user_name_df)
Then there are plenty of ways to do this. Using just Beautifulsoup find or css selectors, or maybe re as Peter suggest?
Using beautifulsoup and "find" method, and re, you can do it the following way:
import re
from bs4 import BeautifulSoup as bs # you need to install beautifullsoup
from urllib import request
base_url = 'https://apps.foldingathome.org/teamstats/team3446.html'
web_request = request.urlopen(url=base_url).read()
page_soup = bs(web_request, 'lxml') # need to install lxml and bs4(beautifulsoup for Python 3+)
user_name_to_find_in_table = 'SteveMoody'
row_tag = page_soup.find(
lambda t: t.name == "td"
and re.findall(user_name_to_find_in_table, t.text, flags=re.I)
).find_parent(name="tr")
print(row_tag.get_text().strip('tr'))
Using Beautifulsoup and CSS Selectors(no re but Beautifulsoup):
from bs4 import BeautifulSoup as bs # you need to install beautifulsoup
from urllib import request
base_url = 'https://apps.foldingathome.org/teamstats/team3446.html'
web_request = request.urlopen(url=base_url).read()
page_soup = bs(web_request, 'lxml') # need to install lxml and bs4(beautifulsoup for Python 3+)
user_name_to_find_in_table = 'SteveMoody'
row_tag = page_soup.select_one(f'tr:has(> td:contains({user_name_to_find_in_table})) ')
print(row_tag.get_text().strip('tr'))
In your case I would favor the pandas example as you keep headers and can easily get other stats, and it runs very quickly.
Using Re:
So fa, best input is Peters' commentLink, so I just adapted it to Python code (happy to get edited), as this solution doesn't need any extra libraries installation.
import re
from urllib import request
base_url = 'https://apps.foldingathome.org/teamstats/team3446.html'
web_request = request.urlopen(url=base_url).read()
user_name_to_find_in_table = 'SteveMoody'
re_patern = rf'<tr>(?:(?:(?:(?!<\/tr>).)*?)\{user_name_to_find_in_table}\b(?:.*?))<\/tr>'
res = re.search(pattern=re_patern, string= str(web_request))
print(res.group(0))
Helpful lin to use variables in regex: stackflow

Extract java script from html document using regular expression

I am trying to extract java script from google.com using regular expression.
Program
import urllib
import re
gdoc = urllib.urlopen('http://google.com').read()
scriptlis = re.findall(r'<script>(.*?)</script>', gdoc)
print scriptlis
Output:
['']
Can any one tell me how to extract java script from html doc by using regular expression only.
This works:
import urllib
import re
gdoc = urllib.urlopen('http://google.com').read()
scriptlis = re.findall('(?si)<script>(.*?)</script>', gdoc)
print scriptlis
The key here is (?si). The "s" sets the "dotall" flag (same as re.DOTALL), which makes Regex match over newlines. That was actually the root of your problem. The scripts on google.com span multiple lines, so Regex can't match them unless you tell it to include newlines in (.*?).
The "i" sets the "ignorcase" flag (same as re.IGNORECASE), which allows it to match anything that can be JavaScript. Now, this isn't entirely necessary because Google codes pretty well. But, if you had poor code that did stuff similar to <SCRIPT>...</SCRIPT>, you will need this flag.
If you don't have an issue with third party libraries, requests combined with BeautifulSoup makes for a great combination:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('http://www.google.com')
p = bs(r.content)
p.find_all('script')
I think the problem is that the text between <script> and </script> is several lines, so you could try something like this:
rg = re.compile('<script>(.*)</script>', re.DOTALL)
result = re.findall(rg, gdoc)
What you probably could try to do is
scriptlis = re.findall(r'<script\s*([^>]*)\s*>(.*?)</script', gdoc, re.I|re.S)
Because most script tags are of type:
<script language="javascript" src="foo"></script>
or
<script language="javascript">alert("foo")</script>
and some even are <SCRIPT></SCRIPT>
Neither of which match your regex. My regex would grab attributes in group 1, and the possible inline code in group 2. And also all tags within HTML comments. But it is about the best possible without BeautifulSoup et al

Extract information from a webpage in a particular format

I am trying to make a simple python script to extract certain links from a webpage. I am able to extract link successfully but now I want to extract some more information like bitrate,size,duration given on that webpage.
I am using the below xpath to extract the above mentioned info
>>> doc = lxml.html.parse('http://mp3skull.com/mp3/linkin_park_faint.html')
>>> info = doc.xpath(".//*[#id='song_html']/div[1]/text()")
>>> info[0:7]
['\n\t\t\t', '\n\t\t\t\t3.71 mb\t\t\t', '\n\t\t\t', '\n\t\t\t\t3.49 mb\t\t\t', '\n\t\t\t', '\n\t\t\t\t192 kbps', '2:41']
Now what I need is that for a particular link the info I require is generated in a form of tuple like (bitrate,size,duration).
The xpath I mentioned above generates the required info but it is ill-formatted that is it is not possible to achieve my required format with any logic at least I am not able to that.
So, is there any way to achieve the output in my format.?
I think BeautifulSoup will do the job, it parses even badly formatted HTML:
http://www.crummy.com/software/BeautifulSoup/
parsing is quite easy with BeautifulSoup - for example:
import bs4
import urllib
soup = bs4.BeautifulSoup(urllib.urlopen('http://mp3skull.com/mp3/linkin_park_faint.html').read())
print soup.find_all('a')
and have quite good docs:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/
You can actually strip everything out with XPath:
translate(.//*[#id='song_html']/div[1]/text(), "\n\t,'", '')
So for your additional question, either:
info[0, len(info)]
for altogether, or:
info.rfind(" ")
Since the translate leaves a space character, but you could replace that with whatever you wanted.
Addl info found here
How are you with regular expressions and python's re module?
http://docs.python.org/library/re.html may be essential.
As far as getting the data out of the array, re.match(regex,info[n]) should suffice, as far as the triple tuple goes, the python tuple syntax takes care of it. Simply match from members of your info array with re.match.
import re
matching_re = '.*' # this re matches whole strings, rather than what you need
incoming_value_1 = re.match(matching_re,info[1])
# etc.
var truple = (incoming_value_1, incoming_value_2, incoming_value_2

using output from beautifulsoup in python

Hey all, I am using beautifulsoup (after unsuccessfully struggling for two days with scrapy) to scrape starcraft 2 league data however I am encountering a problem.
I have this table with the result of which I want the string content of all tags which i do like this:
from BeautifulSoup import *
from urllib import urlopen
def parseWithSoup(url):
print "Reading:" , url
html = urlopen(url).read().lower()
bs = BeautifulSoup(html)
table = bs.find(lambda tag: tag.name=='table' and tag.has_key('id') and tag['id']=="tblt_table")
rows = table.findAll(lambda tag: tag.name=='tr')
rows.pop(0) #first row is header
for row in rows:
tags = row.findAll(lambda tag: tag.name=='a')
content = []
for tagcontent in tags:
content.append(tagcontent.string)
print content
if __name__ == '__main__':
content = "http://www.teamliquid.net/tlpd/sc2-international/games#tblt-5018-1-1-DESC"
metSoup = parseWithSoup(content)
however the output is as follows:
[u'+', u'gadget show live i..', u'crevasse', u'naniwa', u'socke']
[u'+', u'gadget show live i..', u'metalopolis 1.1', u'naniwa', u'socke']
[u'+', u'gadget show live i..', u'shakuras plateau 2.0', u'socke', u'select']
etc...
My question is: where does the u'' come from (is it from unicode?) and how can I remove this? I just need the strings that are in u''...
The u means Unicode string. It doesn't change anything for you as a programmer and you should just disregard it. Treat them like normal strings. You actually want this u there.
Be aware that all Beautiful Soup output is unicode. That's a good thing, because if you run across any Unicode characters in your scraping, you won't have any problems. If you really want to get rid of the u, (I don't recommend it), you can use the unicode string's decode() method.
What you see are Python unicode strings.
Check the Python documentation
http://docs.python.org/howto/unicode.html
in order to deal correctly with unicode strings.

Categories