I have a code like this.
<td class="check ABCD" rowspan="2"><center><div class="checkbox {{#if checked}}select{{else}}deselect{{/if}}" id="{{id}}" {{data "tool"}
<td class="check" rowspan="2"><center><div class="checkbox {{#if checked}}select{{else}}deselect{{/if}}" id="{{id}}" {{data "tool"}}>
And I want to extract only the class and ID name in the above code. I have very little knowledge about using regular expression in python.
How can I extract only the class name & id name(the ones in between "") using regular expression? or is there any better way to do this?.
If yes, please help me finding it :)
Thanks in advance.
Since you asked for a Regex solution in Python, you'll get one:
import re
p = re.compile(ur'^.+?class="([^"]+)".+id="([^"]+)".+?$', re.MULTILINE)
test_str = u"<td class=\"check ABCD\" rowspan=\"2\"><center><div class=\"checkbox {{#if checked}}select{{else}}deselect{{/if}}\" id=\"{{id}}\" {{data \"tool\"}\n<td class=\"check\" rowspan=\"2\"><center><div class=\"checkbox {{#if checked}}select{{else}}deselect{{/if}}\" id=\"{{id}}\" {{data \"tool\"}}>"
re.findall(p, test_str)
See live example over here: https://regex101.com/r/cG8dC5/1
Nevertheless, as some other users already noted. Regex isn't ideal for parsing (x)HTML. Better have a look at: https://pypi.python.org/pypi/beautifulsoup4
Related
I got a html-file and want to change "Test-Dateien" into "other-dir"
since it is nested in some weird outlook vml code i tried using the following regex to access it:
pattern = re.compile(r"\<\!\-\-\[if gte vml 1\]\>\<v\:shape ((.|\n)*?)\-\-\>")
however when regex101 online generator returns match... soup.find_all(text=pattern) returns None
example text copied from soup below:
<p class="MsoNormal"><span style="mso-bookmark:_MailAutoSig"></span><a href="tel:+491624154900"><span style="mso-bookmark:_MailAutoSig"><span style='font-size:10.0pt;font-family:"Arial",sans-serif;mso-fareast-font-family:
Calibri;color:#646464;mso-fareast-language:EN-US;mso-no-proof:yes;text-decoration:
none;text-underline:none'><!--[if gte vml 1]><v:shape id="Bild_x0020_10"
o:spid="_x0000_i1039" type="#_x0000_t75" href="tel:+491624154900" style='width:13.5pt;
height:13.5pt;visibility:visible;mso-wrap-style:square' o:button="t">
<v:imagedata src="Test-Dateien/image004_1.png" o:title=""/>
</v:shape><![endif]--><?if !vml?><span style="mso-ignore:vglayout"><img border="0" height="18" src="Test-Dateien/image004_1.png" v:shapes="Bild_x0020_10" width="18"/></span><?endif?></span></span><span style="mso-bookmark:_MailAutoSig"></span></a><span style="mso-bookmark:_MailAutoSig"><span style='font-size:10.0pt;font-family:"Arial",sans-serif;mso-fareast-font-family:
Calibri;mso-fareast-language:EN-US;mso-no-proof:yes'><o:p></o:p></span></span></p>
I'm trying to scrape the email address from a website where the email is nested within a script, and a simple "find/findAll + .text" isn't doing the trick.
source html:
<script>EMLink('com','aol','mikemhnam','<div class="emailgraphic"><img style="position: relative; top: 3px;" src="https://www.naylornetwork.com/EMailProtector/text-gif.aspx?sx=com&nx=mikemhnam&dx=aol&size=9&color=034af3&underline=yes" border=0></div>','pcoc.officialbuyersguide.net Inquiry','onClick=\'$.get("TrackLinkClick", { LinkType: "Email", LinkValue: "mikemhnam#aol.com", MDSID: "CPC-1210", AdListingID: "" });\'')</script>
<br/>
My current approach was to try a "findAll +" regex expression like so:
for email in soup.findAll(class_='ListingPageNameAddress NONE'):
print(email.findAll("([\w\._]+\#([\w_]+\\.)+[a-zA-Z]+)"))
but in jupyter this is only returning a [] :/
Is there an issue with the regex expression or a simpler way to try to tease out the email here?
Although regex may be more robust over time, in my experience these parts of scripts tags remain pretty constant so consider a plan B of using split
html ='''
<script>EMLink('com','aol','mikemhnam','<div class="emailgraphic"><img style="position: relative; top: 3px;" src="https://www.naylornetwork.com/EMailProtector/text-gif.aspx?sx=com&nx=mikemhnam&dx=aol&size=9&color=034af3&underline=yes" border=0></div>','pcoc.officialbuyersguide.net Inquiry','onClick=\'$.get("TrackLinkClick", { LinkType: "Email", LinkValue: "mikemhnam#aol.com", MDSID: "CPC-1210", AdListingID: "" });\'')</script>
<br/>
'''
print(html.split('LinkValue: "')[1].split('"')[0])
It appears that you aren't using the right findall method. You need to import re and then use the findall() method, not the findAll() method (note the case difference of the letter "A"). The function's interface is:
re.findall(pattern, string, flags=0)
For details, see this section of the re doc on finding all adverbs.
I am doing some CTFs and I made this script:
import requests
page = requests.get("http://ctf.slothparadise.com/about.php").text
p_split = page.split("<p>")
p2_split = p_split[3].split("</p>")
print(p2_split)
My output from this is:
['You are the 135181th visitor to this page.\n Every thousandth visitor gets a prize.', '\n </div> <!-- /container -->\n </body>\n</html>\n']
How can I extract the value 135181 out of this list?
You can try to use regex, this is especially easy since it doesn't seem they change 'th' despite the number ending with 1 or 2:
import re
import requests
page = requests.get("http://ctf.slothparadise.com/about.php").text
re.findall("\d+(?=th)", page)
output:
['135335']
To get this working for any value adapt this:
my_split = ['You are the 135181th visitor to this page.\n Every thousandth visitor
gets a prize.', '\n </div> <!-- /container -->\n </body>\n</html>\n']
visitor_num = my_split[0].split('You are the ', 1)[1].split('th')[0]
print(visitor_num)
Actually, I can see similar solutions in the comments as well... hope this works for you!
For future reference, look up using the split function and indexing- it's something you will definitely use again.
I'm trying to find a nice way of wrapping html tags in html comments without writing 5 functions and 50 lines of code. Using an example code :
<section class="left span9">
### Test
</section>
I need to transform it to :
<!--<section class="left span9">-->
### Test
<!--</section>-->
I have a regex to find the tags
re.findall('<.*?>', str)
but in the last years I wasn't using lambdas too often so now I'm having a hard time getting it to work.
btw any ideas for the reverse of this process - decommenting the tags ?
You can comment/uncomment using simple replace like this
myString = '<section class="left span9">'
print myString.replace("<", "<!--<").replace(">", ">-->")
print myString.replace("<!--", "").replace("-->", "")
Output:
<!--<section class="left span9">-->
<section class="left span9">
Note: This works because, a valid HTML document should have < and > only in the HTML tags. If they should appear, as they are, in the output, they have to be properly HTML escaped with > and <
Ok, so temporarily I've ended up using two functions and re.sub for that :
def comment(match):
return '<!--'+match.group(0)+'-->'
def uncomment(html):
return html.replace('<!--', '').replace('-->', '')
commented_html = re.sub('<.*?>', comment, html_string)
uncommented_html = uncomment(commented_html)
I have written following regex But its not working. Can you please help me? thank you :-)
track_desc = '''<img src="http://images.raaga.com/catalog/cd/A/A0000102.jpg" align="right" border="0" width="100" height="100" vspace="4" hspace="4" />
<p>
</p>
<p> Artist(s) David: <br/>
Music: Ramana Gogula<br/>
</p>'''
rx = "<p><\/p><p>Artist\(s\): (.*?)<br\/>Music: (.*?)<br\/><\/p>"
m = re.search(rx, track_desc)
Output Should be:
Artist(s) David
Music: Ramana Gogula
You were ignoring the whitespace:
<p>[\s\n\r]*Artist\(s\)[\s\n\r]*(.*?)[\s\n\r]*:[\s\n\r]*<br/>[\s\n\r]*Music:[\s\n\r]*(.*?)<br/>[\s\n\r]*</p>
Output is:
[1] => "David"
[2] => "Ramana Gogula"
(note that your regex didn't match the Artists(s) and Music: prefixes either)
However for production code I would not rely on such rather clumsy regex (and equally clumsily formatted HTML source).
Seriously though, ditch the idea of using regex for this if you aren't the slightest familiar with regex (which it looks like). You're using the wrong tool and a badly formatted data source. Parsing HTML with Regex is wrong in 9 out of 10 cases (see #bgporter's comment link) and doomed to fail. Apart from that HTML is hardly ever an appropriate data source (unless there really really is no alternative source).
import lxml.html as lh
import re
track_desc = '''
<img src="http://images.raaga.com/catalog/cd/A/A0000102.jpg" align="right" border="0" width="100" height="100" vspace="4" hspace="4" />
<p>
</p>
<p> Artist(s) David: <br/>
Music: Ramana Gogula<br/>
</p>
'''
tree = lh.fromstring(track_desc)
print re.findall(r'Artist\(s\) (.+):\s*\nMusic: (.*\w)', tree.text_content())
I see a few errors:
regex is not multiline : should use flags=re.MULTILINE to allow to match on multilines
spaces are not taken into account
artist(s) is not followed by :
As the web page is rather strangely presented, this might be error prone to rely on a regex and I wouldn't advise to use it extensively.
Note, following seems to work:
rx='Artist(?:\(s\))?\s+(.*?)\<br\/>\s+Music:\s*(.*?)\<br'
print ("Art... : %s && Mus... : %s" % re.search(rx, track_desc,flags=re.MULTILINE).groups())