Extract artist and music From text (regex) - python

I have written following regex But its not working. Can you please help me? thank you :-)
track_desc = '''<img src="http://images.raaga.com/catalog/cd/A/A0000102.jpg" align="right" border="0" width="100" height="100" vspace="4" hspace="4" />
<p>
</p>
<p> Artist(s) David: <br/>
Music: Ramana Gogula<br/>
</p>'''
rx = "<p><\/p><p>Artist\(s\): (.*?)<br\/>Music: (.*?)<br\/><\/p>"
m = re.search(rx, track_desc)
Output Should be:
Artist(s) David
Music: Ramana Gogula

You were ignoring the whitespace:
<p>[\s\n\r]*Artist\(s\)[\s\n\r]*(.*?)[\s\n\r]*:[\s\n\r]*<br/>[\s\n\r]*Music:[\s\n\r]*(.*?)<br/>[\s\n\r]*</p>
Output is:
[1] => "David"
[2] => "Ramana Gogula"
(note that your regex didn't match the Artists(s) and Music: prefixes either)
However for production code I would not rely on such rather clumsy regex (and equally clumsily formatted HTML source).
Seriously though, ditch the idea of using regex for this if you aren't the slightest familiar with regex (which it looks like). You're using the wrong tool and a badly formatted data source. Parsing HTML with Regex is wrong in 9 out of 10 cases (see #bgporter's comment link) and doomed to fail. Apart from that HTML is hardly ever an appropriate data source (unless there really really is no alternative source).

import lxml.html as lh
import re
track_desc = '''
<img src="http://images.raaga.com/catalog/cd/A/A0000102.jpg" align="right" border="0" width="100" height="100" vspace="4" hspace="4" />
<p>
</p>
<p> Artist(s) David: <br/>
Music: Ramana Gogula<br/>
</p>
'''
tree = lh.fromstring(track_desc)
print re.findall(r'Artist\(s\) (.+):\s*\nMusic: (.*\w)', tree.text_content())

I see a few errors:
regex is not multiline : should use flags=re.MULTILINE to allow to match on multilines
spaces are not taken into account
artist(s) is not followed by :
As the web page is rather strangely presented, this might be error prone to rely on a regex and I wouldn't advise to use it extensively.
Note, following seems to work:
rx='Artist(?:\(s\))?\s+(.*?)\<br\/>\s+Music:\s*(.*?)\<br'
print ("Art... : %s && Mus... : %s" % re.search(rx, track_desc,flags=re.MULTILINE).groups())

Related

soup.find_all doesnt find regex - regex101 does

I got a html-file and want to change "Test-Dateien" into "other-dir"
since it is nested in some weird outlook vml code i tried using the following regex to access it:
pattern = re.compile(r"\<\!\-\-\[if gte vml 1\]\>\<v\:shape ((.|\n)*?)\-\-\>")
however when regex101 online generator returns match... soup.find_all(text=pattern) returns None
example text copied from soup below:
<p class="MsoNormal"><span style="mso-bookmark:_MailAutoSig"></span><a href="tel:+491624154900"><span style="mso-bookmark:_MailAutoSig"><span style='font-size:10.0pt;font-family:"Arial",sans-serif;mso-fareast-font-family:
Calibri;color:#646464;mso-fareast-language:EN-US;mso-no-proof:yes;text-decoration:
none;text-underline:none'><!--[if gte vml 1]><v:shape id="Bild_x0020_10"
o:spid="_x0000_i1039" type="#_x0000_t75" href="tel:+491624154900" style='width:13.5pt;
height:13.5pt;visibility:visible;mso-wrap-style:square' o:button="t">
<v:imagedata src="Test-Dateien/image004_1.png" o:title=""/>
</v:shape><![endif]--><?if !vml?><span style="mso-ignore:vglayout"><img border="0" height="18" src="Test-Dateien/image004_1.png" v:shapes="Bild_x0020_10" width="18"/></span><?endif?></span></span><span style="mso-bookmark:_MailAutoSig"></span></a><span style="mso-bookmark:_MailAutoSig"><span style='font-size:10.0pt;font-family:"Arial",sans-serif;mso-fareast-font-family:
Calibri;mso-fareast-language:EN-US;mso-no-proof:yes'><o:p></o:p></span></span></p>

Python3.7 how to extract numerical value from list

I am doing some CTFs and I made this script:
import requests
page = requests.get("http://ctf.slothparadise.com/about.php").text
p_split = page.split("<p>")
p2_split = p_split[3].split("</p>")
print(p2_split)
My output from this is:
['You are the 135181th visitor to this page.\n Every thousandth visitor gets a prize.', '\n </div> <!-- /container -->\n </body>\n</html>\n']
How can I extract the value 135181 out of this list?
You can try to use regex, this is especially easy since it doesn't seem they change 'th' despite the number ending with 1 or 2:
import re
import requests
page = requests.get("http://ctf.slothparadise.com/about.php").text
re.findall("\d+(?=th)", page)
output:
['135335']
To get this working for any value adapt this:
my_split = ['You are the 135181th visitor to this page.\n Every thousandth visitor
gets a prize.', '\n </div> <!-- /container -->\n </body>\n</html>\n']
visitor_num = my_split[0].split('You are the ', 1)[1].split('th')[0]
print(visitor_num)
Actually, I can see similar solutions in the comments as well... hope this works for you!
For future reference, look up using the split function and indexing- it's something you will definitely use again.

Implementing Regular expressions in Python

I have a code like this.
<td class="check ABCD" rowspan="2"><center><div class="checkbox {{#if checked}}select{{else}}deselect{{/if}}" id="{{id}}" {{data "tool"}
<td class="check" rowspan="2"><center><div class="checkbox {{#if checked}}select{{else}}deselect{{/if}}" id="{{id}}" {{data "tool"}}>
And I want to extract only the class and ID name in the above code. I have very little knowledge about using regular expression in python.
How can I extract only the class name & id name(the ones in between "") using regular expression? or is there any better way to do this?.
If yes, please help me finding it :)
Thanks in advance.
Since you asked for a Regex solution in Python, you'll get one:
import re
p = re.compile(ur'^.+?class="([^"]+)".+id="([^"]+)".+?$', re.MULTILINE)
test_str = u"<td class=\"check ABCD\" rowspan=\"2\"><center><div class=\"checkbox {{#if checked}}select{{else}}deselect{{/if}}\" id=\"{{id}}\" {{data \"tool\"}\n<td class=\"check\" rowspan=\"2\"><center><div class=\"checkbox {{#if checked}}select{{else}}deselect{{/if}}\" id=\"{{id}}\" {{data \"tool\"}}>"
re.findall(p, test_str)
See live example over here: https://regex101.com/r/cG8dC5/1
Nevertheless, as some other users already noted. Regex isn't ideal for parsing (x)HTML. Better have a look at: https://pypi.python.org/pypi/beautifulsoup4

Programmatically delete everything before a HTML node?

I am trying to create a corpus of data from a set of .html pages I have stored in a directory.
These HTML pages have lots of info I don't need.
This info is all stored before the line
<div class="channel">
How can I programmatically remove all of the text before
<div class="channel">
in every HTML file in a folder?
Bonus question for a 50point bounty :
How do I programmatically remove everything AFTER, for example,
<div class="footer">
?
So if my index.html was previously :
<head>
<title>This is bad HTML</title>
</head>
<body>
<h1> Remove me</h1>
<div class="channel">
<h1> This is the good data, keep me</h1>
<p> Keep this text </p>
</div>
<div class="footer">
<h1> Remove me, I am pointless</h1>
</div>
</body>
After my script runs, I want it to be :
<div class="channel">
<h1> This is the good data, keep me</h1>
<p> Keep this text </p>
</div>
This is a bit heavy on memory usage, but it works. Basically you open up the directory, get all ".html" files, read them into a variable, find the split point, store the before or after in a variable, and then overwrite the file.
There are probably better ways to do this, nonetheless, but it works.
import os
dir = os.listdir(".")
files = []
for file in dir:
if file[-5:] == '.html':
files.insert(0, file)
for fileName in files:
file = open(fileName)
content = file.read()
file.close()
loc = content.find('<div class="channel">')
newContent = content[loc:]
file = open(fileName, 'w')
file.write(newContent)
file.close()
If you wanted to just keep up to a point:
newContent = content[0:loc - 1] # I think the -1 is needed, not sure
Note that the things you're searching should be kept in a variable, and not hardcoded.
Also, this won't work recursively for file/folder structures, but you can find out how to modify it to do that very easily.
to remove everything above and everything below
that means the only thing left should be this section:
<div class="channel">
<h1> This is the good data, keep me</h1>
<p> Keep this text </p>
</div>
rather than thinking to remove the unwanted, it would be easier to just extract the wanted.
you can easily extract channel div using XML parser such as DOM
You've not mentioned a language in the question - the post is tagged with python so this answer might still be out of context, but I'll give a php solution that could likely easily be rewritten in another language.
$html='....'; // your page
$search='<div class="channel">';
$components = explode($search,$html); // [0 => before the string, 1 => after the string]
$result = $search.$components[1];
return $result;
To do the reverse is fairly easy too; simply take the value of $components[0] after altering $search to your <div class="footer"> value.
If you happen to have the $search string cropping up multiple times:
$html='....'; // your page
$search='<div class="channel">';
$components = explode($search,$html); // [0 => before the string, 1 => after the string]
unset($components[0]);
$result = $search.implode($search,$components);
return $result;
Someone who knows python better than I do feel free to rewrite and take the answer!

Sleek way of un/commenting out html tags in markdown

I'm trying to find a nice way of wrapping html tags in html comments without writing 5 functions and 50 lines of code. Using an example code :
<section class="left span9">
### Test
</section>
I need to transform it to :
<!--<section class="left span9">-->
### Test
<!--</section>-->
I have a regex to find the tags
re.findall('<.*?>', str)
but in the last years I wasn't using lambdas too often so now I'm having a hard time getting it to work.
btw any ideas for the reverse of this process - decommenting the tags ?
You can comment/uncomment using simple replace like this
myString = '<section class="left span9">'
print myString.replace("<", "<!--<").replace(">", ">-->")
print myString.replace("<!--", "").replace("-->", "")
Output:
<!--<section class="left span9">-->
<section class="left span9">
Note: This works because, a valid HTML document should have < and > only in the HTML tags. If they should appear, as they are, in the output, they have to be properly HTML escaped with > and <
Ok, so temporarily I've ended up using two functions and re.sub for that :
def comment(match):
return '<!--'+match.group(0)+'-->'
def uncomment(html):
return html.replace('<!--', '').replace('-->', '')
commented_html = re.sub('<.*?>', comment, html_string)
uncommented_html = uncomment(commented_html)

Categories