This question already has answers here:
Regex to get src value from an img tag
(3 answers)
Closed 7 years ago.
I need help in extracting the src values from the text ( eg: LOC/IMG.png ) . Any optimal approach to do this , as I have a file count to over 10^5 files .
I have JSON as follows :
{"Items":[{src=\"LOC/IMG.png\"}]}
You have JSON that contains some values that are HTML. If at all possible, therefore, you should parse the JSON as JSON, then parse the HTML values as HTML. This requires you to understand a tiny bit about the structure of the data—but that's a good thing to understand anyway.
For example:
j = json.loads(s)
for item in j['Items']:
soup = bs4.BeautifulSoup(item['Item'])
for img in soup.find_all('img'):
yield img['src']
This may be too slow, but it only takes a couple minutes to write the correct code, run it on 1000 random representative files, then figure out if it will be fast enough when extrapolated to whatever "file count of 1 Lakh" is. If it's fast enough, then do it this way; all else being equal, it's always better to be correct and simple than to be kludgy or complicated, and you'll save time if unexpected data show up as errors right off the bat than if they show up as incorrect results that you don't notice until a week later…
If your files are about 2K, like your example, my laptop can json.loads 2K of random JSON and BeautifulSoup 2K of random HTML in less time than it takes to read 2K off a hard drive, so at worse this will take only twice as long as reading the data and doing nothing. If you have a slow CPU and a fast SSD, or if your data are very unusual, etc., that may not be true (that's why you test, instead of guessing), but I think you'll be fine.
Let me put a disclaimer for parserers: I do not claim regexes are the coolest, and I myself use XML/JSON parsers everywhere when I can. However, when it comes to any malformed texts, parsers usually cannot handle those cases the qay I want. I have to add regexish code to deal with those situations.
So, in case a regex is absolutely necessary, use (?<=src=\\").*?(?=\\")" regex. (?<=src=\\") look-behind and look-ahead (?=\") will act as boundaries for the values inside src attributes.
Here is sample code:
import re
p = re.compile(ur'(?<=src=\\").*?(?=\\")')
test_str = "YOUR_STRING"
re.findall(p, test_str)
See demo.
Related
So I have a HTML file that consist of 4,574 words 57,718 characters.
But recently, when I read it using .read() command it got a limitation and only show 3,004 words 39,248 characters when I export it.
How can I read it and export it fully without any limitation?
This is my python script:
from IPython.display import FileLink, HTML
title = "Download HTML file"
filename = "data.html"
payload = open("./dendo_plot(2).html").read()
payload = payload.replace('"', """)
html = '<a download="{filename}" href="data:text/html;charset=utf-8,'+payload+'" target="_blank">{title}</a>'
print(payload)
HTML(html)
This is what I mean, Left (Source File), Right (Exported File), you can see there were a gap on both file.
I don't think there's a problem here, I think you are simply misinterpreting a variation in a metric between your input and output.
When you call read() on an opened file with no arguments, it reads the whole content of the file (until EOF) and put it in your memory:
To read a file’s contents, call f.read(size), which reads some quantity of data and returns it as a string [...]. size is an optional numeric argument. When size is omitted or negative, the entire contents of the file will be read and returned; it’s your problem if the file is twice as large as your machine’s memory.
From the official Python tutorial
So technically Python might be unable to read the whole file because it is too big to fit in your memory, but I strongly doubt that's what happening here.
I believe the difference in the number of characters and words you see between your input and output are because your data is changed when processed.
Look at: payload = payload.replace('"', """). From an HTML validation point of view, both " and " are the same and displayed the same (which is why you can switch them), but from a Python point of view, they are different and have different length:
>>> len('"')
1
>>> len(""")
6
So just with this line you get a variation in your input and output.
That being said, I don't think it is very relevant to use the number of characters and words to check if two pieces of HTML are the same. Take the following example:
>>> first_html = """<div>
... <p>Hello there</p>
... </div>"""
>>> len(first_html)
32
>>> second_html = "<div><p>Hello there</p></div>"
>>> len(second_html)
29
You would agree that both HTML will display the same thing, but they don't have the same number of characters. The HTML specification is quite tolerant in the usage of spaces, tabulation and new lines, that's why both previous examples are treated as equal by an HTML parser.
About the number of words, one simple question (well not that simple to answer though ^^'): what qualifies as a word in HTML? Is it only the text displayed? Does the HTML tags counts aswell? If so what about their attributes?
So to sum up, I don't think you have a real problem here, only a difference that is a problem from a certain point of view, but not from an other one.
I currently have binary data that looks like this:
test = b'Got [01]:\n{\'test\': [{\'message\': \'foo bar baz \'\n "\'secured\', current \'proposal\'.",\n \'name\': \'this is a very great name \'\n \'approves something of great order \'\n \'has no one else associated\',\n \'status\': \'very good\'}],\n \'log-url\': \'https://localhost/we/are/the/champions\',\n \'status\': \'rockingandrolling\'}\n
As you can see this is basically JSON.
So what I did was the following:
test.decode('utf8').replace("Got [01]:\n{", '{').replace("\n", "").replace("'", '"')
This basically turned it into a string, and get it as close to valid JSON as possible. Unfortunately, it doesn't fully get there, because when I convert it to a string, it keeps all these stupid spaces and line breaks. That is hard to parse out, with all the .replace()s I keep using.
Is there any way to make the binary data that is being outputted and decoded to produce all 1 line allowing to parse the string, and so I can turn it into JSON format
I have also used a regex to this specific case, and it works, but because this binary data is generated dynamically every time, it may be a tad different where the line breaks and spaces are. So a regex is too brittle to catch every case.
Any thoughts?
I'm trying to write a small function for another script that pulls the generated text from "http://subfusion.net/cgi-bin/quote.pl?quote=humorists&number=1"
Essentially, I need it to pull whatever sentence is between < br> tags.
I've been trying my darndest using regular expressions, but I never really could get the hang of those.
All of the searching I did turned up things for pulling either specific sentences, or single words.
This however needs to pull whatever arbitrary string is between < br> tags.
Can anyone help me out? Thanks.
Best I could come up with:
html = urlopen("http://subfusion.net/cgi-bin/quote.pl?quote=humorists&number=1").read()
output = re.findall('\<br>.*\<br>', html)
EDIT: Ended up going with a different approach all together, simply splitting the HTML in a list seperated by < br> and pulling [3], made for cleaner code and less string operations. Keeping this question up for future reference and other people with similar questions.
You need to use the DOTALL flag as there are newlines in the expression that you need to match. I would use
re.findall('<br>(.*?)<br>', html, re.S)
However will return multiple results as there are a bunch of <br><br> on that page. You may want to use the more specific:
re.findall('<hr><br>(.*?)<br><hr>', html, re.S)
from urllib import urlopen
import re
html = urlopen("http://subfusion.net/cgi-bin/quote.pl?quote=humorists&number=1").read()
output = re.findall('<body>.*?>\n*([^<]{5,})<.*?</body>', html, re.S)
if (len(output) > 0):
print(output)
output = re.sub('\n', ' ', output[0])
output = re.sub('\t', '', output)
print(output)
Terminal
imac2011:Desktop allendar$ python test.py
['A black cat crossing your path signifies that the animal is going somewhere.\n\t\t-- Groucho Marx\n\n']
A black cat crossing your path signifies that the animal is going somewhere. -- Groucho Marx
You could also strip of the final \n's and replace all those inside the text (on longer quotes) with <br /> if you are displaying it in HTML again, so you would maintain the original line breaks visually.
All jokes of that page have the same model, no ambigous things, you can use this
output = re.findall('(?<=<br>\s)[^<]+(?=\s{2}<br)', html)
No need to use the dotall flag cause there's no dot.
This is uh, 7 years later, but for future reference:
Use the beautifulsoup library for these kind of purposes, as suggested by Floris in the comments.
I have a string
<a href="/p/123411/"><img src="/p_img/411/123411/639469aa9f_123411_100.jpg" alt="ABCDXYZ" />
What is the Regex to find ABCDXYZ in Python
Don't use regex to parse HTML. Use BeautifulSoup.
from bs4 import BeautifulSoup as BS
text = '''<a href="/p/123411/"><img src="/p_img/411/123411/639469aa9f_123411_100.jpg" alt="ABCDXYZ" />'''
soup = BS(text)
print soup.find('img').attrs['alt']
If you're looking for the value of that alt attribute, you can do this:
>>> r = r'alt="(.*?)"'
Then:
>>> m = re.search(r, mystring)
>>> m.group(1)
'ABCDXYZ'
And you can use re.findall if you want to find more than one.
However, this code will be easily fooled by something like this:
<span>Here's some text explaining how to do alt="foo" in an img tag.</span>
On the other hand, it'll also fail to pick up something like this:
<img src='/p_img/411/123411/639469aa9f_123411_100.jpg' alt='ABCDXYZ' />
How do you deal with that? The short answer is: You don't. XML and HTML are not regular languages.
It's worth backing up here to point out that Python's re engine is not actually a true regular expression engine—and, on top of that, it's embedded in a Turing-complete programming language. So obviously it is possible to build an HTML parser around Python and re. This answer shows part of a parser written in perl, where regexes do most of the heavy lifting. But that doesn't mean you should do it this way. You shouldn't be writing a parser in the first place, given that perfectly good ones already exist, and if you did, you shouldn't be forcing yourself to use regexes even when there's an easier way to do what you want. For quick&dirty playing around, regex is fine. For a production program, it's almost always the wrong answer.
One way to convince your boss to let you use a parser is by crafting a suite of tests that are all obviously valid, and that cannot possibly be handled by any regex-based solution short of a full parser. If you can come up with a test that can be parsed, but only using exponential backtracking, and therefore takes 12 hours with regex vs. 0.1 seconds with bs4, even better, but that's a bit trickier…
Of course it's also worth looking for articles online (and SO questions like this and this and the 300 other dups) and picking the best ones to show your boss.
If you really can't convince your boss otherwise, then you're done at this point. Given what's been specified, this works. Given what may or may not actually be intended, nothing short of mind-reading will work. As you find more and more real-life cases that fail, you can hack it up by adding more and more complex alternations and/or context onto the regex itself, or possibly use a series of regexes and post-filters, until finally you get sick of it and find yourself a better job.
First, a disclaimer: You shouldn't be using regular expressions to parse HTML. You can use BeautifulSoup for this
Next, if you are actually serious about using regular expressions and the above is the exact case you want then you could do something like:
<a href="[a-zA-Z0-9/]+"><img src="[a-zA-Z0-9/]+" alt="([a-zA-Z0-9/]+)" />
and you could access the text via the match object's groups attribute.
I have a file I need to parse. The parsing is built incrementally, such that on each iteration the expressions becomes more case specific.
The code segment which overloads the system looks roughly like this:
for item in ret:
pat = r'a\sstyle=".+class="VEAPI_Pushpin"\sid="msftve(.+?)".+>%s<'%item[1]
r=re.compile(pat, re.DOTALL)
match = r.findall(f)
The file is a rather large HTML file (parsed from bing maps), and each answer must match its exact id.
Before appying this change the workflow was very good. Is there anything I can do to avoid this? Or to optimize the code?
My only guess is that you are getting too many matches and running out of memory. Though this doesn't seem very reasonable, it might be the case. Try using finditer instead of findall to get one match at a time without creating a monster list of matches. If that doesn't fix your problem, you might have stumbled on a more serious bug in the re module.