Remove all white spaces inside specific delimiters - python

I'm trying to process a xml file containing wrongly formed elements.
A wrongly formed elemement is one which doesn't respect the following pattern : <name attribute1=value1 attribute2=value2 ... attributeN=valueN>
There can be 0 to n attributes.
As a consequence, <my element number> is invalid, while <my element=number> is not.
Here is a sample of my text :
<product_name>
A high wind in Jamaica <The innocent voyage> The modern library of the world s best books Books Richard Arthur Warren Hughes
</product_name>
Here, <product_name> is a good element, while <The innocent voyage> is not.
When an incorrect element is spotted, I would like to have the <> replaced with neutral characters, such as +.
Since the file containing these tags is pretty big (1.5 GB), I would rather not use a brute force approach.
Would you guys see an fast (and if possible, elegant) way to solve this problem ?

As you state that you would rather stay away from regex, I was able to create the following code that doesn't use regex (although I'm sure regex would be quite useful)
def valid_tag(tag):
temp = tag.split()
for word in temp[1:]:
if "=" not in word:
return False
return True
Here you pass in a tag as a string as the parameter. For example: "<hello test=test>"
You can run this test on each tag by creating another method for getting a tag by finding a "<" and then the first ">" that follows and creating a substring from that which will be the tag that you pass into this method.
NOTE: This assumes that your tags are written as follows: <hello test=test> and not < hello test = test >
This method is still very primitive and makes a few assumptions as I stated above but hopefully it will give you the start you need.

Related

How to perform a tag-agnostic text string search in an html file?

I'm using LanguageTool (LT) with the --xmlfilter option enabled to spell-check HTML files. This forces LanguageTool to strip all tags before running the spell check.
This also means that all reported character positions are off because LT doesn't "see" the tags.
For example, if I check the following HTML fragment:
<p>This is kin<b>d</b> o<i>f</i> a <b>stupid</b> question.</p>
LanguageTool will treat it as a plain text sentence:
This is kind of a stupid question.
and returns the following message:
<error category="Grammar" categoryid="GRAMMAR" context=" This is kind of a stupid question. " contextoffset="24" errorlength="9" fromx="8" fromy="8" locqualityissuetype="grammar" msg="Don't include 'a' after a classification term. Use simply 'kind of'." offset="24" replacements="kind of" ruleId="KIND_OF_A" shortmsg="Grammatical problem" subId="1" tox="17" toy="8"/>
(In this particular example, LT has flagged "kind of a.")
Since the search string might be wrapped in tags and might occur multiple times I can't do a simple index search.
What would be the most efficient Python solution to reliably locate any given text string in an HTML file? (LT returns an approximate character position, which might be off by 10-30% depending on the number of tags, as well as the words before and after the flagged word(s).)
I.e. I'd need to do a search that ignores all tags, but includes them in the character position count.
In this particular example, I'd have to locate "kind of a" and find the location of the letter k in:
kin<b>d</b> o<i>f</i>a
This may not be the speediest way to go, but pyparsing will recognize HTML tags in most forms. The following code inverts the typical scan, creating a scanner that will match any single character, and then configuring the scanner to skip over HTML open and close tags, and also common HTML '&xxx;' entities. pyparsing's scanString method returns a generator that yields the matched tokens, the starting, and the ending location of each match, so it is easy to build a list that maps every character outside of a tag to its original location. From there, the rest is pretty much just ''.join and indexing into the list. See the comments in the code below:
test = "<p>This is kin<b>d</b> o<i>f</i> a <b>stupid</b> question.</p>"
from pyparsing import Word, printables, anyOpenTag, anyCloseTag, commonHTMLEntity
non_tag_text = Word(printables+' ', exact=1).leaveWhitespace()
non_tag_text.ignore(anyOpenTag | anyCloseTag | commonHTMLEntity)
# use scanString to get all characters outside of tags, and build list
# of (char,loc) tuples
char_locs = [(t[0], loc) for t,loc,endloc in non_tag_text.scanString(test)]
# imagine a world without HTML tags...
untagged = ''.join(ch for ch, loc in char_locs)
# look for our string in the untagged text, then index into the char,loc list
# to find the original location
search_str = 'kind of a'
orig_loc = char_locs[untagged.find(search_str)][1]
# print the test string, and mark where we found the matching text
print(test)
print(' '*orig_loc + '^')
"""
Should look like this:
<p>This is kin<b>d</b> o<i>f</i> a <b>stupid</b> question.</p>
^
"""
The --xmlfilter option is deprecated because of issues like this. The proper solution is to remove the tags yourself but keep the positions so you have a mapping to correct the results that come back from LT. When using LT from Java, this is supported by AnnotatedText, but the algorithm should be simple enough to port it. (full disclosure: I'm the maintainer of LT)

Why is this Python script with regular expressions that slow?

The job is to read in a very large XML file line by line and store what has been already read in a string. When the string contains a full record between tags 'player' and '/player', all the values of xml tags within this record should be written to a text file as a tab separated line and the record removed from the already read chunk.
At the end of the process the unremoved part ( remainder ) should be printed, to check if all records have been properly processed and nothing remained unprocessed.
I have already this code in Perl and it runs swiftly, but I want to switch to Python.
The Python script I currently have is extremely slow.
Is Python that slow, or do I do something wrong with using the regular expressions?
import re
fh=open("players_list_xml.xml")
outf=open("players.txt","w")
x=""
cnt=0
while(cnt<10000):
line=fh.readline().rstrip()
x+=line
mo=re.search(r"<player>(.*)</player>",x)
while(mo):
cnt=cnt+1
if((cnt%1000)==0):
print("processing",cnt)
x=re.sub(re.escape(mo.group()),"",x)
print("\t".join(re.findall(r"<[a-z]+>([^<]+)<[^>]+>",mo.group(1))),file=outf)
mo=re.search(r"<player>(.*)</player>",x)
print("remainder",x)
outf.close()
fh.close()
Your regex is slow because of "backtracking" as you are using a "greedy" expression (this answer provides a simple Python example). Also, as mentioned in a comment, you should be using an XML parser to parse XML. Regex has never been very good for XML (or HTML).
In an attempt to explain why your specific expression is slow...
Lets assume you have three <player>...</player> elements in your XML. Your regex would start by matching the first opening <player> tag (that part is fine). Then (because you are using a greedy match) it would skip to the end of the document and start working backwards (backtracking) until it matched the last closing </player> tag. With a poorly written regex, it would stop there (all three elements would be in one match with all non player elements between them as well). However, that match would obviously be wrong so you make a few changes. Then the new regex would continue were the previously left off by continuing to backtrack until it found the first closing </player> tag. Then it would continue to backtrack until it determined there were no additional </player> tags between the opening tag and the most recently found closing tag. Then it would repeat that process for the second set of tags and again for the third. All that backtracking takes a lot of time. And that is for a relatively small file. In a comment you mention your files contain "more than half a million records". Ouch! I can't image how long that would take. And you're actually matching all elements, not just "player" elements. Then you are running a second regex against each element to check whether they are player elements. I would never expect this to be fast.
To avoid all that backtracking, you can use a "nongreedy" or "lazy" regex. For example (greatly simplified form your code):
r"<player>(.*?)</player>"
Note that the ? indicates that the previous pattern (.*) is nongreedy. In this instance, After finding the first opening <player> tag, it would then continue to move forward through the document (not jumping to the end) until it found the first closing </player> tag and then it would be satisfied that the pattern had matched and move on to find the second occurrence (but only by searching within the document after the end of the first occurrence).
Naturally, the nongreedy expression will be much faster. In my experience, nongreedy is almost always what you want when doing * or + matches (except for the rare cases when you don't).
That said, as stated previously, an XML parser is much more suited to parsing XML. In fact, many XML parsers offer some sort of steaming API which allows you to feed the document in in pieces in order to avoid loading the entire document into memory at once (regex does not offer this advantage). I'd start with lxml and then move to some of the builtin parsers if the C dependency doesn't work for you.
With XML parser:
import xml.parsers.expat
cnt=0
state="idle"
current_key=""
current_value=""
fields=[]
def start_element(name, attrs):
global state
global current_key
global current_value
global fields
if name=="player":
state="player"
elif state=="player":
current_key=name
def end_element(name):
global state
global current_key
global current_value
global fields
global cnt
if state=="player":
if name=="player":
state="idle"
line="\t".join(fields)
print(line,file=outf)
fields=[]
cnt+=1
if((cnt%10000)==0):
print(cnt,"players processed")
else:
fields.append(current_value)
current_key=""
current_value=""
def char_data(data):
global state
global current_key
global current_value
if state=="player" and not current_key=="":
current_value=data
p = xml.parsers.expat.ParserCreate()
p.StartElementHandler = start_element
p.EndElementHandler = end_element
p.CharacterDataHandler = char_data
fh=open("players_list_xml.xml")
outf=open("players.txt","w")
line=True
while((cnt<1000000) and line):
line=fh.readline().rstrip()
p.Parse(line)
outf.close()
fh.close()
This is quite an amount of code.
At least this produces a 29MB text file from the original XML, which size seems right.
The speed is reasonable, though this is a simplistic version, more processing is needed on the records.
In the end of the day it seems that a Perl script with only regexes is working at the speed of a dedicated XML parser, which is remarkable.
The correct answer as everyone else has said is to use an XML parser to parse XML.
The answer to your question about why it's so much slower than your perl version is that for some reason python's regular expressions are just slow, much slower than perl's to handle the same expression. I often find that code that uses regexps is more than twice as fast in perl.

Searching a string for an exact match from a list in Python

I'm working on a project that searches specific user's Twitter streams from my followers list and retweets them. The code below works fine, but if the string appears in side of the word (for instance if the desired string was only "man" but they wrote "manager", it'd get retweeted). I'm still pretty new to python, but my hunch is RegEx will be the way to go, but my attempts have proved useless thus far.
if tweet["user"]["screen_name"] in friends:
for phrase in list:
if phrase in tweet["text"].lower():
print tweet
api.retweet(tweet["id"])
return True
Since you only want to match whole words the easiest way to get Python to do this is to split the tweet text into a list of words and then test for the presence of each of your words using in.
There's an optimization you can use because position isn't important: by building a set from the word list you make searching much faster (technically, O(1) rather than O(n)) because of the fast hashed access used by sets and dicts (thank you Tim Peters, also author of The Zen of Python).
The full solution is:
if tweet["user"]["screen_name"] in friends:
tweet_words = set(tweet["text"].lower().split())
for phrase in list:
if phrase in tweet_words:
print tweet
api.retweet(tweet["id"])
return True
This is not a complete solution. Really you should be taking care of things like purging leading and trailing punctuation. You could write a function to do that, and call it with the tweet text as an argument instead of using a .split() method call.
Given that optimization it occurred to me that iteration in Python could be avoided altogether if the phrases were a set also (the iteration will still happen, but at C speeds rather than Python speeds). So in the code that follows let's suppose that you have during initialization executed the code
tweet_words = set(l.lower() for l in list)
By the way, list is a terrible name for a variable, since by using it you make the Python list type unavailable under its usual name (though you can still get at it with tricks like type([])). Perhaps better to call it word_list or something else both more meaningful and not an existing name. You will have to adapt this code to your needs, it's just to give you the idea. Note that tweet_words only has to be set once.
list = ['Python', 'Perl', 'COBOL']
tweets = [
"This vacation just isn't worth the bother",
"Goodness me she's a great Perl programmer",
"This one slides by under the radar",
"I used to program COBOL but I'm all right now",
"A visit to the doctor is not reported"
]
tweet_words = set(w.lower() for w in list)
for tweet in tweets:
if set(tweet.lower().split()) & tweet_words:
print(tweet)
If you want to use regexes to do this, look for a pattern that is of the form \b<string>\b. In your case this would be:
pattern = re.compile(r"\bman\b")
if re.search(pattern, tweet["text"].lower()):
#do your thing
\b looks for a word boundary in regex. So prefixing and suffixing your pattern with it will match only the pattern. Hope it helps.

Get address out of a paragraph with regex

Alright, this one's a bit of a pain. I'm doing some scraping with Python, trying to get an address out of a few lines of poorly tagged HTML. Here's a sample of the format:
256-555-5555<br/>
1234 Fake Ave S<br/>
Gotham (Lower Ward)<br/>
I'd like to retrieve only 1234 Fake Ave S, Gotham. Any ideas? I've been doing regex's all night and now my brain is mush...
Edit:
More detail about what the possible scenarios of how the data will arrive. Sometimes the first line will be there, sometimes not. All of the addresses I have seen have Ave, Way, St in it although I would prefer not to use that as a factor in the selection as I am not certain they will always be that way. The second and third line are alPhone (or possible email or website):
What I had in mind was something that
Selects everything on 2nd to last line (so, second line if there are three lines, first line if just two when there isn't a phone number).
Selects everything on last line that isn't in parentheses.
Combine the 2nd to last line and last line, adding a ", " in between the two.
I'm using Scrapy to acquire the HTML code. The address is all in the same div, I want to use regex to further break the data up into appropriate sections. Now how to do that is what I'm unable to figure out.
Edit2:
As per Ofir's comment, I should mention that I have already made expressions to isolate the phone number and parentheses section.
Phone (or possible email or website):
((1[-. ])?[0-9]{3}[-. ])?\(?([0-9]{3}[-. ][A?([0-9]{4})|([\w\.-]+#[\w\.-]+)|(www.+)|([\w\.-]*(?:com|net|org|us))
parentheses:
\((.*?)\)
I'm not sure how to use those to construct a everything-but-these statement.
It is possible that in your case it is easier to focus on what you don't want:
html tags (<br>)
phone numbers
everything in parenthesis
Each of which can be matched easily with simple regular expressions, making it easy to construct one to match the rest (presumably - the address)
This attempts to isolate the last two lines out of the string:
>>> s="""256-555-5555<br/>
... 1234 Fake Ave S<br/>
... Gotham (Lower Ward)<br/>
... """
>>> m = re.search(r'((?!</br>).*)<br/>\n((?!</br>).*)<br/>$)', s)
>>> print m.group(1)
1234 Fake Ave S
Trimming the parentheses is probably best left to a separate line of code, rather than complicating the regular expression further.
As far as I understood you problem, I think you are taking the wrong way to solve it.
Regexes are not a magical tool that could extract pertinent data from a pulp and jumble of undifferentiated elements of text. It is a tool that can only extract data from a text having variable parts but also a minimum of stable structure acting as anchors relatively to which the variable parts can be localized.
In your treatment, it seems to me that you first isolated this part containing possible phone number followed by address on 1/2 lines. But doing so, you lost information: what is before and what is after is anchoring information, you shouldn't try to find something in the remaining section obtained after having eliminated this information.
Moreover, I presume that you don't want only to catch a phone number and an address: you may want to extract other pieces of information lying before and after this section. With a good shaped regex, you could capture all the pieces in one shot.
So, please, give more of the text, with enough characters before and enough characters after the limited section allowing to write a correct and easier regex strategy to catch all the data you want. triplee has already asked you that, and you didn't, why ?

regex regarding symbols in urls

I want to replace consecutive symbols just one such as;
this is a dog???
to
this is a dog?
I'm using
str = re.sub("([^\s\w])(\s*\1)+", "\\1",str)
however I notice that this might replace symbols in urls that might happen in my text.
like http://example.com/this--is-a-page.html
Can someone give me some advice how to alter my regex?
So you want to unleash the power of regular expressions on an irregular language like HTML. First of all, search SO for "parse HTML with regex" to find out why that might not be such a good idea.
Then consider the following: You want to replace duplicate symbols in (probably user-entered) text. You don't want to replace them inside a URL. How can you tell what a URL is? They don't always start with http – let's say ars.userfriendly.org might be a URL that is followed by a longer path that contains duplicate symbols.
Furthermore, you'll find lots of duplicate symbols that you definitely don't want to replace (think of nested parentheses (like this)), some of them maybe inside a <script> on the page you're working on (||, && etc. come to mind.
So you might come up with something like
(?<!\b(?:ftp|http|mailto)\S+)([^\\|&/=()"'\w\s])(?:\s*\1)+
which happens to work on the source code of this very page but will surely fail in other cases (for example if URLs don't start with ftp, http or mailto). Plus, it won't work in Python since it uses variable repetition inside lookbehind.
All in all, you probably won't get around parsing your HTML with a real parser, locating the body text, applying a regex to it and writing it back.
EDIT:
OK, you're already working on the parsed text, but it still might contain URLs.
Then try the following:
result = re.sub(
r"""(?ix) # case-insensitive, verbose regex
# Either match a URL
# (protocol optional (if so, URL needs to start with www or ftp))
(?P<URL>\b(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&##/%=~_|$?!:,.]*[A-Z0-9+&##/%=~_|$])
# or
|
# match repeated non-word characters
(?P<rpt>[^\s\w])(?:\s{0,100}(?P=rpt))+""",
# and replace with both captured groups (one will always be empty)
r"\g<URL>\g<rpt>", subject)
Re-EDIT: Hm, Python chokes on the (?:\s*(?P=rpt))+ part, saying the + has nothing to repeat. Looks like a bug in Python (reproducible with (.)(\s*\1)+ whereas (.)(\s?\1)+ works)...
Re-Re-EDIT: If I replace the * with {0,100}, then the regex compiles. But now Python complains about an unmatched group. Obviously you can't reference a group in a replacement if it hasn't participated in the match. I give up... :(

Categories