Python Regular Expression Clobbering Text Between Multiple Latex Expression Matches - python

I am trying to clean conversational text from a StackExchange corpus which contains sentences which may have Latex expressions inside. Latex expressions are delimited by the $ sign: For instance $y = ax + b$
Here is a line of example text from the data containing multiple Latex expressions:
#Gruber - this is another example, when applied like so: $\mathrm{Var} \left(X^2\right) = 4 X^2 \mathrm{Var} (X)$ doesn't make any sense, on the left side you have a constant and on the right a random variable. Did you mean $4E(X)^2 Var(X)$ bless those that take the road less travelled. Another exception in your theory is $4E(X)^2 Var(X)$. What were you thinking? :)
Here is what I have so far: It seems to clobber text between each Latex Expression match and gives one huge match which is incorrect.
([\$](.*)[\$]){1,3}?

I don't understand why you put {1,3} at the end, what goal did you try to achieve. Anyway, your mistake is that you use [\$], which gives you a set of two characters - a backslash and a dollar. I suggest you use
\$([^$]*)\$
and replace it with an empty string: demo here

Related

Modify the data found between two recurring patterns in a multi-line string

I have a multi-line string, it's around to 10000-40000 characters(changes as per the data returned by an API). In this string, there are a number of tables (they are a part of the string, but formatted in a way that makes them look like a table). The tables are always in a repeating pattern. The pattern looks like this:
==============================================================================
*THE HEADINGS/COLUMN NAMES IN THE TABLE*
------------------------------------------------------------------------------
THE DATA IN THE TABLE, FORMATTED TO BE UNDER RESPECTIVE COLUMNS
I'm trying to display the contents in html on a locally hosted webpage, and I want to have the heading of the tables displayed in a specific way (think color, font size). For that, I'm using the python regex module to identify the pattern, but I'm failing to do so due to inexperience in using the re module. To modify the part that I need modified, I'm using the below piece of code:
re.sub(r'\={78}.*\-{78}',some_replacement_string, complete_multi_line_string)
But the above piece of code is not giving me the output I require, since it is not matching the pattern properly(I'm sure the mistake is in the pattern I'm asking re.sub to match)
However:
re.sub(r'\-{78}',some_replacement_string, complete_multi_line_string)
is working as it's returning the string with the replacement, but the slight problem here is that there are multiple ------------------------------------------------------------------------------s in the code that I do not want modified. Please help me out here. If it is helpful, the output that I'm wanting is something like:
==============================================================================
<span>*THE HEADINGS/COLUMN NAMES IN THE TABLE*<\span>
------------------------------------------------------------------------------
THE DATA IN THE TABLE, FORMATTED TO BE UNDER RESPECTIVE COLUMNS
Also, please note that there are newlines or \ns after the ==============================================================================s, the <span>*THE HEADINGS/COLUMN NAMES IN THE TABLE*<\span>s and the ------------------------------------------------------------------------------s, if that is helpful in getting to the solution.
The code snippet I'm currently trying to debug, if helpful:
result = re.sub(r'\={78}.*\-{78}', replacement, multi_line_string)
l = result.count('<\span>')
print(l)
PS: There are 78 = and 78 - in all the occurances.
You should try using the following version:
re.sub(r'(={78})\n(.*?)\n(-{78})', r'\1<span>\2</span>\3', complete_multi_line_string, flags=re.S)
The changes I made here include:
Match on lazy dot .*? instead of greedy dot .*, to ensure that we don't match across header sections
Match with the re.S flag, so that .*? will match across newlines

Regular Expression to replace dot with space before parentheses

I am working on some customer comments that some of them did not follow grammatical rules. For Example (Such as s and b.) in the following text that provides more explanation for previous sentence is surrounded by two dots.
text = "I was initially scared of ANY drug after my experience. But about a year later I tried. (Such as s and b.). I had a very bad reaction to this."
First, I want to find . (Such as s and b.). and then replace the dot before (Such as s and b.) to space. This is my code, but it does not work.
text = re.sub (r'(\.)(\s+?\(.+\)\s*\.)', r' \2 ', text )
Output should be:
"I was initially scared of ANY drug after my experience. But about a year later I tried (Such as s and b.). I had a very bad reaction to this."
I am using python.
The sample provided does not make much sense because the only change is that the ` character is moved one position to the left.
However, this might do the trick (to keep the dot inside the paranthesis):
text = re.sub(r'\.\s*\)\s*\.', '.)', text)
Or this to have it outside:
text = re.sub(r'\.\s*\)\s*\.', ').', text)
Edit: Or maybe you're looking for this to replace the dot before the opening paranthesis?
text = re.sub(r'\.(?=\s*\(.*?\)\.)', ').', text)
I would suggest this to remove a dot before parentheses when there is another one following them:
text = re.sub(r'\.(\s*?\([^)]*\)\s*\.)', r'\1', text)
See it run on repl.it

Python Regex Problems with Whitespace

I'm trying to do a python regular expression that looks for lines formatted as such ([edit:] without new lines; the original is all on one line):
<MediaLine Label="main-video" xmlns="ms-rtcp-metrics">
<OtherTags...></OtherTags>
</MediaLine>
I wish to create a capture group of the body of this XML element (so the OtherTags...) for later processing.
Now the problem lies in the first line, where Label="main-video", and I would like to not capture Label="main-audio"
My initial solution is as such:
m = re.search(r'<MediaLine(.*?)</MediaLine>', line)
This works, in that it filters out all other non-MediaLine elements, but doesn't account for video vs audio. So to build on it, I try simply adding
m = re.search(r'<MediaLine Label(.*?)</MediaLine>', line)
but this won't create a single match, let alone being specific enough to filter audio/video. My problem seems to come down to the space between line and Label. The two variations I can think of trying both fail:
m = re.search(r'<MediaLine L(.*?)</MediaLine>', line)
m = re.search(r'<MediaLine\sL(.*?)</MediaLine>', line)
However, the following works, without being able to distinguish audio/video:
m = re.search(r'<MediaLine\s(.*?)</MediaLine>', line)
Why is the 'L' the point of failure? Where am I going wrong? Thanks for any help.
And to add to this preemptively, my goal is an expression like this:
m = re.search("<MediaLine Label=\"main-video\"(?:.*?)>(?P<payload>.*?)</MediaLine>", line)
result = m.group('payload')
By default, . doesn’t match a newline, so your initial solution didn't work either. To make . match a newline, you need to use the re.DOTALL flag (aka re.S):
>>> m = re.search("<MediaLine Label=\"main-video\"(?:.*?)>(?P<payload>.*)</MediaLine>", line, re.DOTALL)
>>> m.group('payload')
'\n <OtherTags...></OtherTags>\n'
Notice there’s also an extra ? in the first group, so that it’s not greedy.
As another comment observes, the best thing to parse XML is an XML parser. But if your particular XML is sufficiently strict in the tags and attributes that it has, then a regular expression can get the job done. It will just be messier.

Pyparsing delimited list only returns first element

Here is my code :
l = "1.3E-2 2.5E+1"
parser = Word(alphanums + '+-.')
grammar = delimitedList(parser,delim='\t ')
print(grammar.parseString(l))
It returns :
['1.3E-2']
Obiously, I want all both values, not a single one, any idea what is going on ?
As #dawg explains, delimitedList is intended for cases where you have an expression with separating non-whitespace delimiters, typically commas. Pyparsing implicitly skips over whitespace, so in the pyparsing world, what you are really seeing is not a delimitedList, but OneOrMore(realnumber). Also, parseString internally calls str.expandtabs on the provided input string, unless you use the parseWithTabs=True argument. Expanding tabs to spaces helps preserve columnar alignment of data when it is in tabular form, and when I originally wrote pyparsing, this was a prevalent use case.
If you have control over this data, then you might want to use a different delimiter than <TAB>, perhaps commas or semicolons. If you are stuck with this format, but determined to use pyparsing, then use OneOrMore.
As you move forward, you will also want to be more precise about the expressions you define and the variable names that you use. The name "parser" is not very informative, and the pattern of Word(alphanums+'+-.') will match a lot of things besides valid real values in scientific notation. I understand if you are just trying to get anything working, this is a reasonable first cut, and you can come back and tune it once you get something going. If in fact you are going to be parsing real numbers, here is an expression that might be useful:
realnum = Regex(r'[+-]?\d+\.\d*([eE][+-]?\d+)?').setParseAction(lambda t: float(t[0]))
Then you can define your grammar as "OneOrMore(realnum)", which is also a lot more self-explanatory. And the parse action will convert your strings to floats at parse time, which will save you step later when actually working with the parsed values.
Good luck!
Works if you switch to raw strings:
l = r"1.3E-2\t2.5E+1"
parser = Word(alphanums + '+-.')
grammar = delimitedList(parser, delim=r'\t')
print(grammar.parseString(l))
Prints:
['1.3E-2', '2.5E+1']
In general, delimitedList works with something like PDPDP where P is the parse target and D is the delimter or delimiting sequence.
You have delim='\t '. That specifically is a delimiter of 1 tab followed by 1 space; it is not either tab or space.

Regexp to match chords, issue with national accents

I am dealing with this problem. I have *.txt file containing tens of songs. Each song might consist of
name
lines with chords
lines with lyrics
blank lines
I'm writing Python script, which reads the file by lines. I need to recognise the lines with chords. For that purpose I have decided to use regular expressions, since it looks like playful but strong tool for such tasks. I am new to regexp, I've done this tutorial (which I am rather fond of). I have written something like this
\b ?\(?([AC-Hac-h]{1})(#|##|b|bb)?(is|mi|maj|sus)?\d?[ \n(/\(\))?]
I am not very happy with that, since it does not do the job properly. One of the problems is that the language of the songs uses a lot of accents. The second one: the chords might come in pairs - e.g. C(D), h/e. You can see my approach here.
Note
For better readability in final script I would split the regexp into more variables and those then add together.
Edit
After rereading my question I thought, that my goal might not be clear enough. I would like to much different types of chords for instance:
C, C#, Cis, c#, Cmaj, Cmi, Csus, C7, C#7, Db, Dbsus
Also sometimes there might be (no more than two) chord next to each other such as this: C7/D7, Cmi(a). The best solution would be to catch those "pairs" together in one that is match C7/D7 not C7 and D7. I think, that with this additional condition it might be a bit robust, but if it would be unnecessarily difficult I might go with the (I assume) easier version (meaning: matching C7 and D7 instead of C7/D7) and deal with this later separately.
Your Python script reads the text file line by line and you want to find out with a regular expression if the current line is a line with chords or with other information.
Perhaps it is enough to apply the regular expression ^[\t #()/\dAC-Hac-jmsu]+$ on each line. If the regular expression does not return a match, the line contains characters not being allowed in a line with chords. Perhaps this simple regular expression using only a single character class definition is enough.
But it could be that a line with a name or lyrics matches also the expression above. For your example this is not the case, but it could be. In such a case I would suggest to use first the function strip() on every line to remove spaces and tabs from begin and end of every line. And then apply the following regular expression
^(?:[#()/\dAC-Hac-jmsu]{1,6}[\t ]+?)*[#()/\dAC-Hac-jmsu]{1,6}$
The difference is that now each string not containing a space or tab character must have a length between 1 to 6. Longer strings are not allowed. With this additional rule it could be that there are no false positive anymore on detection of lines with chords.
The problems for the chords line detection rule are definitely the letters as a name or a lyric text consisting only of the letters allowed for chords could match too. A solution would be to create a list of strings consisting only of letters which are allowed for chords and using them in an OR expression. That would avoid most likely a false positive by a name or lyric string. With the complete list of chord strings it is most likely also possible to define the rule shorter without the need to list all chord strings in an OR expression.

Categories