I am working on some customer comments that some of them did not follow grammatical rules. For Example (Such as s and b.) in the following text that provides more explanation for previous sentence is surrounded by two dots.
text = "I was initially scared of ANY drug after my experience. But about a year later I tried. (Such as s and b.). I had a very bad reaction to this."
First, I want to find . (Such as s and b.). and then replace the dot before (Such as s and b.) to space. This is my code, but it does not work.
text = re.sub (r'(\.)(\s+?\(.+\)\s*\.)', r' \2 ', text )
Output should be:
"I was initially scared of ANY drug after my experience. But about a year later I tried (Such as s and b.). I had a very bad reaction to this."
I am using python.
The sample provided does not make much sense because the only change is that the ` character is moved one position to the left.
However, this might do the trick (to keep the dot inside the paranthesis):
text = re.sub(r'\.\s*\)\s*\.', '.)', text)
Or this to have it outside:
text = re.sub(r'\.\s*\)\s*\.', ').', text)
Edit: Or maybe you're looking for this to replace the dot before the opening paranthesis?
text = re.sub(r'\.(?=\s*\(.*?\)\.)', ').', text)
I would suggest this to remove a dot before parentheses when there is another one following them:
text = re.sub(r'\.(\s*?\([^)]*\)\s*\.)', r'\1', text)
See it run on repl.it
Related
I'm a begginer in Python and one of the first codes I've made it's an RPG, so there's a lot of texts in strings being printed. Before I learned how to "word wrap", I used to test every string and put an "\n" in the right places, so it could be better to read the history in the console.
But now I don't need those "\n" anymore, and it's been really laborious to replace each one of them using the Replace Dialog of Python IDLE. One of the problems is that I want to ignore double new lines ("\n\n"), because they do make the texts more presentable.
So if I just search "\n" he finds it, but I want to ignore all the "\n\n".
I tried using the "Regular expression" option and did a research with regex but with no success, since I'm completly new in this area. Tried some things like "^\n$" because, if I understood it right, the ^ and the $ delimit the search to what's between them.
I think it's clear what I need, but will write an example anyways:
print("Here's the narrator telling some things to the player. Of course I could do some things but\nnow it's time to ask for help!\n\nProbably it's a simple thing, but it's been lots of time in research and no\nsuccess...")
I want to find and replace those two "\n" with one empty space (" ") and totally ignore the "\n\n".
Can you guys help? Thanks in advance.
You need
re.sub(r'(?<!\n)\n(?!\n)', ' ', text)
See the regex demo.
Details
(?<!\n) - no newline allowed immediately on the left
\n - a newline
(?!\n) - no newline allowed immediately on the right
See Python demo:
import re
text = "Here's the narrator telling some things to the player. Of course I could do some things but\nnow it's time to ask for help!\n\nProbably it's a simple thing, but it's been lots of time in research and no\nsuccess..."
print(re.sub(r'(?<!\n)\n(?!\n)', ' ', text))
Output:
Here's the narrator telling some things to the player. Of course I could do some things but now it's time to ask for help!
Probably it's a simple thing, but it's been lots of time in research and no success...
I am trying to clean conversational text from a StackExchange corpus which contains sentences which may have Latex expressions inside. Latex expressions are delimited by the $ sign: For instance $y = ax + b$
Here is a line of example text from the data containing multiple Latex expressions:
#Gruber - this is another example, when applied like so: $\mathrm{Var} \left(X^2\right) = 4 X^2 \mathrm{Var} (X)$ doesn't make any sense, on the left side you have a constant and on the right a random variable. Did you mean $4E(X)^2 Var(X)$ bless those that take the road less travelled. Another exception in your theory is $4E(X)^2 Var(X)$. What were you thinking? :)
Here is what I have so far: It seems to clobber text between each Latex Expression match and gives one huge match which is incorrect.
([\$](.*)[\$]){1,3}?
I don't understand why you put {1,3} at the end, what goal did you try to achieve. Anyway, your mistake is that you use [\$], which gives you a set of two characters - a backslash and a dollar. I suggest you use
\$([^$]*)\$
and replace it with an empty string: demo here
Newb programmer here working on my first project. I've searched this site and the python documentation, and either I'm not seeing the answer, or I'm not using the right terminology. I've read the regex and if sections, specifically, and followed links around to other parts that seemed relevant.
import re
keyphrase = '##' + '' + '##'
print(keyphrase) #output is ####
j = input('> ') ###whatever##
if keyphrase in j:
print('yay')
else:
print('you still haven\'t figured it out...')
k = j.replace('#', '')
print(k) #whatever
This is for a little reddit bot project. I want the bot to be called like ##whatever## and then be able to do things with the word(s) in between the ##'s. I've set up the above code to test if Python was reading it but I keep getting my "you still haven't figured it out..." quip.
I tried adding the REGEX \W in the middle of keyphrase, to no avail. Also weird combinations of \$\$ and quotes
So, my question, is how do I put a placeholder in keyphrase for user input?
For instance, if a ##comment## does something like ##this## ##I can grab## everything between the # symbols as separate inputs/calls.
You could use the following regex r'##(.*?)##' to capture everything inside of the key phrase you've chosen.
Sample Output:
>>> import re
>>> f = lambda s: re.match(r'##(.*?)##', s).group(1)
>>> f("##whatever##")
whatever
>>> f = lambda s: re.findall(r'##(.*?)##', s)
>>> f("a ##comment## does something like ##this## ##I can grab## everything between the # symbols as separate inputs/calls.")
['comment', 'this', 'I can grab']
How does it work? (1) We state the string constant head and tail for the capture group 1 between the brackets (). Great, almost there! (2) We then match any character .*? with greedy search enforced so that we capture the whole string.
Suggested Readings:
Introduction to Regex in Python - Jee Gikera
Something like this should work:
import re
keyphrase_regex = re.compile(r'##(.*)##')
user_input = input('> ')
keyphrase_match = keyphrase_regex.search(user_input)
# `search` returns `None` if regex didn't match anywhere in the string
keyphrase_content = keyphrase_match.group(1) if keyphrase_match else None
if keyphrase_content:
keyphrase_content = keyphrase_match.group(1)
print('yay! You submitted "', keyphrase_content, '" to the bot!')
else:
# Bonus tip: Use double quotes to make a string containing apostrophe
# without using a backslash escape
print("you still haven't figured it out...")
# Use `keyphrase_content` for whatever down here
Regular expressions are kind of hard to wrap your head around, because they work differently than most programming constructs. It's a language to describe patterns.
Regex One is a fantastic beginners guide.
Regex101 is an online sandbox that allows you to type a regular expression and some sample strings, then see what matches (and why) in real time
The regex ##(.*)## basically means "search through the string until you find two '#' signs. Right after those, start capturing zero-or-more of any character. If you find another '#', stop capturing characters. If that '#' is followed by another one, stop looking at the string, return successfully, and hold onto the entire match (from first '#' to last '#'). Also, hold onto the captured characters in case the programmer asks you for just them.
EDIT: Props to #ospahiu for bringing up the ? lazy quantifier. A final solution, combining our approaches, would look like this:
# whatever_bot.py
import re
# Technically, Python >2.5 will compile and cache regexes automatically.
# For tiny projects, it shouldn't make a difference. I think it's better style, though.
# "Explicit is better than implicit"
keyphrase_regex = re.compile(r'##(.*?)##')
def parse_keyphrases(input):
return keyphrase_regex.find_all(input)
Lambdas are cool. I prefer them for one-off things, but the code above is something I'd rather put in a module. Personal preference.
You could even make the regex substitutable, using the '##' one by default
# whatever_bot.py
import re
keyphrase_double_at_sign = re.compile(r'##(.*?)##')
def parse_keyphrases(input, keyphrase_regex=keyphrase_double_at_sign):
return keyphrase_regex.find_all(input)
You could even go bonkers and write a function that generates a keyphrase regex from an arbitrary "tag" pattern! I'll leave that as an exercise for the reader ;) Just remember: Several characters have special regex meanings, like '*' and '?', so if you want to match that literal character, you'd need to escape them (e.g. '\?').
If you want to grab the content between the "#", then try this:
j = input("> ")
"".join(j.split("#"))
You're not getting any of the info between the #'s in your example because you're effectively looking for '####' in whatever input you give it. Unless you happen to put 4 #'s in a row, that RE will never match.
What you want to do instead is something like
re.match('##\W+##', j)
which will look for 2 leading ##s, then any number greater than 1 alphanumeric characters (\W+), then 2 trailing ##s. From there, your strip code looks fine and you should be able to grab it.
I'm trying to do a python regular expression that looks for lines formatted as such ([edit:] without new lines; the original is all on one line):
<MediaLine Label="main-video" xmlns="ms-rtcp-metrics">
<OtherTags...></OtherTags>
</MediaLine>
I wish to create a capture group of the body of this XML element (so the OtherTags...) for later processing.
Now the problem lies in the first line, where Label="main-video", and I would like to not capture Label="main-audio"
My initial solution is as such:
m = re.search(r'<MediaLine(.*?)</MediaLine>', line)
This works, in that it filters out all other non-MediaLine elements, but doesn't account for video vs audio. So to build on it, I try simply adding
m = re.search(r'<MediaLine Label(.*?)</MediaLine>', line)
but this won't create a single match, let alone being specific enough to filter audio/video. My problem seems to come down to the space between line and Label. The two variations I can think of trying both fail:
m = re.search(r'<MediaLine L(.*?)</MediaLine>', line)
m = re.search(r'<MediaLine\sL(.*?)</MediaLine>', line)
However, the following works, without being able to distinguish audio/video:
m = re.search(r'<MediaLine\s(.*?)</MediaLine>', line)
Why is the 'L' the point of failure? Where am I going wrong? Thanks for any help.
And to add to this preemptively, my goal is an expression like this:
m = re.search("<MediaLine Label=\"main-video\"(?:.*?)>(?P<payload>.*?)</MediaLine>", line)
result = m.group('payload')
By default, . doesn’t match a newline, so your initial solution didn't work either. To make . match a newline, you need to use the re.DOTALL flag (aka re.S):
>>> m = re.search("<MediaLine Label=\"main-video\"(?:.*?)>(?P<payload>.*)</MediaLine>", line, re.DOTALL)
>>> m.group('payload')
'\n <OtherTags...></OtherTags>\n'
Notice there’s also an extra ? in the first group, so that it’s not greedy.
As another comment observes, the best thing to parse XML is an XML parser. But if your particular XML is sufficiently strict in the tags and attributes that it has, then a regular expression can get the job done. It will just be messier.
I'm trying to write code that parses a large text file. However, in order to get said text file, I run the original PDF file through pdfminer. While this works, it also returns my text file with many random spaces (see below)
SM ITH , JO HN , PHD
1234 S N O RT H AV E
Is there any easy way in Python to remove only certain spaces so words aren't separated? For the sample above, I want it to look like
SMITH, JOHN, PHD
1234 S NORTH AVE
Thanks.
Most likely what you're trying to do is impossible to do perfectly, and very hard to do well enough to satisfy you. I'll explain below.
But there's a good chance you shouldn't be doing it in the first place. pdfminer is highly configurable, and something like just specifying a smaller -M value will give you the text you wanted in the first place. You'll need to do a bit of trial and error, but if this works, it'll be far easier than trying to post-process things after the fact.
If you want to do this, you need to come up with a rule that determines which spaces are "random extra spaces" and which are real spaces before you can code that in Python. And I don't know that there is any such rule.
In your example, you can handle most of them by just turning multiple spaces into single spaces, and single spaces into nothing. It should be obvious how to do that. Even if you can't think of a clever solution, a triple replace works fine:
s = re.sub(r'\s\s+', r'<space>', s)
s = re.sub(r'\s', r'', s)
s = re.sub(r'<space>', r' ', s)
However, this rule isn't quite right, because in JO HN , PHD, the space after the comma isn't a random extra space, but it's not showing up as two or more spaces. And the same for the space in "1234 S". And, most likely, the same thing is true in lots of other cases for your real data.
A different somewhat close rule is that you only remove single spaces between letters. Again, if that works, it's easy to code. For example:
s = re.sub(r'(\w)\s(\w)', r'\1\2', s)
s = re.sub(r'\s+', r' ', s)
But now that leaves a space before the comma after SMITH and JOHN.
Maybe you need to put in a little information about English punctuation—strip the spaces around punctuation, then add back in the spaces after a comma or period, around quotes, etc.
Or… well, nobody but you can know what your data look like and figure it out.
If you can't come up with a good rule, the only option is to build some complicated heuristics around looking up possible words in a dictionary and guessing which one is more likely—which still won't get everything right (e.g., how do you know whether "B OO K M AR K" is "BOOK MARK" or "BOOKMARK"?), but it's the best you could possibly do.
What you are trying to do is impossible, e.g., should "DESK TOP" be "DESK TOP" or "DESKTOP"?