Replacing strings in a text and ignoring certain parts

Replacing strings in a text and ignoring certain parts - python

I found many programs online to replace text in a string or file with words prescribed in a dictionary. For example, https://www.daniweb.com/programming/software-development/code/216636/multiple-word-replace-in-text-python
But I was wondering how to get the program to ignore certain parts of the text. For instance, I would like it to ignore parts that are ensconced within say % signs (%Please ignore this%). Better still, how do I get it to ignore the text within but remove the % sign at the end of the run.
Thank you.

This could very easily be done with regular expressions, although they may not be supported by any online programs you find. You will probably need to write something yourself and then use regex as your dict's search key's.
Good place to start playing around with regex is: http://regexr.com

Well in the replacing dictionary just have any word you want to be ignored such as teh be replaced with the but %teh% be replaced with teh. For the program in the link you could have
wordDic = {
'booster': 'rooster',
'%booster%': 'booster'
}

Related

How to use re.search to find multiple strings?

I'm trying to check if a certain line in many different text documents equals one of many different strings. To goal here is to classify those documents and then parse them according to that classification.
In my text editor I can use regex to search for:
(?:^kärnten\n|^steiermark|^graz\n|^madrid\n|^oststeirer\n|^weiz\n|^berlin\n|^lavanttal\n|^villach\n|^osttirol\n|^oberkärnten\n|^klagenfurt\n|^weststeiermark\n|^südsteiermark\n|^südoststeiermark\n|^murtal\n|^mürztal\n|^graz\n|^ennstal\n|^frankreich\n|^österreich\n|^dänemark\n|^polen\n|^großbritannien\n|^italien\n|^hitzendorf\n|^osttirol\n|^slowenien\n|^feldkirchen\n|^völkermarkt\n|^wien\n|^warschau\n|^mailand\n|^mainz\n|^leoben\n|^bleiburg\n|^brüssel\n|^bad radkersburg\n|^london\n|^lienz\n|^liezen\n|^hartberg\n|^ilztal|^pöllau\n|^lobmingtal\n)
But if I try to use this in an if statement in python I keep getting syntax errors for any way I tried it.
My current version is this:
if re.search('(^kärnten\n|^steiermark|^graz\n|^madrid\n|^oststeirer\n|^weiz\n|^berlin\n|^lavanttal\n|^villach\n|^osttirol\n|^oberkärnten\n|^klagenfurt\n|^weststeiermark\n|^südsteiermark\n|^südoststeiermark\n|^murtal\n|^mürztal\n|^graz\n|^ennstal\n|^frankreich\n|^österreich\n|^dänemark\n|^polen\n|^großbritannien\n|^italien\n|^hitzendorf\n|^osttirol\n|^slowenien\n|^feldkirchen\n|^völkermarkt\n|^wien\n|^warschau\n|^mailand\n|^mainz\n|^leoben\n|^bleiburg\n|^brüssel\n|^bad radkersburg\n|^london\n|^lienz\n|^liezen\n|^hartberg\n|^ilztal|^pöllau\n|^lobmingtal\n)', article_lines[5].lower()replace('´','')):
no_author = True
I saw that a possible solution is using a for loop and putting the different strings into a list, but as this would require some extra steps I'd prefer to do it as I tried if possible.

You should include what the error is. Your problem is probably just a typo:
if re.search('(^kärnten\n|^steiermark|^graz\n|^madrid\n|^oststeirer\n|^weiz\n|^berlin\n|^lavanttal\n|^villach\n|^osttirol\n|^oberkärnten\n|^klagenfurt\n|^weststeiermark\n|^südsteiermark\n|^südoststeiermark\n|^murtal\n|^mürztal\n|^graz\n|^ennstal\n|^frankreich\n|^österreich\n|^dänemark\n|^polen\n|^großbritannien\n|^italien\n|^hitzendorf\n|^osttirol\n|^slowenien\n|^feldkirchen\n|^völkermarkt\n|^wien\n|^warschau\n|^mailand\n|^mainz\n|^leoben\n|^bleiburg\n|^brüssel\n|^bad radkersburg\n|^london\n|^lienz\n|^liezen\n|^hartberg\n|^ilztal|^pöllau\n|^lobmingtal\n)', article_lines[5].lower().replace('´','')):
no_author = True

How to select an entire entity around a regex without splitting the string first?

My project (unrelated to this question, just context) is a ML classifier, I'm trying to improve it and have found that when I stripped URLS from the text given to it, some of the URLS have been broken by spaces. For example:
https:// twitter.com/username/sta tus/ID
After I remove links that are not broken, I am left with thinks like www website com. I removed those with the following regular expression in Python:
tweet = re.sub('(www|http).*?(org |net |edu |com |be |tt |me |ms )','',tweet);
I've put a space after every one of them because this happens after the regular strip and text processing (so only working with parts of a URL separated by spaces) and theoretically we should only pick up the remainders of a broken link... not something like
http website strangeTLD .... communication
It's not perfect but it works, however I just thought that I might try to preemptively remove URLS from twitter only, since I know that the spaces that break the regular URL strip will always be in the same places, hoping this improves my classifier accuracy? This will get rid of the string of characters that occurs after a link... specifically pictures, which is a lot of my data.
Specifically, is there a way to select the entity surrounding/after:
pic.twitter.com/
or, in reference to the example I gave earlier, select the entity after the username broken by the space in status (I'm just guessing at this regex)...
http.*?twitter.com/*?/sta tus/
Thank you in advance! And for the record, I was given this dataset to work with; I am not sure why the URLs are almost all broken by spaces.

Yes, what you are talking about is called Positive Lookbehind and works using (?<=...), where the ellipsis should be replaced by what you want to skip.
E.g. if you want to select whatever comes after username in https://twitter.com/username/status/ID, just use
(?<=https:\/\/twitter\.com\/username\/).*
and you will get status/ID, like you can see with this live demo.
In this case I had to escape slashes / using backslashes, as required by Regex specifications; I also used the Kleene star operator, i.e. the asterisk, to match any occurrence of . (any character), just like you did.
What a positive lookbehind combination does is specifying some mandatory text before the current position of your cursor; in other words, it puts the cursor after the expression you feed it (if the said text exists).
Of course this is not enough in your case, since username won't be a fixed string but a variable one. This might be an additional requirement, since lookbehinds do not work with variable lengths.
So you can just skip www.twitter.com/
(?<=https:\/\/twitter\.com\/).*
And then, via Python, create a substring
currentText = "username/status/ID"
result = currentText.split("/",1)[1] # returns status/ID
Test it in this demo (click "Execute"); a simple explanation of how this works is in the answer to this question (in short, you just split the string at the first slash character).
As a sidenote, blanks/spaces aren't allowed in URLs and if necessary are usually encoded as %20 or + (see e.g. this answer). In other words, every URL you got can be safely stripped of spaces before processing, so... why didn't they do it?

Python - Injecting html tags into strings based on regex match

I wrote a script in Python for custom HTML page that finds a word within a string/line and highlights just that word with use of following tags where instance is the word that is searched for.
<b><font color=\"red\">"+instance+"</font></b>
With the following result:
I need to find a word (case insensitive) let's say "port" within a string that can be port, Port, SUPPORT, Support, support etc, which is easy enough.
pattern = re.compile(word, re.IGNORECASE)
find_all_instances = pattern.findall(string_to_search)
However my strings often contain 2 or more instances in single line, and I need to append
<b><font color=\"red\">"+instance+"</font></b> to each of those instances, without changing cases.
Problem with my approach, is that I am attempting to itterate over each of instances found with findall (exact match),
while multiple same matches can also be found within the string.
for instance in find_all_instances:
second_pattern = re.compile(instance)
string_to_search = second_pattern.sub("<b><font color=\"red\">"+instance+"</font></b>", string_to_search)
This results in following:
<b><font color="red"><b><font color="red"><b><font color="red">Http</font></b></font></b></font></b></font>
when I need
<b><font color="red">Http</font></b>
I was thinking, I would be able to avoid this if I was able to find out exact part of the string that the pattern.sub substitutes at the moment of doing it,
however I was not able to find any examples of that kind of usage, which leads me to believe that I am doing something very wrong.
If anyone have a way I could use to insert <b><font color="red">instance</font></b> without replacing instance for all matches(case insensitive), then I would be grateful.

Maybe I'm misinterpretting your question, but wouldn't re.sub be the best option?
Example: https://repl.it/DExs

Okay so two ways I did quickly! The second loop is definitely the way to go. It uses re.sub (as someone else commented too). It replaces with the lowercase search term bear in mind.
import re
FILE = open("testing.txt","r")
word="port"
#THIS LOOP IS CASE SENSITIVE
for line in FILE:
newline=line.replace(word,"<b><font color=\"red\">"+word+"</font></b>")
print newline
#THIS LOOP IS INCASESENSITIVE
for line in FILE:
pattern=re.compile(word,re.IGNORECASE)
newline = pattern.sub("<b><font color=\"red\">"+word+"</font></b>",line)
print newline

Regular expression how to get middle strings

I want to search for string that occurs between a certain string. For example,
\start
\problem{number}
\subproblem{number}
/* strings that I want to get */
\subproblem{number}
/* strings that I want to get */
\problem{number}
\subproblem{number}
...
...
\end
More specifically, I want to get problem number and subproblem number and strings between which is answer.
I somewhat came up with expression like
'(\\problem{(.*?)}\n)? \\subproblem{(.*?)} (.*?) (\\problem|\\subproblem|\\end)'
but it seems like it doesn't work as I expect. What is wrong with this expression?

This one:
(?:\\problem\{(.*?)\}\n)?\\subproblem\{(.*?)\}\n+(.*?)\n+(?=\\problem|\\subproblem|\\end)
returns three matches for me:
Match 1:
group 1: "number"
group 2: "number"
group 3: "/* strings that I want to get */"
Match 2:
group 1: null
group 2: "number"
group 3: "/* strings that I want to get */"
Match 3:
group 1: "number"
group 2: "number"
group 3: " ...\n ..."
However I'd rather parse it in two steps.
First find the problem's number (group 1) and content (group 2) using:
\\problem\{(.*?)\}\n(.+?)\\end
Then find the subproblem's numbers (group 1) and contents (group 2) inside that content using:
\\subproblem\{(.*?)\}\n+(.*?)\n+(?=\\problem|\\subproblem|\\end)

TeX is pretty complicated and I'm not sure how I feel about parsing it using regular expressions.
That said, your regular expression has two issues:
You're using a space character where you should just consume all whitespace
You need to use a lookahead assertion for your final group so that it doesn't get eaten up (because you need to match it at the beginning of the regex the next time around)
Give this a try:
>>> v
'\\start\n\n\\problem{number}\n\\subproblem{number}\n\n/* strings that I want to get */\n\n\\subproblem{number}\n\n/* strings that I want to get */\n\n\\problem{number}\n\\subproblem{number}\n ...\n ...\n\\end\n'
>>> re.findall(r'(?:\\problem{(.*?)})?\s*\\subproblem{(.*?)}\s*(.*?)\s*(?=\\problem{|\\subproblem{|\\end)', v, re.DOTALL)
[('number', 'number', '/* strings that I want to get */'), ('', 'number', '/* strings that I want to get */'), ('number', 'number', '...\n ...')]

If the question really is "What is wrong with this expression?", here's the answer:
You're trying to match newlines with a .*?. You need (?s) for that to work.
You have explicit spaces and newlines in the middle of the regex that don't have any corresponding characters in the source text. You need (?x) for that to work.
That may not be all that's wrong with the expression. But just adding (?sx), turning it into a raw string (because I don't trust myself to mix Python quoting and regex quoting properly), and removing the \n gives me this:
r'(?sx)(\\problem{(.*?)}? \\subproblem{(.*?)} (.*?)) (\\problem|\\subproblem|\\end)'
That returns 2 matches instead of 0, and it's probably the smallest change to your regex that works.
However, if the question is "How can I parse this?", rather than "What's wrong with my existing attempt?", I think impl's solution makes more sense (and I also agree with the point about using regex to parse TeX being usually a bad idea)—-or, even better, doing it in two steps as Regexident does.
if using regex to parse TeX is not good idea, then what method would you suggest to parse TeX?
First of all, as a general rule of thumb, if I can't write the regex to solve a problem by myself, I don't want to solve it with a regex, because I'll have a hard time figuring it out a few months from now. Sometimes I break it down into subexpressions, or use (?x) and load it up with comments, but usually I look for another way.
More importantly, if you have a real parser that can consume your language and give you a tree (or whatever's appropriate) that you can walk and search—as with, e.g. etree for XML—then you've got 90% of a solution for every problem you're going to come up with in dealing with that language. A quick&dirty regex (especially one you can't write on your own) only gets you 10% of the way to solving the next problem. And more often than not, if I've got a problem today, I'm going to have more of them in the next few months.
So, what's a good parser for TeX in Python? Honestly, I don't know. I know scipy/matplotlib has something that does it, so I'd probably look there first. Beyond that, check Google, PyPI, and maybe tex.stackexchange.com. The first things that turn up in a search are Texcaller and plasTeX. I have no idea how good they are, or if they're appropriate for your use case, but it shouldn't take long to skim the tutorials and find out.
If it turns out that there's nothing out there, and it comes down to writing something myself with, e.g., pyparsing vs. regexes, then it's a tougher choice. Some languages, it's very easy to define just the subset you care about and leave the rest as giant uninterpreted tokens, in which case a real parser will be just as easy as a regex, so you might as well go that way. Other languages, you have to handle half the syntax before you can do anything useful, so I wouldn't even try. I'd have to put a bit of time into thinking about it and experimenting both ways before deciding which way to go.

regex regarding symbols in urls

I want to replace consecutive symbols just one such as;
this is a dog???
to
this is a dog?
I'm using
str = re.sub("([^\s\w])(\s*\1)+", "\\1",str)
however I notice that this might replace symbols in urls that might happen in my text.
like http://example.com/this--is-a-page.html
Can someone give me some advice how to alter my regex?

So you want to unleash the power of regular expressions on an irregular language like HTML. First of all, search SO for "parse HTML with regex" to find out why that might not be such a good idea.
Then consider the following: You want to replace duplicate symbols in (probably user-entered) text. You don't want to replace them inside a URL. How can you tell what a URL is? They don't always start with http – let's say ars.userfriendly.org might be a URL that is followed by a longer path that contains duplicate symbols.
Furthermore, you'll find lots of duplicate symbols that you definitely don't want to replace (think of nested parentheses (like this)), some of them maybe inside a <script> on the page you're working on (||, && etc. come to mind.
So you might come up with something like
(?<!\b(?:ftp|http|mailto)\S+)([^\\|&/=()"'\w\s])(?:\s*\1)+
which happens to work on the source code of this very page but will surely fail in other cases (for example if URLs don't start with ftp, http or mailto). Plus, it won't work in Python since it uses variable repetition inside lookbehind.
All in all, you probably won't get around parsing your HTML with a real parser, locating the body text, applying a regex to it and writing it back.
EDIT:
OK, you're already working on the parsed text, but it still might contain URLs.
Then try the following:
result = re.sub(
r"""(?ix) # case-insensitive, verbose regex
# Either match a URL
# (protocol optional (if so, URL needs to start with www or ftp))
(?P<URL>\b(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&##/%=~_|$?!:,.]*[A-Z0-9+&##/%=~_|$])
# or
|
# match repeated non-word characters
(?P<rpt>[^\s\w])(?:\s{0,100}(?P=rpt))+""",
# and replace with both captured groups (one will always be empty)
r"\g<URL>\g<rpt>", subject)
Re-EDIT: Hm, Python chokes on the (?:\s*(?P=rpt))+ part, saying the + has nothing to repeat. Looks like a bug in Python (reproducible with (.)(\s*\1)+ whereas (.)(\s?\1)+ works)...
Re-Re-EDIT: If I replace the * with {0,100}, then the regex compiles. But now Python complains about an unmatched group. Obviously you can't reference a group in a replacement if it hasn't participated in the match. I give up... :(

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.