Regex Python adding characters after a certain word - python

I have a text file and every time that the word "get" occurs I need to insert an # sign after it.
In Python how do I add a character after a specific word using regex? Right now I am parsing the line word by word and I don't understand regex enough to write the code.

Use re.sub() to provide replacements, using a backreference to re-use matched text:
import re
text = re.sub(r'(get)', r'\1#', text)
The (..) parenthesis mark a group, which \1 refers to when specifying a replacement. So get is replaced by get#.
Demo:
>>> import re
>>> text = 'Do you get it yet?'
>>> re.sub(r'(get)', r'\1#', text)
'Do you get# it yet?'
The pattern will match get anywhere in the string; if you need to limit it to whole words, add \b anchors:
text = re.sub(r'(\bget\b)', r'\1#', text)

Related

Regexp, find in a procedure last "end" to replace with another word

I have tried to replace in all procedures some mistakes. Now, I need to find last "end;" in procedure and replace it with another text.
I wrote like: (\s.*)(end|END)(.*(;).*)
But in work not correctly, it also replace some words in the middle of the text. I using re biblio from python.
You can use
result = re.sub(r'(?si)(.*)\bend\b', r'\g<1>some other word', text)
The regex matches
(?si) - an inline re.DOTALL (s) and re.IGNORECASE (i) modifier
(.*) - Group 1: any zero or more chars as many as possible
\bend\b -a whole word end.
The \g<1>some other word replacement is the Group 1 value (I used \g<1> since it will be helpful if your some other word starts with a digit) plus your word.
NOTE: if your some other word can contain literal backslashes, do not forget to double them.

splitting a text file into words using regex in python

brand new to python!!! I'm given a text file https://en.wikipedia.org/wiki/Character_mask and I need to split the file into single words, (more than a single letter separated by one of more of any other character) I've tried using regex but can't seem to split it right without error. here is the code I have so far, can anyone help me fix this regex expression
import re
file = open("charactermask.txt", "r")
text = file.read()
message = print(re.split(',.-\d\c\s',text))
print (message)
file.close()
You can use re.findall with the following regex pattern instead to find all words that are more than 1 character long.
Change:
message = print(re.split(',.-\d\c\s',text))
to:
message = re.findall(r'[A-Za-z]{2,}', text))
If you are looking for simple tokens of words from text string you can use
.split it will work like a charm!
For example
mystring = "My favorite color is blue"
mystring.split()
['My', 'favorite', 'color', 'is', 'blue']
If you're just trying to split the text then SmashGuy's answer should get your job done. Using regex would seem like an overkill. Additionally, your regex pattern doesn't quite seem to do what you described your intention to be. You might want to test your pattern out until you get it right before plugging it into your python script. Try https://regex101.com/
Here's what your pattern does right now:
, matches the character , literally (case sensitive)
. matches any character (except for line terminators)
- matches the character - literally (case sensitive)
\d matches a digit (equal to [0-9])
\c matches the character c literally (case sensitive)
\s matches any whitespace character (equal to [\r\n\t\f\v ])
I'm not sure if you actually meant [,.-], one of these character-prefixes and you might have had the wrong impression on the \c token too as it doesn't do anything special in python's flavor of regex.

Regex for removing parts of the string

How to solve this problem on regex in Python?
I want to filter words regular and text from:
"A regular expression is a special text string for describing a search pattern."
I want the result like this :
"A expression is a special string for describing a search pattern."
Please help me to solve this problem on regex syntax.
import re
txt = "A regular expression is a special text string for describing a search pattern."
pattern = "(.*) regular(.*) text(.*)"
result = re.sub(pattern, r"\1\2\3", txt)
print(result) # for testing only
The explanation:
As you can see, your regular expression is
(.*) regular(.*) text(.*)
Expressions in parentheses are so called capture groups. All 3 have the same form:
.*
which means that they will match everything - . means any character, * means arbitrary number of them, including zero (empty string).
Now we may use the captured texts as \1, \2, \3, respectively, so your original text is in this notation the same as
\1 regular\2 text\3
So in the re.sub() function we keep as substituting string only
\1\2\3
which effectively strip out the parts " regular" and " text".

Strip punctuation with regular expression - python

I would like to strip all of the the punctuations (except the dot) from the beginning and end of a string, but not in the middle of it.
For instance for an original string:
##%%.Hol$a.A.$%
I would like to get the word .Hol$a.A. removed from the end and beginning but not from the middle of the word.
Another example could be for the string:
##%%...&Hol$a.A....$%
In this case the returned string should be ..&Hol$a.A.... because we do not care if the allowed characters are repeated.
The idea is to remove all of the punctuations( except the dot ) just at the beginning and end of the word. A word is defined as \w and/or a .
A practical example is the string 'Barnes&Nobles'. For text analysis is important to recognize Barnes&Nobles as a single entity, but without the '
How to accomplish the goal using Regex?
Use this simple and easily adaptable regex:
[\w.].*[\w.]
It will match exactly your desired result, nothing more.
[\w.] matches any alphanumeric character and the dot
.* matches any character (except newline normally)
[\w.] matches any alphanumeric character and the dot
To change the delimiters, simply change the set of allowed characters inside the [] brackets.
Check this regex out on regex101.com
import re
data = '##%%.Hol$a.A.$%'
pattern = r'[\w.].*[\w.]'
print(re.search(pattern, data).group(0))
# Output: .Hol$a.A.
Depending on what you mean with striping the punctuation, you can adapt the following code :
import re
res = re.search(r"^[^.]*(.[^.]*.([^.]*.)*?)[^.]*$", "##%%.Hol$a.A.$%")
mystr = res.group(1)
This will strip everything before and after the dot in the expression.
Warning, you will have to check if the result is different of None, if the string doesn't match.

How to get only searched word as a result python regex

How can I get only the words that match my regex in python? Because everything I tried also prints the full line where the string was found.
The regex is the following:
\b([1-9][0-9]{1,2})\.([0-9]{1,3})\.([0-9]{1,3})\.([0-9]{1,3})\/([0-9]{1,2})\b
It matched IP + CIDR (e.g 12.0.0.0/8)
The text in which I am searching this is as follows:
04/30","172.18.186.0/24","172.18.185.0/24","172.18.177.16/28","dwefwf-1.RI-nc_wefwfwefwefpat_intweb_fe","172.18.176.16/28","edefwfwf
t_pat_infwef_fe","172.18.178.16/28","dwefwefwef-wefwffwefwefwef_dr_efwefeb_fe","172.18.176.80/28","DSwefwfH2.
RI-nc_rat_dr_fweweb_fe","172.18.178.48/28","172.18.177.208/28","wefwef
wefwtfweapp_fe","172.18.176.208/28","wfwfwefwefwefH2.RI-nwefwefdr_app_fe","172.18.177.192/28","de1dfwwf-1.wefewf","172.18.176.1
92/28","
You should modify your regex as follows:
\b(([1-9][0-9]{1,2})\.([0-9]{1,3})\.([0-9]{1,3})\.([0-9]{1,3})\/([0-9]{1,2}))\b
and then extract the first matched group: \1
Demo: http://repl.it/R0W/1 (It takes a while to run)
I think your regexp work correctly. If you want to get matched string use group function. Like this:
import re
regexp = r'\b([1-9][0-9]{1,2})\.([0-9]{1,3})\.([0-9]{1,3})\.([0-9]{1,3})\/([0-9]{1,2})\b'
text = '''04/30","172.18.186.0/24","172.18.185.0/24","172.18.177.16/28","dwefwf-1.RI-nc_wefwfwefwefpat_intweb_fe","172.18.176.16/28","edefwfwf
t_pat_infwef_fe","172.18.178.16/28","dwefwefwef-wefwffwefwefwef_dr_efwefeb_fe","172.18.176.80/28","DSwefwfH2.
RI-nc_rat_dr_fweweb_fe","172.18.178.48/28","172.18.177.208/28","wefwef
wefwtfweapp_fe","172.18.176.208/28","wfwfwefwefwefH2.RI-nwefwefdr_app_fe","172.18.177.192/28","de1dfwwf-1.wefewf","172.18.176.1
92/28","'''
for i in re.finditer(regexp, text):
print i.group(0)

Categories