How to match and remove occurrences from a file using regex - python

I am new in Python and I am trying to to get some contents from a file using regex. I upload a file, I load it in memory and then I run this regular expression. I want to take the names from the file but it also needs to work with names that have spaces like "Marie Anne". So imagine that the array of names has this values:
all_names = [{name:"Marie Anne", id:1}, {name:"Johnathan", id:2}, {name:"Marie", id:3}, {name:"Anne", id:4},{name:"John", id:5}]
An the string that I am searching might have multiple occurrences and it's multiline.
print all_names # this is an array of id and name, ordered descendently by names length
textToStrip = stdout.decode('ascii', 'ignore').lower()
for i in range(len(all_skills)):
print all_names[i]
m = re.search(r'\W' + re.escape(unicode(all_names[i]['name'].lower())) + '\W',textToStrip)
if m:
textToStrip = re.sub(r'\W' + re.escape(unicode(all_names[i]['name'].lower())) + '\W', "", textToStrip, 100)
print "found " + all_names[i]['name']
print textToStrip
The script is finding the names, but the line re.sub removes them from the list to avoid that takes "Maria Anne", and "Marie" from the same instance, it's also removing extra characters like "," or "." before or after.
Any help would much appreciated... or if you have a better solution for this problem even better.

The characters on both sides are deleted because you have \W included in re.sub() regexp. That's because re.sub replaced everything the regexp matches -- the way you call re.sub.
There's an alternate way to do this. If you wrap the part that you want keep in the matched regext with grouping parens, and if you call re.sub with a callable (a function) instead of the new string, that function can extract the group values from the match object passed to it and assemble a return value that preserves them.
Read documentation for re.sub for details.

Related

Is it possible to search and replace a string with "any" characters?

There are probably several ways to solve this problem, so I'm open to any ideas.
I have a file, within that file is the string "D133330593" Note: I do have the exact position within the file this string exists, but I don't know if that helps.
Following this string, there are 6 digits, I need to replace these 6 digits with 6 other digits.
This is what I have so far:
def editfile():
f = open(filein,'r')
filedata = f.read()
f.close()
#This is the line that needs help
newdata = filedata.replace( -TOREPLACE- ,-REPLACER-)
#Basically what I need is something that lets me say "D133330593******"
#->"D133330593123456" Note: The following 6 digits don't need to be
#anything specific, just different from the original 6
f = open(filein,'w')
f.write(newdata)
f.close()
Use the re module to define your pattern and then use the sub() function to substitute occurrence of that pattern with your own string.
import re
...
pat = re.compile(r"D133330593\d{6}")
re.sub(pat, "D133330593abcdef", filedata)
The above defines a pattern as -- your string ("D133330593") followed by six decimal digits. Then the next line replaces ALL occurrences of this pattern with your replacement string ("abcdef" in this case), if that is what you want.
If you want a unique replacement string for each occurrence of pattern, then you could use the count keyword argument in the sub() function, which allows you to specify the number of times the replacement must be done.
Check out this library for more info - https://docs.python.org/3.6/library/re.html
Let's simplify your problem to you having a string:
s = "zshisjD133330593090909fdjgsl"
and you wanting to replace the 6 characters after "D133330593" with "123456" to produce:
"zshisjD133330594123456fdjgsl"
To achieve this, we can first need to find the index of "D133330593". This is done by just using str.index:
i = s.index("D133330593")
Then replace the next 6 characters, but for this, we should first calculate the length of our string that we want to replace:
l = len("D133330593")
then do the replace:
s[:i+l] + "123456" + s[i+l+6:]
which gives us the desired result of:
'zshisjD133330593123456fdjgsl'
I am sure that you can now integrate this into your code to work with a file, but this is how you can do the heart of your problem .
Note that using variables as above is the right thing to do as it is the most efficient compared to calculating them on the go. Nevertheless, if your file isn't too long (i.e. efficiency isn't too much of a big deal) you can do the whole process outlined above in one line:
s[:s.index("D133330593")+len("D133330593")] + "123456" + s[s.index("D133330593")+len("D133330593")+6:]
which gives the same result.

Incorrect output due to regular expression

I had a pdf in which names are written after a '/'
Eg: /John Adam Will Newman
I want to extract the names starting with '/',
the code which i wrote is :
names=re.compile(r'((/)((\w)+(\s)))+')
However, it produces just first name of the string "JOHN" and that too two times not the rest of the name.
Your + is at the wrong position; your regexp, as it stands, would demand /John /Adam /Will /Newman, with a trailing space.
r'((/)((\w)+(\s))+)' is a little better; it will accept /John Adam Will, with a trailing space; won't take Newman, because there is nothing to match \s.
r'((/)(\w+(\s\w+)*))' matches what you posted. Note that it is necessary to repeat one of the sequences that match a name, because we want N-1 spaces if there are N words.
(As Ondřej Grover says in comments, you likely have too many unneeded capturing brackets, but I left that alone as it hurts nothing but performance.)
I think you define way too many unnamed regexp groups. I would do something like this
import re
s = '/John Adam Will Newman'
name_regexp = re.compile(r'/(?P<name>(\w+\s*)+)')
match_obj = name_regexp.match(s) # match object
group_dict = match_obj.groupdict() # dict mapping {group name: value}
name = group_dict['name']
(?P<name>...) starts a named group
(\w+\s*) is a group matching one or more alphanum characters, possibly followed by some whitespace
the match object returned by the .match(s) method has a method groupdict() which returns a dict which is mapping from group names to their contents

replace multiple words - python

There can be an input "some word".
I want to replace this input with "<strong>some</strong> <strong>word</strong>" in some other text which contains this input
I am trying with this code:
input = "some word".split()
pattern = re.compile('(%s)' % input, re.IGNORECASE)
result = pattern.sub(r'<strong>\1</strong>',text)
but it is failing and i know why: i am wondering how to pass all elements of list input to compile() so that (%s) can catch each of them.
appreciate any help
The right approach, since you're already splitting the list, is to surround each item of the list directly (never using a regex at all):
sterm = "some word".split()
result = " ".join("<strong>%s</strong>" % w for w in sterm)
In case you're wondering, the pattern you were looking for was:
pattern = re.compile('(%s)' % '|'.join(sterm), re.IGNORECASE)
This works on your string because the regular expression would become
(some|word)
which means "matches some or matches word".
However, this is not a good approach as it does not work for all strings. For example, consider cases where one word contains another, such as
a banana and an apple
which becomes:
<strong>a</strong> <strong>banana</strong> <strong>a</strong>nd <strong>a</strong>n <strong>a</strong>pple
It looks like you're wanting to search for multiple words - this word or that word. Which means you need to separate your searches by |, like the script below:
import re
text = "some word many other words"
input = '|'.join('some word'.split())
pattern = re.compile('(%s)' % input, flags=0)
print pattern.sub(r'<strong>\1</strong>',text)
I'm not completely sure if I know what you're asking but if you want to pass all the elements of input in as parameters in the compile function call, you can just use *input instead of input. * will split the list into its elements. As an alternative, could't you just try joining the list with and adding at the beginning and at the end?
Alternatively, you can use the join operator with a list comprehension to create the intended result.
text = "some word many other words".split()
result = ' '.join(['<strong>'+i+'</strong>' for i in text])

python re.sub with a list of words to find

I am not too familiar with RE but I am trying to iterate over a list and use re.sub to take out multiple items from a large block of text that is held in the variable first_word.
I use re.sub to remove tags first and this works fine, but I next want to remove all the strings in the exclusionList variable and I am not sure how to do this.
Thanks for the help, here is the code that raises the exception.
exclusionList = ['+','of','<ET>f.','to','the','<L>L.</L>']
for a in range(0, len(exclusionList)):
first_word = re.sub(exclusionList[a], '',first_word)
And the exception :
first_word = re.sub(exclusionList[a], '',first_word)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 245, in _compile
raise error, v # invalid expression error: nothing to repeat
The plus symbol is an operator in regex meaning 'one or more repetitions of the preceding'. E.g., x+ means one or more repetitions of x. If you want to find and replace actual + signs, you need to escape it like this: re.sub('\+', '', string). So change the first entry in your exclusionList.
You can also eliminate the for loop, like this:
exclusions = '|'.join(exclusionList)
first_word = re.sub(exclusions, '', first_word)
The pipe symbol | indicates a disjunction in regex, so x|y|z matches x or y or z.
The basic form of your program is correct, so I suspect any problem you are having has to do with the regexes you are using. '+' by itself is an invalid regex, you'll need to escape it using '\'.
From a usage point, Python allows you to specify that a string should not do any backslash escaping, so that you don't have to litter your regexen with '\\' when you just mean '\'. The syntax for this is a leading "r", as in r'\+', which is what you should replace the first item in your exclusionList with.
If you are looking to extract the words "to", "the", etc. then you also want to make sure you are extracting whole words, and don't accidentally extract the "to" in "tooth", or the "the" in "other". Add "\b" to specify a word boundary to prevent this: r'\bto\b' and r'\bthe\b'.
Lastly, for a in range(0, len(exclusionList)): is more simply written by just iterating over the list itself: for exclusion in exclusionList:.

Can you place an instance of a member of a list within a regex to match in python?

So essentially I am trying to read lines from multiple files in a directory and using a regex to specifically find the beginnings of a sort of time stamp, I want to also place an instance of a list of months within the regex and then create a counter for each month based on how many times it appears. I have some code below, but it is still a work in progress. I know I closed off date_parse, but I that's why I'm asking. And please leave another suggestion if you can think of a more efficient method. thanks.
months = ['Jan','Feb','Mar','Apr','May','Jun',\
'Jul','Aug','Sep','Oct','Nov',' Dec']
date_parse = re.compile('[Date:\s]+[[A-Za-z]{3},]+[[0-9]{1,2}\s]')
counter=0
for line in sys.stdin:
if data_parse.match(line):
for month in months in line:
print '%s %d' % (month, counter)
In a regular expression, you can have a list of alternative patterns, separated using vertical bars.
http://docs.python.org/library/re.html
from collections import defaultdict
date_parse = re.compile(r'Date:\s+(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)')
c = defaultdict(int)
for line in sys.stdin:
m = date_parse.match(line)
if m is None:
# pattern did not match
# could handle error or log it here if desired
continue # skip to handling next input line
month = m.group(1)
c[month] += 1
Some notes:
I recommend you use a raw string (with r'' or r"") for a pattern, so that backslashes will not become string escapes. For example, inside a normal string, \s is not an escape and you will get a backslash followed by an 's', but \n is an escape and you will get a single character (a newline).
In a regular expression, when you enclose a series of characters in square brackets, you get a "character class" that matches any of the characters. So when you put [Date:\s]+ you would match Date: but you would also match taD:e or any other combination of those characters. It's perfectly okay to just put in a string that should match itself, like Date:.

Categories