Incorrect output due to regular expression - python

I had a pdf in which names are written after a '/'
Eg: /John Adam Will Newman
I want to extract the names starting with '/',
the code which i wrote is :
names=re.compile(r'((/)((\w)+(\s)))+')
However, it produces just first name of the string "JOHN" and that too two times not the rest of the name.

Your + is at the wrong position; your regexp, as it stands, would demand /John /Adam /Will /Newman, with a trailing space.
r'((/)((\w)+(\s))+)' is a little better; it will accept /John Adam Will, with a trailing space; won't take Newman, because there is nothing to match \s.
r'((/)(\w+(\s\w+)*))' matches what you posted. Note that it is necessary to repeat one of the sequences that match a name, because we want N-1 spaces if there are N words.
(As Ondřej Grover says in comments, you likely have too many unneeded capturing brackets, but I left that alone as it hurts nothing but performance.)

I think you define way too many unnamed regexp groups. I would do something like this
import re
s = '/John Adam Will Newman'
name_regexp = re.compile(r'/(?P<name>(\w+\s*)+)')
match_obj = name_regexp.match(s) # match object
group_dict = match_obj.groupdict() # dict mapping {group name: value}
name = group_dict['name']
(?P<name>...) starts a named group
(\w+\s*) is a group matching one or more alphanum characters, possibly followed by some whitespace
the match object returned by the .match(s) method has a method groupdict() which returns a dict which is mapping from group names to their contents

Related

Replace a substring with defined region and follow up variable region in Python

I have a seemingly simple problem that for the life of me is just outside my reach of understanding. What I mean by that is that I can come up with many complex ways to attempt this, but there must be an easy way.
What I am trying to do is find and replace a substring in a string, but the catch is that it is based on a mix of a defined region and then variable regions based on length.
Here is an example:
sequence = 'AATCGATCGTATATCTGCGTAGACTCTGTGCATGC' and I want to replace AATCGATCGTA with <span color="blue">AATCGA</span><span>T</span><span color="green">CGTA</span>
So in this example the first part will always be constant AATCGA and will be used to locate the region to replace. This is then followed by a "spacer", in this case a single character but could be more than one and needs to be specified, and finally the last bit that will follow the "tail", in this case four characters, but could also be more or less. A set-up in this case would be:
to_find = 'AATCGA'
spacer = 'T' #Variable based on number and not on the character
tail = 'CGTA' #Variable based on number and not on the character
With this information I need to do something like:
new_seq = sequence.replace(f'{to_find}{len(spacer)}{len(tail)}', f'<span color="blue">{to_find}</span><span>{spacer}</span><span color="green">{tail}</span>')
print(new_seq)
<span color="blue">AATCGA</span><span>T</span><span color="green">CGTA</span>TATCTGCGTAGACTCTGTGCATGC
But the spacer could be 3 characters from the end of to_find and it may vary, the same with the tail section. Also, it could be in reverse where the to_find is on the right hand side and then the tail is in the start.
Any help would be much appreciated!
I'm not quite sure I understand you fully. Nevertheless, you don't seem to be too far off. Just use regex.
import re
sequence = 'AATCGATCGTATATCTGCGTAGACTCTGTGCATGC'
expected_new_seq = '<span color="blue">AATCGA</span><span>T</span><span color="green">CGTA</span>TATCTGCGTAGACTCTGTGCATGC'
to_find = 'AATCGA'
spacer = 'T' # Variable based on number and not on the character
tail = 'CGTA' # Variable based on number and not on the character
# In this case, the pattern is (AATCGA)(.{1})(.{4})
# It matches "AATCGA" that is followed by 1 character and then 4 characters.
# AATCGA is captured in group 1, then the next unknown character is captured
# in group 2, and the next 4 unknown characters are captured in group 3
# (the brackets create capturing groups).
pattern = f'({to_find})(.{{{len(spacer)}}})(.{{{len(tail)}}})'
# \1 refers to capture group 1 (to_find), \2 refers to capture group 2 (spacer),
# and \3 refers to capture group 3 (tail).
# This no longer needs to be a f-string. But making it a raw string means we
# don't need to escape the slashes
repl = r'<span color="blue">\1</span><span>\2</span><span color="green">\3</span>'
new_seq = re.sub(pattern, repl, sequence)
print(new_seq)
print(new_seq == expected_new_seq)
Output:
<span color="blue">AATCGA</span><span>T</span><span color="green">CGTA</span>TATCTGCGTAGACTCTGTGCATGC
True
Have a play around with it here (also includes interactive explanation): https://regex101.com/r/2mshrI/1
Also, it could be in reverse where the to_find is on the right hand side and then the tail is in the start.
How do you know when to replace it when it's in reverse instead of forward? After all, all you're doing is matching a short string followed/lead by n characters. I imagine you'd get matches in both directions, so which replacement do you carry out? Please provide more examples - longer input with expected output.

Operate on part of sequence while returning whole sequence

I want to shorten a python class name by truncating all but the last part ie: module.path.to.Class => mo.pa.to.Class.
This could be accomplished by splittin the string and storing the list in a variable and then operating on all but the last part and joining them back.
I would like to know if there is a way to do this in one step ie:
split to parts
create two copies of sequence (tee ?)
apply truncation to one sequence and not the other
join selected parts of sequence
Something like:
'.'.join( [chain(map(lambda x: x[:2], foo[:-1]), bar[-1]) for foo, bar in tee(name.split('.'))] )
But I'm unable to figure out working with ...foo, bar in tee(...
If you want to do it by splitting, you can split once on the last dot first, and then process only the first part by splitting it again to get the package indices, then shorten each to its first two characters, and finally join everything back together in the end. If you insist on doing it inline:
name = "module.path.to.Class"
short = ".".join([[x[:2] for x in p.split(".")] + [n] for p, n in [name.rsplit(".", 1)]][0])
print(short) # mo.pa.to.Class
This creates unnecessary lists just so it can traverse the list comprehension waters safely, in reality it probably ends up being slower than just doing it in a normal, procedural fashion:
def shorten_path(source):
indices = source.split(".")
return ".".join(x[:2] for x in indices[:-1]) + "." + indices[-1]
name = "module.path.to.Class"
print(shorten_path(name)) # mo.pa.to.Class
You could do this in one line with a regular expression:
>>> re.sub(r'(\b\w{2})\w*(\.)', r'\1\2', 'module.path.to.Class')
'mo.pa.to.Class'
The pattern r'(\b\w{2})\w*(\.)' captures two matches: the first two letters of a word, and the dot at the end of the word.
The substitution pattern r'\1\2' concatenates the two captured groups - the first two letters of the word and the dot.
No count parameter is passed to re.sub so all occurrences of the pattern are substituted.
The final word - the class name - is not truncated because it isn't follwed by a dot, so it doesn't match the pattern.

How to match and remove occurrences from a file using regex

I am new in Python and I am trying to to get some contents from a file using regex. I upload a file, I load it in memory and then I run this regular expression. I want to take the names from the file but it also needs to work with names that have spaces like "Marie Anne". So imagine that the array of names has this values:
all_names = [{name:"Marie Anne", id:1}, {name:"Johnathan", id:2}, {name:"Marie", id:3}, {name:"Anne", id:4},{name:"John", id:5}]
An the string that I am searching might have multiple occurrences and it's multiline.
print all_names # this is an array of id and name, ordered descendently by names length
textToStrip = stdout.decode('ascii', 'ignore').lower()
for i in range(len(all_skills)):
print all_names[i]
m = re.search(r'\W' + re.escape(unicode(all_names[i]['name'].lower())) + '\W',textToStrip)
if m:
textToStrip = re.sub(r'\W' + re.escape(unicode(all_names[i]['name'].lower())) + '\W', "", textToStrip, 100)
print "found " + all_names[i]['name']
print textToStrip
The script is finding the names, but the line re.sub removes them from the list to avoid that takes "Maria Anne", and "Marie" from the same instance, it's also removing extra characters like "," or "." before or after.
Any help would much appreciated... or if you have a better solution for this problem even better.
The characters on both sides are deleted because you have \W included in re.sub() regexp. That's because re.sub replaced everything the regexp matches -- the way you call re.sub.
There's an alternate way to do this. If you wrap the part that you want keep in the matched regext with grouping parens, and if you call re.sub with a callable (a function) instead of the new string, that function can extract the group values from the match object passed to it and assemble a return value that preserves them.
Read documentation for re.sub for details.

Extracting a number from a string using regular expressions

I have the following string:
fname="VDSKBLAG00120C02 (10).gif"
How can I extract the value 10 from the string fname (using re)?
A simpler regex is \((\d+)\):
regex = re.compile(r'\((\d+)\)')
value = int(re.search(regex, fname).group(1))
regex = re.compile(r"(?<=\()\d+(?=\))")
value = int(re.search(regex, fname).group(0))
Explanation:
(?<=\() # Assert that the previous character is a (
\d+ # Match one or more digits
(?=\)) # Assert that the next character is a )
Personally, I'd use this regex:
^.*\(\d+\)(?:\.[^().]+)?$
With this, I can pick the last number in parentheses, just before the extension (if any). It won't go and pick any random number in parentheses if there is any in the middle of the file name. For example, it should correctly pick out 2 from SomeFilmTitle.(2012).RippedByGroup (2).avi. The only drawback is that, it won't be able to differentiate when the number is right before the extension: SomeFilmTitle (2012).avi.
I make assumption that the extension of the file, if any, should not contain ().

python: regular expressions, how to match a string of undefind length which has a structure and finishes with a specific group

I need to create a regexp to match strings like this 999-123-222-...-22
The string can be finished by &Ns=(any number) or without this... So valid strings for me are
999-123-222-...-22
999-123-222-...-22&Ns=12
999-123-222-...-22&Ns=12
And following are not valid:
999-123-222-...-22&N=1
I have tried testing it several hours already... But did not manage to solve, really need some help
Not sure if you want to literally match 999-123-22-...-22 or if that can be any sequence of numbers/dashes. Here are two different regexes:
/^[\d-]+(&Ns=\d+)?$/
/^999-123-222-\.\.\.-22(&Ns=\d+)?$/
The key idea is the (&Ns=\d+)?$ part, which matches an optional &Ns=<digits>, and is anchored to the end of the string with $.
If you just want to allow strings 999-123-222-...-22 and 999-123-222-...-22&Ns=12 you better use a string function.
If you want to allow any numbers between - you can use the regex:
^(\d+-){3}[.]{3}-\d+(&Ns=\d+)?$
If the numbers must be of only 3 digits and the last number of only 2 digits you can use:
^(\d{3}-){3}[.]{3}-\d{2}(&Ns=\d{2})?$
This looks like a phone number and extension information..
Why not make things simpler for yourself (and anyone who has to read this later) and split the input rather than use a complicated regex?
s = '999-123-222-...-22&Ns=12'
parts = s.split('&Ns=') # splits on Ns and removes it
If the piece before the "&" is a phone number, you could do another split and get the area code etc into separate fields, like so:
phone_parts = parts[0].split('-') # breaks up the digit string and removes the '-'
area_code = phone_parts[0]
The portion found after the the optional '&Ns=' can be checked to see if it is numeric with the string method isdigit, which will return true if all characters in the string are digits and there is at least one character, false otherwise.
if len(parts) > 1:
extra_digits_ok = parts[1].isdigit()

Categories