Python Regular Expression with special characters

Python Regular Expression with special characters - python

Having trouble writing a robust regular expression to grab information out of a string.
$ string1 = 'A_XYZ_THESE_WORDS'
$ string2 = 'A_ABC_THOSE_WORDS'
I would like a robust solution that pulls out from string1 or string2 respectfully 'THESE_WORDS' or 'THOSE_WORDS'.
Basically, I need something that removes everything before the first two underscores (_), but the text before them will vary.
$ get_text = re.search('(?<=A_)\w+(_)',string1)
$ print get_text.group()
$ 'XYZ_THESE_'

Based on your problem statement:
I need something that removes everything before the first two underscores
you don't necessarily need a regular expression:
>>> string1 = 'A_XYZ_THESE_WORDS'
>>> string1.split("_", 2)[2]
'THESE_WORDS'
The second argument to str.split is the maximum number of times to split. This will split on the first two '_'s, then take the third item (the rest of the string) from the resulting list.
This will throw an IndexError if there are fewer than two underscores in the string - this lets you know that the string is not in a format you expect, but if this behaviour is not desirable, consider:
>>> string1 = 'A_XYZ_THESE_WORDS'
>>> string1.split("_", 2)[-1]
'THESE_WORDS'
Which takes the last item in the list from str.split, rather than assuming that there will be three. Comparison:
>>> "JUST_ONE".split("_", 2)[2]
Traceback (most recent call last):
File "<pyshell#3>", line 1, in <module>
"JUST_ONE".split("_", 2)[2]
IndexError: list index out of range
>>> "JUST_ONE".split("_", 2)[-1]
'ONE'

The below regex will print the texts which was just after to the second underscore(_),
>>> import re
>>> string1 = 'A_XYZ_THESE_WORDS'
>>> string2 = 'A_ABC_THOSE_WORDS'
>>> m = re.search(r'^[^_]*_[^_]*_(.*)$', string1)
>>> m.group(1)
'THESE_WORDS'
>>> m = re.search(r'^[^_]*_[^_]*_(.*)$', string2)
>>> m.group(1)
'THOSE_WORDS'

In [21]: regex = re.compile(r'^([a-zA-Z]+_){2}(.*)$')
In [22]: m = regex.search(string1)
In [23]: m.groups()
Out[23]: ('XYZ_', 'THESE_WORDS')
In [24]: m = regex.search(string2)
In [25]: m.groups()
Out[25]: ('ABC_', 'THOSE_WORDS')

Related

use .format() in a string in two steps

I have a string in which I want to replace some variables, but in different steps, something like:
my_string = 'text_with_{var_1}_to_variables_{var_2}'
my_string.format(var_1='10')
### make process 1
my_string.format(var_2='22')
But when I try to replace the first variable I get an Error:
KeyError: 'var_2'
How can I accomplish this?
Edit:
I want to create a new list:
name = 'Luis'
ids = ['12344','553454','dadada']
def create_list(name,ids):
my_string = 'text_with_{var_1}_to_variables_{var_2}'.replace('{var_1}',name)
return [my_string.replace('{var_2}',_id) for _id in ids ]
this is the desired output:
['text_with_Luis_to_variables_12344',
'text_with_Luis_to_variables_553454',
'text_with_Luis_to_variables_dadada']
But using .format instead of .replace.

In simple words, you can not replace few arguments with format {var_1}, var_2 in string(not all) using format. Even though I am not sure why you want to only replace partial string, but there are few approaches that you may follow as a workaround:
Approach 1: Replacing the variable you want to replace at second step by {{}} instead of {}. For example: Replace {var_2} by {{var_2}}
>>> my_string = 'text_with_{var_1}_to_variables_{{var_2}}'
>>> my_string = my_string.format(var_1='VAR_1')
>>> my_string
'text_with_VAR_1_to_variables_{var_2}'
>>> my_string = my_string.format(var_2='VAR_2')
>>> my_string
'text_with_VAR_1_to_variables_VAR_2'
Approach 2: Replace once using format and another using %.
>>> my_string = 'text_with_{var_1}_to_variables_%(var_2)s'
# Replace first variable
>>> my_string = my_string.format(var_1='VAR_1')
>>> my_string
'text_with_VAR_1_to_variables_%(var_2)s'
# Replace second variable
>>> my_string = my_string % {'var_2': 'VAR_2'}
>>> my_string
'text_with_VAR_1_to_variables_VAR_2'
Approach 3: Adding the args to a dict and unpack it once required.
>>> my_string = 'text_with_{var_1}_to_variables_{var_2}'
>>> my_args = {}
# Assign value of `var_1`
>>> my_args['var_1'] = 'VAR_1'
# Assign value of `var_2`
>>> my_args['var_2'] = 'VAR_2'
>>> my_string.format(**my_args)
'text_with_VAR_1_to_variables_VAR_2'
Use the one which satisfies your requirement. :)

Do you have to use format? If not, can you just use string.replace? like
my_string = 'text_with_#var_1#_to_variables_#var2#'
my_string = my_string.replace("#var_1#", '10')
###
my_string = my_string.replace("#var2#", '22')

following seems to work now.
s = 'a {} {{}}'.format('b')
print(s) # prints a b {}
print(s.format('c')) # prints a b c

Capitalize text without capitalizing links in python

I need to capitalize a line of input, but if I just use the upper() function, link addresses get capitalized, thus making them unusable.
For example: "Cool Video www.youtube.com/watch?v=dQw4w9WgXcQ"
will turn to: "COOL VIDEO WWW.YOUTUBE.COM/WATCH?V=DQW4W9WGXCQ"
The link address has changes and won't work anymore. Is there any way to ignore links?

If I was correct to understand your goal here, then you should first look for the part of string to upper case and then joined back with the rest of the original string, this way:
>>> import re
>>> s = "Cool Video -> www.youtube.com/watch?v=dQw4w9WgXcQ"
>>> #Look for the part of string you want to upper case
>>> m = re.search(r'^.*(?=\s+->)', s)
>>> m
<_sre.SRE_Match object; span=(0, 10), match='Cool Video'>
>>> #m.start() and m.end() will give you start and endo position of matched string.
>>> new_s = s[m.start():m.end()].upper() + s[m.end():]
>>> #remember that strings are immutable, so make new one
>>> new_s
'COOL VIDEO -> www.youtube.com/watch?v=dQw4w9WgXcQ'
>>> #OR
>>> new_s = m.group().upper() + s[m.end():]
>>> new_s
'COOL VIDEO -> www.youtube.com/watch?v=dQw4w9WgXcQ'
EDIT:
Otherway, is to look for string preceding a link and then apply upper method on it:
>>> s = "Cool Video www.youtube.com/watch?v=dQw4w9WgXcQ"
>>> m = re.search(r'(.*)(?=www.*)',s)
>>> s = m.group().upper() + s[m.end():]
>>> s
'COOL VIDEO www.youtube.com/watch?v=dQw4w9WgXcQ'

Reg ex for multiple characters

I am trying capture regex for dates like:
14-July-2012-11_31_59
I do:
\d{2}-\w{4}-\d{4}-\d{2}_\d{2}_\d{2}$
But the month part here is 4 letters, it could be long e.g. September.
That is the only variable. The length of digits is ok.
How do regex the word part to say at least 3 letters?

In general, X{n,} means "X at least n times". But \w matches digits and underscores as well, you probably want to use [a-zA-Z]{3,} instead, since month-names shouldn't contain digits or underscores.
\d{2}-[a-zA-Z]{3,}-\d{4}-\d{2}_\d{2}_\d{2}$

Try this:
\d{2}-\w{3,}-\d{4}-\d{2}_\d{2}_\d{2}$

Is this something you're looking for...
>>> a = '14-July-2012-11_31_59'
>>>
>>> pat = r'\b\d{2}\-\w{3,}\-\d{2,4}\-\d{2}\_\d{2}\_\d{2}\b'
>>> regexp = re.compile(pat)
>>> m = regexp.match(a)
>>> m
<_sre.SRE_Match object at 0xa54c870>
>>> m.group()
'14-July-2012-11_31_59'
>>> m = regexp.match('14-September-2012-11_31_59')
>>> m.group()
'14-September-2012-11_31_59'
>>> m = regexp.match('14-September-12-11_31_59')
>>> m.group()
'14-September-12-11_31_59'
>>> m = regexp.match('14-Sep-12-11_31_59')
>>> m.group()
'14-Sep-12-11_31_59'
>>> m = regexp.match('14-Se-12-11_31_59')
>>> m.group()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
>>>

Python error: could not convert string to float

I have some Python code that pulls strings out of a text file:
[2.467188005806714e-05, 0.18664554919828535, 0.5026880460053854, ....]
Python code:
v = string[string.index('['):].split(',')
for elem in v:
new_list.append(float(elem))
This gives an error:
ValueError: could not convert string to float: [2.974717463860223e-06
Why can't [2.974717463860223e-06 be converted to a float?

You've still got the [ in front of your "float" which prevents parsing.
Why not use a proper module for that? For example:
>>> a = "[2.467188005806714e-05, 0.18664554919828535, 0.5026880460053854]"
>>> import json
>>> b = json.loads(a)
>>> b
[2.467188005806714e-05, 0.18664554919828535, 0.5026880460053854]
or
>>> import ast
>>> b = ast.literal_eval(a)
>>> b
[2.467188005806714e-05, 0.18664554919828535, 0.5026880460053854]

You may do the following to convert your string that you read from your file to a list of float
>>> instr="[2.467188005806714e-05, 0.18664554919828535, 0.5026880460053854]"
>>> [float(e) for e in instr.strip("[] \n").split(",")]
[2.467188005806714e-05, 0.18664554919828535, 0.5026880460053854]
The reason your code is failing is, you are not stripping of the '[' from the string.

You are capturing the first bracket, change string.index("[") to string.index("[") + 1

This will give you a list of floats without the need for extra imports etc.
s = '[2.467188005806714e-05, 0.18664554919828535, 0.5026880460053854]'
s = s[1:-1]
float_list = [float(n) for n in s.split(',')]
[2.467188005806714e-05, 0.18664554919828535, 0.5026880460053854]

v = string[string.index('[') + 1:].split(',')
index() return index of given character, so that '[' is included in sequence returned by [:].

Slice a string after a certain phrase?

I've got a batch of strings that I need to cut down. They're basically a descriptor followed by codes. I only want to keep the descriptor.
'a descriptor dps 23 fd'
'another 23 fd'
'and another fd'
'and one without a code'
The codes above are dps, 23 and fd. They can come in any order, are unrelated to each other and might not exist at all (as in the last case).
The list of codes is fixed (or can be predicted, at least), so assuming a code is never used within a legitimate descriptor, how can I strip off everything after the first instance of a code.
I'm using Python.

The short answer, as #THC4K points out in a comment:
string.split(pattern, 1)[0]
where string is your original string, pattern is your "break" pattern, 1 indicates to split no more than 1 time, and [0] means take the first element returned by split.
In action:
>>> s = "a descriptor 23 fd"
>>> s.split("23", 1)[0]
'a descriptor '
>>> s.split("fdasfdsafdsa", 1)[0]
'a descriptor 23 fd'
This is a much shorter way of expressing what I had written earlier, which I will keep here anyway.
And if you need to remove multiple patterns, this is a great candidate for the reduce builtin:
>>> string = "a descriptor dps foo 23 bar fd quux"
>>> patterns = ["dps", "23", "fd"]
>>> reduce(lambda s, pat: s.split(pat, 1)[0], patterns, string)
'a descriptor '
>>> reduce(lambda s, pat: s.split(pat, 1)[0], patterns, "uiopuiopuiopuipouiop")
'uiopuiopuiopuipouiop'
This basically says: for each pat in patterns: take string and repeatedly apply string.split(pat, 1)[0] (like explained above), operating on the result of the previously returned value each time. As you can see, if none of the patterns are in the string, the original string is still returned.
The simplest answer is a list/string slice combined with a string.find:
>>> s = "a descriptor 23 fd"
>>> s[:s.find("fd")]
'a descriptor 23 '
>>> s[:s.find("23")]
'a descriptor '
>>> s[:s.find("gggfdf")] # <-- look out! last character got cut off
'a descriptor 23 f'
A better approach (to avoid cutting off the last character in a missing pattern when s.find returns -1) might be to wrap in a simple function:
>>> def cutoff(string, pattern):
... idx = string.find(pattern)
... return string[:idx if idx != -1 else len(string)]
...
>>> cutoff(s, "23")
'a descriptor '
>>> cutoff(s, "asdfdsafdsa")
'a descriptor 23 fd'
The [:s.find(x)] syntax means take the part of the string from index 0 until the right-hand side of the colon; and in this case, the RHS is the result of s.find, which returns the index of the string you passed.

You seem to be describing something like this:
def get_descriptor(text):
codes = ('12', 'dps', '23')
for c in codes:
try:
return text[:text.index(c)].rstrip()
except ValueError:
continue
raise ValueError("No descriptor found in `%s'" % (text))
E.g.,
>>> get_descriptor('a descriptor dps 23 fd')
'a descriptor'

codes = ('12', 'dps', '23')
def get_descriptor(text):
words = text.split()
for c in codes:
if c in words:
i = words.index(c)
return " ".join(words[:i])
raise ValueError("No code found in `%s'" % (text))

I'd probably use a regular expression to do this:
>>> import re
>>> descriptors = ('foo x', 'foo y', 'bar $', 'baz', 'bat')
>>> data = ['foo x 123', 'foo y 123', 'bar $123', 'baz 123', 'bat 123', 'nothing']
>>> p = re.compile("(" + "|".join(map(re.escape, descriptors)) + ")")
>>> for s in data:
m = re.match(p, s)
if m: print m.groups()[0]
foo x
foo y
bar $
baz
bat
It wasn't entirely clear to me whether you want what you're extracting to include text that precedes the descriptors, or if you expect each line of text to start with a descriptor; the above deals with the latter. For the former, just change the pattern slightly to make it capture all characters before the first occurrence of a descriptor:
>>> p = re.compile("(.*(" + "|".join(map(re.escape, descriptors)) + "))")

Here's an answer that works for all codes rather than forcing you to call the function for each code, and is a bit simpler than some of the answers above. It also works for all of your examples.
strings = ('a descriptor dps 23 fd', 'another 23 fd', 'and another fd',
'and one without a code')
codes = ('dps', '23', 'fd')
def strip(s):
try:
return s[:min(s.find(c) for c in codes if c in s)]
except ValueError:
return s
print map(strip, strings)
Output:
['a descriptor ', 'another ', 'and another ', 'and one without a code']
I believe this satisfies all of your criteria.
Edit: I realized quickly you could remove the try catch if you don't like expecting the exception:
def strip(s):
if not any(c in s for c in codes):
return s
return s[:min(s.find(c) for c in codes if c in s)]

def crop_string(string, pattern):
del_items = []
for indx, val in enumerate(pattern):
a = string.split(val, 1)
del_items.append(a[indx])
for del_item in del_items:
string = string.replace(del_item, "")
return string
example:
I want to crop the string and get only the array out of it..
strin = "crop the array [1,2,3,4,5]
pattern["[","]"]
usage:
a = crop_string(strin ,pattern )
print a
# --- Prints "[1,2,3,4,5]"

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Regular Expression with special characters - python

In [21]: regex = re.compile(r'^([a-zA-Z]+_){2}(.*)$') In [22]: m = regex.search(string1) In [23]: m.groups() Out[23]: ('XYZ_', 'THESE_WORDS') In [24]: m = regex.search(string2) In [25]: m.groups() Out[25]: ('ABC_', 'THOSE_WORDS')

Related

use .format() in a string in two steps

Capitalize text without capitalizing links in python

Reg ex for multiple characters

Python error: could not convert string to float

Slice a string after a certain phrase?

Categories

Resources