extracting chars from string using regex and pythonic way - python

I have a string like this: "32H74312"
I want to extract some parts and put them in different variables.
first_part = 32 # always 2 digits
second_part = H # always 1 chars
third_part = 743 # always 3 digit
fourth_part = 12 # always 2 digit
Is there some way to this in pythonic way?

There's now reason to use a regex for such a simple task.
The pythonic way could be something like:
string = "32H74312"
part1 = string[:2]
part2 = string[2:3]
part3 = string[3:6]
part4 = string[6:]

If String is always same length, then you can do this:
string = "32H74312"
first_part = string[:2] #always 2 digits
second_part = string[2:-5] # always 1 chars
third_part = string[3:-2] # always 3 digit
fourth_part = string[:6] # always 2 digit

Since you have a fixed amount of characters to capture you can do:
(\d\d)(\w)(\d{3})(\d\d)
You can then utilize re.match.
pattern = r"(\d\d)(\w)(\d{3})(\d\d)"
string = "32H74312"
first_part, second_part, third_part, fourth_part = re.match(pattern, string).groups()
print(first_part, second_part, third_part, fourth_part)
Which outputs:
32 H 743 12
Unless it's because you want an easy way to enforce each part being digits and word characters. Then this isn't really something you need regex for.

This is quite 'pythonic' also :
string = "32H74312"
parts = {0:2, 2:3, 3:6, 3:6, 6:8 }
string_parts = [ string[ p : parts[p] ] for p in parts ]

Expanding on Pedro's excellent answer, string slicing syntax is the best way to go.
However, having variables like first_part, second_part, . . . nth_part is typically considered an anti-pattern; you are probably looking for a tuple instead:
str = "32H74312"
parts = (str[:2], str[2], str[3:6], str[6:])
print(parts)
print(parts[0], parts[1], parts[2], parts[3])

You can use this method:
import re
line = '32H74312'
d2p = r'(\d\d)' # two digits pattern
ocp = r'(\w)' # one char pattern
d3p = r'(\d{3})' # three digits pattern
lst = re.match(d2p + ocp + d3p + d2p, line).groups()
for item in lst:
print(item)
Brackets are necessary for grouping search elements. Also to make testing your regexps more comfortable, you can use special platforms such as regex101

Related

How to get integer for two characters in python

a = "a26lsdm3684"
How can I get an integer with value of 26(a[1] and a[2])? If I write int(a[1) or int (a[2]) it just gives me integer of one character. What should I write when I want integer with value of 26 and store it in variable b?
Slice out the two characters, then convert:
b = int(a[1:3]) # Slices are exclusive on the end index, so you need to go to 3 to get 1 and 2
you can get substrings out of the string and convert that to int, as long as you know the exact indexes
a = "a26lsdm3684"
substring_of_a = a[1:3]
number = int(substring_of_a)
print(number, type(number))
There is more than one way to do it.
Use Slicing, as pointed out by jasonharper and ShadowRanger.
Or use re.findall to find the first stretch of digits.
Or use re.split to split on non-digits and find the 2nd element (the first one is an empty string at the beginning).
import re
a = "a26lsdm3684"
print(int(a[1:3]))
print(int((re.findall(r'\d+', a))[0]))
print(int((re.split(r'\D+', a))[1]))
# 26
A little more sustainable if you want multiple numbers from the same string:
def get_numbers(input_string):
i = 0
buffer = ""
out_list = []
while i < len(input_string):
if input_string[i].isdigit():
buffer = buffer + input_string[i]
else:
if buffer:
out_list.append(int(buffer))
buffer = ""
i = i + 1
if buffer:
out_list.append(int(buffer))
return out_list
a = "a26lsdm3684"
print(get_numbers(a))
output:
[26, 3684]
If you want to convert all the numeric parts in your string, and say put them in a list, you may do something like:
from re import finditer
a = "a26lsdm3684"
s=[int(m.group(0)) for m in finditer(r'\d+', a)] ##[26, 3684]

How to extract a substring from a string in Python 3

I am trying to pull a substring out of a function result, but I'm having trouble figuring out the best way to strip the necessary string out using Python.
Output Example:
[<THIS STRING-STRING-STRING THAT THESE THOSE>]
In this example, I would like to grab "STRING-STRING-STRING" and throw away all the rest of the output. In this example, "[<THIS " &" THAT THESE THOSE>]" are static.
Many many ways to solve this. Here are two examples:
First one is a simple replacement of your unwanted characters.
targetstring = '[<THIS STRING-STRING-STRING THAT THESE THOSE>]'
#ALTERNATIVE 1
newstring = targetstring.replace(r" THAT THESE THOSE>]", '').replace(r"[<THIS ", '')
print(newstring)
and this drops everything except your target pattern:
#ALTERNATIVE 2
match = "STRING-STRING-STRING"
start = targetstring.find(match)
stop = len(match)
targetstring[start:start+stop]
These can be shortened but thought it might be useful for OP to have them written out.
I found this extremely useful, might be of help to you as well: https://www.computerhope.com/issues/ch001721.htm
If by '"[<THIS " &" THAT THESE THOSE>]" are static' you mean that they are always the exact same string, then:
s = "[<THIS STRING-STRING-STRING THAT THESE THOSE>]"
before = len("[<THIS ")
after = len(" THAT THESE THOSE>]")
s[before:-after]
# 'STRING-STRING-STRING'
Like so (as long as the postition of the characters in the string doesn't change):
myString = "[<THIS STRING-STRING-STRING THAT THESE THOSE>]"
myString = myString[7:27]
Another alternative method;
import re
my_str = "[<THIS STRING-STRING-STRING THAT THESE THOSE>]"
string_pos = [(s.start(), s.end()) for s in list(re.finditer('STRING-STRING-STRING', my_str))]
start, end = string_pos[0]
print(my_str[start: end + 1])
STRING-STRING-STRING
If the STRING-STRING-STRING occurs multiple times in the string, start and end indexes of the each occurrences will be given as tuples in string_pos.

String.split() after n characters

I can split a string like this:
string = 'ABC_elTE00001'
string = string.split('_elTE')[1]
print(string)
How do I automate this, so I don't have to pass '_elTE' to the function? Something like this:
string = 'ABC_elTE00001'
string = string.split('_' + 4 characters)[1]
print(string)
Use regex, regex has a re.split thing which is the same as str.split just you can split by a regex pattern, it's worth a look at the docs:
>>> import re
>>> string = 'ABC_elTE00001'
>>> re.split('_\w{4}', string)
['ABC', '00001']
>>>
The above example is using a regex pattern as you see.
split() on _ and take everything after the first four characters.
s = 'ABC_elTE00001'
# s.split('_')[1] gives elTE00001
# To get the string after 4 chars, we'd slice it [4:]
print(s.split('_')[1][4:])
OUTPUT:
00001
You can use Regular expression to automate the extraction that you want.
import re
string = 'ABC_elTE00001'
data = re.findall('.([0-9]*$)',string)
print(data)
This is a, quite horrible, version that exactly "translates" string.split('_' + 4 characters)[1]:
s = 'ABC_elTE00001'
s.split(s[s.find("_"):(s.find("_")+1)+4])[1]
>>> '00001'

Python string regular expression

I need to do a string compare to see if 2 strings are equal, like:
>>> x = 'a1h3c'
>>> x == 'a__c'
>>> True
independent of the 3 characters in middle of the string.
You need to use anchors.
>>> import re
>>> x = 'a1h3c'
>>> pattern = re.compile(r'^a.*c$')
>>> pattern.match(x) != None
True
This would check for the first and last char to be a and c . And it won't care about the chars present at the middle.
If you want to check for exactly three chars to be present at the middle then you could use this,
>>> pattern = re.compile(r'^a...c$')
>>> pattern.match(x) != None
True
Note that end of the line anchor $ is important , without $, a...c would match afoocbarbuz.
Your problem could be solved with string indexing, but if you want an intro to regex, here ya go.
import re
your_match_object = re.match(pattern,string)
the pattern in your case would be
pattern = re.compile("a...c") # the dot denotes any char but a newline
from here, you can see if your string fits this pattern with
print pattern.match("a1h3c") != None
https://docs.python.org/2/howto/regex.html
https://docs.python.org/2/library/re.html#search-vs-match
if str1[0] == str2[0]:
# do something.
You can repeat this statement as many times as you like.
This is slicing. We're getting the first value. To get the last value, use [-1].
I'll also mention, that with slicing, the string can be of any size, as long as you know the relative position from the beginning or the end of the string.

replacing all regex matches in single line

I have dynamic regexp in which I don't know in advance how many groups it has
I would like to replace all matches with xml tags
example
re.sub("(this).*(string)","this is my string",'<markup>\anygroup</markup>')
>> "<markup>this</markup> is my <markup>string</markup>"
is that even possible in single line?
For a constant regexp like in your example, do
re.sub("(this)(.*)(string)",
r'<markup>\1</markup>\2<markup>\3</markup>',
text)
Note that you need to enclose .* in parentheses as well if you don't want do lose it.
Now if you don't know what the regexp looks like, it's more difficult, but should be doable.
pattern = "(this)(.*)(string)"
re.sub(pattern,
lambda m: ''.join('<markup>%s</markup>' % s if n % 2 == 0
else s for n, s in enumerate(m.groups())),
text)
If the first thing matched by your pattern doesn't necessarily have to be marked up, use this instead, with the first group optionally matching some prefix text that should be left alone:
pattern = "()(this)(.*)(string)"
re.sub(pattern,
lambda m: ''.join('<markup>%s</markup>' % s if n % 2 == 1
else s for n, s in enumerate(m.groups())),
text)
You get the idea.
If your regexps are complicated and you're not sure you can make everything part of a group, where only every second group needs to be marked up, you might do something smarter with a more complicated function:
pattern = "(this).*(string)"
def replacement(m):
s = m.group()
n_groups = len(m.groups())
# assume groups do not overlap and are listed left-to-right
for i in range(n_groups, 0, -1):
lo, hi = m.span(i)
s = s[:lo] + '<markup>' + s[lo:hi] + '</markup>' + s[hi:]
return s
re.sub(pattern, replacement, text)
If you need to handle overlapping groups, you're on your own, but it should be doable.
re.sub() will replace everything it can. If you pass it a function for repl then you can do even more.
Yes, this can be done in a single line.
>>> re.sub(r"\b(this|string)\b", r"<markup>\1</markup>", "this is my string")
'<markup>this</markup> is my <markup>string</markup>'
\b ensures that only complete words are matched.
So if you have a list of words that you need to mark up, you could do the following:
>>> mywords = ["this", "string", "words"]
>>> myre = r"\b(" + "|".join(mywords) + r")\b"
>>> re.sub(myre, r"<markup>\1</markup>", "this is my string with many words!")
'<markup>this</markup> is my <markup>string</markup> with many <markup>words</markup>!'

Categories