I want to slice out substrings using regex - python

import re
str_ = "8983605653Sudanshu452365423256Shinde"
print(re.findall(r"\d{10}\B|[A-Za-z]{8}|\d{12}|[A-Za-z]{6}",str_))
current output
['8983605653', 'Sudanshu', '4523654232', 'Shinde']
Desired output
['8983605653', 'Sudanshu', '452365423256', 'Shinde']

A regex find all on \d+|\D+ should work here:
str_ = "8983605653Sudanshu452365423256Shinde"
matches = re.findall(r'\d+|\D+', str_)
print(matches) # ['8983605653', 'Sudanshu', '452365423256', 'Shinde']
The pattern used here alternatively finds all digit substrings, or all non digit substrings.

Instead of using an alternation | you can use the matches with capture groups and then print the group values.
import re
str_ = "8983605653Sudanshu452365423256Shinde"
m = re.match(r"(\d{10})([A-Za-z]{8})(\d{12})([A-Za-z]{6})",str_)
if m:
print(list(m.groups()))
Output
['8983605653', 'Sudanshu', '452365423256', 'Shinde']
See a Python demo.

Related

Python replace between two chars (no split function)

I currently investigate a problem that I want to replace something in a string.
For example. I have the following string:
'123.49, 19.30, 02\n'
I only want the first two numbers like '123.49, 19.30'. The split function is not possible, because a I have a lot of data and some with and some without the last number.
I tried something like this:
import re as regex
#result = regex.match(', (.*)\n', string)
result = re.search(', (.*)\\n', string)
print(result.group(1))
This is not working finde. Can someone help me?
Thanks in advance
You could do something like this:
reg=r'(\d+\.\d+), (\d+\.\d+).*'
if(re.search(reg, your_text)):
match = re.search(reg, your_text)
first_num = match.group(1)
second_num = match.group(2)
Alternatively, also adding the ^ sign at the beginning, making sure to always only take the first two.
import re
string = '123.49, 19.30, 02\n'
pattern = re.compile('^(\d*.?\d*), (\d*.?\d*)')
result = re.findall(pattern, string)
result
Output:
[('123.49', '19.30')]
In the code you are using import re as regex. If you do that, you would have to use regex.search instead or re.search.
But in this case you can just use re.
If you use , (.*) you would capture all after the first occurrence of , and you are not taking digits into account.
If you want the first 2 numbers as stated in the question '123.49, 19.30' separated by comma's you can match them without using capture groups:
\b\d+\.\d+,\s*\d+\.\d+\b
Or matching 1 or more repetitions preceded by a comma:
\b\d+\.\d+(?:,\s*\d+\.\d+)+\b
regex demo | Python demo
As re.search can also return None, you can first check if there is a result (no need to run re.search twice)
import re
regex = r"\b\d+\.\d+(?:,\s*\d+\.\d+)+\b"
s = "123.49, 19.30, 02"
match = re.search(regex, s)
if match:
print(match.group())
Output
123.49, 19.30

Match everything except a pattern and replace matched with string

I want to use python in order to manipulate a string I have.
Basically, I want to prepend"\x" before every hex byte except the bytes that already have "\x" prepended to them.
My original string looks like this:
mystr = r"30336237613131\x90\x01\x0A\x90\x02\x146F6D6D616E64\x90\x01\x06\x90\x02\x0F52656C6174\x90\x01\x02\x90\x02\x50656D31\x90\x00"
And I want to create the following string from it:
mystr = r"\x30\x33\x62\x37\x61\x31\x31\x90\x01\x0A\x90\x02\x14\x6F\x6D\x6D\x61\x6E\x64\x90\x01\x06\x90\x02\x0F\x52\x65\x6C\x61\x74\x90\x01\x02\x90\x02\x50\x65\x6D\x31\x90\x00"
I thought of using regular expressions to match everything except /\x../g and replace every match with "\x". Sadly, I struggled with it a lot without any success. Moreover, I'm not sure that using regex is the best approach to solve such case.
Regex: (?:\\x)?([0-9A-Z]{2}) Substitution: \\x$1
Details:
(?:) Non-capturing group
? Matches between zero and one time, match string \x if it exists.
() Capturing group
[] Match a single character present in the list 0-9 and A-Z
{n} Matches exactly n times
\\x String \x
$1 Group 1.
Python code:
import re
text = R'30336237613131\x90\x01\x0A\x90\x02\x146F6D6D616E64\x90\x01\x06\x90\x02\x0F52656C6174\x90\x01\x02\x90\x02\x50656D31\x90\x00'
text = re.sub(R'(?:\\x)?([0-9A-Z]{2})', R'\\x\1', text)
print(text)
Output:
\x30\x33\x62\x37\x61\x31\x31\x90\x01\x0A\x90\x02\x14\x6F\x6D\x6D\x61\x6E\x64\x90\x01\x06\x90\x02\x0F\x52\x65\x6C\x61\x74\x90\x01\x02\x90\x02\x50\x65\x6D\x31\x90\x00
Code demo
You don't need regex for this. You can use simple string manipulation. First remove all of the "\x" from your string. Then add add it back at every 2 characters.
replaced = mystr.replace(r"\x", "")
newstr = "".join([r"\x" + replaced[i*2:(i+1)*2] for i in range(len(replaced)/2)])
Output:
>>> print(newstr)
\x30\x33\x62\x37\x61\x31\x31\x90\x01\x0A\x90\x02\x14\x6F\x6D\x6D\x61\x6E\x64\x90\x01\x06\x90\x02\x0F\x52\x65\x6C\x61\x74\x90\x01\x02\x90\x02\x50\x65\x6D\x31\x90\x00
You can get a list with your values to manipulate as you wish, with an even simpler re pattern
mystr = r"30336237613131\x90\x01\x0A\x90\x02\x146F6D6D616E64\x90\x01\x06\x90\x02\x0F52656C6174\x90\x01\x02\x90\x02\x50656D31\x90\x00"
import re
pat = r'([a-fA-F0-9]{2})'
match = re.findall(pat, mystr)
if match:
print('\n\nNew string:')
print('\\x' + '\\x'.join(match))
#for elem in match: # match gives you a list of strings with the hex values
# print('\\x{}'.format(elem), end='')
print('\n\nOriginal string:')
print(mystr)
This can be done without replacing existing \x by using a combination of positive lookbehinds and negative lookaheads.
(?!(?<=\\x)|(?<=\\x[a-f\d]))([a-f\d]{2})
Usage
See code in use here
import re
regex = r"(?!(?<=\\x)|(?<=\\x[a-f\d]))([a-f\d]{2})"
test_str = r"30336237613131\x90\x01\x0A\x90\x02\x146F6D6D616E64\x90\x01\x06\x90\x02\x0F52656C6174\x90\x01\x02\x90\x02\x50656D31\x90\x00"
subst = r"\\x$1"
result = re.sub(regex, subst, test_str, 0, re.IGNORECASE)
if result:
print (result)
Explanation
(?!(?<=\\x)|(?<=\\x[a-f\d])) Negative lookahead ensuring either of the following doesn't match.
(?<=\\x) Positive lookbehind ensuring what precedes is \x.
(?<=\\x[a-f\d]) Positive lookbehind ensuring what precedes is \x followed by a hexidecimal digit.
([a-f\d]{2}) Capture any two hexidecimal digits into capture group 1.

Regex pattern to extract substring

mystring = "q1)whatq2)whenq3)where"
want something like ["q1)what", "q2)when", "q3)where"]
My approach is to find the q\d+\) pattern then move till I find this pattern again and stop. But I'm not able to stop.
I did req_list = re.compile("q\d+\)[*]\q\d+\)").split(mystring)
But this gives the whole string.
How can I do it?
You could try the below code which uses re.findall function,
>>> import re
>>> s = "q1)whatq2)whenq3)where"
>>> m = re.findall(r'q\d+\)(?:(?!q\d+).)*', s)
>>> m
['q1)what', 'q2)when', 'q3)where']
Explanation:
q\d+\) Matches the string in the format q followed by one or more digits and again followed by ) symbol.
(?:(?!q\d+).)* Negative look-ahead which matches any char not of q\d+ zero or more times.

Using regex assertion in python

I am experimenting with regex and i have read up on assertions a bit and seen examples but for some reason I can not get this to work.. I am trying to get the word after the following pattern using look-behind.
import re
s = '123abc456someword 0001abde19999anotherword'
re.findall(r'(?<=\d+[a-z]+\d+)[a-z]+', s, re.I)
The results should be someword and anotherword
But i get error: look-behind requires fixed-width pattern
Any help appreciated.
Python's re module only allows fixed-length strings using look-behinds. If you want to experiment and be able to use variable length look-behinds in regexes, use the alternative regex module:
>>> import regex
>>> s = '123abc456someword 0001abde19999anotherword'
>>> regex.findall(r'(?i)(?<=\d+[a-z]+\d+)[a-z]+', s)
['someword', 'anotherword']
Or simply avoid using look-behind in general and use a capturing group ( ):
>>> import re
>>> s = '123abc456someword 0001abde19999anotherword'
>>> re.findall(r'\d+[a-z]+\d+([a-z]+)', s, re.I)
['someword', 'anotherword']
Convert it to Non-capturing group and get the matched group from index 1.
(?:\d+\w+\d+)(\w+\b)
here is DEMO
If you are interested in [a-z] only then change \w to [a-z] in above regex pattern. Here \b is added to assert position at a word boundary.
sample code:
import re
p = re.compile(ur'(?:\d+\w+\d+)(\w+\b)', re.IGNORECASE)
test_str = u"123abc456someword 0001abde19999anotherword"
re.findall(p, test_str)
Another easy method through lookahead,
>>> import re
>>> s = '123abc456someword 0001abde19999anotherword'
>>> m = re.findall(r'[a-z]+(?= |$)', s, re.I)
>>> m
['someword', 'anotherword']
It matches one or more alphabets in which the following character must be a space or end of a line.

Python regular expression search vs match

I'm trying to use a python regular expression to match 'BrahuiHan' or 'BrahuiYourba'
>> re.search(r'((Brahui|Han|Yoruba)+\d+)', '10xBrahuiHan50_10xBrahuiYoruba50n4').groups()
('BrahuiHan50', 'Han')
this only returns one group, the first one, I thought it should return the second one too. i.e BrahuiYoruba
If you want to capture all occurrences of a pattern, you need to use re.findall:
>>> import re
>>> re.findall(r'((Brahui|Han|Yoruba)+\d+)', '10xBrahuiHan50_10xBrahuiYoruba50n4')
[('BrahuiHan50', 'Han'), ('BrahuiYoruba50', 'Yoruba')]
>>>
re.search will only capture the first occurrence.
Try
import re
regex = re.compile("((Brahui|Han|Yoruba)\\d{1,})")
testString = "" # fill this in
matchArray = regex.findall(testString)
# the matchArray variable contains the list of matches
Here is demo on debuggex
Pictorial representation:

Categories