findall not retruning all the results in Python 3.7

findall not retruning all the results in Python 3.7 - python

I am trying to create list of tuples with the data after strings string1 and string3. But not getting expected result.
s = 'string1:1234string2string3:a1b2c3string1:2345string3:b5c6d7'
re.findall('string1:(\d+)[\s,\S]+string3:([\s\S]+',s)
Actual result:
[('1234', 'b5c6d7)']
Expected result:
[('1234', 'a1b2c3'), ('2345', 'b5c6d7')]

You current regex uses [\s,\S]+ which is greedy and matches all characters until the end of the line.
You could make it non greedy and use a positive lookahead (?=string|$) for the last match that assert what follows is either string or the end of the line $.
string1:(\d+).*?string3:(.*?)(?=string|$)
import re
s = 'string1:1234string2string3:a1b2c3string1:2345string3:b5c6d7'
print(re.findall('string1:(\d+).*?string3:(.*?)(?=string|$)',s))
Demo

The problem is that [\s,\S]+ is greedy and therefore consuming everything between the first string1 and the last string3.
You can fix that by using positive lookaheads and making the regex non greedy like this:
string1:(\d+)[^\d][\s,\S]+?string3:([\s\S]+?(?=string|$))

Related

Extracting float or int number and substring from a string

I've just learned regex in python3 and was trying to solve a problem.
The problem is something like this:
You have given a string where the first part is a float or integer number and the next part is a substring. You must split the number and the substring and return it as a list. The substring will only contain the alphabet from a-z and A-Z. The values of numbers can be negative.
For example:
Input: 2.5ax
Output:['2.5','ax']
Input: -5bcf
Output:['-5','bcf']
Input:-69.67Gh
Output:['-69.67','Gh']
and so on.
I did several attempts with regex to solve the problem.
1st attempt:
import re
i=input()
print(re.findall(r'^(-?\d+(\.\d+)?)|[a-zA-Z]+$',i))
For the input -2.55xy, the expected output was ['-2.55','xy']
But the output came:
[('-2.55', '.55'), ('', '')]
2nd attempt:
My second attempt was similar to my first attempt just a little different:
import re
i=input()
print(re.findall(r'^(-?(\d+\.\d+)|\d+)|[a-zA-Z]+$',i))
For the same input -2.55xy, the output came as:
[('-2.55', '2.55'), ('', '')]
3rd attempt:
My next attempt was like that:
import re
i=input()
print(re.findall(r'^-?[1-9.]+|[a-z|A-Z]+$',i))
which matched the expected output for -2.55xy and also with the sample examples. But when the input is 2..5 or something like that, it considers that also as a float.
4th attempt:
import re
i=input()
value=re.findall(r"[a-zA-Z]+",i)
print([i.replace(value[0],""),value[0]])
which also matches the expected output but has the same problem as 3rd one that goes with it. Also, it doesn't look like an effective way to do it.
Conclusion:
So I don't know why my 1st and 2nd attempt isn't working. The output comes with a list of tuples which is maybe because of the groups but I don't know the exact reason and don't know how to solve them. Maybe I didn't understand the way the pattern works. Also why the substring didn't show in the output?
In the end, I want to know what's the mistake in my code and how can I write better and more efficient code to solve the problem. Thank you and sorry for my bad English.

The alternation | matches either the left part or the right part.
If the chars a-zA-Z are after the digit, you don't need the alternation | and you can use 2 capture groups to get the matches in that order.
Then using re.findall will return a list of tuples for the capture group values.
(-?\d+(?:\.\d+)?)([a-zA-Z]+)
Explanation
( Capture group 1
-?\d+ Match an optional -
(?:\.\d+)? Optionally match . and 1+ digits using a non capture group (so it is not outputted separately by re.findall)
) Close group 1
( Capture group 2
[a-zA-Z]+ Match 1+ times a char a-z or A-Z
) Close group 2
regex demo
import re
strings = [
"2.5ax",
"-5bcf",
"-69.67Gh",
]
pattern = r"(-?\d+(?:\.\d+)?)([a-zA-Z]+)"
for s in strings:
print(re.findall(pattern, s))
Output
[('2.5', 'ax')]
[('-5', 'bcf')]
[('-69.67', 'Gh')]

lookahead and lookbehind in re.sub simplify things sometimes.
(?<=\d) look behind
(?=[a-zA-Z]) look ahead
that is split between the digit and the letter.
strings = [
"2.5ax",
"-5bcf",
"-69.67Gh",
]
for s in strings:
print(re.split(r'(?<=\d)(?=[a-zA-Z])', s))
['2.5', 'ax']
['-5', 'bcf']
['-69.67', 'Gh']

Match everything except a pattern and replace matched with string

I want to use python in order to manipulate a string I have.
Basically, I want to prepend"\x" before every hex byte except the bytes that already have "\x" prepended to them.
My original string looks like this:
mystr = r"30336237613131\x90\x01\x0A\x90\x02\x146F6D6D616E64\x90\x01\x06\x90\x02\x0F52656C6174\x90\x01\x02\x90\x02\x50656D31\x90\x00"
And I want to create the following string from it:
mystr = r"\x30\x33\x62\x37\x61\x31\x31\x90\x01\x0A\x90\x02\x14\x6F\x6D\x6D\x61\x6E\x64\x90\x01\x06\x90\x02\x0F\x52\x65\x6C\x61\x74\x90\x01\x02\x90\x02\x50\x65\x6D\x31\x90\x00"
I thought of using regular expressions to match everything except /\x../g and replace every match with "\x". Sadly, I struggled with it a lot without any success. Moreover, I'm not sure that using regex is the best approach to solve such case.

Regex: (?:\\x)?([0-9A-Z]{2}) Substitution: \\x$1
Details:
(?:) Non-capturing group
? Matches between zero and one time, match string \x if it exists.
() Capturing group
[] Match a single character present in the list 0-9 and A-Z
{n} Matches exactly n times
\\x String \x
$1 Group 1.
Python code:
import re
text = R'30336237613131\x90\x01\x0A\x90\x02\x146F6D6D616E64\x90\x01\x06\x90\x02\x0F52656C6174\x90\x01\x02\x90\x02\x50656D31\x90\x00'
text = re.sub(R'(?:\\x)?([0-9A-Z]{2})', R'\\x\1', text)
print(text)
Output:
\x30\x33\x62\x37\x61\x31\x31\x90\x01\x0A\x90\x02\x14\x6F\x6D\x6D\x61\x6E\x64\x90\x01\x06\x90\x02\x0F\x52\x65\x6C\x61\x74\x90\x01\x02\x90\x02\x50\x65\x6D\x31\x90\x00
Code demo

You don't need regex for this. You can use simple string manipulation. First remove all of the "\x" from your string. Then add add it back at every 2 characters.
replaced = mystr.replace(r"\x", "")
newstr = "".join([r"\x" + replaced[i*2:(i+1)*2] for i in range(len(replaced)/2)])
Output:
>>> print(newstr)
\x30\x33\x62\x37\x61\x31\x31\x90\x01\x0A\x90\x02\x14\x6F\x6D\x6D\x61\x6E\x64\x90\x01\x06\x90\x02\x0F\x52\x65\x6C\x61\x74\x90\x01\x02\x90\x02\x50\x65\x6D\x31\x90\x00

You can get a list with your values to manipulate as you wish, with an even simpler re pattern
mystr = r"30336237613131\x90\x01\x0A\x90\x02\x146F6D6D616E64\x90\x01\x06\x90\x02\x0F52656C6174\x90\x01\x02\x90\x02\x50656D31\x90\x00"
import re
pat = r'([a-fA-F0-9]{2})'
match = re.findall(pat, mystr)
if match:
print('\n\nNew string:')
print('\\x' + '\\x'.join(match))
#for elem in match: # match gives you a list of strings with the hex values
# print('\\x{}'.format(elem), end='')
print('\n\nOriginal string:')
print(mystr)

This can be done without replacing existing \x by using a combination of positive lookbehinds and negative lookaheads.
(?!(?<=\\x)|(?<=\\x[a-f\d]))([a-f\d]{2})
Usage
See code in use here
import re
regex = r"(?!(?<=\\x)|(?<=\\x[a-f\d]))([a-f\d]{2})"
test_str = r"30336237613131\x90\x01\x0A\x90\x02\x146F6D6D616E64\x90\x01\x06\x90\x02\x0F52656C6174\x90\x01\x02\x90\x02\x50656D31\x90\x00"
subst = r"\\x$1"
result = re.sub(regex, subst, test_str, 0, re.IGNORECASE)
if result:
print (result)
Explanation
(?!(?<=\\x)|(?<=\\x[a-f\d])) Negative lookahead ensuring either of the following doesn't match.
(?<=\\x) Positive lookbehind ensuring what precedes is \x.
(?<=\\x[a-f\d]) Positive lookbehind ensuring what precedes is \x followed by a hexidecimal digit.
([a-f\d]{2}) Capture any two hexidecimal digits into capture group 1.

Regex pattern to match substring

Would like to find the following pattern in a string:
word-word-word++ or -word-word-word++
So that it iterates the -word or word- pattern until the end of the substring.
the string is quite large and contains many words with those^ patterns.
The following has been tried:
p = re.compile('(?:\w+\-)*\w+\s+=', re.IGNORECASE)
result = p.match(data)
but it returns NONE. Does anyone know the answer?

Your regex will only match the first pattern, match() will only find one occurrence, and that only if it is immediately followed by some whitespace and an equals sign.
Also, in your example you implied you wanted three or more words, so here's a version that was changed in the following ways:
match both patterns (note the leading -?)
match only if there are at least three words to the pattern ({2,} instead of +)
match even if there's nothing after the pattern (the \b matches a word boundary. It is not really necessary here, since the preceding \w+ guarantees we are at a word boundary anyway)
returns all matches instead of only the first one.
Here's the code:
#!/usr/bin/python
import re
data=r"foo-bar-baz not-this -this-neither nope double-dash--so-nope -yeah-this-even-at-end-of-string"
p = re.compile(r'-?(?:\w+-){2,}\w+\b', re.IGNORECASE)
print p.findall(data)
# prints ['foo-bar-baz', '-yeah-this-even-at-end-of-string']

Regular expression to find string with iterating letters on the end

Can someone help me with this kind of regular expression matching?
For example, I'm searching through list containing different strings with a letter iterating at the end of the string:
MonsterA
MonsterB
MonsterC
HeroA
HeroB
HeroC
...
What I need this script to return is only the preceding part of the string, in this example Monster and Hero.

If you absolutely need a regex:
re.match(r"(.*)[A-Z]", word).group(1)
But it is not the most efficient if you just want to remove the last character.

You could use a positive lookahead assertion (?=...) to check the words ends in a single uppercase character and then use word boudaries \b...\b to ensure it does not match patterns that arent whole words:
>>> text = "This re will match MonsterA and HeroB but not heroC or MonsterCC"
>>> re.findall(r"\b[A-Z][a-z]+(?=[A-Z]\b)", text)
['Monster', 'Hero']
re.findall returns all such matches in a list.

python regex: get end digits from a string

I am quite new to python and regex (regex newbie here), and I have the following simple string:
s=r"""99-my-name-is-John-Smith-6376827-%^-1-2-767980716"""
I would like to extract only the last digits in the above string i.e 767980716 and I was wondering how I could achieve this using python regex.
I wanted to do something similar along the lines of:
re.compile(r"""-(.*?)""").search(str(s)).group(1)
indicating that I want to find the stuff in between (.*?) which starts with a "-" and ends at the end of string - but this returns nothing..
I was wondering if anyone could point me in the right direction..
Thanks.

You can use re.match to find only the characters:
>>> import re
>>> s=r"""99-my-name-is-John-Smith-6376827-%^-1-2-767980716"""
>>> re.match('.*?([0-9]+)$', s).group(1)
'767980716'
Alternatively, re.finditer works just as well:
>>> next(re.finditer(r'\d+$', s)).group(0)
'767980716'
Explanation of all regexp components:
.*? is a non-greedy match and consumes only as much as possible (a greedy match would consume everything except for the last digit).
[0-9] and \d are two different ways of capturing digits. Note that the latter also matches digits in other writing schemes, like ୪ or ൨.
Parentheses (()) make the content of the expression a group, which can be retrieved with group(1) (or 2 for the second group, 0 for the whole match).
+ means multiple entries (at least one number at the end).
$ matches only the end of the input.

Nice and simple with findall:
import re
s=r"""99-my-name-is-John-Smith-6376827-%^-1-2-767980716"""
print re.findall('^.*-([0-9]+)$',s)
>>> ['767980716']
Regex Explanation:
^ # Match the start of the string
.* # Followed by anthing
- # Upto the last hyphen
([0-9]+) # Capture the digits after the hyphen
$ # Upto the end of the string
Or more simply just match the digits followed at the end of the string '([0-9]+)$'

Your Regex should be (\d+)$.
\d+ is used to match digit (one or more)
$ is used to match at the end of string.
So, your code should be: -
>>> s = "99-my-name-is-John-Smith-6376827-%^-1-2-767980716"
>>> import re
>>> re.compile(r'(\d+)$').search(s).group(1)
'767980716'
And you don't need to use str function here, as s is already a string.

Use the below regex
\d+$
$ depicts the end of string..
\d is a digit
+ matches the preceding character 1 to many times

Save the regular expressions for something that requires more heavy lifting.
>>> def parse_last_digits(line): return line.split('-')[-1]
>>> s = parse_last_digits(r"99-my-name-is-John-Smith-6376827-%^-1-2-767980716")
>>> s
'767980716'

I have been playing around with several of these solutions, but many seem to fail if there are no numeric digits at the end of the string. The following code should work.
import re
W = input("Enter a string:")
if re.match('.*?([0-9]+)$', W)== None:
last_digits = "None"
else:
last_digits = re.match('.*?([0-9]+)$', W).group(1)
print("Last digits of "+W+" are "+last_digits)

Try using \d+$ instead. That matches one or more numeric characters followed by the end of the string.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

findall not retruning all the results in Python 3.7 - python

The problem is that [\s,\S]+ is greedy and therefore consuming everything between the first string1 and the last string3. You can fix that by using positive lookaheads and making the regex non greedy like this: string1:(\d+)[^\d][\s,\S]+?string3:([\s\S]+?(?=string|$))

Related

Extracting float or int number and substring from a string

Match everything except a pattern and replace matched with string

Regex pattern to match substring

Regular expression to find string with iterating letters on the end

python regex: get end digits from a string

Categories

Resources