ignore first occurance of letter regex - python

using the following strings
9989S90K72MF-1
9989S90S-1
9989S75K60MF-1
9989S75S-1
I Would like to extract the below from those strings.
9989S90
9989S90
9989S75
9989S75
So far I have:
(^.*?(?=K|-))
Which gives me:
9989S90
9989S90S
9989S75
9989S75S
Here's a link https://regex101.com/r/d1nQj0/1
I've tried a few different regex but can't seem to nail it. Is there a way to ignore the first occurrence of a digit/letter? Which in my case would be S

The following regex matches a string at the beginning of a line that contains a single S up to but not including the first occurrence of S or K
^(.*?S.*?)(?=K|S)

For the example data, you could also match 1+ digits, then S followed by 1+ digits.
^\d+S\d+
Regex demo
If there has to be a S K or - at the right:
^\d+S\d+(?=[KS-])
Regex demo
Example
import re
regex = r"^\d+S\d+(?=[KS-])"
s = ("9989S90K72MF-1\n"
"9989S90S-1\n"
"9989S75K60MF-1\n"
"9989S75S-1")
print(re.findall(regex, s, re.MULTILINE))
Output
['9989S90', '9989S90', '9989S75', '9989S75']

Related

Extracting float or int number and substring from a string

I've just learned regex in python3 and was trying to solve a problem.
The problem is something like this:
You have given a string where the first part is a float or integer number and the next part is a substring. You must split the number and the substring and return it as a list. The substring will only contain the alphabet from a-z and A-Z. The values of numbers can be negative.
For example:
Input: 2.5ax
Output:['2.5','ax']
Input: -5bcf
Output:['-5','bcf']
Input:-69.67Gh
Output:['-69.67','Gh']
and so on.
I did several attempts with regex to solve the problem.
1st attempt:
import re
i=input()
print(re.findall(r'^(-?\d+(\.\d+)?)|[a-zA-Z]+$',i))
For the input -2.55xy, the expected output was ['-2.55','xy']
But the output came:
[('-2.55', '.55'), ('', '')]
2nd attempt:
My second attempt was similar to my first attempt just a little different:
import re
i=input()
print(re.findall(r'^(-?(\d+\.\d+)|\d+)|[a-zA-Z]+$',i))
For the same input -2.55xy, the output came as:
[('-2.55', '2.55'), ('', '')]
3rd attempt:
My next attempt was like that:
import re
i=input()
print(re.findall(r'^-?[1-9.]+|[a-z|A-Z]+$',i))
which matched the expected output for -2.55xy and also with the sample examples. But when the input is 2..5 or something like that, it considers that also as a float.
4th attempt:
import re
i=input()
value=re.findall(r"[a-zA-Z]+",i)
print([i.replace(value[0],""),value[0]])
which also matches the expected output but has the same problem as 3rd one that goes with it. Also, it doesn't look like an effective way to do it.
Conclusion:
So I don't know why my 1st and 2nd attempt isn't working. The output comes with a list of tuples which is maybe because of the groups but I don't know the exact reason and don't know how to solve them. Maybe I didn't understand the way the pattern works. Also why the substring didn't show in the output?
In the end, I want to know what's the mistake in my code and how can I write better and more efficient code to solve the problem. Thank you and sorry for my bad English.
The alternation | matches either the left part or the right part.
If the chars a-zA-Z are after the digit, you don't need the alternation | and you can use 2 capture groups to get the matches in that order.
Then using re.findall will return a list of tuples for the capture group values.
(-?\d+(?:\.\d+)?)([a-zA-Z]+)
Explanation
( Capture group 1
-?\d+ Match an optional -
(?:\.\d+)? Optionally match . and 1+ digits using a non capture group (so it is not outputted separately by re.findall)
) Close group 1
( Capture group 2
[a-zA-Z]+ Match 1+ times a char a-z or A-Z
) Close group 2
regex demo
import re
strings = [
"2.5ax",
"-5bcf",
"-69.67Gh",
]
pattern = r"(-?\d+(?:\.\d+)?)([a-zA-Z]+)"
for s in strings:
print(re.findall(pattern, s))
Output
[('2.5', 'ax')]
[('-5', 'bcf')]
[('-69.67', 'Gh')]
lookahead and lookbehind in re.sub simplify things sometimes.
(?<=\d) look behind
(?=[a-zA-Z]) look ahead
that is split between the digit and the letter.
strings = [
"2.5ax",
"-5bcf",
"-69.67Gh",
]
for s in strings:
print(re.split(r'(?<=\d)(?=[a-zA-Z])', s))
['2.5', 'ax']
['-5', 'bcf']
['-69.67', 'Gh']

Extract date from inside a string with Python

I have the following string, while the first letters can differ and can also be sometimes two, sometimes three or four.
PR191030.213101.ABD
I want to extract the 191030 and convert that to a valid date.
filename_without_ending.split(".")[0][-6:]
PZA191030_392001_USB
Sometimes it looks liket his
This solution is not valid since this is also might differ from time to time. The only REAL pattern is really the first six numbers.
How do I do this?
Thank you!
You could get the first 6 digits using a pattern an a capturing group
^[A-Z]{2,4}(\d{6})\.
^ Start of string
[A-Z]{2,4} Match 2, 3 or 4 uppercase chars
( Capture group 1
\d{6} Match 6 digits
)\. Close group and match trailing dot
Regex demo | Python demo
For example
import re
regex = r"^[A-Z]{2,4}(\d{6})\."
test_str = "PR191030.213101.ABD"
matches = re.search(regex, test_str)
if matches:
print(matches.group(1))
Output
191030
You can do:
a = 'PR191030.213101.ABD'
int(''.join([c for c in a if c.isdigit()][:6]))
Output:
191030
This can also be done by:
filename_without_ending.split(".")[0][2::]
This splits the string from the 3rd letter to the end.
Since first letters can differ we have to ignore alphabets and extract digits.
So using re module (for regular expressions) apply regex pattern on string. It will give matching pattern out of string.
'\d' is used to match [0-9]digits and + operator used for matching 1 digit atleast(1/more).
findall() will find all the occurences of matching pattern in a given string while #search() is used to find matching 1st occurence only.
import re
str="PR191030.213101.ABD"
print(re.findall(r"\d+",str)[0])
print(re.search(r"\d+",str).group())

regex capture info in text file after multiple blank lines

I open a complex text file in python, match everything else I need with regex but am stuck with one search.
I want to capture the numbers after the 'start after here' line. The space between the two rows is important and plan to split later.
start after here: test
5.7,-9.0,6.2
1.6,3.79,3.3
Code:
text = open(r"file.txt","r")
for line in text:
find = re.findall(r"start after here:[\s]\D+.+", line)
I tried this here https://regexr.com/ and it seems to work but it is for Java.
It doesn't find anything. I assume this is because I need to incorporate multiline but unsure how to read file in differently or incorporate. Have been trying many adjustments to regex but have not been successful.
import re
test_str = ("start after here: test\n\n\n"
"5.7,-9.0,6.2\n\n"
"1.6,3.79,3.3\n")
m = re.search(r'start after here:([^\n])+\n+(.*)', test_str)
new_str = m[2]
m = re.search(r'(-?\d*\.\d*,?\s*)+', new_str)
print(m[0])
The pattern start after here:[\s]\D+.+ matches the literal words and then a whitespace char using [\s] (you can omit the brackets).
Then 1+ times not a digit is matched, which will match until before 5.7. Then 1+ times any character except a newline will be matched which will match 5.7,-9.0,6.2 It will not match the following empty line and the next line.
One option could be to match your string and match all the lines after that do not start with a decimal in a capturing group.
\bstart after here:.*[\r\n]+(\d+\.\d+.*(?:[\r\n]+[ \t]*\d+\.\d+.*)*).*
The values including the empty line are in the first capturing group.
For example
import re
regex = r"\bstart after here:.*[\r\n]+(\d+\.\d+.*(?:[\r\n]+[ \t]*\d+\.\d+.*)*).*"
test_str = ("start after here: test\n\n\n"
"5.7,-9.0,6.2\n\n"
"1.6,3.79,3.3\n")
matches = re.findall(regex, test_str)
print(matches)
Result
['5.7,-9.0,6.2\n\n1.6,3.79,3.3']
Regex demo | Python demo
If you want to match the decimals (or just one or more digits) before the comma you might split on 1 or more newlines and use:
[+-]?(?:\d+(?:\.\d+)?|\.\d+)(?=,|$)
Regex demo

python regex combine patterns with AND and group

I am trying to use regex to match something meets the following conditions:
do not contain a "//" string
contain Chinese characters
pick up those Chinese characters
I read line by line from a file:
f = open("test.js", 'r')
lines = f.readlines()
for line in lines:
matches = regex.findall(line)
if matches:
print(matches)
First I tried to match Chinese characters using following pattern:
re.compile(r"[\u4e00-\u9fff]+")
it works and give me the output:
['下载失成功']
['下载失败']
['绑定监听']
['该功能暂未开放']
Then I tried to exclude the "//" with the following pattern and combine it to the above pattern:
re.compile(r"^(?=^(?:(?!//).)*$)(?=.*[\u4e00-\u9fff]+).*$")
it gives me the output:
[' showToastByText("该功能暂未开放");']
which is almost right but what I want is only the Chinese characters part.
I tried to add "()" but just can not pick up the part that I want.
Any advice will be appreciated, thanks :)
You don't need so complex regex for just negating // in your input and capturing the Chinese characters that appear in sequence together. For discarding the lines containing // just this (?!.*//) negative look ahead is enough and for capturing the Chinese text, you can capture with this regex [^\u4e00-\u9fff]*([\u4e00-\u9fff]+) and your overall regex becomes this,
^(?!.*//)[^\u4e00-\u9fff]*([\u4e00-\u9fff]+)
Where you can extract Chinese characters from first grouping pattern.
Explanation of above regex:
^ - Start of string
(?!.*//) - Negative look ahead that will discard the match if // is present in the line anywhere ahead
[^\u4e00-\u9fff]* - Optionally matches zero or more non-Chinese characters
([\u4e00-\u9fff]+) - Captures Chinese characters one or more and puts then in first grouping pattern.
Demo
Edit: Here are sample codes showing how to capture text from group1
import re
s = ' showToastByText("该功能暂未开放");'
m = re.search(r'^(?!.*//)[^\u4e00-\u9fff]*([\u4e00-\u9fff]+)',s)
if (m):
print(m.group(1))
Prints,
该功能暂未开放
Online Python Demo
Edit: For extracting multiple occurrence of Chinese characters as mentioned in comment
As you want to extract multiple occurrence of Chinese characters, you can check if the string does not contain // and then use findall to extract all the Chinese text. Here is a sample code demonstrating same,
import re
arr = ['showToastByText("该功能暂未开放");','//showToastByText("该功能暂未开放");','showToastByText("未开放");','showToastByText("该功能暂xxxxxx未开放");']
for s in arr:
if (re.match(r'\/\/', s)):
print(s, ' --> contains // hence not finding')
else:
print(s, ' --> ', re.findall(r'[\u4e00-\u9fff]+',s))
Prints,
showToastByText("该功能暂未开放"); --> ['该功能暂未开放']
//showToastByText("该功能暂未开放"); --> contains // hence not finding
showToastByText("未开放"); --> ['未开放']
showToastByText("该功能暂xxxxxx未开放"); --> ['该功能暂', '未开放']
Online Python demo
You don't need a positive lookahead to get the chinese characters (as it will not match anything). So we can rewrite that part to make a lazy match for .* until it finds the desired characters.
As such, using:
^(?=^(?:(?!//).)*$).*?([\u4e00-\u9fff]+).*$
Your first capture group will be the chinese characters

Regex pattern to match substring

Would like to find the following pattern in a string:
word-word-word++ or -word-word-word++
So that it iterates the -word or word- pattern until the end of the substring.
the string is quite large and contains many words with those^ patterns.
The following has been tried:
p = re.compile('(?:\w+\-)*\w+\s+=', re.IGNORECASE)
result = p.match(data)
but it returns NONE. Does anyone know the answer?
Your regex will only match the first pattern, match() will only find one occurrence, and that only if it is immediately followed by some whitespace and an equals sign.
Also, in your example you implied you wanted three or more words, so here's a version that was changed in the following ways:
match both patterns (note the leading -?)
match only if there are at least three words to the pattern ({2,} instead of +)
match even if there's nothing after the pattern (the \b matches a word boundary. It is not really necessary here, since the preceding \w+ guarantees we are at a word boundary anyway)
returns all matches instead of only the first one.
Here's the code:
#!/usr/bin/python
import re
data=r"foo-bar-baz not-this -this-neither nope double-dash--so-nope -yeah-this-even-at-end-of-string"
p = re.compile(r'-?(?:\w+-){2,}\w+\b', re.IGNORECASE)
print p.findall(data)
# prints ['foo-bar-baz', '-yeah-this-even-at-end-of-string']

Categories