Python get N characters from string with specific substring - python

I have a very long string that I extracted from a image file.
The string can look like this
...\n\nDate: 01.01.2022\n\nArticle-no: 123456789\n\nArticle description: asdfqwer 1234...\n...
How do I extract just the 10 characters after the substring "Article-no:"?
I tried solving it with a different approach using rfind like this but it tends to fail every now and then if the start and end string is not accurate.
s = "... string shown above ..."
start = "Article-no: "
end = "Article description: "
print(s[s.find(start)+len(start):s.rfind(end)])

you can use split:
string.split("Article-no: ", 1)[1][0:10]

For this, a regular expression might come in very handy.
import re
# Create a pattern which matches "Article-no: " literally,
# and then grabs the digits that follow.
pattern = re.compile(r"Article-no: (\d+)")
s = "...\n\nDate: 01.01.2022\n\nArticle-no: 123456789\n\nArticle description: asdfqwer 1234...\n..."
match = pattern.search(s)
if match:
print(match.group(1))
This outputs:
123456789
The regular expression used is Article-no: (\d+), which has the following parts:
Article-no: # Match this text literally
( # Open a new group (i.e. group 1)
\d+ # Match 1 or more occurrences of a digit
) # Close group 1
The re module will search the string for places where this matches, and then you can extract the digit from the matches.

Related

Regex python - find match items on list that have the same digit between the second character "_" to character "."

I have the following list :
list_paths=imgs/foldeer/img_ABC_21389_1.tif.tif,
imgs/foldeer/img_ABC_15431_10.tif.tif,
imgs/foldeer/img_GHC_561321_2.tif.tif,
imgs_foldeer/img_BCL_871125_21.tif.tif,
...
I want to be able to run a for loop to match string with specific number,which is the number between the second occurance of "_" to the ".tif.tif", for example, when number is 1, the string to be matched is "imgs/foldeer/img_ABC_21389_1.tif.tif" , for number 2, the match string will be "imgs/foldeer/img_GHC_561321_2.tif.tif".
For that, I wanted to use regex expression. Based on this answer, I have tested this regex expression on Regex101:
[^\r\n_]+\.[^\r\n_]+\_([0-9])
But this doesn't match anything, and also doesn't make sure that it will take the exact number, so if number is 1, it might also select items with number 10 .
My end goal is to be able to match items in the list that have the request number between the 2nd occurrence of "_" to the first occirance of ".tif" , using regex expression, looking for help with the regex expression.
EDIT: The output should be the whole path and not only the number.
Your pattern [^\r\n_]+\.[^\r\n_]+\_([0-9]) does not match anything, because you are matching an underscore \_ (note that you don't have to escape it) after matching a dot, and that does not occur in the example data.
Then you want to match a digit, but the available digits only occur before any of the dots.
In your question, the numbers that you are referring to are after the 3rd occurrence of the _
What you could do to get the path(s) is to make the number a variable for the number you want to find:
^\S*?/(?:[^\s_/]+_){3}\d+\.tif\b[^\s/]*$
Explanation
\S*? Match optional non whitespace characters, as few as possible
/ Match literally
(?:[^\s_/]+_){3} Match 3 times (non consecutive) _
\d+ Match 1+ digits
\.tif\b[^\s/]* Match .tif followed by any char except /
$ End of string
See a regex demo and a Python demo.
Example using a list comprehension to return all paths for the given number:
import re
number = 10
pattern = rf"^\S*?/(?:[^\s_/]+_){{3}}{number}\.tif\b[^\s/]*$"
list_paths = [
"imgs/foldeer/img_ABC_21389_1.tif.tif",
"imgs/foldeer/img_ABC_15431_10.tif.tif",
"imgs/foldeer/img_GHC_561321_2.tif.tif",
"imgs_foldeer/img_BCL_871125_21.tif.tif",
"imgs_foldeer/img_BCL_871125_21.png.tif"
]
res = [lp for lp in list_paths if re.search(pattern, lp)]
print(res)
Output
['imgs/foldeer/img_ABC_15431_10.tif.tif']
I'll show you something working and equally ugly as regex which I hate:
data = ["imgs/foldeer/img_ABC_21389_1.tif.tif",
"imgs/foldeer/img_ABC_21389_1.tif.tif",
"imgs/foldeer/img_ABC_15431_10.tif.tif",
"imgs/foldeer/img_GHC_561321_2.tif.tif",
"imgs_foldeer/img_BCL_871125_21.tif.tif"]
numbers = [int(x.split("_",3)[-1].split(".")[0]) for x in data]
First split gives ".tif.tif"
extract the last element
split again by the dot this time, take the first element (thats your number as a string), cast it to int
Please keep in mind it's gonna work only for the format you provided, no flexibility at all in this solution (on the other hand regex doesn't give any neither)
without regex if allowed.
import re
s= 'imgs/foldeer/img_ABC_15431_10.tif.tif'
last =s[s.rindex('_')+1:]
print(re.findall(r'\d+', last)[0])
Gives #
10
[0-9]*(?=\.tif\.tif)
This regex expression uses a lookahead to capture the last set of numbers (what you're looking for)
Try this:
import re
s = '''imgs/foldeer/img_ABC_21389_1.tif.tif
imgs/foldeer/img_ABC_15431_10.tif.tif
imgs/foldeer/img_GHC_561321_2.tif.tif
imgs_foldeer/img_BCL_871125_21.tif.tif'''
number = 1
res1 = re.findall(f".*_{number}\.tif.*", s)
number = 21
res21 = re.findall(f".*_{number}\.tif.*", s)
print(res1)
print(res21)
Results
['imgs/foldeer/img_ABC_21389_1.tif.tif']
['imgs_foldeer/img_BCL_871125_21.tif.tif']

how to check input string with list of pattern sequentially in python?

I have specific patterns which composed of string, numbers and special character in specific order. I would like to check input string is in the list of pattern that I created and print error if seeing incorrect input. To do so, I tried of using regex but my code is not neat enough. I am wondering if someone help me with this.
use case
I have input att2_epic_app_clm1_sub_valid, where I split them by _; here is list of pattern I am expecting to check and print error if not match.
Rule:
input should start with att and some number like [att][0-6]*, or [ptt][0-6]; after that it should be continued at either epic or semi, then it should be continued with [app][0-6] or [app][0-6_][clm][0-9_]+[sub|sup]; then it should end with [valid|Invalid]
so I composed this pattern with re but when I passed invalid input, it is not detected and I expect error instead.
import re
acceptable_pattern=re.compile(r'([att]+[0-6_])(epic|semi_)([app]+[0-6_]+[clm]+[0-6_])([sub|sup_])([valid|invalid]))'
input='att1_epic_app2_clm1_sub_valid' # this is valid string
wlist=input.split('_')
for each in wlist:
if any(ext in each for ext in acceptable_pattern):
print("valid")
else:
print("invalid")
this is not quite working because I have to check the string from beginning to end where split the string by _ where each new string much match of of the predefined rule such as:
input string should start with att|ptt which end with between 1-6; then next new word either epic or semi; then it should be app or app1~app6 or app{1_6}clm{1~6}{sub|sup_}; then string end with {valid|invalid};
how should I specify those rules by using re.compile to check pattern in input string and raise error if it is not sequentially? How should we do this in python? any quick way of making this happen?
Instead of using split, you could consider writing a pattern that validates the whole string.
If I am reading the requirements, you might use:
^[ap]tt[0-6]_(?:epic|semi)_app(?:[1-6]|[1-6_]clm[0-9]*_su[bp])?_valid$
^ Start of string
[ap]tt[0-6] match att or ptt and a digit 0-6
_(?:epic|semi) Match _epic or _semi
_app Match literally
(?: Non capture group for the alternation
[1-6] Match a digit 1-6
| Or
[1-6_]clm[0-9]*_su[bp] Match a digit 1-6 or _, then clm followed by optional digit 0-9 and then _sub or _sup
)? Close the non capture group and make it optional
_valid Match literally
$ End of string
See a regex demo.
If the string can also start with dev then you can use an alternation:
^(?:[ap]tt|dev)[0-6]_(?:epic|semi)_app(?:[1-6]|[1-6_]clm[0-9]*_su[bp])?_valid$
See another regex demo.
Then you can check if there was a match:
import re
pattern = r"^(?:[ap]tt|dev)[0-6]_(?:epic|semi)_app(?:[1-6]|[1-6_]clm[0-9]*_su[bp])?_valid$"
strings = [
"att2_epic_app_clm1_sub_valid",
"att12_epic_app_clm1_sub_valid",
"att2_epic_app_valid",
"att2_epic_app_clm1_sub_valid"
]
for s in strings:
m = re.match(pattern, s, re.M)
if m:
print("Valid: " + m.group())
else:
print("Invalid: " + s)
Output
Valid: att2_epic_app_clm1_sub_valid
Invalid: att12_epic_app_clm1_sub_valid
Valid: att2_epic_app_valid
Valid: att2_epic_app_clm1_sub_valid

Extract date from inside a string with Python

I have the following string, while the first letters can differ and can also be sometimes two, sometimes three or four.
PR191030.213101.ABD
I want to extract the 191030 and convert that to a valid date.
filename_without_ending.split(".")[0][-6:]
PZA191030_392001_USB
Sometimes it looks liket his
This solution is not valid since this is also might differ from time to time. The only REAL pattern is really the first six numbers.
How do I do this?
Thank you!
You could get the first 6 digits using a pattern an a capturing group
^[A-Z]{2,4}(\d{6})\.
^ Start of string
[A-Z]{2,4} Match 2, 3 or 4 uppercase chars
( Capture group 1
\d{6} Match 6 digits
)\. Close group and match trailing dot
Regex demo | Python demo
For example
import re
regex = r"^[A-Z]{2,4}(\d{6})\."
test_str = "PR191030.213101.ABD"
matches = re.search(regex, test_str)
if matches:
print(matches.group(1))
Output
191030
You can do:
a = 'PR191030.213101.ABD'
int(''.join([c for c in a if c.isdigit()][:6]))
Output:
191030
This can also be done by:
filename_without_ending.split(".")[0][2::]
This splits the string from the 3rd letter to the end.
Since first letters can differ we have to ignore alphabets and extract digits.
So using re module (for regular expressions) apply regex pattern on string. It will give matching pattern out of string.
'\d' is used to match [0-9]digits and + operator used for matching 1 digit atleast(1/more).
findall() will find all the occurences of matching pattern in a given string while #search() is used to find matching 1st occurence only.
import re
str="PR191030.213101.ABD"
print(re.findall(r"\d+",str)[0])
print(re.search(r"\d+",str).group())

to find the pattern using regex?

curP = "https://programmers.co.kr/learn/courses/4673'>#!Muzi#Muzi!)jayg07con&&"
I want to find the Muzi from this string with regex
for example
MuziMuzi : count 0 because it considers as one word
Muzi&Muzi: count 2 because it has & between so it separate the word
7Muzi7Muzi : count 2
I try to use the regex to find all matched
curP = "<a href='https://programmers.co.kr/learn/courses/4673'></a>#!Muzi#Muzi!)jayg07con&&"
pattern = re.compile('[^a-zA-Z]muzi[^a-zA-Z]')
print(pattern.findall(curP))
I expected the ['!muzi#','#Muzi!']
but the result is
['!muzi#']
You need to use this as your regex:
pattern = re.compile('[^a-zA-Z]muzi(?=[^a-zA-Z])', flags=re.IGNORECASE)
(?=[^a-zA-Z]) says that muzi must have a looahead of [^a-zA-Z] but does not consume any characters. So the first match is only matching !Muzi leaving the following # available to start the next match.
Your original regex was consuming !Muzi# leaving Muzi!, which would not match the regex.
Your matches will now be:
['!Muzi', '#Muzi']
As I understand it you want to get any value that may appear on both sides of your keyword Muzi.
That means that the #, in this case, has to be shared by both output values.
The only way to do it using regex is to manipulate the string as you find patterns.
Here is my solution:
import re
# Define the function to find the pattern
def find_pattern(curP):
pattern = re.compile('([^a-zA-Z]muzi[^a-zA-Z])', flags=re.IGNORECASE)
return pattern.findall(curP)[0]
curP = "<a href='https://programmers.co.kr/learn/courses/4673'></a>#!Muzi#Muzi!)jayg07con&&"
pattern_array = []
# Find the the first appearence of pattern on the string
pattern_array.append(find_pattern(curP))
# Remove the pattern found from the string
curP = curP.replace('Muzi','',1)
#Find the the second appearence of pattern on the string
pattern_array.append(find_pattern(curP))
print(pattern_array)
Output:
['!Muzi#', '#Muzi!']

Match everything except a pattern and replace matched with string

I want to use python in order to manipulate a string I have.
Basically, I want to prepend"\x" before every hex byte except the bytes that already have "\x" prepended to them.
My original string looks like this:
mystr = r"30336237613131\x90\x01\x0A\x90\x02\x146F6D6D616E64\x90\x01\x06\x90\x02\x0F52656C6174\x90\x01\x02\x90\x02\x50656D31\x90\x00"
And I want to create the following string from it:
mystr = r"\x30\x33\x62\x37\x61\x31\x31\x90\x01\x0A\x90\x02\x14\x6F\x6D\x6D\x61\x6E\x64\x90\x01\x06\x90\x02\x0F\x52\x65\x6C\x61\x74\x90\x01\x02\x90\x02\x50\x65\x6D\x31\x90\x00"
I thought of using regular expressions to match everything except /\x../g and replace every match with "\x". Sadly, I struggled with it a lot without any success. Moreover, I'm not sure that using regex is the best approach to solve such case.
Regex: (?:\\x)?([0-9A-Z]{2}) Substitution: \\x$1
Details:
(?:) Non-capturing group
? Matches between zero and one time, match string \x if it exists.
() Capturing group
[] Match a single character present in the list 0-9 and A-Z
{n} Matches exactly n times
\\x String \x
$1 Group 1.
Python code:
import re
text = R'30336237613131\x90\x01\x0A\x90\x02\x146F6D6D616E64\x90\x01\x06\x90\x02\x0F52656C6174\x90\x01\x02\x90\x02\x50656D31\x90\x00'
text = re.sub(R'(?:\\x)?([0-9A-Z]{2})', R'\\x\1', text)
print(text)
Output:
\x30\x33\x62\x37\x61\x31\x31\x90\x01\x0A\x90\x02\x14\x6F\x6D\x6D\x61\x6E\x64\x90\x01\x06\x90\x02\x0F\x52\x65\x6C\x61\x74\x90\x01\x02\x90\x02\x50\x65\x6D\x31\x90\x00
Code demo
You don't need regex for this. You can use simple string manipulation. First remove all of the "\x" from your string. Then add add it back at every 2 characters.
replaced = mystr.replace(r"\x", "")
newstr = "".join([r"\x" + replaced[i*2:(i+1)*2] for i in range(len(replaced)/2)])
Output:
>>> print(newstr)
\x30\x33\x62\x37\x61\x31\x31\x90\x01\x0A\x90\x02\x14\x6F\x6D\x6D\x61\x6E\x64\x90\x01\x06\x90\x02\x0F\x52\x65\x6C\x61\x74\x90\x01\x02\x90\x02\x50\x65\x6D\x31\x90\x00
You can get a list with your values to manipulate as you wish, with an even simpler re pattern
mystr = r"30336237613131\x90\x01\x0A\x90\x02\x146F6D6D616E64\x90\x01\x06\x90\x02\x0F52656C6174\x90\x01\x02\x90\x02\x50656D31\x90\x00"
import re
pat = r'([a-fA-F0-9]{2})'
match = re.findall(pat, mystr)
if match:
print('\n\nNew string:')
print('\\x' + '\\x'.join(match))
#for elem in match: # match gives you a list of strings with the hex values
# print('\\x{}'.format(elem), end='')
print('\n\nOriginal string:')
print(mystr)
This can be done without replacing existing \x by using a combination of positive lookbehinds and negative lookaheads.
(?!(?<=\\x)|(?<=\\x[a-f\d]))([a-f\d]{2})
Usage
See code in use here
import re
regex = r"(?!(?<=\\x)|(?<=\\x[a-f\d]))([a-f\d]{2})"
test_str = r"30336237613131\x90\x01\x0A\x90\x02\x146F6D6D616E64\x90\x01\x06\x90\x02\x0F52656C6174\x90\x01\x02\x90\x02\x50656D31\x90\x00"
subst = r"\\x$1"
result = re.sub(regex, subst, test_str, 0, re.IGNORECASE)
if result:
print (result)
Explanation
(?!(?<=\\x)|(?<=\\x[a-f\d])) Negative lookahead ensuring either of the following doesn't match.
(?<=\\x) Positive lookbehind ensuring what precedes is \x.
(?<=\\x[a-f\d]) Positive lookbehind ensuring what precedes is \x followed by a hexidecimal digit.
([a-f\d]{2}) Capture any two hexidecimal digits into capture group 1.

Categories