Python recognize part of string (position and length)

Python recognize part of string (position and length) - python

I have got a file (.VAR) which gives me a positions and lengths in a strings per row, see the example below.
*STRING1 1L8:StringONE
*STRINGWITHVARIABLELENGTH2 *ABC 29L4:StringTWO
*STRINGWITHLENGTH3 *ABC 33L2:StringTHREE
How do i retrieve the " xxLxxx:" value, which is always preceded by a space and always ends with a colon, but never on the same location within the string.
Preferably I would like to find the number before L as the position, and the number behind L as the length, but only searching for "L" would give me also the input from other values within the string. Therefore I think I have to use the space_number_L_number_colon to recognize this part, but I don't know how.
Any thoughts? TIA

You can use a regex here.
Example:
s='''*STRING1 1L8:StringONE
*STRINGWITHVARIABLELENGTH2 *ABC 29L4:StringTWO
*STRINGWITHLENGTH3 *ABC 33L2:StringTHREE'''
import re
out = re.findall(r'\s(\d+)L(\d+):', s)
output: [('1', '8'), ('29', '4'), ('33', '2')]
As integers:
out = [tuple(map(int, x)) for x in re.findall(r'\s(\d+)L(\d+):', s)]
output: [(1, 8), (29, 4), (33, 2)]
regex:
regex demo
\s # space
(\d+) # capture one or more digits
L # literal L
(\d+) # capture one or more digits
: # literal :

Related

Regular expression to extract number with hyphen

The text is like "1-2years. 3years. 10years."
I want get result [(1,2),(3),(10)].
I use python.
I first tried r"([0-9]?)[-]?([0-9])years". It works well except for the case of 10. I also tried r"([0-9]?)[-]?([0-9]|10)years" but the result is still [(1,2),(3),(1,0)].

Your attempt r"([0-9]?)[-]?([0-9])years" doesn't work for the case of 10 because you ask it to match one (or zero) digit per group.
You also don't need the hyphen in brackets.
This should work: Regex101
(\d+)(?:-(\d+))?years
Explanation:
(\d+): Capturing group for one or more digits
(?: ): Non-capturing group
- : hyphen
(\d+): Capturing group for one or more digits
(?: )?: Make the previous non-capturing group optional
In python:
import re
result = re.findall(r"(\d+)(?:-(\d+))?years", "1-2years. 3years. 10years.")
# Gives: [('1', '2'), ('3', ''), ('10', '')]
Each tuple in the list contains two elements: The number on the left side of the hyphen, and the number on the right side of the hyphen. Removing the blank elements is quite easy: you loop over each item in result, then you loop over each match in this item and only select it (and convert it to int) if it is not empty.
final_result = [tuple(int(match) for match in item if match) for item in result]
# gives: [(1, 2), (3,), (10,)]

This should work:
import re
st = '1-2years. 3years. 10years.'
result = [tuple(e for e in tup if e)
for tup in re.findall(r'(?:(\d+)-(\d+)|(\d+))years', st)]
# [('1', '2'), ('3',), ('10',)]
The regex will look for either one number, or two separated by a hyphen, immediately prior to the word years. If we give this to re.findall(), it will give us the output [('1', '2', ''), ('', '', '3'), ('', '', '10')], so we also use a quick list comprehension to filter out the empty strings.
Alternately we could use r'(\d+)(?:-(\d+))?years' to basically the same effect, which is closer to what you've already tried.

You can use this pattern: (?:(\d+)-)?(\d+)years
See Regex Demo
Code:
import re
pattern = r"(?:(\d+)-)?(\d+)years"
text = "1-2years. 3years. 10years."
print([tuple(int(z) for z in x if z) for x in re.findall(pattern, text)])
Output:
[(1, 2), (3,), (10,)]

You only match a single digit as the character class [0-9] is not repeated.
Another option is to match the first digits with an optional part for - and digits.
Then you can split the matches on -
\b(\d+(?:-\d+)?)years\.
\b A word boundary
( Capture group 1 (which will be returned by re.findall)
\d+(?:-\d+)? Match 1+ digits and optionally match - and again 1+ digits
) Close group 1
years\. Match literally with the escaped .
See a regex demo and a Python demo.
Example
pattern = r"\b(\d+(?:-\d+)?)years\."
s = "1-2years. 3years. 10years."
res = [tuple(v.split('-')) for v in re.findall(pattern, s)]
print(res)
Output
[('1', '2'), ('3',), ('10',)]
Or if a list of lists is also ok instead of tuples
res = [v.split('-') for v in re.findall(pattern, s)]
Output
[['1', '2'], ['3'], ['10']]

Why is my Regular Expression not matching?

I have the following pattern:
find_pattern = re.compile(r'(ga:country:\s)([a-zA-Z()\s]*)(.*users:\s)(\d+),')
This is how the input that should be matched looks like:
ga:country: (not set),Date range:0,ga:users:60,
ga:country: Albania,Date range:0,ga:users:7,
ga:country: Algeria,Date range:0,ga:users:10,
...
ga:country: Argentina,Date range:0,ga:users:61,
ga:country: Armenia,Date range:0,ga:users:2,
And this is how the output is going to be formatted (in case it adds any value to the question):
['(not set)', 60],
['Albania', 7],
And when I run a test:
matches = find_pattern.finditer(self.data)
print('matches:', matches)
for match in matches:
print(match)
No matches are found.
Hope someone is able to help.

I would suggest using 2 capturing groups instead of 4, add optional whitespace chars after ga: and make the whitspace chars optional after users:
The .* could also be non greedy .*? to get the first one in case there are more users: parts.
To prevent users: begin part of a larger word, you could make it more specific matching :users:
\bga:\s*country:\s*([a-zA-Z()\s]*),.*?:users:(\d+)
Regex demo
Example with re.findall that returns the values of the capturing groups:
import re
regex = r"\bga:\s*country:\s*([a-zA-Z()\s]*),.*?:users:(\d+)"
s = ("ga:country: (not set),Date range:0,ga:users:60,\n"
"ga:country: Albania,Date range:0,ga:users:7,\n"
"ga:country: Algeria,Date range:0,ga:users:10,\n"
"ga:country: Argentina,Date range:0,ga:users:61,\n"
"ga:country: Armenia,Date range:0,ga:users:2,")
print(re.findall(regex, s))
Output
[('(not set)', '60'), ('Albania', '7'), ('Algeria', '10'), ('Argentina', '61'), ('Armenia', '2')]

Regex: Matching individual characters without matching characters inbetween

I have a simple regex query.
Here is the input:
DLWLALDYVASQASV
The desired output are the positions of the bolded characters. DLWLALDYVASQASV
So it would be D:6, Y:7, S:10.
I am using python, so I know I can use span() or start() to obtain the start positions of a match. But if I try to use something like: DY.{2}S It will match the characters in between and only give me the position of the first (and last in the case of span) character of the match.
Is there a function or a way to retrieve the position of each specified character, not including the characters in-between?

match = re.search(r'(D)(Y)..(S)', 'DLWLALDYVASQASV')
print([match.group(i) for i in range(4)])
>>> ['DYVAS', 'D', 'Y', 'S']
print([match.span(i) for i in range(4)])
>>> [(6, 11), (6, 7), (7, 8), (10, 11)]
print([match.start(i) for i in range(4)])
>>> [6, 6, 7, 10]
You can take subexpressions of regular expression into brackets and then access the corresponding substrings via the match object. See the documentation of Match object for more details.

How can I get the number of groups to vary depending on the number of lines?

I have this regex: ^:([^:]+):([^:]*) which works as in this regex101 link.
Now, in Python, I have this:
def get_data():
data = read_mt_file()
match_fields = re.compile('^:([^:]+):([^:]*)', re.MULTILINE)
fields = re.findall(match_fields, data)
return fields
Which, for a file containing the data from regex101, returns:
[('1', 'text\ntext\n\n'), ('20', 'text\n\n'), ('21', 'text\ntext\ntext\n\n'), ('22', ' \n\n'), ('25', 'aa\naa\naaaaa')]
Now, this is ok, but I want to change the regex, so that I can get the number of groups to vary depending on the number of lines. Meaning:
for the first line, now, I get two groups:
1
text\ntext\n\n
I'd like to get instead:
1
((text\n), (text\n\n)) <-- those should be somehow in the same group but separated, each in his own subgroup. Somehow I need to know they both belong to 1 field, but are sepparate lines.
So, In python, the desired result for that file would be:
[('1', '(text\n), (text\n\n)'), ('20', 'text\n\n'), ('21', '(text\n), (text\n), (text\n\n)'), ('22', ' \n\n'), ('25', '(aa\n), (aa\n), (aaaaa)')]
Is this possible with regex? Could this be achieved with some nice string manipulation instead ?

To do what you want, you'd need another regex.
This is as re.match only matches the last item it matches:
>>> re.match(r'(\d)+', '12345').groups()
('5',)
Instead of using one regex you'll need to use two.
The one that you are using at the moment, and then one to match all the 'sub-groups', using say re.findall.
You can get these sub-groups by simply matching anything that isn't a \n and then any amount of \n.
So you could use a regex such as [^\n]+\n*:
>>> re.findall(r'[^\n]+\n*', 'text\ntext')
['text\n', 'text']
>>> re.findall(r'[^\n]+\n*', 'text\ntext\n\n')
['text\n', 'text\n\n']
>>> re.findall(r'[^\n]+\n*', '')
[]

You may use a simple trick: after getting the matches with your regex, run a .+\n* regex over the Group 2 value:
import re
p = re.compile(r'^:([^:]+):([^:]+)', re.MULTILINE)
s = ":1:text\ntext\n\n:20:text\n\n:21:text\ntext\ntext\n\n:22: \n\n:25:aa\naa\naaaaa"
print([[x.group(1)] + re.findall(r".+\n*", x.group(2)) for x in p.finditer(s)])
Here,
p.finditer(s) finds all matches in the string using your regex
[x.group(1)] - a list created from the first group contents
re.findall(r".+\n*", x.group(2)) - fetches individual lines from Group 2 contents (with trailing newlines, 0 or more)
[] + re.findall - combining the lists into 1.
Result is
[['1', 'text\n', 'text\n\n'], ['20', 'text\n\n'], ['21', 'text\n', 'text\n', 'text\n\n'], ['22', ' \n\n'], ['25', 'aa\n', 'aa\n', 'aaaaa']]
Another approach: match all the substrings with your pattern and then use a re.sub to add ), ( between the lines ending with optional newlines:
[(x, "({})".format(re.sub(r".+(?!\n*$)\n+", r"\g<0>), (", y))) for x, y in p.findall(s)]
Result:
[('1', '(text\n), (text\n\n)'), ('20', '(text\n\n)'), ('21', '(text\n), (text\n), (text\n\n)'), ('22', '( \n\n)'), ('25', '(aa\n), (aa\n), (aaaaa)')]
See the Python 3 demo
Here:
p.findall(s) - grabs all the matches in the form of a list of tuples containing your capture group contents using your regex
(x, "({})".format(re.sub(r".+(?!\n*$)\n+", r"\g<0>), (", y))) - creates a tuple from Group 1 contents and Group 2 contents that are a bit modified with the re.sub the way described below
.+(?!\n*$)\n+ - pattern that matches 1+ characters other than newline and then 1+ newline symbols if they are not at the end of the string. If they are at the end of the string, there will be no replacement made (to avoid , () at the end). The \g<0> in the replacement string is re-inserting the whole match back into the resulting string and appends ), ( to it.

Regex to parse SDDL

I'm using python to parse out an SDDL using regex. The SDDL is always in the form of 'type:some text' repeated up to 4 times. The types can be either 'O', 'G', 'D', or 'S' followed by a colon. The 'some text' will be variable in length.
Here is a sample SDDL:
O:DAG:S-1-5-21-2021943911-1813009066-4215039422-1735D:(D;;0xf0007;;;AN)(D;;0xf0007;;;BG)S:NO_ACCESS_CONTROL
Here is what I have so far. Two of the tuples are returned just fine, but the other two - ('G','S-1-5-21-2021943911-1813009066-4215039422-1735') and ('S','NO_ACCESS_CONTROL') are not.
import re
sddl="O:DAG:S-1-5-21-2021943911-1813009066-4215039422-1735D:(D;;0xf0007;;;AN)(D;;0xf0007;;;BG)S:NO_ACCESS_CONTROL"
matches = re.findall('(.):(.*?).:',sddl)
print matches
[('O', 'DA'), ('D', '(D;;0xf0007;;;AN)(D;;0xf0007;;;BG)')]
what I'd like to have returned is
[('O', 'DA'), ('G','S-1-5-21-2021943911-1813009066-4215039422-1735'), ('D', '(D;;0xf0007;;;AN)(D;;0xf0007;;;BG)'),('S','NO_ACCESS_CONTROL')]

Try the following:
(.):(.*?)(?=.:|$)
Example:
>>> re.findall(r'(.):(.*?)(?=.:|$)', sddl)
[('O', 'DA'), ('G', 'S-1-5-21-2021943911-1813009066-4215039422-1735'), ('D', '(D;;0xf0007;;;AN)(D;;0xf0007;;;BG)'), ('S', 'NO_ACCESS_CONTROL')]
This regex starts out the same way as yours, but instead of including the .: at the end as a part of the match, a lookahead is used. This is necessary because re.findall() will not return overlapping matches, so you need each match to stop before the next match begins.
The lookahead (?=.:|$) essentially means "match only if the next characters are anything followed by a colon, or we are at the end of the string".

It seems like using regex isn't the best solution to this problem. Really, all you want to do is split across the colons and then do some transformations on the resulting list.
chunks = sddl.split(':')
pairs = [(chunks[i][-1], chunks[i+1][:-1] \
if i < (len(chunks) - 2) \
else chunks[i+1])
for i in range(0, len(chunks) - 1)]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python recognize part of string (position and length) - python

Related

Regular expression to extract number with hyphen

Why is my Regular Expression not matching?

Regex: Matching individual characters without matching characters inbetween

How can I get the number of groups to vary depending on the number of lines?

Regex to parse SDDL

Categories

Resources