So I want to capture the indices in a string like this:
"Something bad happened! # data[u'string_1'][u'string_2']['u2'][0]"
I want to capture the strings string_1, string_2, u2, and 0.
I was able to do this using the following regex:
re.findall("("
"((?<=\[u')|(?<=\['))" # Begins with [u' or ['
"[a-zA-Z0-9_\-]+" # Followed by any letters, numbers, _'s, or -'s
"(?='\])" # Ending with ']
")"
"|" # OR
"("
"(?<=\[)" # Begins with [
"[0-9]+" # Followed by any numbers
"(?=\])" # Endging with ]
")", message)
Problem is the result will include tuples with empty strings, as such:
[('string_1', '', ''), ('string_2', '', ''), ('u2', '', ''), ('', '', '0')]
Now I can easily filter out the empty strings from the result, but I would like to prevent them from appearing in the first place.
I believe that the reason for this is due to my capture groups. I tried to use ?: in those group, but then my results were completely gone.
This is how I had attempted to do it:
re.findall("(?:"
"((?<=\[u')|(?<=\['))" # Begins with [u' or ['
"[a-zA-Z0-9_\-]+" # Followed by any letters, numbers, _'s, or -'s
"(?='\])" # Ending with ']
")"
"|" # OR
"(?:"
"(?<=\[)" # Begins with [
"[0-9]+" # Followed by any numbers
"(?=\])" # Endging with ]
")", message)
That results in the following output:
['', '', '', '']
I'm assuming the issue is due to me using lookbehinds along with the non-capturing groups. Any ideas on whether this is possible to do in Python?
Thanks
You can simplify your regex.
(?<=\[)u?'?([a-zA-Z0-9_\-]+)(?='?\])
See demo .
https://regex101.com/r/SA6shx/1
Regex: (?<=\[)(?:[^'\]]*')?([^'\]]+) or \[(?:[^'\]]*')?([^'\]]+)
Python code:
def Years(text):
return re.findall(r'(?<=\[)(?:[^\'\]]*\')?([^\'\]]+)', text)
print(Years('Something bad happened! # data[u\'string_1\'][u\'string_2\'][\'u2\'][0]'))
Output:
['string_1', 'string_2', 'u2', '0']
Related
I've found some solutions, but the results I am getting don't match what I'm expecting.
I want to take a string, and split it at commas, except when the commas are contained within double quotation marks. I would like to ignore whitespace. I can live with losing the double quotes in the process, but it's not necessary.
Is csv the best way to do this? Would a regex solution be better?
#!/usr/local/bin/python2.7
import csv
s = 'abc,def, ghi, "jkl, mno, pqr","stu"'
result = csv.reader(s, delimiter=',', quotechar='"')
for r in result:
print r
# Should display:
# abc
# def
# ghi
# jkl, mno, pqr
# stu
#
# But I get:
# ['a']
# ['b']
# ['c']
# ['', '']
# ['d']
# ['e']
# ['f']
# ['', '']
# [' ']
# ['g']
# ['h']
# ['i']
# ['', '']
# [' ']
# ['jkl, mno, pqr']
# ['', '']
# ['stu']
print r[1] # Should be "def" but I get and "list index out of range" error.
You can use the regular expression
".+?"|[\w-]+
This will match double-quotes, followed by any characters, until the next double-quote is found - OR, it will match word characters (no commas nor quotes).
https://regex101.com/r/IThYf7/1
import re
s = 'abc,def, ghi, "jkl, mno, pqr","stu"'
for r in re.findall(r'".+?"|[\w-]+', s):
print(r)
If you want to get rid of the "s around the quoted sections, the best I could figure out by using the regex module (so that \K was usable) was:
(?:^"?|, ?"?)\K(?:(?<=").+?(?=")|[\w-]+)
https://regex101.com/r/IThYf7/3
Besides using csv you could have another nice approach which is supported by the newer regex module (i.e. pip install regex):
"[^"]*"(*SKIP)(*FAIL)|,\s*
This reads as follows:
"[^"]*"(*SKIP)(*FAIL) # match everything between two double quotes and "forget" about them
| # or
,\s* # match a comma and 0+ whitespaces
In Python:
import regex as re
rx = re.compile(r'"[^"]*"(*SKIP)(*FAIL)|,\s*')
string = 'abc,def, ghi, "jkl, mno, pqr","stu"'
parts = rx.split(string)
print(parts)
This yields
['abc', 'def', 'ghi', '"jkl, mno, pqr"', '"stu"']
See a demo on regex101.com.
While trying to process some code, I needed to find instances in which variables from a certain list were used. Problem is, the code is obfuscated and those variable names could also appear in a string, for example, which I didn't want to match.
However, I haven't been able to find a regex to match only non-quoted words that works in python...
"[^\\\\]((\")|('))(?(2)([^\"]|\\\")*|([^']|\\')*)[^\\\\]\\1|(\w+)"
Should match any non-quoted word to the last group (6th group, index 5 with 0-based indexing). Minor modifications are required to avoid matching strings which begin with quotes.
Explanation:
[^\\\\] Match any character but an escape character. Escaped quotes do not start a string.
((\")|(')) Immediately after the non-escaped character, match either " or ', which starts a string. This is group 1, which contains groups 2 (\") and 3 (')
(?(2) if we matched group 2 (a double-quote)
([^\"]|\\\")*| match anything but double quotes, or match escaped double quotes. Otherwise:
([^']|\\')*) match anything but a single quote or match an escaped single quote.
If you wish to retrieve the string inside the quotes, you will have to add another group: (([^\"]|\\\")*) will allow you to retrieve the whole consumed string, rather than just the last matched character.
Note that the last character of a quoted string will actually be consumed by the last [^\\\\]. To retrieve it, you have to turn it into a group: ([^\\\\]). Additionally, The first character before the quote will also be consumed by [^\\\\], which might be meaningful in cases such as r"Raw\text".
[^\\\\]\\1 will match any non-escape character followed by what the first group matched again. That is, if ((\")|(')) matched a double quote, we requite a double quote to end the string. Otherwise, it matched a single quote, which is what we require to end the string.
|(\w+) will match any word. This will only match if non-quoted strings, as quoted strings will be consumed by the previous regex.
For example:
import re
non_quoted_words = "[^\\\\]((\")|('))(?(2)([^\"]|\\\")*|([^']|\\')*)[^\\\\]\\1|(\w+)"
quote = "This \"is an example ' \\\" of \" some 'text \\\" like wtf' \\\" is what I said."
print(quote)
print(re.findall(non_quoted_words,quote))
will return:
This "is an example ' \" of " some 'text \" like wtf' \" is what I said.
[('', '', '', '', '', 'This'), ('"', '"', '', 'f', '', ''), ('', '', '', '', '', 'some'), ("'", '', "'", '', 't', ''), ('', '', '', '', '', 'is'), ('', '', '', '', '', 'what'), ('', '', '', '', '', 'I'), ('', '', '', '', '', 'said')]
I have string like below
string = "your invoice number IVR/20170531/XVII/V/12652967 and IVR/20170531/XVII/V/13652967"
I want to get invoice number IVR/20170531/XVII/V/12652967 and IVR/20170531/XVII/V/13652967 into list using regex with this pattern
result = re.findall(r'INV[/]\d{8}[/](M{1,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|C?D|D?C{1,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|X?L|L?X{1,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|I?V|V?I{1,3}))[/](M{1,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|C?D|D?C{1,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|X?L|L?X{1,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|I?V|V?I{1,3}))[/]\d{7,9}',string)
but the result is
[('XVII', '', '','', '', '', '', '', 'X', 'VII', '', '', '', 'V','','','', '', '', '', '', '', '', '', '', 'V')]
I tried this pattern in http://regexr.com/, the result is appropriately but in python not
You should modify your pattern, add normal brackets around whole regular expression, and afterwards access that text with first back-reference. You can read more about back-references here.
invoices = []
# Your pattern was slightly incorrect
pattern = re.compile(r'IVR[/]\d{8}[/](M{1,4}(CM|CD|D?C{0,3})|(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|C?D|D?C{1,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|X?L|L?X{1,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|I?V|V?I{1,3}))[/](M{1,4}(CM|CD|D?C{0,3})|(XC|XL|L?X{0,3})|(IX|IV|V?I{0,3})|M{0,4}(CM|C?D|D?C{1,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|X?L|L?X{1,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|I?V|V?I{1,3}))[/]\d{7,9}')
# For each invoice pattern you find in string, append it to list
for invoice in pattern.finditer(string):
invoices.append(invoice.group(1))
Note:
You should also use pattern.finditter() because that way you can iterate trough all pattern findings in text you called string. From re.finditer documentation:
re.finditer(pattern, string, flags=0)
Return an iterator yielding
MatchObject instances over all non-overlapping matches for the RE
pattern in string. The string is scanned left-to-right, and matches
are returned in the order found. Empty matches are included in the
result unless they touch the beginning of another match.
string = "your invoice number IVR/20170531/XVII/V/12652967 and IVR/20170531/XVII/V/13652967"
results = []
matches = re.finditer(regexpattern, string)
for matchNum, match in enumerate(matches):
results.append(match.group())
You need to add ?: before all the groups so that you can use non-capturing groups
Try with this regex:
IVR[/]\d{8}[/](?:M{0,4}(?:CM|CD|D?C{0,3})|(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3}))[/](?:M{0,4}(?:CM|CD|D?C{0,3})|(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3}))[/]\d{8}
Basically you need to add ?: for each group.
You can try this one to retrieve number, roman, roman and number values:
IVR\/(\d{8})\/(M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3}))\/(M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3}))\/(\d{7,9})
Demo
Snippet
import re
string = "your invoice number IVR/20170531/XVII/V/12652967 and IVR/20170531/XVII/V/13652967"
pattern = r"IVR\/(\d{8})\/(M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3}))\/(M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3}))\/(\d{7,9})"
for match in re.findall(pattern, string):
print(match)
Run online
So I want to separate group of punctuation from the text with spaces.
my_text = "!where??and!!or$$then:)"
I want to have a ! where ?? and !! or $$ then :) as a result.
I wanted something like in Javascript, where you can use $1 to get your matching string. What I have tried so far:
my_matches = re.findall('[!"\$%&\'()*+,\-.\/:;=##?\[\\\]^_`{|}~]*', my_text)
Here my_matches is empty so I had to delete \\\ from the expression:
my_matches = re.findall('[!"\$%&\'()*+,\-.\/:;=##?\^_`{|}~]*', my_text)
I have this result:
['!', '', '', '', '', '', '??', '', '', '', '!!', '', '', '$$', '', '', '', '',
':)', '']
So I delete all the redundant entry like this:
my_matches_distinct = list(set(my_matches))
And I have a better result:
['', '??', ':)', '$$', '!', '!!']
Then I replace every match by himself and space:
for match in my_matches:
if match != '':
my_text = re.sub(match, ' ' + match + ' ', my_text)
And of course it's not working ! I tried to cast the match as a string, but it's not working either... When I try to put directly the string to replace it's working though.
But I think I'm not doing it right, because I will have problems with '!' et '!!' right?
Thanks :)
It is recommended to use raw string literals when defining a regex pattern. Besides, do not escape arbitrary symbols inside a character class, only \ must be always escaped, and others can be placed so that they do not need escaping. Also, your regex matches an empty string - and it does - due to *. Replace with + quantifier. Besides, if you want to remove these symbols from your string, use re.sub directly.
import re
my_text = "!where??and!!or$$then:)"
print(re.sub(r'[]!"$%&\'()*+,./:;=##?[\\^_`{|}~-]+', r' \g<0> ', my_text).strip())
# => ! where ?? and !! or $$ then :)
See the Python demo
Details: The []!"$%&'()*+,./:;=##?[\^_`{|}~-]+ matches any 1+ symbols from the set (note that only \ is escaped here since - is used at the end, and ] at the start of the class), and the replacement inserts a space + the whole match (the \g<0> is the backreference to the whole match) and a space. And .strip() will remove leading/trailing whitespace after the regex finishes processing the string.
string.punctuation NOTE
Those who think that they can use f"[{string.punctuation}]+" make a mistake because this won't match \. Why? Because the resulting pattern looks like [!"#$%&'()*+,-./:;<=>?#[\]^_`{|}~]+ and the \] part does not match a backslash or ], it only matches a ] since the \ escapes the ] char.
If you plan to use string.punctuation, you need to escape ] and \ (it would be also correct to escape - and ^ as these are the only special chars inside square brackets, but in this case, it would be redundant):
from string import punctuation
my_text = "!where??and!!or$$then:)"
pattern = "[" + punctuation.replace('\\','\\\\').replace(']', r'\]') + "]+"
print(re.sub(pattern, r' \g<0> ', my_text).strip())
# => ! where ?? and !! or $$ then :)
See this Python demo.
Use sub() method in re library. You can do this as follows,
import re
str = '!where??and!!or$$then:)'
print re.sub(r'([!##%\^&\*\(\):;"\',\./\\]+)', r' \1 ', str).strip()
I hope this code should solve your problem. If you are obvious with regex then the regex part is not a big deal. Just it is to use the right function.
Hope this helps! Please comment if you have any queries. :)
References:
Python re library
I am trying to write a program which reads a paragraph which counts the special characters and words
My input:
list words ="'He came,"
words = list words. partition("'")
for i in words:
list-1. extend(i.split())
print(list-1)
my output looks like this:
["'", 'He', 'came,']
but I want
["'", 'He', 'came', ',']
Can any one help me how to do this?
I am trying to write a program which reads a paragraph which counts the special characters and words
Let's focus on the goal then, rather than your approach. Your approach is possible probably possible but it may take a bunch of splits so let's just ignore it for now. Using re.findall and a lengthy filtered regex should work much better.
lst = re.findall(r"\w+|[^\w\s]", some_sentence)
Would make sense. Broken down it does:
pat = re.compile(r"""
\w+ # one or more word characters
| # OR
[^\w\s] # exactly one character that's neither a word character nor whitespace
""", re.X)
results = pat.findall('"Why, hello there, Martha!"')
# ['"', 'Why', ',', 'hello', 'there', ',', 'Martha', '!', '"']
However then you have to go through another iteration of your list to count the special characters! Let's separate them, then. Luckily this is easy -- just add capturing braces.
new_pat = re.compile(r"""
( # begin capture group
\w+ # one or more word characters
) # end capturing group
| # OR
( # begin capture group
[^\w\s] # exactly one character that's neither a word character nor whitespace
) # end capturing group
""", re.X)
results = pat.findall('"Why, hello there, Martha!"')
# [('', '"'), ('Why', ''), ('', ','), ('hello', ''), ('there', ''), ('', ','), ('Martha', ''), ('', '!'), ('', '"')]
grouped_results = {"words":[], "punctuations":[]}
for word,punctuation in results:
if word:
grouped_results['words'].append(word)
if punctuation:
grouped_results['punctuations'].append(punctuation)
# grouped_results = {'punctuations': ['"', ',', ',', '!', '"'],
# 'words': ['Why', 'hello', 'there', 'Martha']}
Then just count your dict keys.
>>> for key in grouped_results:
print("There are {} items in {}".format(
len(grouped_results[key]),
key))
There are 5 items in punctuations
There are 4 items in words