How to extract the value between the key using RegEx? - python

I have text like:
"abababba"
I want to extract the characters as a list between a.
For the above text, I am expecting output like:
['b', 'b', 'bb']
I have used:
re.split(r'^a(.*?)a$', data)
But it doesn't work.

You could use re.findall to return the capture group values with the pattern:
a([^\sa]+)(?=a)
a Match an a char
([^\sa]+) Capture group 1, repeat matching any char except a (or a whitspace char if you don't want to match spaces)
(?=a) Positive lookahead, assert a to the right
Regex demo
import re
pattern = r"a([^\sa]+)(?=a)"
s = "abababba"
print(re.findall(pattern, s))
Output
['b', 'b', 'bb']

You could use a list comprehension to achieve this:
s = "abababba"
l = [x for x in s.split("a") if not x == ""]
print(l)
Output:
['b', 'b', 'bb']

The ^ and $ will only match the beginning and end of a line, respectively.
In this case, you will get the desired list by using the line:
re.split(r'a(.*?)a', data)[1:-1]

Why not use a normal split:
"abababba".split("a") --> ['', 'b', 'b', 'bb', '']
And remove the empty parts as needed:
# remove all empties:
[*filter(None,"abababba".split("a"))] -> ['b', 'b', 'bb']
or
# only leading/trailing empties (if any)
"abababba".strip("a").split("a") --> ['b', 'b', 'bb']
or
# only leading/trailing empties (assuming always enclosed in 'a')
"abababba".split("a")[1:-1] --> ['b', 'b', 'bb']
If you must use a regular expression, perhaps findall() will let you use a simpler pattern while covering all edge cases (ignoring all empties):
re.findall(r"[^a]+","abababba") --> ['b', 'b', 'bb']
re.findall(r"[^a]+","abababb") --> ['b', 'b', 'bb']
re.findall(r"[^a]+","bababb") --> ['b', 'b', 'bb']
re.findall(r"[^a]+","babaabb") --> ['b', 'b', 'bb']

Related

split string into list by regex

I need a regex, which split input string to list with next rules:
1) By dot;
2) Do not split expression if it is in quotes.
Examples:
'a.b.c' -> ['a', 'b', 'c'];
'a."b.c".d' -> ['a', 'b.c', 'd'];
'a.'b.c'.d' -> ['a', 'b.c', 'd'];
'a.'b c'.d' -> ['a', 'b c', 'd'];
You could leverage the newer regex module with the following expression:
(["']).*?\1(*SKIP)(*FAIL)|\.
This captures quotes, match them up to the next quote and let the matched part fail. The alternation is the dot.
In Python:
import regex as re
data = """
a.b.c
a."b.c".d
a.'b.c'.d
a.'b c'.d
"""
rx = re.compile(r"""(["']).*?\1(*SKIP)(*FAIL)|\.""")
for line in data.split("\n"):
if line:
parts = [part.strip("'").strip('"') for part in rx.split(line) if part]
print(parts)
Which yields
['a', 'b', 'c']
['a', 'b.c', 'd']
['a', 'b.c', 'd']
['a', 'b c', 'd']
See a demo on regex101.com.
If you want to stick with the re module, you could replace the dot in question before and split by the replacement afterwards.
import re
data = """
a.b.c
a."b.c".d
a.'b.c'.d
a.'b c'.d
"""
rx = re.compile(r"""(["']).*?\1|(?P<dot>\.)""")
needle = "SUPERMAN"
def replacer(match):
if match.group('dot') is not None:
return needle
else:
return match.group(0)
for line in data.split("\n"):
if line:
line = rx.sub(replacer, line)
parts = [part.strip("'").strip('"') for part in line.split(needle) if part]
print(parts)
This yields the exact same output as above. Please note that both approaches won't work for escaped quotes.
You can do it with some extra efforts here how can you do.
First split with '.' and then do some logically work on it.
string_data = 'a."b.c".d'
data = string_data.split('.')
list = []
value = None
for i in range(0,len(data)):
if value:
value = None
else:
if '"' in data[i]:
value = data[i]
value = value + '.' + data[i+1]
if value:
list.append(value)
else:
list.append(data[i])
print(list)
It will give you output same as in your qus.
As an alternative you could try using an or | with a positive lookbehind (?<= and a positive lookahead (?= for the single and double quotes
(?<=").*?(?=")|(?<=').*?(?=')|[a-z]+
regex = r"(?<=\").*?(?=\")|(?<=').*?(?=')|[a-z]+"
line = "a.\"b.t\".qq.d.d.'d'.'d.g.r'.d.d"
print(re.findall(regex, line))
['a', 'b.t', 'qq', 'd', 'd', 'd', '.', 'd.g.r', 'd', 'd']
Test output python
here is a regex for you:
\.?([^\"\'\.]+)|\"(.+)\"|\'(.+)\'\.?
implementation:
import re
regex = re.compile( r"""\.?([^\"\'\.]+)|\"(.+)\"|\'(.+)\'\.?""")
def str2list(string):
b = regex.findall(string)
l = []
for i in list(b):
for j in list(i):
if j:
l.append(j)
return l
str2list('a.b.c')
str2list('a."b.c".d')
str2list("a.'b.c'.d")
output:
['a', 'b', 'c']
['a', 'b.c', 'd']
['a', 'b.c', 'd']

Python-Getting contents between current and next occurrence of pattern in a string

I want to implement the following in python
(1)Search pattern in a string
(2)Get content till next occurence of the same pattern in the same string
Till end of the string do (1) and (2)
Searched all available answers but of no use.
Thanks in advance.
As mentioned by Blckknght in the comment, you can achieve this with re.split. re.split retains all empty strings between a) the beginning of the string and the first match, b) the last match and the end of the string and c) between different matches:
>>> re.split('abc', 'abcabcabcabc')
['', '', '', '', '']
>>> re.split('bca', 'abcabcabcabc')
['a', '', '', 'bc']
>>> re.split('c', 'abcabcabcabc')
['ab', 'ab', 'ab', 'ab', '']
>>> re.split('a', 'abcabcabcabc')
['', 'bc', 'bc', 'bc', 'bc']
If you want to retain only c) the strings between 2 matches of the pattern, just slice the resulting array with [1:-1].
Do note that there are two caveat with this method:
re.split doesn't split on empty string match.
>>> re.split('', 'abcabc')
['abcabc']
Content in capturing groups will be included in the resulting array.
>>> re.split(r'(.)(?!\1)', 'aaaaaakkkkkkbbbbbsssss')
['aaaaa', 'a', 'kkkkk', 'k', 'bbbb', 'b', 'ssss', 's', '']
You have to write your own function with finditer if you need to handle those use cases.
This is the variant where only case c) is matched.
def findbetween(pattern, input):
out = []
start = 0
for m in re.finditer(pattern, input):
out.append(input[start:m.start()])
start = m.end()
return out
Sample run:
>>> findbetween('abc', 'abcabcabcabc')
['', '', '']
>>> findbetween(r'', 'abcdef')
['a', 'b', 'c', 'd', 'e', 'f']
>>> findbetween(r'ab', 'abcabcabc')
['c', 'c']
>>> findbetween(r'b', 'abcabcabc')
['ca', 'ca']
>>> findbetween(r'(?<=(.))(?!\1)', 'aaaaaaaaaaaabbbbbbbbbbbbkkkkkkk')
['bbbbbbbbbbbb', 'kkkkkkk']
(In the last example, (?<=(.))(?!\1) matches the empty string at the end of the string, so 'kkkkkkk' is included in the list of results)
You can use something like this
re.findall(r"pattern.*?(?=pattern|$)",test_Str)
Here we search pattern and with lookahead make sure it captures till next pattern or end of string.

Python: how to seperate a list to several list based on empty string?

I'm working a on a list like this, a = ['a','b','','','c','d'], the real list is including thousands of data entries. Is there a fancy way to make the list a as [['a','b'],['c','d]] because the data is really huge?
You can use itertools.groupby for this. You basically group by consecutive empty strings, or consecutive non-empty strings. Then keep all groups that were grouped by True from the lambda in a list comprehension.
>>> from itertools import groupby
>>> [list(i[1]) for i in groupby(a, lambda i: i != '') if i[0]]
[['a', 'b'], ['c', 'd']]
For another example
>>> b = ['a','b','','','c','d', '', 'e', 'f', 'g', '', '', 'h']
>>> [list(i[1]) for i in groupby(b, lambda i: i != '') if i[0]]
[['a', 'b'], ['c', 'd'], ['e', 'f', 'g'], ['h']]

How to print each letter in a string only once

Hello everyone I have a python question.
I'm trying to print each letter in the given string only once.
How do I do this using a for loop and sort the letters from a to z?
Heres what I have;
import string
sentence_str = ("No punctuation should be attached to a word in your list,
e.g., end. Not a correct word, but end is.")
letter_str = sentence_str
letter_str = letter_str.lower()
badchar_str = string.punctuation + string.whitespace
Alist = []
for i in badchar_str:
letter_str = letter_str.replace(i,'')
letter_str = list(letter_str)
letter_str.sort()
for i in letter_str:
Alist.append(i)
print(Alist))
Answer I get:
['a']
['a', 'a']
['a', 'a', 'a']
['a', 'a', 'a', 'a']
['a', 'a', 'a', 'a', 'a']
['a', 'a', 'a', 'a', 'a', 'b']
['a', 'a', 'a', 'a', 'a', 'b', 'b']
['a', 'a', 'a', 'a', 'a', 'b', 'b', 'c']....
I need:
['a', 'b', 'c', 'd', 'e', 'g', 'h', 'i', 'l', 'n', 'o', 'p', 'r', 's', 't', 'u', 'w', 'y']
no errors...
Just check if the letter is not already in your array before appending it:
for i in letter_str:
if not(i in Alist):
Alist.append(i)
print(Alist))
or alternatively use the Set data structure that Python provides instead of an array. Sets do not allow duplicates.
aSet = set(letter_str)
Using itertools ifilter which you can say has an implicit for-loop:
In [20]: a=[i for i in itertools.ifilter(lambda x: x.isalpha(), sentence_str.lower())]
In [21]: set(a)
Out[21]:
set(['a',
'c',
'b',
'e',
'd',
'g',
'i',
'h',
'l',
'o',
'n',
'p',
's',
'r',
'u',
't',
'w',
'y'])
Malvolio correctly states that the answer should be as simple as possible. For that we use python's set type which takes care of the issue of uniqueness in the most efficient and simple way possible.
However, his answer does not deal with removing punctuation and spacing. Furthermore, all answers as well as the code in the question do that pretty inefficiently(loop through badchar_str and replace in the original string).
The best(ie, simplest and most efficient as well as idiomatic python) way to find all unique letters in the sentence is this:
import string
sentence_str = ("No punctuation should be attached to a word in your list,
e.g., end. Not a correct word, but end is.")
bad_chars = set(string.punctuation + string.whitespace)
unique_letters = set(sentence_str.lower()) - bad_chars
If you want them to be sorted, simply replace the last line with:
unique_letters = sorted(set(sentence_str.lower()) - bad_chars)
If the order in which you want to print doesn't matter you can use:
sentence_str = ("No punctuation should be attached to a word in your list,
e.g., end. Not a correct word, but end is.")
badchar_str = string.punctuation + string.whitespace
for i in badchar_str:
letter_str = letter_str.replace(i,'')
print(set(sentence_str))
Or if you want to print in sorted order you could convert it back to list and use sort() and then print.
First principles, Clarice. Simplicity.
list(set(sentence_str))
You can use set() for remove duplicate characters and sorted():
import string
sentence_str = "No punctuation should be attached to a word in your list, e.g., end. Not a correct word, but end is."
letter_str = sentence_str
letter_str = letter_str.lower()
badchar_str = string.punctuation + string.whitespace
for i in badchar_str:
letter_str = letter_str.replace(i,'')
characters = list(letter_str);
print sorted(set(characters))

Expand alphabetical range to list of characters in Python

I have strings describing a range of characters alphabetically, made up of two characters separated by a hyphen. I'd like to expand them out into a list of the individual characters like this:
'a-d' -> ['a','b','c','d']
'B-F' -> ['B','C','D','E','F']
What would be the best way to do this in Python?
In [19]: s = 'B-F'
In [20]: list(map(chr, range(ord(s[0]), ord(s[-1]) + 1)))
Out[20]: ['B', 'C', 'D', 'E', 'F']
The trick is to convert both characters to their ASCII codes, and then use range().
P.S. Since you require a list, the list(map(...)) construct can be replaced with a list comprehension.
Along with aix's excellent answer using map(), you could do this with a list comprehension:
>>> s = "A-F"
>>> [chr(item) for item in range(ord(s[0]), ord(s[-1])+1)]
['A', 'B', 'C', 'D', 'E', 'F']
import string
def lis(strs):
upper=string.ascii_uppercase
lower=string.ascii_lowercase
if strs[0] in upper:
return list(upper[upper.index(strs[0]): upper.index(strs[-1])+1])
if strs[0] in lower:
return list(lower[lower.index(strs[0]): lower.index(strs[-1])+1])
print(lis('a-d'))
print(lis('B-F'))
output:
['a', 'b', 'c', 'd']
['B', 'C', 'D', 'E', 'F']

Categories