How to use split strings with closed brackets as the separator - python

If I have a messy string like '[Carrots] [Broccoli] (cucumber)-(tomato) irrelevant [spinach]' and I want to split it into a list so that each part within any bracket is an item like ['Carrots', 'Broccoli', 'cucumber', 'tomato', 'spinach'] How would I do this? I can't figure out a way to make the .split() method work.

You can use regex
import re
s = '[Carrots] [Broccoli] (cucumber)-(tomato) irrelevant [spinach]'
lst = [x[0] or x[1] for x in re.findall(r'\[(.*?)\]|\((.*?)\)', s)]
print(lst)
Output
['Carrots', 'Broccoli', 'cucumber', 'tomato', 'spinach']
Explanation
Regex pattern to match
r'\[(.*?)\]|\((.*?)\)'
Subpattern 1: To match items in square brackets i.e. [...]
\[(.*?)\] # Use \[ and \] since [, ] are special characters
# we have to escape so they will be literal
(.*?) # Is a Lazy match of all characters
Subpattern 2: To match in parentheses i.e. (..)
\((.*?)\) # Use \( and \) since (, ) are special characters
# we have to escape so they will be literal
Since we are looking for either of the two patterns we use:
'|' # which is or between the two subpatterns
# to match Subpattern1 or Subpattern
The expression
re.findall(r'\[(.*?)\]|\((.*?)\)', s)
[('Carrots', ''), ('Broccoli', ''), ('', 'cucumber'), ('', 'tomato'), ('spinach', '')]
The result is in the first or second tuple. So we use:
[x[0] or x[1] for x in re.findall(r'\[(.*?)\]|\((.*?)\)', s)]
To extract the data from the first or second tuple and place it into a list.

Without any error handling whatsoever (like checking for nested or unbalanced brackets):
def parse(expr):
opening = "(["
closing = ")]"
result = []
current_item = ""
for char in expr:
if char in opening:
current_item = ""
continue
if char in closing:
result.append(current_item)
continue
current_item += char
return result
print(parse("(a)(b) stuff (c) [d] more stuff - (xxx)."))
>>> ['a', 'b', 'c', 'd', 'xxx']
Depending on your needs, this might already be good enough...

Assuming no other brackets or operators (e.g. '-') than the ones present in your example string are used, try
s = '[Carrots] [Broccoli] (cucumber)-(tomato) irrelevant [spinach]'
words = []
for elem in s.replace('-', ' ').split():
if '[' in elem or '(' in elem:
words.append(elem.strip('[]()'))
Or with list comprehension
words = [elem.strip('[]()') for elem in s.replace('-', ' ').split() if '[' in elem or '(' in elem]

Related

Creating a list given an equation with no spaces

I want to create a list given a string such as 'b123+xyz=1+z1$' so that the list equals ['b123', '+', 'xyz', '=', '1', '+', 'z1', '$']
Without spaces or a single repeating pattern, I do not know how to split the string into a list.
I tried creating if statements in a for loop to append the string when it reaches a character that is not a digit or letter through isdigit and isalpha but could not differentiate between variables and digits.
You can use a regular expression to split your string. This works by using positive lookaheads and look behinds for none word chars.
import re
sample = "b123+xyz=1+z1$"
split_sample = re.split("(?=\W)|(?:(?<=\W)(?!$))", sample)
print(split_sample)
OUTPUT
['b123', '+', 'xyz', '=', '1', '+', 'z1', '$']
REGEX EXPLAIN
Another regex approach giving the same result is:
split_sample = re.split(r"(\+|=|\$)", sample)[:-1]
The [:-1] is to remove the final empty string.
"""
Given the equation b123+xyz=1+z1$, break it down
into a list of variables and operators
"""
operators = ['+', '-', '/', '*', '=']
equation = 'b123+xyz=1+z1$'
equation_by_variable_and_operator = []
text = ''
for character in equation:
if character not in operators:
text = text + character
elif character in operators and len(text):
equation_by_variable_and_operator.append(text)
equation_by_variable_and_operator.append(character)
text = ''
# For the final variable
equation_by_variable_and_operator.append(text)
print(equation_by_variable_and_operator)
Output
['b123', '+', 'xyz', '=', '1', '+', 'z1$']
A straight-forward regex solution is;
equation = "b123+xyz=1+z1$"
equation_list = re.findall(r'\W+|\w+', equation)
print(equation_list)
This would also work with strings such as -b**10.
Using re.split() returns empty strings at the start and end of the string from the delimiters at the start and end of the string (see this question). To remove them, they can be filtered out, or otherwise look-behind or look-ahead conditions can be used which add to the pattern's complexity, as earlier answers to this question demonstrate.
Well my answer seems to not be the easiest among them all but i hope it helps you.
data: str = "b123+xyz=1+z1$"
symbols: str = "+=$"
merge_text: str = ""
for char in data:
if char not in symbols:
merge_text += char
else:
# insert a unique character for splitting
merge_text += ","
merge_text += char
merge_text += ","
final_result: list = merge_text.split(",")

How can I extract hashtags from string?

I need to extract the "#" from a function that receives a string.
Here's what I've done:
def hashtag(str):
lst = []
for i in str.split():
if i[0] == "#":
lst.append(i[1:])
return lst
My code does work, but it splits words. So for the example string: "Python is #great #Computer#Science" it'll return the list: ['great', 'Computer#Science'] instead of ['great', 'Computer', 'Science'].
Without using RegEx please.
You can first try to find the firsr index where # occurs and split the slice on #
text = 'Python is #great #Computer#Science'
text[text.find('#')+1:].split('#')
Out[214]: ['great ', 'Computer', 'Science']
You can even use strip at last to remove unnecessary white space.
[tag.strip() for tag in text[text.find('#')+1:].split('#')]
Out[215]: ['great', 'Computer', 'Science']
Split into words, and then filter for the ones beginning with an octothorpe (hash).
[word for word in str.replace("#", " #").split()
if word.startswith('#')
]
The steps are
Insert a space in front of each hash, to make sure we separate on them
Split the string at spaces
Keep the words that start with a hash.
Result:
['#great', '#Computer', '#Science']
split by #
take all tokens except the first one
strip spaces
s = "Python is #great #Computer#Science"
out = [w.split()[0] for w in s.split('#')[1:]]
out
['great', 'Computer', 'Science']
When you split the string using default separator (space), you get the following result:
['Python', 'is', '#great', '#Computer#Science']
You can make a replace (adding a space before a hashtag) before splitting
def hashtag(str):
lst = []
str = str.replace('#', ' #')
for i in str.split():
if i[0] == "#":
lst.append(i[1:])
return lst

How to modify existing Regex expression to ignore words in brackets

I have the following code
listnew= ['E-Textbooks','Dynamic', 'Case', 'Management', '(', 'DCM', ')'].
nounbreak = list(itertools.chain(*[re.findall(r"\b\w+\b(?![\(\w+\)])", i) for i in listnew]))
While the above code successfully removes '-' and even '/'. It somehow is not able to ignore the words in the brackets
The ideal output required is
['E', 'Textbooks','Dynamic', 'Case', 'Management']
How do I tweak the above regex expression itself to render the above desired output?
Your problem is that your regex looks at each list element seperately - it can not "see" that there are "(" and ")" elements before/after the current element it looks at.
I propose cleaning your list beforehand:
import re
from itertools import chain
listnew = ['E-Textbooks','Dynamic', 'Case', 'Management', '(', 'DCM', ')']
# collect indexes of elements that are ( or ) or things between them
# does not work for ((())) - you might need to do something more elaborate
# if that can happen
remove = []
for i,k in enumerate(listnew):
if k == "(":
remove.append(i)
elif k != ")" and remove and i == remove[-1]+1 and remove[-1] != ")":
remove.append(i)
elif k == ")":
remove.append(i)
data = [k for i,k in enumerate(listnew) if i not in frozenset(remove)]
# did not touch your regex per se - you might want to simplify it using regex101.com
nounbreak = list(chain(*[re.findall(r"\b\w+\b(?![\(\w+\)])", i) for i in data]))
print(nounbreak)
Output:
['E', 'Textbooks', 'Dynamic', 'Case', 'Management']
If you only have short lists - you could also ' '.join(..) them and clean the string from things inside parenthesis - see f.e. Regular expression to return text between parenthesis on how to accomplish this and remove it from the string.
This is a sparse solution just demonstrating the regex.
Basically joins the array on a non-word, comma in this case, then
runs a regex on it using findall.
The parenthesis elements will be empty strings that can be filtered
via list compression.
The regex :
\( .*? \)
| \b
( \w+ ) # (1)
\b
Python code :
>>> import re
>>> list_orig = ['E-Textbooks','Dynamic', 'Case', 'Management', '(', 'DCM', ')']
>>> str = ','.join( list_orig )
>>> list_new = re.findall( r"\(.*?\)|\b(\w+)\b", str )
>>> list_new = [i for i in list_new if i]
>>> print( list_new )
['E', 'Textbooks', 'Dynamic', 'Case', 'Management']

Add content to the end of each (non-whitespace) line in string in python 3

Suppose I have the following string:
s = 'some text\n\nsome other text'
I now want to add the letter 'X' to the end of each line containing text so that the output is 'some textX\n\nsome other textX'. I tried
re.sub('((?!\S)$)', 'X', s, re.M)
but that only adds 'X' at the end of the string even though it is in multiline mode, i.e., the output is 'some text\n\nsome other textX'. How can I solve this problem?
Do you really need regex? You could just split on newlines, add X accordingly, and re-join. Here's one way of doing it, using yield -
In [504]: def f(s):
...: for l in s.splitlines():
...: yield l + ('X' if l else '')
...:
In [505]: '\n'.join(list(f(s)))
Out[505]: 'some textX\n\nsome other textX'
Here's an alternative using a list comprehension -
In [506]: '\n'.join([x + 'X' if x else '' for x in s.splitlines()])
Out[506]: 'some textX\n\nsome other textX'
For reference, this is how you'd do this with regex -
Out[507]: re.sub(r'(?<=\S)(?=\n|$)', r'X', s, re.M)
Out[507]: 'some textX\n\nsome other textX'
You need to use a look-ahead as well as a look-behind. Here's a breakdown of the expression -
(?<= # lookbehind
\S # anything that is not a whitespace character, alt - `[^\n]`
)
(?= # lookahead
\n # newline
| # regex OR
$ # end of line
)

How to remove non-alphanumeric characters at the beginning or end of a string

I have a list with elements that have unnecessary (non-alphanumeric) characters at the beginning or end of each string.
Ex.
'cats--'
I want to get rid of the --
I tried:
for i in thelist:
newlist.append(i.strip('\W'))
That didn't work. Any suggestions.
def strip_nonalnum(word):
if not word:
return word # nothing to strip
for start, c in enumerate(word):
if c.isalnum():
break
for end, c in enumerate(word[::-1]):
if c.isalnum():
break
return word[start:len(word) - end]
print([strip_nonalnum(s) for s in thelist])
Or
import re
def strip_nonalnum_re(word):
return re.sub(r"^\W+|\W+$", "", word)
To remove one or more chars other than letters, digits and _ from both ends you may use
re.sub(r'^\W+|\W+$', '', '??cats--') # => cats
Or, if _ is to be removed, too, wrap \W into a character class and add _ there:
re.sub(r'^[\W_]+|[\W_]+$', '', '_??cats--_')
See the regex demo and the regex graph:
See the Python demo:
import re
print( re.sub(r'^\W+|\W+$', '', '??cats--') ) # => cats
print( re.sub(r'^[\W_]+|[\W_]+$', '', '_??cats--_') ) # => cats
You can use a regex expression. The method re.sub() will take three parameters:
The regex expression
The replacement
The string
Code:
import re
s = 'cats--'
output = re.sub("[^\\w]", "", s)
print output
Explanation:
The part "\\w" matches any alphanumeric character.
[^x] will match any character that is not x
I believe that this is the shortest non-regex solution:
text = "`23`12foo--=+"
while len(text) > 0 and not text[0].isalnum():
text = text[1:]
while len(text) > 0 and not text[-1].isalnum():
text = text[:-1]
print text
By using strip you have to know the substring to be stripped.
>>> 'cats--'.strip('-')
'cats'
You could use re to get rid of the non-alphanumeric characters but you would shoot with a cannon on a mouse IMO. With str.isalpha() you can test any strings to contain alphabetic characters, so you only need to keep those:
>>> ''.join(char for char in '#!cats-%' if char.isalpha())
'cats'
>>> thelist = ['cats5--', '#!cats-%', '--the#!cats-%', '--5cats-%', '--5!cats-%']
>>> [''.join(c for c in e if c.isalpha()) for e in thelist]
['cats', 'cats', 'thecats', 'cats', 'cats']
You want to get rid of non-alphanumeric so we can make this better:
>>> [''.join(c for c in e if c.isalnum()) for e in thelist]
['cats5', 'cats', 'thecats', '5cats', '5cats']
This one is exactly the same result you would get with re (as of Christian's answer):
>>> import re
>>> [re.sub("[^\\w]", "", e) for e in thelist]
['cats5', 'cats', 'thecats', '5cats', '5cats']
However, If you want to strip non-alphanumeric characters from the end of the strings only you should use another pattern like this one (check re Documentation):
>>> [''.join(re.search('^\W*(.+)(?!\W*$)(.)', e).groups()) for e in thelist]
['cats5', 'cats', 'the#!cats', '5cats', '5!cats']

Categories