Regex matching optional pattern/string - python

I have a list of strings, and I'm trying to write regex that captures groups of strings that may or may not contain a certain pattern.
any ascii character string
another string = other stuff
string = another string = string
I'm trying to capture the part of the string before first occurrence of the pattern (" = ") and after the pattern. I've tried this:
\s*?([\x00-\x7F]+)( - )?(.*)?
but then it just captures the entire string as one group.
How would I do this?

You can solve this with regular expressions:
>>> text = '''any ascii character string
another string = other stuff
string = another string = string'''
>>> re.findall('^([^=]+?)(?: = (.*?))?$', text, re.M)
[('any ascii character string', ''),
('another string ', ' other stuff'),
('string ', ' another string = string')]
But in this case, a lot easier would be to just split by line and then partition/split the line by the first equals character:
>>> [line.split('=', 1) for line in text.splitlines()]
[['any ascii character string'],
['another string ', ' other stuff'],
['string ', ' another string = string']]
If you don’t like that whitespace, just strip it away:
>>> [list(map(str.strip, line.split('=', 1))) for line in text.splitlines()]
[['any ascii character string'],
['another string', 'other stuff'],
['string', 'another string = string']]

Related

How do I ignore the spaces in a string inputted by the user? [duplicate]

I want to eliminate all the whitespace from a string, on both ends, and in between words.
I have this Python code:
def my_handle(self):
sentence = ' hello apple '
sentence.strip()
But that only eliminates the whitespace on both sides of the string. How do I remove all whitespace?
If you want to remove leading and ending spaces, use str.strip():
>>> " hello apple ".strip()
'hello apple'
If you want to remove all space characters, use str.replace() (NB this only removes the “normal” ASCII space character ' ' U+0020 but not any other whitespace):
>>> " hello apple ".replace(" ", "")
'helloapple'
If you want to remove duplicated spaces, use str.split() followed by str.join():
>>> " ".join(" hello apple ".split())
'hello apple'
To remove only spaces use str.replace:
sentence = sentence.replace(' ', '')
To remove all whitespace characters (space, tab, newline, and so on) you can use split then join:
sentence = ''.join(sentence.split())
or a regular expression:
import re
pattern = re.compile(r'\s+')
sentence = re.sub(pattern, '', sentence)
If you want to only remove whitespace from the beginning and end you can use strip:
sentence = sentence.strip()
You can also use lstrip to remove whitespace only from the beginning of the string, and rstrip to remove whitespace from the end of the string.
An alternative is to use regular expressions and match these strange white-space characters too. Here are some examples:
Remove ALL spaces in a string, even between words:
import re
sentence = re.sub(r"\s+", "", sentence, flags=re.UNICODE)
Remove spaces in the BEGINNING of a string:
import re
sentence = re.sub(r"^\s+", "", sentence, flags=re.UNICODE)
Remove spaces in the END of a string:
import re
sentence = re.sub(r"\s+$", "", sentence, flags=re.UNICODE)
Remove spaces both in the BEGINNING and in the END of a string:
import re
sentence = re.sub("^\s+|\s+$", "", sentence, flags=re.UNICODE)
Remove ONLY DUPLICATE spaces:
import re
sentence = " ".join(re.split("\s+", sentence, flags=re.UNICODE))
(All examples work in both Python 2 and Python 3)
"Whitespace" includes space, tabs, and CRLF. So an elegant and one-liner string function we can use is str.translate:
Python 3
' hello apple '.translate(str.maketrans('', '', ' \n\t\r'))
OR if you want to be thorough:
import string
' hello apple'.translate(str.maketrans('', '', string.whitespace))
Python 2
' hello apple'.translate(None, ' \n\t\r')
OR if you want to be thorough:
import string
' hello apple'.translate(None, string.whitespace)
For removing whitespace from beginning and end, use strip.
>> " foo bar ".strip()
"foo bar"
' hello \n\tapple'.translate({ord(c):None for c in ' \n\t\r'})
MaK already pointed out the "translate" method above. And this variation works with Python 3 (see this Q&A).
In addition, strip has some variations:
Remove spaces in the BEGINNING and END of a string:
sentence= sentence.strip()
Remove spaces in the BEGINNING of a string:
sentence = sentence.lstrip()
Remove spaces in the END of a string:
sentence= sentence.rstrip()
All three string functions strip lstrip, and rstrip can take parameters of the string to strip, with the default being all white space. This can be helpful when you are working with something particular, for example, you could remove only spaces but not newlines:
" 1. Step 1\n".strip(" ")
Or you could remove extra commas when reading in a string list:
"1,2,3,".strip(",")
Be careful:
strip does a rstrip and lstrip (removes leading and trailing spaces, tabs, returns and form feeds, but it does not remove them in the middle of the string).
If you only replace spaces and tabs you can end up with hidden CRLFs that appear to match what you are looking for, but are not the same.
eliminate all the whitespace from a string, on both ends, and in between words.
>>> import re
>>> re.sub("\s+", # one or more repetition of whitespace
'', # replace with empty string (->remove)
''' hello
... apple
... ''')
'helloapple'
https://en.wikipedia.org/wiki/Whitespace_character
Python docs:
https://docs.python.org/library/stdtypes.html#textseq
https://docs.python.org/library/stdtypes.html#str.replace
https://docs.python.org/library/string.html#string.replace
https://docs.python.org/library/re.html#re.sub
https://docs.python.org/library/re.html#regular-expression-syntax
I use split() to ignore all whitespaces and use join() to concatenate
strings.
sentence = ''.join(' hello apple '.split())
print(sentence) #=> 'helloapple'
I prefer this approach because it is only a expression (not a statement).
It is easy to use and it can use without binding to a variable.
print(''.join(' hello apple '.split())) # no need to binding to a variable
import re
sentence = ' hello apple'
re.sub(' ','',sentence) #helloworld (remove all spaces)
re.sub(' ',' ',sentence) #hello world (remove double spaces)
In the following script we import the regular expression module which we use to substitute one space or more with a single space. This ensures that the inner extra spaces are removed. Then we use strip() function to remove leading and trailing spaces.
# Import regular expression module
import re
# Initialize string
a = " foo bar "
# First replace any number of spaces with a single space
a = re.sub(' +', ' ', a)
# Then strip any leading and trailing spaces.
a = a.strip()
# Show results
print(a)
I found that this works the best for me:
test_string = ' test a s test '
string_list = [s.strip() for s in str(test_string).split()]
final_string = ' '.join(string_array)
# final_string: 'test a s test'
It removes any whitespaces, tabs, etc.
try this.. instead of using re i think using split with strip is much better
def my_handle(self):
sentence = ' hello apple '
' '.join(x.strip() for x in sentence.split())
#hello apple
''.join(x.strip() for x in sentence.split())
#helloapple

Regex: Use \b (word boundary) separator but ignore some characters

Given this example:
s = "Hi, domain: (foo.bar.com) bye"
I'd like to create a regex that matches both word and non-word strings, separately, i.e:
re.findall(regex, s)
# Returns: ["Hi", ", ", "domain", ": (", "foo.bar.com", ") ", "bye"]
My approach was to use the word boundary separator \b to catch any string that is bound by two word-to-non-word switches. From the re module docs:
\b is defined as the boundary between a \w and a \W character (or vice versa)
Therefore I tried as a first step:
regex = r'(?:^|\b).*?(?=\b|$)'
re.findall(regex, s)
# Returns: ["Hi", ",", "domain", ": (", "foo", ".", "bar", ".", "com", ") ", "bye"]
The problem is that I don't want the dot (.) character to be a separator too, I'd like the regex to see foo.bar.com as a whole word and not as three words separated by dots.
I tried to find a way to use a negative lookahead on dot but did not manage to make it work.
Is there any way to achieve that?
I don't mind that the dot won't be a separator at all in the regex, it doesn't have to be specific to domain names.
I looked at Regex word boundary alternative, Capture using word boundaries without stopping at "dot" and/or other characters and Regex word boundary excluding the hyphen but it does not fit my case as I cannot use the space as a separator condition.
Exclude some characters from word boundary is the only one that got me close, but I didn't manage to reach it.
You may use this regex in findall:
\w+(?:\.\w+)*|\W+
Which finds a word followed by 0 or more repeats of dot separated words or 1+ of non-word characters.
Code:
import re
s = "Hi, domain: (foo.bar.com) bye"
print (re.findall(r'\w+(?:\.\w+)*|\W+', s))
Output:
['Hi', ', ', 'domain', ': (', 'foo.bar.com', ') ', 'bye']
For your example, you could just split on [^\w.]+, using a capturing group around it to keep those values in the output:
import re
s = "Hi, domain: (foo.bar.com) bye"
re.split(r'([^\w.]+)', s)
# ['Hi', ', ', 'domain', ': (', 'foo.bar.com', ') ', 'bye']
If your string might end or finish with non-word/space characters, you can filter out the resultant empty strings in the list with a comprehension:
s = "!! Hello foo.bar.com, your domain ##"
re.split(r'([^\w.]+)', s)
# ['', '!! ', 'Hello', ' ', 'foo.bar.com', ', ', 'your', ' ', 'domain', ' ##', '']
[w for w in re.split(r'([^\w.]+)', s) if len(w)]
# ['!! ', 'Hello', ' ', 'foo.bar.com', ', ', 'your', ' ', 'domain', ' ##']
Lookarounds let you easily say "dot, except if it's surrounded by alphabetics on both sides" if that's what you mean;
re.findall(r'(?:^|\b)(\w+(?:\.\w+)*|\W+)(?!\.\w)(?=\b|$)', s)
or simply "word boundary, unless it's a dot":
re.findall(r'(?:^|(?<!\.)\b(?!\.)).+?(?=(?<!\.)\b(?!\.)|$)', s)
Notice that the latter will join text across a word boundary if it's a dot; so, for example, 'bye. ' would be extracted as one string.
(Perhaps try to be more precise about your requirements?)
Demo: https://ideone.com/dvQhFO

only special characters remove from the list

From a pdf file I extract all the text as a string, and convert it into the list by removing all the double white spaces, newlines (two or more), spaces (if two or more), and on every dot (.).
Now in my list I want, if a value of a list consists of only special characters, that value should be excluded.
pdfFileObj = open('Python String.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
text=pageObj.extractText()
z =re.split("\n+|[.]|\s{2,}",text)
while("" in z) :
z.remove("")
print(z)
My output is
['split()', 'method in Python split a string into a list of strings after breaking the', 'given string by the specified separator', 'Syntax', ':', 'str', 'split(separator, maxsplit)', 'Parameters', ':', 'separator', ':', 'This is a delimiter', ' The string splits at this specified separator', ' If is', 'no', 't provided then any white space is a separator', 'maxsplit', ':', 'It is a number, which tells us to split the string into maximum of provi', 'ded number of times', ' If it is not provided then the default is', '-', '1 that means there', 'is no limit', 'Returns', ':', 'Returns a list of s', 'trings after breaking the given string by the specifie', 'd separator']
Here are some values that contain only special characters and I want to remove those. Thanks
Remove those special characters before converting text to list.
remove while("" in z) : z.remove("") and add following line after read text variable:
text = re.sub('(a|b|c)', '', text)
In this example, my special characters are a, b and c.
Use a regular expression that tests if a string contains any letters or numbers.
import re
z = [x for x in z if re.search(r'[a-z\d]', x, flags=re.I)]
In the regexp, a-z matches letters, \d matches digits, so [a-z\d] matches any letter or digit (and the re.I flag makes it case-insensitive). So the list comprehension includes any elements of z that contain a letter or digit.

Replace symbol before match using regex in Python

I have strings such as:
text1 = ('SOME STRING,99,1234 FIRST STREET,9998887777,ABC')
text2 = ('SOME OTHER STRING,56789 SECOND STREET,6665554444,DEF')
text3 = ('ANOTHER STRING,#88,4321 THIRD STREET,3332221111,GHI')
Desired output:
SOME STRING 99,1234 FIRST STREET,9998887777,ABC
SOME OTHER STRING,56789 SECOND STREET,6665554444,DEF
ANOTHER STRING #88,4321 THIRD STREET,3332221111,GHI
My idea: Use regex to find occurrences of 1-5 digits, possibly preceded by a symbol, that are between two commas and not followed by a space and letters, then replace by this match without the preceding comma.
Something like:
text.replace(r'(,\d{0,5},)','.........')
If you would use regex module instead of re then possibly:
import regex
str = "ANOTHER STRING,#88,4321 THIRD STREET,3332221111,GHI"
print(regex.sub(r'(?<!^.*,.*),(?=#?\d+,\d+)', ' ', str))
You might be able to use re if you sure there are no other substring following the pattern in the lookahead.
import re
str = "ANOTHER STRING,#88,4321 THIRD STREET,3332221111,GHI"
print(re.sub(r',(?=#?\d+,\d+)', ' ', str))
Easier to read alternative if SOME STRING, SOME OTHER STRING, and ANOTHER STRING never contain commas:
text1.replace(",", " ", 1)
which just replaces the first comma with a space
Simple, yet effective:
my_pattern = r"(,)(\W?\d{0,5},)"
p = re.compile(my_pattern)
p.sub(r" \2", text1) # 'SOME STRING 99,1234 FIRST STREET,9998887777,ABC'
p.sub(r" \2", text2) # 'SOME OTHER STRING,56789 SECOND STREET,6665554444,DEF'
p.sub(r" \2", text3) # 'ANOTHER STRING #88,4321 THIRD STREET,3332221111,GHI'
Secondary pattern with non-capturing group and verbose compilation:
my_pattern = r"""
(?:,) # Non-capturing group for single comma.
(\W?\d{0,5},) # Capture zero or one non-ascii characters, zero to five numbers, and a comma
"""
# re.X compiles multiline regex patterns
p = re.compile(my_pattern, flags = re.X)
# This time we will use \1 to implement the first captured group
p.sub(r" \1", text1)
p.sub(r" \1", text2)
p.sub(r" \1", text3)

Replace substrings with items from list

Basically, I have a string that has multiple double-whitespaces like this:
"Some text\s\sWhy is there no punctuation\s\s"
I also have a list of punctuation marks that should replace the double-whitespaces, so that the output would be this:
puncts = ['.', '?']
# applying some function
# output:
>>> "Some text. Why is there no punctuation?"
I have tried re.sub(' +', puncts[i], text) but my problem here is that I don't know how to properly iterate through the list and replace the 1st double-whitespace with the 1st element in puncts, the 2nd double-whitespace with the 2nd element in puncts and so on.
If we're still using re.sub(), here's one possible solution that follows this basic pattern:
Get the next punctuation character.
Replace only the first occurrence of that character in text.
puncts = ['.', '?']
text = "Some text Why is there no punctuation "
for i in puncts:
text = re.sub('\s(?=\s)', i, text, 1)
The call to re.sub() returns a string, and basically says "find all series of two whitespace characters, but only replace the first whitespace character with a punctuation character." The final argument "1" makes it so that we only replace the first instance of the double whitespace, and not all of them (default behavior).
If the positive lookahead (the part of the regex that we want to match but not replace) confuses you, you can also do without it:
puncts = ['.', '?']
text = "Some text Why is there no punctuation "
for i in puncts:
text = re.sub('\s\s', i + " ", text, 1)
This yields the same output.
There will be a leftover whitespace at the end of the sentence, but if you're stingy about that, a simple text.rstrip() should take care of that one.
Further explanation
Your first try of using regex ' +' doesn't work because that regex matches all instances where there is at least one whitespace — that is, it will match everything, and then also replace all of it with a punctuation character. The above solutions account for the double-whitespace in their respective regexes.
You can do it simply using the replace method!
text = "Some text Why is there no punctuation "
puncts = ['.', '?']
for i in puncts:
text = text.replace(" ", i, 1) #notice the 1 here
print(text)
Output : Some text.Why is there no punctuation?
You can use re.split() to break the string into substrings between the double spaces and intersperse the punctuation marks using join:
import re
string = "Some text Why is there no punctuation "
iPunct = iter([". ","? "])
result = "".join(x+next(iPunct,"") for x in re.split(r"\s\s",string))
print(result)
# Some text. Why is there no punctuation?

Categories