Python Regular expression must strip whitespace except between quotes - python

I need a way to remove all whitespace from a string, except when that whitespace is between quotes.
result = re.sub('".*?"', "", content)
This will match anything between quotes, but now it needs to ignore that match and add matches for whitespace..

I don't think you're going to be able to do that with a single regex. One way to do it is to split the string on quotes, apply the whitespace-stripping regex to every other item of the resulting list, and then re-join the list.
import re
def stripwhite(text):
lst = text.split('"')
for i, item in enumerate(lst):
if not i % 2:
lst[i] = re.sub("\s+", "", item)
return '"'.join(lst)
print stripwhite('This is a string with some "text in quotes."')

Here is a one-liner version, based on #kindall's idea - yet it does not use regex at all! First split on ", then split() every other item and re-join them, that takes care of whitespaces:
stripWS = lambda txt:'"'.join( it if i%2 else ''.join(it.split())
for i,it in enumerate(txt.split('"')) )
Usage example:
>>> stripWS('This is a string with some "text in quotes."')
'Thisisastringwithsome"text in quotes."'

You can use shlex.split for a quotation-aware split, and join the result using " ".join. E.g.
print " ".join(shlex.split('Hello "world this is" a test'))

Oli, resurrecting this question because it had a simple regex solution that wasn't mentioned. (Found your question while doing some research for a regex bounty quest.)
Here's the small regex:
"[^"]*"|(\s+)
The left side of the alternation matches complete "quoted strings". We will ignore these matches. The right side matches and captures spaces to Group 1, and we know they are the right spaces because they were not matched by the expression on the left.
Here is working code (and an online demo):
import re
subject = 'Remove Spaces Here "But Not Here" Thank You'
regex = re.compile(r'"[^"]*"|(\s+)')
def myreplacement(m):
if m.group(1):
return ""
else:
return m.group(0)
replaced = regex.sub(myreplacement, subject)
print(replaced)
Reference
How to match pattern except in situations s1, s2, s3
How to match a pattern unless...

Here little longish version with check for quote without pair. Only deals with one style of start and end string (adaptable for example for example start,end='()')
start, end = '"', '"'
for test in ('Hello "world this is" atest',
'This is a string with some " text inside in quotes."',
'This is without quote.',
'This is sentence with bad "quote'):
result = ''
while start in test :
clean, _, test = test.partition(start)
clean = clean.replace(' ','') + start
inside, tag, test = test.partition(end)
if not tag:
raise SyntaxError, 'Missing end quote %s' % end
else:
clean += inside + tag # inside not removing of white space
result += clean
result += test.replace(' ','')
print result

Related

Python Regex to find String between two strings

I am trying to use Regex to look through a specific part of a string and take what is between but I cant get the right Regex pattern for this.
My biggest issue is with trying to form a Regex pattern for this. I've tried a bunch of variations close to the example listed. It should be close.
import re
toFind = ['[]', '[x]']
text = "| Completed?|\n|------|:---------:|\n|Link Created | [] |\n|Research Done | [X] "
# Regex to search between parameters and make result lowercase if there are any uppercase Chars
result = (re.search("(?<=Link Created)(.+?)(?=Research Done)", text).lower())
# Gets rid of whitespace in case they move the []/[x] around
result = result.replace(" ", "")
if any(x in result for x in toFind):
print("Exists")
else:
print("Doesn't Exist")
Happy Path:
I take string (text) and use Regex expression to get the substring between Link Created and Research Done.
Then make the result lowercase and get rid of whitespace just in case they move the []/[x]s. Then it looks at the string (result) for '[]' or '[x]' and print.
Actual Output:
At the moment all I keep getting is None because the the Regex syntax is off...
If you want . to match newlines, you have the use the re.S option.
Also, it would seem a better idea to check if the regex matched before proceeding with further calls. Your call to lower() gave me an error because the regex didn't match, so calling result.group(0).lower() only when result evaluates as true is safer.
import re
toFind = ['[]', '[x]']
text = "| Completed?|\n|------|:---------:|\n|Link Created | [] |\n|Research Done | [X] "
# Regex to search between parameters and make result lowercase if there are any uppercase Chars
result = (re.search("(?<=Link Created)(.+?)(?=Research Done)", text, re.S))
if result:
# Gets rid of whitespace in case they move the []/[x] around
result = result.group(0).lower().replace(" ", "")
if any(x in result for x in toFind):
print("Exists")
else:
print("Doesn't Exist")
else:
print("re did not match")
PS: all the re options are documented in the re module documentation. Search for re.DOTALL for the details on re.S (they're synonyms). If you want to combine options, use bitwise OR. E.g., re.S|re.I will have . match newline and do case-insensitive matching.
I believe it's the \n newline characters giving issues. You can get around this using [\s\S]+ as such:
import re
toFind = ['[]', '[x]']
text = "| Completed?|\n|------|:---------:|\n|Link Created | [] |\n|Research Done | [X] "
# New regex to match text between
# Remove all newlines, tabs, whitespace and column separators
result = re.search(r"Link Created([\s\S]+)Research Done", text).group(1)
result = re.sub(r"[\n\t\s\|]*", "", result)
if any(x in result for x in toFind):
print("Exists")
else:
print("Doesn't Exist")
Seems like regex is overkill for this particular job unless I am missing something (also not clear to me why you need the step that removes the whitespace from the substring). You could just split on "Link Created" and then split the following string on "Research Done".
text = "| Completed?|\n|------|:---------:|\n|Link Created | [] |\n|Research Done | [X] "
s = text.split("Link Created")[1].split("Research Done")[0].lower()
if "[]" in s or "[x]" in s:
print("Exists")
else:
print("Doesn't Exist")
# Exists

Add automatically the quotation marks inside a string in python

I am in python and I want to add the quotation marks inside a string. Concretely, I have the following string:
'{name:robert,surname:paul}'
And I want to programmatically get the following, operating on the first
'{name:"robert",surname:"paul"}'
Is there any efficient way to perform this?
Use a regex to match word \w* after : and replace it using backreference \1 :
Prefix your regexstring by r (raw string) to automatically escape characters.
https://repl.it/Nh29/1
import re
input_str='{name:robert,surname:paul}'
output_str=re.sub(r':(\w*)', r':"\1"', input_str )
print output_str
will produce
{name:"robert",surname:"paul"}
def literalize(string):
string = string[1:-1].split(',')
string = map(lambda s: str.split(s, ':'), string)
return_string = ''
for item in string:
return_string += '%s: "%s", ' % tuple(item)
return "{%s}" % return_string
I wouldn't ever consider this a masterpiece but I've tried not to use RegEx for this; however it ended up being messy and bodgy, and obviously factorizable with list comprehensions and what not.
Some implementation details are that it won't work very well when the value has a comma or colon inside, and another implementation detail being that tuple(item) can be replaced by (*item) if you prefer a more Python 3 way.
>>> literalize(a)
'{name: "robert", surname: "paul", }'
Note: I don't think the redundant , at the end should matter too much when parsing using something like json.loads(...)

How to remove words after certain character in a line in python [duplicate]

I have a string. How do I remove all text after a certain character? (In this case ...)
The text after will ... change so I that's why I want to remove all characters after a certain one.
Split on your separator at most once, and take the first piece:
sep = '...'
stripped = text.split(sep, 1)[0]
You didn't say what should happen if the separator isn't present. Both this and Alex's solution will return the entire string in that case.
Assuming your separator is '...', but it can be any string.
text = 'some string... this part will be removed.'
head, sep, tail = text.partition('...')
>>> print head
some string
If the separator is not found, head will contain all of the original string.
The partition function was added in Python 2.5.
S.partition(sep) -> (head, sep, tail)
Searches for the separator sep in S, and returns the part before it,
the separator itself, and the part after it. If the separator is not
found, returns S and two empty strings.
If you want to remove everything after the last occurrence of separator in a string I find this works well:
<separator>.join(string_to_split.split(<separator>)[:-1])
For example, if string_to_split is a path like root/location/child/too_far.exe and you only want the folder path, you can split by "/".join(string_to_split.split("/")[:-1]) and you'll get
root/location/child
Without a regular expression (which I assume is what you want):
def remafterellipsis(text):
where_ellipsis = text.find('...')
if where_ellipsis == -1:
return text
return text[:where_ellipsis + 3]
or, with a regular expression:
import re
def remwithre(text, there=re.compile(re.escape('...')+'.*')):
return there.sub('', text)
import re
test = "This is a test...we should not be able to see this"
res = re.sub(r'\.\.\..*',"",test)
print(res)
Output: "This is a test"
The method find will return the character position in a string. Then, if you want remove every thing from the character, do this:
mystring = "123⋯567"
mystring[ 0 : mystring.index("⋯")]
>> '123'
If you want to keep the character, add 1 to the character position.
From a file:
import re
sep = '...'
with open("requirements.txt") as file_in:
lines = []
for line in file_in:
res = line.split(sep, 1)[0]
print(res)
This is in python 3.7 working to me
In my case I need to remove after dot in my string variable fees
fees = 45.05
split_string = fees.split(".", 1)
substring = split_string[0]
print(substring)
Yet another way to remove all characters after the last occurrence of a character in a string (assume that you want to remove all characters after the final '/').
path = 'I/only/want/the/containing/directory/not/the/file.txt'
while path[-1] != '/':
path = path[:-1]
another easy way using re will be
import re, clr
text = 'some string... this part will be removed.'
text= re.search(r'(\A.*)\.\.\..+',url,re.DOTALL|re.IGNORECASE).group(1)
// text = some string

replacing all regex matches in single line

I have dynamic regexp in which I don't know in advance how many groups it has
I would like to replace all matches with xml tags
example
re.sub("(this).*(string)","this is my string",'<markup>\anygroup</markup>')
>> "<markup>this</markup> is my <markup>string</markup>"
is that even possible in single line?
For a constant regexp like in your example, do
re.sub("(this)(.*)(string)",
r'<markup>\1</markup>\2<markup>\3</markup>',
text)
Note that you need to enclose .* in parentheses as well if you don't want do lose it.
Now if you don't know what the regexp looks like, it's more difficult, but should be doable.
pattern = "(this)(.*)(string)"
re.sub(pattern,
lambda m: ''.join('<markup>%s</markup>' % s if n % 2 == 0
else s for n, s in enumerate(m.groups())),
text)
If the first thing matched by your pattern doesn't necessarily have to be marked up, use this instead, with the first group optionally matching some prefix text that should be left alone:
pattern = "()(this)(.*)(string)"
re.sub(pattern,
lambda m: ''.join('<markup>%s</markup>' % s if n % 2 == 1
else s for n, s in enumerate(m.groups())),
text)
You get the idea.
If your regexps are complicated and you're not sure you can make everything part of a group, where only every second group needs to be marked up, you might do something smarter with a more complicated function:
pattern = "(this).*(string)"
def replacement(m):
s = m.group()
n_groups = len(m.groups())
# assume groups do not overlap and are listed left-to-right
for i in range(n_groups, 0, -1):
lo, hi = m.span(i)
s = s[:lo] + '<markup>' + s[lo:hi] + '</markup>' + s[hi:]
return s
re.sub(pattern, replacement, text)
If you need to handle overlapping groups, you're on your own, but it should be doable.
re.sub() will replace everything it can. If you pass it a function for repl then you can do even more.
Yes, this can be done in a single line.
>>> re.sub(r"\b(this|string)\b", r"<markup>\1</markup>", "this is my string")
'<markup>this</markup> is my <markup>string</markup>'
\b ensures that only complete words are matched.
So if you have a list of words that you need to mark up, you could do the following:
>>> mywords = ["this", "string", "words"]
>>> myre = r"\b(" + "|".join(mywords) + r")\b"
>>> re.sub(myre, r"<markup>\1</markup>", "this is my string with many words!")
'<markup>this</markup> is my <markup>string</markup> with many <markup>words</markup>!'

How to remove all characters after a specific character in python?

I have a string. How do I remove all text after a certain character? (In this case ...)
The text after will ... change so I that's why I want to remove all characters after a certain one.
Split on your separator at most once, and take the first piece:
sep = '...'
stripped = text.split(sep, 1)[0]
You didn't say what should happen if the separator isn't present. Both this and Alex's solution will return the entire string in that case.
Assuming your separator is '...', but it can be any string.
text = 'some string... this part will be removed.'
head, sep, tail = text.partition('...')
>>> print head
some string
If the separator is not found, head will contain all of the original string.
The partition function was added in Python 2.5.
S.partition(sep) -> (head, sep, tail)
Searches for the separator sep in S, and returns the part before it,
the separator itself, and the part after it. If the separator is not
found, returns S and two empty strings.
If you want to remove everything after the last occurrence of separator in a string I find this works well:
<separator>.join(string_to_split.split(<separator>)[:-1])
For example, if string_to_split is a path like root/location/child/too_far.exe and you only want the folder path, you can split by "/".join(string_to_split.split("/")[:-1]) and you'll get
root/location/child
Without a regular expression (which I assume is what you want):
def remafterellipsis(text):
where_ellipsis = text.find('...')
if where_ellipsis == -1:
return text
return text[:where_ellipsis + 3]
or, with a regular expression:
import re
def remwithre(text, there=re.compile(re.escape('...')+'.*')):
return there.sub('', text)
import re
test = "This is a test...we should not be able to see this"
res = re.sub(r'\.\.\..*',"",test)
print(res)
Output: "This is a test"
The method find will return the character position in a string. Then, if you want remove every thing from the character, do this:
mystring = "123⋯567"
mystring[ 0 : mystring.index("⋯")]
>> '123'
If you want to keep the character, add 1 to the character position.
From a file:
import re
sep = '...'
with open("requirements.txt") as file_in:
lines = []
for line in file_in:
res = line.split(sep, 1)[0]
print(res)
This is in python 3.7 working to me
In my case I need to remove after dot in my string variable fees
fees = 45.05
split_string = fees.split(".", 1)
substring = split_string[0]
print(substring)
Yet another way to remove all characters after the last occurrence of a character in a string (assume that you want to remove all characters after the final '/').
path = 'I/only/want/the/containing/directory/not/the/file.txt'
while path[-1] != '/':
path = path[:-1]
another easy way using re will be
import re, clr
text = 'some string... this part will be removed.'
text= re.search(r'(\A.*)\.\.\..+',url,re.DOTALL|re.IGNORECASE).group(1)
// text = some string

Categories