I am using re.sub in order to forcibly convert a "bad" string into a "valid" string via regex. I am struggling with creating the right regex that will parse a string and "remove the bad parts". Specifically, I would like to force a string to be all alphabetical, and allow for a single space between words. Any values that disagree with this rule I would like to substitute with ''. This includes multiple spaces. Any help would be appreciated!
import re
list_of_strings = ["3He2l2lo Wo45rld!", "Hello World- -number two-", "Hello World number .. three"
for str in list_of_strings:
print re.sub(r'[^A-Za-z]+([^\s][A-Za-z])*', '' , str)
I would like the output to be:
Hello World
Hello World number two
Hello World number three
Try if the following works. It matches both groups of characters to remove, but only when there is at least an space in them subsitutes it with an space.
import re
list_of_strings = ["3He2l2lo Wo45rld!", "Hello World- -number two-", "Hello World number .. three"]
for str in list_of_strings:
print(re.sub(r'((?:[^A-Za-z\s]|\s)+)', lambda x: ' ' if ' ' in x.group(0) else '' , str))
It yields:
Hello World
Hello World number two
Hello World number three
I would prefer to have 2 passes to simplify the regex. First pass removes non-alphas, second removes multiple spaces.
pass1 = re.sub(r'[^A-Za-z\s]','',str) # remove non-alpha
pass2 = re.sub(r'\s+',' ',pass1); # collapses spaces to 1
Related
I am trying to split a string but it should be replaced to another string and return as a list. Its hard to explain so here is an example:
I have string in variable a:
a = "Hello World!"
I want a list such that:
a.split("Hello").replace("Hey") == ["Hey"," World!"]
It means I want to split a string and write another string to that splited element in the list. SO if a is
a = "Hello World! Hello Everybody"
and I use something like a.split("Hello").replace("Hey") , then the output should be:
a = ["Hey"," World! ","Hey"," Everybody"]
How can I achieve this?
From your examples it sounds a lot like you want to replace all occurrences of Hello with Hey and then split on spaces.
What you are currently doing can't work, because replace needs two arguments and it's a method of strings, not lists. When you split your string, you get a list.
>>> a = "Hello World!"
>>> a = a.replace("Hello", "Hey")
>>> a
'Hey World!'
>>> a.split(" ")
['Hey', 'World!']
x = "HelloWorldHelloYou!"
y = x.replace("Hello", "\nHey\n").lstrip("\n").split("\n")
print(y) # ['Hey', 'World', 'Hey', 'You!']
This is a rather brute-force approach, you can replace \n with any character you're not expecting to find in your string (or even something like XXXXX). The lstrip is to remove \n if your string starts with Hello.
Alternatively, there's regex :)
this functions can do it
def replace_split(s, old, new):
return sum([[blk, new for blk] in s.split(old)], [])[:-1]
It wasnt clear if you wanted to split by space or by uppercase.
import re
#Replace all 'Hello' with 'Hey'
a = 'HelloWorldHelloEverybody'
a = a.replace('Hello', 'Hey')
#This will separate the string by uppercase character
re.findall('[A-Z][^A-Z]*', a) #['Hey', 'World' ,'Hey' ,'Everybody']
You can do this with iteration:
a=a.split(' ')
for word in a:
if word=='Hello':
a[a.index(word)]='Hey'
(Before I start: I'm doing this in python)
So basically I need my single regex to match all quotation marks immediately before and after my html QUOT tags: If a quotation mark exists in those spaces, I need it to match.
Example:
<QUOT.START> Hello, this doesn't match! <\QUOT.END>
"<QUOT.START> "Hello, this will call 4 matches! " <\QUOT.END> "
I have 4 different regexes for this purpose:
1. \"+(?=<QUOT\.START>)
2. (?<=<QUOT\.START>)\"+
3. \"+(?=<\\QUOT\.END>)
4. (?<=<\\QUOT\.END>)\"+
Can I merge these 4 into basically one?
If you're able to use the newer regex module (which supports infinite lookbehind) you can somewhat condense your expression to
(?<=<\\?QUOT\.(?:START|END)>[\t ]*)" # matches quotes after <quot.start> or <quot.end>
# plus whitespaces, eventually
|
"(?=[\t ]*<\\?QUOT\.(?:START|END)>) # before <quot.start> or <quot.end>,
# plus whitespaces eventually
Without verbose mode:
(?<=<\\?QUOT\.(?:START|END)>[\t ]*)"|"(?=[\t ]*<\\?QUOT\.(?:START|END)>)
Generally speaking this is:
(?<=<tag><whitespaces, eventually>)quote|quote(?=<whitespaces, eventually><tag>)
In Python:
import regex as re
string = """
<QUOT.START> Hello, this doesn't match! <\QUOT.END>
"<QUOT.START> "Hello, this will call 4 matches! " <\QUOT.END> "
"""
rx = re.compile(r'''(?<=<\\?QUOT\.(?:START|END)>[\t ]*)"|"(?=[\t ]*<\\?QUOT\.(?:START|END)>)''')
for m in rx.finditer(string):
print(m.group(0))
print(m.span())
This brings up four quotes and their positions.
#ctwheels helped me figure out this (super simple) solution: Being a total newbie at regexes, I didn't know about the |(pipe) syntax . So here is the final regex I wanted (And it works!)
\"+(?=<QUOT\.START>)|(?<=<QUOT\.START>)\"+|\"+(?=<\\QUOT\.END>)|(?<=<\\QUOT\.END>)\"+
You can try this:
s = '<QUOT.START> "Hello, this will call 4 matches! " <\QUOT.END> '
import re
strings = re.findall('\"(.*?)\"', s)
Output:
['Hello, this will call 4 matches! ']
I am trying to remove all the single characters in a string
input: "This is a big car and it has a spacious seats"
my output should be:
output: "This is big car and it has spacious seats"
Here I am using the expression
import re
re.compile('\b(?<=)[a-z](?=)\b')
This matches with first single character in the string ...
Any help would be appreciated ...thanks in Advance
Edit: I have just seen that this was suggested in the comments first by Wiktor Stribiżew. Credit to him - I had not seen when this was posted.
You can also use re.sub() to automatically remove single characters (assuming you only want to remove alphabetical characters). The following will replace any occurrences of a single alphabetical character:
import re
input = "This is a big car and it has a spacious seats"
output = re.sub(r"\b[a-zA-Z]\b", "", input)
>>>
output = "This is big car and it has spacious seats"
You can learn more about inputting regex expression when replacing strings here: How to input a regex in string.replace?
Here's one way to do it by splitting the string and filtering out single length letters using len and str.isalpha:
>>> s = "1 . This is a big car and it has a spacious seats"
>>> ' '.join(i for i in s.split() if not (i.isalpha() and len(i)==1))
'1 . This is big car and it has spacious seats'
re.sub(r' \w{1} |^\w{1} | \w{1}$', ' ', input)
EDIT:
You can use:
import re
input_string = "This is a big car and it has a spacious seats"
str_without_single_chars = re.sub(r'(?:^| )\w(?:$| )', ' ', input_string).strip()
or (which as was brought to my attention, doesn't meet the specifications):
input_string = "This is a big car and it has a spacious seats"
' '.join(w for w in input_string.split() if len(w)>3)
The fastest way to remove words, characters, strings or anything between two known tags or two known characters in a string is by using a direct and Native C approach using RE along with a Common as shown below.
var = re.sub('<script>', '<!--', var)
var = re.sub('</script>', '-->', var)
#And finally
var = re.sub('<!--.*?-->', '', var)
It removes everything and works faster, better and cleaner than Beautiful Soup.
Batch files are where the "" got there beginnings and were only borrowed for use with batch and html from native C". When using all Pythonic methods with regular expressions you have to realize that Python has not altered or changed much from all regular expressions used by Machine Language so why iterate many times when a single loop can find it all as one chunk in one iteration? Do the same individually with Characters also.
var = re.sub('\[', '<!--', var)
var = re.sub('\]', '-->', var)
And finally
var = re.sub('<!--.*?-->', '' var)# wipes it all out from between along with.
And you do not need Beautiful Soup. You can also scalp data using them if you understand how this works.
I want to copy whitespace like spaces and tabs from string to string in Python 2.
For example if I have a string with 3 spaces at first and one tab at the end like " Hi\t" I want to copy those whitespace to another string for example string like "hello" would become " hello\t"
Is that possible to do easily?
Yes, this is of course possible. I would use regex for that.
import re
hi = " Hi\t"
hello = "hello"
spaces = re.match(r"^(\s*).+?(\s*)$", hi)
if spaces:
left, right = spaces.groups()
string = "{}{}{}".format(left, hello, right)
print(string)
# Out: " hello\t"
I am trying to break a string into smaller segments using Python.
The various cases can be:
str1 = "Hello world. This is an ideal example string."
Result:
Hello world.
This is an ideal example string.
str2 = "<H1>Hello world.</H1><P>This is an HTML example string.<P>"
Result:
<H1>Hello world.</H1>
<P>This is an HTML example string.<P>
str3 = "1. Hello World. 2. This is a string."
Result:
1. Hello World.
2. This is a string.
Here is my code. But I cannot seem to achieve the 2nd case:
import re
string = """<h1>This is a string.</h1><a href="www.abc.com"> This is another part. <P/>"""
segment_regex = re.compile(r"""
(
\r\n|
\\r\\n|
\n|
\\n|
\r|
\\r|
\t|
\\t|
(?:
(?<=[^\d][\.|\!|\?])
\s+
(?=[A-Z0-9])
)|
(?:
(?<=[\.|\!|\?])\s*(?=<.*?>)
)
)
""", re.VERBOSE)
seg = segment_regex.split(string)
segments = seg[::2]
separator = seg[1::2]
print("Segments are ---->>")
for s in segments:
print (s)
print("Separators are ---->>")
for p in separator:
print (p)
The regex may be trying to do too many things at once. A simpler and more manageable way would be to first detect the string type html, ideal, list first and then invoke appropriate processors for each. Something like :-
import re
string = """<h1>This is a string.</h1><a href="www.abc.com"> This is another part. <P/>"""
if re.search('<.*?>', string):
split_html(string)
elif re.search('\\d\\.', string):
split_list(string)
else:
split_ideal(string)
Also while this may work for the cases mentioned a generic "splitter" will be far more complex and I don't claim that this approach will work for all.