How can I merge these regex into one? - python

(Before I start: I'm doing this in python)
So basically I need my single regex to match all quotation marks immediately before and after my html QUOT tags: If a quotation mark exists in those spaces, I need it to match.
Example:
<QUOT.START> Hello, this doesn't match! <\QUOT.END>
"<QUOT.START> "Hello, this will call 4 matches! " <\QUOT.END> "
I have 4 different regexes for this purpose:
1. \"+(?=<QUOT\.START>)
2. (?<=<QUOT\.START>)\"+
3. \"+(?=<\\QUOT\.END>)
4. (?<=<\\QUOT\.END>)\"+
Can I merge these 4 into basically one?

If you're able to use the newer regex module (which supports infinite lookbehind) you can somewhat condense your expression to
(?<=<\\?QUOT\.(?:START|END)>[\t ]*)" # matches quotes after <quot.start> or <quot.end>
# plus whitespaces, eventually
|
"(?=[\t ]*<\\?QUOT\.(?:START|END)>) # before <quot.start> or <quot.end>,
# plus whitespaces eventually
Without verbose mode:
(?<=<\\?QUOT\.(?:START|END)>[\t ]*)"|"(?=[\t ]*<\\?QUOT\.(?:START|END)>)
Generally speaking this is:
(?<=<tag><whitespaces, eventually>)quote|quote(?=<whitespaces, eventually><tag>)
In Python:
import regex as re
string = """
<QUOT.START> Hello, this doesn't match! <\QUOT.END>
"<QUOT.START> "Hello, this will call 4 matches! " <\QUOT.END> "
"""
rx = re.compile(r'''(?<=<\\?QUOT\.(?:START|END)>[\t ]*)"|"(?=[\t ]*<\\?QUOT\.(?:START|END)>)''')
for m in rx.finditer(string):
print(m.group(0))
print(m.span())
This brings up four quotes and their positions.

#ctwheels helped me figure out this (super simple) solution: Being a total newbie at regexes, I didn't know about the |(pipe) syntax . So here is the final regex I wanted (And it works!)
\"+(?=<QUOT\.START>)|(?<=<QUOT\.START>)\"+|\"+(?=<\\QUOT\.END>)|(?<=<\\QUOT\.END>)\"+

You can try this:
s = '<QUOT.START> "Hello, this will call 4 matches! " <\QUOT.END> '
import re
strings = re.findall('\"(.*?)\"', s)
Output:
['Hello, this will call 4 matches! ']

Related

python regex to replace all single word characters in string

I am trying to remove all the single characters in a string
input: "This is a big car and it has a spacious seats"
my output should be:
output: "This is big car and it has spacious seats"
Here I am using the expression
import re
re.compile('\b(?<=)[a-z](?=)\b')
This matches with first single character in the string ...
Any help would be appreciated ...thanks in Advance
Edit: I have just seen that this was suggested in the comments first by Wiktor Stribiżew. Credit to him - I had not seen when this was posted.
You can also use re.sub() to automatically remove single characters (assuming you only want to remove alphabetical characters). The following will replace any occurrences of a single alphabetical character:
import re
input = "This is a big car and it has a spacious seats"
output = re.sub(r"\b[a-zA-Z]\b", "", input)
>>>
output = "This is big car and it has spacious seats"
You can learn more about inputting regex expression when replacing strings here: How to input a regex in string.replace?
Here's one way to do it by splitting the string and filtering out single length letters using len and str.isalpha:
>>> s = "1 . This is a big car and it has a spacious seats"
>>> ' '.join(i for i in s.split() if not (i.isalpha() and len(i)==1))
'1 . This is big car and it has spacious seats'
re.sub(r' \w{1} |^\w{1} | \w{1}$', ' ', input)
EDIT:
You can use:
import re
input_string = "This is a big car and it has a spacious seats"
str_without_single_chars = re.sub(r'(?:^| )\w(?:$| )', ' ', input_string).strip()
or (which as was brought to my attention, doesn't meet the specifications):
input_string = "This is a big car and it has a spacious seats"
' '.join(w for w in input_string.split() if len(w)>3)
The fastest way to remove words, characters, strings or anything between two known tags or two known characters in a string is by using a direct and Native C approach using RE along with a Common as shown below.
var = re.sub('<script>', '<!--', var)
var = re.sub('</script>', '-->', var)
#And finally
var = re.sub('<!--.*?-->', '', var)
It removes everything and works faster, better and cleaner than Beautiful Soup.
Batch files are where the "" got there beginnings and were only borrowed for use with batch and html from native C". When using all Pythonic methods with regular expressions you have to realize that Python has not altered or changed much from all regular expressions used by Machine Language so why iterate many times when a single loop can find it all as one chunk in one iteration? Do the same individually with Characters also.
var = re.sub('\[', '<!--', var)
var = re.sub('\]', '-->', var)
And finally
var = re.sub('<!--.*?-->', '' var)# wipes it all out from between along with.
And you do not need Beautiful Soup. You can also scalp data using them if you understand how this works.

Copy whitespace from string to string in Python

I want to copy whitespace like spaces and tabs from string to string in Python 2.
For example if I have a string with 3 spaces at first and one tab at the end like " Hi\t" I want to copy those whitespace to another string for example string like "hello" would become " hello\t"
Is that possible to do easily?
Yes, this is of course possible. I would use regex for that.
import re
hi = " Hi\t"
hello = "hello"
spaces = re.match(r"^(\s*).+?(\s*)$", hi)
if spaces:
left, right = spaces.groups()
string = "{}{}{}".format(left, hello, right)
print(string)
# Out: " hello\t"

Python Regex for Words & single space

I am using re.sub in order to forcibly convert a "bad" string into a "valid" string via regex. I am struggling with creating the right regex that will parse a string and "remove the bad parts". Specifically, I would like to force a string to be all alphabetical, and allow for a single space between words. Any values that disagree with this rule I would like to substitute with ''. This includes multiple spaces. Any help would be appreciated!
import re
list_of_strings = ["3He2l2lo Wo45rld!", "Hello World- -number two-", "Hello World number .. three"
for str in list_of_strings:
print re.sub(r'[^A-Za-z]+([^\s][A-Za-z])*', '' , str)
I would like the output to be:
Hello World
Hello World number two
Hello World number three
Try if the following works. It matches both groups of characters to remove, but only when there is at least an space in them subsitutes it with an space.
import re
list_of_strings = ["3He2l2lo Wo45rld!", "Hello World- -number two-", "Hello World number .. three"]
for str in list_of_strings:
print(re.sub(r'((?:[^A-Za-z\s]|\s)+)', lambda x: ' ' if ' ' in x.group(0) else '' , str))
It yields:
Hello World
Hello World number two
Hello World number three
I would prefer to have 2 passes to simplify the regex. First pass removes non-alphas, second removes multiple spaces.
pass1 = re.sub(r'[^A-Za-z\s]','',str) # remove non-alpha
pass2 = re.sub(r'\s+',' ',pass1); # collapses spaces to 1

String segmentation with html tags in python

I am trying to break a string into smaller segments using Python.
The various cases can be:
str1 = "Hello world. This is an ideal example string."
Result:
Hello world.
This is an ideal example string.
str2 = "<H1>Hello world.</H1><P>This is an HTML example string.<P>"
Result:
<H1>Hello world.</H1>
<P>This is an HTML example string.<P>
str3 = "1. Hello World. 2. This is a string."
Result:
1. Hello World.
2. This is a string.
Here is my code. But I cannot seem to achieve the 2nd case:
import re
string = """<h1>This is a string.</h1><a href="www.abc.com"> This is another part. <P/>"""
segment_regex = re.compile(r"""
(
\r\n|
\\r\\n|
\n|
\\n|
\r|
\\r|
\t|
\\t|
(?:
(?<=[^\d][\.|\!|\?])
\s+
(?=[A-Z0-9])
)|
(?:
(?<=[\.|\!|\?])\s*(?=<.*?>)
)
)
""", re.VERBOSE)
seg = segment_regex.split(string)
segments = seg[::2]
separator = seg[1::2]
print("Segments are ---->>")
for s in segments:
print (s)
print("Separators are ---->>")
for p in separator:
print (p)
The regex may be trying to do too many things at once. A simpler and more manageable way would be to first detect the string type html, ideal, list first and then invoke appropriate processors for each. Something like :-
import re
string = """<h1>This is a string.</h1><a href="www.abc.com"> This is another part. <P/>"""
if re.search('<.*?>', string):
split_html(string)
elif re.search('\\d\\.', string):
split_list(string)
else:
split_ideal(string)
Also while this may work for the cases mentioned a generic "splitter" will be far more complex and I don't claim that this approach will work for all.

finding and returning a string with a specified prefix

I am close but I am not sure what to do with the restuling match object. If I do
p = re.search('[/#.* /]', str)
I'll get any words that start with # and end up with a space. This is what I want. However this returns a Match object that I dont' know what to do with. What's the most computationally efficient way of finding and returning a string which is prefixed with a #?
For example,
"Hi there #guy"
After doing the proper calculations, I would be returned
guy
The following regular expression do what you need:
import re
s = "Hi there #guy"
p = re.search(r'#(\w+)', s)
print p.group(1)
It will also work for the following string formats:
s = "Hi there #guy " # notice the trailing space
s = "Hi there #guy," # notice the trailing comma
s = "Hi there #guy and" # notice the next word
s = "Hi there #guy22" # notice the trailing numbers
s = "Hi there #22guy" # notice the leading numbers
That regex does not do what you think it does.
s = "Hi there #guy"
p = re.search(r'#([^ ]+)', s) # this is the regex you described
print p.group(1) # first thing matched inside of ( .. )
But as usually with regex, there are tons of examples that break this, for example if the text is s = "Hi there #guy, what's with the comma?" the result would be guy,.
So you really need to think about every possible thing you want and don't want to match. r'#([a-zA-Z]+)' might be a good starting point, it literally only matches letters (a .. z, no unicode etc).
p.group(0) should return guy. If you want to find out what function an object has, you can use the dir(p) method to find out. This will return a list of attributes and methods that are available for that object instance.
As it's evident from the answers so far regex is the most efficient solution for your problem. Answers differ slightly regarding what you allow to be followed by the #:
[^ ] anything but space
\w in python-2.x is equivalent to [A-Za-z0-9_], in py3k is locale dependent
If you have better idea what characters might be included in the user name you might adjust your regex to reflect that, e.g., only lower case ascii letters, would be:
[a-z]
NB: I skipped quantifiers for simplicity.
(?<=#)\w+
will match a word if it's preceded by a # (without adding it to the match, a so-called positive lookbehind). This will match "words" that are composed of letters, numbers, and/or underscore; if you don't want those, use (?<=#)[^\W\d_]+
In Python:
>>> strg = "Hi there #guy!"
>>> p = re.search(r'(?<=#)\w+', strg)
>>> p.group()
'guy'
You say: """If I do p = re.search('[/#.* /]', str) I'll get any words that start with # and end up with a space."" But this is incorrect -- that pattern is a character class which will match ONE character in the set #/.* and space. Note: there's a redundant second / in the pattern.
For example:
>>> re.findall('[/#.* /]', 'xxx#foo x/x.x*x xxxx')
['#', ' ', '/', '.', '*', ' ']
>>>
You say that you want "guy" returned from "Hi there #guy" but that conflicts with "and end up with a space".
Please edit your question to include what you really want/need to match.

Categories