Matching two almost similar string (python)

Matching two almost similar string (python) - python

In a file I can have either of the following two string formats:
::WORD1::WORD2= ANYTHING
::WORD3::WORD4::WORD5= ANYTHING2
This is the regex I came up with:
::(\w+)(?:::(\w+))?::(\w+)=(.*)
regex.findall(..)
[(u'WORD1', u'', u'WORD2', u' ANYTHING'),
(u'WORD3', u'WORD4', u'WORD5', u' ANYTHING2')]
My first question is, why do I get this empty u'' when matching the first string ?
My second question is, is there an easier way to write this regex? the two strings are very similar, except that sometimes i have this extra ::WORD5
My last question is: most of the time I have only word between the :: so that's why \w+ is enough, but sometime I can get stuff like 2-WORD2 or 3-2-WORD2 etc.. there is this - that appears. How can I add it into the \w+ ?

for last question:
[\w\-]+
explain:
\w
Matches any word character.

Captured groups are always included in re.findall results, even if they don't match anything. That's why you get an empty string. If you just want to get what's between the delimiters, try split instead of findall:
a = '::WORD1::WORD2= ANYTHING'
b = '::WORD3::WORD4::WORD5= ANYTHING2'
print re.split(r'::|= ', a)[1:] # ['WORD1', 'WORD2', 'ANYTHING']
print re.split(r'::|= ', b)[1:] # ['WORD3', 'WORD4', 'WORD5', 'ANYTHING2']
In response to the comments, if "ANYTHING" could be well, anything, it's easier to use string functions rather than regexps:
x, y = a.split('= ', 1)
results = x.split('::')[1:] + [y]

Based on the answer of thg435 you can just split to the "=" and then do exactly the same somethign like
left,right = a.split('=', 1)
answer = left.split('::')[1:] + [right]

For you last question you can do something like (that accept letters, numbers and "-")
[a-zA-Z0-9\-]+

Related

Python Regex Simple Split - Empty at first index

I have a String that looks like
test = '20170125NBCNightlyNews'
I am trying to split it into two parts, the digits, and the name. The format will always be [date][show] the date is stripped of format and is digit only in the direction of YYYYMMDD (dont think that matters)
I am trying to use re. I have a working version by writing.
re.split('(\d+)',test)
Simple enough, this gives me the values I need in a list.
['', '20170125', 'NBCNightlyNews']
However, as you will note, there is an empty string in the first position. I could theoretically just ignore it, but I want to learn why it is there in the first place, and if/how I can avoid it.
I also tried telling it to match the begininning of the string as well, and got the same results.
>>> re.split('(^\d+)',test)
['', '20170125', 'NBCNightlyNews']
>>> re.split('^(\d+)',test)
['', '20170125', 'NBCNightlyNews']
>>>
Does anyone have any input as to why this is there / how I can avoid the empty string?

Other answers have explained why what you're doing does what it does, but if you have a constant format for the date, there is no reason to abuse a re.split to parse this data:
test[:8], test[8:]
Will split your strings just fine.

What you are actually doing by entering re.split('(^\d+)', test) is, that your test string is splitted on any occurence of a number with at least one character.
So, if you have
test = '20170125NBCNightlyNews'
This is happening:
20170125 NBCNightlyNews
^^^^^^^^
The string is split into three parts, everything before the number, the number itself and everything after the number.
Maybe it is easier to understand if you have a sentence of words, separated by a whitespace character.
re.split(' ', 'this is a house')
=> ['this', 'is', 'a', 'house']
re.split(' ', ' is a house')
=> ['', 'is', 'a', 'house']

You're getting an empty result in the beginning because your input string starts with digits and you're splitting it by digits only. Hence you get an empty string which is before first set of digits.
To avoid that you can use filter:
>>> print filter(None, re.split('(\d+)',test))
['20170125', 'NBCNightlyNews']

Why re.split when you can just match and get the groups?...
import re
test = '20170125NBCNightlyNews'
pattern = re.compile('(\d+)(\w+)')
result = re.match(pattern, test)
result.groups()[0] # for the date part
result.groups()[1] # for the show name
I realize now the intention was to parse the text, not fix the regex usage. I'm with the others, you shouldn't use regex for this simple task when you already know the format won't change and the date is fixed size and will always be first. Just use string indexing.

From the documentation:
If there are capturing groups in the separator and it matches at the start of the string, the result will start with an empty string. The same holds for the end of the string. That way, separator components are always found at the same relative indices within the result list.
So if you have:
test = 'test20170125NBCNightlyNews'
The indexes would remain unaffected:
>>>re.split('(\d+)',test)
['test', '20170125', 'NBCNightlyNews']

If the date is always 8 digits long, I would access the substrings directly (without using regex):
>>> [test[:8], test[8:]]
['20170125', 'NBCNightlyNews']
If the length of the date might vary, I would use:
>>> s = re.search('^(\d*)(.*)$', test)
>>> [s.group(1), s.group(2)]
['20170125', 'NBCNightlyNews']

Extract multiple strings between hot characters from a paragraph

Now I cant really find a way to title this but I can explain what I am going for in code.
So I am able to take a users comment and check to see if it has these "[[Keyword]]" modifiers. Now I want to expand it further to allow more than one.
This is what happens if a user inputs more than one modifier in a row current code.
#comment in this case is "I want to [[find]] [[this]] [[Special]] word."
# c is the comment.
body = c.body
# Finds the hot word
result = re.search("\[\[(.*)\]\]", body, re.IGNORECASE)
print(result)
Expected result:
>>>find this Special
Returned result:
>>>find]] [[this]] [[Special
Is there any way I can take every result and put it into some sort of array, so I can measure how long the array is and each result will correspond to a number
How I want it to work.
print(result[0] +'\n')
print(result[1] +'\n')
print(result[2] +'\n')
>>>find
>>>this
>>>Special

The .* is greedy by default. You want it to match in non-greedy mode so it matches as little as possible. You can do that by using .*? instead of .*. You should also use re.findall to get all the matches instead of re.search, which will only return the first match.
>>> re.findall(r"\[\[(.*?)\]\]", body, re.IGNORECASE)
['find', 'this', 'special']

import re
text = 'comment in this case is "I want to [[find]] [[this]] [[Special]] word.'
result = re.findall("(?:\[\[(.*?)\]\])", text)
for term in result:
print(term + "\n")

Split string from a regex in python re

I have patterns like this:
" 1+2;\r\n\r(%o2) 3\r\n(%i3) "
i'd like to split them up into:
[" 1+2;","(%o2) 3","(%i3)"]
the regex for the first pattern is hard to construct since it could be anything a user asks of an algebra system, the second could be:
'\(%o\d+\).'
and the last something like this:
'\(%i\d+\)
im not stumped by the regex part strictly but how to actually split once i know the correct pattern.
how would i split this?

How about splitting on (\r|\n)+?

Will this code work for you?
patterns = [p.strip() for x in " 1+2;\r\n\r(%o2) 3\r\n(%i3) ".split("\r\n")]
To clarify:
>>> patterns = " 1+2;\r\n\r(%o2) 3\r\n(%i3) ".split("\r\n")
>>> patterns
[' 1+2;', '\r(%o2) 3', '(%i3) ']
>>> patterns = [p.strip() for p in patterns]
['1+2;', '(%o2) 3', '(%i3)']
This way you split the lines and get rid from unnecessary white characters.
EDIT: also: Python String has also splitlines() method:
splitlines(...)
S.splitlines([keepends]) -> list of strings
Return a list of the lines in S, breaking at line boundaries.
Line breaks are not included in the resulting list unless keepends
is given and true.
So this code may be changed to:
patterns = [p.strip() for x in " 1+2;\r\n\r(%o2) 3\r\n(%i3) ".splitlines()]
This may possibly answer the problem with NL's without CR's and all different combinations.

Checking and removing extra symbols

I'm interested by removing extra symbols from strings in python.
What could by the more efficient and pythonic way to do that ? Is there some grammar module ?
My first idea would be to locate the more nested text and go through the left and the right, counting the opening and closing symbols. Then i remove the last one of the symbol counter that contain too much symbol.
An example would be this string
text = "(This (is an example)"
You can clearly see that the first parenthesis is not balanced by another one. So i want to delete it.
text = "This (is and example)"
The solution has to be independant of the position of the parentheses.
Others example could be :
text = "(This (is another example) )) (to) explain) the question"
That would become :
text = "(This (is another example) ) (to) explain the question"

Had to break this into an answer for formatting. Check the Python's regular expression module.
If I'm understanding what you are asking, look at re.sub. You can use a regular expression to find the character you'd like to remove, and replace them with an empty string.
Suppose we want to remove all instances of '.', '&', and '*'.
>>> import re
>>> s = "abc&def.ghi**jkl&"
>>> re.sub('[\.\&\*]', '', s)
'abcdefghijkl'
If the pattern to be matched is larger, you can use re.compile and pass that as the first argument to sub.
>>> r = re.compile('[\.\&\*]')
>>> re.sub(r, '', s)
'abcdefghijkl'
Hope this helps.

finding and returning a string with a specified prefix

I am close but I am not sure what to do with the restuling match object. If I do
p = re.search('[/#.* /]', str)
I'll get any words that start with # and end up with a space. This is what I want. However this returns a Match object that I dont' know what to do with. What's the most computationally efficient way of finding and returning a string which is prefixed with a #?
For example,
"Hi there #guy"
After doing the proper calculations, I would be returned
guy

The following regular expression do what you need:
import re
s = "Hi there #guy"
p = re.search(r'#(\w+)', s)
print p.group(1)
It will also work for the following string formats:
s = "Hi there #guy " # notice the trailing space
s = "Hi there #guy," # notice the trailing comma
s = "Hi there #guy and" # notice the next word
s = "Hi there #guy22" # notice the trailing numbers
s = "Hi there #22guy" # notice the leading numbers

That regex does not do what you think it does.
s = "Hi there #guy"
p = re.search(r'#([^ ]+)', s) # this is the regex you described
print p.group(1) # first thing matched inside of ( .. )
But as usually with regex, there are tons of examples that break this, for example if the text is s = "Hi there #guy, what's with the comma?" the result would be guy,.
So you really need to think about every possible thing you want and don't want to match. r'#([a-zA-Z]+)' might be a good starting point, it literally only matches letters (a .. z, no unicode etc).

p.group(0) should return guy. If you want to find out what function an object has, you can use the dir(p) method to find out. This will return a list of attributes and methods that are available for that object instance.

As it's evident from the answers so far regex is the most efficient solution for your problem. Answers differ slightly regarding what you allow to be followed by the #:
[^ ] anything but space
\w in python-2.x is equivalent to [A-Za-z0-9_], in py3k is locale dependent
If you have better idea what characters might be included in the user name you might adjust your regex to reflect that, e.g., only lower case ascii letters, would be:
[a-z]
NB: I skipped quantifiers for simplicity.

(?<=#)\w+
will match a word if it's preceded by a # (without adding it to the match, a so-called positive lookbehind). This will match "words" that are composed of letters, numbers, and/or underscore; if you don't want those, use (?<=#)[^\W\d_]+
In Python:
>>> strg = "Hi there #guy!"
>>> p = re.search(r'(?<=#)\w+', strg)
>>> p.group()
'guy'

You say: """If I do p = re.search('[/#.* /]', str) I'll get any words that start with # and end up with a space."" But this is incorrect -- that pattern is a character class which will match ONE character in the set #/.* and space. Note: there's a redundant second / in the pattern.
For example:
>>> re.findall('[/#.* /]', 'xxx#foo x/x.x*x xxxx')
['#', ' ', '/', '.', '*', ' ']
>>>
You say that you want "guy" returned from "Hi there #guy" but that conflicts with "and end up with a space".
Please edit your question to include what you really want/need to match.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Matching two almost similar string (python) - python

for last question: [\w\-]+ explain: \w Matches any word character.

Based on the answer of thg435 you can just split to the "=" and then do exactly the same somethign like left,right = a.split('=', 1) answer = left.split('::')[1:] + [right]

For you last question you can do something like (that accept letters, numbers and "-") [a-zA-Z0-9\-]+

Related

Python Regex Simple Split - Empty at first index

Extract multiple strings between hot characters from a paragraph

Split string from a regex in python re

Checking and removing extra symbols

finding and returning a string with a specified prefix

Categories

Resources