Checking and removing extra symbols

Checking and removing extra symbols - python

I'm interested by removing extra symbols from strings in python.
What could by the more efficient and pythonic way to do that ? Is there some grammar module ?
My first idea would be to locate the more nested text and go through the left and the right, counting the opening and closing symbols. Then i remove the last one of the symbol counter that contain too much symbol.
An example would be this string
text = "(This (is an example)"
You can clearly see that the first parenthesis is not balanced by another one. So i want to delete it.
text = "This (is and example)"
The solution has to be independant of the position of the parentheses.
Others example could be :
text = "(This (is another example) )) (to) explain) the question"
That would become :
text = "(This (is another example) ) (to) explain the question"

Had to break this into an answer for formatting. Check the Python's regular expression module.
If I'm understanding what you are asking, look at re.sub. You can use a regular expression to find the character you'd like to remove, and replace them with an empty string.
Suppose we want to remove all instances of '.', '&', and '*'.
>>> import re
>>> s = "abc&def.ghi**jkl&"
>>> re.sub('[\.\&\*]', '', s)
'abcdefghijkl'
If the pattern to be matched is larger, you can use re.compile and pass that as the first argument to sub.
>>> r = re.compile('[\.\&\*]')
>>> re.sub(r, '', s)
'abcdefghijkl'
Hope this helps.

Related

Parse a string using regex to obtain matches beginning with a certain word

I tried to search but the information that I am getting seems to be kinda overwhelming and far from what I need. I can't seem to get it to work.
The requirement is to get the function that starts with "meta" and its parentheses.
input:
one metaOmph(uno)
one metaAsdf(dos)
one metaPoil(tres)
output:
[ metaOmph , (uno) ]
[ metaAsdf, (dos) ]
[ metaPoil, (tres)]
The one that I currently have just gets the entire line if it starts with "meta". so I have the entire "one meta<>" if it's a match, would it be possible do what I'm aiming for?
Edit: It's one input/line at a time.
I'd love to post what I did earlier but I closed repl.it due to my frustration. I'll keep it in mind on my next post. (quite new here)

import re
s = """one metaOmph(uno)
one metaAsdf(dos)
one metaPoil(tres)"""
print(re.findall(".+(meta\w+)(\(\w+\))", s))
Outputs:
[('metaOmph', '(uno)'), ('metaAsdf', '(dos)'), ('metaPoil', '(tres)')]

re.findall() approach with valid regex pattern:
import re
s = '''
one metaOmph(uno)
one metaAsdf(dos)
one metaPoil(tres)
'''
result = re.findall(r'\b(meta\w+)(\([^()]+\))', s)
print(result)
The output:
[('metaOmph', '(uno)'), ('metaAsdf', '(dos)'), ('metaPoil', '(tres)')]

If you are going to pass a multiline string, it would seem simple to use the module level re.findall function.
text = '''one metaOmph(uno)
one metaAsdf(dos)
one metaPoil(tres)'''
r = re.findall(r'\b(meta.*?)(\(.*?\))', text, re.M)
print(r)
[('metaOmph', '(uno)'), ('metaAsdf', '(dos)'), ('metaPoil', '(tres)')]
If you are going to be passing 1-line strings as input to a loop, it might make more sense to compile the pattern beforehand, using re.compile and re.search inside a function:
pat = re.compile(r'\b(meta.*?)(\(.*?\))')
def find(text):
return pat.search(text)
for text in list_of_texts: # assuming you're passing in your strings from a list, or elsewhere
m = find(text)
if m:
print(list(m.groups()))
['metaOmph', '(uno)']
['metaAsdf', '(dos)']
['metaPoil', '(tres)']
Note that m might return a match object or None depending on whether a search was found. You'll want to query the return value, otherwise you'll receive an AttributeError: 'NoneType' object has no attribute 'groups', or something along those lines.
Alternatively, if you want to append the result to a list, you might instead use:
r_list = []
for text in list_of_texts:
m = find(text)
if m:
r_list.append(list(m.groups()))
print(r_list)
[['metaOmph', '(uno)'], ['metaAsdf', '(dos)'], ['metaPoil', '(tres)']]
Regex Details
\b # word boundary (thought to add this in thanks to Roman's answer)
(
meta # literal 'meta'
.*? # non-greedy matchall
)
(
\( # literal opening brace (escaped)
.*?
\) # literal closing brace (escaped)
)

Python Regex Simple Split - Empty at first index

I have a String that looks like
test = '20170125NBCNightlyNews'
I am trying to split it into two parts, the digits, and the name. The format will always be [date][show] the date is stripped of format and is digit only in the direction of YYYYMMDD (dont think that matters)
I am trying to use re. I have a working version by writing.
re.split('(\d+)',test)
Simple enough, this gives me the values I need in a list.
['', '20170125', 'NBCNightlyNews']
However, as you will note, there is an empty string in the first position. I could theoretically just ignore it, but I want to learn why it is there in the first place, and if/how I can avoid it.
I also tried telling it to match the begininning of the string as well, and got the same results.
>>> re.split('(^\d+)',test)
['', '20170125', 'NBCNightlyNews']
>>> re.split('^(\d+)',test)
['', '20170125', 'NBCNightlyNews']
>>>
Does anyone have any input as to why this is there / how I can avoid the empty string?

Other answers have explained why what you're doing does what it does, but if you have a constant format for the date, there is no reason to abuse a re.split to parse this data:
test[:8], test[8:]
Will split your strings just fine.

What you are actually doing by entering re.split('(^\d+)', test) is, that your test string is splitted on any occurence of a number with at least one character.
So, if you have
test = '20170125NBCNightlyNews'
This is happening:
20170125 NBCNightlyNews
^^^^^^^^
The string is split into three parts, everything before the number, the number itself and everything after the number.
Maybe it is easier to understand if you have a sentence of words, separated by a whitespace character.
re.split(' ', 'this is a house')
=> ['this', 'is', 'a', 'house']
re.split(' ', ' is a house')
=> ['', 'is', 'a', 'house']

You're getting an empty result in the beginning because your input string starts with digits and you're splitting it by digits only. Hence you get an empty string which is before first set of digits.
To avoid that you can use filter:
>>> print filter(None, re.split('(\d+)',test))
['20170125', 'NBCNightlyNews']

Why re.split when you can just match and get the groups?...
import re
test = '20170125NBCNightlyNews'
pattern = re.compile('(\d+)(\w+)')
result = re.match(pattern, test)
result.groups()[0] # for the date part
result.groups()[1] # for the show name
I realize now the intention was to parse the text, not fix the regex usage. I'm with the others, you shouldn't use regex for this simple task when you already know the format won't change and the date is fixed size and will always be first. Just use string indexing.

From the documentation:
If there are capturing groups in the separator and it matches at the start of the string, the result will start with an empty string. The same holds for the end of the string. That way, separator components are always found at the same relative indices within the result list.
So if you have:
test = 'test20170125NBCNightlyNews'
The indexes would remain unaffected:
>>>re.split('(\d+)',test)
['test', '20170125', 'NBCNightlyNews']

If the date is always 8 digits long, I would access the substrings directly (without using regex):
>>> [test[:8], test[8:]]
['20170125', 'NBCNightlyNews']
If the length of the date might vary, I would use:
>>> s = re.search('^(\d*)(.*)$', test)
>>> [s.group(1), s.group(2)]
['20170125', 'NBCNightlyNews']

Python replace regex

I have a string in which there are some attributes that may be empty:
[attribute1=value1, attribute2=, attribute3=value3, attribute4=]
With python I need to sobstitute the empty values with the value 'None'. I know I can use the string.replace('=,','=None,').replace('=]','=None]') for the string but I'm wondering if there is a way to do it using a regex, maybe with the ?P<name> option.

You can use
import re
s = '[attribute1=value1, attribute2=, attribute3=value3, attribute4=]'
re.sub(r'=(,|])', r'=None\1', s)
\1 is the match in parenthesis.

With python's re module, you can do something like this:
# import it first
import re
# your code
re.sub(r'=([,\]])', '=None\1', your_string)

You can use
s = '[attribute1=value1, attribute2=, attribute3=value3, attribute4=]'
re.sub(r'=(?!\w)', r'=None', s)
This works because the negative lookahead (?!\w) checks if the = character is not followed by a 'word' character. The definition of "word character", in regular expressions, is usually something like "a to z, 0 to 9, plus underscore" (case insensitive).
From your example data it seems all attribute values match this. It will not work if the values may start with something like a comma (unlikely), may be quoted, or may start with anything else. If so, you need a more fool proof setup, such as parse from the start: skipping the attribute name by locating the first = character.

Be specific and use a character class:
import re
string = "[attribute1=value1, attribute2=, attribute3=value3, attribute4=]"
rx = r'\w+=(?=[,\]])'
string = re.sub(rx, '\g<0>None', string)
print string
# [attribute1=value1, attribute2=None, attribute3=value3, attribute4=None]

How do I strip patterns or words from the end of the string backwards?

I have a string like this:
<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar><foo>
I would like to strip the first 3 opening and the last 3 closing tags from the string. I do not know the tag names in advance.
I can strip the first 3 strings with re.sub(r'<[^<>]+>', '', in_str, 3)). How do I strip the closing tags? What should remain is:
<v1>aaa<b>bbb</b>ccc</v1>
I know I could maybe 'do it right', but I actually do not wish to do xml nor html parsing for my purpose, which is to aid myself visualizing the xml representation of some classes.
Instead, I realized that this problem is interesting. It seems I cannot simply search backwards with regex, ie. right to left. because that seems unsupported:
If you mean, find the right-most match of several (similar to the
rfind method of a string) then no, it is not directly supported. You
could use re.findall() and chose the last match but if the matches can
overlap this may not give the correct result.
But .rstrip is not good with words, and won't do patterns either.
I looked at Strip HTML from strings in Python but I only wish to strip up to 3 tags.
What approach could be used here? Should I reverse the string (ugly in itself and due to the '<>'s). Do tokenization (why not parse, then?)? Or create static closing tags based on the left-to-right match?
Which strategy to follow to strip the patterns from the end of the string?

The simplest would be to use old-fashing string splitting and limiting the split:
in_str.split('>', 3)[-1].rsplit('<', 3)[0]
Demo:
>>> in_str = '<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar><foo>'
>>> in_str.split('>', 3)[-1].rsplit('<', 3)[0]
'<v1>aaa<b>bbb</b>ccc</v1>'
str.split() and str.rsplit() with a limit will split the string from the start or the end up to the limit times, letting you select the remainder unsplit.

You've already got practically all the solution. re can't do backwards, but you can:
in_str = '<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar></foo>'
in_str = re.sub(r'<[^<>]+>', '', in_str, 3)
in_str = in_str[::-1]
print in_str
in_str = re.sub(r'>[^<>]+/<', '', in_str, 3)
in_str = in_str[::-1]
print in_str
<v1>aaa<b>bbb</b>ccc</v1>
Note the reversed regex for the reversed string, but then it goes back-to-front.
Of course, as mentioned, this is way easier with a proper parser:
in_str = '<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar></foo>'
from lxml.html import etree
ix = etree.fromstring(in_str)
print etree.tostring(ix[0][0][0])
<v1>aaa<b>bbb</b>ccc</v1>

I would look into regular expressions and use one such pattern to use a split
http://docs.python.org/3/library/re.html?highlight=regex#re.regex.split

Sorry, can't comment, but will give it as an answer.
in_str.split('>', 3)[-1].rsplit('<', 3)[0] will work for the given example
<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar></foo>, but not for
<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar></foo><another>test</another>.
You just should be aware of this.
To solve the counter example provided by me, you will have to track state (or count) of tags and evaluate that you match the correct pairs.

finding and returning a string with a specified prefix

I am close but I am not sure what to do with the restuling match object. If I do
p = re.search('[/#.* /]', str)
I'll get any words that start with # and end up with a space. This is what I want. However this returns a Match object that I dont' know what to do with. What's the most computationally efficient way of finding and returning a string which is prefixed with a #?
For example,
"Hi there #guy"
After doing the proper calculations, I would be returned
guy

The following regular expression do what you need:
import re
s = "Hi there #guy"
p = re.search(r'#(\w+)', s)
print p.group(1)
It will also work for the following string formats:
s = "Hi there #guy " # notice the trailing space
s = "Hi there #guy," # notice the trailing comma
s = "Hi there #guy and" # notice the next word
s = "Hi there #guy22" # notice the trailing numbers
s = "Hi there #22guy" # notice the leading numbers

That regex does not do what you think it does.
s = "Hi there #guy"
p = re.search(r'#([^ ]+)', s) # this is the regex you described
print p.group(1) # first thing matched inside of ( .. )
But as usually with regex, there are tons of examples that break this, for example if the text is s = "Hi there #guy, what's with the comma?" the result would be guy,.
So you really need to think about every possible thing you want and don't want to match. r'#([a-zA-Z]+)' might be a good starting point, it literally only matches letters (a .. z, no unicode etc).

p.group(0) should return guy. If you want to find out what function an object has, you can use the dir(p) method to find out. This will return a list of attributes and methods that are available for that object instance.

As it's evident from the answers so far regex is the most efficient solution for your problem. Answers differ slightly regarding what you allow to be followed by the #:
[^ ] anything but space
\w in python-2.x is equivalent to [A-Za-z0-9_], in py3k is locale dependent
If you have better idea what characters might be included in the user name you might adjust your regex to reflect that, e.g., only lower case ascii letters, would be:
[a-z]
NB: I skipped quantifiers for simplicity.

(?<=#)\w+
will match a word if it's preceded by a # (without adding it to the match, a so-called positive lookbehind). This will match "words" that are composed of letters, numbers, and/or underscore; if you don't want those, use (?<=#)[^\W\d_]+
In Python:
>>> strg = "Hi there #guy!"
>>> p = re.search(r'(?<=#)\w+', strg)
>>> p.group()
'guy'

You say: """If I do p = re.search('[/#.* /]', str) I'll get any words that start with # and end up with a space."" But this is incorrect -- that pattern is a character class which will match ONE character in the set #/.* and space. Note: there's a redundant second / in the pattern.
For example:
>>> re.findall('[/#.* /]', 'xxx#foo x/x.x*x xxxx')
['#', ' ', '/', '.', '*', ' ']
>>>
You say that you want "guy" returned from "Hi there #guy" but that conflicts with "and end up with a space".
Please edit your question to include what you really want/need to match.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Checking and removing extra symbols - python

Related

Parse a string using regex to obtain matches beginning with a certain word

Python Regex Simple Split - Empty at first index

Python replace regex

How do I strip patterns or words from the end of the string backwards?

finding and returning a string with a specified prefix

Categories

Resources