Reference token value at parse time - python

I am trying to parse the following:
<delimiter><text><delimiter><text><delimter>
Where delimiter can be any single literal character that is repeated three times and text can be any printable characters beside the delimiter character (the first and second occurence of text do not have to match and can be blank).
This is what I have come up with however text consumes from the first delimiter to the end of the string.
from pyparsing import Word, printables
delimiter = Word(printables, exact=1)
text = (Word(printables) + ~delimiter)
parser = delimiter + text # + delimiter + text + delimiter
tests = [
('_abc_123_', ['_', 'abc', '_', '123', '_']),
('-abc-123-', ['-', 'abc', '-', '123', '-']),
('___', ['_', '', '_', '', '_']),
]
for test, expected in tests:
print parser.parseString(test), '<=>', expected
Script output:
['_', 'abc_123_'] <=> ['_', 'abc', '_', '123', '_']
['-', 'abc-123-'] <=> ['-', 'abc', '-', '123', '-']
['_', '__'] <=> ['_', '', '_', '', '_']
I think I need to make use of Future but I can get my head around excluding the value of the delimiter at parse time from text token.

Your intuition was correct, you need to use a Forward (not Future) to capture the definition of text, since this is not fully knowable until parse time. Also, your use of Word has to exclude the delimiter character using the excludeChars argument - just using Word(printables) + ~delimiter is not sufficient.
Here is your code, marked up with the necessary changes, and hopefully some helpful comments:
delimiter = Word(printables, exact=1)
text = Forward() #(Word(printables) + ~delimiter)
def setTextExcludingDelimiter(s,l,t):
# define Word as all printable characters, excluding the delimiter character
# the excludeChars argument for Word is how this is done
text_word = Word(printables, excludeChars=t[0]).setName("text")
# use '<<' operator to assign the text_word definition to the
# previously defined text expression
text << text_word
# attach parse action to delimiter, so that once it is matched,
# it will define the correct expression for text
delimiter.setParseAction(setTextExcludingDelimiter)
# make the text expressions Optional with default value of '' to satisfy 3rd test case
parser = delimiter + Optional(text,'') + delimiter + Optional(text,'') + delimiter

Related

Split string with regex by new lines, symbols and withspaces in python

I'm new to regex library, and I'm trying to make from a text like this
"""constructor SquareGame new(){
let square=square;
}"""
This outputs a list:
['constructor', 'SquareGame', 'new', '(', ')', '{', '\n', 'let', 'square', '=', 'square', ';', '}']
I need to create a list of tokens separated by white spaces, new lines and this symbols {}()[].;,+-*/&|<>=~.
I used re.findall('[,;.()={}]+|\S+|\n', text) but seems to separate tokens by withe spaces and new lines only.
You may use
re.findall(r'\w+|[^\w \t]', text)
To avoid matching any Unicode horizontal whitespace use
re.findall(r'\w+|[^\w \t\u00A0\u1680\u2000-\u200A\u202F\u205F\u3000]', text)
See the regex demo. Details:
\w+ - 1 or more word chars
| - or
[^\w \t] - a single non-word char that is not a space and a tab char (so, all vertical whitespace is matched).
You may add more horizontal whitespace chars to exclude into the [^\w \t] character class, see their list at Match whitespace but not newlines. The regex will look like \w+|[^\w \t\u00A0\u1680\u2000-\u200A\u202F\u205F\u3000].
See the Python demo:
import re
pattern = r"\w+|[^\w \t]"
text = "constructor SquareGame new(){\nlet square=square;\n}"
print ( re.findall(pattern, text) )
# => ['constructor', 'SquareGame', 'new', '(', ')', '{', '\n', 'let', 'square', '=', 'square', ';', '\n', '}']
This regex will only match based on the characters that you indicated and I think this is a safer method.
>>> re.findall(r"\w+|[{}()\[\].;,+\-*/&|<>=~\n]", text)
['constructor', 'SquareGame', 'new', '(', ')', '{', '\n', 'let', 'square', '=', 'square', ';', '\n', '}'

How to remove punctuation from a string [duplicate]

This question already has answers here:
Best way to strip punctuation from a string
(32 answers)
Closed 3 years ago.
One of the project that I've been working on is to create a word counter, and to do that, I have to effectively remove all punctuation from a string.
I have tried using the split method and split at punctuation, however, this will later make the list very weird (from separating at a word to having a list that has 5 words). I then tried to have a list or a string full of punctuation, and use a for loop to eliminate all punctuation, but both are not successful
content_string = "This, is a test! to see: whether? or not. the code can eliminate punctuation"
punctuation = list["'", '"', ',', '.', '?', '!', ':', ';', '()']
for i in content_string.lower():
if i in punctuation:
i = i.replace[i," "]
else:
i = i
It says that
"TypeError: 'type' object is not subscriptable"
This message appears both when using a string or using a list.
There is a mix with parenthesis versus square brackets.
list and replace are functions, arguments are passed with parenthesis.
Also, try to describe your algorithm with words:
example:
For all forbidden characters, i want to remove them from my content (replace with space)
Here is an implementation you can start with:
content_string = "This, is a test! to see: whether? or not. the code can eliminate punctuation"
punctuation = ["'", '"', ',', '.', '?', '!', ':', ';', '(', ')']
for i in punctuation:
content_string = content_string.replace(i, " ")
To create a list, you use l = [...] not l = list[...], and functions/methods (such as str.replace) are called with parenthesis, not square brackets, however, you can use re.sub to do this in a much better and simpler way:
content_string = "This, is a test! to see: whether? or not. the code can eliminate punctuation"
punctuation = ["'", '"', ',', '.', '?', '!', ':', ';', '(', ')'] # '(', ')' not `()`
import re
new_string = re.sub('|'.join(map(re.escape, punctuation)), '', content_string)
print(new_string)
Output:
This is a test to see whether or not the code can eliminate punctuation
Your error
"TypeError: 'type' object is not subscriptable"
comes from the line
punctuation = list["'", '"', ',', '.', '?', '!', ':', ';', '()']
To define a list you either use brackets [ ] without the keyword list, or if you use list you have to put parenthesis (although in this case converting a list into a list is redundant)
# both options will work, but the second one is redundant and therefore wrong
punctuation = ["'", '"', ',', '.', '?', '!', ':', ';', '(', ')']
punctuation = list(["'", '"', ',', '.', '?', '!', ':', ';', '(', ')'])
Notice that the last element () must be splitted in two elements ( and )
Now to achieve what you want in an efficient way, use a conditional comprehension list
''.join([i if i not in punctuation else ' ' for i in content_string])
result:
'This is a test to see whether or not the code can eliminate punctuation'
Notice that according to your code, you are not removing the punctuation symbols but replacing them for spaces.
There are multiple bugs in the code.
First one:
The list keyword is obsolete.
If you wanted to use it, you would need to add parentheses () so that the call would be properly done on the items in the already defined list.
BAD punctuation = list["'", '"', ',', '.', '?', '!', ':', ';', '()']
BETTER punctuation = list(["'", '"', ',', '.', '?', '!', ':', ';', '()'])
But simply defining the list with regular [] syntax would be enough, and also more efficient than a list() call.
Second one:
You will not be able to replace parentheses with the if i in punctuation: check.
This is because they are a two character long string, and you are iterating over single characters of your string. So you will always compare '(' or ')' with '()'.
A possible fix - add parentheses separately to the punctuation list as single characters.
Third bug, or rather an obsolete else instruction:
else:
i = i
This servers no purpose whatsoever, you should skip the else instruction.
Fourth, the most apparent bug:
In your for loop you are editing i variable which is a copy of a single character from the string that you are iterating over. You should perform the change on the original string, this could be done with the usage of enumerate - only if you first turned your string into a list, so that you could modify its values.
for i, char in enumerate(list(content_string.lower())):
if char in punctuation:
content_string[i] = ' '
Anyway, the goal you are trying to achieve can come down to a one-liner, using a list comprehension and a string join on the resulting list afterwards:
content_string = ''.join([char if char not in punctuation else ' ' for char in content_string.lower()])

How to write in the correct way the following array with special characters?

I want to clean the name of a file but ONLY for the special characters not allowed:
char_not_supported_by_file_name = ['\', '/', ':', '*', '?', '"', '<', '>', '|']
tmp_file_name= file
for c in char_not_supported_by_file_name:
if c in tmp_file_name:
tmp_file_name = tmp_file_name.replace(c, '_')
I try to write this list, check if the file's name I want to clean up has one of the 9 special characters I don't want and replace it with an underscore, but my IDE says the array is written wrong. How can I write it in the correct way?
If you precede a quote with a backslash, it will have been escaped. In other words, it will be a character in the string instead of marking the end of the string. You must escape the first backslash with another backslash:
char_not_supported_by_file_name = ['\\', '/', ':', '*', '?', '"', '<', '>', '|']
Also, replace will do nothing if it can't find any instances of the character that needs to be replaced, so you can omit the if check:
for c in char_not_supported_by_file_name:
tmp_file_name = tmp_file_name.replace(c, '_')
If you are willing to import modules, this could be done without the loop, using re.sub:
import re
file_name = "this/is:a*very?bad\\example>of<a|filename"
res = re.sub("[\\\/:*?\"<>|]", "_", file_name)
print(res)
# this_is_a_very_bad_example_of_a_filename
Note the \ backslashes need to be tripled or even quadropled depending on the exact location. Read this question and its duplicates for more information. The reason is that those backslashes are escaped twice: once by the interpreter and then again by re.
Something that will make your code more concise, if you're comfortable with regex, would be using regular expressions instead of an array:
import re
tmp_file_name = file
tmp_file_name = re.sub(r'[\\/:*?\"<>|]', '_', tmp_file_name)
This solves your original problem as well, which is that the backslash in the first element of your array, '\', is escaping the end quote and turning it into a ' literal instead of closing the quotations around your backslash.

Replacing multiple chars in string with another character in Python

I have a list of strings I want to check if each string contains certain characters, if it does then replace the characters with another character.
I have something like below:
invalid_chars = [' ', ',', ';', '{', '}', '(', ')', '\\n', '\\t', '=']
word = 'Ad{min > HR'
for c in list(word):
if c in invalid_chars:
word = word.replace(c, '_')
print (word)
>>> Admin_>_HR
I am trying to convert this into a function using list comprehension but I am strange characters...
def replace_chars(word, checklist, char_replace = '_'):
return ''.join([word.replace(ch, char_replace) for ch in list(word) if ch in checklist])
print(replace_chars(word, invalid_chars))
>>> Ad_min > HRAd{min_>_HRAd{min_>_HR
Try this general pattern:
''.join([ch if ch not in invalid_chars else '_' for ch in word])
For the complete function:
def replace_chars(word, checklist, char_replace = '_'):
return ''.join([ch if ch not in checklist else char_replace for ch in word])
Note: no need to wrap string word in a list(), it's already iterable.
This might be a good use for str.translate(). You can turn your invalid_chars into a translation table with str.maketrans() and apply wherever you need it:
invalid_chars = [' ', ',', ';', '{', '}', '(', ')', '\n', '\t', '=']
invalid_table = str.maketrans({k:'_' for k in invalid_chars})
word = 'Ad{min > HR'
word.translate(invalid_table)
Result:
'Ad_min_>_HR'
This will be especially nice if you need to apply this translation to several strings and more efficient since you don't need to loop through the entire invalid_chars array for every letter, every time which you will if you us if x in invalid_chars inside a loop.
This is easier with regex. You can search for a whole group of characters with a single substitution call. It should perform better too.
>>> import re
>>> re.sub(f"[{re.escape(''.join(invalid_chars))}]", "_", word)
'Ad_min_>_HR'
The code in the f-string builds a regex pattern that looks like this
>>> pattern = f"[{re.escape(''.join(invalid_chars))}]"
>>> print(repr(pattern))
'[\\ ,;\\{\\}\\(\\)\\\n\\\t=]'
>>> print(pattern)
[\ ,;\{\}\(\)\
\ =]
That is, a regex character set containing each of your invalid chars. (The backslash escaping ensures that none of them are interpreted as a regex control character, regardless of which characters you put in invalid_chars.) If you had specified them as a string in the first place, the ''.join() would not be required.
You can also compile the pattern (using re.compile()) if you need to re-use it on multiple words.

Escaping regex unicode string in Python

I have a user defined string.
I want to use it in regex with small improvement: search by three apostrophes instead of one.
For example,
APOSTROPHES = re.escape('\'\u2019\u02bc')
word = re.escape("п'ять")
word = ''.join([s if s not in APOSTROPHES else '[%s]' % APOSTROPHES for s in word])
It works good for latin, but for unicode list comprehension gives the following string:
"[\\'\\\\u2019\\\\u02bc]\xd0[\\'\\\\u2019\\\\u02bc]\xbf[\\'\\\\u2019\\\\u02bc][\\'\\\\u2019\\\\u02bc][\\'\\\\u2019\\\\u02bc]\xd1[\\'\\\\u2019\\\\u02bc]\x8f[\\'\\\\u2019\\\\u02bc]\xd1[\\'\\\\u2019\\\\u02bc]\x82[\\'\\\\u2019\\\\u02bc]\xd1[\\'\\\\u2019\\\\u02bc]\x8c"
Looks like it finds backslashes in both strings and then substitutes APOSTROPHES
Also, print(list(w for w in APOSTROPHES)) gives ['\\', "'", '\\', '\\', 'u', '2', '0', '1', '9', '\\', '\\', 'u', '0', '2', 'b', 'c'].
How can I avoid it? I want to get "\п[\'\u2019\u02bc]\я\т\ь"
What I understand is: you want to create a regular expression which can match a given word with any apostrophe:
The RegEx which match any apostrophe can be defined in a group:
APOSTROPHES_REGEX = r'[\'\u2019\u02bc]'
For instance, you have this (Ukrainian?) word which contains a single quote:
word = "п'ять"
EDIT: If your word contains another kind of apostrophe, you can normalize it, like this:
word = re.sub(APOSTROPHES_REGEX , r"\'", word, flags=re.UNICODE)
To create a RegEx, you escape this string (because in some context, it can contains special characters like punctuation, I think). When escaped, the single quote "'" is replaced by an escaped single quote, like this: r"\'".
You can replace this r"\'" by your apostrophe RegEx:
import re
word_regex = re.escape(word)
word_regex = word_regex.replace(r'\'', APOSTROPHES_REGEX)
The new RegEx can then be used to match the same word with any apostrophe:
assert re.match(word_regex, "п'ять") # '
assert re.match(word_regex, "п’ять") # \u2019
assert re.match(word_regex, "пʼять") # \u02bc
Note: don’t forget to use the re.UNICODE flag, it will help you for some RegEx characters classes like r"\w".

Categories