Replicating behavior of the Python string.split() function in Qt - python

I'm currently trying to exactly replicate the behavior of the Python split() function (the default version, without any arguments) in Qt.
I have been told that the default delimiter is any number of CR/LF/TAB symbols, therefore I tried using the following:
s_body.split(QRegExp("[\r\n\t ]+"), QString::SkipEmptyParts);
However, this does not replicate its behavior precisely.
If I run this on approximately 4 megabytes worth of text, and count the number of unique words, i get 133293. However, if I do the same using the Python function, the result becomes 133367 - therefore there is still something amiss.
Any feedback on how to fix this would be greatly welcome.

My guess is that Python is not skipping empty strings, and they are accounting for the difference. If you want your function to mimic's Python functionality, you can choose to include empty strings, or if you want to get the behavior you've implemented, you can write s_body.split() in Python; with no arguments, it strips all whitespace between non-whitespace characters, which means you get no empty strings back.

With a unicode string, python's split() will, quite naturally, split on the set of all unicode whitespace characters, not just the feeble ascii set:
>>> s = '\t_\n_\x0b_\x0c_\r_ _\x85_\xa0_\u1680_\u2000_\u2001_\u2002_\u2003_\u2004_\u2005_\u2006_\u2007_\u2008_\u2009_\u200a_\u2028_\u2029_\u202f_\u205f_\u3000_'
>>> len(s)
50
>>> len(s.split())
25
>>> ''.join(s.split())
'_________________________'
Now let's see what Qt does (using PyQt4):
>>> qs = QString(s)
>>> r = qs.split(QRegExp('\\s+'), QString.SkipEmptyParts)
>>> r.count()
24
>>> str(r.join(''))
'______\x85___________________'
So, almost there, but for some reason U+0085 NEL (Next Line) is not recognzed as whitespace in Qt4 - but that's easily remedied:
>>> r = qs.split(QRegExp('[\\s\x85]+'), QString.SkipEmptyParts)
>>> r.count()
25
>>> str(r.join(''))
'_________________________'

Related

What does = (equal) do in f-strings inside the expression curly brackets?

The usage of {} in Python f-strings is well known to execute pieces of code and give the result in string format (some tutorials here). However, what does the '=' at the end of the expression mean?
log_file = open("log_aug_19.txt", "w")
console_error = '...stuff...' # the real code generates it with regex
log_file.write(f'{console_error=}')
This is actually a brand-new feature as of Python 3.8.
Added an = specifier to f-strings. An f-string such as f'{expr=}'
will expand to the text of the expression, an equal sign, then the
representation of the evaluated expression.
Essentially, it facilitates the frequent use-case of print-debugging, so, whereas we would normally have to write:
f"some_var={some_var}"
we can now write:
f"{some_var=}"
So, as a demonstration, using a shiny-new Python 3.8.0 REPL:
>>> print(f"{foo=}")
foo=42
>>>
From Python 3.8, f-strings support "self-documenting expressions", mostly for print de-bugging. From the docs:
Added an = specifier to f-strings. An f-string such as f'{expr=}' will
expand to the text of the expression, an equal sign, then the
representation of the evaluated expression. For example:
user = 'eric_idle'
member_since = date(1975, 7, 31)
f'{user=} {member_since=}'
"user='eric_idle' member_since=datetime.date(1975, 7, 31)"
The usual f-string format specifiers allow more control over how the
result of the expression is displayed:
>>> delta = date.today() - member_since
>>> f'{user=!s} {delta.days=:,d}'
'user=eric_idle delta.days=16,075'
The = specifier will display the whole expression so that calculations
can be shown:
>>> print(f'{theta=} {cos(radians(theta))=:.3f}')
theta=30 cos(radians(theta))=0.866
This was introduced in python 3.8. It helps reduce a lot of f'expr = {expr} while writing codes. You can check the docs at What's new in Python 3.8.
A nice example was shown by Raymond Hettinger in his tweet:
>>> from math import radians, sin
>>> for angle in range(360):
print(f'{angle=}\N{degree sign} {(theta:=radians(angle))=:.3f}')
angle=0° (theta:=radians(angle))=0.000
angle=1° (theta:=radians(angle))=0.017
angle=2° (theta:=radians(angle))=0.035
angle=3° (theta:=radians(angle))=0.052
angle=4° (theta:=radians(angle))=0.070
angle=5° (theta:=radians(angle))=0.087
angle=6° (theta:=radians(angle))=0.105
angle=7° (theta:=radians(angle))=0.122
angle=8° (theta:=radians(angle))=0.140
angle=9° (theta:=radians(angle))=0.157
angle=10° (theta:=radians(angle))=0.175
...
You can also check out this to get the underlying idea on why this was proposed in the first place.
As mention here:
Equals signs are now allowed inside f-strings starting with Python 3.8. This lets you quickly evaluate an expression while outputting the expression that was evaluated. It's very handy for debugging.:
It mean it will run the execution of the code in the f-string braces, and add the result at the end with the equals sign.
So it virtually means:
"something={executed something}"
f'{a_string=}' is not exactly the same as f'a_string={a_string}'
The former escapes special characters while the latter does not.
e.g:
a_string = 'word 1 tab \t double quote \\" last words'
print(f'a_string={a_string}')
print(f'{a_string=}')
gets:
a_string=word 1 tab double quote \" last words
a_string='word 1 tab \t double quote \\" last words
I just realised that the difference is that the latter is printing the repr while the former is just printing the value. So, it would be more accurate to say:
f'{a_string=}' is the same as f'a_string={a_string!r}'
and allows formatting specifications.

Indexing the wrong character for an expression

My program seems to be indexing the wrong character or not at all.
I wrote a basic calculator that allows expressions to be used. It works by having the user enter the expression, then turning it into a list, and indexing the first number at position 0 and then using try/except statements to index number2 and the operator. All this is in a while loop that is finished when the user enters done at the prompt.
The program seems to work fine if I type the expression like this "1+1" but if I add spaces "1 + 1" it cannot index it or it ends up indexing the operator if I do "1+1" followed by "1 + 1".
I have asked in a group chat before and someone told me to use tokenization instead of my method, but I want to understand why my program is not running properly before moving on to something else.
Here is my code:
https://hastebin.com/umabukotab.py
Thank you!
Strings are basically lists of characters. 1+1 contains three characters, whereas 1 + 1 contains five, because of the two added spaces. Thus, when you access the third character in this longer string, you're actually accessing the middle element.
Parsing input is often not easy, and certainly parsing arithmetic expressions can get tricky quite quickly. Removing spaces from the input, as suggested by #Sethroph is a viable solution, but will only go that far. If you all of a sudden need to support stuff like 1+2+3, it will still break.
Another solution would be to split your input on the operator. For example:
input = '1 + 2'
terms = input.split('+') # ['1 ', ' 2'] note the spaces
terms = map(int, terms) # [1, 2] since int() can handle leading/trailing whitespace
output = terms[0] + terms[1]
Still, although this can handle situations like 1 + 2 + 3, it will still break when there's multiple different operators involved, or there are parentheses (but that might be something you need not worry about, depending on how complex you want your calculator to be).
IMO, a better approach would indeed be to use tokenization. Personally, I'd use parser combinators, but that may be a bit overkill. For reference, here's an example calculator whose input is parsed using parsy, a parser combinator library for Python.
You could remove the spaces before processing the string by using replace().
Try adding in:
clean_input = hold_input.replace(" ", "")
just after you create hold_input.

Pyparsing delimited list only returns first element

Here is my code :
l = "1.3E-2 2.5E+1"
parser = Word(alphanums + '+-.')
grammar = delimitedList(parser,delim='\t ')
print(grammar.parseString(l))
It returns :
['1.3E-2']
Obiously, I want all both values, not a single one, any idea what is going on ?
As #dawg explains, delimitedList is intended for cases where you have an expression with separating non-whitespace delimiters, typically commas. Pyparsing implicitly skips over whitespace, so in the pyparsing world, what you are really seeing is not a delimitedList, but OneOrMore(realnumber). Also, parseString internally calls str.expandtabs on the provided input string, unless you use the parseWithTabs=True argument. Expanding tabs to spaces helps preserve columnar alignment of data when it is in tabular form, and when I originally wrote pyparsing, this was a prevalent use case.
If you have control over this data, then you might want to use a different delimiter than <TAB>, perhaps commas or semicolons. If you are stuck with this format, but determined to use pyparsing, then use OneOrMore.
As you move forward, you will also want to be more precise about the expressions you define and the variable names that you use. The name "parser" is not very informative, and the pattern of Word(alphanums+'+-.') will match a lot of things besides valid real values in scientific notation. I understand if you are just trying to get anything working, this is a reasonable first cut, and you can come back and tune it once you get something going. If in fact you are going to be parsing real numbers, here is an expression that might be useful:
realnum = Regex(r'[+-]?\d+\.\d*([eE][+-]?\d+)?').setParseAction(lambda t: float(t[0]))
Then you can define your grammar as "OneOrMore(realnum)", which is also a lot more self-explanatory. And the parse action will convert your strings to floats at parse time, which will save you step later when actually working with the parsed values.
Good luck!
Works if you switch to raw strings:
l = r"1.3E-2\t2.5E+1"
parser = Word(alphanums + '+-.')
grammar = delimitedList(parser, delim=r'\t')
print(grammar.parseString(l))
Prints:
['1.3E-2', '2.5E+1']
In general, delimitedList works with something like PDPDP where P is the parse target and D is the delimter or delimiting sequence.
You have delim='\t '. That specifically is a delimiter of 1 tab followed by 1 space; it is not either tab or space.

get escaped unicode code from string

I seem to be having the opposite issue as everyone else in the development world. I need to generate escaped characters from strings. For instance, say I have the word MESSAGE:, I need to generate:
\\u004D\\u0045\\u0053\\u0053\\u0041\\u0047\\u0045\\u003A\\u0053\\u0069\\u006D
The closest thing I could get using Python was:
u'MESSAGE:'.encode('utf16')
# output = '\xff\xfeM\x00E\x00S\x00S\x00A\x00G\x00E\x00:\x00'
My first thought was that I could replace \x with \u00 (or something to that effect), but I quickly realized that wouldn't work. What can I do to output the escaped (unescaped?) string in Python (preferably)?
Before everyone starts "answering" and down voting, the escaped \u00... string is what my app is getting from another 3rd party app which I have no control over. I'm trying to generate my own test data so I don't have to rely on that 3rd party app.
Pierre's answer is nearly right, but the for x in u'MESSAGE:' bit would fail for characters above U+FFFF, except for ‘narrow builds’ (primarily Python 1.6–3.2 on Windows) which use UTF-16 for Unicode strings.
On ‘wide builds’ (and in 3.3+ where the distinction no longer exists), len(unichr(0x10000)) is 1 not 2. When this code point is UTF-16BE-encoded you get two surrogates taking up four bytes, so the output is '\\uD800DC00' instead of what you probably wanted, u'\\uD800\\uDC00'.
To cover it on both variants of Python you can do:
>>> h = u'MESSAGE:\U00010000'.encode('utf-16be').encode('hex')
# '004d004500530053004100470045003ad800dc00'
>>> ''.join(r'\u' + h[i:i+4] for i in range(0, len(h), 4))
'\\u004d\\u0045\\u0053\\u0053\\u0041\\u0047\\u0045\\u003a\\ud800\\udc00'
I think this (quick & dirty) code does what you want:
''.join('\\u' + x.encode('utf_16_be').encode('hex') for x in u'MESSAGE:')
# output: '\\u004d\\u0045\\u0053\\u0053\\u0041\\u0047\\u0045\\u003a'
Or if you want more '\':
''.join('\\\\u' + x.encode('utf_16_be').encode('hex') for x in u'MESSAGE:')
# output: '\\\\u004d\\\\u0045\\\\u0053\\\\u0053\\\\u0041\\\\u0047\\\\u0045\\\\u003a'
print _
# output: \\u004d\\u0045\\u0053\\u0053\\u0041\\u0047\\u0045\\u003a
If you absolutely need upper-case for hexadecimal codes:
''.join('\\u' + x.encode('utf_16_be').encode('hex').upper() for x in u'MESSAGE:')
# output: '\\u004D\\u0045\\u0053\\u0053\\u0041\\u0047\\u0045\\u003A'
There's no need to go through the .encode() step if you don't have characters outside the BMP (>0xFFFF):
>>> ''.join('\\u{:04x}'.format(ord(a)) for a in u'Message')
'\\u004d\\u0065\\u0073\\u0073\\u0061\\u0067\\u0065'

regarding backslash from postgresql

i have a noob question.
I have a record in a table that looks like '\1abc'
I then use this string as a regex replacement in re.sub("([0-9])",thereplacement,"2")
I'm a little confused with the backslashes. The string i got back was "\\1abc"
Are you using python interactivly?
In regular string you need to escape backslashes in your code, or use r"..." (Link to docs). If you are running python interactivly and don't assign the results from your database to a variable, it'll be printed out using it's __repr__() method.
>>> s = "\\1abc"
>>> s
'\\1abc' # <-- How it's represented in Python code
>>> print s
\1abc # <-- The actual string
Also, your re.sub is a bit weird. 1) Maybe you meant [0-9] as the pattern? (Matching a single digit). The arguments are probably switche too, if thereplacement is your input. This is the syntax:
re.sub(pattern, repl, string, count=0)
So my guess is you expect something like this:
>>> s_in = yourDbMagic() # Which returns \1abc
>>> s_out = re.sub("[0-9]", "2", s_in)
>>> print s_in, s_out
\1abc \2abc
Edit: Tried to better explain escaping/representation.
Note that you can make \ stop being an escape character by setting standard_conforming_strings to on.

Categories