Can I override u-strings (u'example') in Python 2? - python

In debugging upgrading to Python 3, it would be useful to be able to override the u'' string prefix to call my own function or replace with a non-u string.
I've tried things like unichr = chr which is useful for my debugging but doesn't accomplish the above.
module.uprefix = str is the type of solution I'm looking for.

You basically can't; as others have noted in the comments, the u-prefix is handled very early, well before anything where an in-code assignment would take effect.
About the best you could do is use ast.parse to read a module on disk (without importing it) and find all the u'' strings; it distinguishes the prefixes. That would help you find them in a Python-aware way, more reliably than just searching for u' and u", but the difference probably wouldn't be large, especially if you search with word boundaries (regex \bu['"]). Unless you somehow have a lot of u' and u" in your program that aren't the prefixes?
>>> ast.dump(ast.parse('"abc"', mode='eval'))
"Expression(body=Constant(value='abc', kind=None))"
>>> ast.dump(ast.parse('u"abc"', mode='eval'))
"Expression(body=Constant(value='abc', kind='u'))"
Per the comments, what are you trying to do? I've migrated a lot of code from Python 2 to Python 3 and never needed this... There may be a different way to achieve the same goal?

Related

Python f-string: replacing newline/linebreak [duplicate]

This question already has answers here:
How can I use newline '\n' in an f-string to format output?
(7 answers)
Closed last month.
First off all, sorry: I'm quite certain this might be a "duplicate" but I didn't succeed finding the right solution.
I simply want to replace all linebreaks within my sql-code for logging it to one line, but Python's f-string doesn't support backslashes, so:
# Works fine (but is useless ;))
self.logger.debug(f"Executing: {sql.replace( 'C','XXX')}")
# Results in SyntaxError:
# f-string expression part cannot include a backslash
self.logger.debug(f"Executing: {sql.replace( '\n',' ')}")
Of course there are several ways to accomplish that before the f-string, but I'd really like to keep my "log the line"-code in one line and without additional helper variables.
(Besides I think it's a quite stupid behavior: Either you can execute code within the curly brackets or you cant't...not "you can, but only without backslashes"...)
This one isn't a desired solution because of additional variables:
How to use newline '\n' in f-string to format output in Python 3.6?
General Update
The suggestion in mkrieger1s comment:
self.logger.debug("Executing %s", sql.replace('\n',' '))
Works fine for me, but as it doesn't use f-strings at all (beeing that itself good or bad ;)), I think I can leave this question open.
I found possible solutions
from os import linesep
print(f'{string_with_multiple_lines.replace(linesep, " ")}')
Best,
You can do this
newline = '\n'
self.logger.debug(f"Executing: {sql.replace( newline,' ')}")
don't use f-strings, especially for logging
assign the newline to a constant and use that, which you apparently don't want to
use an other version of expressing a newline, chr(10) for instance
(Besides I think it's a quite stupid behavior: Either you can execute code within the curly brackets or you cant't...not "you can, but only without backslashes"...)
Feel free to take a shot at fixing it, I'm pretty sure this restriction was not added because the PEP authors and feature developers wanted it to be a pain in the ass.

How to apply string method on regular expression in Python

I'm having a markdown file wich is a little bit broken: the links and images which are too long have line-breaks in it. I would like to remove line-breaks from them.
Example:
from:
See for example the
[installation process for Ubuntu
Trusty](https://wiki.diasporafoundation.org/Installation/Ubuntu/Trusty). The
project offers a Vagrant installation too, but the documentation only admits
that you know what you do, that you are a developer. If it is difficult to
![https://diasporafoundation.org/assets/pages/about/network-
distributed-e941dd3e345d022ceae909beccccbacd.png](data/images/network-
distributed-e941dd3e345d022ceae909beccccbacd.png)
_A pretty decentralized network (Source: <https://diasporafoundation.org/>)_
to:
See for example the
[installation process for Ubuntu Trusty](https://wiki.diasporafoundation.org/Installation/Ubuntu/Trusty). The
project offers a Vagrant installation too, but the documentation only admits
that you know what you do, that you are a developer. If it is difficult to
![https://diasporafoundation.org/assets/pages/about/network-distributed-e941dd3e345d022ceae909beccccbacd.png](data/images/network-distributed-e941dd3e345d022ceae909beccccbacd.png)
_A pretty decentralized network (Source: <https://diasporafoundation.org/>)_
As you can see in this snippet, I managed to match the all links and images with the right pattern: https://regex101.com/r/uL8pO4/2
But now, what is the syntax in Python to use a string method like string.trim() on what I have captured with regular expression?
For the moment, I'm stuck with this:
fix_newlines = re.compile(r'\[([\w\s*:/]*)\]\(([^()]+)\)')
# Capture the links and remove line-breaks from their urls
# Something like r'[\1](\2)'.trim() ??
post['content'] = fix_newlines.sub(r'[\1](\2)', post['content'])
Edit: I updated the example to be more explicit about my problem.
Thank you for your answer
strip would work similar to functionality of trim. As you would need to trim the new lines, use strip('\n'),
fin.readline.strip('\n')
This will work also:
>>> s = """
... ![https://diasporafoundation.org/assets/pages/about/network-
... distributed-e941dd3e345d022ceae909beccccbacd.png](data/images/network-
... distributed-e941dd3e345d022ceae909beccccbacd.png)
... """
>>> new_s = "".join(s.strip().split('\n'))
>>> new_s
'![https://diasporafoundation.org/assets/pages/about/network-distributed-e941dd3e345d022ceae909beccccbacd.png](data/images/network-distributed-e941dd3e345d022ceae909beccccbacd.png)'
>>>
Often times built-in string functions will do, and are easier to read than figuring out regexes. In this case strip removes leading and trailing space, then split returns a list of items between newlines, and join puts them back together in a single string.
Alright, I finally found what I was searching. With the snippet below, I could capture a string with a regex and then apply the treatment on each of them.
def remove_newlines(match):
return "".join(match.group().strip().split('\n'))
links_pattern = re.compile(r'\[([\w\s*:/\-\.]*)\]\(([^()]+)\)')
post['content'] = links_pattern.sub(remove_newlines, post['content'])
Thank you for your answers and sorry if my question wasn't explicit enough.

How to recognize special eol character when I see it, using Python?

I'm scraping a set of originally pdf files, using Python. Having gotten them to text, I had a lot of trouble getting the line endings out. I couldn't figure out what the line separator was. The trouble is, I still don't know.
It's not a '\n', or, I don't think, '\r\n'. However, I've managed to isolate one of these special characters. I literally have it in memory, and by doing a call to my_str.replace(eol, ''), I can remove all of these characters from one of my files.
So my question is open-ended. I'm a bit lost when it comes to unicode and such. How can I identify this character in my files without resorting to something ridiculous, like serializing it and then reading it in? Is there a way I can refer to it as a code, perhaps? I can't get Python to yield what it actually IS. All I ever see if I print it, or call unicode(special_eol) is the character in its functional usage as a newline.
Please help! Thanks, and sorry if I'm missing something obvious.
To determine what specific character that is, you can use str.encode('unicode_escape') or repr() to get (in Python 2) a ASCII-printable representation of the character:
>>> print u'☃'.encode('unicode_escape')
\u2603
>>> print repr(u'☃')
u'\u2603'

Decoding RFC 2231 headers

Trying to address this issue, I'm trying to wrap my head around the various functions in the Python standard library aimed at supporting RFC 2231. The main aim of that RFC appears to be three-fold: allowing non-ASCII encoding in header parameters, noting the language of a given value, and allowing header parameters to span multiple lines. The email.util library provides several functions to deal with various aspects of this. As far as I can tell, they work as follows:
decode_rfc2231 only splits the value of such a parameter into its parts, like this:
>>> email.utils.decode_rfc2231("utf-8''T%C3%A4st.txt")
['utf-8', '', 'T%C3%A4st.txt']
decode_params takes care of detecting RFC2231-encoded parameters. It collects parts which belong together, and also decodes the url-encoded string to a byte sequence. This byte sequence, however, is then encoded as latin1. And all values are enclosed in quotation marks. Furthermore, there is some special handling for the first argument, which still has to be a tuple of two elements, but those two get passed to the result without modification.
>>> email.utils.decode_params([
... (1,2),
... ("foo","bar"),
... ("name*","utf-8''T%C3%A4st.txt"),
... ("baz*0","two"),("baz*1","-part")])
[(1, 2), ('foo', '"bar"'), ('baz', '"two-part"'), ('name', ('utf-8', '', '"Täst.txt"'))]
collapse_rfc2231_value can be used to convert this triple of encoding, language and byte sequence into a proper unicode string. What has me confused, though, is the fact that if the input was such a triple, then the quotes will be carried over to the output. If, on the other hand, the input was a single quoted string, then these quotes will be removed.
>>> [(k, email.utils.collapse_rfc2231_value(v)) for k, v in
... email.utils.decode_params([
... (1,2),
... ("foo","bar"),
... ("name*","utf-8''T%C3%A4st.txt"),
... ("baz*0","two"),("baz*1","-part")])[1:]]
[('foo', 'bar'), ('baz', 'two-part'), ('name', '"Täst.txt"')]
So it seems that in order to use all this machinery, I'd have to add yet another step to unquote the third element of any tuple I'd encounter. Is this true, or am I missing some point here? I had to figure out a lot of the above with help from the source code, since the docs are a bit vague on the details. I cannot imagine what could be the point behind this selective unquoting. Is there a point to it?
What is the best reference on how to use these functions?
The best I found so far is the email.message.Message implementation. There, the process seems to be roughly the one outlined above, but every field gets unquoted via _unquotevalue after the decode_params, and only get_filename and get_boundary collapse their values, all others return a tuple instead. I hope there is something more useful.
Currently the functions from email.utils are rarely used besides within email.message. Most users seem to prefer using email.message.Message directly. There's even a somewhat old issue report on adding unit tests (that would certainly be usable as examples) to Python, even if I'm not sure on how it relates to email.util.
A short example I found is this blogpost which, however, doesn't contain more than once sentence and a few SLOCs of information about RFC2231 parsing. The author notes, however, that many MTAs use RFC2047 instead. Depending on your usecase, that might also be an issue.
Judging from the few examples I could find I assume your way of parsing using email.util is the only way to go, even if the long list comprehension is somewhat ugly.
Because of the lack of examples in some respect it could be wise to write a new RFC2231 parser (if you really need a better, maybe faster or more beautiful codebase). A new implementation could be based on existing implementations like the Dovecot RFC2231 parser for compatibility reasons (you could even use the Dovecot unit test. As the C code seems quite complex to me and since I can't find any python implementation besides email.util and Python2 backports of email.util the task of porting to Python won't be easy (note that Dovecot is LGPL-licensed, which might be an issue in your project)
I think the email.util RFC2231 API has not been designed for easy standalone usage but more as a pile of utility methods for use in email.message.Message.
Old question, but I could not find a complete answer that works on this. So this is what I ended up doing (on Python 2.7):
def decode_rfc2231_header(header):
"""Decode a RFC 2231 header"""
# Remove any quotes
header = email.utils.unquote(header)
encoding, language, value = email.utils.decode_rfc2231(header)
value = urllib.unquote(value)
return email.utils.collapse_rfc2231_value((encoding, language, value))
For example:
>>> name = u'èéêëēėęûüùúūàáâäæãåāāîïíīįì test ôöòóœøōõssśšłžźżçćčñń'
>>> encoded_header = email.utils.encode_rfc2231(name.encode("utf8"), 'utf8', 'en')
>>> print encoded_header
utf8'en'%C3%A8%C3%A9%C3%AA%C3%AB%C4%93%C4%97%C4%99%C3%BB%C3%BC%C3%B9%C3%BA%C5%AB%C3%A0%C3%A1%C3%A2%C3%A4%C3%A6%C3%A3%C3%A5%C4%81%C4%81%C3%AE%C3%AF%C3%AD%C4%AB%C4%AF%C3%AC%20test%20%C3%B4%C3%B6%C3%B2%C3%B3%C5%93%C3%B8%C5%8D%C3%B5ss%C5%9B%C5%A1%C5%82%C5%BE%C5%BA%C5%BC%C3%A7%C4%87%C4%8D%C3%B1%C5%84
>>> print decode_rfc2231_header(encoded_header)
èéêëēėęûüùúūàáâäæãåāāîïíīįì test ôöòóœøōõssśšłžźżçćčñń

Python: Latex symbols to unicode?

I have fond several answers hinting how to solve unicode to latex symbols conversion. For example turning u'á' into \'{a}.
Well I need it the other way around! So I made some research and fond this dictionary. Since the mapping seems to be bijective, I thought of turning the dictionary the other way around. But I can't figure out how to "use" the keys in this dictionary:
u"\u0020": "\\space ",
u"\u0023": "\\#",
u"\u0024": "\\textdollar ",
u"\u0025": "\\%",
How can I turn them inside python to "human readable characters"?
Is there mayhaps a better and more complete was to achieve my goal?
The notation u'\u0020' is just an escape sequence that specifies the space character, only it does so by specifying it by character code. The author of the dictionary probably did it this way so that it would be obvious if something was missing, but you don't need to perform any special conversion to use the dictionary, since u'\u0020' == ' '.
... What?
>>> print {u'\u0020': '\\space'}[u' ']
\space
(That is, they're already characters; you need do nothing to them.)

Categories