Python: Latex symbols to unicode?

Python: Latex symbols to unicode? - python

I have fond several answers hinting how to solve unicode to latex symbols conversion. For example turning u'á' into \'{a}.
Well I need it the other way around! So I made some research and fond this dictionary. Since the mapping seems to be bijective, I thought of turning the dictionary the other way around. But I can't figure out how to "use" the keys in this dictionary:
u"\u0020": "\\space ",
u"\u0023": "\\#",
u"\u0024": "\\textdollar ",
u"\u0025": "\\%",
How can I turn them inside python to "human readable characters"?
Is there mayhaps a better and more complete was to achieve my goal?

The notation u'\u0020' is just an escape sequence that specifies the space character, only it does so by specifying it by character code. The author of the dictionary probably did it this way so that it would be obvious if something was missing, but you don't need to perform any special conversion to use the dictionary, since u'\u0020' == ' '.

... What?
>>> print {u'\u0020': '\\space'}[u' ']
\space
(That is, they're already characters; you need do nothing to them.)

Related

Can I override u-strings (u'example') in Python 2?

In debugging upgrading to Python 3, it would be useful to be able to override the u'' string prefix to call my own function or replace with a non-u string.
I've tried things like unichr = chr which is useful for my debugging but doesn't accomplish the above.
module.uprefix = str is the type of solution I'm looking for.

You basically can't; as others have noted in the comments, the u-prefix is handled very early, well before anything where an in-code assignment would take effect.
About the best you could do is use ast.parse to read a module on disk (without importing it) and find all the u'' strings; it distinguishes the prefixes. That would help you find them in a Python-aware way, more reliably than just searching for u' and u", but the difference probably wouldn't be large, especially if you search with word boundaries (regex \bu['"]). Unless you somehow have a lot of u' and u" in your program that aren't the prefixes?
>>> ast.dump(ast.parse('"abc"', mode='eval'))
"Expression(body=Constant(value='abc', kind=None))"
>>> ast.dump(ast.parse('u"abc"', mode='eval'))
"Expression(body=Constant(value='abc', kind='u'))"
Per the comments, what are you trying to do? I've migrated a lot of code from Python 2 to Python 3 and never needed this... There may be a different way to achieve the same goal?

How to recognize special eol character when I see it, using Python?

I'm scraping a set of originally pdf files, using Python. Having gotten them to text, I had a lot of trouble getting the line endings out. I couldn't figure out what the line separator was. The trouble is, I still don't know.
It's not a '\n', or, I don't think, '\r\n'. However, I've managed to isolate one of these special characters. I literally have it in memory, and by doing a call to my_str.replace(eol, ''), I can remove all of these characters from one of my files.
So my question is open-ended. I'm a bit lost when it comes to unicode and such. How can I identify this character in my files without resorting to something ridiculous, like serializing it and then reading it in? Is there a way I can refer to it as a code, perhaps? I can't get Python to yield what it actually IS. All I ever see if I print it, or call unicode(special_eol) is the character in its functional usage as a newline.
Please help! Thanks, and sorry if I'm missing something obvious.

To determine what specific character that is, you can use str.encode('unicode_escape') or repr() to get (in Python 2) a ASCII-printable representation of the character:
>>> print u'☃'.encode('unicode_escape')
\u2603
>>> print repr(u'☃')
u'\u2603'

Python URL Characters

I really new to Python and coding in general, but I have been making some good strides.
I am able to pull some data off of the web through an API, and the result should be a string. What I am seeing though, are some instances such as "& amp;"" and " &quot". (I modified the character sets so it would print properly to the screen)
I figure there is a way to clean this string and remove the characters such that it looks like it does on a computer screen. I tried searching for urldecoding, but admittedly I dont even know if that is the solution.
Any help on how to remove these "extra" characters and produce a readable string will be greatly appreciated!
Many thanks in advance,
Brock

xml.sax.saxutils.unescape(data[, entities]): Unescape '&amp', '&lt', and '&gt' in a string of data.
You can unescape other strings of data by passing a dictionary as the optional entities parameter. The keys and values must all be strings; each key will be replaced with its corresponding value. '&amp', '&lt', and '&gt' are always unescaped, even if entities is provided.

What's a good way to replace international characters with their base Latin counterparts using Python?

Say I have the string "blöt träbåt" which has a few a and o with umlaut and ring above. I want it to become "blot trabat" as simply as possibly. I've done some digging and found the following method:
import unicodedata
unicode_string = unicodedata.normalize('NFKD', unicode(string))
This will give me the string in unicode format with the international characters split into base letter and combining character (\u0308 for umlauts.) Now to get this back to an ASCII string I could do ascii_string = unicode_string.encode('ASCII', 'ignore') and it'll just ignore the combining characters, resulting in the string "blot trabat".
The question here is: is there a better way to do this? It feels like a roundabout way, and I was thinking there might be something I don't know about. I could of course wrap it up in a helper function, but I'd rather check if this doesn't exist in Python already.

It would be better if you created an explicit table, and then used the unicode.translate method. The advantage would be that transliteration is more precise, e.g. transliterating "ö" to "oe" and "ß" to "ss", as should be done in German.
There are several transliteration packages on PyPI: translitcodec, Unidecode, and trans.

Single quotes vs. double quotes in Python [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
According to the documentation, they're pretty much interchangeable. Is there a stylistic reason to use one over the other?

I like to use double quotes around strings that are used for interpolation or that are natural language messages, and single quotes for small symbol-like strings, but will break the rules if the strings contain quotes, or if I forget. I use triple double quotes for docstrings and raw string literals for regular expressions even if they aren't needed.
For example:
LIGHT_MESSAGES = {
'English': "There are %(number_of_lights)s lights.",
'Pirate': "Arr! Thar be %(number_of_lights)s lights."
}
def lights_message(language, number_of_lights):
"""Return a language-appropriate string reporting the light count."""
return LIGHT_MESSAGES[language] % locals()
def is_pirate(message):
"""Return True if the given message sounds piratical."""
return re.search(r"(?i)(arr|avast|yohoho)!", message) is not None

Quoting the official docs at https://docs.python.org/2.0/ref/strings.html:
In plain English: String literals can be enclosed in matching single quotes (') or double quotes (").
So there is no difference. Instead, people will tell you to choose whichever style that matches the context, and to be consistent. And I would agree - adding that it is pointless to try to come up with "conventions" for this sort of thing because you'll only end up confusing any newcomers.

I used to prefer ', especially for '''docstrings''', as I find """this creates some fluff""". Also, ' can be typed without the Shift key on my Swiss German keyboard.
I have since changed to using triple quotes for """docstrings""", to conform to PEP 257.

I'm with Will:
Double quotes for text
Single quotes for anything that behaves like an identifier
Double quoted raw string literals for regexps
Tripled double quotes for docstrings
I'll stick with that even if it means a lot of escaping.
I get the most value out of single quoted identifiers standing out because of the quotes. The rest of the practices are there just to give those single quoted identifiers some standing room.

If the string you have contains one, then you should use the other. For example, "You're able to do this", or 'He said "Hi!"'. Other than that, you should simply be as consistent as you can (within a module, within a package, within a project, within an organisation).
If your code is going to be read by people who work with C/C++ (or if you switch between those languages and Python), then using '' for single-character strings, and "" for longer strings might help ease the transition. (Likewise for following other languages where they are not interchangeable).
The Python code I've seen in the wild tends to favour " over ', but only slightly. The one exception is that """these""" are much more common than '''these''', from what I have seen.

Triple quoted comments are an interesting subtopic of this question. PEP 257 specifies triple quotes for doc strings. I did a quick check using Google Code Search and found that triple double quotes in Python are about 10x as popular as triple single quotes -- 1.3M vs 131K occurrences in the code Google indexes. So in the multi line case your code is probably going to be more familiar to people if it uses triple double quotes.

"If you're going to use apostrophes,
^
you'll definitely want to use double quotes".
^
For that simple reason, I always use double quotes on the outside. Always
Speaking of fluff, what good is streamlining your string literals with ' if you're going to have to use escape characters to represent apostrophes? Does it offend coders to read novels? I can't imagine how painful high school English class was for you!

Python uses quotes something like this:
mystringliteral1="this is a string with 'quotes'"
mystringliteral2='this is a string with "quotes"'
mystringliteral3="""this is a string with "quotes" and more 'quotes'"""
mystringliteral4='''this is a string with 'quotes' and more "quotes"'''
mystringliteral5='this is a string with \"quotes\"'
mystringliteral6='this is a string with \042quotes\042'
mystringliteral6='this is a string with \047quotes\047'
print mystringliteral1
print mystringliteral2
print mystringliteral3
print mystringliteral4
print mystringliteral5
print mystringliteral6
Which gives the following output:
this is a string with 'quotes'
this is a string with "quotes"
this is a string with "quotes" and more 'quotes'
this is a string with 'quotes' and more "quotes"
this is a string with "quotes"
this is a string with 'quotes'

I use double quotes in general, but not for any specific reason - Probably just out of habit from Java.
I guess you're also more likely to want apostrophes in an inline literal string than you are to want double quotes.

Personally I stick with one or the other. It doesn't matter. And providing your own meaning to either quote is just to confuse other people when you collaborate.

It's probably a stylistic preference more than anything. I just checked PEP 8 and didn't see any mention of single versus double quotes.
I prefer single quotes because its only one keystroke instead of two. That is, I don't have to mash the shift key to make single quote.

In Perl you want to use single quotes when you have a string which doesn't need to interpolate variables or escaped characters like \n, \t, \r, etc.
PHP makes the same distinction as Perl: content in single quotes will not be interpreted (not even \n will be converted), as opposed to double quotes which can contain variables to have their value printed out.
Python does not, I'm afraid. Technically seen, there is no $ token (or the like) to separate a name/text from a variable in Python. Both features make Python more readable, less confusing, after all. Single and double quotes can be used interchangeably in Python.

I chose to use double quotes because they are easier to see.

I just use whatever strikes my fancy at the time; it's convenient to be able to switch between the two at a whim!
Of course, when quoting quote characetrs, switching between the two might not be so whimsical after all...

Your team's taste or your project's coding guidelines.
If you are in a multilanguage environment, you might wish to encourage the use of the same type of quotes for strings that the other language uses, for instance. Else, I personally like best the look of '

None as far as I know. Although if you look at some code, " " is commonly used for strings of text (I guess ' is more common inside text than "), and ' ' appears in hashkeys and things like that.

I aim to minimize both pixels and surprise. I typically prefer ' in order to minimize pixels, but " instead if the string has an apostrophe, again to minimize pixels. For a docstring, however, I prefer """ over ''' because the latter is non-standard, uncommon, and therefore surprising. If now I have a bunch of strings where I used " per the above logic, but also one that can get away with a ', I may still use " in it to preserve consistency, only to minimize surprise.
Perhaps it helps to think of the pixel minimization philosophy in the following way. Would you rather that English characters looked like A B C or AA BB CC? The latter choice wastes 50% of the non-empty pixels.

I use double quotes because I have been doing so for years in most languages (C++, Java, VB…) except Bash, because I also use double quotes in normal text and because I'm using a (modified) non-English keyboard where both characters require the shift key.

' = "
/ = \ = \\
example :
f = open('c:\word.txt', 'r')
f = open("c:\word.txt", "r")
f = open("c:/word.txt", "r")
f = open("c:\\\word.txt", "r")
Results are the same
=>> no, they're not the same.
A single backslash will escape characters. You just happen to luck out in that example because \k and \w aren't valid escapes like \t or \n or \\ or \"
If you want to use single backslashes (and have them interpreted as such), then you need to use a "raw" string. You can do this by putting an 'r' in front of the string
im_raw = r'c:\temp.txt'
non_raw = 'c:\\temp.txt'
another_way = 'c:/temp.txt'
As far as paths in Windows are concerned, forward slashes are interpreted the same way. Clearly the string itself is different though. I wouldn't guarantee that they're handled this way on an external device though.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Latex symbols to unicode? - python

... What? >>> print {u'\u0020': '\\space'}[u' '] \space (That is, they're already characters; you need do nothing to them.)

Related

Can I override u-strings (u'example') in Python 2?

How to recognize special eol character when I see it, using Python?

Python URL Characters

What's a good way to replace international characters with their base Latin counterparts using Python?

Single quotes vs. double quotes in Python [closed]

Categories

Resources