Search and replace characters in a file with Python - python

I am trying to do transliteration where I need to replace every source character in English from a file with its equivalent from a dictionary I am using in the source code corresponding to another language in Unicode format. I am now able to read character by character from a file in English how do I search for its equivalent map in the dictionary I have defined in the source code and make sure that is printed in a new transliterated output file. Thank you:).

The translate method of Unicode objects is the simplest and fastest way to perform the transliteration you require. (I assume you're using Unicode, not plain byte strings which would make it impossible to have characters such as 'पत्र'!).
All you have to do is layout your transliteration dictionary in a precise way, as specified in the docs to which I pointed you:
each key must be an integer, the codepoint of a Unicode character; for example, 0x0904 is the codepoint for ऄ, AKA "DEVANAGARI LETTER SHORT A", so for transliterating it you would use as the key in the dict the integer 0x0904 (equivalently, decimal 2308). (For a table with the codepoints for many South-Asian scripts, see this pdf).
the corresponding value can be a Unicode ordinal, a Unicode string (which is presumably what you'll use for your transliteration task, e.g. u'a' if you want to transliterate the Devanagari letter short A into the English letter 'a'), or None (if during the "transliteration" you want to simply remove instances of that Unicode character).
Characters that aren't found as keys in the dict are passed on untouched from the input to the output.
Once your dict is laid out like that, output_text = input_text.translate(thedict) does all the transliteration for you -- and pretty darn fast, too. You can apply this to blocks of Unicode text of any size that will fit comfortably in memory -- basically doing one text file as a time will be just fine on most machines (e.g., the wonderful -- and huge -- Mahabharata takes at most a few tens of megabytes in any of the freely downloadable forms -- Sanskrit [[cross-linked with both Devanagari and roman-transliterated forms]], English translation -- available from this site).

Note: Updated after clarifications from questioner. Please read the comments from the OP attached to this answer.
Something like this:
for syllable in input_text.split_into_syllables():
output_file.write(d[syllable])
Here output_file is a file object, open for writing. d is a dictionary where the indexes are your source characters and the values are the output characters. You can also try to read your file line-by-line instead of reading it all in at once.

Related

Decoding a byte with latin-1 characters to string with decimal representation

I am working on a migration project to upgrade a layer of web server from python 2.7.8 to python 3.6.3 and I have hit a roadblock for some special cases.
When a request is received from a client, payload is transmitted locally using pyzmq which now interacts in bytes in python3 instead of str (as it is in python2).
Now, the payload which I am receiving is encoded using iso-8859-1 (latin-1) scheme and I can easily convert it into string as payload.decode('latin-1') and pass it to next service (svc-save-entity) which expects string argument.
However, the subsequent service 'svc-save-entity' expects latin-1 chars (if present) to be represented in ASCII Character Reference (such as é for é) rather than in Hex (such as \xe9 for é).
I am struggling to find an efficient way to achieve this conversion. Can any python expert guide me here? Essentially I need the definition of a function say decode_tostring():
payload = b'Banco Santander (M\xe9xico)' #payload is in bytes
payload_str = decode_tostring(payload) #function to convert into string
payload_str == 'Banco Santander (México)' #payload_str is a string in ASCII Character Reference
Definition of decode_tostring() please. :)
The encode() and decode() methods accept a parameter called errors which allows you to specify how characters which are not representable in the specified encoding should be handled. The one you're looking for is XML numeric character reference replacement, which is fortunately one of the standard handlers provided in the codecs module.
Now, it's a little complex to actually do the replacement the way you want it, because the operation of replacing non-ASCII characters with their corresponding XML numeric character references happens during encoding, not decoding. After all, encoding is the process that takes in characters and emits bytes, so it's only during encoding that you can tell whether you have a character that is not part of ASCII. The cleanest way I can think of at the moment to get the transformation you want is to decode, re-encode, and re-decode, applying the XML entity reference replacement during the encoding step.
def decode_tostring(payload):
return payload.decode('latin-1').encode('ascii', errors='xmlcharrefreplace').decode('ascii')
I wouldn't be surprised if there is a method somewhere out there that will replace all non-ASCII characters in a string with their XML numeric character refs and give you back a string, and if so, you could use it to replace the encoding and the second decoding. But I don't know of one. The closest I found at the moment was xml.sax.saxutils.escape(), but that only acts on certain specific characters.
This isn't really relevant to your main question, but I did want to clarify one thing: the numeric entities like é are a feature of SGML, HTML, and XML, which are markup languages - a way to represent structured data as text. They have nothing to do with ASCII. A character encoding like ASCII is nothing more than a table of some characters and some byte sequences such that each character in the table is mapped to one byte sequence in the table and vice versa, with a few constraints to make the mapping unambiguous.
If you have a string with characters that are not in a particular encoding's table, you can't encode the string using that encoding. But what you can do is convert the string into a new string by replacing the characters which aren't in the table with sequences of characters that are in the table, and then encode the new string. There are many ways to do the replacement, of which XML numeric entity references are one example. Some of the other error handlers in Python's codecs module represent other approaches to this replacement.

fluphenazine read as \xef\xac\x82uphenazine

When I write
>>> st = "Piperazine (perphenazine, fluphenazine)"
>>> st
'Piperazine (perphenazine, \xef\xac\x82uphenazine)'
What is happening? why doesn't it do this for any fl? How do I avoid this?
It looks \xef\xac\x82 is not, in fact, fl. Is there any way to 'translate' this character into fl (as the author intended it), without just excluding it via something like
unicode(st, errors='ignore').encode('ascii')
This is what is called a "ligature".
In printing, the f and l characters were typeset with a different amount of space between them from what normal pairs of sequential letters used - in fact, the f and l would merge into one character. Other ligatures include "th", "oe", and "st".
That's what you're getting in your input - the "fl" ligature character, UTF-8 encoded. It's a three-byte sequence. I would take minor issue with your assertion that it's "not, in fact fl" - it really is, but your input is UTF-8 and not ASCII :-). I'm guessing you pasted from a Word document or an ebook or something that's designed for presentation instead of data fidelity (or perhaps, from the content, it was a LaTeX-generated PDF?).
If you want to handle this particular case, you could replace that byte sequence with the ASCII letters "fl". If you want to handle all such cases, you will have to use the Unicode Consortium's "UNIDATA" file at: http://www.unicode.org/Public/UNIDATA/UnicodeData.txt . In that file, there is a column for the "decomposition" of a character. The f-l ligature has the identifier "LATIN SMALL LIGATURE FL". There is, by the way, a Python module for this data file at https://docs.python.org/2/library/unicodedata.html . You want the "decomposition" function:
>>> import unicodedata
>>> foo = u"fluphenazine"
>>> unicodedata.decomposition(foo[0])
'<compat> 0066 006C'
0066 006C is, of course, ASCII 'f' and 'l'.
Be aware that if you're trying to downcast UTF-8 data to ASCII, you're eventually going to have a bad day. There are only 127 ASCII characters, and UTF-8 has millions upon millions of code points. There are many codepoints in UTF-8 that cannot be readily represented as ASCII in a nonconvoluted way - who wants to have some text end up saying "<TREBLE CLEF> <SNOWMAN> <AIRPLANE> <YELLOW SMILEY FACE>"?

Mapping Unicode to ASCII in Python

I receive strings after querying via urlopen in JSON format:
def get_clean_text(text):
return text.translate(maketrans("!?,.;():", " ")).lower().strip()
for track in json["tracks"]:
print track["name"].lower()
get_clean_text(track["name"].lower())
For the string "türlich, türlich (sicher, dicker)" I then get
File "main.py", line 23, in get_clean_text
return text.translate(maketrans("!?,.;():", " ")).lower().strip()
TypeError: character mapping must return integer, None or unicode
I want to format the string to be "türlich türlich sicher dicker".
The question is not a complete self-contained example; I can't be sure whether it's Python 2 or 3, where maketrans came from, etc. There's a good chance I will guess wrong, which is why you should be sure to tag your questions appropriately and provide a short, self contained, correct example. (That, and the fact that various other people—some of them probably smarter than me—likely ignored your question because it was ambiguous.)
Assuming you're using 2.x, and you've done a from string import * to get maketrans, and json["name"] is unicode rather than str/bytes, here's your problem:
There are two kinds of translation tables: old-style 8-bit tables (which are just an array of 256 characters) and new-style sparse tables (which are just a dict mapping one character's ordinal to another). The str.translate function can use either, but unicode.translate can only use the second (for reasons that should be obvious if you think about it for a bit).
The string.maketrans function makes old-style 8-bit translation tables. So you can't use it with unicode.translate.
You can always write your own "makeunitrans" function as a drop-in replacement, something like this:
def makeunitrans(frm, to):
return {ord(f):ord(t) for (f,t) in zip(frm, to)}
But if you just want to map out certain characters, you could do something a bit more special purpose:
def makeunitrans(frm):
return {ord(f):ord(' ') for f in frm}
However, from your final comment, I'm not sure translate is even what you want:
I want to format the string to be "türlich türlich sicher dicker"
If you get this right, you're going to format the string to be "türlich türlich sicher dicker ", because you're mapping all those punctuation characters to spaces, not nothing.
With new-style translation tables you can map anything you want to None, which solves that problem. But you might want to step back and ask why you're using the translate method in the first place instead of, e.g., calling replace multiple times (people usually say "for performance", but you wouldn't be building the translation table in-line every time through if that were an issue) or using a trivial regular expression.

How to change a strings encoding as utf 8 in C

How can i change character encoding of a string to UTF-8? I am making some execv calls to a python program but python returns the strings with the some characters cut of. I don't know if this a python issue or c issue but i thought if i can change the strings encoding in c and then pass it to python, it should do the trick. So how can i do that?
Thanks.
C as a language does not facilitate string encoding. A C string is simply a null-terminated sequence of characters (8-bit signed integers, on most systems).
A wide string (with characters of type wchar_t, typically 16-bit integers) can also be used to hold larger character values; however, again, C standard library functions and data types are in no way aware of any concept of string encoding.
The answer to your question is to ensure that the strings you're passing into Python are encoded as UTF-8.
In order to help you accomplish that in any detailed capacity, however, you will have to provide more information about how your strings are currently formed, what they contain, and how you're constructing your argument list for exec.
There is no such thing as character encoding in C.
A char* can hold any data, how you interpret the characters is up to you. For instance, printf will typically dump the characters as they are to the standard output, and if your console interprets those characters as UFT8, they'll appear as such.
If you want to convert between different encodings in the C side, you can have a look at ICU.
If you want to convert between encodings in the Python side, look at http://docs.python.org/howto/unicode.html.

Python: getting \\u00bd correctly in editor

I would like to do the following:
1) Serialize my class
2) Also manually edit the serialization dump file to remove certain objects of my class which I find unnecessary.
I am currently using python with simplejson. As you know, simplejson converts all characters to unicde. As a result, when I dump a particular object with simplejson, the unicode characters becomes something like that "\u00bd" for 好.
I am interested to manually edit the simplejson file for convenience. Anyone here know a work around for me to do this?
My requirements for this serialization format:
1) Easy to use (just dump and load - done)
2) Allows me to edit them manually without much hassle.
3) Able to display chinese character
I use vim. Does anyone know a way to conver "\u00bd" to 好 in vim?
I don't know anything about simplejson or the Serialisation part of the question, but you asked about converting "\u00bd" to 好 in Vim. Here are some vim tips for working with unicode:
You'll need the correct encoding set up in vim, see:
:help 'encoding'
:help 'fileencoding'
Entering unicode characters by number is simply a case of going into insert mode, pressing Ctrl-V and then typing u followed by the four digit number (or U followed by an 8-digit number). See:
:help i_CTRL-V_digit
Also bear in mind that in order for the character to display correctly in Vim, you'll need a fixed-width font containing that character. It appears as a wide space in Envy Code R and as various boxes in Lucida Console, Consolas and Courier New.
To replace \uXXXX with unicode character XXXX (where X is any hexadecimal digit), type this when in normal mode (where <ENTER> means press the ENTER key, don't type it literally):
:%s/\\u\x\{4\}/\=eval('"' . submatch(0) . '"')/g<ENTER>
Note however that u00bd appears to be unicode character ½ (1/2 in case that character doesn't display correctly on your screen), not the 好 character you mentioned (which is u597D I think). See this unicode table. Start vim and type these characters (where <Ctrl-V> is produced by holding CTRL, pressing V, releasing V and then releasing CTRL):
i<Ctrl-V>u00bd
You should see a small character looking like 1/2, assuming your font supports that character.
If you want json/simplejson to produce unicode output instead of str output with Unicode escapes then you need to pass ensure_ascii=False to dump()/dumps(), then either encode before saving or use a file-like from codecs.

Categories