I’ve got files from a system restore which have odd bits of data padded out to the front of the file which makes it gobbledegook when opening it. I’ve got a text file of file signatures which I’ve collected, and which contain information represented like this at the moment:
Sig_MicrosoftOffice_before2007= \xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1
What I am planning on is reading the text file and using the data to identify the correct header in the data of the corrupt file, and strip everything off before it – hopefully leaving a readable file after. I’m stuck on how best to get this data into python in a readable format though.
My first try was simply reading the values from the file, but as python does, it’s representing the backslashes as the escape character. Is this the best method to achieve what I need? Do I need to think about representing the data in the text file some other way? Or maybe in a dictionary? Any help you could provide would be really appreciated.
You can decode the \xhh escapes by using the string_escape codec (Python 2) or unicode_escape codec (Python 3 or when you have to us Unicode in Python 2):
>>> r'\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1'
'\\xD0\\xCF\\x11\\xE0\\xA1\\xB1\\x1A\\xE1'
>>> r'\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1'.decode('string_escape')
'\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1'
Related
I'm working with a JSON file contains some unknown-encoded strings as the example below:
"L\u00c3\u00aa Nguy\u00e1\u00bb\u0085n Ph\u00c3\u00ba"
I have loaded this text by using json.load() function in Python 3.7 environment and tried to encode/decode it with some methods I found around the Internet but I still cannot get the proper string as I expected. (In this case, it has to be Lê Nguyễn Phú).
My question is, which is the encoding method they used and how to parse this text in a proper way in Python?
Because the JSON file comes from an external source that I didn't handle so that I cannot know or make any changes in the process of encoding the text.
[Updated] More details:
The JSON file looks like this:
{
"content":"L\u00c3\u00aa Nguy\u00e1\u00bb\u0085n Ph\u00c3\u00ba"
}
Firstly, I loaded the JSON file:
with open(json_path, 'r') as f:
data = json.load(f)
But when I extract the content, it's not what I expected:
string = data.get('content', '')
print(string)
'Lê Nguyá»\x85n Phú'
Someone took "Lê Nguyễn Phú", encoded that as UTF-8, and then took the resulting series of bytes and lied to a JSON encoder by telling it that those bytes were the characters of a string. The JSON encoder then cooperatively produced garbage by encoding those characters. But it is reversible garbage. You can reverse this process using something like
json.loads(in_string).encode("latin_1").decode("utf_8")
Which decodes the string from the JSON, extracts the bytes from it (the 256 symbols in Latin-1 are in a 1-to-1 correspondence with the first 256 Unicode codepoints), and then re-decodes those bytes as UTF-8.
The big problem with this technique is that it only works if you are sure that all of your input is garbled in this fashion... there's no completely reliable way to look at an input and decide whether it should have this broken decoding applied to it. If you try to apply it to a validly-encoded string containing codepoints above U+00FF, it will crash. But if you try to apply it to a validly-encoding string containing only codepoints up to U+00FF, it will turn your perfectly good string into a different kind of garbage.
I am trying to write a python script to convert a hex string into ASCII and save the result into a file in .der cert format. I can do this in Notepad++ using the conversion plugin, but I would like to find a way to do this conversion in a python script from command line, either by invoking the notepad++ NppConverter plugin or using python modules.
I am part way there, but my conversion is not identical to the ASCII ouptut seen in notepad++, below is a snippet of the output in Notepad++
But my python conversion is displaying a slightly different output below
As you can see my script causes missing characters in the output, and if i'm honest I don't know why certain blocks are outlined in black. But these missing blocks are needed in the same format to the first picture.
Here's my basic code, I am working in Python 3, I am using the backslashreplace error control as this is the only way I can get the problematic hex to appear in the output file
result = bytearray.fromhex('380c2fd6172cd06d1f30').decode('ascii', 'backslashreplace')
text_file = open("C:\Output.der", "w")
text_file.write(result)
text_file.close()
Any guidance would be greatly appreciated.
MikG, I would say that python did exactly what you requested.
You told to convert the bytes to string, and replace bytes with most significant bit set with escape sequence (except for \xFF char).
Characters \x04 (ETB) and \x1F (US) are perfectly legal ASCII chars (though non-printable), and they are encoded using their literal value.
Characters \xd6 and \xd0 are illegal in ASCII - they are 8-bit long. They are encoded using 4-letter long escape sequence, as you asked: "\" (backslash char) and "xd6" / "xd0" strings
I'm not good with DER, but suppose that you expect to have raw 8-bit sequences. Here is how this could be accomplished:
result = bytearray.fromhex('380c2fd6172cd06d1f30')
with open("Output.der", "wb") as text_file:
text_file.write(result)
Please note "wb" specifier to open -- it tells python to do binary IO.
I also used with statement to ensure that text_file is closed whatever happens with write.
I have a strange text file that I am required to replace any social security number with XXX-XX-XXXX. Great! Simply suck the file in, regex that junk out, and write the file out. Loving life, this will be easy as pie. My acceptance criteria is that I can only change the SSNs the rest of the file must stay exactly the same since it has fixed width columns and even strange characters must be kept for debugging other processes. OK, cool, I got this.
I read the file in:
filehandle = open("text.txt", "r", encoding="UTF-8")
And it gives me some encoding errors like this:
'utf-8' codec can't decode byte 0xd1 in position 6919: invalid continuation byte
I can't figure out the encoding. I've tried chardet and it thinks it's ASCII but I just get a different encoding error. I just need a way to suck this file in, do a simple regex and put it back out. I can put in:
errors="ignore"
And it won't crash but ends up stripping out some of the strange characters which then throws the spacing of the columns off. Here is an example of one of the characters I'm talking about with it's hex (need to use images since I can't copy/paste it here):
The 4E is the 'N' in CHILDREN
The EF BF BD make up the .. stuff
The 53 is the S in CHILDREN
I'm sure this is part of the problem. So, what should I do to simply:
Take the file in, use a regex to simply change \d{3}-\d{2}-\d{4} to XXX-XX-XXXX where the file has some weird characters in it without changing anything else in the file? Thank you all!
You should open your file in binary mode and avoid processing Unicode decoding of UTF-8.
Then use a bytes regular expression to find the social security numbers and replace the found places with relevant bytes.
I have some csv data of some users tweet.
In excel it is displayed like this:
‰ÛÏIt felt like they were my friends and I was living the story with them‰Û #retired #IAN1
I had imported this csv file into python and in python the same tweet appears like this (I am using putty to connect to a server and I copied this from putty's screen)
▒▒▒It felt like they were my friends and I was living the story with them▒ #retired #IAN1
I am wondering how to display these emoji characters properly. I am trying to separate all the words in this tweet but I am not sure how I can separate those emoji unicode characters.
In fact, you certainly have a loss of data…
I don’t know how you get your CSV file from users tweet (you may explain that). But generally, CSV files are encoded in "cp1252" (or "windows-1252"), sometimes in "iso-8859-1" encoding. Nowadays, we can found CSV files encoded in "utf-8".
If you tweets are encoded in "cp1252" or any 8-bit single-byte coded character sets, the Emojis are lost (replaced by "?") or badly converted.
Then, if you open your CSV file into Excel, it will use it’s default encoding ("cp1252") and load the file with corrupted characters. You can try with Libre Office, it has a dialog box which allows you to choose your encoding more easily.
The copy/paste from Putty will also convert your characters depending of your console encoding… It is worst!
If your CSV file use "utf-8" encoding (or "utf-16", "utf-32") you may have more chance to preserve the Emojis. But there is still a problem: most Emojis have a code-point greater that U+FFFF (65535 in decimal). For instance, Grinning Face "😀" has the code-point U+1F600).
This kind of characters are badly handled in Python, try this:
# coding: utf8
from __future__ import unicode_literals
emoji = u"😀"
print(u"emoji: " + emoji)
print(u"repr: " + repr(emoji))
print(u"len: {}".format(len(emoji)))
You’ll get (if your console allow it):
emoji: 😀
repr: u'\U0001f600'
len: 2
The first line won’t print if your console don’t allow unicode,
The \U escape sequence is similar to the \u, but expects 8 hex digits, not 4.
Yes, this character has a length of 2!
EDIT: With Python 3, you get:
emoji: 😀
repr: '😀'
len: 1
No escape sequence for repr(),
the length is 1!
What you can do is posting your CSV file (a fragment) as attachment, then one could analyse it…
See also Unicode Literals in Python Source Code in the Python 2.7 documentation.
First of all you shouldn't work with text copied from a console (nonetheless from a remote connection) because of formatting differences and how unreliable clipboards are. I'd suggest exporting your CSV and reading it directly.
I'm not quite sure what you are trying to do but twitter emojis cannot be displayed in a console due to them being basically compressed images. Would you mind explaning your issue further?
I would personally treat the whole string as Unicode, separate each character in a list then rebuilding words based on spaces.
Why am I getting this issue? and how do I resolve it?
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 24: unexpected code byte
Thank you
Somewhere, perhaps subtly, you are asking Python to turn a stream of bytes into a "string" of characters.
Don't think of a string as "bytes". A string is a list of numbers, each number having an agreed meaning in Unicode. (#65 = Latin Capital A. #19968 = Chinese Character "One"/"First") .
There are many methods of encoding a list of Unicode entities into a stream of bytes. Python is assuming your stream of bytes is the result of a particular such method, called "UTF-8".
However, your stream of bytes has data that does not correspond to that method. Thus the error is raised.
You need to figure out the encoding of the stream of bytes, and tell Python that encoding.
It's important to know if you're using Python 2 or 3, and the code leading up to this exception to see where your bytes came from and what the appropriate way to deal with them is.
If it's from reading a file, you can explicity deal with the bytes read. But you must be sure of the file encoding.
If it's from a string that is part of your source code, then Python is assuming the "wrong thing" about your source files... perhaps $LC_ALL or $LANG needs to be set. This is a good time to firmly understand the concept of encoding, and how text editors choose an encoding to write, and what is standard for your language and operating system.
In addition to what Joe said, chardet is a useful tool to detect encoding of the source data.
Somewhere you have a plain string encoded as "Windows-1252" (or "cp1252") containing a "RIGHT SINGLE QUOTATION MARK" (’) instead of an APOSTROPHE ('). This could come from a file you read, or even in a Python source file of yours; you could be running Python 2.x and have a # -*- coding: utf8 -*- line somewhere near the script's beginning, or you could be running Python 3.x.
You don't give enough data; however, somewhere you have a cp1252-encoded string, which you try (explicitly or implicitly) to decode to unicode as utf-8. This won't work.
Give us more info, and we'll try again to help you.
Joe Koberg's answer reminded me of an older answer of mine, which some people have found helpful: Python UnicodeDecodeError - Am I misunderstanding encode?