0D0D turns into 0D on binascii.hexlify(file.read()) - python

I'm trying to read a file's hex code using file.read() & binascii.hexlify(), and in place of 0D 0D in the original file python reads/prints only one 0D.
ex:
original file: 6D 6F 64 65 2E 0D 0D 0A 24 00 00 00 00 00 00 00
python: print(binascii.hexlify(f.read(16))) output: 6d6f64652e0d0a24000000000000001c
Any ideas as to why this is happening?

Related

python os.read() not reading correct number of bytes

I'm trying to read blocks from a binary file (oracle redo log) but I'm having a issue where, when I try to read a 512 byte block using os.read(fd,512) I am returned less than 512 bytes. (the amount differs depending on the block)
the documentation states that "at most n Bytes" so this makes sense that I'm getting less than expected. How can I force it to keep reading until I get the correct amount of bytes back?
I've attempted to adapt the method described here Python f.read not reading the correct number of bytes But I still have the problem
def read_exactly(fd, size):
data = b''
remaining = size
while remaining: # or simply "while remaining", if you'd like
newdata = read(fd, remaining)
if len(newdata) == 0: # problem
raise IOError("Failed to read enough data")
data += newdata
remaining -= len(newdata)
return data
def get_one_block(fd, start, blocksize):
lseek(fd, start, 0)
blocksize = blocksize
print('Blocksize: ' + str(blocksize))
block = read_exactly(fd, blocksize)
print('Actual Blocksize: ' + str(block.__sizeof__()))
return block
which then returns the error: OSError: Failed to read enough data
My code:
from os import open, close, O_RDONLY, lseek, read, write, O_BINARY, O_CREAT, O_RDWR
def get_one_block(fd, start, blocksize):
lseek(fd, start, 0)
blocksize = blocksize
print('Blocksize: ' + str(blocksize))
block = read(fd, blocksize)
print('Actual Blocksize: ' + str(block.__sizeof__()))
return block
def main():
filename = "redo_logs/redo03.log"
fd = open(filename, O_RDONLY, O_BINARY)
b = get_one_block(fd, 512, 512)
Output
Blocksize: 512
Actual Blocksize: 502
in this instance the last byte read is 0xB3 which is followed by 0x1A which i believe is the problem.
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
EF 42 B8 5A DC D1 63 1B A3 31 C7 5E 9F 4A B7 F4
4E 04 6B E8 B3<<-- stops here -->>1A 4F 3C BF C9 3C F6 9F C3 08 02
05 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Any help would be greatly appreciated :)
You need to read inside a while loop and check the true number of bytes you've got.
If you got less you read again with the left delta.
the while exits when you got what you expected or reached EOF.

When I use Thrift to serialize map in C++ to disk, and then de-serialize it using Python, I do not get back the same object

Summary: when I use Thrift to serialize map in C++ to disk, and then de-serialize it using Python, I do not get back the same object.
A minimal example to reproduce to the problem is in Github repo https://github.com/brunorijsman/reproduce-thrift-crash
Clone this repo on Ubuntu (tested on 16.04) and follow the instructions at the top of the file reproduce.sh
I have the following Thrift model file, which (as you can see) contains a map indexed by a struct:
struct Coordinate {
1: required i32 x;
2: required i32 y;
}
struct Terrain {
1: required map<Coordinate, i32> altitude_samples;
}
I use the following C++ code to create an object with 3 coordinates in the map (see the repo for complete code for all snippets below):
Terrain terrain;
add_sample_to_terrain(terrain, 10, 10, 100);
add_sample_to_terrain(terrain, 20, 20, 200);
add_sample_to_terrain(terrain, 30, 30, 300);
where:
void add_sample_to_terrain(Terrain& terrain, int32_t x, int32_t y, int32_t altitude)
{
Coordinate coordinate;
coordinate.x = x;
coordinate.y = y;
std::pair<Coordinate, int32_t> sample(coordinate, altitude);
terrain.altitude_samples.insert(sample);
}
I use the following C++ code to serialize an object to disk:
shared_ptr<TFileTransport> transport(new TFileTransport("terrain.dat"));
shared_ptr<TBinaryProtocol> protocol(new TBinaryProtocol(transport));
terrain.write(protocol.get());
Important note: for this to work correctly, I had to implement the function Coordinate::operator<. Thrift does generate the declaration for the Coordinate::operator< but does not generate the implementation of Coordinate::operator<. The reason for this is that Thrift does not understand the semantics of the struct and hence cannot guess the correct implementation of the comparison operator. This is discussed at http://mail-archives.apache.org/mod_mbox/thrift-user/201007.mbox/%3C4C4E08BD.8030407#facebook.com%3E
// Thrift generates the declaration but not the implementation of operator< because it has no way
// of knowning what the criteria for the comparison are. So, provide the implementation here.
bool Coordinate::operator<(const Coordinate& other) const
{
if (x < other.x) {
return true;
} else if (x > other.x) {
return false;
} else if (y < other.y) {
return true;
} else {
return false;
}
}
Then, finally, I use the following Python code to de-serialize the same object from disk:
file = open("terrain.dat", "rb")
transport = thrift.transport.TTransport.TFileObjectTransport(file)
protocol = thrift.protocol.TBinaryProtocol.TBinaryProtocol(transport)
terrain = Terrain()
terrain.read(protocol)
print(terrain)
This Python program outputs:
Terrain(altitude_samples=None)
In other words, the de-serialized Terrain contains no terrain_samples, instead of the expected dictionary with 3 coordinates.
I am 100% sure that the file terrain.dat contains valid data: I also de-serialized the same data using C++ and in that case, I do get the expected results (see repo for details)
I suspect that this has something to do with the comparison operator.
I gut feeling is that I should have done something similar in Python with respect to the comparison operator as I did in C++. But I don't know what that missing something would be.
Additional information added on 19-Sep-2018:
Here is a hexdump of the encoding produced by the C++ encoding program:
Offset: 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
00000000: 01 00 00 00 0D 02 00 00 00 00 01 01 00 00 00 0C ................
00000010: 01 00 00 00 08 04 00 00 00 00 00 00 03 01 00 00 ................
00000020: 00 08 02 00 00 00 00 01 04 00 00 00 00 00 00 0A ................
00000030: 01 00 00 00 08 02 00 00 00 00 02 04 00 00 00 00 ................
00000040: 00 00 0A 01 00 00 00 00 04 00 00 00 00 00 00 64 ...............d
00000050: 01 00 00 00 08 02 00 00 00 00 01 04 00 00 00 00 ................
00000060: 00 00 14 01 00 00 00 08 02 00 00 00 00 02 04 00 ................
00000070: 00 00 00 00 00 14 01 00 00 00 00 04 00 00 00 00 ................
00000080: 00 00 C8 01 00 00 00 08 02 00 00 00 00 01 04 00 ..H.............
00000090: 00 00 00 00 00 1E 01 00 00 00 08 02 00 00 00 00 ................
000000a0: 02 04 00 00 00 00 00 00 1E 01 00 00 00 00 04 00 ................
000000b0: 00 00 00 00 01 2C 01 00 00 00 00 .....,.....
The first 4 bytes are 01 00 00 00
Using a debugger to step through the Python decoding function reveals that:
This is being decoded as a struct (which is expected)
The first byte 01 is interpreted as the field type. 01 means field type VOID.
The next two bytes are interpreted as the field id. 00 00 means field ID 0.
For field type VOID, nothing else is read and we continue to the next field.
The next byte is interpreted as the field type. 00 means STOP.
We top reading data for the struct.
The final result is an empty struct.
All off the above is consistent with the information at https://github.com/apache/thrift/blob/master/doc/specs/thrift-binary-protocol.md which describes the Thrift binary encoding format
My conclusion thus far is that the C++ encoder appears to produce an "incorrect" binary encoding (I put incorrect in quotes because certainly something as blatant as that would have been discovered by lots of other people, so I am sure that I am still missing something).
Additional information added on 19-Sep-2018:
It appears that the C++ implementation of TFileTransport has the concept of "events" when writing to disk.
The output which is written to disk is divided into a sequence of "events" where each "event" is preceded by a 4-byte length field of the event, followed by the contents of the event.
Looking at the hexdump above, the first couple of events are:
0100 0000 0d : Event length 1, event value 0d
02 0000 0000 01 : Event length 2, event value 00 01
Etc.
The Python implementation of TFileTransport does not understand this concept of events when parsing the file.
It appears that the problem is one of the following two:
1) Either the C++ code should not be inserting these event lengths into the encoded file,
2) Or the Python code should understand these event lengths when decoding the file.
Note that all these event lengths make the C++ encode file much larger than the Python encoded file.
Sadly C++ TFileTransport is not totally portable and will not work with Python's TFileObjectTransport. If you switch to C++ TSimpleFileTransport it will work as expected, with Python TFileObjectTransport and with Java TSimpleFileTransport.
Take a look at the examples here:
https://github.com/RandyAbernethy/ThriftBook/tree/master/part2/types/complex
They do pretty much exactly what you are attempting in Java and Python and you can find examples with C++, Java and Python here (though they add a zip compression layer):
https://github.com/RandyAbernethy/ThriftBook/tree/master/part2/types/zip
Another caution however would be against the use of complex key types. Complex key types require (as you discovered) comparators but will flat out not work with some languages. I might suggest, for example:
map<x,map<y,alt>>
giving the same utility but eliminating a whole class of possible problems (and no need for comparators).

Chrome corrupting binary file download by converting to UTF-8

I've currently been assigned to maintain an application written with Flask. Currently I'm trying to add a feature that allows users to download a pre-generated excel file, however, whenever I try to send it, my browser appears to re-encode the file in UTF-8 which causes multibyte characters to be added, which corrupts the file.
File downloaded with wget:
(venv) luke#ubuntu:~$ hexdump -C wget.xlsx | head -n 2
00000000 50 4b 03 04 14 00 00 00 08 00 06 06 fb 4a 1f 23 |PK...........J.#|
00000010 cf 03 c0 00 00 00 13 02 00 00 0b 00 00 00 5f 72 |.............._r|
The file downloaded with Chrome (notice the EF BF BD sequences?)
(venv) luke#ubuntu:~$ hexdump -C chrome.xlsx | head -n 2
00000000 50 4b 03 04 14 00 00 00 08 00 ef bf bd 03 ef bf |PK..............|
00000010 bd 4a 1f 23 ef bf bd 03 ef bf bd 00 00 00 13 02 |.J.#............|
Does anyone know how I could fix this? This is the code I'm using:
data = b'PK\x03\x04\x14\x00\x00\x00\x08\x00}\x0c\xfbJ\x1f#\xcf\x03\xc0\x00\x00\x00\x13\x02\x00\x00\x0b\x00\x00\x00'
send_file(BytesIO(data), attachment_filename="x.xlsx", as_attachment=True)
Related issue: Encoding problems trying to send a binary file using flask_restful
Chrome was expecting to receive utf-8 encoded text, and found some bytes that couldn't be interpreted as valid utf-8 encoding of a char - which is normal, because your file is binary. So it replaced these invalid bytes with EF BF BD, the utf-8 encoding of the Unicode replacement character. The content-type header you send is probably text/..... Maybe try something like Content-Type:application/vnd.openxmlformats-officedocument.spreadsheetml.sheet

Python write to text file only if ASCII value

I am trying to write a program which will allow me to compare SQL files to each other and have started by writing the full SQL file to to a text file. The text file generates successfully, but with blocks at the end as in the below example:
SET ANSI_NULLS ON਍ഀ
GO਍ഀ
SET QUOTED_IDENTIFIER ON਍ഀ
GO਍ഀ
CREATE TABLE [dbo].[CDR](਍ഀ
Below this is the code that generates the text file
#!/usr/bin/python
# -*- coding: utf-8 -*-
import os
from _ast import Num
#imports packages
r= open('master_lines.txt', 'w')
directory= "E:\\" #file directory, anonymous omission
master= directory + "master"
databases= ["\\1", "\\2", "\\3", "\\4"]
file_types= ["\\StoredProcedure", "\\Table", "\\UserDefinedFunction", "\\View"]
servers= []
server_number= []
master_lines= []
for file in os.listdir("E:\\"): #adds server paths to an array
servers.append(file)
for num in range(0, len(servers)):
for file in os.listdir(directory + servers[num]): #adds all the servers and paths to an array
server_number.append(servers[num] + "\\" + file)
master= directory + server_number[server_number.index("master")]
master_var= master + databases[0]
tmp= master_var + file_types[1]
for file in os.listdir(tmp):
with open(file) as tmp_file:
line= tmp_file.readlines()
for num in range(0, len(line)):
r.write(line[num])
r.close
I have already tried changing the encoding to both latin1 and utf-8; the current text file is the most successful as ascii and latin1 produced chinese and arabic characters respectively.
Below is the SQL file in text format:
/****** Object: Table [dbo].[CDR] Script Date: 2017-01-12 02:30:49 PM ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[CDR](
[calldate] [datetime] NOT NULL,
[clid] [varchar](80) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
[src] [varchar](80) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
[dst] [varchar](80) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
[dcontext] [varchar](80) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
[channel] [varchar](80) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
[dstchannel] [varchar](80) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
[lastapp] [varchar](80) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
[lastdata] [varchar](80) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
[duration] [int] NOT NULL,
[billsec] [int] NOT NULL,
[disposition] [varchar](45) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
[amaflags] [int] NOT NULL,
[accountcode] [varchar](20) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
[userfield] [varchar](255) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
[uniqueid] [varchar](64) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
[cdr_id] [int] NOT NULL,
[cost] [real] NOT NULL,
[cdr_tag] [varchar](10) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[importID] [bigint] IDENTITY(-9223372036854775807,1) NOT NULL,
CONSTRAINT [PK_CDR_1] PRIMARY KEY CLUSTERED
(
[uniqueid] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [ReadPartition]
) ON [ReadPartition]
GO
SET ANSI_PADDING ON
GO
/****** Object: Index [Idx_Dst_incl_uniqueId] Script Date: 2017-01-12 02:30:50 PM ******/
CREATE NONCLUSTERED INDEX [Idx_Dst_incl_uniqueId] ON [dbo].[CDR]
(
[dst] ASC
)
INCLUDE ( [uniqueid]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [ReadPartition]
GO
Hex dump to understand what happens, not part of above question:
ff fe 2f 00 2a 00 2a 00 2a 00 2a 00 2a 00 2a 00
20 00 4f 00 62 00 6a 00 65 00 63 00 74 00 3a 00
20 00 20 00 54 00 61 00 62 00 6c 00 65 00 20 00
5b 00 64 00 62 00 6f 00 5d 00 2e 00 5b 00 43 00
44 00 52 00 5d 00 20 00 20 00 20 00 20 00 53 00
63 00 72 00 69 00 70 00 74 00 20 00 44 00 61 00
74 00 65 00 3a 00 20 00 32 00 30 00 31 00 37 00
2d 00 30 00 31 00 2d 00 31 00 32 00 20 00 30 00
32 00 3a 00 33 00 30 00 3a 00 34 00 39 00 20 00
50 00 4d 00 20 00 2a 00 2a 00 2a 00 2a 00 2a 00
2a 00 2f 00 0d 00 0a 00 53 00 45 00 54 00 20 00
41 00 4e 00 53 00 49 00 5f 00 4e 00 55 00 4c 00
4c 00 53 00 20 00 4f 00 4e 00 0d 00 0a 00 47 00
4f 00 0d 00 0a 00 53 00 45 00 54 00 20 00 51 00
55 00 4f 00 54 00 45 00 44 00 5f 00 49 00 44 00
Result of hexdump:
../.*.*.*.*.*.*.
.O.b.j.e.c.t.:.
. .T.a.b.l.e. .
[.d.b.o.]...[.C.
D.R.]. . . . .S.
c.r.i.p.t. .D.a.
t.e.:. .2.0.1.7.
-.0.1.-.1.2. .0.
2.:.3.0.:.4.9. .
P.M. .*.*.*.*.*.
*./.....S.E.T. .
A.N.S.I._.N.U.L.
L.S. .O.N.....G.
O.....S.E.T. .Q.
U.O.T.E.D._.I.D.
Your problem is that the original files are encoded in UTF-16 with an initial Byte Order Mark. It is normally transparent on Windows because almost all file editors automatically read it thanks to the initial BOM.
But the conversion is not automatic for Python scripts! That means that every character is read as the character itself followed by a null. It is almost transparent except for end of lines, because the nulls are simply written back again to form normal UTF16 characters. But the \n is no longer preceded by a raw \r but with a null, as as you write in text mode, Python replaces it with a pair \r\n which is no longer a valid UTF16 character and this causes the bloc display.
This is trivial to fix, just declare the UTF16 encoding when reading files:
for file in os.listdir(tmp):
with open(file, encoding='utf_16_le') as tmp_file:
Optionally, if you want to preserve the UTF16 encoding, you could also open the master file with it. By default, Python will encode it as utf8. But my advice would be to revert to 8bit encoding files to avoid further problem if you later wanted to process the output file.

numpy.genfromtxt csv file with null characters

I'm working on a scientific graphing script, designed to create graphs from csv files output by Agilent's Chemstation software.
I got the script working perfectly when the files come from one version of Chemstation (The version for liquid chromatography).
Now i'm trying to port it to work on our GC (Gas Chromatography). For some reason, this version of chemstation inserts nulls in between each character in any text file it outputs.
I'm trying to use numpy.genfromtxt to get the x,y data into python in order to create the graphs (using matplotlib).
I originally used:
data = genfromtxt(directory+signal, delimiter = ',')
to load the data in. When I do this with a csv file generated by our GC, I get an array of all 'nan' values. If I set the dtype to none, I get 'byte strings' that look like this:
b'\x00 \x008\x008\x005\x00.\x002\x005\x002\x001\x007\x001\x00\r'
What I need is a float, for the above string it would be 885.252171.
Anyone have any idea how I can get where I need to go?
And just to be clear, I couldn't find any setting on Chemstation that would affect it's output to just not create files with nulls.
Thanks
Jeff
Given that your file is encoded as utf-16-le with a BOM, and all the actual unicode codepoints (except the BOM) are less than 128, you should be able to use an instance of codecs.EncodedFile to transcode the file from utf-16 to ascii. The following example works for me.
Here's my test file:
$ cat utf_16_le_with_bom.csv
??2.0,19
1.5,17
2.5,23
1.0,10
3.0,5
The first two bytes, ff and fe are the BOM U+FEFF:
$ hexdump utf_16_le_with_bom.csv
0000000 ff fe 32 00 2e 00 30 00 2c 00 31 00 39 00 0a 00
0000010 31 00 2e 00 35 00 2c 00 31 00 37 00 0a 00 32 00
0000020 2e 00 35 00 2c 00 32 00 33 00 0a 00 31 00 2e 00
0000030 30 00 2c 00 31 00 30 00 0a 00 33 00 2e 00 30 00
0000040 2c 00 35 00 0a 00
0000046
Here's the python script genfromtxt_utf16.py (updated for Python 3):
import codecs
import numpy as np
fh = open('utf_16_le_with_bom.csv', 'rb')
efh = codecs.EncodedFile(fh, data_encoding='ascii', file_encoding='utf-16')
a = np.genfromtxt(efh, delimiter=',')
fh.close()
print("a:")
print(a)
With python 3.4.1 and numpy 1.8.1, the script works:
$ python3.4 genfromtxt_utf16.py
a:
[[ 2. 19. ]
[ 1.5 17. ]
[ 2.5 23. ]
[ 1. 10. ]
[ 3. 5. ]]
Be sure that you don't specify the encoding as file_encoding='utf-16-le'. If the endian suffix is included, the BOM is not stripped, and it can't be transcoded to ascii.

Categories