numpy.genfromtxt csv file with null characters

numpy.genfromtxt csv file with null characters - python

I'm working on a scientific graphing script, designed to create graphs from csv files output by Agilent's Chemstation software.
I got the script working perfectly when the files come from one version of Chemstation (The version for liquid chromatography).
Now i'm trying to port it to work on our GC (Gas Chromatography). For some reason, this version of chemstation inserts nulls in between each character in any text file it outputs.
I'm trying to use numpy.genfromtxt to get the x,y data into python in order to create the graphs (using matplotlib).
I originally used:
data = genfromtxt(directory+signal, delimiter = ',')
to load the data in. When I do this with a csv file generated by our GC, I get an array of all 'nan' values. If I set the dtype to none, I get 'byte strings' that look like this:
b'\x00 \x008\x008\x005\x00.\x002\x005\x002\x001\x007\x001\x00\r'
What I need is a float, for the above string it would be 885.252171.
Anyone have any idea how I can get where I need to go?
And just to be clear, I couldn't find any setting on Chemstation that would affect it's output to just not create files with nulls.
Thanks
Jeff

Given that your file is encoded as utf-16-le with a BOM, and all the actual unicode codepoints (except the BOM) are less than 128, you should be able to use an instance of codecs.EncodedFile to transcode the file from utf-16 to ascii. The following example works for me.
Here's my test file:
$ cat utf_16_le_with_bom.csv
??2.0,19
1.5,17
2.5,23
1.0,10
3.0,5
The first two bytes, ff and fe are the BOM U+FEFF:
$ hexdump utf_16_le_with_bom.csv
0000000 ff fe 32 00 2e 00 30 00 2c 00 31 00 39 00 0a 00
0000010 31 00 2e 00 35 00 2c 00 31 00 37 00 0a 00 32 00
0000020 2e 00 35 00 2c 00 32 00 33 00 0a 00 31 00 2e 00
0000030 30 00 2c 00 31 00 30 00 0a 00 33 00 2e 00 30 00
0000040 2c 00 35 00 0a 00
0000046
Here's the python script genfromtxt_utf16.py (updated for Python 3):
import codecs
import numpy as np
fh = open('utf_16_le_with_bom.csv', 'rb')
efh = codecs.EncodedFile(fh, data_encoding='ascii', file_encoding='utf-16')
a = np.genfromtxt(efh, delimiter=',')
fh.close()
print("a:")
print(a)
With python 3.4.1 and numpy 1.8.1, the script works:
$ python3.4 genfromtxt_utf16.py
a:
[[ 2. 19. ]
[ 1.5 17. ]
[ 2.5 23. ]
[ 1. 10. ]
[ 3. 5. ]]
Be sure that you don't specify the encoding as file_encoding='utf-16-le'. If the endian suffix is included, the BOM is not stripped, and it can't be transcoded to ascii.

Related

0D0D turns into 0D on binascii.hexlify(file.read())

I'm trying to read a file's hex code using file.read() & binascii.hexlify(), and in place of 0D 0D in the original file python reads/prints only one 0D.
ex:
original file: 6D 6F 64 65 2E 0D 0D 0A 24 00 00 00 00 00 00 00
python: print(binascii.hexlify(f.read(16))) output: 6d6f64652e0d0a24000000000000001c
Any ideas as to why this is happening?

When I use Thrift to serialize map in C++ to disk, and then de-serialize it using Python, I do not get back the same object

Summary: when I use Thrift to serialize map in C++ to disk, and then de-serialize it using Python, I do not get back the same object.
A minimal example to reproduce to the problem is in Github repo https://github.com/brunorijsman/reproduce-thrift-crash
Clone this repo on Ubuntu (tested on 16.04) and follow the instructions at the top of the file reproduce.sh
I have the following Thrift model file, which (as you can see) contains a map indexed by a struct:
struct Coordinate {
1: required i32 x;
2: required i32 y;
}
struct Terrain {
1: required map<Coordinate, i32> altitude_samples;
}
I use the following C++ code to create an object with 3 coordinates in the map (see the repo for complete code for all snippets below):
Terrain terrain;
add_sample_to_terrain(terrain, 10, 10, 100);
add_sample_to_terrain(terrain, 20, 20, 200);
add_sample_to_terrain(terrain, 30, 30, 300);
where:
void add_sample_to_terrain(Terrain& terrain, int32_t x, int32_t y, int32_t altitude)
{
Coordinate coordinate;
coordinate.x = x;
coordinate.y = y;
std::pair<Coordinate, int32_t> sample(coordinate, altitude);
terrain.altitude_samples.insert(sample);
}
I use the following C++ code to serialize an object to disk:
shared_ptr<TFileTransport> transport(new TFileTransport("terrain.dat"));
shared_ptr<TBinaryProtocol> protocol(new TBinaryProtocol(transport));
terrain.write(protocol.get());
Important note: for this to work correctly, I had to implement the function Coordinate::operator<. Thrift does generate the declaration for the Coordinate::operator< but does not generate the implementation of Coordinate::operator<. The reason for this is that Thrift does not understand the semantics of the struct and hence cannot guess the correct implementation of the comparison operator. This is discussed at http://mail-archives.apache.org/mod_mbox/thrift-user/201007.mbox/%3C4C4E08BD.8030407#facebook.com%3E
// Thrift generates the declaration but not the implementation of operator< because it has no way
// of knowning what the criteria for the comparison are. So, provide the implementation here.
bool Coordinate::operator<(const Coordinate& other) const
{
if (x < other.x) {
return true;
} else if (x > other.x) {
return false;
} else if (y < other.y) {
return true;
} else {
return false;
}
}
Then, finally, I use the following Python code to de-serialize the same object from disk:
file = open("terrain.dat", "rb")
transport = thrift.transport.TTransport.TFileObjectTransport(file)
protocol = thrift.protocol.TBinaryProtocol.TBinaryProtocol(transport)
terrain = Terrain()
terrain.read(protocol)
print(terrain)
This Python program outputs:
Terrain(altitude_samples=None)
In other words, the de-serialized Terrain contains no terrain_samples, instead of the expected dictionary with 3 coordinates.
I am 100% sure that the file terrain.dat contains valid data: I also de-serialized the same data using C++ and in that case, I do get the expected results (see repo for details)
I suspect that this has something to do with the comparison operator.
I gut feeling is that I should have done something similar in Python with respect to the comparison operator as I did in C++. But I don't know what that missing something would be.
Additional information added on 19-Sep-2018:
Here is a hexdump of the encoding produced by the C++ encoding program:
Offset: 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
00000000: 01 00 00 00 0D 02 00 00 00 00 01 01 00 00 00 0C ................
00000010: 01 00 00 00 08 04 00 00 00 00 00 00 03 01 00 00 ................
00000020: 00 08 02 00 00 00 00 01 04 00 00 00 00 00 00 0A ................
00000030: 01 00 00 00 08 02 00 00 00 00 02 04 00 00 00 00 ................
00000040: 00 00 0A 01 00 00 00 00 04 00 00 00 00 00 00 64 ...............d
00000050: 01 00 00 00 08 02 00 00 00 00 01 04 00 00 00 00 ................
00000060: 00 00 14 01 00 00 00 08 02 00 00 00 00 02 04 00 ................
00000070: 00 00 00 00 00 14 01 00 00 00 00 04 00 00 00 00 ................
00000080: 00 00 C8 01 00 00 00 08 02 00 00 00 00 01 04 00 ..H.............
00000090: 00 00 00 00 00 1E 01 00 00 00 08 02 00 00 00 00 ................
000000a0: 02 04 00 00 00 00 00 00 1E 01 00 00 00 00 04 00 ................
000000b0: 00 00 00 00 01 2C 01 00 00 00 00 .....,.....
The first 4 bytes are 01 00 00 00
Using a debugger to step through the Python decoding function reveals that:
This is being decoded as a struct (which is expected)
The first byte 01 is interpreted as the field type. 01 means field type VOID.
The next two bytes are interpreted as the field id. 00 00 means field ID 0.
For field type VOID, nothing else is read and we continue to the next field.
The next byte is interpreted as the field type. 00 means STOP.
We top reading data for the struct.
The final result is an empty struct.
All off the above is consistent with the information at https://github.com/apache/thrift/blob/master/doc/specs/thrift-binary-protocol.md which describes the Thrift binary encoding format
My conclusion thus far is that the C++ encoder appears to produce an "incorrect" binary encoding (I put incorrect in quotes because certainly something as blatant as that would have been discovered by lots of other people, so I am sure that I am still missing something).
Additional information added on 19-Sep-2018:
It appears that the C++ implementation of TFileTransport has the concept of "events" when writing to disk.
The output which is written to disk is divided into a sequence of "events" where each "event" is preceded by a 4-byte length field of the event, followed by the contents of the event.
Looking at the hexdump above, the first couple of events are:
0100 0000 0d : Event length 1, event value 0d
02 0000 0000 01 : Event length 2, event value 00 01
Etc.
The Python implementation of TFileTransport does not understand this concept of events when parsing the file.
It appears that the problem is one of the following two:
1) Either the C++ code should not be inserting these event lengths into the encoded file,
2) Or the Python code should understand these event lengths when decoding the file.
Note that all these event lengths make the C++ encode file much larger than the Python encoded file.

Sadly C++ TFileTransport is not totally portable and will not work with Python's TFileObjectTransport. If you switch to C++ TSimpleFileTransport it will work as expected, with Python TFileObjectTransport and with Java TSimpleFileTransport.
Take a look at the examples here:
https://github.com/RandyAbernethy/ThriftBook/tree/master/part2/types/complex
They do pretty much exactly what you are attempting in Java and Python and you can find examples with C++, Java and Python here (though they add a zip compression layer):
https://github.com/RandyAbernethy/ThriftBook/tree/master/part2/types/zip
Another caution however would be against the use of complex key types. Complex key types require (as you discovered) comparators but will flat out not work with some languages. I might suggest, for example:
map<x,map<y,alt>>
giving the same utility but eliminating a whole class of possible problems (and no need for comparators).

ADODB unable to store DATETIME value with sub-second precision

According to the Microsoft documentation for the DATETIME column type, values of that type can store "accuracy rounded to increments of .000, .003, or .007 seconds." According to their documentation for the data types used by ADODB, the adDBTimeStamp (code 135), which ADODB uses for DATETIME column parameters, "indicates a date/time stamp (yyyymmddhhmmss plus a fraction in billionths)." However, all attempts (tested using multiple versions of SQL Server, and both the SQLOLEDB provider and the newer SQLNCLI11 provider) fail when a parameter is passed with sub-second precision. Here's a repro case demonstrating the failure:
import win32com.client
# Connect to the database
conn_string = "Provider=...." # sensitive information redacted
conn = win32com.client.Dispatch("ADODB.Connection")
conn.Open(conn_string)
# Create the temporary test table
cmd = win32com.client.Dispatch("ADODB.Command")
cmd.ActiveConnection = conn
cmd.CommandText = "CREATE TABLE #t (dt DATETIME NOT NULL)"
cmd.CommandType = 1 # adCmdText
cmd.Execute()
# Insert a row into the table (with whole second precision)
cmd = win32com.client.Dispatch("ADODB.Command")
cmd.ActiveConnection = conn
cmd.CommandText = "INSERT INTO #t VALUES (?)"
cmd.CommandType = 1 # adCmdText
params = cmd.Parameters
param = params.Item(0)
print("param type is {:d}".format(param.Type)) # 135 (adDBTimeStamp)
param.Value = "2018-01-01 12:34:56"
cmd.Execute() # this invocation succeeds
# Show the result
cmd = win32com.client.Dispatch("ADODB.Command")
cmd.ActiveConnection = conn
cmd.CommandText = "SELECT * FROM #t"
cmd.CommandType = 1 # adCmdText
rs, rowcount = cmd.Execute()
data = rs.GetRows(1)
print(data[0][0]) # displays the datetime value stored above
# Insert a second row into the table (with sub-second precision)
cmd = win32com.client.Dispatch("ADODB.Command")
cmd.ActiveConnection = conn
cmd.CommandText = "INSERT INTO #t VALUES (?)"
cmd.CommandType = 1 # adCmdText
params = cmd.Parameters
param = params.Item(0)
print("param type is {:d}".format(param.Type)) # 135 (adDBTimeStamp)
param.Value = "2018-01-01 12:34:56.003" # <- blows up here
cmd.Execute()
# Show the result
cmd = win32com.client.Dispatch("ADODB.Command")
cmd.ActiveConnection = conn
cmd.CommandText = "SELECT * FROM #t"
cmd.CommandType = 1 # adCmdText
rs, rowcount = cmd.Execute()
data = rs.GetRows(2)
print(data[0][1])
This code throws an exception on the line indicated above, with the error message "Application uses a value of the wrong type for the current operation." Is this a known bug in ADODB? If so, I haven't found any discussion of it. (Perhaps there was discussion earlier which disappeared when Microsoft killed the KB pages.) How can the value be of the wrong type if it matches the documentation?

This is a well-known bug in the SQL Server OLEDB drivers going back more than 20 years; which means it is never going to be fixed.
It's also not a bug in ADO. The ActiveX Data Objects (ADO) API is a thin wrapper around the underlying OLEDB API. The bug exists is in Microsoft's SQL Server OLEDB driver itself (all of them). And they will never, never, never fix it now; as they are chicken-shits that don't want to maintain existing code it might break existing applications.
So the bug has been carried forward for decades:
SQOLEDB (1999) → SQLNCLI (2005) → SQLNCLI10 (2008) → SQLNCLI11 (2010) → MSOLEDB (2012)
The only solution is rather than parameterizing your datetime as timestamp:
adTimestamp (aka DBTYPE_DBTIMESTAMP, 135)
you need to parameterize it an "ODBC 24-hour format" yyyy-mm-dd hh:mm:ss.zzz string:
adChar (aka DBTYPE_STR, 129): 2021-03-21 17:51:22.619
or with even with the ADO-specific type string type:
adVarChar (200): 2021-03-21 17:51:22.619
What about other DBTYPE_xxx's?
You might think that the adDate (aka DBTYPE_DATE, 7) looks promising:
Indicates a date value (DBTYPE_DATE). A date is stored as a double, the whole part of which is the number of days since December 30, 1899, and the fractional part of which is the fraction of a day.
But unfortunately not, as it also parameterizes the value to the server without milliseconds:
exec sp_executesql N'SELECT #P1 AS Sample',N'#P1 datetime','2021-03-21 06:40:24'
You also cannot use adFileTime, which also looks promising:
Indicates a 64-bit value representing the number of 100-nanosecond intervals since January 1, 1601 (DBTYPE_FILETIME).
Meaning it could support a resolution of 0.0000001 seconds.
Unfortunately by the rules of VARIANTs, you are not allowed to store a FILETIME in a VARIANT. And since ADO uses variants for all values, it throws up when it encounters variant type 64 (VT_FILETIME).
Decoding TDS to confirm our suspicions
We can confirm that the SQL Server OLEDB driver is not supplying a datetime with the available precision by decoding the packet sent to the server.
We can issue the batch:
SELECT ? AS Sample
And specify parameter 1: adDBTimestamp - 3/21/2021 6:40:23.693
Now we can capture that packet:
0000 03 01 00 7b 00 00 01 00 ff ff 0a 00 00 00 00 00 ...{............
0010 63 28 00 00 00 09 04 00 01 32 28 00 00 00 53 00 c(.......2(...S.
0020 45 00 4c 00 45 00 43 00 54 00 20 00 40 00 50 00 E.L.E.C.T. .#.P.
0030 31 00 20 00 41 00 53 00 20 00 53 00 61 00 6d 00 1. .A.S. .S.a.m.
0040 70 00 6c 00 65 00 00 00 63 18 00 00 00 09 04 00 p.l.e...c.......
0050 01 32 18 00 00 00 40 00 50 00 31 00 20 00 64 00 .2....#.P.1. .d.
0060 61 00 74 00 65 00 74 00 69 00 6d 00 65 00 00 00 a.t.e.t.i.m.e...
0070 6f 08 08 f2 ac 00 00 20 f9 6d 00 o...... .m.
And decode it:
03 ; Packet type. 0x03 = 3 ==> RPC
01 ; Status
00 7b ; Length. 0x07B ==> 123 bytes
00 00 ; SPID
01 ; Packet ID
00 ; Window
ff ff ; ProcName 0xFFFF => Stored procedure number. UInt16 number to follow
0a 00 ; PROCID 0x000A ==> stored procedure ID 10 (10=sp_executesql)
00 00 ; Option flags (16 bits)
00 00 63 28 00 00 00 09 ; blah blah blah
04 00 01 32 28 00 00 00 ;
53 00 45 00 4c 00 45 00 ; \
43 00 54 00 20 00 40 00 ; |
50 00 31 00 20 00 41 00 ; |- "SELECT #P1 AS Sample"
53 00 20 00 53 00 61 00 ; |
6d 00 70 00 6c 00 65 00 ; /
00 00 63 18 00 00 00 09 ; blah blah blah
04 00 01 32 18 00 00 00 ;
40 00 50 00 31 00 20 00 ; \
64 00 61 00 74 00 65 00 ; |- "#P1 datetime"
74 00 69 00 6d 00 65 00 ; /
00 00 6f 08 08 ; blah blah blah
f2 ac 00 00 ; 0x0000ACF2 = 44,274 ==> 1/1/1900 + 44,274 days = 3/21/2021
20 f9 6d 00 ; 0x006DF920 = 7,207,200 ==> 7,207,200 / 300 seconds after midnight = 24,024.000 seconds = 6h 40m 24.000s = 6:40:24.000 AM
The short version is that a datetime is specified on-the-wire as:
datetime is represented in the following sequence:
One 4-byte signed integer that represents the number of days since January 1, > 1900. Negative numbers are allowed to represent dates since January 1, 1753.
One 4-byte unsigned integer that represents the number of one three-hundredths of a second (300 counts per second) elapsed since 12 AM that day.
Which means we can read the datetime supplied by the driver as:
Date portion: 0x0000acf2 = 44,274 = January 1, 1900 + 44,274 days = 3/21/2021
Time portion: 0x006df920 = 7,207,200 = 7,207,200 / 300 seconds = 6:40:24 AM
So the driver cut off the precision of our datetime:
Supplied date: 2021-03-21 06:40:23.693
Date in TDS: 2021-03-21 06:40:24
In other words:
OLE Automation uses Double to represent datetime.
The Double has a resolution to ~0.0000003 seconds.
The driver has the option to encode the time down to 1/300th of a second:
6:40:24.693 → 7,207,407 → 0x006DF9EF
But it chose not to. Bug: Driver.
Resources to help decoding TDS
2.2.6.6 RPC Request
4.8 RPC Client Request (actual hex example)
2.2.5.5.1.8 Date/Times

Chrome corrupting binary file download by converting to UTF-8

I've currently been assigned to maintain an application written with Flask. Currently I'm trying to add a feature that allows users to download a pre-generated excel file, however, whenever I try to send it, my browser appears to re-encode the file in UTF-8 which causes multibyte characters to be added, which corrupts the file.
File downloaded with wget:
(venv) luke#ubuntu:~$ hexdump -C wget.xlsx | head -n 2
00000000 50 4b 03 04 14 00 00 00 08 00 06 06 fb 4a 1f 23 |PK...........J.#|
00000010 cf 03 c0 00 00 00 13 02 00 00 0b 00 00 00 5f 72 |.............._r|
The file downloaded with Chrome (notice the EF BF BD sequences?)
(venv) luke#ubuntu:~$ hexdump -C chrome.xlsx | head -n 2
00000000 50 4b 03 04 14 00 00 00 08 00 ef bf bd 03 ef bf |PK..............|
00000010 bd 4a 1f 23 ef bf bd 03 ef bf bd 00 00 00 13 02 |.J.#............|
Does anyone know how I could fix this? This is the code I'm using:
data = b'PK\x03\x04\x14\x00\x00\x00\x08\x00}\x0c\xfbJ\x1f#\xcf\x03\xc0\x00\x00\x00\x13\x02\x00\x00\x0b\x00\x00\x00'
send_file(BytesIO(data), attachment_filename="x.xlsx", as_attachment=True)
Related issue: Encoding problems trying to send a binary file using flask_restful

Chrome was expecting to receive utf-8 encoded text, and found some bytes that couldn't be interpreted as valid utf-8 encoding of a char - which is normal, because your file is binary. So it replaced these invalid bytes with EF BF BD, the utf-8 encoding of the Unicode replacement character. The content-type header you send is probably text/..... Maybe try something like Content-Type:application/vnd.openxmlformats-officedocument.spreadsheetml.sheet

Reading fortran binary (streaming access) with np.fromfile or open & struct

The following Fortran code:
INTEGER*2 :: i, Array_A(32)
Array_A(:) = (/ (i, i=0, 31) /)
OPEN (unit=11, file = 'binary2.dat', form='unformatted', access='stream')
Do i=1,32
WRITE(11) Array_A(i)
End Do
CLOSE (11)
Produces streaming binary output with numbers from 0 to 31 in integer 16bit. Each record is taking up 2 bytes, so they are written at byte 1, 3, 5, 7 and so on. The access='stream' suppresses the standard header of Fortran for each record (I need to do that to keep the files as tiny as possible).
Looking at it with a Hex-Editor, I get:
00 00 01 00 02 00 03 00 04 00 05 00 06 00 07 00
08 00 09 00 0A 00 0B 00 0C 00 0D 00 0E 00 0F 00
10 00 11 00 12 00 13 00 14 00 15 00 16 00 17 00
18 00 19 00 1A 00 1B 00 1C 00 1D 00 1E 00 1F 00
which is completely fine (despite the fact that the second byte is never used, because decimals are too low in my example).
Now I need to import these binary files into Python 2.7, but I can't. I tried many different routines, but I always fail in doing so.
1. attempt: "np.fromfile"
with open("binary2.dat", 'r') as f:
content = np.fromfile(f, dtype=np.int16)
returns
[ 0 1 2 3 4 5 6 7 8 9 10 11
12 13 14 15 16 17 18 19 20 21 22 23
24 25 0 0 26104 1242 0 0]
2. attempt: "struct"
import struct
with open("binary2.dat", 'r') as f:
content = f.readlines()
struct.unpack('h' * 32, content)
delivers
struct.error: unpack requires a string argument of length 64
because
print content
['\x00\x00\x01\x00\x02\x00\x03\x00\x04\x00\x05\x00\x06\x00\x07\x00\x08\x00\t\x00\n', '\x00\x0b\x00\x0c\x00\r\x00\x0e\x00\x0f\x00\x10\x00\x11\x00\x12\x00\x13\x00\x14\x00\x15\x00\x16\x00\x17\x00\x18\x00\x19\x00']
(note the delimiter, the t and the n which shouldn't be there according to what Fortran's "streaming" access does)
3. attempt: "FortranFile"
f = FortranFile("D:/Fortran/Sandbox/binary2.dat", 'r')
print(f.read_ints(dtype=np.int16))
With the error:
TypeError: only length-1 arrays can be converted to Python scalars
(remember how it detected a delimiter in the middle of the file, but it would also crash for shorter files without line break (e.g. decimals from 0 to 8))
Some additional thoughts:
Python seems to have troubles with reading parts of the binary file. For np.fromfile it reads Hex 19 (dec: 25), but crashes for Hex 1A (dec: 26). It seems to be confused with the letters, although 0A, 0B ... work just fine.
For attempt 2 the content-result is weird. Decimals 0 to 8 work fine, but then there is this strange \t\x00\n thing. What is it with hex 09 then?
I've been spending hours trying to find the logic, but I'm stuck and really need some help. Any ideas?

The problem is in open file mode. Default it is 'text'. Change this mode to binary:
with open("binary2.dat", 'rb') as f:
content = np.fromfile(f, dtype=np.int16)
and all the numbers will be readed successfull. See Dive in to Python chapter Binary Files for more details.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

numpy.genfromtxt csv file with null characters - python

Related

0D0D turns into 0D on binascii.hexlify(file.read())

When I use Thrift to serialize map in C++ to disk, and then de-serialize it using Python, I do not get back the same object

ADODB unable to store DATETIME value with sub-second precision

Chrome corrupting binary file download by converting to UTF-8

Reading fortran binary (streaming access) with np.fromfile or open & struct

Categories

Resources