StringIO and pandas read_csv

StringIO and pandas read_csv - python

I'm trying to mix StringIO and BytesIO with pandas and struggling with some basic stuff. For example, I can't get "output" below to work, whereas "output2" below does work. But "output" is closer to the real world example I'm trying to do. The way in "output2" is from an old pandas example but not really a useful way for me to do it.
import io # note for python 3 only
# in python2 need to import StringIO
output = io.StringIO()
output.write('x,y\n')
output.write('1,2\n')
output2 = io.StringIO("""x,y
1,2
""")
They seem to be the same in terms of type and contents:
type(output) == type(output2)
Out[159]: True
output.getvalue() == output2.getvalue()
Out[160]: True
But no, not the same:
output == output2
Out[161]: False
More to the point of the problem I'm trying to solve:
pd.read_csv(output) # ValueError: No columns to parse from file
pd.read_csv(output2) # works fine, same as reading from a file

io.StringIO here is behaving just like a file -- you wrote to it, and now the file pointer is pointing at the end. When you try to read from it after that, there's nothing after the point you wrote, so: no columns to parse.
Instead, just like you would with an ordinary file, seek to the start, and then read:
>>> output = io.StringIO()
>>> output.write('x,y\n')
4
>>> output.write('1,2\n')
4
>>> output.seek(0)
0
>>> pd.read_csv(output)
x y
0 1 2

Related

Writing array as string (python 3)

I am trying to write entire array as text or csv file.
from array import array as pyarray
import csv
tmp1 = (x for x in range(10))
tmp2 = (x+10 for x in range(10))
arr1 = pyarray('l')
with open ('fileoutput','wb') as fil1:
for i in range(10):
val = next(tmp1) - next(tmp2)
arr1.append(val)
arr1.tofile(fil1)
The problem with this code is it writes as binary file. I want to write as string, so that it would be readable. It is possible to create a loop and write file line by line, however real problem has millions of line in arr1. What is optimized way to write in human readable form?
Edit:
After changing above code line to with open ('fileoutput','w') as fil1: i.e. 'wb' to 'w', there is error:
write() argument must be str, not bytes. So this is not solved the problem. Any suggestions?

You opened the file in wb mode. This writes in binary. Write to the file in w mode to write it as a string.
with open ('fileoutput','w') as fil1:

You can try appending the results to a string then save it into a file, as following:
from array import array as pyarray
tmp1 = (x for x in range(10))
tmp2 = (x+10 for x in range(10))
arr1 = pyarray('l')
fileoutput_str = str(arr1)+'\n'
for i in range(10):
val = next(tmp1) - next(tmp2)
fileoutput_str += str(val)+'\n'
fileoutput_fn = 'fileoutput'
fileoutput_fo = open(fileoutput_fn, 'w')
fileoutput_fo.write(fileoutput_str)
fileoutput_fo.close()
You will have to remove the binary option b in order to write string into the file.

Python - Efficient way to flip bytes in a file?

I've got a folder full of very large files that need to be byte flipped by a power of 4. So essentially, I need to read the files as a binary, adjust the sequence of bits, and then write a new binary file with the bits adjusted.
In essence, what I'm trying to do is read a hex string hexString that looks like this:
"00112233AABBCCDD"
And write a file that looks like this:
"33221100DDCCBBAA"
(i.e. every two characters is a byte, and I need to flip the bytes by a power of 4)
I am very new to python and coding in general, and the way I am currently accomplishing this task is extremely inefficient. My code currently looks like this:
import binascii
with open(myFile, 'rb') as f:
content = f.read()
hexString = str(binascii.hexlify(content))
flippedBytes = ""
inc = 0
while inc < len(hexString):
flippedBytes += file[inc + 6:inc + 8]
flippedBytes += file[inc + 4:inc + 6]
flippedBytes += file[inc + 2:inc + 4]
flippedBytes += file[inc:inc + 2]
inc += 8
..... write the flippedBytes to file, etc
The code I pasted above accurately accomplishes what I need (note, my actual code has a few extra lines of: "hexString.replace()" to remove unnecessary hex characters - but I've left those out to make the above easier to read). My ultimate problem is that it takes EXTREMELY long to run my code with larger files. Some of my files I need to flip are almost 2gb in size, and the code was going to take almost half a day to complete one single file. I've got dozens of files I need to run this on, so that timeframe simply isn't practical.
Is there a more efficient way to flip the HEX values in a file by a power of 4?
.... for what it's worth, there is a tool called WinHEX that can do this manually, and only takes a minute max to flip the whole file.... I was just hoping to automate this with python so we didn't have to manually use WinHEX each time

You want to convert your 4-byte integers from little-endian to big-endian, or vice-versa. You can use the struct module for that:
import struct
with open(myfile, 'rb') as infile, open(myoutput, 'wb') as of:
while True:
d = infile.read(4)
if not d:
break
le = struct.unpack('<I', d)
be = struct.pack('>I', *le)
of.write(be)

Here is a little struct awesomeness to get you started:
>>> import struct
>>> s = b'\x00\x11\x22\x33\xAA\xBB\xCC\xDD'
>>> a, b = struct.unpack('<II', s)
>>> s = struct.pack('>II', a, b)
>>> ''.join([format(x, '02x') for x in s])
'33221100ddccbbaa'
To do this at full speed for a large input, use struct.iter_unpack

"String" and "int" type dictionary compare with "Unicode" type dictionary

I have three text files.
The one (zoo.txt) looks like this:
{'cow':'113', 'cat':'50', 'dog':'100', 'IDnumber':'113.1.22', 'type':'3'}
And it reads by json function:
file_open = open('zoo.txt', 'r')
zoo_animal = file_open.read()
zoo_animal = json.loads(zoo_animal)
And after the function, the output like this:
{u'cow':u'113', u'cat':u'50', u'dog':u'100', u'IDnumber':u'113.1.22', u'type':u'3'}
The other one is in_range.txt, it means the value of key in zoo.txt must match in this standard range.
The in_range.txt looks like:
cow 1 150
cat 0 25
dog 0 50
And it reads by with function:
with open('in_range.txt', 'r') as g:
for line in g:
spliteLineR=line.split()
in_range[str(spliteLineR[0])]=[int(spliteLineR[1]),int(spliteLineR[2])]
The output is:
{'cow':[1,150], 'cat':[0,25], 'dog':[0,50]}
The third text file is single_value.txt, it means the value of key in zoo.txt must equal to standard value.
The single_value.txt looks like:
IDnumber 1.8.70
type 1
And also it reads by with function:
with open('single_value.txt', 'r') as f:
for line in f:
spliteLineS=line.split()
single_value[str(spliteLineS[0])]=str(spliteLineS[1])
The output is:
{'IDnumber':'1.8.70', 'type':'1'}
My question is:
Do I need to transfer all type(str, int and unicode) to unicode or str then compare? because I use mathematical operators(<, ==, >) to compare directly, it can not get right answer.
If I need to transfer the type, how to do it?
Please give me a hand~ thank you very much~

No, you don't need to convert the ASCII strings to Unicode or vice versa, because ASCII is a subset of Unicode, so they behave sensibly when you do equality tests, eg
print('cow' == u'cow')
output
True
That code will work correctly in Python 2 or Python 3.
However, you do have to convert those numeric strings to a numeric type to perform numeric comparisons. Here's a short demo.
from __future__ import print_function
zoo_animal = {
u'cow':u'113', u'cat':u'50', u'dog':u'100',
u'IDnumber':u'113.1.22', u'type':u'3',
}
in_range = {'cow':[1, 150], 'cat':[0, 25], 'dog':[0, 50]}
for key in zoo_animal:
if key in in_range:
lo, hi = in_range[key]
val = int(zoo_animal[key])
print(key, val, lo <= val <= hi)
output
cow 113 True
cat 50 False
dog 100 False
Once again, that code works on both Python 2 and Python 3.

byte reverse AB CD to CD AB with python

I have a .bin file, and I want to simply byte reverse the hex data. Say for instance # 0x10 it reads AD DE DE C0, want it to read DE AD C0 DE.
I know there is a simple way to do this, but I am am beginner and just learning python and am trying to make a few simple programs to help me through my daily tasks. I would like to convert the whole file this way, not just 0x10.
I will be converting at start offset 0x000000 and blocksize/length is 1000000.
here is my code, maybe you can tell me what to do. i am sure i am just not getting it, and i am new to programming and python. if you could help me i would very much appreciate it.
def main():
infile = open("file.bin", "rb")
new_pos = int("0x000000", 16)
chunk = int("1000000", 16)
data = infile.read(chunk)
reverse(data)
def reverse(data):
output(data)
def output(data):
with open("reversed", "wb") as outfile:
outfile.write(data)
main()
and you can see the module for reversing, i have tried many different suggestions and it will either pass the file through untouched, or it will throw errors. i know module reverse is empty now, but i have tried all kinds of things. i just need module reverse to convert AB CD to CD AB.
thanks for any input
EDIT: the file is 16 MB and i want to reverse the byte order of the whole file.

In Python 3.4 you can use this:
>>> data = b'\xAD\xDE\xDE\xC0'
>>> swap_data = bytearray(data)
>>> swap_data.reverse()
the result is
bytearray(b'\xc0\xde\xde\xad')

In Python 2, the binary file gets read as a string, so string slicing should easily handle the swapping of adjacent bytes:
>>> original = '\xAD\xDE\xDE\xC0'
>>> ''.join([c for t in zip(original[1::2], original[::2]) for c in t])
'\xde\xad\xc0\xde'
In Python 3, the binary file gets read as bytes. Only a small modification is need to build another array of bytes:
>>> original = b'\xAD\xDE\xDE\xC0'
>>> bytes([c for t in zip(original[1::2], original[::2]) for c in t])
b'\xde\xad\xc0\xde'
You could also use the < and > endianess format codes in the struct module to achieve the same result:
>>> struct.pack('<2h', *struct.unpack('>2h', original))
'\xde\xad\xc0\xde'
Happy byte swapping :-)

data = b'\xAD\xDE\xDE\xC0'
reversed_data = data[::-1]
print(reversed_data)
# b'\xc0\xde\xde\xad'

Python3
bytes(reversed(b'\xAD\xDE\xDE\xC0'))
# b'\xc0\xde\xde\xad'

Python has a list operator to reverse the values of a list --> nameOfList[::-1]
So, I might store the hex values as string and put them into a list then try something like:
def reverseList(aList):
rev = aList[::-1]
outString = ""
for el in rev:
outString += el + " "
return outString

TEXT compression in python

I have this text :
2,3,5,1,13,7,17,11,89,1,233,29,61,47,1597,19,37,41,421,199,28657,23,3001,521,53,281,514229,31,557,2207,19801,3571,141961,107,73,9349,135721,2161,2789,211,433494437,43,109441,139,2971215073,1103,97,101,6376021,90481,953,5779,661,14503,797,59,353,2521,4513,3010349,35239681,1087,14736206161,9901,269,67,137,71,6673,103681,9375829,54018521,230686501,29134601,988681,79,157,1601,2269,370248451,99194853094755497,83,9521,6709,173,263,1069,181,741469,4969,4531100550901,6643838879,761,769,193,599786069,197,401,743519377,919,519121,103,8288823481,119218851371,1247833,11128427,827728777,331,1459000305513721,10745088481,677,229,1381,347,29717,709,159512939815855788121,
This are numbers generated from my generator program,now the problem has a source code limit so I can't use the above texts in my solution so I want to compress this and put it into a data-structure in python so that I can print them by indexing like:
F = [`compressed data`]
and F[0] would give 2 F[5] would give 7 like this ... Please suggest me a suitable compression technique.
PS: I am a very newbie to python so please explain your method.

Sure you can do this:
import base64
import zlib
compressed = 'eJwdktkNgDAMQxfqR+5j/8V4QUJQUttx3Nrzl0+f+uunPPpm+Tf3Z/tKX1DM5bXP+wUFA777bCob4HMRfUk14QwfDYPrrA5gcuQB49lQQxdZpdr+1oN2bEA3pW5Nf8NGOFsR19NBszyX7G2raQpkVUEBdbTLuwSRlcDCYiW7GeBaRYJrgImrM3lmI/WsIxFXNd+aszXoRXuZ1PnZRdwKJeqYYYKq6y1++PXOYdgM0TlZcymCOdKqR7HYmYPiRslDr2Sn6C0Wgw+a6MakM2VnBk6HwU6uWqDRz+p6wtKTCg2WsfdKJwfJlHNaFT4+Q7PGfR9hyWK3p3464nhFwpOd7kdvjmz1jpWcxmbG/FJUXdMZgrpzs+jxC11twrBo3TaNgvsf8oqIYwT4r9XkPnNC1XcP7qD5cW7UHSJZ3my5qba+ozncl5kz8gGEEYOQ'
data = zlib.decompress(base64.b64decode(compressed))
Note that this is only 139 characters shorter.
But it works:
>>> data
'2,3,5,1,13,7,17,11,89,1,233,29,61,47,1597,19,37,41,421,199,28657,23,3001,521,53,281,514229,31,557,2207,19801,3571,141961,107,73,9349,135721,2161,2789,211,433494437,43,109441,139,2971215073,1103,97,101,6376021,90481,953,5779,661,14503,797,59,353,2521,4513,3010349,35239681,1087,14736206161,9901,269,67,137,71,6673,103681,9375829,54018521,230686501,29134601,988681,79,157,1601,2269,370248451,99194853094755497,83,9521,6709,173,263,1069,181,741469,4969,4531100550901,6643838879,761,769,193,599786069,197,401,743519377,919,519121,103,8288823481,119218851371,1247833,11128427,827728777,331,1459000305513721,10745088481,677,229,1381,347,29717,709,159512939815855788121,'
If your code limit really is so short, maybe you are supposed to calculate this data or something? What is it?

zlib would get the job done, if you indeed want compression. If you don't want compression, then I'm afraid that my mind-reading skills are on the wane.

On Python 2.4-2.7, pypy, jython:
>>> enc = sdata.encode('zlib').encode('base64')
>>> print enc
eJwdktkNgDAMQxfqR+5j/8V4QUJQUttx3Nrzl0+f+uunPPpm+Tf3Z/tKX1DM5bXP+wUFA777bCob
4HMRfUk14QwfDYPrrA5gcuQB49lQQxdZpdr+1oN2bEA3pW5Nf8NGOFsR19NBszyX7G2raQpkVUEB
dbTLuwSRlcDCYiW7GeBaRYJrgImrM3lmI/WsIxFXNd+aszXoRXuZ1PnZRdwKJeqYYYKq6y1++PXO
YdgM0TlZcymCOdKqR7HYmYPiRslDr2Sn6C0Wgw+a6MakM2VnBk6HwU6uWqDRz+p6wtKTCg2WsfdK
JwfJlHNaFT4+Q7PGfR9hyWK3p3464nhFwpOd7kdvjmz1jpWcxmbG/FJUXdMZgrpzs+jxC11twrBo
3TaNgvsf8oqIYwT4r9XkPnNC1XcP7qD5cW7UHSJZ3my5qba+ozncl5kz8gGEEYOQ
>>> print enc.decode('base64').decode('zlib')[:79]
2,3,5,1,13,7,17,11,89,1,233,29,61,47,1597,19,37,41,421,199,28657,23,3001,521,53
>>> sdata == enc.decode('base64').decode('zlib')
True
>>> F = [int(s) for s in sdata.split(',') if s.strip()]
>>> F[0], F[5]
(2, 7)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

StringIO and pandas read_csv - python

Related

Writing array as string (python 3)

Python - Efficient way to flip bytes in a file?

"String" and "int" type dictionary compare with "Unicode" type dictionary

byte reverse AB CD to CD AB with python

TEXT compression in python

Categories

Resources