I'm trying to run the following code but for some reason I get the following error: "TypeError: limit must be an integer".
Reading csv data file
import sys
import csv
maxInt = sys.maxsize
decrement = True
while decrement:
decrement = False
try:
**csv.field_size_limit(maxInt)**
except OverflowError:
maxInt = int(maxInt/10)
decrement = True
with open("Data.csv", 'rb') as textfile:
text = csv.reader(textfile, delimiter=" ", quotechar='|')
for line in text:
print ' '.join(line)
The error occurs in the starred line. I have only added the extra bit above the csv read statement as the file was too large to read normally. Alternatively, I could change the file to a text file from csv but I'm not sure whether this will corrupt the data further I can't actually see any of the data as the file is >2GB and hence costly to open.
Any ideas? I'm fairly new to Python but I'd really like to learn a lot more.
I'm not sure whether this qualifies as an answer or not, but here are a few things:
First, the csv reader automatically buffers per line of the CSV, so the file size shouldn't matter too much, 2KB or 2GB, whatever.
What might matter is the number of columns or amount of data inside the fields themselves. If this CSV contains War and Peace in each column, then yeah, you're going to have an issue reading it.
Some ways to potentially debug are to run print sys.maxsize, and to just open up a python interpreter, import sys, csv and then run csv.field_size_limit(sys.maxsize). If you are getting some terribly small number or an exception, you may have a bad install of Python. Otherwise, try to take a simpler version of your file. Maybe the first line, or the first several lines and just 1 column. See if you can reproduce the smallest possible case and remove the variability of your system and the file size.
On Windows 7 64bit with Python 2.6, maxInt = sys.maxsize returns 9223372036854775807L which consequently results in a TypeError: limit must be an integer when calling csv.field_size_limit(maxInt). Interestingly, using maxInt = int(sys.maxsize) does not change this. A crude workaround is to simlpy use csv.field_size_limit(2147483647) which of course cause issues on other platforms. In my case this was adquat to identify the broken value in the CSV, fix the export options in the other application and remove the need for csv.field_size_limit().
-- originally posted by user roskakori on this related question
Related
So I'm thinking this is one of those problems where I can't see the forest for the tree. Here is the assignment:
Using the file object input, write code that read an integer from a file called
rawdata into a variable datum (make sure you assign an integer value to datum).
Open the file at the beginning of your code, and close it at the end.
okay so first thing: I thought the input function was for assigning data to an object such as a variable, not for reading data from an object. Wouldn't that be read.file_name ?
But I gave it shot:
infile = open('rawdata','r')
datum = int(input.infile())
infile.close()
Now first problem... MyProgrammingLab doesn't want to grade it. By that I mean I type in the code, click 'submit' and I get the "Checking" screen. And that's it. At the time of writing this, my latest attempt to submit as been 'checking' for 11 minutes. It's not giving me an error, it's just not... 'checking' I guess.
Now at the moment I can't use Python to try the program because it's looking for a while and I'm on a school computer that is write locked, so even if I have the code right (I doubt it), the program will fail to run because it can neither find the file rawdata nor create it.
So... what's the deal? Am I reading the instructions wrong or is it telling me to use input in some other way then I'm trying to use it? Or am I supposed to be using a different method?
You are so close. You're just using the file object slightly incorrectly. Once it's open, you can just .read() it, and get the value.
It would probably look something like this
infile = open('rawdata','r')
datum = int(infile.read())
infile.close()
I feel like your confusion is based purely on the wording of the question - the term "file object input" can certainly be confusing if you haven't worked with Python I/O before. In this case, the "file object" is infile and the "input" would be the rawdata file, I suppose.
Currently taking this class and figured this out. This is my contribution to all of us college peeps just trying to make it through, lol. MPL accepts this answer.
input = open('rawdata','r')
datum = int(input.readline())
input.close()
I'm parsing a 20Gb file and outputting lines that meet a certain condition to another file, however occasionally python will read in 2 lines at once and concatenate them.
inputFileHandle = open(inputFileName, 'r')
row = 0
for line in inputFileHandle:
row = row + 1
if line_meets_condition:
outputFileHandle.write(line)
else:
lstIgnoredRows.append(row)
I've checked the line endings in the source file and they check out as line feeds (ascii char 10). Pulling out the problem rows and parsing them in isolation works as expected. Am I hitting some python limitation here? The position in the file of the first anomaly is around the 4GB mark.
Quick google search for "python reading files larger than 4gb" yielded many many results. See here for such an example and another one which takes over from the first.
It's a bug in Python.
Now, the explanation of the bug; it's not easy to reproduce because it depends both on the internal FILE buffer size and the number of chars passed to fread().
In the Microsoft CRT source code, in open.c, there is a block starting with this encouraging comment "This is the hard part. We found a CR at end of buffer. We must peek ahead to see if next char is an LF."
Oddly, there is an almost exact copy of this function in Perl source code:
http://perl5.git.perl.org/perl.git/blob/4342f4d6df6a7dfa22a470aa21e54a5622c009f3:/win32/win32.c#l3668
The problem is in the call to SetFilePointer(), used to step back one position after the lookahead; it will fail because it is unable to return the current position in a 32bit DWORD. [The fix is easy; do you see it?]
At this point, the function thinks that the next read() will return the LF, but it won't because the file pointer was not moved back.
And the work-around:
But note that Python 3.x is not affected (raw files are always opened in binary mode and CRLF translation is done by Python); with 2.7, you may use io.open().
The 4GB mark is suspiciously near the maximum value that can be stored in a 32-bit register (2**32).
The code you've posted looks fine by itself, so I would suspect a bug in your Python build.
FWIW, the snippet would be a little cleaner if it used enumerate:
inputFileHandle = open(inputFileName, 'r')
for row, line in enumerate(inputFileHandle):
if line_meets_condition:
outputFileHandle.write(line)
else:
lstIgnoredRows.append(row)
I'm trying to read a huge amount of lines from standard input with python.
more hugefile.txt | python readstdin.py
The problem is that the program freezes as soon as i've read just a single line.
print sys.stdin.read(8)
exit(1)
This prints the first 8 bytes but then i expect it to terminate but it never does. I think it's not really just reading the first bytes but trying to read the whole file into memory.
Same problem with sys.stdin.readline()
What i really want to do is of course to read all the lines but with a buffer so i don't run out of memory.
I'm using python 2.6
This should work efficiently in a modern Python:
import sys
for line in sys.stdin:
# do something...
print line,
You can then run the script like this:
python readstdin.py < hugefile.txt
Back in the day, you had to use xreadlines to get efficient huge line-at-a-time IO -- and the docs now ask that you use for line in file.
Of course, this is of assistance only if you're actually working on the lines one at a time. If you're just reading big binary blobs to pass onto something else, then your other mechanism might be as efficient.
I need to import a binary file from Python -- the contents are signed 16-bit integers, big endian.
The following Stack Overflow questions suggest how to pull in several bytes at a time, but is this the way to scale up to read in a whole file?
Reading some binary file in Python
Receiving 16-bit integers in Python
I thought to create a function like:
from numpy import *
import os
def readmyfile(filename, bytes=2, endian='>h'):
totalBytes = os.path.getsize(filename)
values = empty(totalBytes/bytes)
with open(filename, 'rb') as f:
for i in range(len(values)):
values[i] = struct.unpack(endian, f.read(bytes))[0]
return values
filecontents = readmyfile('filename')
But this is quite slow (the file is 165924350 bytes). Is there a better way?
Use numpy.fromfile.
I would directly read until EOF (it means checking for receiving an empty string), removing then the need to use range() and getsize.
Alternatively, using xrange (instead of range) should improve things, especially for memory usage.
Moreover, as Falmarri suggested, reading more data at the same time would improve performance quite a lot.
That said, I would not expect miracles, also because I am not sure a list is the most efficient way to store all that amount of data.
What about using NumPy's Array, and its facilities to read/write binary files? In this link there is a section about reading raw binary files, using numpyio.fread. I believe this should be exactly what you need.
Note: personally, I have never used NumPy; however, its main raison d'etre is exactly handling of big sets of data - and this is what you are doing in your question.
You're reading and unpacking 2 bytes at a time
values[i] = struct.unpack(endian,f.read(bytes))[0]
Why don't you read like, 1024 bytes at a time?
I have had the same kind of problem, although in my particular case I have had to convert a very strange binary format (500 MB) file with interlaced blocks of 166 elements that were 3-bytes signed integers; so I've had also the problem of converting from 24-bit to 32-bit signed integers that slow things down a little.
I've resolved it using NumPy's memmap (it's just a handy way of using Python's memmap) and struct.unpack on large chunk of the file.
With this solution I'm able to convert (read, do stuff, and write to disk) the entire file in about 90 seconds (timed with time.clock()).
I could upload part of the code.
I think the bottleneck you have here is twofold.
Depending on your OS and disc controller, the calls to f.read(2) with f being a bigish file are usually efficiently buffered -- usually. In other words, the OS will read one or two sectors (with disc sectors usually several KB) off disc into memory because this is not a lot more expensive than reading 2 bytes from that file. The extra bytes are cached efficiently in memory ready for the next call to read that file. Don't rely on that behavior -- it might be your bottleneck -- but I think there are other issues here.
I am more concerned about the single byte conversions to a short and single calls to numpy. These are not cached at all. You can keep all the shorts in a Python list of ints and convert the whole list to numpy when (and if) needed. You can also make a single call struct.unpack_from to convert everything in a buffer vs one short at a time.
Consider:
#!/usr/bin/python
import random
import os
import struct
import numpy
import ctypes
def read_wopper(filename,bytes=2,endian='>h'):
buf_size=1024*2
buf=ctypes.create_string_buffer(buf_size)
new_buf=[]
with open(filename,'rb') as f:
while True:
st=f.read(buf_size)
l=len(st)
if l==0:
break
fmt=endian[0]+str(l/bytes)+endian[1]
new_buf+=(struct.unpack_from(fmt,st))
na=numpy.array(new_buf)
return na
fn='bigintfile'
def createmyfile(filename):
bytes=165924350
endian='>h'
f=open(filename,"wb")
count=0
try:
for int in range(0,bytes/2):
# The first 32,767 values are [0,1,2..0x7FFF]
# to allow testing the read values with new_buf[value<0x7FFF]
value=count if count<0x7FFF else random.randint(-32767,32767)
count+=1
f.write(struct.pack(endian,value&0x7FFF))
except IOError:
print "file error"
finally:
f.close()
if not os.path.exists(fn):
print "creating file, don't count this..."
createmyfile(fn)
else:
read_wopper(fn)
print "Done!"
I created a file of random shorts signed ints of 165,924,350 bytes (158.24 MB) which comports to 82,962,175 signed 2 byte shorts. With this file, I ran the read_wopper function above and it ran in:
real 0m15.846s
user 0m12.416s
sys 0m3.426s
If you don't need the shorts to be numpy, this function runs in 6 seconds. All this on OS X, python 2.6.1 64 bit, 2.93 gHz Core i7, 8 GB ram. If you change buf_size=1024*2 in read_wopper to buf_size=2**16 the run time is:
real 0m10.810s
user 0m10.156s
sys 0m0.651s
So your main bottle neck, I think, is the single byte calls to unpack -- not your 2 byte reads from disc. You might want to make sure that your data files are not fragmented and if you are using OS X that your free disc space (and here) is not fragmented.
Edit I posted the full code to create then read a binary file of ints. On my iMac, I consistently get < 15 secs to read the file of random ints. It takes about 1:23 to create since the creation is one short at a time.
I have code to read in the version number from a make file.
VERSION_ID=map(int,re.match("VERSION_ID\s*=\s*(\S+)",open("version.mk").read()).group(1).split("."))
This code takes VERSION_ID=0.0.2 and stores it as [0, 0, 2].
Is there any way I can increment this number by one and write the new version number into the version.mk file with the variable VERSION_ID.
Thanks
I have tried the same statement with write() instead of read() but I am getting an error saying that I can not write a list. I have also tried to write it as a string but am getting a bad file descriptor message.
s = str(VERSION_ID)
VERSION_ID=map(int,re.search("VERSION_ID\s*=\s*(\S+)",open("version.mk").write(s)).group(1).split("."))
I know this is rubbish, I just can't seem to find what to do here in the online docs.
I have also tried the pickle module to no avail. Maybe I'd be able to write a pickled list instead and then unpickle it. Or I was thinking I could just write over the whole line altogether.
I've tried to do anther approach, Ihave tried to get list to be entered as a string. I have tried this but I am not sure if it will work.
for x in VERSION_ID:
"VERSION_ID={0}.{1}.{2}.format(x)
perhpas something like this (you should also check for errors and that)
#! /usr/bin/python
import re
fn = "version.mk"
omk = open(fn).readlines()
nmk = open(fn, "w")
r = re.compile(r'(VERSION_ID\s*=\s*)(\S+)')
for l in omk:
m1 = r.match(l)
if m1:
VERSION_ID=map(int,m1.group(2).split("."))
VERSION_ID[2]+=1 # increment version
l = r.sub(r'\g<1>' + '.'.join(['%s' % (v) for v in VERSION_ID]), l)
nmk.write(l)
nmk.close()