I am facing an 'List Index out of range' error when trying to iterate a for-loop over a table I've created from a CSV extract, but cannot figure out why - even after trying many different methods.
Here is the step by step description of how the error happens :
I'm removing the first line of an imported CSV file, as this
line contains the columns' names but no data. The CSV has the following structure.
columnName1, columnName2, columnName3, columnName4
This, is, some, data
I, have, in, this
very, interesting, CSV, file
After storing the CSV in a first array called oldArray, I want to populate a newArray that will get all values from oldArray but not the first line, which is the column name line, as previously
mentioned. My newArray should then look like this.
This, is, some, data
I, have, in, this
very, interesting, CSV, file
To create this newArray, I'm using the following code with the append() function.
tempList = []
newArray = []
for i in range(len(oldArray)):
if i > 0: #my ugly way of skipping line 0...
for j in range(len(oldArray[0])):
tempList.append(oldArray[i][j])
newArray.append(tempList)
tempList = []
I also stored the columns in their own separate list.
i = 0
for i in range(len(oldArray[0])):
my_columnList[i] = oldArray[0][i]
And the error comes up next : I now want to populate a treeview table from this newArray, using a for-loop and insert (in a function). But I always get the 'Index List out of range error' and I cannot figure out why.
def populateTable(my_tree, newArray, my_columnList):
i = 0
for i in range(len(newArray)):
my_tree.insert('','end', text=newArray[i][0], values = (newArray[i][1:len(newArray[0]))
#(im using the text option to bypass treeview's column 0 problem)
return my_tree
Error message --> " File "(...my working directory...)", line 301, in populateTable
my_tree.insert(parent='', index='end', text=data[i][0], values=(data[i][1:len(data[0])]))
IndexError: list index out of range "
Using that same function with different datasets and columns worked fine, but not for this here newArray.
I'm fairy certain that the error comes strictly from this 'newArray' and is not linked to another parameter.
I've tested the validity of the columns list, of the CSV import in oldArray through some print() functions, and everything seems normal - values, row dimension, column dimension.
This is a great mystery to me...
Thank you all very much for your help and time.
You can find a problem from your error message: File "(...my working directory...)", line 301, in populateTable my_tree.insert(parent='', index='end', text=data[i][0], values=(data[i][1:len(data[0])])) IndexError: list index out of range
It means there is an index out of range in line 301: data[i][0] or data[i][1:len(data[0])]
(i is over len(data)) or (0 or 1 is over len(data[0]))
My guess is there is some empty list in data(maybe data[-1]?).
if data[i] is [] or [some_one_item], then data[i][1:len(data[0])] try to access to second item which not exists.
there is no problem in your "ugly" way to skip line 0 but I recommend having a look on this way
new_array = old_array.copy()
new_array.remove(new_array[0])
now for fixing your issue
looks like you have a problem in the indexing
when you use a for loop using the range of the length of an array you use normal indexing which starts from one while you identify your i variable to be zero
to make it simple
len(oldArray[0])
this is equal to 4 so when you use it in the for loop it's just like saying
for i in range(4):
to fix this you can either subtract 1 from the length of the old array or just identify the i variable to be 1 at the first
i = 1
for i in range(len(oldArray[0])):
my_columnList[i] = oldArray[0][i]
or
i = 0
for i in range(len(oldArray[0])-1):
my_columnList[i] = oldArray[0][i]
this mistake is also repeated in your populateTree function
so in the same way your code would be
def populateTree(my_tree, newArray, my_columnList):
i = 0
for i in range(len(newArray)-1):
my_tree.insert('','end', text=newArray[i][0], values = (newArray[i][1:len(newArray[0]))
#(im using the text option to bypass treeview's column 0 problem)
return my_tree
I am using Python to generate an ASCII file composed of very long lines. This is one example line (let's say line 100 in the file, '[...]' are added by me to shorten the line):
{6 1,14 1,[...],264 1,270 2,274 2,[...],478 1,479 8,485 1,[...]}
If I open the ASCII file that I generated with ipython:
f = open('myfile','r')
print repr(f.readlines()[99])
I do obtain the expected line printed correctly ('[...]' are added by me to shorten the line):
'{6 1,14 1,[...],264 1,270 2,274 2,[...],478 1,479 8,485 1,[...]}\n'
On the contrary, if I open this file with the program that is suppose to read it, it will generate an exception, complaining about an unexpected pair after 478 1.
So I tried to open the file with vim. Still vim shows no problem, but if I copy the line as printed by vim and paste it in another text editor (in my case TextMate), this is the line that I obtain ('[...]' are added by me to shorten the line):
{6 1,14 1,[...],264 1,270 2,274 2,[...],478 1,4 79 8,485 1,[...]}
This line indeed has a problem after the pair 478 1.
I tried to generate my lines in different ways (concatenating, with cStringIO, ...), but I always obtain this result. When using the cStringIO, for example, the lines are generated as in the following (even though I tried to change this, as well, with no luck):
def _construct_arff(self,attributes,header,data_rows):
"""Create the string representation of a Weka ARFF file.
*attributes* is a dictionary with attribute_name:attribute_type
(e.g., 'num_of_days':'NUMERIC')
*header* is a list of the attributes sorted
(e.g., ['age','name','num_of_days'])
*data_rows* is a list of lists with the values, sorted as in the header
(e.g., [ [88,'John',465],[77,'Bob',223]]"""
arff_str = cStringIO.StringIO()
arff_str.write('#relation %s\n' % self.relation_name)
for idx,att_name in enumerate(header):
try:
name = att_name.replace("\\","\\\\").replace("'","\\'")
arff_str.write("#attribute '%s' %s\n" % (name,attributes[att_name]))
except UnicodeEncodeError:
arff_str.write('#attribute unicode_err_%s %s\n'
% (idx,attributes[att_name]))
arff_str.write('#data\n')
for data_row in data_rows:
row = []
for att_idx,att_name in enumerate(header):
att_type = attributes[att_name]
value = data_row[att_idx]
# numeric attributes can be sparse: None and zeros are not written
if ((not att_type == constants.ARRF_NUMERIC)
or not ((value == None) or value == 0)):
row.append('%s %s' % (att_idx,value))
arff_str.write('{' + (','.join(row)) + '}\n')
return arff_str.getvalue()
UPDATE: As you can see from the code above, the function transforms a given set of data to a special arff file format. I noticed that one of the attributes I was creating contained numbers as strings (e.g., '1', instead of 1). By forcing these numbers into integers:
features[name] = int(value)
I recreated the arff file successfully. However I don't see how this, which is a value, can have an impact on the formatting of *att_idx*, which is always an integer, as also pointed out by #JohnMachin and #gnibbler (thanks for your answers, btw). So, even if my code runs now, I still don't see why this happens. How can the value, if not properly transformed into int, influence the formatting of something else?
This file contains the wrongly formatted version.
The built-in function repr is your friend. It will show you unambiguously what you have in your file.
Do this:
f = open('myfile','r')
print repr(f.readlines()[99])
and edit your question to show the result.
Update: As to how it got there, it is impossible to tell, because it cannot have been generated by the code that you showed. The value 37 should be a value of att_idx which comes from enumerate() and so must be an int. You are formatting this int with %s ... 37 can't become 3rubbish7. Also that should generate att_idx in order 0, 1, etc etc but you are missing many values and there is nothing conditional inside your loop.
Please show us the code that you actually ran.
Update:
And again, this code won't run:
for idx,att_name in enumerate(header):
arff_str.write("#attribute '%s' %s\n" % (name,attributes[att_name]))
because name is not defined; you probably mean att_name.
Perhaps we can short-circuit all this stuffing about: post a copy of your output file (zipped if it's huge) on the web somewhere so that we can see for ourselves what might be disturbing its consumers. Please do edit your question to say which line(s) exhibits(s) the problem.
By the way, you say some of the data is string rather than integer, and the problem goes away if you coerce the data to int by doing features[name] = int(value) ... what is 'features'?? What is 'name'??
Are any of those strings unicode instead of str?
Update 2 (after bad file posted on net)
No info supplied on which line(s) exhibits(s) the problem. As it turned out, no lines exhibited the described problem with attribute 479. I wrote this checking script:
import re, sys
# sample data line:
# {40 1,101 3,319 2,375 2,525 2,530 bug}
# Looks like all data lines end in ",530 bug}" or ",530 other}"
pattern1 = r"\{(?:\d+ \d+,)*\d+ \w+\}$"
matcher1 = re.compile(pattern1).match
pattern2 = r"\{(?:\d+ \d+,)*"
matcher2 = re.compile(pattern2).match
bad_atts = re.compile(r"\D\d+\s+\W").findall
got_data = False
for lino, line in enumerate(open(sys.argv[1], "r"), 1):
if not got_data:
got_data = line.startswith('#data')
continue
if not matcher1(line):
print
print lino, repr(line)
m = matcher2(line)
if m:
print "OK up to offset", m.end()
print bad_atts(line)
Sample output (wrapped at column 80):
581 '{2 1,7 1,9 1,12 1,13 1,14 1,15 1,16 1,17 1,18 1,21 1,22 1,24 1,25 1,26 1,27
1,29 1,32 1,33 1,36 1,39 1,40 1,44 1,48 1,49 1,50 1,54 1,57 1,58 1,60 1,67 1,68
1,69 1,71 1,74 1,75 1,76 1,77 1,80 1,88 1,93 1,101 ,103 6,104 2,109 20,110 3,11
2 2,114 1,119 17,120 4,124 39,128 5,137 1,138 1,139 1,162 1,168 1,172 18,175 1,1
76 6,179 1,180 1,181 2,185 2,187 9,188 8,190 1,193 1,195 2,196 4,197 1,199 3,201
3,202 4,203 5,206 1,207 2,208 1,210 2,211 1,212 5,213 1,215 2,216 3,218 2,220 2
,221 3,225 8,226 1,233 1,241 4,242 1,248 5,254 2,255 1,257 4,258 4,260 1,266 1,2
68 1,269 3,270 2,271 5,273 1,276 1,277 1,280 1,282 1,283 11,285 1,288 1,289 1,29
6 8,298 1,299 1,303 1,304 11,306 5,308 1,309 8,310 1,315 3,316 1,319 11,320 5,32
1 11,322 2,329 1,342 2,345 1,349 1,353 2,355 2,358 3,359 1,362 1,367 2,368 1,369
1,373 2,375 9,377 1,381 4,382 1,383 3,387 1,388 5,395 2,397 2,400 1,401 7,407 2
,412 1,416 1,419 2,421 2,422 1,425 2,427 1,431 1,433 7,434 1,435 1,436 2,440 1,4
49 1,454 2,455 1,460 3,461 1,463 1,467 1,470 1,471 2,472 7,477 2,478 11,479 31,4
82 6,485 7,487 1,490 2,492 16,494 2,495 1,497 1,499 1,501 1,502 1,503 1,504 11,5
06 3,510 2,515 1,516 2,517 3,518 1,522 4,523 2,524 1,525 4,527 2,528 7,529 3,530
bug}\n'
OK up to offset 203
[',101 ,']
709 '{101 ,124 2,184 1,188 1,333 1,492 3,500 4,530 bug}\n'
OK up to offset 1
['{101 ,']
So it looks like the attribute with att_idx == 101 can sometimes contain the empty string ''. You need to sort out how this attribute is to be treated. It would help your thinking if you unwound this Byzantine code:
if ((not att_type == constants.ARRF_NUMERIC)
or not ((value == None) or value == 0)):
Aside: that "expletive deleted" code won't run; it should be ARFF, not ARRF
into:
if value or att_type != constants.ARFF_NUMERIC:
or maybe just if value: which will filter out all of None, 0, and "". Note that att_idx == 101 corresponds to the attribute "priority" which is given a STRING type in the ARFF file header:
[line 103] #attribute 'priority' STRING
By the way, your statement about features[name] = int(value) "fixing" the problem is very suspicious; int("") raises an exception.
It may help you to read the warning at the end of this wiki section about sparse ARFF files.