IndexError: invalid index - python

I try to read from a dataset and I want all elements except the last one in train. I get the last element as target. I can print it and all good but when the code reaches train = ... then I get this error: IndexError: invalid index
dataset = np.genfromtxt(open(train_file,'r'), delimiter=',',dtype=None)[1:]
target = [x[401] for x in dataset]
train = [x[0:400] for x in dataset]
I also tried: [x[:-1] for x in dataset] but I get the same error.
Data set is big but this is a sample:
xxx,-0.011451,-0.070532,...,-0.011451,-0.070532,O

Your issue appears to be with understanding how list comprehensions work, and when you might want to use one.
A list comprehension goes through every item in an list, applies a function to it, and may or may not filter out other elements. For instance, if I had the following list:
digits = [1, 2, 3, 4, 5, 6, 7]
And I used the following list comprehension:
squares = [i * i for i in digits]
I would get: [1, 4, 9, 16, 25, 36, 49]
I could also do something like this:
even_squares = [i * i for i in digits if i % 2 == 0]
Which would give me: [4, 16, 36]
Now let's talk about your list comprehensions in particular. You wrote [x[401] for x in dataset], which, in English, reads as "a list containing the 401st element of each item in the list called dataset".
Now, in all likelihood, there aren't more than 402 items in each line of your dataset, meaning that, when you try to access the 401st element of each, you get an error.
It sounds like you're just trying to get all the elements in dataset excluding the last one. To do that, you can use python's slice notation. If you write dataset[:-1], you'll get all items in the dataset other than the last one. Similarly, if you wrote dataset[:-2], you'd get all items except for the last two, and so on. The same works if you want to cut off the front of the list: dataset[1:-1] will give you all items in the list excluding the 0th and last items.
Edit:
Now that I see the new comments on your post, it's clear that you are trying to get the first 401 elements of each item in the dataset. Unfortunately, because we don't know anything about your dataset, it's impossible to say what exactly the issue is.

I just tested this with the following toy code. Your syntax is actually correct. Something is wrong with your input file, not with the way you are selecting elements from your list of arrays.
from numpy import *
a = array(range(1,403))
dataset = []
for i in range(5):
dataset.append(a)
target = [x[401] for x in dataset]
train = [x[0:400] for x in dataset]

Related

I can't print a specific set in a list in python

coordinates=[(4,5),(5,6,(6,6))]
print(coordinates[3])
result:Traceback (most recent call last):
File "C:\Users\PycharmProjects\pythonProject1\App.py", line 2, in <module>
print(coordinates[3])
IndexError: list index out of range
I want the result to be [6,6] instead of the error message what do you mean i am trying to access a fourth one I am trying to access the 3rd one in the list when i use coordinates[1][2] it gives me a syntax error
You want to double check the way you formatted your list.
From the looks of it, your list has index 0 (4,5) and index 1 (5,6,(6,6)). Your list simply has nothing at index 3 because there are only 2 entries. That's what the error message means.
You might want to change your list into coordinates=[(4,5),5,6,(6,6)], then you have index 0-3.
Try this:
print(coordinates[1][-1])
or
print(coordinates[1][2])
There are 2 problems:
(1) Items in a list are zero-indexed, meaning the first element of L is L[0], second is L[1], etc. So if you want the third item:
print(coordinates[2])
(2) If you just stop there, you'll still get an index out of range exception, because your list has only 2 elements. I think you've misplaced a ):
coordinates=[(4,5),(5,6),(6,6)]
In your question, your list coordinates has 2 elements: (4, 5) and (5, 6, (6, 6)) - notice the nested parentheses. The second item is the set containing 3 elements: 5, 6, and another set (6,6).
Tip: Sometimes I like to put lists on multiple lines, especially if they are nested or contain sets/tuples/other complicated structures. One widely accepted style for this is shown below:
coordinates = [
(4,5),
(5,6),
(6,6),
]
EDIT: Thanks to Ignatius Reilly for catching my missing commas!
Note: you are allowed to have a comma after the last element, which I usually do. It makes it easier to add more elements or rearrange them later using copy/paste.

How exactly does Python check through a list?

I was doing one of the course exercises on codeacademy for python and I had a few questions I couldn't seem to find an answer to:
For this block of code, how exactly does python check whether something is "in" or "not in" a list? Does it run through each item in the list to check or does it use a quicker process?
Also, how would this code be affected if it were running with a massive list of numbers (thousands or millions)? Would it slow down as the list size increases, and are there better alternatives?
numbers = [1, 1, 2, 3, 5, 8, 13]
def remove_duplicates(list):
new_list = []
for i in list:
if i not in new_list:
new_list.append(i)
return new_list
remove_duplicates(numbers)
Thanks!
P.S. Why does this code not function the same?
numbers = [1, 1, 2, 3, 5, 8, 13]
def remove_duplicates(list):
new_list = []
new_list.append(i for i in list if i not in new_list)
return new_list
In order to execute i not in new_list Python has to do a linear scan of the list. The scanning loop breaks as soon as the result of the test is known, but if i is actually not in the list the whole list must be scanned to determine that. It does that at C speed, so it's faster than doing a Python loop to explicitly check each item. Doing the occasional in some_list test is ok, but if you need to do a lot of such membership tests it's much better to use a set.
On average, with random data, testing membership has to scan through half the list items, and in general the time taken to perform the scan is proportional to the length of the list. In the usual notation the size of the list is denoted by n, and the time complexity of this task is written as O(n).
In contrast, determining membership of a set (or a dict) can be done (on average) in constant time, so its time complexity is O(1). Please see TimeComplexity in the Python Wiki for further details on this topic. Thanks, Serge, for that link.
Of course, if your using a set then you get de-duplication for free, since it's impossible to add duplicate items to a set.
One problem with sets is that they generally don't preserve order. But you can use a set as an auxilliary collection to speed up de-duping. Here is an illustration of one common technique to de-dupe a list, or other ordered collection, which does preserve order. I'll use a string as the data source because I'm too lazy to type out a list. ;)
new_list = []
seen = set()
for c in "this is a test":
if c not in seen:
new_list.append(c)
seen.add(c)
print(new_list)
output
['t', 'h', 'i', 's', ' ', 'a', 'e']
Please see How do you remove duplicates from a list whilst preserving order? for more examples. Thanks, Jean-François Fabre, for the link.
As for your PS, that code appends a single generator object to new_list, it doesn't append what the generate would produce.
I assume you alreay tried to do it with a list comprehension:
new_list = [i for i in list if i not in new_list]
That doesn't work, because the new_list doesn't exist until the list comp finishes running, so doing in new_list would raise a NameError. And even if you did new_list = [] before the list comp, it won't be modified by the list comp, and the result of the list comp would simply replace that empty list object with a new one.
BTW, please don't use list as a variable name (even in example code) since that shadows the built-in list type, which can lead to mysterious error messages.
You are asking multiple questions and one of them asking if you can do this more efficiently. I'll answer that.
Ok let's say you'd have thousands or millions of numbers. From where exactly? Let's say they were stored in some kind of txtfile, then you would probably want to use numpy (if you are sticking with Python that is). Example:
import numpy as np
numbers = np.array([1, 1, 2, 3, 5, 8, 13], dtype=np.int32)
numbers = np.unique(numbers).tolist()
This will be more effective (above all memory-efficient compared) than reading it with python and performing a list(set..)
numbers = [1, 1, 2, 3, 5, 8, 13]
numbers = list(set(numbers))
You are asking for the algorithmic complexity of this function. To find that you need to see what is happening at each step.
You are scanning the list one at a time, which takes 1 unit of work. This is because retrieving something from a list is O(1). If you know the index, it can be retrieved in 1 operation.
The list to which you are going to add it increases at worst case 1 at a time. So at any point in time, the unique items list is going to be of size n.
Now, to add the item you picked to the unique items list is going to take n work in the worst case. Because we have to scan each item to decide that.
So if you sum up the total work in each step, it would be 1 + 2 + 3 + 4 + 5 + ... n which is n (n + 1) / 2. So if you have a million items, you can just find that by applying n = million in the formula.
This is not entirely true because of how list works. But theoretically, it would help to visualize this way.
to answer the question in the title: python has more efficient data types but the list() object is just a plain array, if you want a more efficient way to search values you can use dict() which uses a hash of the object stored to insert it into a tree which i assume is what you were thinking of when you mentioned "a quicker process".
as to the second code snippet:
list().append() inserts whatever value you give it to the end of the list, i for i in list if i not in new_list is a generator object and it inserts that generator as an object into the array, list().extend() does what you want: it takes in an iterable and appends all of its elements to the list

Iterate overs values in nested list

I'm working on scientific data and using a module called pysam in order to get reference position for each unique "object" in my file.
In the end, I obtain a "list of lists" that looks like that (here I provide an example with only two objects in the file):
pos = [[1,2,3,6,7,8,15,16,17,20],[1,5,6,7,8,20]]
and, for each list in pos, I would like to iterate over the values and compare value[i] with value[i+1]. When the difference is greater than 2 (for example) I want to store both values (value[i] and value[i+1]) into a new list.
If we call it final_pos then I would like to obtain:
final_pos = [[3,6,8,15,17,20],[1,5,8,20]]
It seemed rather easy to do, at first, but I must be lacking some basic knowledge on how lists works and I can't manage to iterate over each values of each list and then compare consecutive values together..
If anyone has an idea, I'm more than willing to hear about it !
Thanks in advance for your time !
EDIT: Here's what I tried:
pos = [[1,2,3,6,7,8,15,16,17,20],[1,5,6,7,8,20]]
final_pos = []
for list in pos:
for value in list:
for i in range(len(list)-1):
if value[i+1]-value[i] > 2:
final_pos.append(value[i])
final_pos.append(value[i+1])
You can iterate over each of the individual list in pos and then compare the consecutive values. When you need to insert the values, you can use a temporary set because you wouldn't want to insert the same element twice in your final list. Then, you can convert the temporary set to a list and append it to your final list (after sorting it, to preserve order). Also, the sorting will only work if the elements in the original list is actually sorted.
pos = [[1,2,3,6,7,8,15,16,17,20],[1,5,6,7,8,20]]
final_pos = []
for l in pos:
temp_set = set()
for i in range(len(l)-1):
if l[i+1] - l[i] > 2:
temp_set.add(l[i])
temp_set.add(l[i+1])
final_pos.append(sorted(list(temp_set)))
print(final_pos)
Output
[[3, 6, 8, 15, 17, 20], [1, 5, 8, 20]]
Edit: About what you tried:
for list in pos:
This line will give us list = [1,2,3,6,7,8,15,16,17,20] (in the first iteration)
for value in list:
This line will give us value = 1 (in the first iteration)
Now, value is just a number not a list and hence, value[i] and value[i+1] doesn't make sense.
Your code has an obvious "too many loop" issues. It also stores the result as a flat list, you need a list of lists.
It has also a more subtle bug: a same index can be added more than once if 2 intervals match in a row. I've registered the added indices in a set to avoid this.
The bug doesn't show with your original data (which tripped a lot of experienced users, including me), so I've changed it:
pos = [[1,2,3,6,7,8,11,15,16,17,20],[1,5,6,7,8,20]]
final_pos = []
for value in pos:
sublist = []
added_indexes = set()
for i in range(len(value)-1):
if value[i+1]-value[i] > 2:
if not i in added_indexes:
sublist.append(value[i])
## added_indexes.add(i) # we don't need to add it, we won't go back
# no need to test for i+1, it's new
sublist.append(value[i+1])
# registering it for later
added_indexes.add(i+1)
final_pos.append(sublist)
print(final_pos)
result:
[[3, 6, 8, 11, 15, 17, 20], [1, 5, 8, 20]]
Storing the indexes in a set, and not the values (which would also work here, with some post-processing sort, see this answer) also would work when objects aren't hashable (like custom objects which have a custom distance implemented between them) or only partially sorted (waves) if it has some interest (ex: pos = [[1,2,3,6,15,16,17,20,1,6,10,11],[1,5,6,7,8,20,1,5,6,7,8,20]])

Python Pop Loop

Running into something which seems strange. I use a set of lists to hold some data and if a condition is met in them I want to remove that data from each list.
This is what I have currently. It works and removes everything from the first result but when there's more than one meeting the criteria it leaves them.
agecounter = 0
for matches in List1:
if Condition Met:
List1.pop(agecounter)
List2.pop(agecounter)
List3.pop(agecounter)
agecounter = agecounter + 1
If I have 10 items in those lists and three meet the criteria it'll remove the first one. I can even print the data from the other results meeting the condition. It prints it to the console just fine, doesn't throw an exception but seems to just ignore the pop.
I might be missing something really obvious here but there's no reason for that not to work, is there?
Traverse your list in reverse order
agecounter = len(List1)-1
for matches in reversed(List1):
if Condition Met:
List1.pop(agecounter)
List2.pop(agecounter)
List3.pop(agecounter)
agecounter = agecounter - 1
It's not a good idea to remove elements from a list while iterating over it. You should probably iterate over a copy of your list instead
A full example that shows the problem would help get better answers.
That being said, pop is probably not working as you expect. It's not removing agecounter from the list. It's removing the item at position agecounter.
>>> a = [1,2,3, 4]
>>> a.pop(1)
2
>>> a
[1, 3, 4]
>>> a.pop(2)
4
And when you get higher you're more likely to go off the end, throwing an exception.
>>> a.pop(3)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: pop index out of range
Adding to #StephenTG's answer you probably want to copy the data rather than modify it during iteration. You can "filter" it using a list comprehension:
a = [1,2,3,4]
>>> b = [2,3]
>>> [x for x in a if x not in b]
[1, 4]

Deduping a complex list using a simplified copy of itself

I have two lists of strings that are passed into a function. They are more or less the same, except that one has been run through a regex filter to remove certain boilerplate substrings (e.g. removing 'LLC' from 'Blues Brothers LLC').
This function is meant to internally deduplicate the modified list and remove the associated item in the non-modified list. You can assume that these lists were sorted alphabetically before being run through the regex filter, and remain in the same order (i.e. original[x] and modified[x] refer to the same entity, even if original[x] != modified[x]). Relative order must be maintained between the two lists in the output.
This is what I have so far. It works 99% of the time, except for very rare combinations of inputs and boilerplate strings (1 in 1000s) where some output strings will be mismatched by a single list position. Input lists are 'original' and 'modified'.
# record positions of duplicates so we're not trying to modify the same lists we're iterating
dellist_modified = []
dellist_original = []
# probably not necessary, extra precaution against modifying lists being iterated.
# fwiw the problem still exists if I remove these and change their references in the last two lines directly to the input lists
modified_copy = modified
original_copy = original
for i in range(0, len(modified)-1):
if modified[i] == modified[i+1]:
dellist_modified.append(modified[i+1])
dellist_original.append(original[i+1])
for j in dellist_modified:
if j in modified:
del modified_copy[agg_match.index(j)]
del original_copy[agg_match.index(j)]
# return modified_copy and original_copy
It's ugly, but it's all I got. My testing indicates the problem is created by the last chunk of code.
Modifications or entirely new approaches would be greatly appreciated. My next step is to try using dictionaries.
Here is a clean way of doing this:
original = list(range(10))
modified = list(original)
modified[5] = "a"
modified[6] = "a"
def without_repeated(original, modified):
seen = set()
for (o, m) in zip(original, modified):
if m not in seen:
seen.add(m)
yield o, m
original, modified = zip(*without_repeated(original, modified))
print(original)
print(modified)
Giving us:
(0, 1, 2, 3, 4, 5, 7, 8, 9)
(0, 1, 2, 3, 4, 'a', 7, 8, 9)
We iterate through both lists at the same time. We keep a set of items we have seen (sets have very fast checks for ownership) and then yields any results that we haven't already seen.
We can then use zip again to give us two lists back.
Note we could actually do this like so:
seen = set()
original, modified = zip(*((o, m) for (o, m) in zip(original, modified) if m not in seen and not seen.add(m)))
This works the same way, except using a single generator expression, with adding the item to the set hacked in using the conditional statement (as add always returns false, we can do this). However, this method is considerably harder to read and so I'd advise against it, just an example for the sake of it.
A set in python is a collection of distinct elements. Is the order of these elements critical? Something like this may work:
distinct = list(set(original))
Why use parallel lists? Why not a single list of class instances? That keeps things grouped easily, and reduces your list lookups.

Categories