ValueError: y contains new labels: ['#'] - python

I have a list of lists with every list containing 1 up to 5 tags. I have constructed a list containing the top 50 tags. My goal is to construct a new list of lists where every list contains only the top 50 tags. My approach went like this:
First I constructed a new list of lists with only the top 50 tags:
top_50 = list(np.array(pd.read_csv(os.path.join(dir,"Tags.csv")))[:,1])
train = pd.read_csv(os.path.join(dir,"Train.csv"),iterator = True)
top_50 = top_50[:51]
tags = list(np.array(train.get_chunk(50000))[:,3])
top_50_tags = [[tag for tag in list if tag in top_50] for list in tags]
Then I tried to encode the tags:
coder = preprocessing.LabelEncoder()
coder = coder.fit(top_50)
tags = [coder.transform(tag) for tag in list for list in top_50_tags]
This however gave me this error:
Traceback (most recent call last):
File "C:\Users\Ano\workspace\final_submission\src\rf_test.py", line 69, in <module>
main()
File "C:\Users\Ano\workspace\final_submission\src\rf_test.py", line 33, in main
labels = [coder.transform(tag) for tag in list for list in top_50_tags]
File "C:\Python27\lib\site-packages\sklearn\preprocessing\label.py", line 120, in transform
raise ValueError("y contains new labels: %s" % str(diff))
ValueError: y contains new labels: ['#']
I think this error rises because some of my lists are empty, since there were no top 50 tags in them. But the error specifically states that ["#"] is the newly seen label. Am I right with my hypothesis? And what should I do with the error message?
Edit:
For the people wondering why I am using list as a variable in list comprehension, I actually use a different word as a variable in my real program.
Update
I checked for differences in my top_50 and the tags:
print(len(top_50.difference(tags)))
which gave me a length of 0. This should mean that my empty lists are the problem?

Maybe you can check this issue: https://github.com/scikit-learn/scikit-learn/issues/3123
In scikit-learn 0.17 version, this bug has been solved.

Related

Indexing error after removing line from 2D array

I am facing an 'List Index out of range' error when trying to iterate a for-loop over a table I've created from a CSV extract, but cannot figure out why - even after trying many different methods.
Here is the step by step description of how the error happens :
I'm removing the first line of an imported CSV file, as this
line contains the columns' names but no data. The CSV has the following structure.
columnName1, columnName2, columnName3, columnName4
This, is, some, data
I, have, in, this
very, interesting, CSV, file
After storing the CSV in a first array called oldArray, I want to populate a newArray that will get all values from oldArray but not the first line, which is the column name line, as previously
mentioned. My newArray should then look like this.
This, is, some, data
I, have, in, this
very, interesting, CSV, file
To create this newArray, I'm using the following code with the append() function.
tempList = []
newArray = []
for i in range(len(oldArray)):
if i > 0: #my ugly way of skipping line 0...
for j in range(len(oldArray[0])):
tempList.append(oldArray[i][j])
newArray.append(tempList)
tempList = []
I also stored the columns in their own separate list.
i = 0
for i in range(len(oldArray[0])):
my_columnList[i] = oldArray[0][i]
And the error comes up next : I now want to populate a treeview table from this newArray, using a for-loop and insert (in a function). But I always get the 'Index List out of range error' and I cannot figure out why.
def populateTable(my_tree, newArray, my_columnList):
i = 0
for i in range(len(newArray)):
my_tree.insert('','end', text=newArray[i][0], values = (newArray[i][1:len(newArray[0]))
#(im using the text option to bypass treeview's column 0 problem)
return my_tree
Error message --> " File "(...my working directory...)", line 301, in populateTable
my_tree.insert(parent='', index='end', text=data[i][0], values=(data[i][1:len(data[0])]))
IndexError: list index out of range "
Using that same function with different datasets and columns worked fine, but not for this here newArray.
I'm fairy certain that the error comes strictly from this 'newArray' and is not linked to another parameter.
I've tested the validity of the columns list, of the CSV import in oldArray through some print() functions, and everything seems normal - values, row dimension, column dimension.
This is a great mystery to me...
Thank you all very much for your help and time.
You can find a problem from your error message: File "(...my working directory...)", line 301, in populateTable my_tree.insert(parent='', index='end', text=data[i][0], values=(data[i][1:len(data[0])])) IndexError: list index out of range
It means there is an index out of range in line 301: data[i][0] or data[i][1:len(data[0])]
(i is over len(data)) or (0 or 1 is over len(data[0]))
My guess is there is some empty list in data(maybe data[-1]?).
if data[i] is [] or [some_one_item], then data[i][1:len(data[0])] try to access to second item which not exists.
there is no problem in your "ugly" way to skip line 0 but I recommend having a look on this way
new_array = old_array.copy()
new_array.remove(new_array[0])
now for fixing your issue
looks like you have a problem in the indexing
when you use a for loop using the range of the length of an array you use normal indexing which starts from one while you identify your i variable to be zero
to make it simple
len(oldArray[0])
this is equal to 4 so when you use it in the for loop it's just like saying
for i in range(4):
to fix this you can either subtract 1 from the length of the old array or just identify the i variable to be 1 at the first
i = 1
for i in range(len(oldArray[0])):
my_columnList[i] = oldArray[0][i]
or
i = 0
for i in range(len(oldArray[0])-1):
my_columnList[i] = oldArray[0][i]
this mistake is also repeated in your populateTree function
so in the same way your code would be
def populateTree(my_tree, newArray, my_columnList):
i = 0
for i in range(len(newArray)-1):
my_tree.insert('','end', text=newArray[i][0], values = (newArray[i][1:len(newArray[0]))
#(im using the text option to bypass treeview's column 0 problem)
return my_tree

Insert dictionary item into list of dictionaries

I have a list adImageList of dictionary items in following form:
[{'Image_thumb_100x75': 'https://cache.domain.com/mmo/7/295/170/227_174707044_thumb.jpg',
'Image_hoved_400x300': 'https://cache.domain.com/mmo/7/295/170/227_174707044_hoved.jpg',
'Image_full_800x600': 'https://cache.domain.com/mmo/7/295/170/227_174707044.jpg'},
{'Image_thumb_100x75': 'https://cache.domain.com/mmo/7/295/170/227_1136648194_thumb.jpg',
'Image_hoved_400x300': 'https://cache.domain.com/mmo/7/295/170/227_1136648194_hoved.jpg',
'Image_full_800x600': 'https://cache.domain.com/mmo/7/295/170/227_1136648194.jpg'},
{'Image_thumb_100x75': 'https://cache.domain.com/mmo/7/295/170/227_400613427_thumb.jpg',
'Image_hoved_400x300': 'https://cache.domain.com/mmo/7/295/170/227_400613427_hoved.jpg',
'Image_full_800x600': 'https://cache.domain.com/mmo/7/295/170/227_400613427.jpg'}]
I have iterator which suppose to add local URL under each image record after fetching it from web (fetching part works ok). So I'm using following code to append local URL to existing dictionary items:
for i, d in enumerate(adImageList):
file_name_thumb = '0{}_{}_{}'.format(i, page_title,'_thumb_100x75.jpg')
urllib.request.urlretrieve(d['Image_thumb_100x75'], file_name_thumb)
local_path_thumb = dir_path+file_name_thumb
adImageList.insert[i](1,{'Image_thumb_100x75_local_path_thumb':local_path_thumb}) # not working
file_name_hoved = '0{}_{}_{}'.format(i, page_title,'_hoved_400x300.jpg')
urllib.request.urlretrieve(d['Image_hoved_400x300'], file_name_hoved)
local_path_hoved = dir_path+file_name_hoved
adImageList.insert[i](3,{'Image_hoved_400x300_local_path_hoved':local_path_hoved}) # not working
file_name_full = '0{}_{}_{}'.format(i, page_title,'_full_800x600.jpg')
urllib.request.urlretrieve(d['Image_full_800x600'], file_name_full)
local_path_full = dir_path+file_name_full
adImageList.insert[i](5,{'Image_full_800x600_local_path_full':local_path_full}) # not working
Idea is to extend dict items in following manner which also explains numbers 1,3 and 5 in my code
{'Image_thumb_100x75': 'https://cache.domain.com/mmo/7/295/170/227_174707044_thumb.jpg',
'Image_thumb_100x75_local_path_thumb':local_path_thumb #1,
'Image_hoved_400x300': 'https://cache.domain.com/mmo/7/295/170/227_174707044_hoved.jpg',
'Image_hoved_400x300_local_path_hoved':local_path_hoved #3
'Image_full_800x600': 'https://cache.domain.com/mmo/7/295/170/227_174707044.jpg',
'Image_full_800x600_local_path_full':local_path_full #5}
But it's giving me error:
TypeError: 'builtin_function_or_method' object is not subscriptable
Most likely here's what you had in mind:
adImageList[i]['Image_thumb_100x75_local_path_thumb']=local_path_thumb
This adds key 'Image_thumb_100x75_local_path_thumb' to the ith dictionary on the list and sets its value to local_path_thumb. The purpose of 1,3,5 is still unclear.
python stack traces give line numbers for a reason, but my guess is this line:
adImageList.insert[i]
insert is a method

Python3: Split multiple variables dynamically

I'm trying to split multiple variables that were dynamically created off a for loop and then delete everything after the first space.
Minor back story: I'm using paramiko to SSH to a network switch to pull VLAN information. Trying to create a new variable for each VLAN name and then present all variables back into a list for the user to select from.
#VLANLines## were split from VLANList off \r\n. Variables created form a for loop
VLANLine1 = 'GGGGGGGGG 5 5/7'
VLANLine2 = 'HHHH 66 22/23'
VLANLine3 = 'SSSSSSS 33 3/4'
#HHHH and SSSSSS are random names I put in place for this question. This is the data I need to keep.
#Length of VLANList = 14 in this demo
i = 0
while i < len(VLANList):
VLANLine[i].split(" ")
del VLAN[i][1:]
Error below
Traceback (most recent call last):
File "<pyshell#16>", line 2, in <module>
VLANLine[i].split(" ")
IndexError: string index out of range
How can I dynamically split 'VLANLine##' and then delete out everything after the space? I may be going at this all wrong too. I just started working with python a few weeks ago.
This may work for you.
VLAN_clean = [v[0:v.find(' ')] for v in VLANList if v.find(' ') != -1]
str.split does what you need cleanly:
VLANList = [
'GGGGGGGGG 5 5/7',
'HHHH 66 22/23',
'SSSSSSS 33 3/4',
]
VLAN_Clean = [v.split()[0] for v in VLANList]
print(VLAN_Clean)
Output:
['GGGGGGGGG', 'HHHH', 'SSSSSSS']
split will split each string at the first space character, returning a tuple of values. If there is no blank, it will simply return a tuple of length 1 containing the entire string. So, running split on each item, then selecting the first item from the resulting tuple gives you the right thing.

Copy portion of list

How do I copy only a portion of a list to another list. For example, if the length of the list is 105 but only 30 of the randomly selected elements need to be copied to a new list.
This is the code that I have written
for x in range (104):
if len(trainingSet1)>30:
break
trainingSet1[x]= (trainingSet[random.randint(1,103)])
But it keeps giving this error:
Traceback (most recent call last):
File "Q1_2.py", line 82, in <module>
main()
File "Q1_2.py", line 72, in main
trainingSet1[x]= (trainingSet[random.randint(1,103)])
IndexError: list index out of range
The bug is probably here:
trainingSet1[x] = ...
Unless you already populated trainingSet1, you’re trying to assign to
an element that doesn’t exist yet. Use trainingSet1.append(...)
instead.
Initialize the trainingSet1 as trainingSet1 =[] and then try to append values to that instead of using trainingSet1[x] = value . If you really want to assign as you have done in the code you can first initialize the array as trainingSet1 = [0] * 30. This will assign 30 0's to the list and those will be replaced by your randomly selected values later.

select and make new list with specific information

EDIT2: Nevermind this, someone pointed my error. Thanks
first of all, this is an example of results i have
(172, 'Nucleus')
(172, 'Nucleus')
(472, 'Cytoplasm')
(472, 'Cytoplasm')
(472, 'Nucleus')
what i`m trying to do is to match the first number (position 0) and then look if there is a part of the word "nucleus" (here, it would be "nuc") It can happens that in each number there is only word that has nucleus.
i'm trying to make 2 lists : the first list would be only the number containing only "nuc" word. the second list would be containing those with nuc and other things (like cytoplasm in my example)
That is only a little part of my result.
I don't have example of code, because i have really no clue how to include only one valor of my query in the list ( as on the example, i would enter the number 172 two time) (oops i now have an example of code)
EDIT: oops wrote that before i wrote the code i tried...
right now, my code looks like that :
here is how i got my example a little bit higher
def number1(self, position):
self.position = position
List = [self.name()]
for item in List:
for i in range(position, self.c.rowcount):
self.number(i)
def separate_list(self, list_signal):
nuc_list = []
not_nuc_list = []
for i in list_signal:
print(list_signal(i))
if list_signal(i)(0) == list_signal(i+1)(0):
if list_signal(i)(1) and list_signal(i+1)(1) == re.search("nuc"):
nuc_list.append(list_signal(i))
else:not_nuc_list.append(list_signal(i))
return nuc_list and not_nuc_list
dc = connection()
dc.separate_list(dc.number1(0))
error:
Traceback (most recent call last):
File "class vincent.py", line 91, in <module>
dc.separate_list(dc.number1(0))
File "class vincent.py", line 61, in separate_list
for i in list_signal:
TypeError: 'NoneType' object is not iterable
i know this is not cute, i tried doing it the best way i can .. (new to python and programming in itself)
EDIT2: Nevermind this, someone pointed my error. Thanks
A few things, if you are trying to get the index, position 0 of the list as you say, you would use list_name[0], if you are using position to sort, use a different method
Are (172, 'Nucleus') ... (172, 'Nucleus') tuples or are they lists of their own? List you can use index with the [0] method, tuple you can assign it to two variables to work with the data as number, cell_type = (172, 'nucleus')
Also, at the moment dc.number1 doesn't return anything so it cant be used at input to another function. Add a return of some sort or change what you are using as the input to whatever self.number is modifying.
You may want to make a list of all your results, e.g. [(172, 'Nucleus'), ...(172, 'Nucleus')] then you can iterate through with
for item in results_list:
for number, cell_type in item:
print str(number), cell_type
#Should give you "172 Nucleus"

Categories