I'm using Python 2.7.5 and this format is not working. As far as I remember it works on my other computer which also has 2.7
train_data.ix[:,1:-1]
The error I get is:
AttributeError: 'list' object has no attribute 'ix'
If I use train_data[:,1:-1] then the error is:
TypeError: list indices must be integers, not tuple
How can I solve this?
Thanks!
Lists and NumPy arrays do not have an ix method:
In [8]: import numpy as np
In [10]: x = np.array([])
In [11]: x.ix
AttributeError: 'numpy.ndarray' object has no attribute 'ix'
But Pandas Series and DataFrames do have an ix method:
In [16]: import pandas as pd
In [17]: y = pd.Series([1,2])
In [18]: y.ix[0]
Out[18]: 1
In [19]: y.ix[1]
Out[19]: 2
If train_data is a Pandas DataFrame, then train_data.ix[:,1:-1]
selects all rows from the second to next-to-last columns. The : indicates all rows, the 1:-1 indicates the second to next-to-last columns.
The syntax for python slice notation is list:
list[start:end:step]
If there are any commas in the slice notation it will give you the tuple error.
if you use
list1=[1,2,3,4,5,6,7]
>>> list1[:1:-1]
[7, 6, 5, 4, 3]
and if you want to reverse the list, you can use,
>>> list1[::-1]
[7, 6, 5, 4, 3, 2, 1]
The first error says that list object doesn't have attribute ix (and it doesn't have, in fact):
>>> [].ix
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'list' object has no attribute 'ix'
In the second piece of code you are trying to use comma in the slice. According to this document you can't do it with python's lists. AFAIK you can do it with some numpy structures, so you need to convert your list to one of them.
The direct answer to your question is that it is not working because your train_data is a list. It appears that the train_data in the code you are trying to understand was a Pandas DataFrame object due to the .ix property you are using.
Related
I have a .csv file with a column which contains a list of values structured as well:
What I wanna do is create a nested list with all the values in the following format, so I can iterate through them with another method:
[[1009, 1310], [9420, 9699], [11590, 12009], [12290, 12499], [14460, 14809]]
I tried to read it by simple converting the cell to a list:
df = pd.read_csv('example.csv', usecols=['anomaly_sequences'])
a = df.iloc[0]['anomaly_sequences']
print(a[0])
But the output I got is: [
If I check its type with print(a.dtype) I get:
AttributeError: 'str' object has no attribute 'dtype'
How can I read it directly as a list instead of a string?
You can use literal_eval from the standard ast library to take a string and evaluate it as python code.
import ast
df['anomaly_sequences'] = df['anomaly_sequences'].apply(ast.literal_eval)
Let's start off with a random (reproducible) data array -
# Setup
In [11]: np.random.seed(0)
...: a = np.random.randint(0,9,(7,2))
...: a[2] = a[0]
...: a[4] = a[1]
...: a[6] = a[1]
# Check values
In [12]: a
Out[12]:
array([[5, 0],
[3, 3],
[5, 0],
[5, 2],
[3, 3],
[6, 8],
[3, 3]])
# Check its itemsize
In [13]: a.dtype.itemsize
Out[13]: 8
Let's view each row as a scalar using custom datatype that covers two elements. We will use void-dtype for this purpose. As mentioned in the docs -
https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.dtypes.html#specifying-and-constructing-data-types, https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.interface.html#arrays-interface) and in stackoverflow Q&A, it seems that would be -
In [23]: np.dtype((np.void, 16)) # 8 is the itemsize, so 8x2=16
Out[23]: dtype('V16')
# Create new view of the input
In [14]: b = a.view('V16').ravel()
# Check new view array
In [15]: b
Out[15]:
array([b'\x05\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00',
b'\x03\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00',
b'\x05\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00',
b'\x05\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00',
b'\x03\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00',
b'\x06\x00\x00\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00',
b'\x03\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00'],
dtype='|V16')
# Use pandas.factorize on the new view
In [16]: pd.factorize(b)
Out[16]:
(array([0, 1, 0, 0, 1, 2, 1]),
array(['\x05\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00',
'\x03\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00',
'\x06\x00\x00\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00'],
dtype=object))
Two things off factorize's output that I could not understand and hence the follow-up questions -
The fourth element of the first output (=0) looks wrong, because it has same ID as the third element, but in b, the fourth and third elements are different. Why so?
Why does the second output has an object dtype, while the dtype of b was V16. Is this also causing the wrong value mentioned in 1.?
A bigger question could be - Does pandas.factorize cover custom datatypes? From docs, I see :
values : sequence A 1-D sequence. Sequences that aren’t pandas objects
are coerced to ndarrays before factorization.
In the provided sample case, we have a NumPy array, so one would assume no issues with the input, unless the docs didn't clarify about the custom datatype part?
System setup : Ubuntu 16.04, Python : 2.7.12, NumPy : 1.16.2, Pandas :
0.24.2.
On Python-3.x
System setup : Ubuntu 16.04, Python : 3.5.2, NumPy : 1.16.2, Pandas :
0.24.2.
Running the same setup, I get -
In [18]: b
Out[18]:
array([b'\x05\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00',
b'\x03\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00',
b'\x05\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00',
b'\x05\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00',
b'\x03\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00',
b'\x06\x00\x00\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00',
b'\x03\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00'],
dtype='|V16')
In [19]: pd.factorize(b)
Out[19]:
(array([0, 1, 0, 2, 1, 3, 1]),
array([b'\x05\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00',
b'\x03\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00',
b'\x05\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00',
b'\x06\x00\x00\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00'],
dtype=object))
So, the first output off factorize looks alright here. But, the second output has object dtype again, different from the input. So, the same question - Why this dtype change?
Compiling the questions/tl;dr
With such a custom datatype :
Why wrong labels, uniques and different uniques dtype on Python2.x?
Why different uniques dtype on Python3.x?
As for why V16 is coerced to object, many functions in pandas convert data to one of the data types the internal functions can handle, here. If the data type is not in the list, it becomes an object – and pandas doesn't convert the result back into the original dtype, it appears.
Regarding the discrepancy between Python 2 and Python 3: There's only one pandas codebase for both, so why do they give different results?
Turns out that Python 2 uses the string type (which are just arrays of bytes) to represent your data¹, and Python 3 the bytes type. The effect of this is that Python 2 uses a StringHashTable for the factorization and Python 3 uses a PyObjectHashTable, and the StringHashTable gives incorrect results in your case. I believe that this is because the strings in the StringHashTable are assumed to be zero-terminated, which is not the case for your strings – and indeed, if you only compare the rows up to the first zero byte, the first and fourth row look identical.
Conclusion: It's a bug, and we should probably file an issue for it.
¹ More detail: This call to ensure_object returns an array of strings in Python 2, but an array of bytes in Python 3 (as can be seen by the b prefix). Correspondingly, the hashtable chosen here is different.
Ok, here's the preconditions I cannot change:
I have a dataframe with a single column
it has to be converted and summed in numpy
It looks like that, and it starts from an arbitrary index (I don't think I need to re-index it to save on computational overhead)
3 1.32745e+06
4 0
5 6.07657e+08
6 NaN
The following does not sum it but returns nan. What am I doing wrong?
np_value = np_value.values
print(np.nansum(np_value))
Please provide more information on what your np_value is because I believe that is where you are going wrong. I tried this and got the correct answer of 5.
import numpy as np
import pandas as pd
#create numpy array of values
np_values = np.array([1,0,4,np.nan])
#put those values in a dataframe to test
np_values = pd.DataFrame(data=np_values)
#Take just the values of that data
np_value = np_values.values
print(np.nansum(np_value))
np.nansum can't operate on object arrays or string arrays:
>>> import numpy as np
>>> arr = np.array([1.32745e+06, 0, 6.07657e+08, 'NaN'], dtype=object)
>>> np.nansum(arr)
TypeError: unsupported operand type(s) for +: 'float' and 'str'
>>> arr = np.array([1.32745e+06, 0, 6.07657e+08, 'NaN'])
>>> np.nansum(arr)
TypeError: cannot perform reduce with flexible type
You need to cast it to a numeric type (e.g. float) to make it work:
>>> np.nansum(arr.astype(float))
608984450.0
Note: It's pretty obvious in this case that it's an object or string array because the 0 would display as 0.0 in a float array. Be careful with object arrays, these are slow and often unsupported.
I am currently learning Python. I am using Python35.
Basically I have an excel sheet with a fixed number of columns and rows (that contain data), and I want to save those values in a 2D list using append.
I currently saved the data in a 1D list. This is my code:
import openpyxl
Values=[[]]
MaxColumn=sheet.max_column
MaxRow=sheet.max_row
for y in range (10,MaxRow):#Iterate for each row.
for x in range (1,MaxColumn):#Iterate for each column.
Values.append(sheet.cell(row=y,column=x).value)
#I have tried with the following:
Values[y].append(sheet.cell(row=y,column=x).value)
Traceback (most recent call last):
File "<pyshell#83>", line 4, in <module>
Values[y].append(sheet.cell(row=y,column=x).value)
AttributeError: 'int' object has no attribute 'append'
for x in range (1,MaxColumn):
#print(sheet.cell(row=y,column=x).value)
Values.append(sheet.cell(row=y,column=x).value)
You must have some code that redefines the Values object but in any case you can just do list(sheet.values).
Try the following:
# Define a list
list_2d = []
# Loop over all rows in the sheet
for row in ws.rows:
# Append a list of column values to your list_2d
list_2d.append( [cell.value for cell in row] )
print(list_2d)
Your Traceback Error:
Values[y].append(
AttributeError: 'int' object has no attribute 'append'
Values[y] is not a list object, its a int value at index y in your list object.
I've a numpy.ndarray the columns of which I'd like to access. I will be taking all columns after 8 and testing them for variance, removing the column if the variance/average is low. In order to do this, I need access to the columns, preferably with Numpy. By my current methods, I encounter errors or failure to transpose.
To mine these arrays, I am using the IOPro adapter, which gives a regular numpy.ndarray.
import iopro
import sys
adapter = iopro.text_adapter(sys.argv[1], parser='csv')
all_data = adapter[:]
z_matrix = adapter[range(8,len(all_data[0]))][1:3]
print type(z_matrix) #check type
print z_matrix # print array
print z_matrix.transpose() # attempt transpose (fails)
print z_matrix[:,0] # attempt access by column (fails)
Can someone explain what is happening?
The output is this:
<type 'numpy.ndarray'>
[ (18.712, 64.903, -10.205, -1.346, 0.319, -0.654, 1.52398, 114.495, -75.2488, 1.52184, 111.31, 175.
408, 1.52256, 111.699, -128.141, 1.49227, 111.985, -138.173)
(17.679, 48.015, -3.152, 0.848, 1.239, -0.3, 1.52975, 113.963, -50.0622, 1.52708, 112.335, -57.4621
, 1.52603, 111.685, -161.098, 1.49204, 113.406, -66.5854)]
[ (18.712, 64.903, -10.205, -1.346, 0.319, -0.654, 1.52398, 114.495, -75.2488, 1.52184, 111.31, 175.
408, 1.52256, 111.699, -128.141, 1.49227, 111.985, -138.173)
(17.679, 48.015, -3.152, 0.848, 1.239, -0.3, 1.52975, 113.963, -50.0622, 1.52708, 112.335, -57.4621
, 1.52603, 111.685, -161.098, 1.49204, 113.406, -66.5854)]
Traceback (most recent call last):
File "z-matrix-filtering.py", line 11, in <module>
print z_matrix[:,0]
IndexError: too many indices
What is going wrong? Is there a better way to access the columns? I will be reading all lines of a file, testing all columns from the 8th for significant variance, removing any columns that don't vary significantly, and then reprinting the result as a new CSV.
EDIT:
Based on responses, I have created the following very ugly and I think inane approach.
all_data = adapter[:]
z_matrix = []
for line in all_data:
to_append = []
for column in range(8,len(all_data.dtype)):
to_append.append(line[column].astype(np.float16))
z_matrix.append(to_append)
z_matrix = np.array(z_matrix)
The reason that the columns must be directly accessed is that there is a String inside the data. If this string is not circumvented in some way, an error will be thrown about a void-array with object members using buffer error.
Is there a better solution? This seems terrible, and it seems it will be inefficient for several gigabytes of data.
Notice that the output of print z_matrix has the form
[ (18.712, 64.903, ..., -138.173)
(17.679, 48.015, ..., -66.5854)]
That is, it is printed as a list of tuples. That is the output you get when the array is a "structured array". It is a one-dimensional array of structures. Each "element" in the array has 18 fields. The error occurs because you are trying to index a 1-D array as if it were 2-D; z_matrix[:,0] won't work.
Print the data type of the array to see the details. E.g.
print z_matrix.dtype
That should show the names of the fields and their individual data types.
You can get one of the elements as, for example, z_matrix[k] (where k is an integer), or you can access a "column" (really a field of the structured array) as z_matrix['name'] (change 'name' to one of the fields in the dtype).
If the fields all have the same data type (which looks like the case here--each field has type np.float64), you can create a 2-D view of the data by reshaping the result of the view method. For example:
z_2d = z_matrix.view(np.float64).reshape(-1, len(z_matrix.dtype.names))
Another way to get the data by column number rather than name is:
col = 8 # The column number (zero-based).
col_data = z_matrix[z_matrix.dtype.names[col]]
For more about structured arrays, see http://docs.scipy.org/doc/numpy/user/basics.rec.html.
The display of z_matrix is consistent with it being shape (2,), a 1d array of tuples.
np.array([np.array(a) for a in z_matrix])
produces a (2,18) 2d array. You should be able to do your column tests on that.
It is very easy to access numpy array. Here's a simple example which can be helpful
import numpy as n
A = n.array([[1, 2, 3], [4, 5, 6]])
print A
>>> array([[1, 2, 3],
[5, 6, 7]])
A.T // To obtain the transpose
>>> array([[1, 5],
[2, 6],
[3, 7]])
n.mean(A.T, axis = 1) // To obtain column wise mean of array A
>>> array([ 3., 4., 5.])
I hope this will help you perform your transpose and column-wise operations