I have a pandas DataFrame with very sparse columns. I would like to iterate over the DataFrame's values but without the missing ones, to save time.
I can't find how to access the indexes of the non-empty cells.
For example:
a = pd.Series([2, 3, 0, 0, 4], dtype='Sparse[int]')
print(a.sparse.sp_values) # --> [2,3,4]
print(a.sparse.sp_index) # --> AttributeError
print(a.sparse.to_coo()) # --> ValueError
I got the non-empty values, but where is the index? In the above example I am looking for [0,1,4].
I looked at the documentation which doesn't seem to mention it. I found information only for SparseArray but not for a Series/DataFrame of sparse type.
Printing dir(a.sparse) (without those starting with '_'):
['density', 'fill_value', 'from_coo', 'npoints', 'sp_values', 'to_coo', 'to_dense']
IIUC, use flatnonzero from numpy :
idx = np.flatnonzero(a).tolist()
print(idx)
#[0, 1, 4]
Or loc from pandas's with boolean indexing :
idx = a.ne(0).loc[lambda s: s].index.tolist() # or list(a[a.ne(0)].index)
print(idx)
#[0, 1, 4]
Related
My data frame has six columns of float data and around 80,000 rows. One of the column is "Current" and sometimes it is negative. I wanted to find index locations when the "Current" value is negative. My code is given below:
currnet_index = my_df[(my_df["Current"]<0)].index.tolist()
print(current_index[:5])
This gives output as I wanted:
[0, 124, 251, 381, 512]
This is fine. Is it possible to write this code using iloc method? I tried with following code and but it is giving error. I am wondering to know which of them is best and fastest method?
current_index = my_df.iloc[(my_df["Current"]<0)]
The output is:
NotImplementedError: iLocation based boolean indexing on an integer type is not available
With iloc you need to use a Boolean array rather than a Boolean series. For this, you can use pd.Series.values. Here's a demo:
df = pd.DataFrame({'Current': [1, 3, -4, 9, -3, 1, -2]})
res = df.iloc[df['Current'].lt(0).values].index
# Int64Index([2, 4, 6], dtype='int64')
Incidentally, loc works with either an array or a series.
You can simply use the following
my_df.ix[my_df['Current']<0].index.values
I have data sorted the way I want.
I'm about to put in something like:
series_data = []
for count,x in enumerate(df):
series_data.append(list(range(count)))
df['up_to_row'].iloc(count)= series_data
so the column would be:
df['up_to_row'] = Series([0], [0,1], [0,1,2], [0,1,2,3]...)
Then I need to translate this iloc locations to indexes.
Is there a more pandas specific way to do this?
I would do this with pandas operators but I'm not sure how to get the current iloc (which is why the enumerate was needed)
Edit*
Found the final answer using some of the tools from #wen. Thank you.
data['ind']=data.index.astype(str)+','
data['cumsum_indexes']=data['ind'].cumsum().str[:-1].str.split(',')
Using cumsum , Notice it will convert number in list to str , not int anymore.
df['up_to_row']=np.arange(len(df))
(df['up_to_row'].astype(str)+',').cumsum().str[:-1].str.split(',')
Out[211]:
0 [0]
1 [0, 1]
2 [0, 1, 2]
3 [0, 1, 2, 3]
Name: up_to_row, dtype: object
I'm reading in a column from a dataframe named 'OneHot'. Each row of this column has a value of either [1,0] or [0,1]. I am trying to store these values into a variable so I can use it in a neural network.
Problem:
When I read in the values into a variable it stores as (792824, 1) instead of (792824, 2). 792824 is the amount of rows in the dataframe. I have tried reshape and that did not work.
Here is the code I have:
input_matrix = np.matrix(df['VectorTweet'].values.tolist())
In [157]:
input_matrix = np.transpose(input_matrix)
x_inputs = input_matrix.shape
print x_inputs
(792824, 1)
In [160]:
output_matrix = np.matrix(df['OneHot'].values.tolist())
y_inputs = np.transpose(output_matrix)
print y_outputs.shape
(792824, 1)
print y_outputs[1]
[['[1, 0]']]
attached is a snippet of my dataframe Example of my dataframe.
Looks like each entry in OneHot is a string representation of a list. That's why you're only getting one column in your transpose - you've made a single-element list of a string of a list of integers. You can convert strings of lists to actual lists with ast.literal_eval():
# OneHot as string of list of ints
strOneHot = pd.Series(['[0,1]','[1,0]'])
print(strOneHot.values)
# ['[0,1]' '[1,0]']
import ast
print(strOneHot.apply(ast.literal_eval).values)
# [[0, 1] [1, 0]]
FWIW, you can take the transpose of a Pandas series with .T, if that's useful here:
strOneHot.apply(ast.literal_eval).T
Output:
0 [0, 1]
1 [1, 0]
dtype: object
I know this is a relatively common topic on stackoverflow but I couldn't find the answer I was looking for. Basically, I am trying to make very efficient code (I have rather large data sets) to get certain columns of data from a matrix. Below is what I have so far. It gives me this error: could not broadcast input array from shape (2947,1) into shape (2947)
def get_data(self, colHeaders):
temp = np.zeros((self.matrix_data.shape[0],len(colHeaders)))
for col in colHeaders:
index = self.header2matrix[col]
temp[:,index:] = self.matrix_data[:,index]
data = np.matrix(temp)
return temp
Maybe this simple example will help:
In [70]: data=np.arange(12).reshape(3,4)
In [71]: header={'a':0,'b':1,'c':2}
In [72]: col=['c','a']
In [73]: index=[header[i] for i in col]
In [74]: index
Out[74]: [2, 0]
In [75]: data[:,index]
Out[75]:
array([[ 2, 0],
[ 6, 4],
[10, 8]])
data is some sort of 2D array, header is a dictionary mapping names to column numbers. Using the input col, I construct a column index list. You can select all columns at once, rather than one by one.
I'd like to filter a NumPy 2-d array by checking whether another array contains a column value. How can I do that?
import numpy as np
ar = np.array([[1,2],[3,-5],[6,-15],[10,7]])
another_ar = np.array([1,6])
new_ar = ar[ar[:,0] in another_ar]
print new_ar
I hope to get [[1,2],[6,-15]] but above code prints just [1,2].
You can use np.where,but note that as ar[:,0] is a list of first elements if ar you need to loop over it and check for membership :
>>> ar[np.where([i in another_ar for i in ar[:,0]])]
array([[ 1, 2],
[ 6, -15]])
Instead of using in, you can use np.in1d to check which values in the first column of ar are also in another_ar and then use the boolean index returned to fetch the rows of ar:
>>> ar[np.in1d(ar[:,0], another_ar)]
array([[ 1, 2],
[ 6, -15]])
This is likely to be much faster than using any kind of for loop and testing membership with in.