pandas way to get list of indexes using iloc? - python

I have data sorted the way I want.
I'm about to put in something like:
series_data = []
for count,x in enumerate(df):
series_data.append(list(range(count)))
df['up_to_row'].iloc(count)= series_data
so the column would be:
df['up_to_row'] = Series([0], [0,1], [0,1,2], [0,1,2,3]...)
Then I need to translate this iloc locations to indexes.
Is there a more pandas specific way to do this?
I would do this with pandas operators but I'm not sure how to get the current iloc (which is why the enumerate was needed)
Edit*
Found the final answer using some of the tools from #wen. Thank you.
data['ind']=data.index.astype(str)+','
data['cumsum_indexes']=data['ind'].cumsum().str[:-1].str.split(',')

Using cumsum , Notice it will convert number in list to str , not int anymore.
df['up_to_row']=np.arange(len(df))
(df['up_to_row'].astype(str)+',').cumsum().str[:-1].str.split(',')
Out[211]:
0 [0]
1 [0, 1]
2 [0, 1, 2]
3 [0, 1, 2, 3]
Name: up_to_row, dtype: object

Related

Iterating over a pandas sparse series, without the missing values

I have a pandas DataFrame with very sparse columns. I would like to iterate over the DataFrame's values but without the missing ones, to save time.
I can't find how to access the indexes of the non-empty cells.
For example:
a = pd.Series([2, 3, 0, 0, 4], dtype='Sparse[int]')
print(a.sparse.sp_values) # --> [2,3,4]
print(a.sparse.sp_index) # --> AttributeError
print(a.sparse.to_coo()) # --> ValueError
I got the non-empty values, but where is the index? In the above example I am looking for [0,1,4].
I looked at the documentation which doesn't seem to mention it. I found information only for SparseArray but not for a Series/DataFrame of sparse type.
Printing dir(a.sparse) (without those starting with '_'):
['density', 'fill_value', 'from_coo', 'npoints', 'sp_values', 'to_coo', 'to_dense']
IIUC, use flatnonzero from numpy :
idx = np.flatnonzero(a).tolist()
print(idx)
​#[0, 1, 4]
Or loc from pandas's with boolean indexing :
idx = a.ne(0).loc[lambda s: s].index.tolist() # or list(a[a.ne(0)].index)
print(idx)
#[0, 1, 4]

Adding ndarray into dataframe and then back to ndarray

I have a ndarray which looks like this:
x
I wanted to add this into an existing dataframe so that I could export it as a csv, and then use that csv in a separate python script, pull out the ndarray and carry out some analysis, mainly so that I don't have one really long python script.
To add it to a dataframe I've done the following:
data["StandardisedFeatures"] = x.tolist()
This looks ok to me. However, in my next script, when I try to pull out the data and put it back as an array, it doesn't appear the same, it's wrapped in single quotes and treating it as a string:
data['StandardisedFeatures'].to_numpy()
I've tried astype(float) but it doesn't seem to work, can anyone suggest a way to fix this?
Thanks.
If your list objects in a DataFrame have become strings while processing (happens sometimes), you can use eval or ast.literal_eval functions to convert back from string to list, and use map to do it for every element.
Here is an example which will give you an idea of how to deal with this:
import pandas as pd
import numpy as np
dic = {"a": [1,2,3], "b":[4,5,6], "c": [[1,2,3], [4,5,6], [1,2,3]]}
df = pd.DataFrame(dic)
print("DataFrame:", df, sep="\n", end="\n\n")
print("Column of list to numpy:", df.c.to_numpy(), sep="\n", end="\n\n")
temp = df.c.astype(str).to_numpy()
print("Since your list objects have somehow become str objects while working with df:", temp, sep="\n", end="\n\n")
print("Magic for what you want:", np.array(list(map(eval, temp))), sep="\n", end="\n\n")
Output:
DataFrame:
a b c
0 1 4 [1, 2, 3]
1 2 5 [4, 5, 6]
2 3 6 [1, 2, 3]
Column of list to numpy:
[list([1, 2, 3]) list([4, 5, 6]) list([1, 2, 3])]
Since your list objects have somehow become str objects while working with df:
['[1, 2, 3]' '[4, 5, 6]' '[1, 2, 3]']
Magic for what you want:
[[1 2 3]
[4 5 6]
[1 2 3]]
Note: I have used eval in the example only because more people are familiar with it. You should prefer using ast.literal_eval instead whenever you need eval. This SO post nicely explains why you should do this.
Perhaps an alternative and simpler way of solving this issue is to use numpy.save and numpy.load functions. Then you can save the array as a numpy array object and load it again in the next script directly as a numpy array:
import numpy as np
x = np.array([[1, 2], [3, 4]])
# Save the array in the working directory as "x.npy" (extension is automatically inserted)
np.save("x", x)
# Load "x.npy" as a numpy array
x_loaded = np.load("x.npy")
You can save objects of any type in a DataFrame.
You retain their type, but they will be classified as "object" in the pandas.DataFrame.info().
Example: save lists
df = pd.DataFrame(dict(my_list=[[1,2,3,4], [1,2,3,4]]))
print(type(df.loc[0, 'my_list']))
# Print: list
This is useful if you use your objects directly with pandas.DataFrame.apply().

Is it possible to find index positions of certain rows in a big Dataframe (80000 rows, 6 columns) using iloc method?

My data frame has six columns of float data and around 80,000 rows. One of the column is "Current" and sometimes it is negative. I wanted to find index locations when the "Current" value is negative. My code is given below:
currnet_index = my_df[(my_df["Current"]<0)].index.tolist()
print(current_index[:5])
This gives output as I wanted:
[0, 124, 251, 381, 512]
This is fine. Is it possible to write this code using iloc method? I tried with following code and but it is giving error. I am wondering to know which of them is best and fastest method?
current_index = my_df.iloc[(my_df["Current"]<0)]
The output is:
NotImplementedError: iLocation based boolean indexing on an integer type is not available
With iloc you need to use a Boolean array rather than a Boolean series. For this, you can use pd.Series.values. Here's a demo:
df = pd.DataFrame({'Current': [1, 3, -4, 9, -3, 1, -2]})
res = df.iloc[df['Current'].lt(0).values].index
# Int64Index([2, 4, 6], dtype='int64')
Incidentally, loc works with either an array or a series.
You can simply use the following
my_df.ix[my_df['Current']<0].index.values

Matrix is printing wrong dimensions

I'm reading in a column from a dataframe named 'OneHot'. Each row of this column has a value of either [1,0] or [0,1]. I am trying to store these values into a variable so I can use it in a neural network.
Problem:
When I read in the values into a variable it stores as (792824, 1) instead of (792824, 2). 792824 is the amount of rows in the dataframe. I have tried reshape and that did not work.
Here is the code I have:
input_matrix = np.matrix(df['VectorTweet'].values.tolist())
​
In [157]:
input_matrix = np.transpose(input_matrix)
x_inputs = input_matrix.shape
print x_inputs
(792824, 1)
In [160]:
output_matrix = np.matrix(df['OneHot'].values.tolist())
y_inputs = np.transpose(output_matrix)
print y_outputs.shape
​
(792824, 1)
print y_outputs[1]
[['[1, 0]']]
attached is a snippet of my dataframe Example of my dataframe.
Looks like each entry in OneHot is a string representation of a list. That's why you're only getting one column in your transpose - you've made a single-element list of a string of a list of integers. You can convert strings of lists to actual lists with ast.literal_eval():
# OneHot as string of list of ints
strOneHot = pd.Series(['[0,1]','[1,0]'])
print(strOneHot.values)
# ['[0,1]' '[1,0]']
import ast
print(strOneHot.apply(ast.literal_eval).values)
# [[0, 1] [1, 0]]
FWIW, you can take the transpose of a Pandas series with .T, if that's useful here:
strOneHot.apply(ast.literal_eval).T
Output:
0 [0, 1]
1 [1, 0]
dtype: object

finding the max of a column in an array

def maxvalues():
for n in range(1,15):
dummy=[]
for k in range(len(MotionsAndMoorings)):
dummy.append(MotionsAndMoorings[k][n])
max(dummy)
L = [x + [max(dummy)]] ## to be corrected (adding columns with value max(dummy))
## suggest code to add new row to L and for next function call, it should save values here.
i have an array of size (k x n) and i need to pick the max values of the first column in that array. Please suggest if there is a simpler way other than what i tried? and my main aim is to append it to L in columns rather than rows. If i just append, it is adding values at the end. I would like to this to be done in columns for row 0 in L, because i'll call this function again and add a new row to L and do the same. Please suggest.
General suggestions for your code
First of all it's not very handy to access globals in a function. It works but it's not considered good style. So instead of using:
def maxvalues():
do_something_with(MotionsAndMoorings)
you should do it with an argument:
def maxvalues(array):
do_something_with(array)
MotionsAndMoorings = something
maxvalues(MotionsAndMoorings) # pass it to the function.
The next strange this is you seem to exlude the first row of your array:
for n in range(1,15):
I think that's unintended. The first element of a list has the index 0 and not 1. So I guess you wanted to write:
for n in range(0,15):
or even better for arbitary lengths:
for n in range(len(array[0])): # I chose the first row length here not the number of columns
Alternatives to your iterations
But this would not be very intuitive because the max function already implements some very nice keyword (the key) so you don't need to iterate over the whole array:
import operator
column = 2
max(array, key=operator.itemgetter(column))[column]
this will return the row where the i-th element is maximal (you just define your wanted column as this element). But the maximum will return the whole row so you need to extract just the i-th element.
So to get a list of all your maximums for each column you could do:
[max(array, key=operator.itemgetter(column))[column] for column in range(len(array[0]))]
For your L I'm not sure what this is but for that you should probably also pass it as argument to the function:
def maxvalues(array, L): # another argument here
but since I don't know what x and L are supposed to be I'll not go further into that. But it looks like you want to make the columns of MotionsAndMoorings to rows and the rows to columns. If so you can just do it with:
dummy = [[MotionsAndMoorings[j][i] for j in range(len(MotionsAndMoorings))] for i in range(len(MotionsAndMoorings[0]))]
that's a list comprehension that converts a list like:
[[1, 2, 3], [4, 5, 6], [0, 2, 10], [0, 2, 10]]
to an "inverted" column/row list:
[[1, 4, 0, 0], [2, 5, 2, 2], [3, 6, 10, 10]]
Alternative packages
But like roadrunner66 already said sometimes it's easiest to use a library like numpy or pandas that already has very advanced and fast functions that do exactly what you want and are very easy to use.
For example you convert a python list to a numpy array simple by:
import numpy as np
Motions_numpy = np.array(MotionsAndMoorings)
you get the maximum of the columns by using:
maximums_columns = np.max(Motions_numpy, axis=0)
you don't even need to convert it to a np.array to use np.max or transpose it (make rows to columns and the colums to rows):
transposed = np.transpose(MotionsAndMoorings)
I hope this answer is not to unstructured. Some parts are suggestions to your function and some are alternatives. You should pick the parts that you need and if you have any trouble with it, just leave a comment or ask another question. :-)
An example with a random input array, showing that you can take the max in either axis easily with one command.
import numpy as np
aa= np.random.random([4,3])
print aa
print
print np.max(aa,axis=0)
print
print np.max(aa,axis=1)
Output:
[[ 0.51972266 0.35930957 0.60381998]
[ 0.34577217 0.27908173 0.52146593]
[ 0.12101346 0.52268843 0.41704152]
[ 0.24181773 0.40747905 0.14980534]]
[ 0.51972266 0.52268843 0.60381998]
[ 0.60381998 0.52146593 0.52268843 0.40747905]

Categories