Function to reindex a pandas dataframe - python

I want to reindex a couple of pandas dataframes. I want to the last index to move up to the first, and I aim to build a general function.
I could find the last index by using len. But how do I then get the index series range(1, (len(a)-1)) printed out like 1,2,3,4, etc.?
What would be a smart solution to this problem?
a=pd.DataFrame({'geo':['Nation', 'geo1', 'geo2', 'geo3', 'geo4', 'geo5', 'geo6', 'county']})
print(a.geo)
# This is what I want to achieve in a general function
print(a.geo.reindex(index=[7, 0, 1,2,3,4,5,6]))
# The last index can be located using len. But what about the rest?
print(a.geo.reindex(index=[(len(a)-1), 0, 1,2,3,4,5,6]))
b=pd.DataFrame({'geo':['Nation', 'geo1', 'geo2', 'geo3', 'geo4', 'county']})
print(b.geo)
print(b.geo.reindex(index=[5, 0, 1,2,3,4]))

print(pd.DataFrame({'geo': np.roll(a['geo'], shift=1)}))
An alternative way is to do:
print(a.iloc[np.arange(-1, len(a)-1)])

Related

For loop iteration missing values out?

I'm currently trying to create a column in a pandas dataframe, that creates a counter that equals the number of rows in the dataframe, divided by 2. Here is my code so far:
# Fill the cycles column with however many rows exist / 2
for x in ((jac_output.index)/2):
jac_output.loc[x, 'Cycles'] = x+1
However, I've noticed that it misses out values every so often, like this:
[
Why would my counter miss a value every so often as it gets higher? And is there another way of optimizing this, as it seems to be quite slow?
you may have removed some data from the dataframe, so some indicies are missing, therefore you should use reset_index to renumber them, or you can just use
for x in np.arange(0,len(jac_output.index),1)/2:.
You can view jac_output.index as a list like [0, 1, 2, ...]. When you divide it by 2, it results in [0, 0.5, 1, ...]. 0.5 is surely not in your original index.
To slice the index into half, you can try:
jac_output.index[:len(jac_output.index)//2]

Selecting by index -1 in a df column / time series throws error

Let's assume we have a simple dataframe like this:
df = pd.DataFrame({'col1':[1,2,3], 'col2':[10,20,30]})
Then I can select elements like this
df.col2[0] or df.col2[1]
But if I want to select the last element with df.col2[-1] it results in the error message:
KeyError: -1
I know that there are workarounds to that. I could do for example df.col2[len(df)-1] or df.iloc[-1,1]. But why wouldn't be the much simpler version of indexing directly by -1 be allowed? Am I maybe missing another simple selection way for -1? Tnx
The index labels of your DataFrame are [0,1,2]. Your code df.col2[1] is an equivalent of using a loc function as df['col2'].loc[1](or df.col2.loc[1]). You can see that you index does not contain a label '-1' (which is why you get the KeyError).
For positional indexing you need to use an iloc function (which you can use on Pandas Series as well as DataFrame), so you could do df['col2'].iloc[-1] (or df.col2.iloc[-1]).
As you can see, you can use both label based ('col2') and position based (-1) indexing together, you don't need to choose one or another as df.iloc[-1,1] or df.col2[len(df)-1] (which would be equivalent to df.loc[lend(df)-1,'col2'])

Select all rows in Python pandas

I have a function that aims at printing the sum along a column of a pandas DataFrame after filtering on some rows to be defined ; and the percentage this quantity makes up in the same sum without any filter:
def my_function(df, filter_to_apply, col):
my_sum = np.sum(df[filter_to_apply][col])
print(my_sum)
print(my_sum/np.sum(df[col]))
Now I am wondering if there is any way to have a filter_to_apply that actually doesn't do any filter (i.e. keeps all rows), to keep using my function (that is actually a bit more complex and convenient) even when I don't want any filter.
So, some filter_f1 that would do: df[filter_f1] = df and could be used with other filters: filter_f1 & filter_f2.
One possible answer is: df.index.isin(df.index) but I am wondering if there is anything easier to understand (e.g. I tried to use just True but it didn't work).
A Python slice object, i.e. slice(-1), acts as an object that selects all indexes in a indexable object. So df[slice(-1)] would select all rows in the DataFrame. You can store that in a variable an an initial value which you can further refine in your logic:
filter_to_apply = slice(-1) # initialize to select all rows
... # logic that may set `filter_to_apply` to something more restrictive
my_function(df, filter_to_apply, col)
This is a way to select all rows:
df[range(0, len(df))]
this is also
df[:]
But I haven't figured out a way to pass : as an argument.
Theres a function called loc on pandas that filters rows. You could do something like this:
df2 = df.loc[<Filter here>]
#Filter can be something like df['price']>500 or df['name'] == 'Brian'
#basically something that for each row returns a boolean
total = df2['ColumnToSum'].sum()

Vectorizing function on arrays in DataFrame column?

I want to use a function (scipy.signal.savgol_filter) on every element in a Dataframe column (every element of the column is an array). While looping seems a little unnecessary, I can't wrap my head around a vectorized solution.
I tried the obvious .apply method as well as just using the function on the column. Both show an error like "setting an array element with a sequence".
Example code with lists instead of arrays (but same results):
import pandas as pd
from scipy import signal
df = pd.DataFrame(data={'A': [[1,3,9], [7,2,3], [3,2,6,3], [2,3,4]]})
df['smooth'] = df.apply(signal.savgol_filter, args=(3, 0))
Respectively:
df['smooth'] = signal.savgol_filter(df['A'], 3, 0)
Or:
df['smooth'] = signal.savgol_filter(df['A'].values, 3, 0)
None of those work, I think because the whole column is given to the function.
Is there a way to use the function on all the elements (=arrays) in the column at the same time or do i have to loop over every row?
The problem is that your elements aren't the same shape when trying to treat it as a multidimensional array.
If you just want to apply that function to each row you need to select the column explicitly:
df['smooth'] = df['A'].apply(signal.savgol_filter, args=(3, 0))
This is not really a vectorized solution, though.
Edit:
It's worth adding that there is discussion over on the numpy issue tracker about the ambiguity of this error message.
See here and here

When to reset index? loc vs iloc for gaps in index? Best practices?

I discovered a very subtle bug in my code. I frequently delete rows from a dataframe in my analysis. Because this will leave gaps in the index, I try to end all functions by resetting the index at the end with
df0 = df0.reset_index (drop = True)
Then I continue in the next function with
for row in xrange (df0.shape [0]):
print df0.loc [row]
print df0.iloc [row]
However, if I dont reset the index correctly, the first row might have an index of 192. The index of 192 is not the same as the row number of 0. This leads to the problem that df0.loc[row] accesses the row with index 0, and df0.iloc[row] are accessing the row with index 192. This has caused a very strange bug, in that I try to update row 0, but index 192 gets updated instead. Or vice versa.
But in reality, I dont use any df0.loc() or df0.iloc() functions because they are too slow. My code is riddled with df0.get_value(...) and df0.set_value(...) functions because they are the fastest functions when accessing values.
And it seems that some of the functions are accessed by index, and other are accessed by row numbers? I am confused. Can someone explain to me? What are the best practices? Are some functions using index to access values, and other are using row numbers? Have I misunderstood something? Should I always reset_index() as often I can? Or never do that?
EDIT: To recap: I manually merge some rows in functions so there will be gaps in the indicies. In other functions I iterate over each row and do calculations. However, if I have reset the index I get other calculation results than if I don't reset the index. Why? That is my problem.
.loc[] looks at index labels, which may or may not be integer-valued.
If your index is [0, 1, 3] (a non-sequential integer index), .loc[2] won't find anything, because there is no index label 2.
Similarly, if your index is ['a', 'b', 'c'] (a non-integer index), .loc[2] will come up empty.
.iloc[] looks at index positions, which will always be integer-valued.
If your index is [0, 1, 3], .loc[2] will return the row corresponding to 3.
If your index is ['a', 'b', 'c'], .loc[2] will return the row corresponding to 'c'.
That's not a bug, that's just how those indexers are designed. Whether one fits your purpose depends on the structure of your data and what you're trying to accomplish. It's hard to make a recommendation without knowing more.
That said, it does sound like your code is getting kind of thorny. Having to perform reset_index() in a bunch of different places and keep constant track of which row you're trying to update suggest that you may not be taking advantage of Pandas' ability to perform vector-based calculations across many rows and columns at once. Maybe the task you want to accomplish makes this inevitable. But it's worth taking some time to consider whether you can't vectorize some of what you're doing, so that you can apply it to the whole dataframe or a subset of the dataframe, rather than operating on individual cells one at a time.

Categories