Shrink pandas Df by deleting rows through modulo - python

I need to reduce (or select) for example multiple of 4 of the index.
i have a 2MS dataframe and i want to get less data for a future plot. so the idea is to work with 1/4 of the data. leaving only the rows with index 4 - 8 - 16 - 20 - 4*n (or maybe the same but with 5*n)
if someone has any idea i will be grateful.

You can use the iloc function, which takes a row/column slice.
From the docs
Purely integer-location based indexing for selection by position.
.iloc[] is primarily integer position based (from 0 to length-1 of the
axis), but may also be used with a boolean array.
So you could write df.iloc[::4, :]

Related

Repeat calculations for every row of dataframe

The following calculations were for the 1st row, i.e., train_df.y1[0].
I want to repeat this operation for all 400 rows of train_df
squared_deviations_y1_0_train = ((ideal_df.loc[:0,"y1":"y50"] - train_df.y1[0]) ** 2).sum(axis=1)
The result is correct, just need to repeat it.
Since your end result seems to be a scalar, you can convert both of these dataframes to Numpy and take advantage of braodcasting.
Something like this,
squared_deviations = ((ideal_df.to_numpy() - train_df.y1.to_numpy().reshape(-1,1)) ** 2).sum(axis=1)
would do pretty nicely. If you MUST stay within pandas, you could use the subtract() method to get the same outcome.
(train_df.y1.subtract(ideal_df.T) ** 2).sum(axis=0)
Not that train_df.y1 becomes a row vector of size (400,) so you need to make the row dimension 400 to do this subtraction (hence the transpose of ideal_df).
You can also use the apply() method as Barmar suggested. This will require you to define a function that calculates the row index so that you can subtract the appropriate value of train_df for every cell before you perform the square and sum operations. Something like this,
(ideal_df.apply(lambda cell: cell - train_df.y1[cell.index]) ** 2).sum(axis=1)
would also work. I highly recommend using Numpy for these tasks because Numpy was designed with broadcasting in mind, but as shown you can get away with doing it in Pandas.

What is quicker, (=.at), (=.loc), (.drop), or (.append) to filter a large dataframe?

I want to sort through a Dataframe of about 400k rows, with 4 columns, taking out roughly half of them with an if statement:
for a in range (0, howmanytimestorunthrough):
if ('Primary' not in DataFrameexample[a]):
#take out row
So far I've been testing either one of the 4 below:
newdf.append(emptyline,)
nefdf.at[b,'column1'] = DataFrameexample.at[a,'column1']
nefdf.at[b,'column2'] = DataFrameexample.at[a,'column2']
nefdf.at[b,'column3'] = DataFrameexample.at[a,'column3']
nefdf.at[b,'column4'] = DataFrameexample.at[a,'column4']
b = b + 1
or the same with .loc
newdf.append(emptyline,)
nefdf.loc[b,:] = DataFrameexample.loc[a,:]
b = b + 1
or changing the if (not in) to an if (in) and using:
DataFrameexample = DataFrameexample.drop([k])
or trying to set emptyline to have values, and then append it:
notemptyline = pd.Series(DataFrameexample.loc[a,:].values, index = ['column1', 'column2', ...)
newdf.append(notemptyline, ignore_index=True)
So from what I've managed to test so far, they all seem to work ok on a small number of rows (2000), but once I start getting a lot more rows they take exponentially longer. .at seems slighly faster than .loc even if I need it to run 4 times, but still gets slow (10 times the rows, takes longer than 10 times). .drop I think tries to copy the dataframe each time, so really doesn't work? I can't seem to get .append(notemptyline) to work properly, it just replaces index 0 over and over again.
I know there must be an efficient way of doing this, I just can't seem to quite get there. Any help?
Your speed problem has nothing to do with .loc vs .at vs ... (for a comparisson between .loc and .at look have a look at this question) but comes from explicitly looping over every row of your dataframe. Pandas is all about vectorising your operations.
You want to filter your dataframe based on a comparison. You can transform that to a boolean indexer.
indexer = df!='Primary'
This will give you a 4 by n rows dataframe with boolean values. Now you want to reduce the dimension to 1 x n rows such that the value is true if all values in the row (axis 1) are true.
indexer = indexer.all(axis=1)
Now we can use .loc to to get only the rows were indexer is True
df = df.loc[indexer]
This will be much faster then iterating over the rows.
EDIT:
To check if the df entry contains a string you can replace the first row:
indexer = df.apply(lambda x: x.str.contains('Primary'))
Note that you normally don't want to use an apply statement (internally it uses a for loop for custom functions) to iterate over a lot of elements. In this case we are looping over the columns which is fine if you just have a couple of those.

What is Pandas doing here that my indexes [0] and [1] refer to the same value?

I have a dataframe with these indices and values:
df[df.columns[0]]
1 example
2 example1
3 example2
When I access df[df.columns[0]][2], I get "example1". Makes sense. That's how indices work.
When I access df[df.columns[0]], however, I get "example", and I get example when I access df[df.columns[1]] as well. So for
df[df.columns[0]][0]
df[df.columns[0]][1]
I get "example".
Strangely, I can delete "row" 0, and the result is that 1 is deleted:
gf = df.drop(df.index[[0]])
gf
exampleDF
2 example1
3 example2
But when I delete row 1, then
2 example1
is deleted, as opposed to example.
This is a bit confusing to me; are there inconsistent standards in Pandas regarding row indices, or am I missing something / made an error?
You are probably causing pandas to switch between .iloc (index based) and .loc (label based) indexing.
All arrays in Python are 0 indexed. And I notice that indexes in your DataFrame are starting from 1. So when you run df[df.column[0]][0] pandas realizes that there is no index named 0, and falls back to .iloc which locates things by array indexing. Therefore it returns what it finds at the first location of the array, which is 'example'.
When you run df[df.column[0]][1] however, pandas realizes that there is a index label 1, and uses .loc which returns what it finds at that label, which again happens to be 'example'.
When you delete the first row, your DataFrame does not have index labels 0 and 1. So when you go to locate elements at those places in the way you are, it does not return None to you, but instead falls back on array based indexing and returns elements from the 0th and 1st places in the array.
To enforce pandas to use one of the two indexing techniques, use .iloc or .loc. .loc is label based, and will raise KeyError if you try df[df.column[0]].loc[0]. .iloc is index based and will return 'example' when you try df[df.column[0]].iloc[0].
Additional note
These commands are bad practice: df[col_label].iloc[row_index]; df[col_label].loc[row_label].
Please use df.loc[row_label, col_label]; or df.iloc[row_index, col_index]; or df.ix[row_label_or_index, col_label_or_index]
See Different Choices for Indexing for more information.

Unexpected difference between loc and ix

I've noticed a strange difference between loc and ix when subsetting a DataFrame in Pandas.
import pandas as pd
# Create a dataframe
df = pd.DataFrame({'id':[10,9,5,6,8], 'x1':[10.0,12.3,13.4,11.9,7.6], 'x2':['a','a','b','c','c']})
df.set_index('id', inplace=True)
df
x1 x2
id
10 10.0 a
9 12.3 a
5 13.4 b
6 11.9 c
8 7.6 c
df.loc[[10, 9, 7]] # 7 does not exist in the index so a NaN row is returned
df.loc[[7]] # KeyError: 'None of [[7]] are in the [index]'
df.ix[[7]] # 7 does not exist in the index so a NaN row is returned
Why does df.loc[[7]] throw an error while df.ix[[7]] returns a row with NaN? Is this a bug? If not, why are loc and ix designed this way?
(Note I'm using Pandas 0.17.1 on Python 3.5.1)
As #shanmuga says, this is (at least for loc) the intended and documented behaviour, and not a bug.
The documentation on loc/selection by label, gives the rules on this (http://pandas.pydata.org/pandas-docs/stable/indexing.html#selection-by-label ):
At least 1 of the labels for which you ask, must be in the index or a KeyError will be raised!
This means using loc with a single label (eg df.loc[[7]]) will raise an error if this label is not in the index, but when using it with a list of labels (eg df.loc[[7,8,9]]) will not raise an error if at least one of those labels is in the index.
For ix I am less sure, and this is not clearly documented I think. But in any case, ix is much more permissive and has a lot of edge cases (fallback to integer position etc), and is rather a rabbit hole. But in general, ix will always return a result indexed with the provided labels (so does not check if the labels are in the index as loc does), unless it falls back to integer position indexing.
In most cases it is advised to use loc/iloc
I think this behavior is intended, not a bug.
Although I couldn't find any official documentation, I found a comment by jreback on 21 Mar 2014 to issue on GitHub indicating this.
ix can very subtly give wrong results (use an index of say even numbers)
you can use whatever function you want; ix is still there, but it doesn't provide the guarantees that loc provides, namely that it won't interpret a number as a location
As for why it is designed so
As mentioned in docs
.ix supports mixed integer and label based access. It is primarily label based, but will fall back to integer positional access unless the corresponding axis is of integer type.
In my opinion raising a KeyError would be ambiguous as whether it it came from index, or integer position. Instead ix returns NaN when given a list

Doing column math with numpy in python

I am looking for coding examples to learn Numpy.
Usage would be dtype ='object'.
To construnct array the code used would
a= np.asarray(d, dtype ='object')
not np.asarray(d) or np.asarray(d, dtype='float32')
Is sorting any different than float32/64?
Coming from excel "cell" equations, wrapping my head around Row Column math.
Ex:
A = array([['a',2,3,4],['b',5,6,2],['c',5,1,5]], dtype ='object')
[['a',2,3,4],
['b',5,6,2],
['c',5,1,5]])
Create new array with:
How would I sort high to low by [3].
How calc for entire col. (1,1)- (1,0), Example without sorting A
['b',3],
['c',0]
How calc for enitre array (1,1) - (2,0) Example without sorting A
['b',2],
['c',-1]
Despite the fact that I still cannot understand exactly what you are asking, here is my best guess. Let's say you want to sort A by the values in 3rd column:
A = array([['a',2,3,4],['b',5,6,2],['c',5,1,5]], dtype ='object')
ii = np.argsort(A[:,2])
print A[ii,:]
Here the rows have been sorted according to the 3rd column, but each row is left unsorted.
Subtracting all of the columns is a problem due to the string objects, however if you exclude them, you can for example subtract the 3rd row from the 1st by:
A[0,1:] - A[2,1:]
If I didn't understand the basic point of your question, then please revise it. I highly recommend you take a look at the numpy tutorial and documentation if you have not done so already:
http://docs.scipy.org/doc/numpy/reference/
http://docs.scipy.org/doc/numpy/user/

Categories