I have a Pandas Data Frame object that has 1000 rows and 10 columns. I would simply like to slice the Data Frame and take the first 10 rows. How can I do this? I've been trying to use this:
>>> df.shape
(1000,10)
>>> my_slice = df.ix[10,:]
>>> my_slice.shape
(10,)
Shouldn't my_slice be the first ten rows, ie. a 10 x 10 Data Frame? How can I get the first ten rows, such that my_slice is a 10x10 Data Frame object? Thanks.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html?highlight=head#pandas.DataFrame.head
df2 = df.head(10)
should do the trick
You can also do as a convenience:
df[:10]
There are various ways to do that. Below we will go through at least three options.
In order to keep the original dataframe df, we will be assigning the sliced dataframe to df_new.
At the end, in section Time Comparison we will show, using a random dataframe, the various times of execution.
Option 1
df_new = df[:10] # Option 1.1
# or
df_new = df[0:10] # Option 1.2
Option 2
Using head
df_new = df.head(10)
For negative values of n, this function returns all rows except the last n rows, equivalent to df[:-n] [Source].
Option 3
Using iloc
df_new = df.iloc[:10] # Option 3.1
# or
df_new = df.iloc[0:10] # Option 3.2
Time Comparison
For this specific case one has used time.perf_counter() to measure the time of execution.
method time
0 Option 1.1 0.00000120000913739204
1 Option 1.2 0.00000149995321407914
2 Option 2 0.00000170001294463873
3 Option 3.1 0.00000120000913739204
4 Option 3.2 0.00000350002665072680
As there are various variables that might affect the time of execution, this might change depending on the dataframe used, and more.
Notes:
Instead of 10 one can replace the previous operations with the number of rows one wants. For example
df_new = df[:5]
will return a dataframe with the first 5 rows.
There are additional ways to measure the time of execution. For additional ways, read this: How do I get time of a Python program's execution?
One can also adjust the previous options to a lambda function, such as the following
df_new = df.apply(lambda x: x[:10])
# or
df_new = df.apply(lambda x: x.head(10))
Note, however, that there are strong opinions on the usage of .apply() and, for this case, it is far from being a required method.
df.ix[10,:] gives you all the columns from the 10th row. In your case you want everything up to the 10th row which is df.ix[:9,:]. Note that the right end of the slice range is inclusive: http://pandas.sourceforge.net/gotchas.html#endpoints-are-inclusive
DataFrame[:n] will return first n rows.
Related
I'm trying to split my data frame into 10 rows and find the aggregate function (mean, SD, etc) for each 10 rows then merge it into 1 data frame again. Previously I had grouped the data using .groupby function, but having trouble to split the data into 10 rows.
This is what I have done :
def sorting (df):
grouped = df.groupby(['Activity']).
l_grouped=list(grouped)
I turned the grouped result into list (l_grouped), but I don't know if I could separate the rows from each tuple/list?
The result was indentical with the original data frame, but there were separated by 'Activity'. For example, row that has 'Standing' as the targeted value ('Activity') would be accesible through calling l_grouped[1][0] (type list/tuple). l_grouped [1][1] would return word 'Standing' only.
I could access the grouped result using :
for i in range(len(df_sort)):
print(df_sort[i][1])
df_sort referring to the result of calling the sorting(df)
Is there any way i could split/divide the tuple/list per each rows? Then create the aggregate function out of that?
I would suggest using a window function + a small trick for the stride:
step = 10
window = 10
df = pd.DataFrame({'a': range(100)})
df['a'].rolling(window).sum()[::step]
if this is not the exact result you are looking for check the documentation for more details regarding the bounds of the window and etc. ...
You can do:
import numpy as np
step = 10
df["id"] = np.arange(len(df))//step
gr = df.groupby("id")
...
I’ve been struggling the past week trying to use apply to use functions over an entire pandas dataframe, including rolling windows, groupby, and especially multiple input columns and multiple output columns. I found a large amount of questions on SO about this topic and many old & outdated answers. So I started to create a notebook for every possible combination of x inputs & outputs, rolling, rolling & groupby combined and I focused on performance as well. Since I’m not the only one struggling with these questions I thought I’d provide my solutions here with working examples, hoping it helps any existing/future pandas-users.
Important notes
The combination of apply & rolling in pandas has a very strong output requirement. You have to return one single value. You can not return a pd.Series, not a list, not an array, not secretly an array within an array, but just one value, e.g. one integer. This requirement makes it hard to get a working solution when trying to return multiple outputs for multiple columns. I don’t understand why it has this requirement for 'apply & rolling', because without rolling 'apply' doesn’t have this requirement. Must be due to some internal pandas functions.
The combination of 'apply & rolling' combined with multiple input columns simply does not work! Imagine a dataframe with 2 columns, 6 rows and you want to apply a custom function with a rolling window of 2. Your function should get an input array with 2x2 values - 2 values of each column for 2 rows. But it seems pandas can’t handle rolling and multiple input columns at the same time. I tried to use the axis parameter to get it working but:
Axis = 0, will call your function per column. In the dataframe described above, it will call your function 10 times (not 12 because rolling=2) and since it’s per column, it only provides the 2 rolling values of that column…
Axis = 1, will call your function per row. This is what you probably want, but pandas will not provide a 2x2 input. It actually completely ignores the rolling and only provides one row with values of 2 columns...
When using 'apply' with multiple input columns, you can provide a parameter called raw (boolean). It’s False by default, which means the input will be a pd.Series and thus includes indexes next to the values. If you don’t need the indexes you can set raw to True to get a Numpy array, which often achieves a much better performance.
When combining 'rolling & groupby', it returns a multi-indexes series which can’t easily serve as an input for a new column. The easiest solution is to append a reset_index(drop=True) as answered & commented here (Python - rolling functions for GroupBy object).
You might ask me, when would you ever want to use a rolling, groupby custom function with multiple outputs!? Answer: I recently had to do a Fourier transform with sliding windows (rolling) over a dataset of 5 million records (speed/performance is important) with different batches within the dataset (groupby). And I needed to save both the power & phase of the Fourier transform in different columns (multiple outputs). Most people probably only need some of the basic examples below, but I believe that especially in the Machine Learning/Data-science sectors the more complex examples can be useful.
Please let me know if you have even better, clearer or faster ways to perform any of the solutions below. I'll update my answer and we can all benefit!
Code examples
Let’s create a dataframe first that will be used in all the examples below, including a group-column for the groupby examples.
For the rolling window and multiple input/output columns I just use 2 in all code examples below, but obviously this could be any number > 1.
df = pd.DataFrame(np.random.randint(0,5,size=(6, 2)), columns=list('ab'))
df['group'] = [0, 0, 0, 1, 1, 1]
df = df[['group', 'a', 'b']]
It will look like this:
group a b
0 0 2 2
1 0 4 1
2 0 0 4
3 1 0 2
4 1 3 2
5 1 3 0
Input 1 column, output 1 column
Basic
def func_i1_o1(x):
return x+1
df['c'] = df['b'].apply(func_i1_o1)
Rolling
def func_i1_o1_rolling(x):
return (x[0] + x[1])
df['d'] = df['c'].rolling(2).apply(func_i1_o1_rolling, raw=True)
Roling & Groupby
Add the reset_index solution (see notes above) to the rolling function.
df['e'] = df.groupby('group')['c'].rolling(2).apply(func_i1_o1_rolling, raw=True).reset_index(drop=True)
Input 2 columns, output 1 column
Basic
def func_i2_o1(x):
return np.sum(x)
df['f'] = df[['b', 'c']].apply(func_i2_o1, axis=1, raw=True)
Rolling
As explained in point 2 in the notes above, there isn't a 'normal' solution for 2 inputs. The workaround below uses the 'raw=False' to ensure the input is a pd.Series, which means we also get the indexes next to the values. This enables us to get values from other columns at the correct indexes to be used.
def func_i2_o1_rolling(x):
values_b = x
values_c = df.loc[x.index, 'c'].to_numpy()
return np.sum(values_b) + np.sum(values_c)
df['g'] = df['b'].rolling(2).apply(func_i2_o1_rolling, raw=False)
Rolling & Groupby
Add the reset_index solution (see notes above) to the rolling function.
df['h'] = df.groupby('group')['b'].rolling(2).apply(func_i2_o1_rolling, raw=False).reset_index(drop=True)
Input 1 column, output 2 columns
Basic
You could use a 'normal' solution by returning pd.Series:
def func_i1_o2(x):
return pd.Series((x+1, x+2))
df[['i', 'j']] = df['b'].apply(func_i1_o2)
Or you could use the zip/tuple combination which is about 8 times faster!
def func_i1_o2_fast(x):
return x+1, x+2
df['k'], df['l'] = zip(*df['b'].apply(func_i1_o2_fast))
Rolling
As explained in point 1 in the notes above, we need a workaround if we want to return more than 1 value when using rolling & apply combined. I found 2 working solutions.
1
def func_i1_o2_rolling_solution1(x):
output_1 = np.max(x)
output_2 = np.min(x)
# Last index is where to place the final values: x.index[-1]
df.at[x.index[-1], ['m', 'n']] = output_1, output_2
return 0
df['m'], df['n'] = (np.nan, np.nan)
df['b'].rolling(2).apply(func_i1_o2_rolling_solution1, raw=False)
Pros: Everything is done within 1 function.
Cons: You have to create the columns first and it is slower since it doesn't use the raw input.
2
rolling_w = 2
nan_prefix = (rolling_w - 1) * [np.nan]
output_list_1 = nan_prefix.copy()
output_list_2 = nan_prefix.copy()
def func_i1_o2_rolling_solution2(x):
output_list_1.append(np.max(x))
output_list_2.append(np.min(x))
return 0
df['b'].rolling(rolling_w).apply(func_i1_o2_rolling_solution2, raw=True)
df['o'] = output_list_1
df['p'] = output_list_2
Pros: It uses the raw input which makes it about twice as fast. And since it doesn't use indexes to set the output values the code looks a bit more clear (to me at least).
Cons: You have to create the nan-prefix yourself and it takes a bit more lines of code.
Rolling & Groupby
Normally, I would use the faster 2nd solution above. However, since we're combining groups and rolling this means you'd have to manually set NaN's/zeros (depending on the number of groups) at the right indexes somewhere in the middle of the dataset. To me it seems that when combining rolling, groupby and multiple output columns, the first solution is easier and solves the automatic NaNs/grouping automatically. Once again, I use the reset_index solution at the end.
def func_i1_o2_rolling_groupby(x):
output_1 = np.max(x)
output_2 = np.min(x)
# Last index is where to place the final values: x.index[-1]
df.at[x.index[-1], ['q', 'r']] = output_1, output_2
return 0
df['q'], df['r'] = (np.nan, np.nan)
df.groupby('group')['b'].rolling(2).apply(func_i1_o2_rolling_groupby, raw=False).reset_index(drop=True)
Input 2 columns, output 2 columns
Basic
I suggest using the same 'fast' way as for i1_o2 with the only difference that you get 2 input values to use.
def func_i2_o2(x):
return np.mean(x), np.median(x)
df['s'], df['t'] = zip(*df[['b', 'c']].apply(func_i2_o2, axis=1))
Rolling
As I use a workaround for applying rolling with multiple inputs and I use another workaround for rolling with multiple outputs, you can guess I need to combine them for this one.
1. Get values from other columns using indexes (see func_i2_o1_rolling)
2. Set the final multiple outputs on the correct index (see func_i1_o2_rolling_solution1)
def func_i2_o2_rolling(x):
values_b = x.to_numpy()
values_c = df.loc[x.index, 'c'].to_numpy()
output_1 = np.min([np.sum(values_b), np.sum(values_c)])
output_2 = np.max([np.sum(values_b), np.sum(values_c)])
# Last index is where to place the final values: x.index[-1]
df.at[x.index[-1], ['u', 'v']] = output_1, output_2
return 0
df['u'], df['v'] = (np.nan, np.nan)
df['b'].rolling(2).apply(func_i2_o2_rolling, raw=False)
Rolling & Groupby
Add the reset_index solution (see notes above) to the rolling function.
def func_i2_o2_rolling_groupby(x):
values_b = x.to_numpy()
values_c = df.loc[x.index, 'c'].to_numpy()
output_1 = np.min([np.sum(values_b), np.sum(values_c)])
output_2 = np.max([np.sum(values_b), np.sum(values_c)])
# Last index is where to place the final values: x.index[-1]
df.at[x.index[-1], ['w', 'x']] = output_1, output_2
return 0
df['w'], df['x'] = (np.nan, np.nan)
df.groupby('group')['b'].rolling(2).apply(func_i2_o2_rolling_groupby, raw=False).reset_index(drop=True)
I want to sort through a Dataframe of about 400k rows, with 4 columns, taking out roughly half of them with an if statement:
for a in range (0, howmanytimestorunthrough):
if ('Primary' not in DataFrameexample[a]):
#take out row
So far I've been testing either one of the 4 below:
newdf.append(emptyline,)
nefdf.at[b,'column1'] = DataFrameexample.at[a,'column1']
nefdf.at[b,'column2'] = DataFrameexample.at[a,'column2']
nefdf.at[b,'column3'] = DataFrameexample.at[a,'column3']
nefdf.at[b,'column4'] = DataFrameexample.at[a,'column4']
b = b + 1
or the same with .loc
newdf.append(emptyline,)
nefdf.loc[b,:] = DataFrameexample.loc[a,:]
b = b + 1
or changing the if (not in) to an if (in) and using:
DataFrameexample = DataFrameexample.drop([k])
or trying to set emptyline to have values, and then append it:
notemptyline = pd.Series(DataFrameexample.loc[a,:].values, index = ['column1', 'column2', ...)
newdf.append(notemptyline, ignore_index=True)
So from what I've managed to test so far, they all seem to work ok on a small number of rows (2000), but once I start getting a lot more rows they take exponentially longer. .at seems slighly faster than .loc even if I need it to run 4 times, but still gets slow (10 times the rows, takes longer than 10 times). .drop I think tries to copy the dataframe each time, so really doesn't work? I can't seem to get .append(notemptyline) to work properly, it just replaces index 0 over and over again.
I know there must be an efficient way of doing this, I just can't seem to quite get there. Any help?
Your speed problem has nothing to do with .loc vs .at vs ... (for a comparisson between .loc and .at look have a look at this question) but comes from explicitly looping over every row of your dataframe. Pandas is all about vectorising your operations.
You want to filter your dataframe based on a comparison. You can transform that to a boolean indexer.
indexer = df!='Primary'
This will give you a 4 by n rows dataframe with boolean values. Now you want to reduce the dimension to 1 x n rows such that the value is true if all values in the row (axis 1) are true.
indexer = indexer.all(axis=1)
Now we can use .loc to to get only the rows were indexer is True
df = df.loc[indexer]
This will be much faster then iterating over the rows.
EDIT:
To check if the df entry contains a string you can replace the first row:
indexer = df.apply(lambda x: x.str.contains('Primary'))
Note that you normally don't want to use an apply statement (internally it uses a for loop for custom functions) to iterate over a lot of elements. In this case we are looping over the columns which is fine if you just have a couple of those.
Coming from R, the code would be
x <- data.frame(vals = c(100,100,100,100,100,100,200,200,200,200,200,200,200,300,300,300,300,300))
x$state <- cumsum(c(1, diff(x$vals) != 0))
Which marks every time the difference between rows is non-zero, so that I can use it to spot transitions in data, like so:
vals state
1 100 1
...
7 200 2
...
14 300 3
What would be a clean equivalent in Python?
Additional question
The answer to the original question is posted below, but won't work properly for a grouped dataframe with pandas.
Data here: https://pastebin.com/gEmPHAb7. Notice that there are 2 different filenames.
When imported as df_all I group it with the following, and then apply solution posted below.
df_grouped = df_all.groupby("filename")
df_all["state"] = (df_grouped['Fit'].diff() != 0).cumsum()
Using diff and cumsum, as in your R example:
df['state'] = (df['vals'].diff()!= 0).cumsum()
This uses the fact that True has integer value 1
Bonus question
df_grouped = df_all.groupby("filename")
df_all["state"] = (df_grouped['Fit'].diff() != 0).cumsum()
I think you misunderstand what groupby does. All groupby does is create groups based on the criterium (filename in this instance). You then need to tell add another operation to tell what needs to happen with this group.
Common operations are mean, sum, or more advanced as apply and transform.
You can find more information here or here
If you can explain more in detail what you want to achieve with the groupby I can help you find the correct method. If you want to perform the above operation per filename, you probably need something like this:
def get_state(group):
return (group.diff()!= 0).cumsum()
df_all['state'] = df_all.groupby('filename')['Fit'].transform(get_state)
In the following, male_trips is a big pandas data frame and stations is a small pandas data frame. For each station id I'd like to know how many male trips took place. The following does the job, but takes a long time:
mc = [ sum( male_trips['start_station_id'] == id ) for id in stations['id'] ]
how should I go about this instead?
Update! So there were two main approaches: groupby() followed by size(), and the simpler .value_counts(). I did a quick timeit, and the groupby approach wins by quite a large margin! Here is the code:
from timeit import Timer
setup = "import pandas; male_trips=pandas.load('maletrips')"
a = "male_trips.start_station_id.value_counts()"
b = "male_trips.groupby('start_station_id').size()"
Timer(a,setup).timeit(100)
Timer(b,setup).timeit(100)
and here is the result:
In [4]: Timer(a,setup).timeit(100) # <- this is value_counts
Out[4]: 9.709594964981079
In [5]: Timer(b,setup).timeit(100) # <- this is groupby / size
Out[5]: 1.5574288368225098
Note that, at this speed, for exploring data typing value_counts is marginally quicker and less remembering!
I'd do like Vishal but instead of using sum() using size() to get a count of the number of rows allocated to each group of 'start_station_id'. So:
df = male_trips.groupby('start_station_id').size()
My answer below works in Pandas 0.7.3. Not sure about the new releases.
This is what the pandas.Series.value_counts method is for:
count_series = male_trips.start_station_id.value_counts()
It should be straight-forward to then inspect count_series based on the values in stations['id']. However, if you insist on only considering those values, you could do the following:
count_series = (
male_trips[male_trips.start_station_id.isin(stations.id.values)]
.start_station_id
.value_counts()
)
and this will only give counts for station IDs actually found in stations.id.
male_trips.count()
doesnt work?
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.count.html
how long would this take:
df = male_trips.groupby('start_station_id').sum()
edit: after seeing in the answer above that isin and value_counts exist (and value_counts even comes with its own entry in pandas.core.algorithm and also isin isn't simply np.in1d) I updated the three methods below
male_trips.start_station_id[male_trips.start_station_id.isin(station.id)].value_counts()
You could also do an inner join on stations.id:
pd.merge(male_trips, station, left_on='start_station_id', right_on='id') followed by value_counts.
Or:
male_trips.set_index('start_station_id, inplace=True)
station.set_index('id, inplace=True)
male_trips.ix[male_trips.index.intersection(station.index)].reset_index().start_station_id.value_counts()
If you have the time I'd be interested how this performs differently with a huge DataFrame.