I'm trying to split my data frame into 10 rows and find the aggregate function (mean, SD, etc) for each 10 rows then merge it into 1 data frame again. Previously I had grouped the data using .groupby function, but having trouble to split the data into 10 rows.
This is what I have done :
def sorting (df):
grouped = df.groupby(['Activity']).
l_grouped=list(grouped)
I turned the grouped result into list (l_grouped), but I don't know if I could separate the rows from each tuple/list?
The result was indentical with the original data frame, but there were separated by 'Activity'. For example, row that has 'Standing' as the targeted value ('Activity') would be accesible through calling l_grouped[1][0] (type list/tuple). l_grouped [1][1] would return word 'Standing' only.
I could access the grouped result using :
for i in range(len(df_sort)):
print(df_sort[i][1])
df_sort referring to the result of calling the sorting(df)
Is there any way i could split/divide the tuple/list per each rows? Then create the aggregate function out of that?
I would suggest using a window function + a small trick for the stride:
step = 10
window = 10
df = pd.DataFrame({'a': range(100)})
df['a'].rolling(window).sum()[::step]
if this is not the exact result you are looking for check the documentation for more details regarding the bounds of the window and etc. ...
You can do:
import numpy as np
step = 10
df["id"] = np.arange(len(df))//step
gr = df.groupby("id")
...
Related
I am a new coder using jupyter notebook. I have a dataframe that contains 23 columns with different amounts of values( at most 23 and at least 2) I have created a function that normalizes the contents of one column below.
def normalize(column):
y = DFref[column].values[()]
y = x.astype(int)
KGF= list()
for element in y:
element_norm = element / x.sum()
KGF.append(element_norm)
return KGF
I am now trying to create a function that loops through all columns in the Data frame. Right now if I plug in the name of one column, it works as intended. What would I need to do in order to create a function that loops through each column and normalizes the values of each column, and then adds it to a new dataframe?
It's not clear if all 23 columns are numeric, but I will assume they are. Then there are a number of ways to solve this. The method below probably isn't the best, but it might be a quick fix for you...
colnames = DFref.columns.tolist()
normalised_data = {}
for colname in colnames:
normalised_data[colname] = normalize(colname)
df2 = pd.DataFrame(normalised_data)
I start by creating a data frame from my input csv file and I get the right format, but now I need to perform calculations on the V, I, and P columns. I want to split the data using the time stamps.
i.e get the mean for V, I, and P for all the values between Test loop 0 and Test loop 1. I know I can do this using iloc but I am trying to write a script that will work for different log files that might have a different number of entries.
Data frame output
Please let me know if you need any more information, any help/input is appreciated.
I think need extract first 19 values from column Time and aggregate mean:
df = df.groupby(df['Time'].str[:19]).mean()
If need remove rows with NaNs before:
df = df.dropna()
df = df.groupby(df['Time'].str[:19]).mean()
If you want to have the mean for the lines between your 'Test loop' lines:
First, you need to extract the limits of your time windows:
serie_time_limits = df[df['Time'].contains('Test loop')]['Time'].str[:19]
df_data = df[~df['Time'].contains('Test loop')]
df_data['Time'] = df_data['Time'].str[:19]
Then, you can get the mean for each test loop:
means = []
for i in range(len(serie_time_limits)):
if i==len(serie_time_limits)-1:
df_window = df_data[(df_data['Time']>=serie_time_limits[i])
else:
df_window = df_data[(df_data['Time']>=serie_time_limits[i]) & (df_data['Time']<serie_time_limits[i+1])]
means.append(df_window[['V', 'I', 'P']].mean())
I have some DataFrames with information about some elements, for instance:
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df2=pd.DataFrame([[1,5],[1,7],[1,23],[2,6],[2,4]],columns=['Group','Value'])
I have used something like dfGroups = df.groupby('group').apply(my_agg).reset_index(), so now I have DataFrmaes with informations on groups of the previous elements, say
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
my_df2_Group=pd.DataFrame([[1,38],[2,49]],columns=['Group','Group_Value'])
Now I want to clean my groups according to properties of their elements. Let's say that I want to discard groups containing an element with Value greater than 16. So in my_df1_Group, there should only be the first group left, while both groups qualify to stay in my_df2_Group.
As I don't know how to get my_df1_Group and my_df2_Group from my_df1 and my_df2 in Python (I know other languages where it would simply be name+"_Group" with name looping in [my_df1,my_df2], but how do you do that in Python?), I build a list of lists:
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
Then, I simply try this:
my_max=16
Bad=[]
for Sample in SampleList:
for n in Sample[1]['Group']:
df=Sample[0].loc[Sample[0]['Group']==n] #This is inelegant, but trying to work
#with Sample[1] in the for doesn't work
if (df['Value'].max()>my_max):
Bad.append(1)
else:
Bad.append(0)
Sample[1] = Sample[1].assign(Bad_Row=pd.Series(Bad))
Sample[1] = Sample[1].query('Bad_Row == 0')
Which runs without errors, but doesn't work. In particular, this doesn't add the column Bad_Row to my df, nor modifies my DataFrame (but the query runs smoothly even if Bad_Rowcolumn doesn't seem to exist...). On the other hand, if I run this technique manually on a df (i.e. not in a loop), it works.
How should I do?
Based on your comment below, I think you are wanting to check if a Group in your aggregated data frame has a Value in the input data greater than 16. One solution is to perform a row-wise calculation using a criterion of the input data. To accomplish this, my_func accepts a row from the aggregated data frame and the input data as a pandas groupby object. For each group in your grouped data frame, it will subset you initial data and use boolean logic to see if any of the 'Values' in your input data meet your specified criterion.
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>16).any():
return 'Bad Row'
else:
return 'Good Row'
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
grouped_df1 = my_df1.groupby('Group')
my_df1_Group['Bad_Row'] = my_df1_Group.apply(lambda x: my_func(x,grouped_df1), axis=1)
Returns:
Group Group_Value Bad_Row
0 1 57 Good Row
1 2 63 Bad Row
Based on dubbbdan idea, there is a code that works:
my_max=16
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>my_max).any():
return 1
else:
return 0
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
for Sample in SampleList:
grouped_df = Sample[0].groupby('Group')
Sample[1]['Bad_Row'] = Sample[1].apply(lambda x: my_func(x,grouped_df), axis=1)
Sample[1].drop(Sample[1][Sample[1]['Bad_Row']!=0].index, inplace=True)
Sample[1].drop(['Bad_Row'], axis = 1, inplace = True)
I would like to know how I could iterate through each columns of a dataframe to perform some calculations and store the result in an another dataframe.
df_empty = []
m = daily.ix[:,-1] #Columns= stocks & Rows= daily returns
stocks = daily.ix[:,:-1]
for col in range (len(stocks.columns)):
s = daily.ix[:,col]
covmat = np.cov(s,m)
beta = covmat[0,1]/covmat[1,1]
return (beta)
print(beta)
In the above example, I first want to calculate a covariance matrix between "s" (the columns representing stocks daily returns and for which I want to iterate through one by one) and "m" (the market daily return which is my reference column/the last column of my dataframe). Then I want to calculate the beta for each covariance pair stock/market.
I'm not sure why return(beta) give me a single numerical result for one stock while print(beta) print the beta for all stocks.
I'd like to find a way to create a dataframe with all these betas.
beta_df = df_empty.append(beta)
I have tried the above code but it returns 'none' as if it could not append the outcome.
Thank you for your help
The return statement within your for-loop ends the loop itself the first time the return is encountered. Moreover, you are not saving the beta value anywhere because the for-loop itself does not return a value in python (it only has side effects).
Apart from that, you may choose a more pandas-like approach using apply on the data frame which basically iterates over the columns of the data frame and passes each column to a supplied function as the first parameter while returning the result of the function call. Here is a minimal working example with some dummy data:
import pandas as pd
import numpy as pd
# create some dummy data
daily = pd.DataFrame(np.random.randint(100, size=(100, 5)))
# define reference column
cov_column = daily.iloc[:, -1]
# setup computation function
def compute(column):
covmat = np.cov(column, cov_column)
return covmat[0,1]/covmat[1,1]
# use apply to iterate over columns
result = daily.iloc[:, :-1].apply(compute)
# show output
print(result)
0 -0.125382
1 0.024777
2 0.011324
3 -0.017622
dtype: float64
I have a Pandas Data Frame object that has 1000 rows and 10 columns. I would simply like to slice the Data Frame and take the first 10 rows. How can I do this? I've been trying to use this:
>>> df.shape
(1000,10)
>>> my_slice = df.ix[10,:]
>>> my_slice.shape
(10,)
Shouldn't my_slice be the first ten rows, ie. a 10 x 10 Data Frame? How can I get the first ten rows, such that my_slice is a 10x10 Data Frame object? Thanks.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html?highlight=head#pandas.DataFrame.head
df2 = df.head(10)
should do the trick
You can also do as a convenience:
df[:10]
There are various ways to do that. Below we will go through at least three options.
In order to keep the original dataframe df, we will be assigning the sliced dataframe to df_new.
At the end, in section Time Comparison we will show, using a random dataframe, the various times of execution.
Option 1
df_new = df[:10] # Option 1.1
# or
df_new = df[0:10] # Option 1.2
Option 2
Using head
df_new = df.head(10)
For negative values of n, this function returns all rows except the last n rows, equivalent to df[:-n] [Source].
Option 3
Using iloc
df_new = df.iloc[:10] # Option 3.1
# or
df_new = df.iloc[0:10] # Option 3.2
Time Comparison
For this specific case one has used time.perf_counter() to measure the time of execution.
method time
0 Option 1.1 0.00000120000913739204
1 Option 1.2 0.00000149995321407914
2 Option 2 0.00000170001294463873
3 Option 3.1 0.00000120000913739204
4 Option 3.2 0.00000350002665072680
As there are various variables that might affect the time of execution, this might change depending on the dataframe used, and more.
Notes:
Instead of 10 one can replace the previous operations with the number of rows one wants. For example
df_new = df[:5]
will return a dataframe with the first 5 rows.
There are additional ways to measure the time of execution. For additional ways, read this: How do I get time of a Python program's execution?
One can also adjust the previous options to a lambda function, such as the following
df_new = df.apply(lambda x: x[:10])
# or
df_new = df.apply(lambda x: x.head(10))
Note, however, that there are strong opinions on the usage of .apply() and, for this case, it is far from being a required method.
df.ix[10,:] gives you all the columns from the 10th row. In your case you want everything up to the 10th row which is df.ix[:9,:]. Note that the right end of the slice range is inclusive: http://pandas.sourceforge.net/gotchas.html#endpoints-are-inclusive
DataFrame[:n] will return first n rows.