I have a data frame like this:
pk_dcdata threshold last_ep diff
window
1 11075761 0.00001 4 3
1 11075768 0.00001 7 6
2 11075769 0.00001 1 -1
2 11075770 0.00001 1 -1
3 11075771 0.00001 1 0
3 11075768 0.00001 7 6
I want to calculate the mean in the column 'diff' but compare with the index 'window', and save the mean into a new list. e.g. window = 1 and the mean is (3+6)/2, and the next is window = 2, so (-1-1)/2 and so on.
Expected outcome: list = [4.5,-1,3]
I tried to use 'rolling_mean' but don't know how to set the moving length. Because the dataset is big, hope can get a fast way to get the result.
Dont use list as variable because python reserved word.
Need aggregate by mean per index and last convert Series to list:
L = df.groupby(level=0)['diff'].mean().tolist()
#alternative
#L = df.groupby('window')['diff'].mean().tolist()
print (L)
[4.5, -1.0, 3.0]
Alternative working in pandas 0.20.0+, check docs.
You can use groupby(): let's say your dataframe is called df
avg_diff = df['diff'].groupby(level=0).mean()
This will provide you with a dataframe with the means based on the window.
If then you want to put it in a list you can do like this:
my_list = avg.tolist()
Related
I have a dataframe that has following columns: X and Y are Cartesian coordinates and Value is the value of element at these coordinates. What I want to achieve is to select only one coordinates out of n that are close to other, lets say coordinates are close if distance is lower than some value m, so the initial DF looks like this (example):
data = {'X':[0,0,0,1,1,5,6,7,8],'Y':[0,1,4,2,6,5,6,4,8],'Value':[6,7,4,5,6,5,6,4,8]}
df = pd.DataFrame(data)
X Y Value
0 0 0 6
1 0 1 7
2 0 4 4
3 1 2 5
4 1 6 6
5 5 5 5
6 6 6 6
7 7 4 4
8 8 8 8
distance is count with following function:
def countDistance(lat1, lon1, lat2, lon2):
#use basic knowledge about triangles - values are in meters
distance = sqrt(pow(lat1-lat2,2)+pow(lon1-lon2,2))
return distance
lets say if we want to m<=3, the output dataframe would look like this:
X Y Value
1 0 1 7
4 1 6 6
8 8 8 8
What is to be done:
rows 0,1,3 are close, highest value is in row 1, continue
rows 2 and 4 (from original df) are close, keep row 4
rows 5,6,7 are close, keep row 6
left over row 6 is close to row 8, keep row 8, has higher value
So I need to go through dataframe row by row, check the rest, select best match and then continue. I can't think about any simple method how to achieve this, this cant be use case of drop_duplicates, since they are not duplicates, but looping over the whole DF will be very inefficient. One method I could think about was to loop just once, for each of rows finds close ones (probably apply countdistance()), select the best fitting row and replace rest with its values, in the end use drop_duplicates. The other idea was to create a recursive function that would create a new DF, then while original df will have rows select first, find close ones, best match append to new DF, remove first row and all close from original DF and continue until empty, then return same function with new DF as to remove possible uncaught close points.
These ideas are all kind of inefficient, is there a nice and efficient pythonic way to achieve this?
For now, I have created simple code with recursion, the code works but is most likely not optimal.
def recModif(self,df):
#columns=['','X','Y','Value']
new_df = df.copy()
new_df = new_df[new_df['Value']<0] #create copy to work with
changed = False
while not df.empty: #for all the data
df = df.reset_index(drop=True) #need to reset so 0 is always accessible
x = df.loc[0,'X'] #first row x and y
y = df.loc[0,'Y']
df['dist'] = self.countDistance(x,y,df['X'],df['Y']) #add column with distances
select = df[df['dist']<10] #number of meters that two elements cant be next to other
if(len(select.index)>1): #if there is more than one elem close
changed = True
#print(select,select['Value'].idxmax())
select = select.loc[[select['Value'].idxmax()]] #get the highest one
new_df = new_df.append(pd.DataFrame(select.iloc[:,:3]),ignore_index=True) #add it to new df
df = df[df['dist'] >= 10] #drop the elements now
if changed:
return self.recModif(new_df) #use recursion if possible overlaps
else:
return new_df #return new df if all was OK
In pandas, axis=0 represent rows and axis=1 represent columns.
Therefore to get the sum of values in each row in pandas, df.sum(axis=0) is called.
But it returns a sum of values in each columns and vice-versa. Why???
import pandas as pd
df=pd.DataFrame({"x":[1,2,3,4,5],"y":[2,4,6,8,10]})
df.sum(axis=0)
Dataframe:
x y
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
Output:
x 15
y 30
Expected Output:
0 3
1 6
2 9
3 12
4 15
I think the right way to interpret the axis parameter is what axis you sum 'over' (or 'across'), rather than the 'direction' the sum is computed in. Specifying axis = 0 computes the sum over the rows, giving you a total for each column; axis = 1 computes the sum across the columns, giving you a total for each row.
I was a reading the source code in pandas project, and I think that this come from Numpy, in this library is used in that way(0 sum vertically and 1 horizonally), and additionally Pandas use under the hood numpy in order to make this sum.
In this link you could check that pandas use numpy.cumsum function to make the sum.
And this link is for numpy documentation.
If you are looking a way to remember how to use the axis parameter, the 'anant' answer, its a good approach, interpreting the sum over the axis instead across. So when is specified 0 you are computing the sum over the rows(iterating over the index in order to be more pandas doc complaint). When axis is 1 you are iterating over the columns.
I have a data frame with 5 fields. I want to copy 2 fields from this into a new data frame. This works fine. df1 = df[['task_id','duration']]
Now in this df1, when I try to group by task_id and sum duration, the task_id field drops off.
Before (what I have now).
After (what I'm trying to achieve).
So, for instance, I'm trying this:
df1['total'] = df1.groupby(['task_id'])['duration'].sum()
The result is:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
I don't know why I can't just sum the values in a column and group by unique IDs in another column. Basically, all I want to do is preserve the original two columns (['task_id', 'duration']), sum duration, and calculate a percentage of duration in a new column named pct. This seems like a very simple thing but I can't get anything working. How can I get this straightened out?
The code will take care of having the columns retained and getting the sum.
df[['task_id', 'duration']].groupby(['task_id', 'duration']).size().reset_index(name='counts')
Setup:
X = np.random.choice([0,1,2], 20)
Y = np.random.uniform(2,10,20)
df = pd.DataFrame({'task_id':X, 'duration':Y})
Calculate pct:
df = pd.merge(df, df.groupby('task_id').agg(sum).reset_index(), on='task_id')
df['pct'] = df['duration_x'].divide(df['duration_y'])*100
df.drop('duration_y', axis=1) # Drops sum duration, remove this line if you want to see it.
Result:
duration_x task_id pct
0 8.751517 0 58.017921
1 6.332645 0 41.982079
2 8.828693 1 9.865355
3 2.611285 1 2.917901
4 5.806709 1 6.488531
5 8.045490 1 8.990189
6 6.285593 1 7.023645
7 7.932952 1 8.864436
8 7.440938 1 8.314650
9 7.272948 1 8.126935
10 9.162262 1 10.238092
11 7.834692 1 8.754639
12 7.989057 1 8.927129
13 3.795571 1 4.241246
14 6.485703 1 7.247252
15 5.858985 2 21.396850
16 9.024650 2 32.957771
17 3.885288 2 14.188966
18 5.794491 2 21.161322
19 2.819049 2 10.295091
disclaimer: All data is randomly generated in setup, however, calculations are straightforward and should be correct for any case.
I finally got everything working in the following way.
# group by and sum durations
df1 = df1.groupby('task_id', as_index=False).agg({'duration': 'sum'})
list(df1)
# find each task_id as relative percentage of whole
df1['pct'] = df1['duration']/(df1['duration'].sum())
df1 = pd.DataFrame(df1)
If I have a pandas database such as:
timestamp label value new
etc. a 1 3.5
b 2 5
a 5 ...
b 6 ...
a 2 ...
b 4 ...
I want the new column to be the average of the last two a's and the last two b's... so for the first it would be the average of 5 and 2 to get 3.5. It will be sorted by the timestamp. I know I could use a groupby to get the average of all the a's or all the b's but I'm not sure how to get an average of just the last two. I'm kinda new to python and coding so this might not be possible idk.
Edit: I should also mention this is not for a class or anything this is just for something I'm doing on my own and that this will be on a very large dataset. I'm just using this as an example. Also I would want each A and each B to have its own value for the last 2 average so the dimension of the new column will be the same as the others. So for the third line it would be the average of 2 and whatever the next a would be in the data set.
IIUC one way (among many) to do that:
In [139]: df.groupby('label').tail(2).groupby('label').mean().reset_index()
Out[139]:
label value
0 a 3.5
1 b 5.0
Edited to reflect a change in the question specifying the last two, not the ones following the first, and that you wanted the same dimensionality with values repeated.
import pandas as pd
data = {'label': ['a','b','a','b','a','b'], 'value':[1,2,5,6,2,4]}
df = pd.DataFrame(data)
grouped = df.groupby('label')
results = {'label':[], 'tail_mean':[]}
for item, grp in grouped:
subset_mean = grp.tail(2).mean()[0]
results['label'].append(item)
results['tail_mean'].append(subset_mean)
res_df = pd.DataFrame(results)
df = df.merge(res_df, on='label', how='left')
Outputs:
>> res_df
label tail_mean
0 a 3.5
1 b 5.0
>> df
label value tail_mean
0 a 1 3.5
1 b 2 5.0
2 a 5 3.5
3 b 6 5.0
4 a 2 3.5
5 b 4 5.0
Now you have a dataframe of your results only, if you need them, plus a column with it merged back into the main dataframe. Someone else posted a more succinct way to get to the results dataframe; probably no reason to do it the longer way I showed here unless you also need to perform more operations like this that you could do inside the same loop.
I have a dataframe with a number of columns, two of which are grouping variables.
>>> df2
Groupvar1 Groupvar2 x y z
0 A 1 0.726317 0.574514 0.700475
1 A 2 0.422089 0.798931 0.191157
2 A 3 0.888318 0.658061 0.686496
....
13 B 2 0.978920 0.764266 0.673941
14 B 3 0.759589 0.162488 0.698958
and I want to make a new dataframe which holds the diffrence between each datapoint in the origianl df and the mean corresponding to its subgroup.
So to begin with a make the new df with the grouped averages:
>>> grp_vars = ['Groupvar1','Groupvar2']
>>> df2_grp = df2.groupby(grp_vars)
>>> df2_grp_avg = df2_grp.mean()
>>> df2_grp_avg
x y z
Groupvar1 Groupvar2
A 1 0.364533 0.645237 0.886286
2 0.325533 0.500077 0.246287
3 0.796326 0.496950 0.510085
4 0.774854 0.688732 0.487547
B 1 0.743783 0.452482 0.612006
2 0.575687 0.396902 0.446126
3 0.473152 0.476379 0.508060
4 0.434320 0.406458 0.382187
and in the new dtaframe I want to keep the deltas, defined as:
delta = individual value - average value of the subgroup this individual is a member of
Now, it's clear to me how to do this the hard way (for loop) but I supose there must be a more elegant solution. Apprecaite any advice on finding that more elegant solution. TIA.
Use .groupby(...).transform function:
>>> demean = lambda df: df - df.mean()
>>> df.groupby(['Groupvar1', 'Groupvar2']).transform(demean)
ant then pd.concat the result with the original data-frame.