I have some earthquake data. I have a Magnitude, Distance, and Percent that I care about. I want to group all of the MAGNITUDES together and sum the distances and percents for each magnitudes. Here is a part of my data:
import pandas as pd
data = {'Distance': [1, 5, 9, 3, 5, 4, 2, 3.1],
'Magnitude': [7.3, 7.3, 7.3, 6.0, 8.2, 6.0, 8.2, 5.7],
'Percent': [0.1, 0.05, 0.07, 0.11, 0.2, 0.07, 0.08,0.11]
}
df = pd.DataFrame(data)
print(df)
Distance Magnitude Percent
0 1.0 7.3 0.10
1 5.0 7.3 0.05
2 9.0 7.3 0.07
3 3.0 6.0 0.11
4 5.0 8.2 0.20
5 4.0 6.0 0.07
6 2.0 8.2 0.08
7 3.1 5.7 0.11
My idea was this. Groupby and sum:
df2 = df.groupby(['Distance','Magnitude','Percent'],as_index=False).agg({'Percent': 'sum'},{'Distance': 'sum'})
I get the same dataframe upon running my code except it is ascending by distance which is fine, but nothing groupped together or summed.
I want it to look like this:
Distance Magnitude Percent
0 15.0 5.7 0.22
1 7.0 6.0 0.18
2 7.0 7.3 0.28
3 3.1 8.2 0.11
There is only 1 value for each magnitude and the distances and percents have been summed for each magnitude.
This will do the the task, you just need to groupby magnitude only
df.groupby(by=["Magnitude"]).sum()
Output
Distance Percent
Magnitude
5.7 3.1 0.11
6.0 7.0 0.18
7.3 15.0 0.22
8.2 7.0 0.28
Or to prevent Magnitude becoming an index as per #lsr729 you can use this as well
df.groupby(by=["Magnitude"], as_index=False).sum()
Output2
Magnitude Distance Percent
0 5.7 3.1 0.11
1 6.0 7.0 0.18
2 7.3 15.0 0.22
3 8.2 7.0 0.28
Related
I have a dataframe with list of values:
In [24]: data
Out[24]:
[{'value': 1.2},
{'value': 2.2},
{'value': 1.8},
{'value': 2.0},
{'value': 1.1},
{'value': 3.9},
{'value': 0.0},
{'value': 1.5},
{'value': 2.5},
{'value': 1.6},
{'value': 2.3},
{'value': 3.0},
{'value': 3.3},
{'value': 0.5},
{'value': 4.0},
{'value': 3.4},
{'value': 0.8},
{'value': 2.5},
{'value': 2.1},
{'value': 3.0}]
In [25]: df = pd.DataFrame(data=data)
In [26]: df
Out[26]:
value
0 1.2
1 2.2
2 1.8
3 2.0
4 1.1
5 3.9
6 0.0
7 1.5
8 2.5
9 1.6
10 2.3
11 3.0
12 3.3
13 0.5
14 4.0
15 3.4
16 0.8
17 2.5
18 2.1
19 3.0
Now I want to pick a subset of this dataframe in the following way:
Pick max value - easy enough - df['value'].max()
Find the next N rows that have a value closest to the last value - 0.2
i.e. - max value is 4.0, so I want to find the row with value closest to 4.0 - 0.2 = 3.8, i.e. row 5.
Next, I want to find a row with value 4.0 - (0.2 * 2) = 3.6, so this would be row 15 (with 3.4), and so on (up to N times)
Is there a quick way to do that?
expected output:
value
0 4.0
1 3.9
2 3.4
The actual data I'll run with should be more evenly distributed, so around each expected value (i.e. around 3.8, 3.6, 3.4) there will be a number of close values (e.g 3.44, 3.38, 3.41)
Assuming the resolution (0.2) is expected to be a lot larger than the distance to the closest, I believe you can use merge_asof:
step, N = 0.2, 3
maxval = df['value'].max()
(pd.merge_asof(df.sort_values('value'),
pd.DataFrame({'ref':maxval-np.arange(N)*step}).sort_values('ref'),
left_on='value',
right_on='ref',
direction='nearest')
.assign(dist=lambda x: x['ref'].sub(x['value']).abs())
.sort_values('dist')
.drop_duplicates('ref')
)
Output:
value ref dist
19 4.0 4.0 0.0
18 3.9 3.8 0.1
17 3.4 3.6 0.2
based on what I understood, you might want:
arr = df['value'].max() - (0.2 * np.arange(1,len(df)+1))
out = (pd.merge_asof(pd.Series(arr,name='Derived_value').sort_values(),
df['value'].sort_values(),left_on='Derived_value',right_on='value'))
The above should do unless you want to sort exactly as per the values of the calculations:
out_one = (out.assign(Derived_value=pd.Categorical(out['Derived_value'],
categories=arr,ordered=True)).sort_values("Derived_value"))
print(out_one)
Derived_value value
19 3.8 3.4
18 3.6 3.4
17 3.4 3.4
16 3.2 3.0
15 3.0 3.0
14 2.8 2.5
13 2.6 2.5
12 2.4 2.3
11 2.2 2.2
10 2.0 2.0
9 1.8 1.6
8 1.6 1.5
7 1.4 1.2
6 1.2 1.1
5 1.0 0.8
4 0.8 0.5
3 0.6 0.5
2 0.4 0.0
1 0.2 0.0
0 0.0 0.0
print(arr)
[3.8 3.6 3.4 3.2 3. 2.8 2.6 2.4 2.2 2. 1.8 1.6 1.4 1.2 1. 0.8 0.6 0.4
0.2 0. ]
Here is one way of doing with numpy broadcasting:
N = 5
v = df['value'].max() - np.arange(N) * 0.2
dist = np.abs(v[:, None] - df['value'].values)
df['value'].iloc[np.argmin(dist, axis=1)]
14 4.0
5 3.9
15 3.4
15 3.4
12 3.3
Name: value, dtype: float64
I have a data set containing some outliers that I'd like to remove.
I want to remove the 0 value in the data frame shown below:
df = pd.DataFrame({'Time': [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], 'data': [1.1, 1.05, 1.01, 1.05, 0, 1.2, 1.1, 1.08, 1.07, 1.1]})
I can do something like this in order to remove values below a certain threshold:
df.loc[df['data'] < 0.5, 'data'] = np.NaN
This yelds me a list without the '0' value:
Time data
0 0.0 1.10
1 0.1 1.05
2 0.2 1.01
3 0.3 1.05
4 0.4 NaN
5 0.5 1.20
6 0.6 1.10
7 0.7 1.08
8 0.8 1.07
9 0.9 1.10
However, I am also suspicious about data surrounding invalid values, and would like to remove values '0.2' units of Time away from the outliers. Like the following:
Time data
0 0.0 1.10
1 0.1 1.05
2 0.2 NaN
3 0.3 NaN
4 0.4 NaN
5 0.5 NaN
6 0.6 NaN
7 0.7 1.08
8 0.8 1.07
9 0.9 1.10
You can get a list of all points in time in which you have bad measurements and filter for all nearby time values:
bad_times = df.Time[df['data'] < 0.5]
for t in bad_times:
df.loc[(df['Time'] - t).abs() <= 0.2, 'data'] = np.NaN
result:
>>> print(df)
Time data
0 0.0 1.10
1 0.1 1.05
2 0.2 NaN
3 0.3 NaN
4 0.4 NaN
5 0.5 NaN
6 0.6 NaN
7 0.7 1.08
8 0.8 1.07
9 0.9 1.10
You can get a list of Time to be deleted, and then apply nan for those rows.
df.loc[df['data'] < 0.5, 'data'] = np.NaN
l=df[df['data'].isna()]['Time'].values
l2=[]
for i in l:
l2=l2+[round(i-0.1,1),round(i-0.2,1),round(i+0.1,1),round(i+0.2,1)]
df.loc[df['Time'].isin(l2), 'data'] = np.nan
This is my first question on stackoverflow. Go easy on me!
I have two data sets acquired simultaneously by different acquisition systems with different sampling rates. One is very regular, and the other is not. I would like to create a single dataframe containing both data sets, using the regularly spaced timestamps (in seconds) as the reference for both. The irregularly sampled data should be interpolated on the regularly spaced timestamps.
Here's some toy data demonstrating what I'm trying to do:
import pandas as pd
import numpy as np
# evenly spaced times
t1 = np.array([0,0.5,1.0,1.5,2.0])
y1 = t1
# unevenly spaced times
t2 = np.array([0,0.34,1.01,1.4,1.6,1.7,2.01])
y2 = 3*t2
df1 = pd.DataFrame(data={'y1':y1,'t':t1})
df2 = pd.DataFrame(data={'y2':y2,'t':t2})
df1 and df2 look like this:
df1:
t y1
0 0.0 0.0
1 0.5 0.5
2 1.0 1.0
3 1.5 1.5
4 2.0 2.0
df2:
t y2
0 0.00 0.00
1 0.34 1.02
2 1.01 3.03
3 1.40 4.20
4 1.60 4.80
5 1.70 5.10
6 2.01 6.03
I'm trying to merge df1 and df2, interpolating y2 on df1.t. The desired result is:
df_combined:
t y1 y2
0 0.0 0.0 0.0
1 0.5 0.5 1.5
2 1.0 1.0 3.0
3 1.5 1.5 4.5
4 2.0 2.0 6.0
I've been reading documentation for pandas.resample, as well as searching previous stackoverflow questions, but haven't been able to find a solution to my particular problem. Any ideas? Seems like it should be easy.
UPDATE:
I figured out one possible solution: interpolate the second series first, then append to the first data frame:
from scipy.interpolate import interp1d
f2 = interp1d(t2,y2,bounds_error=False)
df1['y2'] = f2(df1.t)
which gives:
df1:
t y1 y2
0 0.0 0.0 0.0
1 0.5 0.5 1.5
2 1.0 1.0 3.0
3 1.5 1.5 4.5
4 2.0 2.0 6.0
That works, but I'm still open to other solutions if there's a better way.
If you construct a single DataFrame from Series, using time values as index, like this:
>>> t1 = np.array([0, 0.5, 1.0, 1.5, 2.0])
>>> y1 = pd.Series(t1, index=t1)
>>> t2 = np.array([0, 0.34, 1.01, 1.4, 1.6, 1.7, 2.01])
>>> y2 = pd.Series(3*t2, index=t2)
>>> df = pd.DataFrame({'y1': y1, 'y2': y2})
>>> df
y1 y2
0.00 0.0 0.00
0.34 NaN 1.02
0.50 0.5 NaN
1.00 1.0 NaN
1.01 NaN 3.03
1.40 NaN 4.20
1.50 1.5 NaN
1.60 NaN 4.80
1.70 NaN 5.10
2.00 2.0 NaN
2.01 NaN 6.03
You can simply interpolate it, and select only the part where y1 is defined:
>>> df.interpolate('index').reindex(y1)
y1 y2
0.0 0.0 0.0
0.5 0.5 1.5
1.0 1.0 3.0
1.5 1.5 4.5
2.0 2.0 6.0
It's not exactly clear to me how you're getting rid of some of the values in y2, but it seems like if there is more than one for a given timepoint, you only want the first one. Also, it seems like your time values should be in the index. I also added column labels. It looks like this:
import pandas as pd
# evenly spaced times
t1 = [0,0.5,1.0,1.5,2.0]
y1 = t1
# unevenly spaced times
t2 = [0,0.34,1.01,1.4,1.6,1.7,2.01]
# round t2 values to the nearest half
new_t2 = [round(num * 2)/2 for num in t2]
# set y2 values
y2 = [3*z for z in new_t2]
# eliminate entries that have the same index value
for x in range(1, len(new_t2), -1):
if new_t2[x] == new_t2[x-1]:
new_t2.delete(x)
y2.delete(x)
ser1 = pd.Series(y1, index=t1)
ser2 = pd.Series(y2, index=new_t2)
df = pd.concat((ser1, ser2), axis=1)
df.columns = ('Y1', 'Y2')
print df
This prints:
Y1 Y2
0.0 0.0 0.0
0.5 0.5 1.5
1.0 1.0 3.0
1.5 1.5 4.5
1.5 1.5 4.5
1.5 1.5 4.5
2.0 2.0 6.0
I'm using Pandas 0.13.0 and I try to do a sliding average based on the value of the index.
The index values are not equally distributed.
The index is sorted with increasing and unique values.
import pandas as pd
import Quantities as pq
f = {
'A': [ 0.0, 0.1, 0.2, 0.5, 1.0, 1.4, 1.5] * pq.m,
'B': [10.0, 11.0, 12.0, 15.0, 20.0, 30.0, 50.0] * pq.kPa
}
df = pd.DataFrame(f)
df.set_index(df['A'], inplace=True)
The DataFrame gives:
in: print df
out:
A B
A
0.00 0.00 m 10.0 kPa
0.10 0.10 m 11.0 kPa
0.20 0.20 m 12.0 kPa
0.50 0.50 m 15.0 kPa
1.00 1.00 m 20.0 kPa
1.40 1.40 m 30.0 kPa
1.50 1.50 m 50.0 kPa
Now I would like to do the average of the column B for each x value of the index, between x and x+c, c being a user defined criterion.
For the sake of this example, c = 0.40.
The averaging process would give:
A B C
A
0.00 0.00 m 10.0 kPa 11.0 kPa = (10.0 + 11.0 + 12.0) / 3
0.10 0.10 m 11.0 kPa 12.7 kPa = (11.0 + 12.0 + 15.0) / 3
0.20 0.20 m 12.0 kPa 13.5 kPa = (12.0 + 15.0) / 2
0.50 0.50 m 15.0 kPa 15.0 kPa = (15.0) / 1
1.00 1.00 m 20.0 kPa 25.0 kPa = (20.0 + 30.0) / 2
1.40 1.40 m 30.0 kPa 40.0 kPa = (30.0 + 50.0) / 2
1.50 1.50 m 50.0 kPa 50.0 kPa = (50.0) / 1
Note that because the index values are not evenly space, sometimes the x+c won't be found. It is ok for now, though I will definitely add a way to take the average value at x+c between the value just before and the value just after x+c, so I get a more accurate average.
I tried the solution found here from Zelazny7:
pandas rolling computation with window based on values instead of counts
But I can't make it work for my case, where the search is made on the index.
I also looked at:
Pandas Rolling Computations on Sliding Windows (Unevenly spaced)
But I don't understand how to apply it to my case.
Any idea how to solve this problem in a efficient pandas approach? (using apply, map or rolling?)
Thanks.
What you needed to do from the answer you linked to was to turn the index into a series so you can then call apply on it. The other key thing here is that you also have to index the constructed series the same as your df index as the default is to just create an index from scratch like 0,1,2,3...
In [26]:
def f(x, c):
ser = df.loc[(df.index >= x) & (df.index <= x + c),'B']
return ser.mean()
df['C'] = pd.Series(data = df.index, index = df.index).apply(lambda x: f(x,c=0.4))
df
Out[26]:
A B C
A
0.0 0.0 10 11.000000
0.1 0.1 11 12.666667
0.2 0.2 12 13.500000
0.5 0.5 15 15.000000
1.0 1.0 20 25.000000
1.4 1.4 30 40.000000
1.5 1.5 50 50.000000
I have the following table:
Site Peril ReturnPeriod Min Max Mean
0 one river 20 0.0 0.1 0.05
1 one river 100 0.0 0.1 0.05
2 one coast 20 2.0 5.3 4.00
3 one coast 100 2.0 5.3 4.00
4 two river 20 0.1 0.5 0.90
5 two coast 20 0.3 0.5 0.80
I'm trying to reshape it to get to this:
Peril: river coast
Site ReturnPeriod Min Max Mean Min Max Mean
0 one 20 0.0 0.1 0.05 2.0 5.3 4.00
1 one 100 0.0 0.1 0.05 2.0 5.3 4.00
2 two 20 0.1 0.5 0.90 0.3 0.5 0.80
I think melt can take me halfway there but I'm having trouble getting the final output. Any ideas?
I think that this may actually possible with just a call to pivot_table:
df.pivot_table(values = ['Min', 'Mean', 'Max'], rows = ['Site', 'ReturnPeriod'], cols = 'Peril')
I need to check it more thoroughly though.