I'm trying to update column entries by counting the frequency of row entries in different columns. Here is a sample of my data. Actual data consistes 10k samples each having length of 220. (220 seconds).
d = {'ID':['a12', 'a12','a12','a12','a12', 'a12', 'a12','a12','v55','v55','v55','v55','v55','v55','v55', 'v55'],
'Exp_A':[0.012,0.154,0.257,0.665,1.072,1.514,1.871,2.144, 0.467, 0.812,1.59,2.151,2.68,3.013,3.514,4.015],
'freq':['00:00:00', '00:00:01', '00:00:02', '00:00:03', '00:00:04',
'00:00:05', '00:00:06', '00:00:07','00:00:00', '00:00:01', '00:00:02', '00:00:03', '00:00:04',
'00:00:05', '00:00:06', '00:00:07'],
'A_Bullseye':[0,0,0,0,1,0,1,0, 0,0,1,0,0,0,1,0], 'A_Bull_Total':[0,0,0,0,0,1,1,2,0,0,0,1,1,1,1,2], 'A_Shot':[0,1,1,1,0,1,0,0, 1,1,0,1,0,1,0,0]}
df = pd.DataFrame(data=d)
Per each second, only Bullseye or Shot can be registered.
Count1: Number of df.A_Shot == 1 before the first df.A_Bullseye == 1 for each ID is 3 & 2 for ID=a12 and ID=v55 resp.
Count2: Number of df.A_Shot == 1 from the end of count1 to the second df.A_Bullseye == 1, 1 for df[df.ID=='a12'] and 2 for df[df.ID=='v55']
Where i in count(i) is df.groupby(by='ID')[A_Bull_Total].max(). Here i is 2.
So, if I can compute the average count for each i, then I will be able to adjust the values of df.Exp_A using the average of the above counts.
mask_A_Shot= df.A_Shot == 1
mask_A_Bullseye= df.A_Bulleseye == 0
mask = mask_A_Shot & mask_A_Bulleseye
df[mask.groupby(df['ID'])].mean()
Ideally I would like to have something like for each i (Bullseye), how many Shots are needed and how many seconds it took.
Create a grouping key of Bullseye within each ID using .cumsum and then you can find how many shots, and how much time elapsed between the bullseyes.
import pandas as pd
df['freq'] = pd.to_timedelta(df.freq, unit='s')
df['Bullseye'] = df.groupby('ID').A_Bullseye.cumsum()+1
# Chop off any shots after the final bullseye
m = df.Bullseye <= df.groupby('ID').A_Bullseye.transform(lambda x: x.cumsum().max())
df[m].groupby(['ID', 'Bullseye']).agg({'A_Shot': 'sum',
'freq': lambda x: x.max()-x.min()})
Output:
A_Shot freq
ID Bullseye
a12 1 3 00:00:03
2 1 00:00:01
v55 1 2 00:00:01
2 2 00:00:03
Edit:
Given your comment, here is how I would proceed. We're going to .shift the Bullseye column so instead of incrementing the counter at the Bullseye, we increment the counter the row after the bullseye. We'll modify A_Shot so bullseyes are also considered to be a shot.
df['freq'] = pd.to_timedelta(df.freq, unit='s')
df['Bullseye'] = df.groupby('ID').A_Bullseye.apply(lambda x: x.shift().cumsum().fillna(0)+1)
# Also consider Bullseye's as a shot:
df.loc[df.A_Bullseye == 1, 'A_Shot'] = 1
# Chop off any shots after the final bullseye
m = df.Bullseye <= df.groupby('ID').A_Bullseye.transform(lambda x: x.cumsum().max())
df1 = (df[m].groupby(['ID', 'Bullseye'])
.agg({'A_Shot': 'sum',
'freq': lambda x: (x.max()-x.min()).total_seconds()}))
Output: df1
A_Shot freq
ID Bullseye
a12 1.0 4 4.0
2.0 2 1.0
v55 1.0 3 2.0
2.0 3 3.0
And now since freq is an integer number of seconds, you can do divisions easily:
df1.A_Shot / df1.freq
#ID Bullseye
#a12 1.0 1.0
# 2.0 2.0
#v55 1.0 1.5
# 2.0 1.0
#dtype: float64
Related
I am creating a variable 'spike' as an indicator variable that is 1 for the date corresponding to old column, Cost, n smallest values and a 0 otherwise. The code illustrated below is apart of a larger for loop.
I can only get results using the idxmin() function. I would like help in getting the index for the n smallest values.
import pandas as pd
import numpy as np
df3 = pd.DataFrame({'Dept':['A', 'A', 'B', 'B'],
'Benefit':[2000,25,55,400],
'Cost':[1000, 500, 1500, 2000]})
# Let's create an index using Timestamps
index_ = [pd.Timestamp('01-06-2018'), pd.Timestamp('04-06-2018'),
pd.Timestamp('07-06-2018'), pd.Timestamp('10-06-2018')]
df3.index = index_
print(df3)
df3.index = index_
print(df3)
df3['spike'] = np.where(df3.index.isin(lookup), 1, 0)
If you sort, then you can get the top-3 with standard Python / numpy array slicing.
low_cost = df3.sort_values('Cost')[:3]
low_cost
# Dept Benefit Cost
# 2018-04-06 A 25 500
# 2018-01-06 A 2000 1000
# 2018-07-06 B 55 1500
To get the spike column, for efficiency I would recommend a join.
spikes = low_cost.assign(spike=1)[['spike']]
spikes
# spike
# 2018-04-06 1
# 2018-01-06 1
# 2018-07-06 1
df3.join(spikes, how='left').fillna(0)
# Dept Benefit Cost spike
# 2018-01-06 A 2000 1000 1.0
# 2018-04-06 A 25 500 1.0
# 2018-07-06 B 55 1500 1.0
# 2018-10-06 B 400 2000 0.0
I have some log data that represents an item (id) and a timestamp that an action was a started and I want to determine the time between actions on each item.
for example, I have some data that looks like this:
data = [{"timestamp":"2019-05-21T14:17:29.265Z","id":"ff9dad92-e7c1-47a5-93a7-6e49533a6e25"},{"timestamp":"2019-05-21T14:21:49.722Z","id":"ff9dad92-e7c1-47a5-93a7-6e49533a6e25"},{"timestamp":"2019-05-21T15:16:25.695Z","id":"ff9dad92-e7c1-47a5-93a7-6e49533a6e25"},{"timestamp":"2019-05-21T15:16:25.696Z","id":"ff9dad92-e7c1-47a5-93a7-6e49533a6e25"},{"timestamp":"2019-05-22T07:51:17.49Z","id":"ff12891e-5786-438b-891c-abd4244723b4"},{"timestamp":"2019-05-22T08:11:13.948Z","id":"ff12891e-5786-438b-891c-abd4244723b4"},{"timestamp":"2019-05-22T11:52:59.897Z","id":"ff12891e-5786-438b-891c-abd4244723b4"},{"timestamp":"2019-05-22T11:53:03.406Z","id":"ff12891e-5786-438b-891c-abd4244723b4"},{"timestamp":"2019-05-22T11:53:03.481Z","id":"ff12891e-5786-438b-891c-abd4244723b4"},{"timestamp":"2019-05-21T14:23:08.147Z","id":"fe55bb22-fe5b-4b12-8aaf-d5f0320ac7fa"},{"timestamp":"2019-05-21T14:29:18.228Z","id":"fe55bb22-fe5b-4b12-8aaf-d5f0320ac7fa"},{"timestamp":"2019-05-21T15:17:09.831Z","id":"fe55bb22-fe5b-4b12-8aaf-d5f0320ac7fa"},{"timestamp":"2019-05-21T15:17:09.834Z","id":"fe55bb22-fe5b-4b12-8aaf-d5f0320ac7fa"},{"timestamp":"2019-05-21T14:02:19.072Z","id":"fd3554cd-b83d-49af-a8e6-7bf41c741cd0"},{"timestamp":"2019-05-21T14:02:34.867Z","id":"fd3554cd-b83d-49af-a8e6-7bf41c741cd0"},{"timestamp":"2019-05-21T14:12:28.877Z","id":"fd3554cd-b83d-49af-a8e6-7bf41c741cd0"},{"timestamp":"2019-05-21T15:19:19.567Z","id":"fd3554cd-b83d-49af-a8e6-7bf41c741cd0"},{"timestamp":"2019-05-21T15:19:19.582Z","id":"fd3554cd-b83d-49af-a8e6-7bf41c741cd0"},{"timestamp":"2019-05-21T09:58:02.185Z","id":"f89c2e3e-06dc-467b-813b-dc92f2692f63"},{"timestamp":"2019-05-21T10:07:24.044Z","id":"f89c2e3e-06dc-467b-813b-dc92f2692f63"}]
stack = pd.DataFrame(data)
stack.head()
I have tried getting all the unique ids to split the data frame and then getting the time taken with the index to recombine with the original set like, but the function is extremely slow on large data-sets and messes up both the index
and timestamp order resulting in results getting miss matched.
import ciso8601 as time
records = []
for i in list(stack.id.unique()):
dff = stack[stack.id == i]
time_taken = []
times = []
i = 0
for _, row in dff.iterrows():
if bool(times):
print(_)
current_time = time.parse_datetime(row.timestamp)
prev_time = times[i]
time_taken = current_time - prev_time
times.append(current_time)
i+=1
records.append(dict(index = _, time_taken = time_taken.seconds))
else:
records.append(dict(index = _, time_taken = 0))
times.append(time.parse_datetime(row.timestamp))
x = pd.DataFrame(records).set_index('index')
stack.merge(x, left_index=True, right_index=True, how='inner')
Is there a neat pandas groupby and apply way of doing this so that I don't have to split the frame and store it in memory so that can reference the previous row in the subset?
Thanks
You can use GroupBy.diff:
stack['timestamp'] = pd.to_datetime(stack['timestamp'])
stack['timestamp']= (stack.sort_values(['id','timestamp'])
.groupby('id')
.diff()['timestamp']
.dt.total_seconds()
.round().fillna(0))
print(stack['time_taken'])
0 0.0
1 260.0
2 3276.0
3 0.0
4 0.0
5 1196.0
6 13306.0
7 4.0
8 0.0
9 0.0
10 370.0
11 2872.0
...
If you want the resulting dataframe to be ordered by date, instead do:
stack['timestamp'] = pd.to_datetime(stack['timestamp'])
stack = stack.sort_values(['id','timestamp'])
stack['time_taken'] = (stack.groupby('id')
.diff()['timestamp']
.dt.total_seconds()
.round()
.fillna(0))
If dont need replace timestamp to datetimes create Series filled by datetimes by to_datetime and pass to DataFrameGroupBy.diff, then convert to seconds by Series.dt.total_seconds, if necessary round by Series.round and replace missing values by 0:
t = pd.to_datetime(stack['timestamp'])
stack['time_taken'] = t.groupby(stack['id']).diff().dt.total_seconds().round().fillna(0)
print (stack)
id timestamp time_taken
0 ff9dad92-e7c1-47a5-93a7-6e49533a6e25 2019-05-21T14:17:29.265Z 0.0
1 ff9dad92-e7c1-47a5-93a7-6e49533a6e25 2019-05-21T14:21:49.722Z 260.0
2 ff9dad92-e7c1-47a5-93a7-6e49533a6e25 2019-05-21T15:16:25.695Z 3276.0
3 ff9dad92-e7c1-47a5-93a7-6e49533a6e25 2019-05-21T15:16:25.696Z 0.0
4 ff12891e-5786-438b-891c-abd4244723b4 2019-05-22T07:51:17.49Z 0.0
5 ff12891e-5786-438b-891c-abd4244723b4 2019-05-22T08:11:13.948Z 1196.0
6 ff12891e-5786-438b-891c-abd4244723b4 2019-05-22T11:52:59.897Z 13306.0
7 ff12891e-5786-438b-891c-abd4244723b4 2019-05-22T11:53:03.406Z 4.0
8 ff12891e-5786-438b-891c-abd4244723b4 2019-05-22T11:53:03.481Z 0.0
9 fe55bb22-fe5b-4b12-8aaf-d5f0320ac7fa 2019-05-21T14:23:08.147Z 0.0
10 fe55bb22-fe5b-4b12-8aaf-d5f0320ac7fa 2019-05-21T14:29:18.228Z 370.0
11 fe55bb22-fe5b-4b12-8aaf-d5f0320ac7fa 2019-05-21T15:17:09.831Z 2872.0
12 fe55bb22-fe5b-4b12-8aaf-d5f0320ac7fa 2019-05-21T15:17:09.834Z 0.0
13 fd3554cd-b83d-49af-a8e6-7bf41c741cd0 2019-05-21T14:02:19.072Z 0.0
14 fd3554cd-b83d-49af-a8e6-7bf41c741cd0 2019-05-21T14:02:34.867Z 16.0
15 fd3554cd-b83d-49af-a8e6-7bf41c741cd0 2019-05-21T14:12:28.877Z 594.0
16 fd3554cd-b83d-49af-a8e6-7bf41c741cd0 2019-05-21T15:19:19.567Z 4011.0
17 fd3554cd-b83d-49af-a8e6-7bf41c741cd0 2019-05-21T15:19:19.582Z 0.0
18 f89c2e3e-06dc-467b-813b-dc92f2692f63 2019-05-21T09:58:02.185Z 0.0
19 f89c2e3e-06dc-467b-813b-dc92f2692f63 2019-05-21T10:07:24.044Z 562.0
Or if need replace timestamp to datetimes use #yatu answer.
df1:
GAME PLAY BET
0 (SWE, FIN) DRAW 10
1 (DEN, GER) WIN 20
2 (RUS, CZE) LOSS 30
df2:
GAME WIN DRAW LOSS
0 (SWE, FIN) 1.50 2.0 3.25
1 (DEN, GER) 2.00 2.5 2.10
2 (RUS, CZE) 1.05 2.1 10.00
I'd like to create a column "PAYOFF" in df1, for each game. Payoff is computed by fetching the actual odds (WIN/DRAW/LOSS) from df2, by multiplying that value with the "BET" from df1. For instance, for row 1, (SWE,FIN), the PLAY was a DRAW, and I need to use that value to fetch from the DRAW col in df2.
I can manage that by joining the 2 df's, and then some ugly del/rename of columns in a number of steps, but surely I'm missing some more elegant way to do that ? TIA, --Tommy
I think need lookup
df1['New']=df2.set_index('GAME').lookup(df1.GAME,df1.PLAY)
df1
Out[26]:
GAME PLAY BET New
0 (SWE,FIN) DRAW 10 2.0
1 (DEN,GER) WIN 20 2.0
2 (RUS,CZE) LOSS 30 10.0
I like Wen's solution better, but you can use
merged = pd.merge(
pd.concat([df1, pd.get_dummies(df1.PLAY)], axis=1),
df2,
on='GAME')
>>> merged.BET * (merged.DRAW_x * merged.DRAW_y + merged.WIN_x * merged.WIN_y + merged.LOSS_x * merged.LOSS_y)
0 20.0
1 40.0
2 300.0
dtype: float64
Suppoose df.bun (df is a Pandas dataframe)is a multi-index(date and name) with variable being category values written in string,
date name values
20170331 A122630 stock-a
A123320 stock-a
A152500 stock-b
A167860 bond
A196030 stock-a
A196220 stock-a
A204420 stock-a
A204450 curncy-US
A204480 raw-material
A219900 stock-a
How can I make this to represent total counts in the same date and its percentage to make table like below with each of its date,
date variable counts Percentage
20170331 stock 7 70%
bond 1 10%
raw-material 1 10%
curncy 1 10%
I have done print(df.groupby('bun').count()) as a resort to this question but it lacks..
cf) Before getting df.bun I used the following code to import nested dictionary to Pandas dataframe.
import numpy as np
import pandas as pd
result = pd.DataFrame()
origDict = np.load("Hannah Lee.npy")
for item in range(len(origDict)):
newdict = {(k1, k2):v2 for k1,v1 in origDict[item].items() for k2,v2 in origDict[item][k1].items()}
df = pd.DataFrame([newdict[i] for i in sorted(newdict)],
index=pd.MultiIndex.from_tuples([i for i in sorted(newdict.keys())]))
print(df.bun)
I believe need SeriesGroupBy.value_counts:
g = df.groupby('date')['values']
df = pd.concat([g.value_counts(),
g.value_counts(normalize=True).mul(100)],axis=1, keys=('counts','percentage'))
print (df)
counts percentage
date values
20170331 stock-a 6 60.0
bond 1 10.0
curncy-US 1 10.0
raw-material 1 10.0
stock-b 1 10.0
Another solution with size for counts and then divide by new Series created by transform and sum:
df2 = df.reset_index().groupby(['date', 'values']).size().to_frame('count')
df2['percentage'] = df2['count'].div(df2.groupby('date')['count'].transform('sum')).mul(100)
print (df2)
count percentage
date values
20170331 bond 1 10.0
curncy-US 1 10.0
raw-material 1 10.0
stock-a 6 60.0
stock-b 1 10.0
Difference between solutions is first sort by values per groups and second sort MultiIndex.
I have two dataframes: one has multi levels of columns, and another has only single level column (which is the first level of the first dataframe, or say the second dataframe is calculated by grouping the first dataframe).
These two dataframes look like the following:
first dataframe-df1
second dataframe-df2
The relationship between df1 and df2 is:
df2 = df1.groupby(axis=1, level='sector').mean()
Then, I get the index of rolling_max of df1 by:
result1=pd.rolling_apply(df1,window=5,func=lambda x: pd.Series(x).idxmax(),min_periods=4)
Let me explain result1 a little bit. For example, during the five days (window length) 2016/2/23 - 2016/2/29, the max price of the stock sh600870 happened in 2016/2/24, the index of 2016/2/24 in the five-day range is 1. So, in result1, the value of stock sh600870 in 2016/2/29 is 1.
Now, I want to get the sector price for each stock by the index in result1.
Let's take the same stock as example, the stock sh600870 is in sector ’家用电器视听器材白色家电‘. So in 2016/2/29, I wanna get the sector price in 2016/2/24, which is 8.770.
How can I do that?
idxmax (or np.argmax) returns an index which is relative to the rolling
window. To make the index relative to df1, add the index of the left edge of
the rolling window:
index = pd.rolling_apply(df1, window=5, min_periods=4, func=np.argmax)
shift = pd.rolling_min(np.arange(len(df1)), window=5, min_periods=4)
index = index.add(shift, axis=0)
Once you have ordinal indices relative to df1, you can use them to index
into df1 or df2 using .iloc.
For example,
import numpy as np
import pandas as pd
np.random.seed(2016)
N = 15
columns = pd.MultiIndex.from_product([['foo','bar'], ['A','B']])
columns.names = ['sector', 'stock']
dates = pd.date_range('2016-02-01', periods=N, freq='D')
df1 = pd.DataFrame(np.random.randint(10, size=(N, 4)), columns=columns, index=dates)
df2 = df1.groupby(axis=1, level='sector').mean()
window_size, min_periods = 5, 4
index = pd.rolling_apply(df1, window=window_size, min_periods=min_periods, func=np.argmax)
shift = pd.rolling_min(np.arange(len(df1)), window=window_size, min_periods=min_periods)
# alternative, you could use
# shift = np.pad(np.arange(len(df1)-window_size+1), (window_size-1, 0), mode='constant')
# but this is harder to read/understand, and therefore it maybe more prone to bugs.
index = index.add(shift, axis=0)
result = pd.DataFrame(index=df1.index, columns=df1.columns)
for col in index:
sector, stock = col
mask = pd.notnull(index[col])
idx = index.loc[mask, col].astype(int)
result.loc[mask, col] = df2[sector].iloc[idx].values
print(result)
yields
sector foo bar
stock A B A B
2016-02-01 NaN NaN NaN NaN
2016-02-02 NaN NaN NaN NaN
2016-02-03 NaN NaN NaN NaN
2016-02-04 5.5 5 5 7.5
2016-02-05 5.5 5 5 8.5
2016-02-06 5.5 6.5 5 8.5
2016-02-07 5.5 6.5 5 8.5
2016-02-08 6.5 6.5 5 8.5
2016-02-09 6.5 6.5 6.5 8.5
2016-02-10 6.5 6.5 6.5 6
2016-02-11 6 6.5 4.5 6
2016-02-12 6 6.5 4.5 4
2016-02-13 2 6.5 4.5 5
2016-02-14 4 6.5 4.5 5
2016-02-15 4 6.5 4 3.5
Note in pandas 0.18 the rolling_apply syntax was changed. DataFrames and Series now have a rolling method, so that now you would use:
index = df1.rolling(window=window_size, min_periods=min_periods).apply(np.argmax)
shift = (pd.Series(np.arange(len(df1)))
.rolling(window=window_size, min_periods=min_periods).min())
index = index.add(shift.values, axis=0)