Could someone tell me how to Add rows in this dataframe automatically?
I have a data frame df :
frequency
enrollment_id event days
1 access 2 3
7 8
9 4
10 3
12 2
15 21
18 4
19 8
20 20
22 16
23 2
28 2
29 14
navigate 2 1
7 4
9 1
10 3
11 1
12 1
15 5
18 1
19 1
22 3
23 1
28 1
29 2
page_close 2 1
7 6
9 2
10 3
... ...
200881 navigate 28 1
200882 discussion 28 4
navigate 28 4
200883 access 28 2
navigate 28 2
page_close 28 1
200885 navigate 21 1
200887 access 21 3
navigate 21 2
page_close 21 1
video 21 1
200888 access 21 2
discussion 21 1
navigate 21 5
page_close 21 1
video 21 1
wiki 21 1
200889 navigate 21 1
200893 navigate 21 2
200895 navigate 21 1
200896 navigate 21 1
200897 navigate 21 1
200898 navigate 21 1
200900 navigate 21 1
200901 access 21 3
navigate 21 2
page_close 21 2
video 21 1
200904 navigate 21 1
200905 navigate 21 1
This df has 3 index: 1. event 2. days 3. enrollment_id
and only one column frequency
event has 7 different value like : access, remove etc.
days has 30 different vaule 0 - 29 (not every event has 0 - 29. some event just has for example 0, 1, 4.)
enrollment_id has a lot of different value (maybe 100000). Same, not each days has all enrollment_id.
My question is : How can I add all lost rows?
For example : If I have this
frequency
enrollment_id event days
1 access 2 3
7 8
I need to add rows for
frequency
enrollment_id event days
1 access 0 0
1 0
3 0
4 0
5 0
6 0
... ...
29 0
and I need to add rows for 0 with all other enrollment_id and frequency 0
and and all rows for access with0days - 29days and enrollment_id from 1 - max
I really want to get this answer. I really appreciate your help!!
EDIT:
If need add mising days only to last level days use reindex with unstack + stack:
df = df['frequency'].unstack()
.reindex(columns=list(range(30)), fill_value=0)
.stack()
.to_frame('frequency')
If need add all combination of all levels:
Use by new MultiIndex created by from_product:
#get all unique values of all levels
a = df.index.get_level_values('enrollment_id').unique()
b = df.index.get_level_values('event').unique()
c = df.index.get_level_values('days').unique()
Or you can use your values in lists like:
a = ['access', 'remove']
b = range(1, df.index.get_level_values('event').max() + 1)
c = range(30)
mux = pd.MultiIndex.from_product([a,b,c], names=df.index.names)
#for missing values add 0
df = df.reindex(mux, fill_value=0)
Related
I am having trouble applying some logic across my entire dataset. I am able to apply the logic on a small "group" but not on all of the groups (note, the groups are made by primaryFilter and secondaryFilter. Do you all mind pointing me in the right direction to go about this?
Entire Data
import pandas as pd
import numpy as np
myInput = {
'primaryFilter': [100,100,100,100,100,100,100,100,100,100,200,200,200,200,200,200,200,200,200,200],
'secondaryFilter': [1,1,1,1,2,2,2,3,3,3,1,1,2,2,2,2,3,3,3,3],
'constantValuePerGroup': [15,15,15,15,20,20,20,17,17,17,10,10,30,30,30,30,22,22,22,22],
'someValue':[3,1,4,7,9,9,2,7,3,7,6,4,7,10,10,3,4,6,7,5]
}
df_input = pd.DataFrame(data=myInput)
df_input
Test Data (First Group)
df_test = df_input[df_input.primaryFilter.isin([100])]
df_test = df_test[df_test.secondaryFilter == 1.0]
df_test['newColumn'] = np.nan
for index,row in df_test.iterrows():
if index==0:
print("start")
df_test.loc[0, 'newColumn'] = 0
elif index==df_test.shape[0]-1:
df_test.loc[index, 'newColumn'] = df_test.loc[index-1, 'newColumn'] + df_test.loc[index-1, 'someValue']
print("end")
else:
print("inter")
df_test.loc[index, 'newColumn'] = df_test.loc[index-1, 'newColumn'] + df_test.loc[index-1, 'someValue']
df_test["delta"] = df_test["constantValuePerGroup"] - df_test['newColumn']
df_test.head()
Here is the output of the test
I now would like to apply the above logic to the remaining groups 100,2 and 100,3 and 200,1 and so forth..
No need to use iterrows here, you can group the dataframe on primaryFilter and secondaryFilter columns then for each unique group take the cumulative sum of values in column someValue and shift the resulting cummulative sum by 1 position downwards to obtain newColumn. Finally subtract newColumn from constantValuePerGroup to get the delta.
df_input['newColumn'] = df_input.groupby(['primaryFilter', 'secondaryFilter'])['someValue'].apply(lambda s: s.cumsum().shift(fill_value=0))
df_input['delta'] = df_input['constantValuePerGroup'] - df_input['newColumn']
>>> df_input
primaryFilter secondaryFilter constantValuePerGroup someValue newColumn delta
0 100 1 15 3 0 15
1 100 1 15 1 3 12
2 100 1 15 4 4 11
3 100 1 15 7 8 7
4 100 2 20 9 0 20
5 100 2 20 9 9 11
6 100 2 20 2 18 2
7 100 3 17 7 0 17
8 100 3 17 3 7 10
9 100 3 17 7 10 7
10 200 1 10 6 0 10
11 200 1 10 4 6 4
12 200 2 30 7 0 30
13 200 2 30 10 7 23
14 200 2 30 10 17 13
15 200 2 30 3 27 3
16 200 3 22 4 0 22
17 200 3 22 6 4 18
18 200 3 22 7 10 12
19 200 3 22 5 17 5
I have a pretty similiar question to another question on here.
Let's assume I have two dataframes:
df
volumne
11
24
30
df2
range_low range_high price
10 20 1
21 30 2
How can I filter the second dataframe, based for one row of the first dataframe, if the value range is true?
So for example (value 11 from df) leads to:
df3
range_low range_high price
10 20 1
wherelese (value 30 from df) leads to:
df3
I am looking for a way to check, if a specific value is in a range of another dataframe, and filter the dataframe based on this condition. In none python code:
Find 11 in
(10, 20), if True: df3 = filter on this row
(21, 30), if True: df3= filter on this row
if not
return empty frame
For loop solution use:
for v in df['volumne']:
df3 = df2[(df2['range_low'] < v) & (df2['range_high'] > v)]
print (df3)
For non loop solution is possible use cross join, but if large DataFrames there should be memory problem:
df = df.assign(a=1).merge(df2.assign(a=1), on='a', how='outer')
print (df)
volumne a range_low range_high price
0 11 1 10 20 1
1 11 1 21 30 2
2 24 1 10 20 1
3 24 1 21 30 2
4 30 1 10 20 1
5 30 1 21 30 2
df3 = df[(df['range_low'] < df['volumne']) & (df['range_high'] > df['volumne'])]
print (df3)
volumne a range_low range_high price
0 11 1 10 20 1
3 24 1 21 30 2
I have a similar problem (but with date ranges), and if df2 is too large, it will take for ever.
If the volumes are always integers, a faster solution is to create an intermediate dataframe where you associate each possible volume to a price (in one iteration) and then merge.
price_list=[]
for index, row in df2.iterrows():
x=pd.DataFrame(range(row['range_low'],row['range_high']+1),columns=['volume'])
x['price']=row['price']
price_list.append(x)
df_prices=pd.concat(price_list)
you will get something like this
volume price
0 10 1
1 11 1
2 12 1
3 13 1
4 14 1
5 15 1
6 16 1
7 17 1
8 18 1
9 19 1
10 20 1
0 21 2
1 22 2
2 23 2
3 24 2
4 25 2
5 26 2
6 27 2
7 28 2
8 29 2
9 30 2
then you can quickly associate associate a price to each volume in df
df.merge(df_prices,on='volume')
volume price
0 11 1
1 24 2
2 30 2
So the problem I seem to have is that I want to acces the data in a dataframe but only the last twelve numbers in every column so I have a data frame:
index A B C
20 1 2 3
21 2 5 6
22 7 8 9
23 10 1 2
24 3 1 2
25 4 9 0
26 10 11 12
27 1 2 3
28 2 1 5
29 6 7 8
30 8 4 5
31 1 3 4
32 1 2 3
33 5 6 7
34 1 3 4
The values inside A,B,C are not important they are just to show an example
currently I am using
df1=df2.iloc[23:35]
perhaps there is an easier way to do this because I have to do this for around 20 different dataframes of different sizes I know that if I use
df1=df2.iloc[-1]
it will return the last number but I dont know how to incorporate it for the last twelve numbers. any help would be appreciated.
You can get the last n rows of a DataFrame by:
df.tail(n)
or
df.iloc[-n-1:-1]
Currently I'm working with weekly data for different subjects, but it might have some long streaks without data, so, what I want to do, is to just keep the longest streak of consecutive weeks for every id. My data looks like this:
id week
1 8
1 15
1 60
1 61
1 62
2 10
2 11
2 12
2 13
2 25
2 26
My expected output would be:
id week
1 60
1 61
1 62
2 10
2 11
2 12
2 13
I got a bit close, trying to mark with a 1 when week==week.shift()+1. The problem is this approach doesn't mark the first occurrence in a streak, and also I can't filter the longest one:
df.loc[ (df['id'] == df['id'].shift())&(df['week'] == df['week'].shift()+1),'streak']=1
This, according to my example, would bring this:
id week streak
1 8 nan
1 15 nan
1 60 nan
1 61 1
1 62 1
2 10 nan
2 11 1
2 12 1
2 13 1
2 25 nan
2 26 1
Any ideas on how to achieve what I want?
Try this:
df['consec'] = df.groupby(['id',df['week'].diff(-1).ne(-1).shift().bfill().cumsum()]).transform('count')
df[df.groupby('id')['consec'].transform('max') == df.consec]
Output:
id week consec
2 1 60 3
3 1 61 3
4 1 62 3
5 2 10 4
6 2 11 4
7 2 12 4
8 2 13 4
Not as concise as #ScottBoston but I like this approach
def max_streak(s):
a = s.values # Let's deal with an array
# I need to know where the differences are not `1`.
# Also, because I plan to use `diff` again, I'll wrap
# the boolean array with `True` to make things cleaner
b = np.concatenate([[True], np.diff(a) != 1, [True]])
# Tell the locations of the breaks in streak
c = np.flatnonzero(b)
# `diff` again tells me the length of the streaks
d = np.diff(c)
# `argmax` will tell me the location of the largest streak
e = d.argmax()
return c[e], d[e]
def make_thing(df):
start, length = max_streak(df.week)
return df.iloc[start:start + length].assign(consec=length)
pd.concat([
make_thing(g) for _, g in df.groupby('id')
])
id week consec
2 1 60 3
3 1 61 3
4 1 62 3
5 2 10 4
6 2 11 4
7 2 12 4
8 2 13 4
I have a dataframe, where the left column is the left - most location of an object, and the right column is the right most location. I need to group the objects if they overlap, or they overlap objects that overlap (recursively).
So, for example, if this is my dataframe:
left right
0 0 4
1 5 8
2 10 13
3 3 7
4 12 19
5 18 23
6 31 35
so lines 0 and 3 overlap - thus they should be on the same group, and also line 1 is overlapping line 3 - thus it joins the group.
So, for this example the output should be something like that:
left right group
0 0 4 0
1 5 8 0
2 10 13 1
3 3 7 0
4 12 19 1
5 18 23 1
6 31 35 2
I thought of various directions, but didn't figure it out (without an ugly for).
Any help will be appreciated!
I found the accepted solution (update: now deleted) to be misleading because it fails to generalize to similar cases. e.g. for the following example:
df = pd.DataFrame({'left': [0,5,10,3,12,13,18,31],
'right':[4,8,13,7,19,16,23,35]})
df
The suggested aggregate function outputs the following dataframe (note that the 18-23 should be in group 1, along with 12-19).
One solution is using the following approach (based on a method for combining intervals posted by #CentAu):
# Union intervals by #CentAu
from sympy import Interval, Union
def union(data):
""" Union of a list of intervals e.g. [(1,2),(3,4)] """
intervals = [Interval(begin, end) for (begin, end) in data]
u = Union(*intervals)
return [u] if isinstance(u, Interval) \
else list(u.args)
# Create a list of intervals
df['left_right'] = df[['left', 'right']].apply(list, axis=1)
intervals = union(df.left_right)
# Add a group column
df['group'] = df['left'].apply(lambda x: [g for g,l in enumerate(intervals) if
l.contains(x)][0])
...which outputs:
Can you try this, use rolling max and rolling min, to find the intersection of the range :
df=df.sort_values(['left','right'])
df['Group']=((df.right.rolling(window=2,min_periods=1).min()-df.left.rolling(window=2,min_periods=1).max())<0).cumsum()
df.sort_index()
Out[331]:
left right Group
0 0 4 0
1 5 8 0
2 10 13 1
3 3 7 0
4 12 19 1
5 18 23 1
6 31 35 2
For example , (1,3) and (2,4)
To find the intersection
mix(3,4)-max(1,2)=1 ; 1 is more than 0; then two intervals have intersection
You can sort samples and utilize cumulative functions cummax and cumsum. Let's take your example:
left right
0 0 4
3 3 7
1 5 8
2 10 13
4 12 19
5 13 16
6 18 23
7 31 35
First you need to sort values so that longer ranges come first:
df = df.sort_values(['left', 'right'], ascending=[True, False])
Result:
left right
0 0 4
3 3 7
1 5 8
2 10 13
4 12 19
5 13 16
6 18 23
7 31 35
Then you can find overlapping groups through comparing 'left' with previous 'right' values:
df['group'] = (df['right'].cummax().shift() <= df['left']).cumsum()
df.sort_index(inplace=True)
Result:
left right group
0 0 4 0
1 5 8 0
2 10 13 1
3 3 7 0
4 12 19 1
5 13 16 1
6 18 23 1
7 31 35 2
In one line: