Summing values across given range of days difference backwards - Pandas - python

I have created a days difference column in a pandas dataframe, and I'm looking to add a column that has the sum of a specific value over a given days window backwards
Notice that I can supply a date column for each row if it is needed, but the diff was created as days difference from the first day of the data.
Example
df = pd.DataFrame.from_dict({'diff': [0,0,1,2,2,2,2,10,11,15,18],
'value': [10,11,15,2,5,7,8,9,23,14,15]})
df
Out[12]:
diff value
0 0 10
1 0 11
2 1 15
3 2 2
4 2 5
5 2 7
6 2 8
7 10 9
8 11 23
9 15 14
10 18 15
I want to add 5_days_back_sum column that will sum the past 5 days, including same day so the result would be like this
Out[15]:
5_days_back_sum diff value
0 21 0 10
1 21 0 11
2 36 1 15
3 58 2 2
4 58 2 5
5 58 2 7
6 58 2 8
7 9 10 9
8 32 11 23
9 46 15 14
10 29 18 15
How can I achieve that? Originally I have a date column to create the diff column, if that helps its available

Use custom function with boolean indexing for filtering range with sum:
def f(x):
return df.loc[(df['diff'] >= x - 5) & (df['diff'] <= x), 'value'].sum()
df['5_days_back_sum'] = df['diff'].apply(f)
print (df)
diff value 5_days_back_sum
0 0 10 21
1 0 11 21
2 1 15 36
3 2 2 58
4 2 5 58
5 2 7 58
6 2 8 58
7 10 9 9
8 11 23 32
9 15 14 46
10 18 15 29
Similar solution with between:
def f(x):
return df.loc[df['diff'].between(x - 5, x), 'value'].sum()
df['5_days_back_sum'] = df['diff'].apply(f)
print (df)
diff value 5_days_back_sum
0 0 10 21
1 0 11 21
2 1 15 36
3 2 2 58
4 2 5 58
5 2 7 58
6 2 8 58
7 10 9 9
8 11 23 32
9 15 14 46
10 18 15 29

Related

Hash table mapping in Pandas

I have a large dataset with millions of rows of data. One of the data columns is ID.
I also have another (hash)table that maps the range of indices to a specific group that meets a certain criteria.
What is an efficient way to map the range of indices to include them as an additional column on my dataset in pandas?
As an example, lets say that the dataset looks like this:
In [18]:
print(df_test)
Out [19]:
ID
0 13
1 14
2 15
3 16
4 17
5 18
6 19
7 20
8 21
9 22
10 23
11 24
12 25
13 26
14 27
15 28
16 29
17 30
18 31
19 32
Now the hash table with the range of indices looks like this:
In [20]:
print(df_hash)
Out [21]:
ID_first
0 0
1 2
2 10
where the index specifies the group number that I need.
I tried doing something like this:
for index in range(df_hash.size):
try:
df_test.loc[df_hash.ID_first[index]:df_hash.ID_first[index + 1], 'Group'] = index
except:
df_test.loc[df_hash.ID_first[index]:, 'Group'] = index
Which works well, except that it is really slow as it loops over the length of the hash table dataframe (hundreds of thousands of rows). It produces the following answer (which I want):
In [23]:
print(df_test)
Out [24]:
ID Group
0 13 0
1 14 0
2 15 1
3 16 1
4 17 1
5 18 1
6 19 1
7 20 1
8 21 1
9 22 1
10 23 2
11 24 2
12 25 2
13 26 2
14 27 2
15 28 2
16 29 2
17 30 2
18 31 2
19 32 2
Is there a way to do this more efficiently?
You can map the index of df_test using ID_first to the index of df_hash, and then ffill. Need to construct a Series as the pd.Index class doesn't have a ffill method.
df_test['group'] = (pd.Series(df_test.index.map(dict(zip(df_hash.ID_first, df_hash.index))),
index=df_test.index)
.ffill(downcast='infer'))
# ID group
#0 13 0
#1 14 0
#2 15 1
#...
#9 22 1
#10 23 2
#...
#17 30 2
#18 31 2
#19 32 2
you can do series.isin with series.cumsum
df_test['group'] = df_test['ID'].isin(df_hash['ID_first']).cumsum() #.sub(1)
print(df_test)
ID group
0 0 1
1 1 1
2 2 2
3 3 2
4 4 2
5 5 2
6 6 2
7 7 2
8 8 2
9 9 2
10 10 3
11 11 3
12 12 3
13 13 3
14 14 3
15 15 3
16 16 3
17 17 3
18 18 3
19 19 3

How to look for same columns from one dataframe in other dataframe pandas python?

I have one dataframe like this,
tabla_aciertos= {'Numeros_acertados' : [5,5,5,4,4,3,4,2,3,3,1,2,2],'Estrellas_acertadas': [2,1,0,2,1,2,0,2,1,0,2,1,0]}
categorias = [1,2,3,4,5,6,7,8,9,10,11,12,13]
categoria_de_premios = pd.DataFrame (tabla_aciertos,index = [categorias] )
categoria_de_premios
Numeros_acertados Estrellas_acertadas
1 5 2
2 5 1
3 5 0
4 4 2
5 4 1
6 3 2
7 4 0
8 2 2
9 3 1
10 3 0
11 1 2
12 2 1
13 2 0
and another df :
sorteos_anteriores.iloc[:,:]
uno dos tres cuatro cinco Estrella1 Estrella2 bolas_Acertadas estrellas_Acertadas
Fecha
2020-10-13 5 14 38 41 46 1 10 0 1
2020-09-10 11 15 35 41 50 5 8 1 0
2020-06-10 4 21 36 41 47 9 11 0 0
2020-02-10 6 12 15 40 45 3 9 0 0
2020-09-29 4 14 16 41 44 11 12 0 1
... ... ... ... ... ... ... ... ... ...
2004-12-03 15 24 28 44 47 4 5 0 0
2004-05-03 4 7 33 37 39 1 5 0 1
2004-02-27 14 18 19 31 37 4 5 0 0
2004-02-20 7 13 39 47 50 2 5 1 0
2004-02-13 16 29 32 36 41 7 9 0 0
1363 rows × 9 columns
Now I need to see in each and every row of the df "sorteos_anteriores" is in one of the all row from the first df, "tabla_aciertos" .
Let me give you one example,
Inmagine in "sorteos_anteriores" you have in:
2019-11-2 in the column "bolas_Acertadas"= 5 and "estrellas_Acertadas= 1". Now you go to fist table, "tabla_aciertos" and you find that in (index 2 = "Numeros_acertados" = 5 and Estrellas_acertadas=1) . You have won a second (index=2) class prize. You should create a new column "Prize" in "sorteos_anteriores" and in each row write a number from 1 to 13 if you have some kind of prize of 0 or Nan if you not.
I have try :
sorteos_anteriores ['categorias'] = sorteos_anteriores(sorteos_anteriores.loc[:,'bolas_Acertadas':'estrellas_Acertadas'] == tabla_premios.iloc[ : ,0:2])
Also with where and merge, but nothing works.
Thanks for your help.
Thanks to Cuina Max I could do it.
answer here
# supposing that the indexes, starting from one, correspond to the the premiums
categoria_de_premios['Categoria'] = df.index
# Merge using pd.merge and the appropriate arguments
sorteos_anteriores = (sorteos_anteriores.merge(
categoria_de_premios,
how='outer',
left_on=['bolas_Acertadas','estrellas_Acertadas'],
right_on=['Numeros_acertados', 'Estrellas_acertadas']
)).drop(columns=['Numeros_acertados', 'Estrellas_acertadas'])

Pandas dataframe problem. Create column where a row cell gets the value of another row cell

I have this pandas dataframe. It is sorted by the "h" column. What I want is to add two new columns where:
The items of each zone, will have a max boundary and a min boundary. (They will be the same for every item in the zone). The max boundary will be the minimum "h" value of the previous zone, and the min boundary will be the maximum "h" value of the next zone
name h w set row zone
ZZON5 40 36 A 0 0
DWOPN 38 44 A 1 0
5SWYZ 37 22 B 2 0
TFQEP 32 55 B 3 0
OQ33H 26 41 A 4 1
FTJVQ 24 25 B 5 1
F1RK2 20 15 B 6 1
266LT 18 19 A 7 1
HSJ3X 16 24 A 8 2
L754O 12 86 B 9 2
LWHDX 11 68 A 10 2
ZKB2F 9 47 A 11 2
5KJ5L 7 72 B 12 3
CZ7ET 6 23 B 13 3
SDZ1B 2 10 A 14 3
5KWRU 1 59 B 15 3
what i hope for:
name h w set row zone maxB minB
ZZON5 40 36 A 0 0 26
DWOPN 38 44 A 1 0 26
5SWYZ 37 22 B 2 0 26
TFQEP 32 55 B 3 0 26
OQ33H 26 41 A 4 1 32 16
FTJVQ 24 25 B 5 1 32 16
F1RK2 20 15 B 6 1 32 16
266LT 18 19 A 7 1 32 16
HSJ3X 16 24 A 8 2 18 7
L754O 12 86 B 9 2 18 7
LWHDX 11 68 A 10 2 18 7
ZKB2F 9 47 A 11 2 18 7
5KJ5L 7 72 B 12 3 9
CZ7ET 6 23 B 13 3 9
SDZ1B 2 10 A 14 3 9
5KWRU 1 59 B 15 3 9
Any ideas?
First group-by zone and find the minimum and maximum of them
min_max_zone = df.groupby('zone').agg(min=('h', 'min'), max=('h', 'max'))
Now you can use apply:
df['maxB'] = df['zone'].apply(lambda x: min_max_zone.loc[x-1, 'min']
if x-1 in min_max_zone.index else np.nan)
df['minB'] = df['zone'].apply(lambda x: min_max_zone.loc[x+1, 'max']
if x+1 in min_max_zone.index else np.nan)

Incremental assignment in pandas dataframe to determine month from week number without date element

I'm having week numbers in the dataframe from 1 to 52 e.g. [1,2,3,4,5,6,7,8,..52]
I'm trying to create a new column for month but it would mean an incremental assignment like [1,2,3,4] = 1, [5,6,7,8] = 2, .. [49,50,51,52] = 12
I tried getting the records by multiple of 4 using df[df["week"]%4==0] and then ffill it but seems like we can only assign it all to the same number which is not what I want. Instead I want to assign [1..12] accordingly. Is there another way to do this?
Subtract 1 first and then use integer division by 4:
df = pd.DataFrame({'week':range(1,53)})
df['new'] = (df["week"] - 1)//4
print (df.head(10))
week new
0 1 0
1 2 0
2 3 0
3 4 0
4 5 1
5 6 1
6 7 1
7 8 1
8 9 2
9 10 2
print (df.tail(10))
week new
42 43 10
43 44 10
44 45 11
45 46 11
46 47 11
47 48 11
48 49 12
49 50 12
50 51 12
51 52 12
If want starting by 1 it is possible, but last value is 13:
df['new'] = ((df["week"] - 1)//4) + 1
print (df.head(10))
week new
0 1 1
1 2 1
2 3 1
3 4 1
4 5 2
5 6 2
6 7 2
7 8 2
8 9 3
9 10 3
print (df.tail(10))
week new
42 43 11
43 44 11
44 45 12
45 46 12
46 47 12
47 48 12
48 49 13
49 50 13
50 51 13
51 52 13
If want values between 1 and 12 (but some groups has more like 4 values) use, solution by #Aryerez, thank you:
df['new'] = ((df["week"] - 1) // (52 / 12)).astype(int) + 1
print (df.head(10))
week new
0 1 1
1 2 1
2 3 1
3 4 1
4 5 1
5 6 2
6 7 2
7 8 2
8 9 2
9 10 3
print (df.tail(10))
week new
42 43 10
43 44 10
44 45 11
45 46 11
46 47 11
47 48 11
48 49 12
49 50 12
50 51 12
51 52 12
EDIT: For 5 values in each 3rd group use:
df['new'] = ((df["week"] + 4) // (52 / 12)).astype(int)
print (df.head(15))
week new
0 1 1
1 2 1
2 3 1
3 4 1
4 5 2
5 6 2
6 7 2
7 8 2
8 9 3
9 10 3
10 11 3
11 12 3
12 13 3
13 14 4
14 15 4
print (df.tail(15))
week new
37 38 9
38 39 9
39 40 10
40 41 10
41 42 10
42 43 10
43 44 11
44 45 11
45 46 11
46 47 11
47 48 12
48 49 12
49 50 12
50 51 12
51 52 12

grouping by id and a condition

I have a dataframe df
df=DataFrame({'id': ['a','a','a','a','a','a','a','b','b','b','b','b','b','b','b','b','b'],
'min':[10,17,21,22,22,7,58,15,17,19,19,19,19,19,25,26,26],
'day':[15,15,15,15,15,17,17,41,41,41,41,41,41,41,57,57,57]})
that looks like
id min day
0 a 10 15
1 a 17 15
2 a 21 15
3 a 30 15
4 a 50 15
5 a 57 17
6 a 58 17
7 b 15 41
8 b 17 41
9 b 19 41
10 b 19 41
11 b 19 41
12 b 19 41
13 b 19 41
14 b 25 57
15 b 26 57
16 b 26 57
I want a new column that categorizes the data in a certain format based on the id and the relationship between the rows as follows, if min value difference for consecutive rows is less than 8 and the day value is the same I want to assign them to the same group, so my output would look like.
id min day category
0 a 10 15 1
1 a 17 15 1
2 a 21 15 1
3 a 30 15 2
4 a 50 15 3
5 a 57 17 4
6 a 58 17 4
7 b 15 41 5
8 b 17 41 5
9 b 19 41 5
10 b 19 41 5
11 b 19 41 5
12 b 19 41 5
13 b 19 41 5
14 b 25 57 6
15 b 26 57 6
16 b 26 57 6
hope this helps. let me know your views.
All the best.
import pandas as pd
df=pd.DataFrame({'id': ['a','a','a','a','a','a','a','b','b','b','b','b','b','b','b','b','b'],
'min':[10,17,21,22,22,7,58,15,17,19,19,19,19,19,25,26,26],
'day':[15,15,15,15,15,17,17,41,41,41,41,41,41,41,57,57,57]})
# initialize the catagory to 1 for counter increament
cat =1
# for the first row the catagory will be 1
new_series = [cat]
# loop will start from 1 and not from 0 because we cannot perform operation on iloc -1
for i in range(1,len(df)):
if df.iloc[i]['day'] == df.iloc[i-1]['day']:
if df.iloc[i]['min'] - df.iloc[i-1]['min'] > 8:
cat+=1
else:
cat+=1
new_series.append(cat)
df['catagory']= new_series
print(df)

Categories