Pandas grouby 7 days - python

How to sum my data counts by week and if the last week still not completed calculate the average "normalization"
let's say these is my lists
days = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31]
counts = [1839,1334,2241,2063,1216,1409,1614,1860,1298,1140,1122,2153,971,1650,1835,889,653,484,2078,1198,426,684,910,701,851,360,763,402,1853,400,1159]
Thanks

Here is a solution with Pandas:
1) Create dataframe:
df = pd.DataFrame({'days':days,'counts': counts})
df['week'] = df.days.sub(1)//7 # adding week column
2) calculate sum and mean by week, then producing normalized sum:
d2 = df.groupby('week').agg({'counts':['sum','mean']}) # ca
d2['norm_sum'] = d2[('counts','mean')] * 7
3) output:
print (d2)
counts norm_sum
sum mean
week
0 10102 1683.666667 11785.666667
1 10158 1451.142857 10158.000000
2 8787 1255.285714 8787.000000
3 4695 670.714286 4695.000000
4 3814 953.500000 6674.500000

I do not know how to use pandas in this case, but I would do it using built-in python modules following way:
from collections import defaultdict
from statistics import mean
days = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31]
counts = [1839,1334,2241,2063,1216,1409,1614,1860,1298,1140,1122,2153,971,1650,1835,889,653,484,2078,1198,426,684,910,701,851,360,763,402,1853,400,1159]
weeks = [d//7 for d in days]
avg_count = int(mean(counts))
weeks = weeks + [weeks[-1]]*(len(weeks)%7) # pad weeks to multiply of 7
counts = counts + [avg_count]*(len(counts)%7) # pad counts to multiply of 7
count_per_week = defaultdict(int)
for w, c in zip(weeks, counts):
count_per_week[w] += c
print(dict(count_per_week))
Output:
{0: 10102, 1: 10158, 2: 8787, 3: 4695, 4: 3814}
Note that I assume average is reasonable filler value, which do not have to always holds true. defaultdict(int) when asked for non-existing key will set that key value to int() that is 0.

This my approach:
The data:
counts = [1839,1334,2241,2063,1216,1409,1614,1860,1298,1140,1122,2153,971,1650,1835,889,653,484,2078,1198,426,684,910,701,851,360,763,402,1853,400,1159]
To array:
counts = np.array(counts)
Reshape: (thanks to https://stackoverflow.com/users/4427777/daniel-f)
def shapeshifter(num_col, my_array=data):
return np.lib.pad(my_array, (0, num_col - len(my_array) % num_col), 'constant', constant_values = 0).reshape(-1, num_col)
data = shapeshifter(7, counts)
array([[1839, 1334, 2241, 2063, 1216, 1409, 1614],
[1860, 1298, 1140, 1122, 2153, 971, 1650],
[1835, 889, 653, 484, 2078, 1198, 426],
[ 684, 910, 701, 851, 360, 763, 402],
[1853, 400, 1159, 0, 0, 0, 0]])
To dataframe with zeros converted to NaN:
df = pd.DataFrame(data)
df[df == 0] = np.nan
Fill missing values with the mean value of the month:
df.fillna(counts.mean())
0 1 2 3 4 5 6
0 1839 1334 2241 2063.000000 1216.000000 1409.000000 1614.000000
1 1860 1298 1140 1122.000000 2153.000000 971.000000 1650.000000
2 1835 889 653 484.000000 2078.000000 1198.000000 426.000000
3 684 910 701 851.000000 360.000000 763.000000 402.000000
4 1853 400 1159 1211.483871 1211.483871 1211.483871 1211.483871
Get the sum by row or week:
df.sum(axis=1)
0 11716.0
1 10194.0
2 7563.0
3 4671.0
4 3412.0
dtype: float64

Related

Assign new column in DataFrame based on if value is in a certain value range

I have two DataFrames as follows:
df_discount = pd.DataFrame(data={'Graduation' : np.arange(0,1000,100), 'Discount %' : np.arange(0,50,5)})
df_values = pd.DataFrame(data={'Sum' : [20,801,972,1061,1251]})
Now my goal is to get a new column df_values['New Sum'] for my df_values that applies the corresponding discount to df_values['Sum'] based on the value of df_discount['Graduation']. If the Sum is >= the Graduation the corresponding discount is applied.
Examples: Sum 801 should get a discount of 40% resulting in 480.6, Sum 1061 gets 45% resulting in 583.55.
I know I could write a funtion with if else conditions and the returning values. However, is there a better way to do this if you have very many different conditions?
You could try if pd.merge_asof() works for you:
df_discount = pd.DataFrame({
'Graduation': np.arange(0, 1000, 100), 'Discount %': np.arange(0, 50, 5)
})
df_values = pd.DataFrame({'Sum': [20, 100, 101, 350, 801, 972, 1061, 1251]})
df_values = (
pd.merge_asof(
df_values, df_discount,
left_on="Sum", right_on="Graduation",
direction="backward"
)
.assign(New_Sum=lambda df: df["Sum"] * (1 - df["Discount %"] / 100))
.drop(columns=["Graduation", "Discount %"])
)
Result (without the last .drop(columns=...) to see what's happening):
Sum Graduation Discount % New_Sum
0 20 0 0 20.00
1 100 100 5 95.00
2 101 100 5 95.95
3 350 300 15 297.50
4 801 800 40 480.60
5 972 900 45 534.60
6 1061 900 45 583.55
7 1251 900 45 688.05
pandas.cut() is made for problems like this where you need to segment your data into bins (i.e. discount % based on value range).
First define the column, the ranges, and the corresponding bins.
# The column we need to segment
col = df_values['Sum']
# The ranges: [0, 100, 200,... ,900, np.inf] means (0,100), (100,200), ... (900,inf)
graduation = np.append(df_discount['Graduation'], np.inf)
# For each range what is the corresponding bin (i.e. discount)
discount = df_discount['Discount %']
Now call pandas.cut() and do the discount calculation.
df_values['Discount %'] = pd.cut(col,
graduation,
labels=discount)
# Convert the string label to an int for calculation
df_values['Discount %'] = df_values['Discount %'].astype(int)
df_values['New Sum'] = df_values['Sum'] * (1-df_values['Discount %']/100)
Sum Discount % New Sum
0 20 0 20.00
1 801 40 480.60
2 972 45 534.60
3 1061 45 583.55
4 1251 45 688.05
You can use pandas.DataFrame.mask. Basically if your condition is true it replaces the value. But for that your sum column has to be inside first dataframe.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mask.html

How do I create a new column based on existing columns in pandas?

I am new to Python/pandas. I want to compute continuous returns based on "GOOG" Price. If the price is in column (a); How should I calculate the return in column (b) according to the following formula?
continuous returns =
I want to do this like the image below (calculating continuous returns in Excel) in Pandas DataFrame.
import pandas as pd
x = pd.DataFrame([2340, 2304, 2238, 2260, 2315, 2318, 2300, 2310, 2353, 2350],
columns=['a'])
Try:
x['b'] = np.log(x['a']/x['a'].shift())
Output:
a b
0 2340 NaN
1 2304 -0.015504
2 2238 -0.029064
3 2260 0.009782
4 2315 0.024045
5 2318 0.001295
6 2300 -0.007796
7 2310 0.004338
8 2353 0.018444
9 2350 -0.001276
You can use generator function with .apply:
import numpy as np
import pandas as pd
x = pd.DataFrame(
[2340, 2304, 2238, 2260, 2315, 2318, 2300, 2310, 2353, 2350], columns=["a"]
)
def fn():
old_a = np.nan
a = yield
while True:
new_a = yield np.log(a / old_a)
a, old_a = new_a, a
s = fn()
next(s)
x["b"] = x["a"].apply(lambda v: s.send(v))
print(x)
Prints:
a b
0 2340 NaN
1 2304 -0.015504
2 2238 -0.029064
3 2260 0.009782
4 2315 0.024045
5 2318 0.001295
6 2300 -0.007796
7 2310 0.004338
8 2353 0.018444
9 2350 -0.001276

iterating through rows and columns of stock price python

I have a code below
result, diff = [], []
for index, row in final.iterrows():
for column in final.columns:
if ((final['close'] - final['open']) > 20):
diff = final['close'] - final['open']
result = 1
elif ((final['close'] - final['open']) < -20):
diff = final['close'] - final['open']
result = -1
elif (-20 < (final['close'] - final['open']) < 20 ):
diff = final['close'] - final['open']
result = 0
else:
continue
The intention is to for every time stamp, check if close - open is greater than 20 pips, then assign a buy value to it. If it's less than -20 assign a sell value, if in between assign a 0.
I am getting this error
The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
[Finished in 35.418s
Someone experienced with pandas would give a better answer, but since no one is answering here's mine. You generally don't want to iterate directly with pandas.Dataframes as that defeats the purpose. A pandas solution would look more like:
import pandas as pd
data = {
'symbol': ['WZO', 'FDL', 'KXW', 'GYU', 'MIR', 'YAC', 'CDE', 'DSD', 'PAM', 'BQE'],
'open': [356, 467, 462, 289, 507, 654, 568, 646, 440, 625],
'close': [399, 497, 434, 345, 503, 665, 559, 702, 488, 608]
}
df = pd.DataFrame.from_dict(data)
df['diff'] = df['close'] - df['open']
df.loc[(df['diff'] < 20) & (df['diff'] > -20), 'result'] = 0
df.loc[df['diff'] >= 20, 'result'] = 1
df.loc[df['diff'] <= -20, 'result'] = -1
df now contains:
symbol open close diff result
0 WZO 356 399 43 1.0
1 FDL 467 497 30 1.0
2 KXW 462 434 -28 -1.0
3 GYU 289 345 56 1.0
4 MIR 507 503 -4 0.0
5 YAC 654 665 11 0.0
6 CDE 568 559 -9 0.0
7 DSD 646 702 56 1.0
8 PAM 440 488 48 1.0
9 BQE 625 608 -17 0.0
Regarding your code, I'll repeat my comment from above: You are iterating by row, but then using the whole DataFrame final in your conditions. I think you meant to do row there. You don't need to iterate over columns grabbing your values by index. Your conditions miss for when final['close'] - final['open'] is exactly 20. result, diff = [], [] are lists at the top, but then assigned as integers in the loop. Perhaps you want result.append()?

Subtract from each cell in pandas dataframe based on value

I have a df like this-- it's a dataframe and all values are floats:
data=np.random.randint(3000,size=(10,1))
data=pd.DataFrame(data)
For each value, if it's between 570 and 1140, I want to subtract 570.
If it's over 1140, I want to subtract 1140 from the value. I wrote this function to do that.
def AdjustTimes(val):
if val > 570 and val < 1140:
val = val-570
elif val > 1140:
val = val - 1140
Based on another question I tried to apply it using data.applymap(AdjustTimes). I got no error but the function does not seem to have been applied.
Setup
data
0
0 1863
1 2490
2 2650
3 2321
4 822
5 82
6 2192
7 722
8 2537
9 874
First, let's create masks for each of your conditions. One pandaic approach is using between to retrieve a mask for the first condition -
m1 = data.loc[:, 0].between(570, 1140, inclusive=True)
Or, you can do this with a couple of logical operators -
m1 = data.loc[:, 0].ge(570) & data.loc[:, 0].le(1140)
And,
m2 = data.loc[:, 0].gt(1140)
Now, to perform replacement, you have a couple of options.
Option 1
Use loc to index and subtract -
data.loc[m1, 0] -= 570
data.loc[m2, 0] -= 1140
data
0
0 723
1 1350
2 1510
3 1181
4 252
5 82
6 1052
7 152
8 1397
9 304
Equivalent version for a pd.Series -
m1 = data.ge(570) & data.le(1140)
m2 = data.gt(1140)
data.loc[m1] -= 570
data.loc[m2] -= 1140
Option 2
You can also do this with np.where (but it'd be a bit more inefficient).
v = data.loc[:, 0]
data.loc[:, 0] = np.where(m1, v - 570, np.where(m2, v - 1140, v))
Here, m1 and m2 are the masks computed from before.
data
0
0 723
1 1350
2 1510
3 1181
4 252
5 82
6 1052
7 152
8 1397
9 304
Equivalent pd.Series code -
data[:] = np.where(m1, data - 570, np.where(m2, data - 1140, data))
Could you try something like:
data=np.random.randint(3000,size=(10,1))
data=pd.DataFrame(data)
data = data -570*((data > 570) & (data < 1140)) -1140*(data > 1140)
The applymap method is designed to generate a new dataframe, not to modify an existing one (and the function it calls should return a value for the new cell rather than modifying its argument). You don't show the line where you actually use applymap, but I suspect it's just data.applymap(AdjustTimes) on its own. If you change your code to the following it should work fine:
def AdjustTimes(val):
if val >= 1140:
return val - 1140
elif val >= 570:
return val - 570
data = data.applymap(AdjustTimes)
(I've also cleaned up the if statements to be a little faster and deal with the case where Val = 1140 (your original code wouldn't adjust that one).

How to (efficiently) check if any two elements differ by 10

Suppose I have the following column in a -- pandas -- dataframe:
x
1 589
2 354
3 692
4 474
5 739
6 731
7 259
8 723
9 497
10 48
Note: I've changed the indexing to start at 1 (see test data).
I simply wish to test if the difference between any two of the items in this column are less than 10.
Final result: No two elements should have an absolute difference less than 10.
Goal:
x
1 589
2 354
3 692
4 474
5 749 #
6 731
7 259
8 713 #
9 497
10 48
Perhaps this could be done using:
for index, row in df.iterrows():
However, that has not be successful thus far...
Given I'm looking to perform element-wise comparisions, I don't expect staging speed...
Test Data:
import pandas as pd
df = pd.DataFrame(index = range(1,stim_numb+1), columns= ['x'])
df['x'] = [589, 354, 692, 474, 739, 731, 259, 723, 497, 48]
One solution might be to sort the list, then compare consecutive items, adding 10 whenever the difference is too small, and then sorting the list back to the original order (if necessary).
from operator import itemgetter
lst = [589, 354, 692, 474, 739, 731, 259, 723, 497, 48]
# temp is list as pairs of original index and value, sorted by value
temp = [[i, e] for i, e in sorted(enumerate(lst), key=itemgetter(1))]
last = None
for item in temp:
while last is not None and item[1] < last + 10:
item[1] += 10
last = item[1]
# sort the list back to original order using the index from the tuple
lst_new = [e for i, e in sorted(temp, key=itemgetter(0))]
Result is [589, 354, 692, 474, 759, 741, 259, 723, 497, 48]
This is using plain Python lists; maybe it can be done more elegantly in Pandas or Numpy.

Categories