How to calculate percentatge change on this simple data frame? - python

I have data that looks like this:
+------+---------+------+-------+
| Year | Cluster | AREA | COUNT |
+------+---------+------+-------+
| 2016 | 0 | 10 | 2952 |
| 2016 | 1 | 10 | 2556 |
| 2016 | 2 | 10 | 8867 |
| 2016 | 3 | 10 | 9786 |
| 2017 | 0 | 10 | 2470 |
| 2017 | 1 | 10 | 3729 |
| 2017 | 2 | 10 | 8825 |
| 2017 | 3 | 10 | 9114 |
| 2018 | 0 | 10 | 1313 |
| 2018 | 1 | 10 | 3564 |
| 2018 | 2 | 10 | 7245 |
| 2018 | 3 | 10 | 6990 |
+------+---------+------+-------+
I have to get the percentage changes for each cluster compared to the previous year, e.g.
+------+---------+-----------+-------+----------------+
| Year | Cluster | AREA | COUNT | Percent Change |
+------+---------+-----------+-------+----------------+
| 2016 | 0 | 10 | 2952 | NaN |
| 2017 | 0 | 10 | 2470 | -16.33% |
| 2018 | 0 | 10 | 1313 | -46.84% |
| 2016 | 1 | 10 | 2556 | NaN |
| 2017 | 1 | 10 | 3729 | 45.89% |
| 2018 | 1 | 10 | 3564 | -4.42% |
| 2016 | 2 | 10 | 8867 | NaN |
| 2017 | 2 | 10 | 8825 | -0.47% |
| 2018 | 2 | 10 | 7245 | -17.90% |
| 2016 | 3 | 10 | 9786 | NaN |
| 2017 | 3 | 10 | 9114 | -6.87% |
| 2018 | 3 | 10 | 6990 | -23.30% |
+------+---------+-----------+-------+----------------+
Is there any easy to do this?
I've tried a few things below, this seemed to make the most sense, but it returns NaN for each pct_change.
df['pct_change'] = df.groupby(['Cluster','Year'])['COUNT '].pct_change()
+------+---------+------+------------+------------+
| Year | Cluster | AREA | Count | pct_change |
+------+---------+------+------------+------------+
| 2016 | 0 | 10 | 295200.00% | NaN |
| 2016 | 1 | 10 | 255600.00% | NaN |
| 2016 | 2 | 10 | 886700.00% | NaN |
| 2016 | 3 | 10 | 978600.00% | NaN |
| 2017 | 0 | 10 | 247000.00% | NaN |
| 2017 | 1 | 10 | 372900.00% | NaN |
| 2017 | 2 | 10 | 882500.00% | NaN |
| 2017 | 3 | 10 | 911400.00% | NaN |
| 2018 | 0 | 10 | 131300.00% | NaN |
| 2018 | 1 | 10 | 356400.00% | NaN |
| 2018 | 2 | 10 | 724500.00% | NaN |
| 2018 | 3 | 10 | 699000.00% | NaN |
+------+---------+------+------------+------------+
Basically, I simply want the function to compare the year on year change for each cluster.

df['pct_change'] = df.groupby(['Cluster'])['Count'].pct_change()
df.sort_values('Cluster', axis = 0, ascending = True)

Another method going old school with transform
df['p'] = df.groupby('cluster')['count'].transform(lambda x: (x-x.shift())/x.shift())
df = df.sort_values(by='cluster')
print(df)
year cluster area count p
0 2016 0 10 2952 NaN
4 2017 0 10 2470 -0.163279
8 2018 0 10 1313 -0.468421
1 2016 1 10 2556 NaN
5 2017 1 10 3729 0.458920
9 2018 1 10 3564 -0.044248
2 2016 2 10 8867 NaN
6 2017 2 10 8825 -0.004737
10 2018 2 10 7245 -0.179037
3 2016 3 10 9786 NaN
7 2017 3 10 9114 -0.068670
11 2018 3 10 6990 -0.233048

Related

cumsum bounded within a range(python, pandas)

I have a df where I'd like to have the cumsum be bounded within a range of 0 to 6. Where sum over 6 will be rollover to 0. The adj_cumsum column is what I'm trying to get. I've search and found a couple of posts using loops, however, since mine is more straightforward, hence, is wondering whether there is a less complicated or updated approach.
+----+-------+------+----------+----------------+--------+------------+
| | month | days | adj_days | adj_days_shift | cumsum | adj_cumsum |
+----+-------+------+----------+----------------+--------+------------+
| 0 | jan | 31 | 3 | 0 | 0 | 0 |
| 1 | feb | 28 | 0 | 3 | 3 | 3 |
| 2 | mar | 31 | 3 | 0 | 3 | 3 |
| 3 | apr | 30 | 2 | 3 | 6 | 6 |
| 4 | may | 31 | 3 | 2 | 8 | 1 |
| 5 | jun | 30 | 2 | 3 | 11 | 4 |
| 6 | jul | 31 | 3 | 2 | 13 | 6 |
| 7 | aug | 31 | 3 | 3 | 16 | 2 |
| 8 | sep | 30 | 2 | 3 | 19 | 5 |
| 9 | oct | 31 | 3 | 2 | 21 | 0 |
| 10 | nov | 30 | 2 | 3 | 24 | 3 |
| 11 | dec | 31 | 3 | 2 | 26 | 5 |
+----+-------+------+----------+----------------+--------+------------+
data = {"month": ['jan','feb','mar','apr',
'may','jun','jul','aug',
'sep','oct','nov','dec'],
"days": [31,28,31,30,31,30,31,31,30,31,30,31]}
df = pd.DataFrame(data)
df['adj_days'] = df['days'] - 28
df['adj_days_shift'] = df['adj_days'].shift(1)
df['cumsum'] = df.adj_days_shift.cumsum()
df.fillna(0, inplace=True)
Kindly advise
What you are looking for is called a modulo operation.
Use df['adj_cumsum'] = df['cumsum'].mod(7).
Intuition:
df["adj_cumsum"] = df["cumsum"].apply(lambda x:x%7)
Am I right?

Running Hyperopt in Freqtrade and getting crazy results

I ran hyperopt for 5000 iterations and got the following results:
2022-01-10 19:38:31,370 - freqtrade.optimize.hyperopt - INFO - Best result:
1101 trades. Avg profit 0.23%. Total profit 25.48064438 BTC (254.5519Σ%). Avg duration 888.1 mins.
with values:
{ 'roi_p1': 0.011364434095803464,
'roi_p2': 0.04123147845715937,
'roi_p3': 0.10554480985209454,
'roi_t1': 105,
'roi_t2': 47,
'roi_t3': 30,
'rsi-enabled': True,
'rsi-value': 9,
'sell-rsi-enabled': True,
'sell-rsi-value': 94,
'sell-trigger': 'sell-bb_middle1',
'stoploss': -0.42267640639979365,
'trigger': 'bb_lower2'}
2022-01-10 19:38:31,371 - freqtrade.optimize.hyperopt - INFO - ROI table:
{ 0: 0.15814072240505736,
30: 0.05259591255296283,
77: 0.011364434095803464,
182: 0}
Result for strategy BBRSI
================================================== BACKTESTING REPORT =================================================
| pair | buy count | avg profit % | cum profit % | total profit BTC | avg duration | profit | loss |
|:----------|------------:|---------------:|---------------:|-------------------:|:----------------|---------:|-------:|
| ETH/BTC | 11 | -1.30 | -14.26 | -1.42732928 | 3 days, 4:55:00 | 0 | 1 |
| LUNA/BTC | 17 | 0.60 | 10.22 | 1.02279906 | 15:46:00 | 9 | 0 |
| SAND/BTC | 37 | 0.30 | 11.24 | 1.12513532 | 6:16:00 | 14 | 1 |
| MATIC/BTC | 24 | 0.47 | 11.35 | 1.13644340 | 12:20:00 | 10 | 0 |
| ADA/BTC | 24 | 0.24 | 5.68 | 0.56822170 | 21:05:00 | 5 | 0 |
| BNB/BTC | 11 | -1.09 | -11.96 | -1.19716109 | 3 days, 0:44:00 | 2 | 1 |
| XRP/BTC | 20 | -0.39 | -7.71 | -0.77191523 | 1 day, 5:48:00 | 1 | 1 |
| DOT/BTC | 9 | 0.50 | 4.54 | 0.45457736 | 4 days, 1:13:00 | 4 | 0 |
| SOL/BTC | 19 | -0.38 | -7.16 | -0.71688463 | 22:47:00 | 3 | 1 |
| MANA/BTC | 29 | 0.38 | 11.16 | 1.11753320 | 10:25:00 | 9 | 1 |
| AVAX/BTC | 27 | 0.30 | 8.15 | 0.81561432 | 16:36:00 | 11 | 1 |
| GALA/BTC | 26 | -0.52 | -13.45 | -1.34594702 | 15:48:00 | 9 | 1 |
| LINK/BTC | 21 | 0.27 | 5.68 | 0.56822170 | 1 day, 0:06:00 | 5 | 0 |
| TOTAL | 275 | 0.05 | 13.48 | 1.34930881 | 23:42:00 | 82 | 8 |
================================================== SELL REASON STATS ==================================================
| Sell Reason | Count |
|:--------------|--------:|
| roi | 267 |
| force_sell | 8 |
=============================================== LEFT OPEN TRADES REPORT ===============================================
| pair | buy count | avg profit % | cum profit % | total profit BTC | avg duration | profit | loss |
|:---------|------------:|---------------:|---------------:|-------------------:|:------------------|---------:|-------:|
| ETH/BTC | 1 | -14.26 | -14.26 | -1.42732928 | 32 days, 4:00:00 | 0 | 1 |
| SAND/BTC | 1 | -4.65 | -4.65 | -0.46588544 | 17:00:00 | 0 | 1 |
| BNB/BTC | 1 | -14.23 | -14.23 | -1.42444977 | 31 days, 13:00:00 | 0 | 1 |
| XRP/BTC | 1 | -8.85 | -8.85 | -0.88555957 | 18 days, 4:00:00 | 0 | 1 |
| SOL/BTC | 1 | -10.57 | -10.57 | -1.05781765 | 5 days, 14:00:00 | 0 | 1 |
| MANA/BTC | 1 | -3.17 | -3.17 | -0.31758065 | 17:00:00 | 0 | 1 |
| AVAX/BTC | 1 | -12.58 | -12.58 | -1.25910300 | 7 days, 9:00:00 | 0 | 1 |
| GALA/BTC | 1 | -23.66 | -23.66 | -2.36874608 | 7 days, 12:00:00 | 0 | 1 |
| TOTAL | 8 | -11.50 | -91.97 | -9.20647144 | 12 days, 23:15:00 | 0 | 8 |
Have accurately followed the tutorial. Don't know what I am doing wrong here.

Filter rows based on condition in Pandas

I have dataframe df_groups that contain sample number, group number and accuracy.
Tabel 1: Samples with their groups
+----+----------+------------+------------+
| | sample | group | Accuracy |
|----+----------+------------+------------|
| 0 | 0 | 6 | 91.6 |
| 1 | 1 | 4 | 92.9333 |
| 2 | 2 | 2 | 91 |
| 3 | 3 | 2 | 90.0667 |
| 4 | 4 | 4 | 91.8 |
| 5 | 5 | 5 | 92.5667 |
| 6 | 6 | 6 | 91.1 |
| 7 | 7 | 5 | 92.3333 |
| 8 | 8 | 2 | 92.7667 |
| 9 | 9 | 0 | 91.1333 |
| 10 | 10 | 4 | 92.5 |
| 11 | 11 | 5 | 92.4 |
| 12 | 12 | 7 | 93.1333 |
| 13 | 13 | 7 | 93.5333 |
| 14 | 14 | 2 | 92.1 |
| 15 | 15 | 6 | 93.2 |
| 16 | 16 | 8 | 92.7333 |
| 17 | 17 | 8 | 90.8 |
| 18 | 18 | 3 | 91.9 |
| 19 | 19 | 3 | 93.3 |
| 20 | 20 | 5 | 90.6333 |
| 21 | 21 | 9 | 92.9333 |
| 22 | 22 | 4 | 93.3333 |
| 23 | 23 | 9 | 91.5333 |
| 24 | 24 | 9 | 92.9333 |
| 25 | 25 | 1 | 92.3 |
| 26 | 26 | 9 | 92.2333 |
| 27 | 27 | 6 | 91.9333 |
| 28 | 28 | 5 | 92.1 |
| 29 | 29 | 8 | 84.8 |
+----+----------+------------+------------+
I want to return a dataframe with any accuracy above (e.g. 92).
so the results will be like this
Tabel 1: Samples with their groups when accuracy above 92.
+----+----------+------------+------------+
| | sample | group | Accuracy |
|----+----------+------------+------------|
| 1 | 1 | 4 | 92.9333 |
| 2 | 5 | 5 | 92.5667 |
| 3 | 7 | 5 | 92.3333 |
| 4 | 8 | 2 | 92.7667 |
| 5 | 10 | 4 | 92.5 |
| 6 | 11 | 5 | 92.4 |
| 7 | 12 | 7 | 93.1333 |
| 8 | 13 | 7 | 93.5333 |
| 9 | 14 | 2 | 92.1 |
| 10 | 15 | 6 | 93.2 |
| 11 | 16 | 8 | 92.7333 |
| 12 | 19 | 3 | 93.3 |
| 13 | 21 | 9 | 92.9333 |
| 14 | 22 | 4 | 93.3333 |
| 15 | 24 | 9 | 92.9333 |
| 16 | 25 | 1 | 92.3 |
| 17 | 26 | 9 | 92.2333 |
| 18 | 28 | 5 | 92.1 |
+----+----------+------------+------------+
so, the result will return based on the condition that is greater than or equal to the predefined accuracy (e.g. 92, 90 or 85, ect).
You can use df.loc[df['Accuracy'] >= predefined_accuracy] .

Split a column and combine rows where there are multiple data measures

I'm trying to use python to solve my data analysis problem.
I have a table like this:
+----------+-----+------+--------+-------------+--------------+
| ID | QTR | Year | MEF_ID | Qtr_Measure | Value_column |
+----------+-----+------+--------+-------------+--------------+
| 11 | 1 | 2020 | Name1 | QTRAVG | 5 |
| 11 | 2 | 2020 | Name1 | QTRAVG | 8 |
| 11 | 3 | 2020 | Name1 | QTRAVG | 6 |
| 11 | 4 | 2020 | Name1 | QTRAVG | 9 |
| 15 | 1 | 2020 | Name2 | QTRAVG | 67 |
| 15 | 2 | 2020 | Name2 | QTRAVG | 89 |
| 15 | 3 | 2020 | Name2 | QTRAVG | 100 |
| 15 | 4 | 2020 | Name2 | QTRAVG | 121 |
| 11 | 1 | 2020 | Name1 | QTRMAX | 6 |
| 11 | 2 | 2020 | Name1 | QTRMAX | 9 |
| 11 | 3 | 2020 | Name1 | QTRMAX | 7 |
| 11 | 4 | 2020 | Name1 | QTRMAX | 10 |
+----------+-----+------+--------+-------------+--------------+
I want to arrange the Value_column in a way that can capture when there is multiple Qtr_measures for unique IDs and MEF_IDs. When doing this, the overall size of the table will be reduced and I would like to have columns replacing Qtr_Measures with the type as below:
+----------+-----+------+--------+-------------+--------+--------+
| ID | QTR | Year | MEF_ID | Qtr_Measure | QTRAVG | QTRMAX |
+----------+-----+------+--------+-------------+--------+--------+
| 11 | 1 | 2020 | Name1 | QTRAVG | 5 | 6 |
| 11 | 2 | 2020 | Name1 | QTRAVG | 8 | 9 |
| 11 | 3 | 2020 | Name1 | QTRAVG | 6 | 7 |
| 11 | 4 | 2020 | Name1 | QTRAVG | 9 | 10 |
| 15 | 1 | 2020 | Name2 | QTRAVG | 67 | |
| 15 | 2 | 2020 | Name2 | QTRAVG | 89 | |
| 15 | 3 | 2020 | Name2 | QTRAVG | 100 | |
| 15 | 4 | 2020 | Name2 | QTRAVG | 121 | |
+----------+-----+------+--------+-------------+--------+--------+
How can I do this with python?
Thank you
Use pivot_table with reset_index and rename_axis:
piv = (df.pivot_table(index=['ID', 'QTR', 'Year', 'MEF_ID'],
values='Value_column',
columns='Qtr_Measure')
.reset_index()
.rename_axis(None, axis=1)
)
print(piv)
ID QTR Year MEF_ID QTRAVG QTRMAX
0 11 1 2020 Name1 5.0 6.0
1 11 2 2020 Name1 8.0 9.0
2 11 3 2020 Name1 6.0 7.0
3 11 4 2020 Name1 9.0 10.0
4 15 1 2020 Name2 67.0 NaN
5 15 2 2020 Name2 89.0 NaN
6 15 3 2020 Name2 100.0 NaN
7 15 4 2020 Name2 121.0 NaN

How to put value 1 when other column is not zero in python?

This is the table I have
|Company | Counts | Date | mean |
|A | 100 | 2019 | nan |
|B | 200 | 2019 | nan |
|C | 300 | 2019 | nan |
|D | 400 | 2019 | 1.02 |
|E | 0 | 2020 | 10.08 |
|F | 0 | 2020 | 11.11 |
I am trying to get this, by replacing 'mean' with 1 when 'Counts' is not 0.
|Company | Counts | Date | mean |
|A | 100 | 2019 | 1 |
|B | 200 | 2019 | 1 |
|C | 300 | 2019 | 1 |
|D | 400 | 2019 | 1 |
|E | 0 | 2020 | 10.08 |
|F | 0 | 2020 | 11.11 |
If I am not misunderstanding you can use loc to change the value of the mean column where the condition you stated matches.
df.loc[df.Counts != 0, 'mean'] = 1
Company Counts Date mean
0 A 100 2019 1
1 B 200 2019 1
2 C 300 2019 1
3 D 400 2019 1
4 E 0 2020 10.08
5 F 0 2020 11.11
Can use np.where and pass in the condition, true value and false values
df['mean'] = np.where(df['Counts']>1, 1, df['mean'])

Categories