sumif and countif on Python for multiple columns , On row level and not column level - python

I'm trying to figure a way to do:
COUNTIF(Col2,Col4,Col6,Col8,Col10,Col12,Col14,Col16,Col18,">=0.05")
SUMIF(Col2,Col4,Col6,Col8,Col10,Col12,Col14,Col16,Col18,">=0.05")
My attempt:
import pandas as pd
df=pd.read_excel(r'C:\\Users\\Downloads\\Prepped.xls') #Please use: https://github.com/BeboGhattas/temp-repo/blob/main/Prepped.xls
df.iloc[:, [2,4,6,8,10,12,14,16,18]].astype(float) #changing dtype to float
#unconditional sum
df['sum']=df.iloc[:, [2,4,6,8,10,12,14,16,18]].astype(float).sum(axis=1)
whatever goes below won't work
#sum if
df['greater-than-0.05']=df.iloc[:, [2,4,6,8,10,12,14,16,18]].astype(float).sum([c for c in col if c >= 0.05])
| | # | word | B64684807 | B64684807Measure | B649845471 | B649845471Measure | B83344143 | B83344143Measure | B67400624 | B67400624Measure | B85229235 | B85229235Measure | B85630406 | B85630406Measure | B82615898 | B82615898Measure | B87558236 | B87558236Measure | B00000009 | B00000009Measure | 有效竞品数 | 关键词抓取时间 | 搜索量排名 | 月搜索量 | 在售商品数 | 竞争度 |
|---:|----:|:--------|------------:|:-------------------|-------------:|:-------------------------|------------:|:-------------------------|------------:|:-------------------|------------:|:-------------------|------------:|:-------------------|------------:|:-------------------|------------:|-------------------:|------------:|:-------------------|-------------:|:--------------------|-------------:|-----------:|-------------:|---------:|
| 0 | 1 | word 1 | 0.055639 | [主要流量词] | 0.049416 | nan | 0.072298 | [精准流量词, 主要流量词] | 0.00211 | nan | 0.004251 | nan | 0.007254 | nan | 0.074409 | [主要流量词] | 0.033597 | nan | 0.000892 | nan | 9 | 2022-10-06 00:53:56 | 5726 | 326188 | 3810 | 0.01 |
| 1 | 2 | word 2 | 0.045098 | nan | 0.005472 | nan | 0.010791 | nan | 0.072859 | [主要流量词] | 0.003423 | nan | 0.012464 | nan | 0.027396 | nan | 0.002825 | nan | 0.060989 | [主要流量词] | 9 | 2022-10-07 01:16:21 | 9280 | 213477 | 40187 | 0.19 |
| 2 | 3 | word 3 | 0.02186 | nan | 0.05039 | [主要流量词] | 0.007842 | nan | 0.028832 | nan | 0.044385 | [精准流量词] | 0.001135 | nan | 0.003866 | nan | 0.021035 | nan | 0.017202 | nan | 9 | 2022-10-07 00:28:31 | 24024 | 81991 | 2275 | 0.03 |
| 3 | 4 | word 4 | 0.000699 | nan | 0.01038 | nan | 0.001536 | nan | 0.021512 | nan | 0.007658 | nan | 5e-05 | nan | 0.048682 | nan | 0.001524 | nan | 0.000118 | nan | 9 | 2022-10-07 00:52:12 | 34975 | 53291 | 30970 | 0.58 |
| 4 | 5 | word 5 | 0.00984 | nan | 0.030248 | nan | 0.003006 | nan | 0.014027 | nan | 0.00904 | [精准流量词] | 0.000348 | nan | 0.000414 | nan | 0.006721 | nan | 0.00153 | nan | 9 | 2022-10-07 02:36:05 | 43075 | 41336 | 2230 | 0.05 |
| 5 | 6 | word 6 | 0.010029 | [精准流量词] | 0.120739 | [精准流量词, 主要流量词] | 0.014359 | nan | 0.002796 | nan | 0.002883 | nan | 0.028747 | [精准流量词] | 0.007022 | nan | 0.017803 | nan | 0.001998 | nan | 9 | 2022-10-07 00:44:51 | 49361 | 34791 | 517 | 0.01 |
| 6 | 7 | word 7 | 0.002735 | nan | 0.002005 | nan | 0.005355 | nan | 6.3e-05 | nan | 0.000772 | nan | 0.000237 | nan | 0.015149 | nan | 2.1e-05 | nan | 2.3e-05 | nan | 9 | 2022-10-07 09:48:20 | 53703 | 31188 | 511 | 0.02 |
| 7 | 8 | word 8 | 0.003286 | [精准流量词] | 0.058161 | [主要流量词] | 0.013681 | [精准流量词] | 0.000748 | [精准流量词] | 0.002684 | [精准流量词] | 0.013916 | [精准流量词] | 0.029376 | nan | 0.019792 | nan | 0.005602 | nan | 9 | 2022-10-06 01:51:53 | 58664 | 27751 | 625 | 0.02 |
| 8 | 9 | word 9 | 0.004273 | [精准流量词] | 0.025581 | [精准流量词] | 0.014784 | [精准流量词] | 0.00321 | [精准流量词] | 0.000892 | nan | 0.00223 | nan | 0.005315 | nan | 0.02211 | nan | 0.027008 | [精准流量词] | 9 | 2022-10-07 01:34:28 | 73640 | 20326 | 279 | 0.01 |
| 9 | 10 | word 10 | 0.002341 | [精准流量词] | 0.029604 | nan | 0.007817 | [精准流量词] | 0.000515 | [精准流量词] | 0.001865 | [精准流量词] | 0.010128 | [精准流量词] | 0.015378 | nan | 0.019677 | nan | 0.003673 | nan | 9 | 2022-10-07 01:17:44 | 80919 | 17779 | 207 | 0.01 |
So my question is,
How can i do the sumif and countif on the exact table (Should use col2,col4... etc, because every file will have the same format but different header, so using df['B64684807'] isn't helpful )
Sample file can be found at:
https://github.com/BeboGhattas/temp-repo/blob/main/Prepped.xls

IIUC, you can use a boolean mask:
df2 = df.iloc[:, [2,4,6,8,10,12,14,16,18]].astype(float)
m = df2.ge(0.05)
df['countif'] = m.sum(axis=1)
df['sumif'] = df2.where(m).sum(axis=1)
output (last 3 columns only):
sum countif sumif
0 0.299866 3 0.202346
1 0.241317 2 0.133848
2 0.196547 1 0.050390
3 0.092159 0 0.000000
4 0.075174 0 0.000000
5 0.206376 1 0.120739
6 0.026360 0 0.000000
7 0.147246 1 0.058161
8 0.105403 0 0.000000
9 0.090998 0 0.000000

Related

ordene matrix with numpy.triu and non nan's values

I’ve a dataset where i need do a transformation to get a upper triangular matrix. So my matrix has this format:
| 1 | 2 | 3 |
01/01/1999 | nan | 582.96 | nan |
02/01/1999 | nan | 589.78 | 78.47 |
03/01/1999 | nan | 588.74 | 79.41 |
… | | |
01/01/2022 | 752.14 | 1005.78 | 193.47 |
02/01/2022 | 754.14 | 997.57 | 192.99 |
I use a dataframe.T, to get my date as columns, but I also need that my rows be ordened by non nan’s.
| 01/01/1999 | 02/01/1999 |03/01/1999 |… |01/01/2022 | 02/01/2022 |
2 | 582.96 | 589.78 | 588.74 |… | 1005.78 | 997.57 |
3 | nan | 78.47 | 79.41 | … | 193.47 | 192.99 |
1 | nan | nan | nan | … | 752.14 | 754.14 |
A tried use the different combinantions of numpy.triu, sort_by and dataframe.T but I haven’t success.
My main goal is get with this format, but if I get this with performance would be nice, cause my data is big.

Python Pandas - Merge specific cells

I have this massive dataframe which has 3 different columns of values under each one heading.
As an example, first it looked something like this:
| | 0 | 1 | 2 | 3 | ..
| 0 | a | 7.3 | 9.1 | NaN | ..
| 1 | b | 2.51 | 4.8 | 6.33 | ..
| 2 | c | NaN | NaN | NaN | ..
| 3 | d | NaN | 3.73 | NaN | ..
1, 2 and 3 all belong together. For simplicity of the program I used integers for the dataframe index and columns.
But now that it finished calculating stuff, I changed the columns to the appropriate string.
| | 0 | Heading 1 | Heading 1 | Heading 1 | ..
| 0 | a | 7.3 | 9.1 | NaN | ..
| 1 | b | 2.51 | 4.8 | 6.33 | ..
| 2 | c | NaN | NaN | NaN | ..
| 3 | d | NaN | 3.73 | NaN | ..
Everything runs perfectly smooth up until this point, but here's where I'm stuck.
All I wanna do is merge the 3 "Heading 1" into one giant cell, so that it looks something like this:
| | 0 | Heading 1 | ..
| 0 | a | 7.3 | 9.1 | NaN | ..
| 1 | b | 2.51 | 4.8 | 6.33 | ..
| 2 | c | NaN | NaN | NaN | ..
| 3 | d | NaN | 3.73 | NaN | ..
But everything I find online is merging the entire column, values included.
I'd really appreciate if someone could help me out here!

Running Hyperopt in Freqtrade and getting crazy results

I ran hyperopt for 5000 iterations and got the following results:
2022-01-10 19:38:31,370 - freqtrade.optimize.hyperopt - INFO - Best result:
1101 trades. Avg profit 0.23%. Total profit 25.48064438 BTC (254.5519Σ%). Avg duration 888.1 mins.
with values:
{ 'roi_p1': 0.011364434095803464,
'roi_p2': 0.04123147845715937,
'roi_p3': 0.10554480985209454,
'roi_t1': 105,
'roi_t2': 47,
'roi_t3': 30,
'rsi-enabled': True,
'rsi-value': 9,
'sell-rsi-enabled': True,
'sell-rsi-value': 94,
'sell-trigger': 'sell-bb_middle1',
'stoploss': -0.42267640639979365,
'trigger': 'bb_lower2'}
2022-01-10 19:38:31,371 - freqtrade.optimize.hyperopt - INFO - ROI table:
{ 0: 0.15814072240505736,
30: 0.05259591255296283,
77: 0.011364434095803464,
182: 0}
Result for strategy BBRSI
================================================== BACKTESTING REPORT =================================================
| pair | buy count | avg profit % | cum profit % | total profit BTC | avg duration | profit | loss |
|:----------|------------:|---------------:|---------------:|-------------------:|:----------------|---------:|-------:|
| ETH/BTC | 11 | -1.30 | -14.26 | -1.42732928 | 3 days, 4:55:00 | 0 | 1 |
| LUNA/BTC | 17 | 0.60 | 10.22 | 1.02279906 | 15:46:00 | 9 | 0 |
| SAND/BTC | 37 | 0.30 | 11.24 | 1.12513532 | 6:16:00 | 14 | 1 |
| MATIC/BTC | 24 | 0.47 | 11.35 | 1.13644340 | 12:20:00 | 10 | 0 |
| ADA/BTC | 24 | 0.24 | 5.68 | 0.56822170 | 21:05:00 | 5 | 0 |
| BNB/BTC | 11 | -1.09 | -11.96 | -1.19716109 | 3 days, 0:44:00 | 2 | 1 |
| XRP/BTC | 20 | -0.39 | -7.71 | -0.77191523 | 1 day, 5:48:00 | 1 | 1 |
| DOT/BTC | 9 | 0.50 | 4.54 | 0.45457736 | 4 days, 1:13:00 | 4 | 0 |
| SOL/BTC | 19 | -0.38 | -7.16 | -0.71688463 | 22:47:00 | 3 | 1 |
| MANA/BTC | 29 | 0.38 | 11.16 | 1.11753320 | 10:25:00 | 9 | 1 |
| AVAX/BTC | 27 | 0.30 | 8.15 | 0.81561432 | 16:36:00 | 11 | 1 |
| GALA/BTC | 26 | -0.52 | -13.45 | -1.34594702 | 15:48:00 | 9 | 1 |
| LINK/BTC | 21 | 0.27 | 5.68 | 0.56822170 | 1 day, 0:06:00 | 5 | 0 |
| TOTAL | 275 | 0.05 | 13.48 | 1.34930881 | 23:42:00 | 82 | 8 |
================================================== SELL REASON STATS ==================================================
| Sell Reason | Count |
|:--------------|--------:|
| roi | 267 |
| force_sell | 8 |
=============================================== LEFT OPEN TRADES REPORT ===============================================
| pair | buy count | avg profit % | cum profit % | total profit BTC | avg duration | profit | loss |
|:---------|------------:|---------------:|---------------:|-------------------:|:------------------|---------:|-------:|
| ETH/BTC | 1 | -14.26 | -14.26 | -1.42732928 | 32 days, 4:00:00 | 0 | 1 |
| SAND/BTC | 1 | -4.65 | -4.65 | -0.46588544 | 17:00:00 | 0 | 1 |
| BNB/BTC | 1 | -14.23 | -14.23 | -1.42444977 | 31 days, 13:00:00 | 0 | 1 |
| XRP/BTC | 1 | -8.85 | -8.85 | -0.88555957 | 18 days, 4:00:00 | 0 | 1 |
| SOL/BTC | 1 | -10.57 | -10.57 | -1.05781765 | 5 days, 14:00:00 | 0 | 1 |
| MANA/BTC | 1 | -3.17 | -3.17 | -0.31758065 | 17:00:00 | 0 | 1 |
| AVAX/BTC | 1 | -12.58 | -12.58 | -1.25910300 | 7 days, 9:00:00 | 0 | 1 |
| GALA/BTC | 1 | -23.66 | -23.66 | -2.36874608 | 7 days, 12:00:00 | 0 | 1 |
| TOTAL | 8 | -11.50 | -91.97 | -9.20647144 | 12 days, 23:15:00 | 0 | 8 |
Have accurately followed the tutorial. Don't know what I am doing wrong here.

combine data frames of different sizes and replacing values

I am having 2 dataframes of different size. I am looking to join the dataframes and want to replace the Nan values after combining both the dataframes and replacing the the Nan values with lower size dataframe.
dataframe1:-
| symbol| value1 | value2 | Occurance |
|=======|========|========|===========|
2020-07-31 | A | 193.5 | 186.05 | 3 |
2020-07-17 | A | 372.5 | 359.55 | 2 |
2020-07-21 | A | 387.8 | 382.00 | 1 |
dataframe2:-
| x | y | z | symbol|
|=====|=====|=====|=======|
2020-10-01 |448.5|453.0|443.8| A |
I tried concatenating and replacing the Nan values with values of dataframe2 value.
I tried df1 =pd.concat([dataframe2,dataframe1],axis=1). The result is given below but i am looking for result as in result desired. How can i achieve that desired result.
Result given:-
| X | Y | Z | symbol|symbol| value1| value2 | Occurance|
|====== | ====|=====|=======|======|=======| =======| =========|
2020-07-31|NaN |NaN | NaN | NaN | A |193.5 | 186.05 | 3 |
2021-05-17| NaN | NaN | NaN | NaN | A |372.5 | 359.55 | 2 |
2021-05-21| NaN | NaN | NaN | NaN | A |387.8 | 382.00 | 1 |
2020-10-01| 448.5 |453.0|443.8| A |NaN | NaN | NaN | NaN |
Result Desired:-
| X | Y | Z | symbol|symbol| value1| value2 | Occurance|
| ===== | ======| ====| ======| =====|=======|========|==========|
2020-10-01| 448.5 |453.0 |443.8| A | A |193.5 | 186.05 | 3 |
2020-10-01| 448.5 |453.0 |443.8| A | A |372.5 | 359.55 | 2 |
2020-10-01| 448.5 |453.0 |443.8| A | A |387.8 | 382.00 | 1 |
2020-10-01| 448.5 |453.0 |443.8| A |NaN | NaN | NaN | NaN |
Please note the datatime needs to be same in the Result Desired. In short replicating the single line of dataframe2 to NaN values of dataframe1. a solution avoiding For loop would be great.
Could you try to sort your dataframe by the index to check how the output would be ?
df1.sort_index()

How to fill missing GPS Data in pandas?

I have a data frame that looks something like this
+-----+------------+-------------+-------------------------+----+----------+----------+
| | Actual_Lat | Actual_Long | Time | ID | Cal_long | Cal_lat |
+-----+------------+-------------+-------------------------+----+----------+----------+
| 0 | 63.433376 | 10.397068 | 2019-09-30 04:48:13.540 | 11 | 10.39729 | 63.43338 |
| 1 | 63.433301 | 10.395846 | 2019-09-30 04:48:18.470 | 11 | 10.39731 | 63.43326 |
| 2 | 63.433259 | 10.394543 | 2019-09-30 04:48:23.450 | 11 | 10.39576 | 63.43323 |
| 3 | 63.433258 | 10.394244 | 2019-09-30 04:48:29.500 | 11 | 10.39555 | 63.43436 |
| 4 | 63.433258 | 10.394215 | 2019-09-30 04:48:35.683 | 11 | 10.39505 | 63.43427 |
| ... | ... | ... | ... | ...| ... | ... |
| 70 | NaN | NaN | NaT | NaN| 10.35826 | 63.43149 |
| 71 | NaN | NaN | NaT | NaN| 10.35809 | 63.43155 |
| 72 | NaN | NaN | NaT | NaN| 10.35772 | 63.43163 |
| 73 | NaN | NaN | NaT | NaN| 10.35646 | 63.43182 |
| 74 | NaN | NaN | NaT | NaN| 10.35536 | 63.43196 |
+-----+------------+-------------+-------------------------+----------+----------+----------+
Actual_lat and Actual_long contains GPS coordinates of data obtained from GPS device. Cal_lat and cal_lat are GPS coordinates obtained from OSRM's API. As you can see there is a lot of data missing in actual coordinates. I am looking to get a data set such that when I take difference of actual_lat vs cal_lat it should be zero or at least close to zero. I tried to fill these missing values with destination lat and long, but that would result in huge difference. My question is how can I fill these values using python/pandas so that when vehicle followed the OSRM estimated path the difference between actual lat/long and estimated lat/long should be zero or close to zero. I am new to GIS data Sets and have no idea about how to deal with them.
EDIT: I am looking for something like this.
+-----+------------+-------------+-------------------------+----------+----------+----------+----------------------+----------------------+
| | Actual_Lat | Actual_Long | Time | Tour ID | Cal_long | Cal_lat | coordinates_diff_Lat | coordinates_diff_Lon |
+-----+------------+-------------+-------------------------+----------+----------+----------+----------------------+----------------------+
| 0 | 63.433376 | 10.397068 | 2019-09-30 04:48:13.540 | 11 | 10.39729 | 63.43338 | -0.000 | -0.000 |
| 1 | 63.433301 | 10.395846 | 2019-09-30 04:48:18.470 | 11 | 10.39731 | 63.43326 | 0.000 | -0.001 |
| 2 | 63.433259 | 10.394543 | 2019-09-30 04:48:23.450 | 11 | 10.39576 | 63.43323 | 0.000 | -0.001 |
| 3 | 63.433258 | 10.394244 | 2019-09-30 04:48:29.500 | 11 | 10.39555 | 63.43436 | -0.001 | -0.001 |
| 4 | 63.433258 | 10.394215 | 2019-09-30 04:48:35.683 | 11 | 10.39505 | 63.43427 | -0.001 | -0.001 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 70 | 63.43000 | 10.35800 | NaT | 115268.0 | 10.35826 | 63.43149 | 0.000 | -0.003 |
| 71 | 63.43025 | 10.35888 | NaT | 115268.0 | 10.35809 | 63.43155 | 0.000 | -0.003 |
| 72 | 63.43052 | 10.35713 | NaT | 115268.0 | 10.35772 | 63.43163 | 0.000 | -0.002 |
| 73 | 63.43159 | 10.35633 | NaT | 115268.0 | 10.35646 | 63.43182 | 0.000 | -0.001 |
| 74 | 63.43197 | 10.35537 | NaT | 115268.0 | 10.35536 | 63.43196 | 0.000 | 0.000 |
+-----+------------+-------------+-------------------------+----------+----------+----------+----------------------+----------------------+
Note that 63.43197,10.35537 is destination and 63.433376,10.397068 is starting position. All these points represent road coordinates.
IIUC, you need something like this:
I am taking the columns out of df as list.
div = float(len(cal_lat)) / float(len(actual_lat))
new_l = []
for i in range(len(cal_lat)):
new_l.append(actual_lat[int(i/div)])
print(new_l)
len(new_l)
Do, the same with longitude columns.
Since these are GPS points you can tweak your model to have the accuracy of up to 3 digits, when taking the difference. So, keeping this in mind, starting from Actual_lat and lng , if your next value is same as the first, the difference won’t be much greater.
Hopefully, I made sense and you have your solution.
You need pandas.DataFrame.where.
Let's say your dataframe is df, then you can do:
df.Actual_Lat = df.Actual_Lat.where(~df.Actual_Lat.isna(), df.Cal_lat)

Categories