I have a code below
result, diff = [], []
for index, row in final.iterrows():
for column in final.columns:
if ((final['close'] - final['open']) > 20):
diff = final['close'] - final['open']
result = 1
elif ((final['close'] - final['open']) < -20):
diff = final['close'] - final['open']
result = -1
elif (-20 < (final['close'] - final['open']) < 20 ):
diff = final['close'] - final['open']
result = 0
else:
continue
The intention is to for every time stamp, check if close - open is greater than 20 pips, then assign a buy value to it. If it's less than -20 assign a sell value, if in between assign a 0.
I am getting this error
The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
[Finished in 35.418s
Someone experienced with pandas would give a better answer, but since no one is answering here's mine. You generally don't want to iterate directly with pandas.Dataframes as that defeats the purpose. A pandas solution would look more like:
import pandas as pd
data = {
'symbol': ['WZO', 'FDL', 'KXW', 'GYU', 'MIR', 'YAC', 'CDE', 'DSD', 'PAM', 'BQE'],
'open': [356, 467, 462, 289, 507, 654, 568, 646, 440, 625],
'close': [399, 497, 434, 345, 503, 665, 559, 702, 488, 608]
}
df = pd.DataFrame.from_dict(data)
df['diff'] = df['close'] - df['open']
df.loc[(df['diff'] < 20) & (df['diff'] > -20), 'result'] = 0
df.loc[df['diff'] >= 20, 'result'] = 1
df.loc[df['diff'] <= -20, 'result'] = -1
df now contains:
symbol open close diff result
0 WZO 356 399 43 1.0
1 FDL 467 497 30 1.0
2 KXW 462 434 -28 -1.0
3 GYU 289 345 56 1.0
4 MIR 507 503 -4 0.0
5 YAC 654 665 11 0.0
6 CDE 568 559 -9 0.0
7 DSD 646 702 56 1.0
8 PAM 440 488 48 1.0
9 BQE 625 608 -17 0.0
Regarding your code, I'll repeat my comment from above: You are iterating by row, but then using the whole DataFrame final in your conditions. I think you meant to do row there. You don't need to iterate over columns grabbing your values by index. Your conditions miss for when final['close'] - final['open'] is exactly 20. result, diff = [], [] are lists at the top, but then assigned as integers in the loop. Perhaps you want result.append()?
Related
I have this problem which I've been trying to solve:
I want the code to take this DataFrame and group multiple columns based on the most frequent number and sum the values on the last column. For example:
df = pd.DataFrame({'A':[1000, 1000, 1000, 1000, 1000, 200, 200, 500, 500],
'B':[380, 380, 270, 270, 270, 45, 45, 45, 55],
'C':[380, 380, 270, 270, 270, 88, 88, 88, 88],
'D':[45, 32, 67, 89, 51, 90, 90, 90, 90]})
df
A B C D
0 1000 380 380 45
1 1000 380 380 32
2 1000 270 270 67
3 1000 270 270 89
4 1000 270 270 51
5 200 45 88 90
6 200 45 88 90
7 500 45 88 90
8 500 55 88 90
I would like the code to show the result below:
A B C D
0 1000 380 380 284
1 1000 380 380 284
2 1000 270 270 284
3 1000 270 270 284
4 1000 270 270 284
5 200 45 88 360
6 200 45 88 360
7 500 45 88 360
8 500 55 88 360
Notice that the most frequent value on the first rows is 1000, and this way I group the column 'A' so I get the sum 284 on the column 'D'. However, on the last rows, the most frequent number, which is 88, is not on column 'A', but in column 'C'. I am trying to sum the values on column 'D' by grouping column 'C' and get 360. I am not sure if I made myself clear.
I tried to use the function df['D'] = df.groupby(['A', 'B', 'C'])['D'].transform('sum'), but it does not show the desired result aforementioned.
Is there any pandas-style way of resolving this? Thanks in advance!
Code
def get_count_sum(col, func):
return df.groupby(col).D.transform(func)
ga = get_count_sum('A', 'count')
gb = get_count_sum('B', 'count')
gc = get_count_sum('C', 'count')
conditions = [
((ga > gb) & (ga > gc)),
((gb > ga) & (gb > gc)),
((gc > ga) & (gc > gb)),
]
choices = [get_count_sum('A', 'sum'),
get_count_sum('B', 'sum'),
get_count_sum('C', 'sum')]
df['D'] = np.select(conditions, choices)
df
Output
A B C D
0 1000 380 380 284
1 1000 380 380 284
2 1000 270 270 284
3 1000 270 270 284
4 1000 270 270 284
5 200 45 88 360
6 200 45 88 360
7 500 45 88 360
8 500 55 88 360
Explanation
Since we need to group by each column 'A','B' or 'C' considering which one has max repeated number, so first we are checking the max repeated number and storing the groupby output in ga, gb, gc for A,B,C col respectively.
We are checking which col has max frequent number in conditions.
According to the conditions we are applying choices for if else conditions.
np.select is like if-elif-else where we placed the conditions and required output in choices.
How to sum my data counts by week and if the last week still not completed calculate the average "normalization"
let's say these is my lists
days = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31]
counts = [1839,1334,2241,2063,1216,1409,1614,1860,1298,1140,1122,2153,971,1650,1835,889,653,484,2078,1198,426,684,910,701,851,360,763,402,1853,400,1159]
Thanks
Here is a solution with Pandas:
1) Create dataframe:
df = pd.DataFrame({'days':days,'counts': counts})
df['week'] = df.days.sub(1)//7 # adding week column
2) calculate sum and mean by week, then producing normalized sum:
d2 = df.groupby('week').agg({'counts':['sum','mean']}) # ca
d2['norm_sum'] = d2[('counts','mean')] * 7
3) output:
print (d2)
counts norm_sum
sum mean
week
0 10102 1683.666667 11785.666667
1 10158 1451.142857 10158.000000
2 8787 1255.285714 8787.000000
3 4695 670.714286 4695.000000
4 3814 953.500000 6674.500000
I do not know how to use pandas in this case, but I would do it using built-in python modules following way:
from collections import defaultdict
from statistics import mean
days = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31]
counts = [1839,1334,2241,2063,1216,1409,1614,1860,1298,1140,1122,2153,971,1650,1835,889,653,484,2078,1198,426,684,910,701,851,360,763,402,1853,400,1159]
weeks = [d//7 for d in days]
avg_count = int(mean(counts))
weeks = weeks + [weeks[-1]]*(len(weeks)%7) # pad weeks to multiply of 7
counts = counts + [avg_count]*(len(counts)%7) # pad counts to multiply of 7
count_per_week = defaultdict(int)
for w, c in zip(weeks, counts):
count_per_week[w] += c
print(dict(count_per_week))
Output:
{0: 10102, 1: 10158, 2: 8787, 3: 4695, 4: 3814}
Note that I assume average is reasonable filler value, which do not have to always holds true. defaultdict(int) when asked for non-existing key will set that key value to int() that is 0.
This my approach:
The data:
counts = [1839,1334,2241,2063,1216,1409,1614,1860,1298,1140,1122,2153,971,1650,1835,889,653,484,2078,1198,426,684,910,701,851,360,763,402,1853,400,1159]
To array:
counts = np.array(counts)
Reshape: (thanks to https://stackoverflow.com/users/4427777/daniel-f)
def shapeshifter(num_col, my_array=data):
return np.lib.pad(my_array, (0, num_col - len(my_array) % num_col), 'constant', constant_values = 0).reshape(-1, num_col)
data = shapeshifter(7, counts)
array([[1839, 1334, 2241, 2063, 1216, 1409, 1614],
[1860, 1298, 1140, 1122, 2153, 971, 1650],
[1835, 889, 653, 484, 2078, 1198, 426],
[ 684, 910, 701, 851, 360, 763, 402],
[1853, 400, 1159, 0, 0, 0, 0]])
To dataframe with zeros converted to NaN:
df = pd.DataFrame(data)
df[df == 0] = np.nan
Fill missing values with the mean value of the month:
df.fillna(counts.mean())
0 1 2 3 4 5 6
0 1839 1334 2241 2063.000000 1216.000000 1409.000000 1614.000000
1 1860 1298 1140 1122.000000 2153.000000 971.000000 1650.000000
2 1835 889 653 484.000000 2078.000000 1198.000000 426.000000
3 684 910 701 851.000000 360.000000 763.000000 402.000000
4 1853 400 1159 1211.483871 1211.483871 1211.483871 1211.483871
Get the sum by row or week:
df.sum(axis=1)
0 11716.0
1 10194.0
2 7563.0
3 4671.0
4 3412.0
dtype: float64
I have 2 dataframe -
print(d)
Year Salary Amount Amount1 Amount2
0 2019 1200 53 53 53
1 2020 3443 455 455 455
2 2021 6777 123 123 123
3 2019 5466 313 313 313
4 2020 4656 545 545 545
5 2021 4565 775 775 775
6 2019 4654 567 567 567
7 2020 7867 657 657 657
8 2021 6766 567 567 567
print(d1)
Year Salary Amount Amount1 Amount2
0 2019 1200 53 73 63
import pandas as pd
d = pd.DataFrame({
'Year': [
2019,
2020,
2021,
] * 3,
'Salary': [
1200,
3443,
6777,
5466,
4656,
4565,
4654,
7867,
6766
],
'Amount': [
53,
455,
123,
313,
545,
775,
567,
657,
567
],
'Amount1': [
53,
455,
123,
313,
545,
775,
567,
657,
567
], 'Amount2': [
53,
455,
123,
313,
545,
775,
567,
657,
567
]
})
d1 = pd.DataFrame({
'Year': [
2019
],
'Salary': [
1200
],
'Amount': [
53
],
'Amount1': [
73
], 'Amount2': [
63
]
})
I want to compare the 'Salary' value of dataframe d1 i.e. 1200 with all the values of 'Salary' in dataframe d and set a count if it is >= or < (a Boolean comparison) - this is to be done for all the columns(amount, amount1, amount2 etc), if the value in any column of d1 is NaN/None, no comparison needs to be done. The name of the columns will always be same so it is basically one to one column comparison.
My approach and thoughts -
I can get the values of d1 in a list by doing -
l = []
for i in range(len(d1.columns.values)):
if i == 0:
continue
else:
num = d1.iloc[0, i]
l.append(num)
print(l)
# list comprehension equivalent
lst = [d1.iloc[0, i] for i in range(len(d1.columns.values)) if i != 0]
[1200, 53, 73, 63]
and then use iterrows to iterate over all the columns and rows in dataframe d OR
I can iterate over d and then perform a similar comparison by looping over d1 - but these would be time consuming for a high dimensional dataframe(d in this case).
What would be the more efficient or pythonic way of doing it?
IIUC, you can do:
(df1 >= df2.values).sum()
Output:
Year 9
Salary 9
Amount 9
Amount1 8
Amount2 8
dtype: int64
I have a df like this-- it's a dataframe and all values are floats:
data=np.random.randint(3000,size=(10,1))
data=pd.DataFrame(data)
For each value, if it's between 570 and 1140, I want to subtract 570.
If it's over 1140, I want to subtract 1140 from the value. I wrote this function to do that.
def AdjustTimes(val):
if val > 570 and val < 1140:
val = val-570
elif val > 1140:
val = val - 1140
Based on another question I tried to apply it using data.applymap(AdjustTimes). I got no error but the function does not seem to have been applied.
Setup
data
0
0 1863
1 2490
2 2650
3 2321
4 822
5 82
6 2192
7 722
8 2537
9 874
First, let's create masks for each of your conditions. One pandaic approach is using between to retrieve a mask for the first condition -
m1 = data.loc[:, 0].between(570, 1140, inclusive=True)
Or, you can do this with a couple of logical operators -
m1 = data.loc[:, 0].ge(570) & data.loc[:, 0].le(1140)
And,
m2 = data.loc[:, 0].gt(1140)
Now, to perform replacement, you have a couple of options.
Option 1
Use loc to index and subtract -
data.loc[m1, 0] -= 570
data.loc[m2, 0] -= 1140
data
0
0 723
1 1350
2 1510
3 1181
4 252
5 82
6 1052
7 152
8 1397
9 304
Equivalent version for a pd.Series -
m1 = data.ge(570) & data.le(1140)
m2 = data.gt(1140)
data.loc[m1] -= 570
data.loc[m2] -= 1140
Option 2
You can also do this with np.where (but it'd be a bit more inefficient).
v = data.loc[:, 0]
data.loc[:, 0] = np.where(m1, v - 570, np.where(m2, v - 1140, v))
Here, m1 and m2 are the masks computed from before.
data
0
0 723
1 1350
2 1510
3 1181
4 252
5 82
6 1052
7 152
8 1397
9 304
Equivalent pd.Series code -
data[:] = np.where(m1, data - 570, np.where(m2, data - 1140, data))
Could you try something like:
data=np.random.randint(3000,size=(10,1))
data=pd.DataFrame(data)
data = data -570*((data > 570) & (data < 1140)) -1140*(data > 1140)
The applymap method is designed to generate a new dataframe, not to modify an existing one (and the function it calls should return a value for the new cell rather than modifying its argument). You don't show the line where you actually use applymap, but I suspect it's just data.applymap(AdjustTimes) on its own. If you change your code to the following it should work fine:
def AdjustTimes(val):
if val >= 1140:
return val - 1140
elif val >= 570:
return val - 570
data = data.applymap(AdjustTimes)
(I've also cleaned up the if statements to be a little faster and deal with the case where Val = 1140 (your original code wouldn't adjust that one).
I have been working on a programming challenge, problem here, which basically states:
Given integer array, you are to iterate through all pairs of neighbor
elements, starting from beginning - and swap members of each pair
where first element is greater than second.
And then return the amount of swaps made and the checksum of the final answer. My program seemingly does both the sorting and the checksum according to how it wants. But my final answer is off for everything but the test input they gave.
So: 1 4 3 2 6 5 -1
Results in the correct output: 3 5242536 with my program.
But something like:
2 96 7439 92999 240 70748 3 842 74 706 4 86 7 463 1871 7963 904 327 6268 20955 92662 278 57 8 5912 724 70916 13 388 1 697 99666 6924 2 100 186 37504 1 27631 59556 33041 87 9 45276 -1
Results in: 39 1291223 when the correct answer is 39 3485793.
Here's what I have at the moment:
# Python 2.7
def check_sum(data):
data = [str(x) for x in str(data)[::]]
numbers = len(data)
result = 0
for number in range(numbers):
result += int(data[number])
result *= 113
result %= 10000007
return(str(result))
def bubble_in_array(data):
numbers = data[:-1]
numbers = [int(x) for x in numbers]
swap_count = 0
for x in range(len(numbers)-1):
if numbers[x] > numbers[x+1]:
temp = numbers[x+1]
numbers[x+1] = numbers[x]
numbers[x] = temp
swap_count += 1
raw_number = int(''.join([str(x) for x in numbers]))
print('%s %s') % (str(swap_count), check_sum(raw_number))
bubble_in_array(raw_input().split())
Does anyone have any idea where I am going wrong?
The issue is with your way of calculating Checksum. It fails when the array has numbers with more than one digit. For example:
2 96 7439 92999 240 70748 3 842 74 706 4 86 7 463 1871 7963 904 327 6268 20955 92662 278 57 8 5912 724 70916 13 388 1 697 99666 6924 2 100 186 37504 1 27631 59556 33041 87 9 45276 -1
You are calculating Checksum for 2967439240707483842747064867463187179639043276268209559266227857859127247091613388169792999692421001863750412763159556330418794527699666
digit by digit while you should calculate the Checksum of [2, 96, 7439, 240, 70748, 3, 842, 74, 706, 4, 86, 7, 463, 1871, 7963, 904, 327, 6268, 20955, 92662, 278, 57, 8, 5912, 724, 70916, 13, 388, 1, 697, 92999, 6924, 2, 100, 186, 37504, 1, 27631, 59556, 33041, 87, 9, 45276, 99666]
The fix:
# Python 2.7
def check_sum(data):
result = 0
for number in data:
result += number
result *= 113
result %= 10000007
return(result)
def bubble_in_array(data):
numbers = [int(x) for x in data[:-1]]
swap_count = 0
for x in xrange(len(numbers)-1):
if numbers[x] > numbers[x+1]:
numbers[x+1], numbers[x] = numbers[x], numbers[x+1]
swap_count += 1
print('%d %d') % (swap_count, check_sum(numbers))
bubble_in_array(raw_input().split())
More notes:
To swap two variables in Python, you dont need to use a temp variable, just use a,b = b,a.
In python 2.X, use xrange instead of range.