I have a dataframe and specifically need to find the min, max, and average of the age column using loops. However, try as I might. It just did not work. I hope somebody could help me see where the problem is. Thank you.
Here is my code
total_age = 0
i=0
max_amount = adult_data["age"][i]
min_amount = adult_data["age"][i]
for i in range(len(adult_data["age"])):
total_age = adult_data["age"][i] + total_age,
i = i + 1,
if adult_data["age"][i] > max_amount:
max_amount = adult_data["age"][i],
if adult_data["age"][i] < min_amount:
min_amount = adult_data["age"][i],
print(total_age)
The error I am currently getting is
ValueError: key of type tuple not found and not a MultiIndex
The commas at the end of statements indicate tuples. For example, i = i + 1,is the same as i = (i + 1,), where (i + 1,) is a tuple with one element.
So, your code is essentially the same as:
for i in range(len(adult_data["age"])):
total_age = (adult_data["age"][i] + total_age,)
i = (i + 1,)
if adult_data["age"][i] > max_amount:
max_amount = (adult_data["age"][i],)
if adult_data["age"][i] < min_amount:
min_amount = (adult_data["age"][i],)
That's a lot of tuples!
In other words, you don't need the commas. You also don't need i = i + 1 because the range automatically increments i.
Try:
adult_data["age"].agg(['min','max','mean'])
Related
I currently use something like the similar bit of code to determine comparison
list_of_numbers = [29800.0, 29795.0, 29795.0, 29740.0, 29755.0, 29745.0]
high = 29980.0
lookback = 10
counter = 1
for number in list_of_numbers:
if (high >= number) \
and (counter < lookback):
counter += 1
else:
break
The resulted counter magnitude will be 7. However, it is very taxing on large data arrays. So, I have looked for a solution and came up with np.argmax(), but there seems to be an issue. For example the following:
list_of_numbers = [29800.0, 29795.0, 29795.0, 29740.0, 29755.0, 29745.0]
np_list = np.array(list_of_numbers)
high = 29980.0
print(np.argmax(np_list > high) + 1)
this will get output 1, just like argmax is suppose to .. but I want it to get output 7. Is there another method to do this that will give me similar output for the if statement ?
You can get a boolean array for where high >= number using NumPy:
list_of_numbers = [29800.0, 29795.0, 29795.0, 29740.0, 29755.0, 29745.0]
high = 29980.0
lookback = 10
boolean_arr = np.less_equal(np.array(list_of_numbers), high)
Then finding where is the first False argument in that to satisfy break condition in your code. Furthermore, to consider countering, you can use np.cumsum on the boolean array and find the first argument that satisfying specified lookback magnitude. So, the result will be the smaller value between break_arr and lookback_lim:
break_arr = np.where(boolean_arr == False)[0][0] + 1
lookback_lim = np.where(np.cumsum(boolean_arr) == lookback)[0][0] + 1
result = min(break_arr, lookback_lim)
If your list_of_numbers have not any bigger value than your specified high limit for break_arr or the specified lookback exceeds values in np.cumsum(boolean_arr) for lookback_lim, the aforementioned code will get stuck with an error like the following, relating to np.where:
IndexError: index 0 is out of bounds for axis 0 with size 0
Which can be handled by try-except or if statements e.g.:
try:
break_arr = np.where(boolean_arr == False)[0][0] + 1
except:
break_arr = len(boolean_arr) + 1
try:
lookback_lim = np.where(np.cumsum(boolean_arr) == lookback)[0][0] + 1
except:
lookback_lim = len(boolean_arr) + 1
You have you less than sign backwards, no? The following should work as the for-loop:
print(np.min([np.sum(np.array(list_of_numbers) < high) + 1, lookback]))
A look back can be accomplished using shift. A cumcount can be used to get a running total. A query can be used as a filter
Why am I getting this error?
ticker = 'NFLX'
price = get_data(ticker, start_date='2020-01-01', end_date=None, index_as_date=bool, interval ='1d')
price.to_csv(r'D:\Python Stuff\pythonProject\NFLX.csv')
df = pd.read_csv('NFLX.csv')
price_list = df['adjclose']
change = price_list.diff(1)
def RSI():
change.dropna(inplace=True)
positive = change.copy()
negative = change.copy()
positive[positive < 0] = 0
negative[negative > 0] = 0
RSI_days = 5
average_gain = positive.rolling(window=RSI_days).mean()
average_loss = abs(negative.rolling(window=RSI_days).mean())
relative_strength = average_gain/average_loss
rsi = 100.0-(100.0/(1.0+relative_strength))
return rsi
relative_strength_index = RSI()
print(relative_strength_index[-6:])
lowest_rsi = min(relative_strength_index[-6:])
print(lowest_rsi)
index_number = relative_strength_index[-6:].index(lowest_rsi)
print('index of lowest rsi = ' + index_number)
When I run the code, my error is "TypeError: 'Int64Index' object is not callable". How can I improve this code to resolve this error?
Replace the parenthesis in index(lowest_rsi) by brackets
index_number = relative_strength_index[-6:].index[lowest_rsi.astype(int)]
print('index of lowest rsi = ' + index_number)
Edit: Your code should probably look something like this:
relative_strength_index = RSI()
print(relative_strength_index[-6:])
index_number = relative_strength_index[-6:].idxmin()
print('index of lowest rsi =', index_number)
relative_strength_index is not a list but is of type pandas.Series.
Therefore, index is a property and not a method like with list, which is where your original error came from and you cannot call that property.
You should instead cast relative_strength_index to a list and the you can proceed to call the index method as shown below.
Also, you should cast index_number to a string before concatenating it.
index_number = list(relative_strength_index[ -6: ]).index(lowest_rsi)
print('index of lowest rsi = ' + str(index_number))
I have a dataframe dframe and i want to iterate over lines of my column number if number <5 then my column STATE takes the value 'vv'
if number is between [5..17] so my value of my column STATE takes 'xx'.
ELSE STATE takes 'yy'.
SO I wrote this code but it doesn't work..
ANy helps please.
thank you
`
for it in dframe['number']:
if (it < 5):
dframe['STATE'] = 'vv'
elif (it >= 5 & it < 17):
dframe['STATE'] = 'xx'
else:
dframe['STATE'] = 'yy'`
You can do this in one line with a list comprehension:
dframe['STATE'] = ['vv' if (i < 5) else 'xx' if (i >= 5) & (i < 17) else 'yy' for i in dframe['number']]
Below code should work for you.
dframe['state']=''
dframe.loc[dframe['number'] <5, 'state'] = 'vv'
dframe.loc[(dframe['number'] >5) & (dframe['number']<17), 'state'] = 'xx'
dframe.loc[dframe['state'] =='', 'state'] = 'yy'
I'm trading daily on Cryptocurrencies and would like to find which are the most desirable Cryptos for trading.
I have CSV file for every Crypto with the following fields:
Date Sell Buy
43051.23918 1925.16 1929.83
43051.23919 1925.12 1929.79
43051.23922 1925.12 1929.79
43051.23924 1926.16 1930.83
43051.23925 1926.12 1930.79
43051.23926 1926.12 1930.79
43051.23927 1950.96 1987.56
43051.23928 1190.90 1911.56
43051.23929 1926.12 1930.79
I would like to check:
How many quotes will end with profit:
for Buy positions - if one of the following Sells > current Buy.
for Sell positions - if one of the following Buys < current Sell.
How much time it would take to a theoretical position to become profitable.
What can be the profit potential.
I'm using the following code:
#converting from OLE to datetime
OLE_TIME_ZERO = dt.datetime(1899, 12, 30, 0, 0, 0)
def ole(oledt):
return OLE_TIME_ZERO + dt.timedelta(days=float(oledt))
#variables initialization
buy_time = ole(43031.57567) - ole(43031.57567)
sell_time = ole(43031.57567) - ole(43031.57567)
profit_buy_counter = 0
no_profit_buy_counter = 0
profit_sell_counter = 0
no_profit_sell_counter = 0
max_profit_buy_positions = 0
max_profit_buy_counter = 0
max_profit_sell_positions = 0
max_profit_sell_counter = 0
df = pd.read_csv("C:/P/Crypto/bitcoin_test_normal_276k.csv")
#comparing to max
for index, row in df.iterrows():
a = index + 1
df_slice = df[a:]
if df_slice["Sell"].max() - row["Buy"] > 0:
max_profit_buy_positions += df_slice["Sell"].max() - row["Buy"]
max_profit_buy_counter += 1
for index1, row1 in df_slice.iterrows():
if row["Buy"] < row1["Sell"] :
buy_time += ole(row1["Date"])- ole(row["Date"])
profit_buy_counter += 1
break
else:
no_profit_buy_counter += 1
#comparing to sell
for index, row in df.iterrows():
a = index + 1
df_slice = df[a:]
if row["Sell"] - df_slice["Buy"].min() > 0:
max_profit_sell_positions += row["Sell"] - df_slice["Buy"].min()
max_profit_sell_counter += 1
for index2, row2 in df_slice.iterrows():
if row["Sell"] > row2["Buy"] :
sell_time += ole(row2["Date"])- ole(row["Date"])
profit_sell_counter += 1
break
else:
no_profit_sell_counter += 1
num_rows = len(df.index)
buy_avg_time = buy_time/num_rows
sell_avg_time = sell_time/num_rows
if max_profit_buy_counter == 0:
avg_max_profit_buy = "There is no profitable buy positions"
else:
avg_max_profit_buy = max_profit_buy_positions/max_profit_buy_counter
if max_profit_sell_counter == 0:
avg_max_profit_sell = "There is no profitable sell positions"
else:
avg_max_profit_sell = max_profit_sell_positions/max_profit_sell_counter
The code works fine for 10K-20K lines but for a larger amount (276K) it take a long time (more than 10 hrs)
What can I do in order to improve it?
Is there any "Pythonic" way to compare each value in a data frame to all following values?
note - the dates in the CSV are in OLE so I need to convert it to Datetime.
File for testing:
Thanks for your comment.
Here you can find the file that I used:
First, I'd want to create the cumulative maximum/minimum values for Sell and Buy per row, so it's easy to compare to. pandas has cummax and cummin, but they go the wrong way. So we'll do:
df['Max Sell'] = df[::-1]['Sell'].cummax()[::-1]
df['Min Buy'] = df[::-1]['Buy'].cummin()[::-1]
Now, we can just compare each row:
df['Buy Profit'] = df['Max Sell'] - df['Buy']
df['Sell Profit'] = df['Sell'] - df['Min Buy']
I'm positive this isn't exactly what you want as I don't perfectly understand what you're trying to do, but hopefully it leads you in the right direction.
After comparing your function and mine, there is a slight difference, as your a is offset one off the index. Removing that offset, you'll see that my method produces the same results as yours, only in vastly shorter time:
for index, row in df.iterrows():
a = index
df_slice = df[a:]
assert (df_slice["Sell"].max() - row["Buy"]) == df['Max Sell'][a] - df['Buy'][a]
else:
print("All assertions passed!")
Note this will still take the very long time required by your function. Note that this can be fixed with shift, but I don't want to run your function for long enough to figure out what way to shift it.
I started learning Python < 2 weeks ago.
I'm trying to make a function to compute a 7 day moving average for data. Something wasn't going right so I tried it without the function.
moving_average = np.array([])
i = 0
for i in range(len(temp)-6):
sum_7 = np.array([])
avg_7 = 0
missing = 0
total = 7
j = 0
for j in range(i,i+7):
if pd.isnull(temp[j]):
total -= 1
missing += 1
if missing == 7:
moving_average = np.append(moving_average, np.nan)
break
if not pd.isnull(temp[j]):
sum_7 = np.append(sum_7, temp[j])
if j == (i+6):
avg_7 = sum(sum_7)/total
moving_average = np.append(moving_average, avg_7)
If I run this and look at the value of sum_7, it's just a single value in the numpy array which made all the moving_average values wrong. But if I remove the first for loop with the variable i and manually set i = 0 or any number in the range of the data set and run the exact same code from the inner for loop, sum_7 comes out as a length 7 numpy array. Originally, I just did sum += temp[j] but the same problem occurred, the total sum ended up as just the single value.
I've been staring at this trying to fix it for 3 hours and I'm clueless what's wrong. Originally I wrote the function in R so all I had to do was convert to python language and I don't know why sum_7 is coming up as a single value when there are two for loops. I tried to manually add an index variable to act as i to use it in the range(i, i+7) but got some weird error instead. I also don't know why that is.
https://gyazo.com/d900d1d7917074f336567b971c8a5cee
https://gyazo.com/132733df8bbdaf2847944d1be02e57d2
Hey you can using rolling() function and mean() function from pandas.
Link to the documentation :
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.rolling.html
df['moving_avg'] = df['your_column'].rolling(7).mean()
This would give you some NaN values also, but that is a part of rolling mean because you don't have all past 7 data points for first 6 values.
Seems like you misindented the important line:
moving_average = np.array([])
i = 0
for i in range(len(temp)-6):
sum_7 = np.array([])
avg_7 = 0
missing = 0
total = 7
j = 0
for j in range(i,i+7):
if pd.isnull(temp[j]):
total -= 1
missing += 1
if missing == 7:
moving_average = np.append(moving_average, np.nan)
break
# The following condition should be indented one more level
if not pd.isnull(temp[j]):
sum_7 = np.append(sum_7, temp[j])
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
if j == (i+6):
# this ^ condition does not do what you meant
# you should use a flag instead
avg_7 = sum(sum_7)/total
moving_average = np.append(moving_average, avg_7)
Instead of a flag you can use a for-else construct, but this is not readable. Here's the relevant documentation.
Shorter way to do this:
moving_average = np.array([])
for i in range(len(temp)-6):
ngram_7 = [t for t in temp[i:i+7] if not pd.isnull(t)]
average = (sum(ngram_7) / len(ngram_7)) if ngram_7 else np.nan
moving_average = np.append(moving_average, average)
This could be refactored further:
def average(ngram):
valid = [t for t in temp[i:i+7] if not pd.isnull(t)]
if not valid:
return np.nan
return sum(valid) / len(valid)
def ngrams(seq, n):
for i in range(len(seq) - n):
yield seq[i:i+n]
moving_average = [average(k) for k in ngrams(temp, 7)]