Pandas NaN with log - python

I got this problem "setting an array element with a sequence", any help to solve the matter? I used this to create NaN in my data to be able to calculate the log and then I need to plot it.
import pandas as pd
d = np.array(Hnew)
df = pd.DataFrame(data=d)
df = df.mask(df < 62.5)
h = np.zeros(np.size(df))
for i in range(0, np.size(df)):
h[i] = 5-np.log((df[i]-62.5)/0.915)

This should work:
h= 5 - np.log((df.mask(df['val']<= 62.5)['val'] - 62.5)/0.915)
You tried to assign the series np.log returned to the elements in an array of type float64 which isn't possible (that's the reason for the message). But np.log already returns the series you probably want.
Please also note that I also changed < 62.5 to <= 62.5 because you probably would get -inf or an error, if you try to calculate the log of 0.

Related

Attempting to find location of values in a pandas Dataframe if certain conditions are met

I have this DataFrame.
High Close
Close Time
2022-10-23 21:41:59.999 19466.02 19461.29
2022-10-23 21:42:59.999 19462.48 19457.83
2022-10-23 21:43:59.999 19463.13 19460.09
2022-10-23 21:44:59.999 19465.15 19463.76
I'm attempting to check if Close at a later date (up to 600 rows later but no more) goes above the close of an earlier date & High is lower than the High of the same earlier date then I want to get the location of both the earlier and later date and make new columns in the Dataframe with those locations.
Expected output:
High Close LC HC HH LH
Close Time
2022-10-23 21:41:59.999 19466.02 19461.29 19461.29 NaN 19466.02 NaN
2022-10-23 21:42:59.999 19462.48 19457.83 NaN NaN NaN NaN
2022-10-23 21:43:59.999 19463.13 19460.09 NaN NaN NaN NaN
2022-10-23 21:44:59.999 19465.15 19463.76 NaN 19463.76 NaN 19465.15
This is the code I have tried
# Checking if conditions are met
for i in range(len(df)):
for a in range(i,600):
if (df.iloc[i:, 1] < df.iloc[a, 1]) & (df.iloc[i:, 0] > df.iloc[a, 0]):
# Creating new DataFrame columns
df['LC'] = df.iloc[i, 1]
df['HC'] = df.get_loc[i,1]
df['HH'] = df.get_loc[a, 0]
df['LH'] = df.get_loc[a, 0]
else:
continue
This line: if (df.iloc[i:, 1] < df.iloc[a, 1]) & (df.iloc[i:, 0] > df.iloc[a, 0]):
Is causing error: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I believe I should be using any() instead of an if statement but I am unsure of how to apply it. I also think that there many be an issue with the way I am using the df.get_loc[] but I am unsure. I'm a pandas beginner so if it is obvious I apologize
Here is an image to help visualise what I am attempting to do using a candlestick chart
what I want to do is check if HC is higher than LC and LH is lower than HH then add that data to new columns in the DataFrame
Here is an additional way I tried to achieve the desired output
idx_close, idx_high = map(df.columns.get_loc, ["Close", "High"])
# Check Conditions
for i in range(len(df)):
bool_l = [((df.iloc[i, idx_close] < df.iloc[a, idx_close]) &
(df.iloc[i, idx_high] > df.iloc[a, idx_high])
).any() for a in range(i, 600)]
# Creating new DataFrame columns
df.loc[i, 'LC'] = df.iloc[i,1]
df.loc[bool_l, 'HC'] = df.iloc[bool_l, 1]
# Creating new DataFrame columns
df.loc[i, 'HH'] = df.iloc[i, 0]
df.loc[bool_l, 'LH'] = df.iloc[bool_l, 0]
And I get an error IndexError: Boolean index has wrong length: 600 instead of 2867
On the line df.loc[bool_l, 'HC'] = df.iloc[bool_l, 1]
I assume the error comes from the range(i,600) but I don't know how to get around it
As mentioned by #Jimpsoni, the question is a little unclear in defining what you mean by LC, HC, HH and HL. I will use the below definitions to answer your question:
C_t is the close price on a given day, t
H_t is the high price on a given dat, t
Then if I understand correctly, you want to check the following two conditions:
close condition: is there a future close price in the next 600 days which is higher than the the current close price?
high condition: is there a future high price in the next 600 days which is lower than the high price from today?
And then for each day t, you want to find the first day in the future where both the conditions above are satisfied simultaneously.
With this understanding, I sketch a solution below.
Part 1: Setting up sample data (do this in your question next time)
import pandas as pd
import numpy as np
np.random.seed(2022)
# make example data
close = np.sin(range(610)) + 10
high = close + np.random.rand(*close.shape)
close[2] += 100 # to ensure the close condition cannot be met
dates = pd.date_range(end='2022-06-30', periods=len(close))
# insert into pd.dataframe
df = pd.DataFrame(index=dates, data=np.array([close, high]).T, columns=['close',
'high'])
The sample dataframe looks like below:
Part 2: Understand how to use rolling functions
Since at each point in time you want to subset your dataframe to a particular view in a rolling manner (always look 600 forward), it makes sense to use the built in .rolling method of pandas dataframes here. The .rolling() method is similar to a groupby in the sense that it allows you to iterate over different subsets of the dataframe, without explicitly writing the loop. rolling by default looks backwards, so we need to import a forward looking indexer to achieve the forward window. Note that you can also achieve the same forward window with some shifting, but it is less intuitive. The below chunk demonstrates how both methods give you the same result:
# create the rolling window object
# the simplest solution is to use the fixed size forward window as below
indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=600)
method_1 = df.rolling(indexer, min_periods=600).sum()
# can also achieve the same thing by doing the rolling computation and then shifting backward
method_2 = df.rolling(600, min_periods=600).sum().shift(-599)
# confirm they are the same
(method_1.dropna() == method_2.dropna()).mean().mean() # returns 1.0
Part 3: Write the logic that will be applied to each rolling view
Once we have the rolling frame, we can simply apply a function that will be applied to each rolling view of the dataframe. Here is the function, below will be some comments:
def check_conditions(ser, df_full):
df_chunk = df_full.loc[ser.index, :]
today_close, today_high = df_chunk.iloc[0,:2]
future_close_series, future_high_series = df_chunk.iloc[1:, 0], df_chunk.iloc[1:, 1]
close_condition = today_close < future_close_series
high_condition = today_high > future_high_series
both_condition = close_condition & high_condition
if both_condition.sum() == 0:
return 1
first_date_satisfied = future_close_series.index[np.where(both_condition)][0]
future_vals = df_chunk.loc[first_date_satisfied, ['close','high']].values
df_full.loc[ser.index[0], ['Future_Close', 'Future_High', 'Future_Date']] = np.append(future_vals, first_date_satisfied)
return 1
Some comments: First, notice that the function takes two arguments, the first is a series, and the second is the full dataframe. Unfortunately, the .rolling() only acts on a single column / row at a time. This is in contrast to .groupby() which allows access to the full dataframe produced by each group. To get around this, I use a pattern proposed in this answer, where we use the index of the series (the series is the rolling view on a single column of the dataframe) to subset the full dataframe appropriately. Then check the conditions on appropriately subset dataframe with all columns available, and modify that dataframe when the conditions match. This may not be particularly memory efficient as pointed out in the comments to that answer linked above.
Final part: run the function
Here we set up the df_full dataframe, and then use rolling and apply the function to generate the output:
df_out = df.copy()
df_out[['Future_Close','Future_High','Future_Date']] = np.nan
_ = df.rolling(indexer, min_periods=600).apply(lambda x: check_conditions(x, df_out))
df_out
In df_out, the extra column will be NaN if no day in the next 600 days meet the criteria. If a day does meet the criteria, the close and high price on that day, as well as the date, are attached.
Output looks like:
Hope it helps.
Using regular for loop to iterate over dataframe is slow, instead you should use pandas built-in methods for dataframe because they are much faster.
Series has a built-in method diff() that iterates over that series and compares every value to previous one. So you could get 2 new serieses that each has information whether the previous close was less than todays
# You need numpy for this, if already imported, ignore this line
import numpy as np
# Make series for close and high
close = df["Close"].diff()
high = df["High"].diff()
I think you want to have lower "high" value and higher "close" value but with this logic it doesn't generate desired output.
# Then create new columns
df['LC'] = np.where((close > 0) & (high < 0), np.NaN, df["Close"])
df['HC'] = np.where((close > 0) & (high < 0), np.NaN, df["High"])
df['HH'] = np.where((close > 0) & (high < 0), np.NaN, df["Close"])
df['LH'] = np.where((close > 0) & (high < 0), np.NaN, df["High"])
if you provide more infomation about what LC, HC, HH, LH are supposed to be, or provide more examples, I can help you get correct results.

How to use 2 methods of filling NA in 1 column in Python

I have a data frame with 1 column.
- There are many NA values at the beginning and at the end that I would like to eliminate them completely.
- At the same time, there are some NA values in the between of 2 available values that I would like to fill them by the mean of 2 closed available values.
For illustration, I attach the image here for your imagine.
I can not think of any solution. Just wonder if anyone can please help me with that.
Thank you for your help]1
Try this,i have reproduced example by using random numbers
import pandas as pd
import numpy as np
random_index = np.random.randint(0,100,size=(5, 1))
random_range = np.arange(10,15)
df = pd.DataFrame(np.random.randint(0,100,size=(100, 1)), columns=list('A'))
df.loc[10:15,'A'] = "#N/A"
for c in random_index:
df.loc[c,"A"] = "#N/A"
// replacing start from here
df[df=="#N/A"]= np.nan
index = list(np.where(df['A'].isna()))[0]
drops = []
for i in index:
if pd.isnull(df.loc[(i-1),"A"]) is False and pd.isnull(df.loc[(i+1),"A"]) is False:
df.loc[i,"A"] = (df.loc[(i-1),"A"]+df.loc[(i+1),"A"])/2
else:
drops.append(i)
df = df.drop(df.index[drops]).reset_index(drop=True)
First, if each N/A is in string format, replace either with np.nan.The most straightforward possible way is to use isnan on the given column, then extract true indices(such as using the result on a np.arange array). From there you can either use a for to iterate indices to check if they are sequential or not, or calculate the distance between consecutive elements to find the ones not equal to 1.

How can I set a new value for a cell as a float in pandas dataframe (Python) - The DataFrame rounds to integer when in nested for loop

FOUND SOLUTION: I needed to change datatype for dataframe:
for p in periods:
df['Probability{}'.format(p)] = 0
for p in periods:
df['Probability{}'.format(p)] = float(0)
Alternatively do as in approved answer below.
I am asserting new values for cells as floats but they are set as integers and I don't get why.
It is a part of a data mining project, which contains nested loops.
I am using Python 3.
I tried different modes of writing into a cell with pandas:
df.at[index,col] = float(val),
df.set_value[index,col,float(val)], and
df[col][index] = float(val) but none of them delivered a solution. The output I got was:
In: print(df[index][col])
Out: 0
In: print(val)
Out: 0.4774410939826658
Here is a simplified version of the loop
periods = [7,30,90,180]
for p in periods:
df['Probability{}'.format(p)] = 0
for i in range(len(df.index)):
for p in periods:
if i >= p - 1:
# Getting relevant data and computing value
vals = [df['Close'][j] for j in range(i - p, i)]
probability = (len([j for j in vals if j>0])/len(vals))
# Asserting value to cell in pd.dataframe
df.at[df.index[i], 'Probability{}'.format(p)] = float(probability)
I don't get why pandas.DataFrame are changing float to integer and rounds up or down. When I asserted values to cells in console directly I did not experience any problems.
Is there any work arounds or solutions to this problem?
I had no problem before nesting a for loop for periods to avoid hard coding a lot of trivial code.
NB: It also seems that if I factorize, e.g. with 100 * val = new_val, it do only factorize the rounded number. So if I multiplied 100*val = new_val = 0 because the number is rounded down to 0.
I also tried to change datatype for the dataframe:
df = df.apply(pd.to_numeric)
All the best.
Seems like a problem with incorrect data types in your dataframe. Your last attempt at converting the whole df was probably very close. Try and use
df['Close'] = pd.to_numeric(df['Close'], downcast="float")

sklearn error ValueError: Input contains NaN, infinity or a value too large for dtype('float64')

I am using sklearn and having a problem with the affinity propagation. I have built an input matrix and I keep getting the following error.
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
I have run
np.isnan(mat.any()) #and gets False
np.isfinite(mat.all()) #and gets True
I tried using
mat[np.isfinite(mat) == True] = 0
to remove the infinite values but this did not work either.
What can I do to get rid of the infinite values in my matrix, so that I can use the affinity propagation algorithm?
I am using anaconda and python 2.7.9.
This might happen inside scikit, and it depends on what you're doing. I recommend reading the documentation for the functions you're using. You might be using one which depends e.g. on your matrix being positive definite and not fulfilling that criteria.
EDIT: How could I miss that:
np.isnan(mat.any()) #and gets False
np.isfinite(mat.all()) #and gets True
is obviously wrong. Right would be:
np.any(np.isnan(mat))
and
np.all(np.isfinite(mat))
You want to check whether any of the elements are NaN, and not whether the return value of the any function is a number...
I got the same error message when using sklearn with pandas. My solution is to reset the index of my dataframe df before running any sklearn code:
df = df.reset_index()
I encountered this issue many times when I removed some entries in my df, such as
df = df[df.label=='desired_one']
This is my function (based on this) to clean the dataset of nan, Inf, and missing cells (for skewed datasets):
import pandas as pd
import numpy as np
def clean_dataset(df):
assert isinstance(df, pd.DataFrame), "df needs to be a pd.DataFrame"
df.dropna(inplace=True)
indices_to_keep = ~df.isin([np.nan, np.inf, -np.inf]).any(axis=1)
return df[indices_to_keep].astype(np.float64)
In most cases getting rid of infinite and null values solve this problem.
get rid of infinite values.
df.replace([np.inf, -np.inf], np.nan, inplace=True)
get rid of null values the way you like, specific value such as 999, mean, or create your own function to impute missing values
df.fillna(999, inplace=True)
This is the check on which it fails:
https://github.com/scikit-learn/scikit-learn/blob/0.17.X/sklearn/utils/validation.py#L51
Which says
def _assert_all_finite(X):
"""Like assert_all_finite, but only for ndarray."""
X = np.asanyarray(X)
# First try an O(n) time, O(1) space solution for the common case that
# everything is finite; fall back to O(n) space np.isfinite to prevent
# false positives from overflow in sum method.
if (X.dtype.char in np.typecodes['AllFloat'] and not np.isfinite(X.sum())
and not np.isfinite(X).all()):
raise ValueError("Input contains NaN, infinity"
" or a value too large for %r." % X.dtype)
So make sure that you have non NaN values in your input. And all those values are actually float values. None of the values should be Inf either.
The Dimensions of my input array were skewed, as my input csv had empty spaces.
With this version of python 3:
/opt/anaconda3/bin/python --version
Python 3.6.0 :: Anaconda 4.3.0 (64-bit)
Looking at the details of the error, I found the lines of codes causing the failure:
/opt/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py in _assert_all_finite(X)
56 and not np.isfinite(X).all()):
57 raise ValueError("Input contains NaN, infinity"
---> 58 " or a value too large for %r." % X.dtype)
59
60
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
From this, I was able to extract the correct way to test what was going on with my data using the same test which fails given by the error message: np.isfinite(X)
Then with a quick and dirty loop, I was able to find that my data indeed contains nans:
print(p[:,0].shape)
index = 0
for i in p[:,0]:
if not np.isfinite(i):
print(index, i)
index +=1
(367340,)
4454 nan
6940 nan
10868 nan
12753 nan
14855 nan
15678 nan
24954 nan
30251 nan
31108 nan
51455 nan
59055 nan
...
Now all I have to do is remove the values at these indexes.
None of the answers here worked for me. This was what worked.
Test_y = np.nan_to_num(Test_y)
It replaces the infinity values with high finite values and the nan values with numbers
I had the same error, and in my case X and y were dataframes so I had to convert them to matrices first:
X = X.values.astype(np.float)
y = y.values.astype(np.float)
Edit: The originally suggested X.as_matrix() is Deprecated
Problem seems to occur in DecisionTreeClassifier input check, Try
X_train = X_train.replace((np.inf, -np.inf, np.nan), 0).reset_index(drop=True)
I had the error after trying to select a subset of rows:
df = df.reindex(index=my_index)
Turns out that my_index contained values that were not contained in df.index, so the reindex function inserted some new rows and filled them with nan.
Remove all infinite values:
(and replace with min or max for that column)
import numpy as np
# generate example matrix
matrix = np.random.rand(5,5)
matrix[0,:] = np.inf
matrix[2,:] = -np.inf
>>> matrix
array([[ inf, inf, inf, inf, inf],
[0.87362809, 0.28321499, 0.7427659 , 0.37570528, 0.35783064],
[ -inf, -inf, -inf, -inf, -inf],
[0.72877665, 0.06580068, 0.95222639, 0.00833664, 0.68779902],
[0.90272002, 0.37357483, 0.92952479, 0.072105 , 0.20837798]])
# find min and max values for each column, ignoring nan, -inf, and inf
mins = [np.nanmin(matrix[:, i][matrix[:, i] != -np.inf]) for i in range(matrix.shape[1])]
maxs = [np.nanmax(matrix[:, i][matrix[:, i] != np.inf]) for i in range(matrix.shape[1])]
# go through matrix one column at a time and replace + and -infinity
# with the max or min for that column
for i in range(matrix.shape[1]):
matrix[:, i][matrix[:, i] == -np.inf] = mins[i]
matrix[:, i][matrix[:, i] == np.inf] = maxs[i]
>>> matrix
array([[0.90272002, 0.37357483, 0.95222639, 0.37570528, 0.68779902],
[0.87362809, 0.28321499, 0.7427659 , 0.37570528, 0.35783064],
[0.72877665, 0.06580068, 0.7427659 , 0.00833664, 0.20837798],
[0.72877665, 0.06580068, 0.95222639, 0.00833664, 0.68779902],
[0.90272002, 0.37357483, 0.92952479, 0.072105 , 0.20837798]])
I found that after calling pct_change on a new column that nan existed in one of rows. I remove the nan row with the following code
df = df.replace([np.inf, -np.inf], np.nan)
df = df.dropna()
df = df.reset_index()
i got the same error. it worked with df.fillna(-99999, inplace=True) before doing any replacement, substitution etc
I would like to propose a solution for numpy that worked well for me. The line
from numpy import inf
inputArray[inputArray == inf] = np.finfo(np.float64).max
substitues all infite values of a numpy array with the maximum float64 number.
Puff !! In my case the problem was about NaN values...
You can list your columns that had NaN with this function
your_data.isnull().sum()
and then you can fill these NAN values in your dataset file.
Here is the code for how to "Replace NaN with zero and infinity with large finite numbers."
your_data[:] = np.nan_to_num(your_data)
from numpy.nan_to_num
In my case the problem was that many scikit functions return numpy arrays, which are devoid of pandas index. So there was an index mismatch when I used those numpy arrays to build new DataFrames and then I tried to mix them with the original data.
dataset = dataset.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
This worked for me
I had the same issue, in my case the answer was simply that I had a cell in my CSV with no value ("x,y,z,,"). Putting a default value in fixed it for me.
Using isneginf may help.
http://docs.scipy.org/doc/numpy/reference/generated/numpy.isneginf.html#numpy.isneginf
x[numpy.isneginf(x)] = 0 #0 is the value you want to replace with
Note: This solution only applies if you consciously want to keep NaN entries in your dataset.
This error happened to me when I was using some of the scikit-learn functionality (in my case: GridSearchCV). Under the hood I was using an xgboost XGBClassifier which handles NaN data gracefully. However, GridSearchCV was using sklearn.utils.validation module that encforced lack of missing data in the input data by calling _assert_all_finite function. This was ultimately causing an error:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64')
Sidenote: _assert_all_finite accepts an allow_nan argument, which, if set to True, would not be causing issues. However, scikit-learn API does not allow us to have control over this argument.
Solution
My solution was to use patch module to silence the _assert_all_finite function so that it does not raise ValueError. Here is a snippet
import sklearn
with mock.patch("sklearn.utils.validation._assert_all_finite"):
# your code that raises ValueError
this will replace the _assert_all_finite by a dummy mock function so it won't get executed.
Please note that patching is not a recommended practice and might result in unpredictable behaviour!
EDIT:
This Pull Request should resolve the issue (though the fix has not been released as of Jan 2022)
If you're running an estimator, it could be that your learning rate is too high. I passed in the wrong array to a grid search by accident and ended up training with a learning rate of 500, which I could see causing issues with the training process.
Basically it's not necessarily only your inputs that have to all be valid, but the intermediate data as well.
After a long time of dealing with this problem, I realized that this is because in splits of training and testing sets there are columns of data which are the same for all data rows. Then some calculations in some algorithms may lead to infinity results. If the data that you are using is in a way that close rows are more likely to be similar then shuffling the data can help. This is a bug with scikit. I'm using version 0.23.2.
If you happen to use the "kc_house_data.csv" dataset (which some commenters and many data-science newcomers seem to use, because it's presented in lots of popular course material), the data is faulty and the true source for the error.
To fix it, as of 2022:
Delete the last (empty) line in the csv file
There are two lines that contain one empty data value "x,x,,x,x" - to fix it, don't delete the comma, instead add a random integer value like 2000, so it looks like this "x,x,2000,x,x"
Don't forget to save and reload in your project.
All the other answers are helpful and correct, but not in this case:
If you use kc_house_data.csv you need to fix the data in the file, nothing else will help, the empty data field will shift the other data around randomly and generate weird bugs that are hard to track to the source!
In my case the algorithm required data to be between (0,1) noninclusive. My quite brutal solutions was to add a small random number to all desired values:
y_train = pd.DataFrame(y_train).applymap(lambda x: x + np.random.rand()/100000.0)["col_name"]
y_train[y_train >= 1] = 0.999999
while y_train is in the range of [0,1].
This is definitely not suitable for all cases, as you are messing with your input data but can be a solution if you have sparse data and only need a quick forecast
try
mat.sum()
If the sum of your data is infinity (greater that the max float value which is 3.402823e+38) you will get that error.
see the _assert_all_finite function in validation.py from the scikit source code:
if is_float and np.isfinite(X.sum()):
pass
elif is_float:
msg_err = "Input contains {} or a value too large for {!r}."
if (allow_nan and np.isinf(X).any() or
not allow_nan and not np.isfinite(X).all()):
type_err = 'infinity' if allow_nan else 'NaN, infinity'
# print(X.sum())
raise ValueError(msg_err.format(type_err, X.dtype))

Transition Matrix in Dataframe Not Passing the Value

I am trying to implement transition matrix.
Both data and transition matrix are in DataFrames using Pandas
states_mat = pd.DataFrame(None, index=range(0,24), columns=range(0,24))
def states_update(data):
states_vec = data['hr']
# Do nothing if there is no sequence
if len(states_vec) < 2:
return
for i in xrange(1, len(states_vec)):
prev = states_vec[i-1]
curr = states_vec[i]
states_mat[curr][prev] += 1
Data are in int64 type
It is not updating +1 count as I wanted. I believe it is some kind of type issue, but not sure how to force the type. I am using DataFrame for my data as I want to use group function to split the data and apply the above function. Any suggestions?
OK so the first problem and the one that resolved your issue is that you created your states_mat dataframe with a default value of None which becomes a numpy.NaN.
You cannot add an integer to a NaN:
In [24]:
NaN + 1
Out[24]:
nan
So change the DataFrame construction to:
states_mat = pd.DataFrame(0, index=range(0,24), columns=range(0,24))
Probably subindexing is fine in this case but you could have used loc also would work:
states_mat.loc[curr, prev] += 1

Categories