Pandas dataframe update keys - python

I'm unable to update a Pandas Dataframe using pd.update() function, I always get a None result.
I'm using a Dataframe with keys which is the result of joining 2 Dataframes.
I calculate the z1 score for only float32 columns, and then I update the Dataframe with the new values for float32 columns.
class MySimpleScaler(object):
def __init__(self):
self._means = None
self._stds = None
def preprocess(self, data):
"""Calculate z-score for dataframe"""
if self._means is None: # During training only
self._means = data.select_dtypes('float32').mean()
if self._stds is None: # During training only
self._stds = data.select_dtypes('float32').std()
if not self._stds.all():
raise ValueError('At least one column has standard deviation of 0.')
z1 = (data.select_dtypes('float32') - self._means) / self._stds
return data.update(z1)
all_x = pd.concat([train_x, eval_x], keys=['train', 'eval'])
scaler = MySimpleScaler()
all_x = scaler.preprocess(all_x)
train_x, eval_x = all_x.xs('train'), all_x.xs('eval')
When I run the data.update(z1) it always returns None.
I need to reuse the scaler object later to calculate z score for new dataframes.

If you add to a set, you are doing an in-place operation, which returns None. The Series will be updated, but the copy returned will be None.

DataFrame update is an in-place operation. It will always return None, but the dataframe will be modified.

Related

How do I solve the "AttributeError: 'Series' object has no attribute '_check_fillna'" error when using the technical-analysis-library-in-python

I am using the following library to fill a Dataframe with modified data (indicator of financial data) https://technical-analysis-library-in-python.readthedocs.io/en/latest/
However, the library has a number of classes that seem to miss certain attributes; or lack the inheritance from another class.
I have created a pandas.Series filled with ones to demonstrate. I call the method aroon_up() from class AroonIndicator with the aforementioned series as input, but I get a 'Series' object has no attribute '_check_fillna'" error. I see that there is no attribute _check_fillna in the AroonIndicator class, but there is in the IndicatorMixin. I have tried to run the Series through the IndicatorMixin class, but it states that this class takes no arguments.
Can someone explain to me what I am doing wrong?
Library
class IndicatorMixin:
"""Util mixin indicator class"""
_fillna = False
def _check_fillna(self, series: pd.Series, value: int = 0) -> pd.Series:
"""Check if fillna flag is True.
Args:
series(pandas.Series): calculated indicator series.
value(int): value to fill gaps; if -1 fill values using 'backfill' mode.
Returns:
pandas.Series: New feature generated.
"""
if self._fillna:
series_output = series.copy(deep=False)
series_output = series_output.replace([np.inf, -np.inf], np.nan)
if isinstance(value, int) and value == -1:
series = series_output.fillna(method="ffill").fillna(method='bfill')
else:
series = series_output.fillna(method="ffill").fillna(value)
return series
#staticmethod
def _true_range(
high: pd.Series, low: pd.Series, prev_close: pd.Series
) -> pd.Series:
tr1 = high - low
tr2 = (high - prev_close).abs()
tr3 = (low - prev_close).abs()
true_range = pd.DataFrame(data={"tr1": tr1, "tr2": tr2, "tr3": tr3}).max(axis=1)
return true_range
class AroonIndicator(IndicatorMixin):
"""Aroon Indicator
Identify when trends are likely to change direction.
Aroon Up = ((N - Days Since N-day High) / N) x 100
Aroon Down = ((N - Days Since N-day Low) / N) x 100
Aroon Indicator = Aroon Up - Aroon Down
https://www.investopedia.com/terms/a/aroon.asp
Args:
close(pandas.Series): dataset 'Close' column.
window(int): n period.
fillna(bool): if True, fill nan values.
"""
def __init__(self, close: pd.Series, window: int = 25, fillna: bool = False):
self._close = close
self._window = window
self._fillna = fillna
# self._check_fillna = checkfillna
self._run()
self._check_fillna(IndicatorMixin._check_fillna())
def _run(self):
min_periods = 0 if self._fillna else self._window
rolling_close = self._close.rolling(
self._window, min_periods=min_periods)
self._aroon_up = rolling_close.apply(
lambda x: float(np.argmax(x) + 1) / self._window * 100, raw=True
)
def aroon_up(self) -> pd.Series:
"""Aroon Up Channel
Returns:
pandas.Series: New feature generated.
"""
aroon_up_series = self._check_fillna(self._aroon_up, value=0)
return pd.Series(aroon_up_series, name=f"aroon_up_{self._window}")
My program
# Create an empty DataFrame
table = pd.DataFrame()
# Create a serie of ones
list = np.ones((100))
sr = pd.Series(list)
# fill the empty Dataframe with the indicator of the Series
'try 1:'
table['numbers'] = AroonIndicator.aroon_up(sr)
'try 2:'
table['numbers'] = AroonIndicator.aroon_up(IndicatorMixin(sr))
# print the table
print(table)
The Aroon functions return values as panda Series, however you are trying to assign the results to the 'table' variable, which you have initialized as a DataFrame.
Also, when the only parameter you can pass to a function is 'self', you do not include a parameter when you call the function.
Lastly, don't use reserved words like 'list' for variable names.
Try:
import pandas as pd
import numpy as np
list_values = pd.Series(np.ones(100))
sr = AroonIndicator(list_values)
sr = sr.aroon_up()
print(sr)

Python: How to parse variables from several pandas dataframes?

I want to extract the x and y variables from several pandas dataframes (before proceeding to next steps). I initialize the tab-delimited .txt file, before I extract the information.
Error raised is ValueError: too many values to unpack (expected 2).
import pandas as pd
class DataProcessing:
def __init__(self, data):
self.df = pd.read_csv(data, sep="\t")
X, y = self.df.iloc[1:, 1:]
return X, y
dp_linear_cna = DataProcessing("data_linear_cna.txt")
dp_mrna_seq_v2_rsem = DataProcessing("data_mrna_seq_v2_rsem.txt")
dp_linear_cna.extract_info()
dp_mrna_seq_v2_rsem.extract_info()
Traceback:
ValueError: too many values to unpack (expected 2)
The sep="/t" is supposed to be sep="\t".
Never iterate over rows/columns, select data using index.
e.g. selecting a column: df['some_column_name']
You coding style is quite bad. First of all, don't return anything in init. It's a constructor. Make another function instead.
class DataProcessing:
def __init__(self, data):
self.df = pd.read_csv(data, sep="\t")
def split_data(self):
X = self.df.iloc[:, :-1]
y = self.df.iloc[:, -1]
return X, y
Calling your DataProcessing like this:
def main():
dp = DataProcessing('data_linear_cna.txt')
X, y = dp.split_data()
print(X)
print()
print(y)
Main point here is selection over position via df.iloc[rows, columns]
X, y = self.df.iloc[1:, 1:]
this is not a valid statement. pandas.DataFrame.iloc return another pandas.DataFrame. Not a tuple. You can't do tuple unpacking.
Indexing both axes
You can mix the indexer types for the index and columns. Use : to select the entire axis.

Python, pandas exclude outliers function

I tried to exclude a few outliers from a pandas dataframe, but the function just return the same table without any difference.I can't figure out why.
excluding outliers
def exclude_outliers(DataFrame, col_name):
interval = 2.5*DataFrame[col_name].std()
mean = DataFrame[col_name].mean()
m_i = mean + interval
DataFrame = DataFrame[DataFrame[col_name] < m_i]
outlier_column = ['util_linhas_inseguras', 'idade', 'vezes_passou_de_30_59_dias', 'razao_debito', 'salario_mensal', 'numero_linhas_crdto_aberto',
'numero_vezes_passou_90_dias', 'numero_emprestimos_imobiliarios', 'numero_de_vezes_que_passou_60_89_dias', 'numero_de_dependentes']
for col in outlier_column:
exclude_outliers(df_train, col)
df_train.describe()
As written, your function doesn't return anything and, as a result, your for loop is not making any changes to the DataFrame. Try the following:
At the end of your function, add the following line:
def exclude_outliers(DataFrame, col_name):
... # Function filters the DataFrame
# Add this line to return the filtered DataFrame
return DataFrame
And then modify your for loop to update the df_train:
for col in outlier_column:
# Now we update the DataFrame on each iteration
df_train = exclude_outliers(df_train, col)

Drop bad data from dataset Tensorflow

I have a training pipeline using tf.data. Inside the dataset there is some bad elements, in my case a values of 0. How do i drop these bad data elements based on their value? I want to be able to remove them within the pipeline while training since the dataset is large.
Assume from the following pseudo code:
def parse_function(element):
height = element['height']
if height <= 0: skip() #How to skip this value
labels = element['label']
features['height'] = height
return features, labels
ds = tf.data.Dataset.from_tensor_slices(ds_files)
clean_ds = ds.map(parse_function)
A suggestion would be using ds.skip(1) based on the feature value, or provide some sort of neutral weight/loss?
You can use tf.data.Dataset.filter:
def filter_func(elem):
""" return True if the element is to be kept """
return tf.math.greater(elem['height'],0)
ds = tf.data.Dataset.from_tensor_slices(ds_files)
clean_ds = ds.filter(filter_func)
Assuming that element is a data frame in your code, then it would be:
def parse_function(element):
element = element.query('height>0')
labels = element['label']
features['height'] = element['height']
return features, labels
ds = tf.data.Dataset.from_tensor_slices(ds_files)
clean_ds = ds.map(parse_function)
`

label-encoder encoding a dataframe without encoding NaN missing values

I have a dataframe that contains Numerical, categorical and NaN values.
customer_class B C
0 OM1 1 2.0
1 NaN 6 1.0
2 OM1 9 NaN
....
I need a LabelEncoder that keeps my missing values as 'NaN' to use an Imputer afterwards.
So I have would like to use this code in order to encode my dataframe by keeping NaN value .
here is the code :
class LabelEncoderByCol(BaseEstimator, TransformerMixin):
def __init__(self,col):
#List of column names in the DataFrame that should be encoded
self.col = col
#Dictionary storing a LabelEncoder for each column
self.le_dic = {}
for el in self.col:
self.le_dic[el] = LabelEncoder()
def fit(self,x,y=None):
#Fill missing values with the string 'NaN'
x[self.col] = x[self.col].fillna('NaN')
for el in self.col:
#Only use the values that are not 'NaN' to fit the Encoder
a = x[el][x[el]!='NaN']
self.le_dic[el].fit(a)
return self
def transform(self,x,y=None):
#Fill missing values with the string 'NaN'
x[self.col] = x[self.col].fillna('NaN')
for el in self.col:
#Only use the values that are not 'NaN' to fit the Encoder
a = x[el][x[el]!='NaN']
#Store an ndarray of the current column
b = x[el].get_values()
#Replace the elements in the ndarray that are not 'NaN'
#using the transformer
b[b!='NaN'] = self.le_dic[el].transform(a)
#Overwrite the column in the DataFrame
x[el]=b
#return the transformed D
col = data1['customer_class']
LabelEncoderByCol(col)
LabelEncoderByCol.fit(x=col,y=None)
But I got this error :
846 if mask.any():
--> 847 raise ValueError('%s not contained in the index' % str(key[mask]))
848 self._set_values(indexer, value)
849
ValueError: ['OM1' 'OM1' 'OM1' ... 'other' 'EU' 'EUB'] not contained in the index
Any idea please to resolve this error?
thanks
Two things jumped out to me when I tried to reproduce:
Your code seems to expect a dataframe will be passed to your class. But in your example you passed a series. I fixed this by wrapping the series as a dataframe before passing it to your class: col = pd.DataFrame(data1['customer_class']).
In your class' __init__ method it seemed like you had intended to iterate through a list of column names, but instead were actually iterating through all of your columns, series by series. I fixed this by changing the appropriate line to: self.col = col.columns.values.
Below, I've pasted in my modifications to your class' __init__ and fit methods (my only modification to the transform method was to have it return the modified dataframe):
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import LabelEncoder
data1 = pd.DataFrame({'customer_class': ['OM1', np.nan, 'OM1'],
'B': [1,6,9],
'C': [2.0, 1.0, np.nan]})
class LabelEncoderByCol(BaseEstimator, TransformerMixin):
def __init__(self,col):
#List of column names in the DataFrame that should be encoded
self.col = col.columns.values
#Dictionary storing a LabelEncoder for each column
self.le_dic = {}
for el in self.col:
self.le_dic[el] = LabelEncoder()
def fit(self,x,y=None):
#Fill missing values with the string 'NaN'
x = x.fillna('NaN')
for el in self.col:
#Only use the values that are not 'NaN' to fit the Encoder
a = x[el][x[el]!='NaN']
self.le_dic[el].fit(a)
return self
def transform(self,x,y=None):
#Fill missing values with the string 'NaN'
x[self.col] = x[self.col].fillna('NaN')
for el in self.col:
#Only use the values that are not 'NaN' to fit the Encoder
a = x[el][x[el]!='NaN']
#Store an ndarray of the current column
b = x[el].get_values()
#Replace the elements in the ndarray that are not 'NaN'
#using the transformer
b[b!='NaN'] = self.le_dic[el].transform(a)
#Overwrite the column in the DataFrame
x[el]=b
return x
I am able to run the following lines (also slightly modified from your initial implementation) with no error:
col = pd.DataFrame(data1['customer_class'])
lenc = LabelEncoderByCol(col)
lenc.fit(x=col,y=None)
I can then access the classes for the customer_class column from your example:
lenc.fit(x=col,y=None).le_dic['customer_class'].classes_
Which outputs:
array(['OM1'], dtype=object)
Finally, I can transform the column using your class' transform method:
lenc.transform(x=col,y=None)
Which outputs the following:
customer_class
0 0
1 NaN
2 0

Categories