I do not understand why "u" has NaNvalues. What wrong am I doing here?
>>> z=pd.DataFrame([['abcb','asasa'],['sdsd','aeio']])
>>> z
0 1
0 abcb asasa
1 sdsd aeio
>>> u=pd.DataFrame(z,columns=['hello','ajsajs'])
>>> u
hello ajsajs
0 NaN NaN
1 NaN NaN
Alternate construction calls
You can use the underlying NumPy array:
u = pd.DataFrame(z.values, columns=['hello','ajsajs'])
hello ajsajs
0 abcb asasa
1 sdsd aeio
Alternately, you could use:
u = z.rename(columns={0: 'hello',1: 'ajsajs'})
And lastly as suggested by #Dark:
u = z.set_axis(['hello','ajsajs'], axis=1, inplace=False)
A small note on inplace in set_axis -
WARNING: inplace=None currently falls back to to True, but in a future
version, will default to False. Use inplace=True explicitly rather
than relying on the default.
In pandas 0.20.3 the syntax would be just:
u = z.set_axis(axis=1, labels=['hello','ajsajs'])
#Dark's solution appears fastest here.
Why current method doesn't work
I believe the issue here is that there's a .reindex being called when the DataFrame is constructed in this way. Here's some source code where ellipses denote irrelevant stuff I'm leaving out:
from pandas.core.internals import BlockManager
# pandas.core.frame.DataFrame
class DataFrame(NDFrame):
def __init__(self, data=None, index=None, columns=None, dtype=None,
copy=False):
# ...
if isinstance(data, DataFrame):
data = data._data
if isinstance(data, BlockManager):
mgr = self._init_mgr(data, axes=dict(index=index, columns=columns),
dtype=dtype, copy=copy)
# ... a bunch of other if statements irrelevant to your case
NDFrame.__init__(self, mgr, fastpath=True)
# ...
What's happening here:
DataFrame inherits from a more generic base class which in turn has multiple inheritance. (Pandas is great, but its source can be like trying to backtrack through a spider's web.)
In u = pd.DataFrame(z,columns=['hello','ajsajs']), x is a DataFrame. Therefore, the first if statement below is True and data = data._data. What's _data? It's a BlockManager.* (To be continued below...)
Because we just converted what you passed to its BlockManager, the next if statement also evaluates True. Then mgr gets assigned to the result of the _init_mrg method and the parent class's __init__ gets called, passing mgr.
* confirm with isinstance(z._data, BlockManager).
Now on to part 2...
# pandas.core.generic.NDFrame
class NDFrame(PandasObject, SelectionMixin):
def __init__(self, data, axes=None, copy=False, dtype=None,
fastpath=False):
# ...
def _init_mgr(self, mgr, axes=None, dtype=None, copy=False):
""" passed a manager and a axes dict """
for a, axe in axes.items():
if axe is not None:
mgr = mgr.reindex_axis(axe,
axis=self._get_block_manager_axis(a),
copy=False)
# ...
return mgr
Here is where _init_mgr is defined, which gets called above. Essentially in your case you have:
columns=['hello','ajsajs']
axes=dict(index=None, columns=columns)
# ...
When you go to reindex axis and specify a new axis where none of the new labels are included in the old object, you get all NaNs. This seems like a deliberate design decision. Consider this related example to prove the point, where one new column is present and one is not:
pd.DataFrame(z, columns=[0, 'ajsajs'])
0 ajsajs
0 abcb NaN
1 sdsd NaN
Related
I want to make a function that does an ETL to a list/Series and returns a Series with additional attributes and methods specific to that function. I can achieve this by creating a class to extend the Series and it works but then when I try to reassign the output from the function with the new class the updated class attributes and methods are stripped away. How can I extend the Series to have custom attributes and methods that are not stripped away when reassigning back to the dataframe?
Custom function that does ETL, returns a Series with an extended class
import pandas as pd
def normalize_x(x: list, new_attribute: None):
normalized = pd.Series(['normalized_'+ i if i != 4 else None for i in x])
return NormalizeX(normalized = normalized, original = x, new_attribute = new_attribute)
class NormalizeX(pd.Series):
def __init__(self, normalized, original, new_attribute, *args, **kwargs,):
super().__init__(data = normalized, *args, **kwargs)
self.original = original
self.normalized = normalized
self.new_attribute = new_attribute
def conversion_errors(self):
return [o != n for o, n in zip(pd.isnull(self.original), pd.isnull(self.normalized))]
df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6], "C": ['dog', 'cat', 4]})
Assign to a new object (new attributes and methods work)
out = normalize_x(df.C, new_attribute = 'CoolAttribute')
out
## 0 normalized_dog
## 1 normalized_cat
## 2 None
## dtype: object
## Can still use Series methods
out.to_list()
## ['normalized_dog', 'normalized_cat', None]
## Can use the new methods and access attributes
out.conversion_errors()
## [False, False, True]
out.original
##0 dog
##1 cat
##2 4
##Name: C, dtype: object
Assign to a Pandas DataFrame (new attributes and methods break)
df['new'] = normalize_x(df.C, new_attribute = 'CoolAttribute')
df['new']
## 0 normalized_dog
## 1 normalized_cat
## 2 None
## dtype: object
## Can't use the new methods or access attributes
df['new'].conversion_errors()
## AttributeError: 'Series' object has no attribute 'conversion_errors'
df['new'].original
## AttributeError: 'Series' object has no attribute 'original'
Pandas allows you to extend its classes (Series, DataFrame). In your case, the solution is quite verbose, but I think it's the only way you have to reach your goal.
I try to get to the point without analyse complex cases, so the complete implementation of the interface is up to you, but I can give you an idea of what you can use.
I don't understand the utility of new_attribute, so for the moment is not taken into account. Basically, Pandas extensions let you extend only a 1-D array for what I know. Since you have multiple arrays (both normalized, original) you have to create another data type to get around the problem.
class NormX(object):
def __init__(self, normalized, original):
self.normalized = normalized
self.original = original
def __repr__(self,):
if self.normalized is None:
return 'Nan'
return self.normalized
This allows you to create a simple base object like the following:
norm_obj = NormX('normalized_dog', 'dog')
This object will be the basic block of your custom array. To be able to exploit this this kind of class, you have to register a new type in Pandas:
from pandas.api.extensions import ExtensionDtype, register_extension_dtype
import numpy as np
#pd.api.extensions.register_extension_dtype
class NormXType(ExtensionDtype):
name = 'normX'
type = NormX
kind = 'O'
na_value = np.nan
Now you have all the elements to build a custom array based on Pandas framework. To do that, you have to extend its interface class named ExtensionArray. Here you can find the abstract methods that must be implemented by subclasses. I gave you a very basic implementation, but it should be declared in a proper way:
from pandas.api.extensions import ExtensionArray
class NormalizeX(ExtensionArray):
def __init__(self, values):
self.data = values
def __repr__(self,):
return "NormalizeX({!r})".format([(t.normalized, t.original) for t in self.data])
def _from_sequence(self,):
pass
def _from_factorized(self,):
pass
def __getitem__(self, key):
return self.data[key]
# def __setitem__(self, key, value):
# self.normalized[key] = value
# return self
def __len__(self,):
return len(self.data)
def __eq__(self, other):
return False
def dtype(self,):
# return self._dtype
return object
def nbytes(self,):
return sys.getsizeof(self.data)
def isna(self,):
return False
def take(self,):
pass
def copy(self,):
return type(self)(self.data)
def _concat_same_type(self,):
pass
More over, to define a custom method on that class, you have to define a custom Series accessor, as follows:
#pd.api.extensions.register_series_accessor("normx_accsr")
class NormalizeXAccessor:
def __init__(self, obj):
self.normalized = [o.normalized for o in obj]
self.original = [o.original for o in obj]
#property
def conversion_errors(self):
return [o != n for o, n in zip(pd.isnull(self.original), pd.isnull(self.normalized))]
In this way, the NormalizeX custom array implements all the requested methods to be successfully integrated into both Series and DataFrame. So, your example simply reduces to:
df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6], "C": ['dog', 'cat', 4]})
norm_list = [['normalized_'+ i, i] if i != 4 else (None, i) for i in df.C]
normx_objs = [NormX(*t) for t in norm_list]
normalizedX = NormalizeX(normx_objs)
df['new'] = normalizedX
# Use the previously defined attribute_accessor normx_accsr
df['new'].normx_accsr.conversion_errors # [False, False, True]
It's too difficult for me to implement your desired functionality, so I only share what I found in my investigation expecting it might be useful for other answerers.
The cause of the issue:
The reason why you got those attribute errors is sanitization processes were carried out on the Series which you passed to the DataFrame.
A brief check of ids:
You can quickly confirm the difference between what out and df['new'] refer to by the following code:
out = normalize_x(df.C, new_attribute = 'CoolAttribute')
df['new'] = out
print(id(out))
print(id(df['new']))
1861777917792
1861770685504
you can see out and df['new'] are different from each other because of this id difference.
Let's dive into the pandas source code to see what goes on here.
DataFrame._set_item method:
In the definition of the DataFrame class, _set_item method works when you try to add Series to DataFrame in a specified column.
def _set_item(self, key, value) -> None:
"""
Add series to DataFrame in specified column.
If series is a numpy-array (not a Series/TimeSeries), it must be the
same length as the DataFrames index or an error will be thrown.
Series/TimeSeries will be conformed to the DataFrames index to
ensure homogeneity.
"""
value = self._sanitize_column(value)
In this method, value = self._sanitize_column(value) in the first line except the docstring. This _sanitize_column method actually destroyed your original Series functionality. If you dig this method deeper, you'll finally reach the following lines:
def _reindex_for_setitem(value: FrameOrSeriesUnion, index: Index) -> ArrayLike:
# reindex if necessary
if value.index.equals(index) or not len(index):
return value._values.copy()
value._values.copy() is the direct cause of the disappearance of the NormalizeX attributes. It just copies the values from the given Series.
Therefore, the _set_item method should be modified in order to protect the NormalizeX attributes.
Conclusion:
You have to override the DataFrame class to set your NormalizeX in a specified column keeping with its attributes.
I am cleaning my data for a machine learning project by replacing the missing values with the zeros and the mean for the 'Age' and 'Fare' columns respectively. The code for which is given below:
train_data['Age'] = train_data['Age'].fillna(0)
mean = train_data['Fare'].mean()
train_data['Fare'] = train_data['Fare'].fillna(mean)
Since I would I have to do this multiple times for other sets of data, I want to automate this process by creating a generic function that takes the DataFrame as input and performs the operations for modifying it and returning the modified function. The code for that is given below:
def data_cleaning(df):
df['Age'] = df['Age'].fillna(0)
fare_mean = df['Fare'].mean()
df['Fare'] = df['Fare'].fillna()
return df
However when I pass the training data DataFrame:
train_data = data_cleaning(train_data)
I get the following error:
/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:2:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-
docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/tmp/ipykernel_42/1440633985.py in <module>
1 #print(train_data)
----> 2 train_data = data_cleaning(train_data)
3 cross_val_data = data_cleaning(cross_val_data)
/tmp/ipykernel_42/3053068338.py in data_cleaning(df)
2 df['Age'] = df['Age'].fillna(0)
3 fare_mean = df['Fare'].mean()
----> 4 df['Fare'] = df['Fare'].fillna()
5 return df
/opt/conda/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args,
**kwargs)
309 stacklevel=stacklevel,
310 )
--> 311 return func(*args, **kwargs)
312
313 return wrapper
/opt/conda/lib/python3.7/site-packages/pandas/core/series.py in fillna(self, value,
method, axis, inplace, limit, downcast)
4820 inplace=inplace,
4821 limit=limit,
-> 4822 downcast=downcast,
4823 )
4824
/opt/conda/lib/python3.7/site-packages/pandas/core/generic.py in fillna(self, value,
method, axis, inplace, limit, downcast)
6311 """
6312 inplace = validate_bool_kwarg(inplace, "inplace")
-> 6313 value, method = validate_fillna_kwargs(value, method)
6314
6315 self._consolidate_inplace()
/opt/conda/lib/python3.7/site-packages/pandas/util/_validators.py in
validate_fillna_kwargs(value, method, validate_scalar_dict_value)
368
369 if value is None and method is None:
--> 370 raise ValueError("Must specify a fill 'value' or 'method'.")
371 elif value is None and method is not None:
372 method = clean_fill_method(method)
ValueError: Must specify a fill 'value' or 'method'.
On some research, I found that I would have to use apply() and map() functions instead, but I am not sure how to input the mean value of the column. Furthermore, this does not scale well as I would have to calculate all the fillna values before inputting them into the function, which is cumbersome. Therefore I want to ask, is there better way to automate data cleaning?
This line df['Fare'] = df['Fare'].fillna() in your function, you did not fill the n/a with anything, thus it returns an error. You should change it to df['Fare'] = df['Fare'].fillna(fare_mean).
If you intend to make this usable for another file in same directory, you can just call it in another file by:
from file_that_contain_function import function_name
And if you intend to make it reusable for your workspace/virtual environment, you may need to create your own python package.
So yes, the other answer explains where the error is coming from.
However, the warning at the beginning has nothing to do with filling NaNs. The warning is telling you that you are modifying a slice of a copy of your dataframe. Change your code to
def data_cleaning(df):
df['Age'] = df.loc[:, 'Age'].fillna(0)
fare_mean = df['Fare'].mean()
df['Fare'] = df.loc[:, 'Fare'].fillna(fare_mean) # <- and also fix this error
return df
I suggest also searching that specific warning here, as there are hundreds of posts detailing this warning and how to deal with it. Here's a good one.
I am trying to create a Pandas pipeline that creates dummy variables and append the column to the existing dataframe.
Unfortunately I can't get the appended columns to stick when the pipeline is finished.
Example:
def function(df):
pass
def create_dummy(df):
a = pd.get_dummy(df['col'])
b = df.append(a)
return b
def mah_pipe(df):
(df.pipe(function)
.pipe(create_dummy)
.pipe(print)
return df
print(mah_pipe(df))
First - I have no idea if this is good practice.
What's weird is that the .pipe(print) prints the dataframe with appended columns. Yay.
But the statement print(mah_pipe(df)) does not. I though they would behave the same way.
I have tried to read the documentation about pd.pipe but I couldn't figure it out.
Hoping someone could help shed some light on what's going on.
This is because print in Python returns None. Since you are not making a copy of df on your pipes, your df dies after print.
pipes in Pandas
Unless used as last pipe, in Pandas, we except (df) -> [pipe] -> (df_1)-> [pipe2] ->(df_2)-> [pipeN] -> df_N By having print as last pipe, the output is None.
Solution
...
def start_pipe(dataf):
# allows make a copy to avoid modifying original
dataf = dataf.copy()
def create_dummies(dataf, column_name):
dummies = pd.get_dummies(dataf[column_name])
dataf[dummies.columns] = dummies
return dataf
def print_dataf(dataf, n_rows=5):
print(dataf.head(n_rows))
return dataf # this is important
# usage
...
dt = (df
.pipe(start_pipe)
.pipe(create_dummies, column_name='a')
.pipe(print_dataf, n_rows=10)
)
def mah_pipe(df):
df = (df
.pipe(start_pipe)
.pipe(create_dummies, column_name='a')
.pipe(print_dataf, n_rows=10)
)
return df
print(mah_pipe(df))
I have a pandas.DatetimeIndex for an interval ['2018-01-01', '2018-01-04') (start included, end excluded) and freq=1D:
>>> index = pd.DatetimeIndex(start='2018-01-01',
end='2018-01-04',
freq='1D',
closed='left')
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03'],
dtype='datetime64[ns]',
freq='D')
How can I obtain the correct open end='2018-01-04' attribute again? I need it for a DB query with timestamp ranges.
There is no index.end
index[-1] returns '2018-01-03'
index[-1] + index.freq works in this case but is wrong for freq='2D'
There's no way because this information is lost after constructing the object. At creation time, the interval is unfolded into the resulting sequence:
pandas/core/indexes/datetimes.py:
class DatetimeIndex(<...>):
<...>
#classmethod
def _generate(cls, start, end, periods, name, freq,
tz=None, normalize=False, ambiguous='raise', closed=None):
<...>
index = tools.to_datetime(np.linspace(start.value,
end.value, periods),
utc=True)
<...>
if not left_closed and len(index) and index[0] == start:
index = index[1:]
if not right_closed and len(index) and index[-1] == end:
index = index[:-1]
index = cls._simple_new(index, name=name, freq=freq, tz=tz)
return index
Neither is closed information saved anywhere, so you can't even infer it from the first/last point and step.
You can subclass DatetimeIndex and save this information. Note that it's an immutable type, so you need to override __new__ instead of __init__:
import inspect, collections
class SiDatetimeIndex(pd.DatetimeIndex):
_Interval = collections.namedtuple('Interval',
('start','end','freq','closed'))
#add 'interval' to dir(): DatetimeIndex inherits pandas.core.accessor.DirNamesMixin
_accessors = pd.DatetimeIndex._accessors | frozenset(('interval',))
def __new__(cls, *args, **kwargs):
base_new = super(SiDatetimeIndex,cls).__new__
callargs = inspect.getcallargs(base_new,cls,*args,**kwargs)
result = base_new(**callargs)
result.interval = cls._Interval._make(callargs[a] for a in cls._Interval._fields)
return result
In [31]: index = SiDatetimeIndex(start='2018-01-01',
...: end='2018-01-04',
...: freq='1D',
...: closed='left')
In [38]: index.interval
Out[38]: Interval(start='2018-01-01', end='2018-01-04', freq='1D', closed='left')
Don't expect though that all the pandas methods (including the inherited ones in your class) will now magically start creating your overridden class.
For that, you'll need to replace live references to the base class in loaded pandas modules that those methods use.
Alternatively, you can replace just the original's __new__ -- then no need to replace references.
Can something like this work for you?
index = pd.DatetimeIndex(start='2018-01-01', end='2018-01-04', freq='1D', closed='left')
def get_end(index, freq):
if freq == '1D':
return(index.max()+1)
get_end(index, '1D')
You can write a logic for 1D/2D/1M. Also, make the column name of the dateIndex with Freq parameter as suffix/prefix 'purchase_date_1D' and parse it if you don't even want to give it as separate input.
sencap.csv is a file that has a lot of columns that I don't need and I want to keep just some columns in order to start filtering it to analyze the information and do some graphs, which in this case it'll be a pie chart that aggregate energy quantities depending on its energy source. Everything works fine except the condition that asks to sum() only those rows which are less than 9.0 MW.
import pandas as pd
import matplotlib.pyplot as plt
aux = pd.read_csv('sencap.csv')
keep_col = ['subsistema','propietario','razon_social', 'estado',
'fecha_servicio_central', 'region_nombre', 'comuna_nombre',
'tipo_final', 'clasificacion', 'tipo_energia', 'potencia_neta_mw',
'ley_ernc', 'medio_generacion', 'distribuidora', 'punto_conexion',
]
c1 = aux['medio_generacion'] == 'PMGD'
c2 = aux['medio_generacion'] == 'PMG'
aux2 = aux[keep_col]
aux3 = aux2[c1 | c2]
for col in ['potencia_neta_mw']:
aux3[col] = pd.to_numeric(aux3[col].str.replace(',','.'))
c3 = aux3['potencia_neta_mw'] <= 9.0
aux4 = aux3[c3]
df = aux4.groupby(['tipo_final']).sum()
Warning:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
aux3[col] = pd.to_numeric(aux3[col].str.replace(',','.'))
aux3[col] = pd.to_numeric(aux3[col].str.replace(',','.'))
this line is the reason you are getting a warning.
Accessing the "col" using indexing may result in unpredictable behavior since it may return view or copy of original data.
it depends on the memory layout of the array, about which pandas makes no guarantees
pandas documentation advises users to use .loc instead.
Example:
In: df
Out:
one two
first second first second
0 a b c d
1 e f g h
2 i j k l
3 m n o p
dfmi.loc[:,('one','second')] = value
# becomes
dfmi.loc.__setitem__((slice(None), ('one', 'second')), value)
dfmi['one']['second'] = value
# becomes
dfmi.__getitem__('one').__setitem__('second', value)
In the second case __getitem__ is unpredictable. It may return view or copy of the data. Modifying a view and copy works differently.
Making change on copy is not reflected on the original data where as a change on a view does.
Note: So the warning is present to warn users, even if you get the expected output there is a chance it might causes some unpredictable behavior.