I have a pandas.DatetimeIndex for an interval ['2018-01-01', '2018-01-04') (start included, end excluded) and freq=1D:
>>> index = pd.DatetimeIndex(start='2018-01-01',
end='2018-01-04',
freq='1D',
closed='left')
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03'],
dtype='datetime64[ns]',
freq='D')
How can I obtain the correct open end='2018-01-04' attribute again? I need it for a DB query with timestamp ranges.
There is no index.end
index[-1] returns '2018-01-03'
index[-1] + index.freq works in this case but is wrong for freq='2D'
There's no way because this information is lost after constructing the object. At creation time, the interval is unfolded into the resulting sequence:
pandas/core/indexes/datetimes.py:
class DatetimeIndex(<...>):
<...>
#classmethod
def _generate(cls, start, end, periods, name, freq,
tz=None, normalize=False, ambiguous='raise', closed=None):
<...>
index = tools.to_datetime(np.linspace(start.value,
end.value, periods),
utc=True)
<...>
if not left_closed and len(index) and index[0] == start:
index = index[1:]
if not right_closed and len(index) and index[-1] == end:
index = index[:-1]
index = cls._simple_new(index, name=name, freq=freq, tz=tz)
return index
Neither is closed information saved anywhere, so you can't even infer it from the first/last point and step.
You can subclass DatetimeIndex and save this information. Note that it's an immutable type, so you need to override __new__ instead of __init__:
import inspect, collections
class SiDatetimeIndex(pd.DatetimeIndex):
_Interval = collections.namedtuple('Interval',
('start','end','freq','closed'))
#add 'interval' to dir(): DatetimeIndex inherits pandas.core.accessor.DirNamesMixin
_accessors = pd.DatetimeIndex._accessors | frozenset(('interval',))
def __new__(cls, *args, **kwargs):
base_new = super(SiDatetimeIndex,cls).__new__
callargs = inspect.getcallargs(base_new,cls,*args,**kwargs)
result = base_new(**callargs)
result.interval = cls._Interval._make(callargs[a] for a in cls._Interval._fields)
return result
In [31]: index = SiDatetimeIndex(start='2018-01-01',
...: end='2018-01-04',
...: freq='1D',
...: closed='left')
In [38]: index.interval
Out[38]: Interval(start='2018-01-01', end='2018-01-04', freq='1D', closed='left')
Don't expect though that all the pandas methods (including the inherited ones in your class) will now magically start creating your overridden class.
For that, you'll need to replace live references to the base class in loaded pandas modules that those methods use.
Alternatively, you can replace just the original's __new__ -- then no need to replace references.
Can something like this work for you?
index = pd.DatetimeIndex(start='2018-01-01', end='2018-01-04', freq='1D', closed='left')
def get_end(index, freq):
if freq == '1D':
return(index.max()+1)
get_end(index, '1D')
You can write a logic for 1D/2D/1M. Also, make the column name of the dateIndex with Freq parameter as suffix/prefix 'purchase_date_1D' and parse it if you don't even want to give it as separate input.
Related
I want to make a function that does an ETL to a list/Series and returns a Series with additional attributes and methods specific to that function. I can achieve this by creating a class to extend the Series and it works but then when I try to reassign the output from the function with the new class the updated class attributes and methods are stripped away. How can I extend the Series to have custom attributes and methods that are not stripped away when reassigning back to the dataframe?
Custom function that does ETL, returns a Series with an extended class
import pandas as pd
def normalize_x(x: list, new_attribute: None):
normalized = pd.Series(['normalized_'+ i if i != 4 else None for i in x])
return NormalizeX(normalized = normalized, original = x, new_attribute = new_attribute)
class NormalizeX(pd.Series):
def __init__(self, normalized, original, new_attribute, *args, **kwargs,):
super().__init__(data = normalized, *args, **kwargs)
self.original = original
self.normalized = normalized
self.new_attribute = new_attribute
def conversion_errors(self):
return [o != n for o, n in zip(pd.isnull(self.original), pd.isnull(self.normalized))]
df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6], "C": ['dog', 'cat', 4]})
Assign to a new object (new attributes and methods work)
out = normalize_x(df.C, new_attribute = 'CoolAttribute')
out
## 0 normalized_dog
## 1 normalized_cat
## 2 None
## dtype: object
## Can still use Series methods
out.to_list()
## ['normalized_dog', 'normalized_cat', None]
## Can use the new methods and access attributes
out.conversion_errors()
## [False, False, True]
out.original
##0 dog
##1 cat
##2 4
##Name: C, dtype: object
Assign to a Pandas DataFrame (new attributes and methods break)
df['new'] = normalize_x(df.C, new_attribute = 'CoolAttribute')
df['new']
## 0 normalized_dog
## 1 normalized_cat
## 2 None
## dtype: object
## Can't use the new methods or access attributes
df['new'].conversion_errors()
## AttributeError: 'Series' object has no attribute 'conversion_errors'
df['new'].original
## AttributeError: 'Series' object has no attribute 'original'
Pandas allows you to extend its classes (Series, DataFrame). In your case, the solution is quite verbose, but I think it's the only way you have to reach your goal.
I try to get to the point without analyse complex cases, so the complete implementation of the interface is up to you, but I can give you an idea of what you can use.
I don't understand the utility of new_attribute, so for the moment is not taken into account. Basically, Pandas extensions let you extend only a 1-D array for what I know. Since you have multiple arrays (both normalized, original) you have to create another data type to get around the problem.
class NormX(object):
def __init__(self, normalized, original):
self.normalized = normalized
self.original = original
def __repr__(self,):
if self.normalized is None:
return 'Nan'
return self.normalized
This allows you to create a simple base object like the following:
norm_obj = NormX('normalized_dog', 'dog')
This object will be the basic block of your custom array. To be able to exploit this this kind of class, you have to register a new type in Pandas:
from pandas.api.extensions import ExtensionDtype, register_extension_dtype
import numpy as np
#pd.api.extensions.register_extension_dtype
class NormXType(ExtensionDtype):
name = 'normX'
type = NormX
kind = 'O'
na_value = np.nan
Now you have all the elements to build a custom array based on Pandas framework. To do that, you have to extend its interface class named ExtensionArray. Here you can find the abstract methods that must be implemented by subclasses. I gave you a very basic implementation, but it should be declared in a proper way:
from pandas.api.extensions import ExtensionArray
class NormalizeX(ExtensionArray):
def __init__(self, values):
self.data = values
def __repr__(self,):
return "NormalizeX({!r})".format([(t.normalized, t.original) for t in self.data])
def _from_sequence(self,):
pass
def _from_factorized(self,):
pass
def __getitem__(self, key):
return self.data[key]
# def __setitem__(self, key, value):
# self.normalized[key] = value
# return self
def __len__(self,):
return len(self.data)
def __eq__(self, other):
return False
def dtype(self,):
# return self._dtype
return object
def nbytes(self,):
return sys.getsizeof(self.data)
def isna(self,):
return False
def take(self,):
pass
def copy(self,):
return type(self)(self.data)
def _concat_same_type(self,):
pass
More over, to define a custom method on that class, you have to define a custom Series accessor, as follows:
#pd.api.extensions.register_series_accessor("normx_accsr")
class NormalizeXAccessor:
def __init__(self, obj):
self.normalized = [o.normalized for o in obj]
self.original = [o.original for o in obj]
#property
def conversion_errors(self):
return [o != n for o, n in zip(pd.isnull(self.original), pd.isnull(self.normalized))]
In this way, the NormalizeX custom array implements all the requested methods to be successfully integrated into both Series and DataFrame. So, your example simply reduces to:
df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6], "C": ['dog', 'cat', 4]})
norm_list = [['normalized_'+ i, i] if i != 4 else (None, i) for i in df.C]
normx_objs = [NormX(*t) for t in norm_list]
normalizedX = NormalizeX(normx_objs)
df['new'] = normalizedX
# Use the previously defined attribute_accessor normx_accsr
df['new'].normx_accsr.conversion_errors # [False, False, True]
It's too difficult for me to implement your desired functionality, so I only share what I found in my investigation expecting it might be useful for other answerers.
The cause of the issue:
The reason why you got those attribute errors is sanitization processes were carried out on the Series which you passed to the DataFrame.
A brief check of ids:
You can quickly confirm the difference between what out and df['new'] refer to by the following code:
out = normalize_x(df.C, new_attribute = 'CoolAttribute')
df['new'] = out
print(id(out))
print(id(df['new']))
1861777917792
1861770685504
you can see out and df['new'] are different from each other because of this id difference.
Let's dive into the pandas source code to see what goes on here.
DataFrame._set_item method:
In the definition of the DataFrame class, _set_item method works when you try to add Series to DataFrame in a specified column.
def _set_item(self, key, value) -> None:
"""
Add series to DataFrame in specified column.
If series is a numpy-array (not a Series/TimeSeries), it must be the
same length as the DataFrames index or an error will be thrown.
Series/TimeSeries will be conformed to the DataFrames index to
ensure homogeneity.
"""
value = self._sanitize_column(value)
In this method, value = self._sanitize_column(value) in the first line except the docstring. This _sanitize_column method actually destroyed your original Series functionality. If you dig this method deeper, you'll finally reach the following lines:
def _reindex_for_setitem(value: FrameOrSeriesUnion, index: Index) -> ArrayLike:
# reindex if necessary
if value.index.equals(index) or not len(index):
return value._values.copy()
value._values.copy() is the direct cause of the disappearance of the NormalizeX attributes. It just copies the values from the given Series.
Therefore, the _set_item method should be modified in order to protect the NormalizeX attributes.
Conclusion:
You have to override the DataFrame class to set your NormalizeX in a specified column keeping with its attributes.
Given a list of values or strings, how can I detect whether these are either dates, date and times, or neither?
I have used the pandas api to infer data types but it doesn't work well with dates. See example:
import pandas as pd
def get_redshift_dtype(values):
dtype = pd.api.types.infer_dtype(values)
return dtype
This is the result that I'm looking for. Any suggestions on better methods?
# Should return "date"
values_1 = ['2018-10-01', '2018-02-14', '2017-08-01']
# Should return "date"
values_2 = ['2018-10-01 00:00:00', '2018-02-14 00:00:00', '2017-08-01 00:00:00']
# Should return "datetime"
values_3 = ['2018-10-01 02:13:00', '2018-02-14 11:45:00', '2017-08-01 00:00:00']
# Should return "None"
values_4 = ['123098', '213408', '801231']
You can write a function to return values dependent on conditions you specify:
def return_date_type(s):
s_dt = pd.to_datetime(s, errors='coerce')
if s_dt.isnull().any():
return 'None'
elif s_dt.normalize().equals(s_dt):
return 'date'
return 'datetime'
return_date_type(values_1) # 'date'
return_date_type(values_2) # 'date'
return_date_type(values_3) # 'datetime'
return_date_type(values_4) # 'None'
You should be aware that Pandas datetime series always include time. Internally, they are stored as integers, and if a time is not specified it will be set to 00:00:00.
Here's something that'll give you exactly what you asked for using re
import re
classify_dict = {
'date': '^\d{4}(-\d{2}){2}$',
'date_again': '^\d{4}(-\d{2}){2} 00:00:00$',
'datetime': '^\d{4}(-\d{2}){2} \d{2}(:\d{2}){2}$',
}
def classify(mylist):
key = 'None'
for k, v in classify_dict.items():
if all([bool(re.match(v, e)) for e in mylist]):
key = k
break
if key == 'date_again':
key = 'date'
return key
classify(values_2)
>>> 'date'
The checking is done iteratively using regex and it tries to match all items of a list. Only if all items are matched will the key be returned. This works for all of your example lists you've given.
For now, the regex string does not check for numbers outside certain range, e.g (25:00:00) but that would be relatively straightforward to implement.
I do not understand why "u" has NaNvalues. What wrong am I doing here?
>>> z=pd.DataFrame([['abcb','asasa'],['sdsd','aeio']])
>>> z
0 1
0 abcb asasa
1 sdsd aeio
>>> u=pd.DataFrame(z,columns=['hello','ajsajs'])
>>> u
hello ajsajs
0 NaN NaN
1 NaN NaN
Alternate construction calls
You can use the underlying NumPy array:
u = pd.DataFrame(z.values, columns=['hello','ajsajs'])
hello ajsajs
0 abcb asasa
1 sdsd aeio
Alternately, you could use:
u = z.rename(columns={0: 'hello',1: 'ajsajs'})
And lastly as suggested by #Dark:
u = z.set_axis(['hello','ajsajs'], axis=1, inplace=False)
A small note on inplace in set_axis -
WARNING: inplace=None currently falls back to to True, but in a future
version, will default to False. Use inplace=True explicitly rather
than relying on the default.
In pandas 0.20.3 the syntax would be just:
u = z.set_axis(axis=1, labels=['hello','ajsajs'])
#Dark's solution appears fastest here.
Why current method doesn't work
I believe the issue here is that there's a .reindex being called when the DataFrame is constructed in this way. Here's some source code where ellipses denote irrelevant stuff I'm leaving out:
from pandas.core.internals import BlockManager
# pandas.core.frame.DataFrame
class DataFrame(NDFrame):
def __init__(self, data=None, index=None, columns=None, dtype=None,
copy=False):
# ...
if isinstance(data, DataFrame):
data = data._data
if isinstance(data, BlockManager):
mgr = self._init_mgr(data, axes=dict(index=index, columns=columns),
dtype=dtype, copy=copy)
# ... a bunch of other if statements irrelevant to your case
NDFrame.__init__(self, mgr, fastpath=True)
# ...
What's happening here:
DataFrame inherits from a more generic base class which in turn has multiple inheritance. (Pandas is great, but its source can be like trying to backtrack through a spider's web.)
In u = pd.DataFrame(z,columns=['hello','ajsajs']), x is a DataFrame. Therefore, the first if statement below is True and data = data._data. What's _data? It's a BlockManager.* (To be continued below...)
Because we just converted what you passed to its BlockManager, the next if statement also evaluates True. Then mgr gets assigned to the result of the _init_mrg method and the parent class's __init__ gets called, passing mgr.
* confirm with isinstance(z._data, BlockManager).
Now on to part 2...
# pandas.core.generic.NDFrame
class NDFrame(PandasObject, SelectionMixin):
def __init__(self, data, axes=None, copy=False, dtype=None,
fastpath=False):
# ...
def _init_mgr(self, mgr, axes=None, dtype=None, copy=False):
""" passed a manager and a axes dict """
for a, axe in axes.items():
if axe is not None:
mgr = mgr.reindex_axis(axe,
axis=self._get_block_manager_axis(a),
copy=False)
# ...
return mgr
Here is where _init_mgr is defined, which gets called above. Essentially in your case you have:
columns=['hello','ajsajs']
axes=dict(index=None, columns=columns)
# ...
When you go to reindex axis and specify a new axis where none of the new labels are included in the old object, you get all NaNs. This seems like a deliberate design decision. Consider this related example to prove the point, where one new column is present and one is not:
pd.DataFrame(z, columns=[0, 'ajsajs'])
0 ajsajs
0 abcb NaN
1 sdsd NaN
I can access elements of a named tuple by name as follows(*):
from collections import namedtuple
Car = namedtuple('Car', 'color mileage')
my_car = Car('red', 100)
print my_car.color
But how can I use a variable to specify the name of the field I want to access? E.g.
field = 'color'
my_car[field] # doesn't work
my_car.field # doesn't work
My actual use case is that I'm iterating through a pandas dataframe with for row in data.itertuples(). I am doing an operation on the value from a particular column, and I want to be able to specify the column to use by name as a parameter to the method containing this loop.
(*) example taken from here. I am using Python 2.7.
You can use getattr
getattr(my_car, field)
The 'getattr' answer works, but there is another option which is slightly faster.
idx = {name: i for i, name in enumerate(list(df), start=1)}
for row in df.itertuples(name=None):
example_value = row[idx['product_price']]
Explanation
Make a dictionary mapping the column names to the row position. Call 'itertuples' with "name=None". Then access the desired values in each tuple using the
indexes obtained using the column name from the dictionary.
Make a dictionary to find the indexes.
idx = {name: i for i, name in enumerate(list(df), start=1)}
Use the dictionary to access the desired values by name in the row tuples
for row in df.itertuples(name=None):
example_value = row[idx['product_price']]
Note: Use start=0 in enumerate if you call itertuples with index=False
Here is a working example showing both methods and the timing of both methods.
import numpy as np
import pandas as pd
import timeit
data_length = 3 * 10**5
fake_data = {
"id_code": list(range(data_length)),
"letter_code": np.random.choice(list('abcdefgz'), size=data_length),
"pine_cones": np.random.randint(low=1, high=100, size=data_length),
"area": np.random.randint(low=1, high=100, size=data_length),
"temperature": np.random.randint(low=1, high=100, size=data_length),
"elevation": np.random.randint(low=1, high=100, size=data_length),
}
df = pd.DataFrame(fake_data)
def iter_with_idx():
result_data = []
idx = {name: i for i, name in enumerate(list(df), start=1)}
for row in df.itertuples(name=None):
row_calc = row[idx['pine_cones']] / row[idx['area']]
result_data.append(row_calc)
return result_data
def iter_with_getaatr():
result_data = []
for row in df.itertuples():
row_calc = getattr(row, 'pine_cones') / getattr(row, 'area')
result_data.append(row_calc)
return result_data
dict_idx_method = timeit.timeit(iter_with_idx, number=100)
get_attr_method = timeit.timeit(iter_with_getaatr, number=100)
print(f'Dictionary index Method {dict_idx_method:0.4f} seconds')
print(f'Get attribute method {get_attr_method:0.4f} seconds')
Result:
Dictionary index Method 49.1814 seconds
Get attribute method 80.1912 seconds
I assume the difference is due to lower overhead in creating a tuple vs a named tuple and also lower overhead in accessing it by the index rather than getattr but both of those are just guesses. If anyone knows better please comment.
I have not explored how the number of columns vs number of rows effects the timing results.
since python version 3.6 one could inherit from typing.NamedTuple
import typing as tp
class HistoryItem(tp.NamedTuple):
inp: str
tsb: float
rtn: int
frequency: int = None
def __getitem__(self, item):
if isinstance(item, int):
item = self._fields[item]
return getattr(self, item)
def get(self, item, default=None):
try:
return self[item]
except (KeyError, AttributeError, IndexError):
return default
item = HistoryItem("inp", 10, 10, 10)
print(item[0]) # 'inp'
print(item["inp"]) # 'inp'
Another way of accessing them can be:
field_idx = my_car._fields.index(field)
my_car[field_idx]
Extract index of the field and then use it to index the namedtuple.
Use the following code
for i,x in enumerate(my_car._fields):
print(x, my_car[i])
I am working on praising a *.csv file. Therefore I try to create a class which helps me to simplify some operations on DataFrame.
I've created two methods in order to parse a column 'z' that contains values for the 'Price' column.
def subr(self):
isone = self.df.z == 1.0
if isone.any():
atone = self.df.Price[isone].iloc[0]
self.df.loc[self.df.z.between(0.8, 2.5), 'Benchmark'] = atone
# df.loc[(df.r >= .8) & (df.r <= 1.4), 'value'] = atone
return self.df
def obtain_z(self):
"Return a column with z for E_ref"
self.z_col = self.subr()
self.dfnew = self.df.groupby((self.df.z < self.df.z.shift()).cumsum()).apply(self.z_col)
return self.dfnew
def main():
x = ParseDataBase('data.csv')
file_content = x.read_file()
new_df = x.obtain_z()
I'm getting the following error:
'DataFrame' objects are mutable, thus they cannot be hashed
'DataFrame' objects are mutable means that we can change elements of that Frame. I'm not sure when I'm hashing.
I noticed the use of apply(self.z_col) is going wrong.
I also have no clue how to fix it.
You are passing the DataFrame self.df returned by self.subr() to apply, but actually apply only takes functions as parameters (see examples here).