Hello SO and community!
Guess, my question somewhat resonates with this one.
However, trust the below task is a little bit different from that referenced above, namely to extract, transform, load data utilizing pandas.DataFrame, and I am stuck implementing Protocol for the purpose.
The code is below:
import io
import pandas as pd
import re
import requests
from functools import cache
from typing import Protocol
from zipfile import ZipFile
from pandas import DataFrame
#cache
def extract_can_from_url(url: str, **kwargs) -> DataFrame:
'''
Returns DataFrame from downloaded zip file from url
Parameters
----------
url : str
url to download from.
**kwargs : TYPE
additional arguments to pass to pd.read_csv().
Returns
-------
DataFrame
'''
name = url.split('/')[-1]
if os.path.exists(name):
with ZipFile(name, 'r').open(name.replace('-eng.zip', '.csv')) as f:
return pd.read_csv(f, **kwargs)
else:
r = requests.get(url)
with ZipFile(io.BytesIO(r.content)).open(name.replace('-eng.zip', '.csv')) as f:
return pd.read_csv(f, **kwargs)
class ETL(Protocol):
# =============================================================================
# Maybe Using these items for dataclass:
# url: str
# meta: kwargs(default_factory=dict)
# =============================================================================
def __init__(self, url: str, **kwargs) -> None:
return None
def download(self) -> DataFrame:
return DataFrame
def retrieve_series_ids(self) -> list[str]:
return list[str]
def transform(self) -> DataFrame:
return DataFrame
def sum_up_series_ids(self) -> DataFrame:
return DataFrame
class ETLCanadaFixedAssets(ETL):
def __init__(self, url: str, **kwargs) -> None:
self.url = url
self.kwargs = kwargs
#cache
def download(self) -> DataFrame:
self.df = extract_can_from_url(URL, index_col=0, usecols=range(14))
return self.df
def retrieve_series_ids(self) -> list[str]:
# =========================================================================
# Columns Specific to URL below, might be altered
# =========================================================================
self._columns = {
"Prices": 0,
"Industry": 1,
"Flows and stocks": 2,
"VECTOR": 3,
}
self.df_cut = self.df.loc[:, tuple(self._columns)]
_q = (self.df_cut.iloc[:, 0].str.contains('2012 constant prices')) & \
(self.df_cut.iloc[:, 1].str.contains('manufacturing', flags=re.IGNORECASE)) & \
(self.df_cut.iloc[:, 2] == 'Linear end-year net stock')
self.df_cut = self.df_cut[_q]
self.series_ids = sorted(set(self.df_cut.iloc[:, -1]))
return self.series_ids
def transform(self) -> DataFrame:
# =========================================================================
# Columns Specific to URL below, might be altered
# =========================================================================
self._columns = {
"VECTOR": 0,
"VALUE": 1,
}
self.df = self.df.loc[:, tuple(self._columns)]
self.df = self.df[self.df.iloc[:, 0].isin(self.series_ids)]
return self.df
def sum_up_series_ids(self) -> DataFrame:
self.df = pd.concat(
[
self.df[self.df.iloc[:, 0] == series_id].iloc[:, [1]]
for series_id in self.series_ids
],
axis=1
)
self.df.columns = self.series_ids
self.df['sum'] = self.df.sum(axis=1)
return self.df.iloc[:, [-1]]
UPD
Instantiating the class ETLCanadaFixedAssets
df = ETLCanadaFixedAssets(URL, index_col=0, usecols=range(14)).download().retrieve_series_ids().transform().sum_up_series_ids()
returns an error, however, expected:
AttributeError: 'DataFrame' object has no attribute 'retrieve_series_ids'
Please can anyone provide a guidance for how to put these things together (namely how to retrieve the DataFrame which might have been retrieved otherwise using the procedural approach by calling the functions within the last class as they appear within the latter) and point at those mistakes which were made above?
Probably, there is another way to do this elegantly using injection.
Thank you very much in advance!
All the functions of ETLCanadaFixedAssets and ETL classes should return self. This will allow you to call the functions of the class on the return value of the functions, so you can chain them together. You could add one more function that retrieves the encapsulated dataframe but that will always be called last, as the moment you call this function you cannot chain other functions any more. What you are trying to build is called fluent API you may read more about it here
For example:
class ETL(Protocol):
def download(self) -> ETL:
...
def retrieve_series_ids(self) -> ETL:
...
def transform(self) -> ETL:
...
def sum_up_series_ids(self) -> ETL:
...
#property
def dataframe(self) -> DataFrame:
...
Note you will need the following import line to be able to use the class annotation inside the class definition
from __future__ import annotations
Related
I try to figure out how to structure my program which is a simple trading bot.
class "Exchange"
should store instances of the class "TradingPair"
class "Symbol"
stores all symbol related stuff
class "Websocket"
gets the ws stream and stores the ticks it in a dataframe in the TradingPair instance
located in Exchange.symbols[symbol_name].history
class "Indicators"
calculates for example a moving average and stores the values in
Exchange.symbols[symbol_name].history
Here are my questions:
To access Exchange.symbols I need a class variable so I can read/edit it from within other class instances. In Websocket / handle_symbol_ticker I have to write Exchange.symbols[self.symbol_name].history. Could this be done in a shorter manner. I did try history_pat = Exchange.symbols[self.symbol_name].history, but this generates a new object...
In Indicators / calc_ma I could not use loc[-1,colum_name] but had to use .index[-1]. What would be the best way do the index?
Here is the code:
import pandas as pd
class Exchange:
symbols = {}
class Symbol:
def __init__(self, base_asset, quote_asset):
self.base_asset = base_asset
self.quote_asset = quote_asset
self.symbol_name = self.base_asset + self.quote_asset
self.history = pd.DataFrame()
class Websocket(Exchange):
def __init__(self, symbol_name):
self.symbol_name = symbol_name
history_path = Exchange.symbols[self.symbol_name].history # doesn't work
def handle_symbol_ticker(self, msg: dict):
Exchange.symbols[self.symbol_name].history = pd.concat([
Exchange.symbols[self.symbol_name].history,
pd.DataFrame([msg])
]).set_index("event_time")
# def handle_symbol_ticker(self, msg: dict):
# history_path = pd.concat([ # <- doesn't work
# history_path,
# pd.DataFrame([msg])
# ]).set_index("event_time")
class Indicators(Exchange):
def __init__(self, symbol_name):
self.symbol_name = symbol_name
def calc_ma(self, timespan):
timespan_name = "ma_" + str(timespan)
Exchange.symbols[self.symbol_name].history.loc[
Exchange.symbols[self.symbol_name].history.index[-1],
timespan_name] \
= Exchange.symbols[self.symbol_name].history["close"].tail(timespan).mean()
if __name__ == "__main__":
bnc_exchange = Exchange()
bnc_exchange.symbols["axsbusd"] = Symbol("axs", "busd")
bnc_websocket = Websocket( "axsbusd")
bnc_indicators = Indicators("axsbusd")
bnc_exchange.symbols["axsbusd"].history = pd.DataFrame({
"event_time": [101,102,103,104,105],
"close": [50,51,56,54,53],
})
bnc_websocket.handle_symbol_ticker({
"event_time": 106,
"close": 54
})
bnc_indicators.calc_ma(3)
print(bnc_exchange.symbols["axsbusd"].history)
I'm currently attempting to define a custom dataset to read/write .fits files to/from S3 as SunPy Maps.
The closest thing to this already in the data catalog is the pillow.ImageDataSet pillow.ImageDataSet, which supports passing a file object when loading:
https://pillow.readthedocs.io/en/stable/reference/Image.html.
I'm unsure if Maps are flexible enough with inputs to justify a similar approach. My attempts so far at modifying the pillow.ImageDataSet _load method to include
smap = Map(fs_file)
return smap
results in the following error:
DataSetError: Failed while loading data from data set SunPyMapDataSet(filepath=sunspots/data/01_raw/map_sample.fits, protocol=s3, save_args={'overwrite': True}).
Invalid input: <File-like object S3FileSystem, sunspots/data/01_raw/map_sample.fits>
How might I get things working here?
I am unfamiliar with this SunPy library. I think your approach is correct so far.
the fs_file is a handler and you need a correct way to open this file. I think you are getting this error probably because Map(fs_file) isn't the correct way to load a file.
You should probably look for functions that load Map object from a file.
Months ago I wrote a Kedro custom dataset for SunPy using Astropy as an intermediary and forgot to answer this question. It may be worth opening a PR to the new kedro-datasets package for SunPy users.
import warnings
from copy import deepcopy
from pathlib import PurePosixPath
from typing import Any, Dict
import fsspec
from kedro.io.core import (
AbstractVersionedDataSet,
DataSetError,
Version,
get_filepath_str,
get_protocol_and_path,
)
import numpy as np
from astropy.io import fits
from sunpy.map import Map
class SunPyMapDataSet(AbstractVersionedDataSet):
DEFAULT_SAVE_ARGS = {"overwrite": False}
def __init__(
self,
filepath: str,
save_args: Dict[str, Any] = None,
version: Version = None,
credentials: Dict[str, Any] = None,
fs_args: Dict[str, Any] = None,
) -> None:
_fs_args = deepcopy(fs_args) or {}
_fs_open_args_load = _fs_args.pop("open_args_load", {})
_fs_open_args_save = _fs_args.pop("open_args_save", {})
_credentials = deepcopy(credentials) or {}
protocol, path = get_protocol_and_path(filepath, version)
if protocol == "file":
_fs_args.setdefault("auto_mkdir", True)
self._protocol = protocol
self._fs = fsspec.filesystem(self._protocol, **_credentials, **_fs_args)
super().__init__(
filepath=PurePosixPath(path),
version=version,
exists_function=self._fs.exists,
glob_function=self._fs.glob,
)
self._save_args = deepcopy(self.DEFAULT_SAVE_ARGS)
if save_args is not None:
self._save_args.update(save_args)
_fs_open_args_save.setdefault("mode", "wb")
self._fs_open_args_load = _fs_open_args_load
self._fs_open_args_save = _fs_open_args_save
def _describe(self) -> Dict[str, Any]:
return dict(
filepath=self._filepath,
protocol=self._protocol,
save_args=self._save_args,
version=self._version,
)
def _load(self) -> Map:
load_path = get_filepath_str(self._get_load_path(), self._protocol)
with self._fs.open(load_path, **self._fs_open_args_load) as fs_file:
file = fits.open(fs_file).copy()
image_hdu = file[1]
image_hdu.verify("fix")
smap = Map((image_hdu.data, image_hdu.header))
return smap
def _save(self, data: Map) -> None:
save_path = get_filepath_str(self._get_save_path(), self._protocol)
with self._fs.open(save_path, **self._fs_open_args_save) as fs_file:
hdu = fits.ImageHDU()
hdu.header = data.fits_header
hdu.data = data.data
hdu.writeto(fs_file, **self._save_args)
self._invalidate_cache()
def _exists(self) -> bool:
try:
load_path = get_filepath_str(self._get_load_path(), self._protocol)
except DataSetError:
return False
return self._fs.exists(load_path)
def _release(self) -> None:
super()._release()
self._invalidate_cache()
def _invalidate_cache(self) -> None:
"""Invalidate underlying filesystem caches."""
filepath = get_filepath_str(self._filepath, self._protocol)
self._fs.invalidate_cache(filepath)
I'm trying to create a class that takes the path and name of the CSV file, converts it to a dataframe, deletes some columns, converts another one to datetime, as in the code
import os
from pathlib import Path
import pandas as pd
import datetime
class Plans:
def __init__(self, file , path):
self.file = file
self.path = path
self.df = pd.Dataframe()
def get_dataframe(self):
os.chdir(self.path)
self.df = pd.read_csv(self.file, encoding="latin-1", low_memory=False, sep=';')
if 'data' in df.columns:
self.tipo = 'sales'
self.df['data'] = pd.to_datetime(df['data'])
return clean_unused_data()
def clean_unused_data(self):
columns = ['id', 'docs', 'sequence','data_in','received', 'banc', 'return', 'status', 'return_cod',
'bank_account_return', 'id_transcript', 'id_tx','type_order']
for item in columns:
del self.df[item]
del columns[:]
return self.df
When I call an object of the class it gives an error with the clean_unused_data function
returns the following error:
__getattr__ raise AttributeError(f"module 'pandas' has no attribute '{name}'")
Also, I would like to do more dataframe transformations in the Plans class. but since this first one failed, I was a little lost.
Thanks for the help and I apologize for the lack of intimacy with python
I think the error refers to calling an attribute that does not exist in Pandas. From what I can see you wrote pd.DataFrame as pd.Dataframe. Notice the capitalization.
Try the following:
def __init__(self, file , path):
self.file = file
self.path = path
self.df = pd.DataFrame()
Probably one of the columns you are trying to delete is not actually in your file. You can handle the exception or remove this column label from your array.
I want to call df["ID"] in the dataset_csv function and then call the dataset_csv function using dataset = RawToCSV.dataset_csv(input_path). df["ID"] was defined in the raw_file_processing function.
My code raised TypeError: __init__() missing 1 required positional argument: 'df' error.
import re
import pandas as pd
import os
import numpy as np
input_path = "../input_data"
class RawToCSV:
def __init__(self, path_, df):
self.measurement_df = None
self.cls = None
self.path_ = path_
self.df = df
def raw_file_processing(self, path_):
# Open all the subfolders within path
for root, dirs, files in os.walk(path_):
for file in files:
with open(os.path.join(root, file), "r") as data:
self.df = pd.read_csv(data)
# 'Class' refers to the independent variable
cls_info = self.df.iloc[2]
# Dummy-code the classes
cls = pd.get_dummies(cls_info)
# Create the ID series by concatenating columns 1-3
self.df = self.df.assign(
ID=self.df[['cell_id:cell_id', 'region:region', 'tile_num:tile_num']].apply(
lambda row: '_'.join([str(each) for each in row]), axis=1))
self.df = self.df.drop(columns=['cell_id:cell_id', 'region:region', 'tile_num:tile_num'])
# Obtain measurement info
# Normalize data against blank/empty columns
# log-transform the data
for col in self.df[9:]:
if re.findall(r"Blank|Empty", col):
background = col
else:
line = col.readline()
for dat in line:
norm_data = dat / background
self.measurement_df = np.log2(norm_data)
return self.df["ID"], cls, self.measurement_df
def dataset_csv(self):
"""Col 1: ID
Col 2: class
Col 3-n: measurements"""
ids = self.df["ID"]
id_col = ids.to_frame()
cls_col = self.cls.to_frame()
frames = [id_col, cls_col, self.measurement_df]
dataset_df = pd.concat(frames)
data_csv = dataset_df.to_csv("../input_data/dataset.csv")
return data_csv
raw = RawToCSV(input_path)
three_tuple = raw.raw_file_processing(input_path)
dataset = raw.data_csv()
Traceback:
> --------------------------------------------------------------------------- TypeError Traceback (most recent call
> last) /tmp/ipykernel_136/323215226.py in <module>
> ----> 1 raw = RawToCSV(input_path)
> 2 three_tuple = raw.raw_file_processing(input_path)
>
> TypeError: __init__() missing 1 required positional argument: 'df'
In this part of code:
dataset = RawToCSV.dataset_csv(input_path)
You are using the class itself, however you should first instantiate from the class RawToCSV, like this:
rawToCSV = RawTOCSV(input_path)
dataset = rawToCSV.data_csv()
But still you have another mistake ,too. In the constructor of the class , __init__ you've initiated the self.df with self.df, which the latter one hasn't been defined ,yet.
Therefore in this part of code, you'll get another error (AttributeError: 'RawToCSV' object has no attribute 'df'):
def __init__(self, path_):
self.measurement_df = None
self.cls = None
self.path_ = path_
self.df = self.df # <-----
On this line:
dataset = RawToCSV.dataset_csv(input_path)
you're calling dataset_csv as if it were a static method (calling it on the class not an instance). You are passing in input_path, which I assume is a string. Since you're calling the method as if it were static, it is not invisibly adding the actual self value into the call (you have to have an object to even be sent as self).
This means that your one parameter of dataset_csv, which you named self, is receiving the (string) value of input_path.
The error message is telling you that the string input_path has no member .df because it doesn't.
With the way your class and its methods are currently set up, you'll need your entry point code at the bottom to be something like this:
raw = RawToCSV(input_path)
three_tuple = raw.raw_file_processing(input_path)
dataset = raw.dataset_csv()
Though, you may want to restructure your class and its methods
I'm currently creating a Class that inherits a DataFrame from pandas. I'm interested in developing a method called 'new_filter' that is a fancier execution of a DataFrame command:
import pandas as pd
from ipywidgets import widgets
from IPython.display import display
import numpy as np
class Result(pd.DataFrame):
#property
def _constructor(self):
return Result
def _filter_done(self, c):
self._column_name = self._filter_dd.value
self._expression = self._filter_txt.value
return self[eval('self.'+ self._column_name +' '+self._expression)]
def new_filter(self):
self._filter_dd = widgets.Dropdown(options=list(self.columns),
description='Column:')
self._filter_txt = widgets.Text(description='Expr:')
self._filter_button = widgets.Button(description = 'Done')
self._filter_box = widgets.VBox([self._filter_dd, self._filter_txt, self._filter_button])
display(self._filter_box)
self._filter_button.on_click(self._filter_done)
After creating an object like:
test = Result(np.random.randn(3,4), columns=['A','B','C','D']) #just an example
test_2 = test.new_filter()
Then, for example:
Widget Output
What I want is that 'test_2' be an object from 'Result' class. Is there any solution to this?
First, you will have to return something in the function new_filter. Second, if you want the same object to be modified, it is a bit hard I think. One thing you can do is to have an object which has a trait which can be updated in _filter_done.
Here is a small example of how you can do it:
import pandas as pd
from ipywidgets import widgets
from IPython.display import display
import numpy as np
class Result(pd.DataFrame):
#property
def _constructor(self):
return Result
def _filter_done(self, obj, c):
## obj is the obejct to be modified.
## Updating its data attribute to have the filtered data.
self._column_name = self._filter_dd.value
self._expression = self._filter_txt.value
obj.data = self[eval('self.'+ self._column_name +' '+self._expression)]
def new_filter(self):
self._filter_dd = widgets.Dropdown(options=list(self.columns),
description='Column:')
self._filter_txt = widgets.Text(description='Expr:')
self._filter_button = widgets.Button(description = 'Done')
self._filter_box = widgets.VBox([self._filter_dd, self._filter_txt, self._filter_button])
display(self._filter_box)
result_obj = FilterResult()
self._filter_button.on_click(lambda arg: self._filter_done(result_obj, arg))
return result_obj
from traitlets import HasTraits
from traittypes import DataFrame
class FilterResult(HasTraits):
data = DataFrame()
With the same example code as in your question, i.e.,
test = Result(np.random.randn(3,4), columns=['A', 'B', 'C','D']) #just an example
test_2 = test.new_filter()
You can see that whenever you click on done, the updated dataframe is in test_2.data.