I have upwards of 4000 lines of code that analyze, manipulate, compare and plot 2 huge .csv documents. For readability and future publication, I'd like to convert to object-oriented classes. I convert them to pd.DataFrames:
my_data1 = pd.DataFrame(np.random.randn(100, 9), columns=list('123456789'))
my_data2 = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
I have functions that compare various aspects of each of the datasets and functions that only use the datasets individually. I want to convert this structure into a dataclass with methods for each dataframe.
I can't manipulate these dataframes through my class functions. I keep getting NameError: name 'self' is not defined. Here's my dataclass structure:
#dataclass
class Data:
ser = pd.DataFrame
# def __post_init__(self):
# self.ser = self.clean()
def clean(self, ser):
acceptcols = np.where(ser.loc[0, :] == '2')[0]
data = ser.iloc[:, np.insert(acceptcols, 0, 0)]
data = ser.drop(0)
data = ser.rename(columns={'': 'Time(s)'})
data = ser.astype(float)
data = ser.reset_index(drop=True)
data.columns = [column.replace('1', '')
for column in ser.columns]
return data
my_data1 = pd.DataFrame(np.random.randn(100, 9), columns=list('123456789'))
my_data2 = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
# Attempt 1
new_data1 = Data.clean(my_data1) # Parameter "ser" unfilled
# Attempt 2
new_data1 = Data.clean(ser=my_data1) # Parameter "self" unfilled
# Attempt 3
new_data1 = Data.clean(self, my_data1) # Unresolved reference "self"
I have tried various forms of defining def clean(self and other stuff) but I think I just don't understand classes or class structure enough. Documentation on classes and dataclasses always use very rudimentary examples, I've tried cut/pasting a template to no avail. What am I missing?
you can first get an instance x of the class Data.
x = Data()
# Attempt 1
new_data1 = x.clean(my_data1) # Parameter "ser" unfilled
# Attempt 2
new_data1 = x.clean(ser=my_data1) # Parameter "self" unfilled
If I were you I would not use a class this way, I would instead just define the following function
def clean(ser):
acceptcols = np.where(ser.loc[0, :] == '2')[0]
data = ser.iloc[:, np.insert(acceptcols, 0, 0)]
data = ser.drop(0)
data = ser.rename(columns={'': 'Time(s)'})
data = ser.astype(float)
data = ser.reset_index(drop=True)
data.columns = [column.replace('1', '')
for column in ser.columns]
return data
and call it directly.
Also, in your clean(), each modification is based on ser which is the input, but not the last modification. This is a problem, isn't this?
Related
I want to make a function that does an ETL to a list/Series and returns a Series with additional attributes and methods specific to that function. I can achieve this by creating a class to extend the Series and it works but then when I try to reassign the output from the function with the new class the updated class attributes and methods are stripped away. How can I extend the Series to have custom attributes and methods that are not stripped away when reassigning back to the dataframe?
Custom function that does ETL, returns a Series with an extended class
import pandas as pd
def normalize_x(x: list, new_attribute: None):
normalized = pd.Series(['normalized_'+ i if i != 4 else None for i in x])
return NormalizeX(normalized = normalized, original = x, new_attribute = new_attribute)
class NormalizeX(pd.Series):
def __init__(self, normalized, original, new_attribute, *args, **kwargs,):
super().__init__(data = normalized, *args, **kwargs)
self.original = original
self.normalized = normalized
self.new_attribute = new_attribute
def conversion_errors(self):
return [o != n for o, n in zip(pd.isnull(self.original), pd.isnull(self.normalized))]
df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6], "C": ['dog', 'cat', 4]})
Assign to a new object (new attributes and methods work)
out = normalize_x(df.C, new_attribute = 'CoolAttribute')
out
## 0 normalized_dog
## 1 normalized_cat
## 2 None
## dtype: object
## Can still use Series methods
out.to_list()
## ['normalized_dog', 'normalized_cat', None]
## Can use the new methods and access attributes
out.conversion_errors()
## [False, False, True]
out.original
##0 dog
##1 cat
##2 4
##Name: C, dtype: object
Assign to a Pandas DataFrame (new attributes and methods break)
df['new'] = normalize_x(df.C, new_attribute = 'CoolAttribute')
df['new']
## 0 normalized_dog
## 1 normalized_cat
## 2 None
## dtype: object
## Can't use the new methods or access attributes
df['new'].conversion_errors()
## AttributeError: 'Series' object has no attribute 'conversion_errors'
df['new'].original
## AttributeError: 'Series' object has no attribute 'original'
Pandas allows you to extend its classes (Series, DataFrame). In your case, the solution is quite verbose, but I think it's the only way you have to reach your goal.
I try to get to the point without analyse complex cases, so the complete implementation of the interface is up to you, but I can give you an idea of what you can use.
I don't understand the utility of new_attribute, so for the moment is not taken into account. Basically, Pandas extensions let you extend only a 1-D array for what I know. Since you have multiple arrays (both normalized, original) you have to create another data type to get around the problem.
class NormX(object):
def __init__(self, normalized, original):
self.normalized = normalized
self.original = original
def __repr__(self,):
if self.normalized is None:
return 'Nan'
return self.normalized
This allows you to create a simple base object like the following:
norm_obj = NormX('normalized_dog', 'dog')
This object will be the basic block of your custom array. To be able to exploit this this kind of class, you have to register a new type in Pandas:
from pandas.api.extensions import ExtensionDtype, register_extension_dtype
import numpy as np
#pd.api.extensions.register_extension_dtype
class NormXType(ExtensionDtype):
name = 'normX'
type = NormX
kind = 'O'
na_value = np.nan
Now you have all the elements to build a custom array based on Pandas framework. To do that, you have to extend its interface class named ExtensionArray. Here you can find the abstract methods that must be implemented by subclasses. I gave you a very basic implementation, but it should be declared in a proper way:
from pandas.api.extensions import ExtensionArray
class NormalizeX(ExtensionArray):
def __init__(self, values):
self.data = values
def __repr__(self,):
return "NormalizeX({!r})".format([(t.normalized, t.original) for t in self.data])
def _from_sequence(self,):
pass
def _from_factorized(self,):
pass
def __getitem__(self, key):
return self.data[key]
# def __setitem__(self, key, value):
# self.normalized[key] = value
# return self
def __len__(self,):
return len(self.data)
def __eq__(self, other):
return False
def dtype(self,):
# return self._dtype
return object
def nbytes(self,):
return sys.getsizeof(self.data)
def isna(self,):
return False
def take(self,):
pass
def copy(self,):
return type(self)(self.data)
def _concat_same_type(self,):
pass
More over, to define a custom method on that class, you have to define a custom Series accessor, as follows:
#pd.api.extensions.register_series_accessor("normx_accsr")
class NormalizeXAccessor:
def __init__(self, obj):
self.normalized = [o.normalized for o in obj]
self.original = [o.original for o in obj]
#property
def conversion_errors(self):
return [o != n for o, n in zip(pd.isnull(self.original), pd.isnull(self.normalized))]
In this way, the NormalizeX custom array implements all the requested methods to be successfully integrated into both Series and DataFrame. So, your example simply reduces to:
df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6], "C": ['dog', 'cat', 4]})
norm_list = [['normalized_'+ i, i] if i != 4 else (None, i) for i in df.C]
normx_objs = [NormX(*t) for t in norm_list]
normalizedX = NormalizeX(normx_objs)
df['new'] = normalizedX
# Use the previously defined attribute_accessor normx_accsr
df['new'].normx_accsr.conversion_errors # [False, False, True]
It's too difficult for me to implement your desired functionality, so I only share what I found in my investigation expecting it might be useful for other answerers.
The cause of the issue:
The reason why you got those attribute errors is sanitization processes were carried out on the Series which you passed to the DataFrame.
A brief check of ids:
You can quickly confirm the difference between what out and df['new'] refer to by the following code:
out = normalize_x(df.C, new_attribute = 'CoolAttribute')
df['new'] = out
print(id(out))
print(id(df['new']))
1861777917792
1861770685504
you can see out and df['new'] are different from each other because of this id difference.
Let's dive into the pandas source code to see what goes on here.
DataFrame._set_item method:
In the definition of the DataFrame class, _set_item method works when you try to add Series to DataFrame in a specified column.
def _set_item(self, key, value) -> None:
"""
Add series to DataFrame in specified column.
If series is a numpy-array (not a Series/TimeSeries), it must be the
same length as the DataFrames index or an error will be thrown.
Series/TimeSeries will be conformed to the DataFrames index to
ensure homogeneity.
"""
value = self._sanitize_column(value)
In this method, value = self._sanitize_column(value) in the first line except the docstring. This _sanitize_column method actually destroyed your original Series functionality. If you dig this method deeper, you'll finally reach the following lines:
def _reindex_for_setitem(value: FrameOrSeriesUnion, index: Index) -> ArrayLike:
# reindex if necessary
if value.index.equals(index) or not len(index):
return value._values.copy()
value._values.copy() is the direct cause of the disappearance of the NormalizeX attributes. It just copies the values from the given Series.
Therefore, the _set_item method should be modified in order to protect the NormalizeX attributes.
Conclusion:
You have to override the DataFrame class to set your NormalizeX in a specified column keeping with its attributes.
Python Panel passing a dataframe in param.Parameterized class
I can build a dashboard using panel. I know want to include the code in a class including the data manipulation.
df = ydata.load_web(rebase=True)
class Plot(param.Parameterized):
df = df
col = list(df.columns)
Index1 = param.ListSelector(default=col, objects=col)
Index2 = param.ListSelector(default=col[1:2], objects=col)
def dashboard(self, **kwargs):
unds = list(set(self.Index1 + self.Index2))
return self.df[unds].hvplot()
b = Plot(name="Index Selector")
pn.Row(b.param, b.dashboard)
I would like to call
b = Plot(name="Index Selector", df=ydata.load_web(rebase=True))
Using a parameterized DataFrame and two methods for
setting the ListSelector according the available columns in the data frame and
creating a plot with hv.Overlay (containing single plots for each choosen column),
the code could look like this:
# Test data frame with two columns
df = pd.DataFrame(np.random.randint(90,100,size=(100, 1)), columns=['1'])
df['2'] = np.random.randint(70,80,size=(100, 1))
class Plot(param.Parameterized):
df = param.DataFrame(precedence=-1) # precedence <1, will not be shown as widget
df_columns = param.ListSelector(default=[], objects=[], label='DataFrame columns')
def __init__(self, **params):
super(Plot, self).__init__(**params)
# set the column selector with the data frame provided at initialization
self.set_df_columns_selector()
# method is triggered each time the data frame changes
#param.depends('df', watch=True)
def set_df_columns_selector(self):
col = list(self.df.columns)
print('Set the df index selector when the column list changes: {}'.format(col))
self.param.df_columns.objects = list(col) # set choosable columns according current df
self.df_columns = [self.param.df_columns.objects[0]] # set column 1 as default
# method is triggered each time the choosen columns change
#param.depends('df_columns', watch=True)
def set_plots(self):
print('Plot the columns choosen by the df column selector: {}'.format(self.df_columns))
plotlist = [] # start with empty list
for i in self.df_columns:
# append plot for each choosen column
plotlist.append(hv.Curve({'x': self.df.index, 'y':self.df[i]}))
self.plot = hv.Overlay(plotlist)
def dashboard(self):
return self.plot
b = Plot(name="Plot", df=df)
layout = pn.Row(b.param, b.dashboard)
layout.app()
#or:
#pn.Row(b.param, b.dashboard)
This way the parameterized variables take care of updating the plots.
I'm looking for the name for a procedure which handles output from one function in several others (trying to find better words for my problem). Some pseudo/actual code would be really helpful.
I have written the following code:
def read_data():
read data from a file
create df
return df
def parse_data():
sorted_df = read_data()
count lines
sort by date
return sorted_df
def add_new_column():
new_column_df = parse_data()
add new column
return new_column_df
def create_plot():
plot_data = add_new_column()
create a plot
display chart
What I'm trying to understand is how to skip a function, e.g. create following chain read_data() -> parse_data() -> create_plot().
As the code looks right now (due to all return values and how they are passed between functions) it requires me to change input data in the last function, create_plot().
I suspect that I'm creating logically incorrect code.
Any thoughts?
Original code:
import pandas as pd
import matplotlib.pyplot as plt
# Read csv files in to data frame
def read_data():
raw_data = pd.read_csv('C:/testdata.csv', sep=',', engine='python', encoding='utf-8-sig').replace({'{':'', '}':'', '"':'', ',':' '}, regex=True)
return raw_data
def parse_data(parsed_data):
...
# Convert CreationDate column into datetime
raw_data['CreationDate'] = pd.to_datetime(raw_data['CreationDate'], format='%Y-%m-%d %H:%M:%S', errors='coerce')
raw_data.sort_values(by=['CreationDate'], inplace=True, ascending=True)
parsed_data = raw_data
return parsed_data
raw_data = read_files()
parsed = parsed_data(raw_data)
Pass the data in instead of just effectively "nesting" everything. Any data that a function requires should ideally be passed in to the function as a parameter:
def read_data():
read data from a file
create df
return df
def parse_data(sorted_df):
count lines
sort by date
return sorted_df
def add_new_column(new_column_df):
add new column
return new_column_df
def create_plot(plot_data):
create a plot
display chart
df = read_data()
parsed = parse_data(df)
added = add_new_column(parsed)
create_plot(added)
Try to make sure functions are only handling what they're directly responsible for. It isn't parse_data's job to know where the data is coming from or to produce the data, so it shouldn't be worrying about that. Let the caller handle that.
The way I have things set up here is often referred to as "piping" or "threading". Information "flows" from one function into the next. In a language like Clojure, this could be written as:
(-> (read-data)
(parse-data)
(add-new-column)
(create-plot))
Using the threading macro -> which frees you up from manually needing to handle data passing. Unfortunately, Python doesn't have anything built in to do this, although it can be achieved using external modules.
Also note that since dataframes seem to be mutable, you don't actually need to return the altered ones them from the functions. If you're just mutating the argument directly, you could just pass the same data frame to each of the functions in order instead of placing it in intermediate variables like parsed and added. The way I'm showing here is a general way to set things up, but it can be altered depending on your exact use case.
Use class to contain your code
class DataManipulation:
def __init__(self, path):
self.df = pd.DataFrame()
self.read_data(path)
#staticmethod
def new(file_path):
return DataManipulation(path)
def read_data(self, path):
read data from a file
self.df = create df
def parse_data(self):
use self.df
count lines
sort by date
return self
def add_new_column(self):
use self.df
add new column
return self
def create_plot(self):
plot_data = add_new_column()
create a plot
display chart
return self
And then,
d = DataManipulation.new(filepath).parse_data().add_column().create_plot()
I am trying to read a csv file using panda and parse it and then upload the results in my django database. Well, for now i am converting each dataframe to a list and then iterating over the list to save it in the DB. But my solution is inefficient when the list is really big for each column. How can i make it better ?
fileinfo = pd.read_csv(csv_file, sep=',',
names=['Series_reference', 'Period', 'Data_value', 'STATUS',
'UNITS', 'Subject', 'Group', 'Series_title_1', 'Series_title_2',
'Series_title_3','Series_tile_4','Series_tile_5'],
skiprows = 1)
# serie = fileinfo[fileinfo['Series_reference']]
s = fileinfo['Series_reference'].values.tolist()
p = fileinfo['Period'].values.tolist()
d = fileinfo['Data_value'].values.tolist()
st = fileinfo['STATUS'].values.tolist()
u = fileinfo['UNITS'].values.tolist()
sub = fileinfo['Subject'].values.tolist()
gr = fileinfo['Group'].values.tolist()
stt= fileinfo['Series_title_1'].values.tolist()
while count < len(s):
b = Testdata(
Series_reference = s[count],
Period = p[count],
Data_value = d[count],
STATUS = st[count],
UNITS = u[count],
Subject = sub[count],
Group = gr[count],
Series_title_1 = stt[count]
)
b.save()
count = count + 1
You can use pandas apply function. You can pass axis=1 to apply a given function to every row:
df.apply(
creational_function, # Method that creates your structure
axis=1, # Apply to every row
args=(arg1, arg2) # Additional args to creational_function
)
in creational_function the first argument received is the row, where you can access specific columns likewise the original dataframe
def creational_function(row, arg1, arg2):
s = row['Series_reference']
# For brevity I skip the others arguments...
# Create TestData
# Save
Note that arg1 and arg2 are the same for every row.
If you want to do something more with your created TestData objects, you can change creational_function to return a value, then df.apply will return a list containing all elements returned by the passed function.
I am working on praising a *.csv file. Therefore I try to create a class which helps me to simplify some operations on DataFrame.
I've created two methods in order to parse a column 'z' that contains values for the 'Price' column.
def subr(self):
isone = self.df.z == 1.0
if isone.any():
atone = self.df.Price[isone].iloc[0]
self.df.loc[self.df.z.between(0.8, 2.5), 'Benchmark'] = atone
# df.loc[(df.r >= .8) & (df.r <= 1.4), 'value'] = atone
return self.df
def obtain_z(self):
"Return a column with z for E_ref"
self.z_col = self.subr()
self.dfnew = self.df.groupby((self.df.z < self.df.z.shift()).cumsum()).apply(self.z_col)
return self.dfnew
def main():
x = ParseDataBase('data.csv')
file_content = x.read_file()
new_df = x.obtain_z()
I'm getting the following error:
'DataFrame' objects are mutable, thus they cannot be hashed
'DataFrame' objects are mutable means that we can change elements of that Frame. I'm not sure when I'm hashing.
I noticed the use of apply(self.z_col) is going wrong.
I also have no clue how to fix it.
You are passing the DataFrame self.df returned by self.subr() to apply, but actually apply only takes functions as parameters (see examples here).