how to manipulate a DataFrame in which each value is a object? - python

I have a DataFrame in which each value is a object of custom class, say:
dc = {"c1":{"a1":CAppState(1,1), "a2":CAppState(2,4) }, "c2":{"a2":CAppState(2,5), "a3":CAppState(3,32)} }
df = pd.DataFrame(dc)
where CAppState is a class:
class CAppState(object):
def __init__(self, nID, nValue):
self.m_nID = nID
self.m_nValue = nValue
I was wondering how could I conduct some common operations on this dataframe, like: cumsum() or sort according to CAppState.m_nValue ?
Any suggestion would be appreciated.

There is no builtin way to do this. You must create a Series from your objects and cumsum that. This can be done fairly easily with map. For instance:
df.c1.map(lambda x: x.m_nValue).cumsum()
You could also use operator.attrgetter:
df.c1.map(operator.attrgetter('m_nValue')).cumsum()

Related

Use ast.literal_eval on all columns of a Pandas Dataframe

I have a data frame that looks like the following:
Category Class
==========================
['org1', 'org2'] A
['org2', 'org3'] B
org1 C
['org3', 'org4'] A
org2 A
['org2', 'org4'] B
...
When I read in this data using Pandas, the lists are read in as strings (e.g., dat['Category][0][0] returns [ rather than returning org1). I have several columns like this. I want every categorical column that already contains at least one list to have all records be a list. For example, the above data frame should look like the following:
Category Class
==========================
['org1', 'org2'] A
['org2', 'org3'] B
['org1'] C
['org3', 'org4'] A
['org2'] A
['org2', 'org4'] B
...
Notice how the singular values in the Category column are now contained in lists. When I reference dat['Category][0][0], I'd like org1 to be returned.
What is the best way to accomplish this? I was thinking of using ast.literal_eval with an apply and lambda function, but I'd like to try and use best-practices if possible. Thanks in advance!
You could create a boolean mask of the values that need to changed. If there are no lists, no change is needed. If there are lists, you can apply literal_eval or a list creation lambda to subsets of the data.
import ast
import pandas as pd
def normalize_category(df):
is_list = df['Category'].str.startswith('[')
if is_list.any():
df.loc[is_list,'Category'] = df.loc[is_list, 'Category'].apply(ast.literal_eval)
df.loc[~is_list,'Category'] = df.loc[~is_list]['Category'].apply(lambda val: [val])
df = pd.DataFrame({"Category":["['org1', 'org2']", "org1"], "Class":["A", "B"]})
normalize_category(df)
print(df)
df = pd.DataFrame({"Category":["org2", "org1"], "Class":["A", "B"]})
normalize_category(df)
print(df)
You can do it like this:
df['Category'] = df['Category'].apply(lambda x: literal_eval(x) if x.startswith('[') else [x])

How to retain attributes of inherited Spark DataFrame Class following a Spark operation on that class

I create a new class called NewDataFrame with attribute a_string:
import numpy as np
import pandas as pd
from pyspark.sql import DataFrame
class NewDataFrame(DataFrame):
def __init__(self, df):
super().__init__(df._jdf,df.sql_ctx)
self.a_string = "Hello, World."
I use the class on some data and am able to print out a_string:
data = {
'a': ['yellow', 'red']
,'b': [1, 2]
}
df = pd.DataFrame(data)
sdf = spark.createDataFrame(df)
temp = NewDataFrame(sdf)
temp.a_string
Out[]: Hello, World.
Now, I filter temp to a subset and try to output a_string and receive an error because the filter method returns a DataFrame, not NewDataFrame.
temp = temp.filter("a='yellow'")
temp.a_string
Out[]: 'DataFrame' object has no attribute 'a_string'
To keep the attribute in the result of a filter, I have tried creating a new method on the NewDataFrame class which performs the filter and then feeds the result back into a NewDataFrame class, which works, but I do not want to rewrite all the Spark functions in this manner.
Is there a way for the class to have access to the full range of DataFrame methods while still retaining the attributes I define in NewDataFrame?
I have tried the same inheritance in Python with no success. The PySpark dataframe has been implemented in a such way to return the Dataframe object after performing the operations. You can look at the source code and see how it has been done:
https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py
As mentioned in the question, without overriding each function in your subclass it will be difficult to build proper functional class. One another way is you can use initialize Parent object within the child class instead of inheriting from the Parent class.

Dynamic column assignment in Python Pandas

I have a pandas dataframe from which I'd like to create some text-related feature columns. I also have a class that calculates those features. Here's my code:
r = ReadabilityMetrics()
text_features = [['sentence_count', r.sentence_count], ['word_count', r.word_count], ['syllable_count', r.syllable_count], ['unique_words', r.unique_words],
['reading_time', r.reading_time], ['speaking_time', r.speaking_time], ['flesch_reading_ease', r.flesch_reading_ease], ['flesch_kincaid_grade', r.flesch_kincaid_grade],
['char_count', r.char_count]]
(df
.assign(**{t:df['description'].apply(f) for t, f in text_features})
)
I iterate over text_features to dynamically create the columns.
My question: how can I remove reference to the methods and make text_features more concise?
For example, I'd like have text_features = ['sentence_count', 'word_count', 'syllable_count', ...], and since the column names are the same as the function names, dynamically reference the functions. Having a nested list doesn't seem DRY so looking for a more efficient implementation.
I think you're looking for this:
text_features = ['sentence_count', 'word_count', 'syllable_count', 'unique_words', 'reading_time', 'speaking_time', 'flesch_reading_ease', 'flesch_kincaid_grade', 'char_count']
df.assign(**{func_name: df['description'].apply(getattr(r, func_name)) for func_name in text_features})
for column_name, function in text_features:
df[column_name] = df['description'].apply(function)
I think this is fine. I would probably define text_features as a list of tuples rather than a list of lists.
If you're sure that it has to be more concise, define text_features as a list of strings.
for column name in text_features:
df[column_name] = df['description'].apply(getattr(r, column_name))
I would not try to make it any more concise than this (such as using ** with a dict) as to make the solution less esoteric, but this is just a matter of opinion.
In your case try getattr
(df
.assign(**{t:df['description'].apply(getattr(r, t)()) for t in text_features})
)

Python: Apply .apply() with a self-defined function to a Data Frame- why doesn't it work?

I am trying to apply a self-defined function by using apply() to a data frame. Goal is to calculate the mean of each row / column with a self-defined function. But it doesn't work, probably I still don't understand the logic of .apply() fully. Can someone help me? Thanks in advance:
d = pd.DataFrame({"A":[50,60,70],"B":[80,90,100]})
def m(x):
x.sum()/len(x)
return x
d.apply(m(),axis=0)
If possible the best way is a vectorized solution:
df = d.sum() / len(d)
Your solution is possible too, but you need to change to return the values, and also in apply remove (), finally axis=0 is the default value for that parameter, so it can also be removed:
def m(x):
return x.sum()/len(x)
df = d.apply(m)

Better way to structure a series of df manipulations in your class

How do you better structure the code in your class so that your class returns the df that you want, but you don't have a main method which calls a lot of other methods in sequential order. I find that in a lot of situations I arrive at this structure and it seems bad. I have a df that I just overwrite it with the result of other base functions (that I unit test) until I get what I want.
class A:
def main(self):
df = self.load_file_into_df()
df = self.add_x_columns(df)
df = self.calculate_y(df)
df = self.calculate_consequence(df)
...
return df
def add_x_columns(df)
def calculate_y(df)
def calculate_consequence(df)
...
# now use it somewhere else
df = A().main()
pipe
One feature you may wish to utilize is pd.DataFrame.pipe. This is considered "pandorable" because it facilitates operator chaining.
In my opinion, you should separate reading data into a dataframe from manipulating the dataframe. For example:
class A:
def main(self):
df = self.load_file_into_df()
df = df.pipe(self.add_x_columns)\
.pipe(self.calculate_y)\
.pipe(self.calculate_consequence)
return df
compose
Function composition is not native to Python, but the 3rd party toolz library does offer this feature. This allows you to lazily define chained functions. Note the reversed order of operations, i.e. the last argument of compose is performed first.
from toolz import compose
class A:
def main(self)
df = self.load_file_into_df()
transformer = compose(self.calculate_consequence,
self.calculate_y,
self.add_x_columns)
df = df.pipe(transformer)
return df
In my opinion, compose offers a flexible and adaptable solution. You can, for example, define any number of compositions and apply them selectively or repeatedly at various points in your workflow.

Categories