Let data be a giant pandas dataframe. It has many functions. The functions do not modify in place but return a new dataframe. How then am I supposed to perform multiple operations, to maximize performance?
For example, say I want to do
data = data.method1().method2().method()
where method1 could be set_index, and so on.
Is this the way you are supposed to do it? My wory is that pandas creates a copy every time I call a method, so that there are 3 copies being made of my data in the above, when in reality, all I want is to modify the original one.
So is it faster to say
data = data.method1(inplace=True)
data = data.method2(inplace=True)
data = data.method3(inplace=True)
This is just way too verbose for me?
Yes you can do that, i.e applying methods one after the other. Since you overwrite data, you are not creating 3 copies, but only keeping one copy
data = data.method1().method2().method()
However regarding your second example, the general rule is to either write over the existing dataframe or do it "inplace", but not both at the same time
data = data.method1()
or
data.method1(inplace=True)
Related
I'm new to python and I'm using pandas to perform some basic manipulation before running a model.
Here is my problem: I have got dataset A and I want to create another dataset (B) equal to A, except for one variable X.
In order to reach this outcome I'm doing this:
A = B
columns_drop= ['X']
B.drop(columns_drop,axis=1)
But the result came out to be that both A and B misses the variable X.
Use:
A = B.drop(columns_drop, axis=1)
This will make a copy of the data.
Your approach failed as both A and B point to the same object. The proper way to make an independent copy is A = B.copy(), although the non-mutables objects like lists that are in the dataframe won't be deep copied!
I have dataframes that follow name syntax of 'df#' and I would like to be able to loop through these dataframes in a function. In the code below, if function "testing" is removed, the loop works as expected. When I add the function, it gets stuck on the "test" variable with keyerror = "iris1".
import statistics
iris1 = sns.load_dataset('iris')
iris2 = sns.load_dataset('iris')
def testing():
rows = []
for i in range(2):
test=vars()['iris'+str(i+1)]
rows.append([
statistics.mean(test['sepal_length']),
statistics.mean(test['sepal_width'])
])
testing()
The reason this will be valuable is because I am subsetting my dataframe df multiple times to create quick visualizations. So in Jupyter, I have one cell where I create visualizations off of df1,df2,df3. In the next cell, I overwrite df1,df2,df3 based on different subsetting rules. This is advantageous because I can quickly do this by calling a function each time, so the code stays quite uniform.
Store the datasets in a dictionary and pass that to the function.
import statistics
import seaborn as sns
datasets = {'iris1': sns.load_dataset('iris'), 'iris2': sns.load_dataset('iris')}
def testing(data):
rows = []
for i in range(1,3):
test=data[f'iris{i}']
rows.append([
statistics.mean(test['sepal_length']),
statistics.mean(test['sepal_width'])
])
testing(datasets)
No...
You should NEVER make a sentence like I have dataframes that follow name syntax of 'df#'
Then you have a list of dataframes, or a dict of dataframe, depending how you want to index them...
Here I would say a list
Then you can forget about vars(), trust me you don't need it... :)
EDIT :
And use list comprehensions, your code could hold in three lines :
import statistics
list_iris = [sns.load_dataset('iris'), sns.load_dataset('iris')]
rows = [
(statistics.mean(test['sepal_length']), statistics.mean(test['sepal_width']))
for test in list_iris
]
Storing as a list or dictionary allowed me to create the function. There is still a problem of the nubmer of dataframes in the list varies. It would be nice to be able to just input n argument specifying how many objects are in the list (I guess I could just add a bunch of if statements to define the list based off such an argument). **EDIT: Changing my code so that I don't use df# syntax, instead just putting it directly into a list
The problem I was experiencing is still perplexing. I can't for the life of me figure out why the "test" variable performs as expected outside of a function, but inside of a function it fails. I'm going to go the route of creating a list of dataframes, but am still curious to understand why it fails inside of the function.
I agree with #Icarwiz that it might not be the best way to go about it but you can make it work with.
test=eval('iris'+str(i+1))
I was wondering what the most pythonic way to return a pandas dataframe would be in a class. I am curious if I need to include .copy() when returning, and generally would like to understand the pitfalls of not including it. I am using a helper function because the dataframe is called multiple times, and I don't want return from the manipulate_dataframe method.
My questions are:
Do I need to place .copy() after df when assigning it to a new object?
Do I need to place .copy() after self.final_df when returning it with the get_df helper function?
class df:
def __init__(self):
pass
def manipulate_dataframe(self, df):
""" DATAFRAME MANIPULATION """
self.final_df = df
def get_df(self):
return self.final_df
Question: do you want the thing referred to as self.final_df to directly alter the original data frame, or do you want it to live its own life? Recall Python treats many variables as "views", in the sense that
y = x
creates two links to same information and if you alter y, it also alters x. Example:
>>> x = [3, 4, 6]
>>> y = x
>>> y[0] = 77
>>> x
[77, 4, 6]
The pandas df is just another example of same Python fact.
Making a copy can be time consuming. It will also disconnect your self.final_df from the input data frame. People often do that because Pandas issues warnings about user efforts to assign values into view of data frames. The correct place to start reading on this is Pandas Document https://pandas.pydata.org/docs/user_guide/indexing.html?highlight=copy%20versus%20view. However there seem to be 100 blog posts about it, I prefer the RealPython site https://realpython.com/pandas-settingwithcopywarning but you will find a others, https://www.dataquest.io/blog/settingwithcopywarning). A common effort that careless users take to address that problem is to copy the data frame, which effectively "disconnects" the new copy from the old data frame.
I don't know why you might want to create a class that simply holds a pandas df. Probably you could just create a function that does all of those data manipulations. If those commands are supposed to alter original df, don't make a copy.
If I have a "reference" to a dataframe, there appears to be no way to append to it in pandas because neither append nor concat support the inplace=True parameter.
An (overly) simple example:
chosen_df, chosen_row = (candidate_a_df, candidate_a_row) if some_test else (candidate_b_df, candidate_b_row)
chosen_df = chosen_df.append(chosen_row)
Now because Python does something akin to copy reference by value, chosen_df will initially be a reference to whichever candidate dataframe passed some_test.
But the update semantics of pandas mean that the referenced dataframe is not updated by the result of the append function; a new label is created instead. I believe, if there was the possibility to use inplace=True this would work, but it looks like that isn't likely to happen, given discussion here https://github.com/pandas-dev/pandas/issues/14796
It's worth noting that with a simpler example using lists rather than dataframes does work, because the contents of lists are directly mutated by append().
So my question is --- How could an updatable abstraction over N dataframes be achieved in Python?
The idiom is commonplace, useful and trivial in languages that allow references, so I'm guessing I'm missing a Pythonic trick, or thinking about the whole problem with the wrong hat on!
Obviously the pure illustrative example can be resolved by duplicating the append in the body of an if...else and concretely referencing each underlying dataframe in turn. But this isn't scalable to more complex examples and it's a generic solution akin to references I'm looking for.
Any ideas?
There is a simple way to do this specifically for pandas dataframes - so I'll answer my own question.
chosen_df, chosen_row = (candidate_a_df, candidate_a_row) if some_test else (candidate_b_df, candidate_b_row)
chosen_df.loc[max_idx+1] = chosen_row
The calculation of max_idx very much depends on the structure of chosen_df. In the simplest case when it is a dataframe with a sequential index starting at 0, then you can simply use the length of the index to calculate it.
If chosen_df is non-sequential you'll need call max() on the index column rather than rely on the length of the index.
If chosen_df is a slice or groupby object then you'll need to calculate the index off the max parent dataframe to ensure it's truly the max across all rows.
I'm trying to optimize one piece of Software written in Python using Pandas DF . The algorithm takes a pandas DF as input, can't be distributed and it outputs a metric for each client.
Maybe it's not the best solution, but my time-efficient approach is to load all files in parallel and then build a DF for each client
This works fine BUT very few clients have really HUGE amount of data. So I need to save memory when creating their DF.
In order to do this I'm performing a groupBy() (actually a combineByKey, but logically it's a groupBy) and then for each group (aka now a single Row of an RDD) I build a list and from it, a pandas DF.
However this makes many copies of the data (RDD rows, List and pandas DF...) in a single task/node and crashes and I would like to remove that many copies in a single node.
I was thinking on a "special" combineByKey with the following pseudo-code:
def createCombiner(val):
return [val]
def mergeCombinerVal(x,val):
x.append(val);
return x;
def mergeCombiners(x,y):
#Not checking if y is a pandas DF already, but we can do it too
if (x is a list):
pandasDF= pd.Dataframe(data=x,columns=myCols);
pandasDF.append(y);
return pandasDF
else:
x.append(y);
return x;
My question here, docs say nothing, but someone knows if it's safe to assume that this will work? (return dataType of merging two combiners is not the same than the combiner). I can control datatype on mergeCombinerVal too if the amount of "bad" calls is marginal, but it would be very inefficient to append to a pandas DF row by row.
Any better idea to perform want i want to do?
Thanks!,
PS: Right now I'm packing Spark rows, would switching from Spark rows to python lists without column names help reducing memory usage?
Just writing my comment as answer
At the end I've used regular combineByKey, its faster than groupByKey (idk the exact reason, I guess it helps packing the rows, because my rows are small size, but there are maaaany rows), and also allows me to group them into a "real" Python List (groupByKey groups into some kind of Iterable which Pandas doesn't support and forces me to create another copy of the structure, which doubles memory usage and crashes), which helps me with memory management when packing them into Pandas/C datatypes.
Now I can use those lists to build a dataframe directly without any extra transformation (I don't know what structure is Spark's groupByKey "list", but pandas won't accept it in the constructor).
Still, my original idea should have given a little less memory usage (at most 1x DF + 0.5x list, while now I have 1x DF + 1x list), but as user8371915 said it's not guaranteed by the API/docs..., better not to put that into production :)
For now, my biggest clients fit into a reasonable amount of memory. I'm processing most of my clients in a very parallel low-memory-per-executor job and the biggest ones in a not-so-parallel high-memory-per-executor job. I decide based on a pre-count I perform