pandas apply with inputs from multiple rows - python

I need to do an apply on a dataframe using inputs from multiple rows. As a simple example, I can do the following if all the inputs are from a single row:
df['c'] = df[['a','b']].apply(lambda x: awesome stuff, axis=1)
# or
df['d'] = df[['b','c']].shift(1).apply(...) # to get the values from the previous row
However, if I need 'a' from the current row, and 'b' from the previous row, is there a way to do that with apply? I could add a new 'bshift' column and then just use df[['a','bshift']] but it seems there must be a more direct way.
Related but separate, when accessing a specific value in the df, is there a way to combine labeled indexing with integer-offset? E.g. I know the label of the current row but need the row before. Something like df.at['labelIknow'-1, 'a'] (which of course doesn't work). This is for when I'm forced to iterate through rows. Thanks in advance.
Edit: Some info on what I'm doing etc. I have a pandas store containing tables of OHLC bars (one table per security). When doing backtesting, currently I pull the full date range I need for a security into memory, and then resample it into a frequency that makes sense for the test at hand. Then I do some vectorized operations for things like trade entry signals etc. Finally I loop over the data from start to finish doing the actual backtest, e.g. checking for trade entry exit, drawdown etc - this looping part is the part I'm trying to speed up.

This should directly answer your question and let you use apply, although I'm not sure it's ultimately any better than a two-line solution. It does avoid creating extra variables at least.
df['c'] = pd.concat([ df['a'], df['a'].shift() ], axis=1).apply(np.mean,axis=1)
That will put the mean of 'a' values from the current and previous rows into 'c', for example.
This isn't as general, but for simpler cases you can do something like this (continuing the mean example):
df['c'] = ( df['a'] + df['a'].shift() ) / 2
That is about 10x faster than the concat() method on my tiny example dataset. I imagine that's as fast as you could do it, if you can code it in that style.
You could also look into reshaping the data with stack() and hierarchical indexing. That would be a way to get all your variables into the same row but I think it will likely be more complicated than the concat method or just creating intermediate variables via shift().

For the first part, I don't think such a thing is possible. If you update on what you actually want to achieve, I can update this answer.
Also looking at the second part, your data structure seems to be relying an awfully lot on the order of rows. This is typically not how you want to manage your databases. Again, if you tell us what your overall goal is, we may hint you towards a solution (and potentially a better way to structure the data base).
Anyhow, one way to get the row before, if you know a given index label, is to do:
df.ix[:'labelYouKnow'].iloc[-2]
Note that this is not the optimal thing to do efficiency-wise, so you may want to improve your your db structure in order to prevent the need for doing such things.

Related

Replacing values in entire dataframe based on dictionary

I have a large dataframe in which I want to change every value based on a dictionary. Currently I am using a loop:
for column in df.columns:
df = df.replace({column: dictionary})
This works but it is pretty slow. This single for loop takes around 90% of the time that it takes my entire code to run. Is there a faster method using loc, map or replace?
There are lots of other questions like this already but it seems pretty much everyone else is just trying to change individual columns rather than the entire dataframe.
I'm using Spyder with Python 3.9 on Windows. Thanks!
Edit: I found a way to make it faster by switching rows and columns. Previously I had lots of columns and few rows, now that it's the other way around the code is a lot faster. Still, is there a way to replace values in the entire dataframe rather than just individual columns?
Is there a way to replace values in the entire dataframe rather than just individual columns?
Use:
df = df.replace(dictionary)

Panda crosstab function getting number for conditions

I´m not sure if the title was well picked, sorry for that. If this was already covered please let me know where I couldn´t find it.
For an analysis that I am doing, I am working in JupyterLab mainly scanpy. I want to see the number of cells that are coexpressing certain genes in a leiden clustering. So far I was trying with pandas crosstab function and I get the number for each cluster.
However, I have two conditions and there I´m struggling to separate the samples to get the cell counts separately.
The code I am using to get the total cell number which works fine.
pd.crosstab(adata_proc.obs['leiden_r05'], adata_proc.obs['CoEx'])
The code where I am struggling to get the numbers for the samples. I know that the aggfunc = ','.join is not the correct way but this is to explain what the problem is.
pd.crosstab(adata_proc.obs['leiden_r05'], adata_proc.obs['CoEx'], adata_proc.obs['sample'], aggfunc = ','.join)
I can get the name of the conditions out in the table but I don´t want this. I want the numbers for the 2 conditions. How is this possible? Maybe there is a way to do this in a separate function?
Edit:
Using crosstab, you'll need to add the 'CoEx' column to the index, and use the 'sample' as the column of interest:
pd.crosstab(index=[adata_proc.obs['leiden_r05'],adata_proc.obs['CoEx']], columns=[adata_proc.obs['sample']])
I suggest using the .groupby function:
adata_proc.obs.groupby(['leiden_r05','CoEx'])["sample"].value_counts()
Another option (a bit of an abuse) is the pivot_table interface. In your case it be:
pd.pivot_table(adata_proc.obs, index=["leiden_r05"], columns=["CoEx","sample"],values='barcode', aggfunc=len, fill_value=0)
*The 'values' argument is here only to reduce the amounts of columns, an artifact of using an unfit method

Using .loc in pandas slows down calculation

I have the following dataframe where I want to assign the bottom 1% value to a new column. When I do this calculation with using the ".loc" notification, it takes around 10 seconds for using .loc assignment, where the alternative solution is only 2 seconds.
df_temp = pd.DataFrame(np.random.randn(100000000,1),columns=list('A'))
%time df_temp["q"] = df_temp["A"].quantile(0.01)
%time df_temp.loc[:, "q1_loc"] = df_temp["A"].quantile(0.01)
Why is the .loc solution slower? I understand using the .loc solution is safer, but if I want to assign data to all indices in the column, what can go wrong with the direct assignment?
.loc is searching along the entirety of indices and columns (in this case, only 1 column) in your df along the whole axes, which is time consuming and perhaps redundant, in addition to figuring out the quantiles of df_temp['A'] (which is negligible as far as calculation time). Your direct assignment method, on the other hand, is just parsing df_temp['A'].quantile(0.01), and assigning df_temp['q']. It doesn't need to exhaustively search the indices/columns of your df.
See this answer for a similar description of the .loc method.
As far as safety is concerned, you are not using chained indexing, so you're probably safe (you're not trying to set anything on a copy of your data, it's being set directly on the data itself). It's good to be aware of the potential issues with not using .loc (see this post for a nice overview of SettingWithCopy warnings), but I think that you're OK as far as that goes.
If you want to be more explicit about your column creation, you could do something along the lines of df = df.assign(q=df_temp["A"].quantile(0.01)). It won't really change performance (I don't think), nor the result, but it allows you to see that you're explicitly assigning a new column to your existing dataframe (and thus not setting anything on a copy of said dataframe).

Pandas DataFrame - Test for change/modification

Simple question, and my google-fu is not strong enough to find the right term to get a solid answer from the documentation. Any term I look for that includes either change or modify leads me to questions like 'How to change column name....'
I am reading in a large dataframe, and I may be adding new columns to it. These columns are based on interpolation of values on a row by row basis, and the simple numbers of rows makes this process a couple hours in length. Hence, I save the dataframe, which also can take a bit of time - 30 seconds at least.
My current code will always save the dataframe, even if I have not added any new columns. Since I am still developing some plotting tools around it, I am wasting a lot of time waiting for the save to finish at the termination of the script needlessly.
Is there a DataFrame attribute I can test to see if the DataFrame has been modified? Essentially, if this is False I can avoid saving at the end of the script, but if it is True then a save is necessary. This simple one line if will save me a lot of time and a lost of SSD writes!
You can use:
df.equals(old_df)
You can read the it's functionality in pandas' documentation. It basically does what you want, returning True only if both DataFrames are equal, and it's probably the fastest way to do it since it's an implementation of pandas itself.
Notice you need to use .copy() when assigning old_df before changes in your current df, otherwise you might pass the dataframe by reference and not by value.

Nested data in Pandas

First of all: I know this is a dangerous question. There are a lot of similar questions about storing and accessing nested data in pandas but I think my question is different (more general) so hang on. :)
I have medium sized dataset of workouts for 1 athlete. Each workout has a date and time, ~200 properties (e.g. average speed and heart rate) and some raw data (3-10 lists of e.g. speed and heart rate values per second). I have about 300 workouts and each workouts contains on average ~4000 seconds.
So far I tried 3 solutions to store this data with pandas to be able to analyze it:
I could use MultiIndex and store all data in 1 DataFrame but this
DataFrame would get quite large (which doesn't have to be a problem
but visually inspecting it will be hard) and slicing the data is cumbersome.
Another way would be to store the date and properties
in a DataFrame df_1 and to store the raw data in a separate
DataFrame df_2 that I would store in a separate column raw_data
in df_1.
...Or (similar to (2)) I could store the raw data in separate DataFrames
that I store in a dict with keys identical to the index of the
DataFrame df_1.
Either of these solutions work and for this use case there are no major performance benefits to either of them. To me (1) feels the most 'Pandorable' (really like that word :) ) but slicing the data is difficult and visual inspection of the DataFrame (printing it) is of no use. (2) feels a bit 'hackish' and in-place modifications can be unreliable but this solution is very nice to work with. And (3) is ugly and a bit difficult to work with, but also the most Pythonic in my opinion.
Question: What would be the benefits of each method and what is the most Pandorable solution in your opinion?
By the way: Of course I am open to alternative solutions.

Categories