I would like to know if there's a technique to simply undo a change that was done using Pandas.
For example, I did a string replacement on a few thousand rows of Pandas Dataframe, where, every occurrence of "&" in its string be replaced with "and". However after performing the replacement, I found out that I've made a mistake in the changes and would want to revert back to the Dataframe's most latest form before that string replacement was done.
Is there a way to do this?
Yes, there is a way to do this. If you're using the newest iteration of python and pandas you could do it this way:
df.replace(to_replace='and', value='&', inplace=true)
This is the way I learned it!
If you have cells structured in step, and the mess is because of running a couple of cells that have affected the dataset, you can stop the kernel and run all the cells from the beginning.
Related
I have a large dataframe in which I want to change every value based on a dictionary. Currently I am using a loop:
for column in df.columns:
df = df.replace({column: dictionary})
This works but it is pretty slow. This single for loop takes around 90% of the time that it takes my entire code to run. Is there a faster method using loc, map or replace?
There are lots of other questions like this already but it seems pretty much everyone else is just trying to change individual columns rather than the entire dataframe.
I'm using Spyder with Python 3.9 on Windows. Thanks!
Edit: I found a way to make it faster by switching rows and columns. Previously I had lots of columns and few rows, now that it's the other way around the code is a lot faster. Still, is there a way to replace values in the entire dataframe rather than just individual columns?
Is there a way to replace values in the entire dataframe rather than just individual columns?
Use:
df = df.replace(dictionary)
I am trying to format a pandas DataFrame value representation.
Basically, all I want is to get the "Thousand" separator on my values.
I managed to do it using the pd.style.format function. It does the job, but also "breaks" all my table original design.
here is an example of what is going on:
Is there anything I can do to avoid doing it? I want to keep the original table format, only changing the format of the value.
PS: Don't know if it makes any difference, but I am using Google Colab.
In case anyone is having the same problem as I was using Colab, I have found a solution:
.set_table_attributes('class="dataframe"') seems to solve the problem
More infos can be found here: https://github.com/googlecolab/colabtools/issues/1687
For this case you could do:
pdf.assign(a=pdf['a'].map("{:,.0f}".format))
I am failrly new in Python. I wrote a script and I am surprised of the time it take to go through a particular loop compare to the rest of my code.
Can someone tell me what is inefficient in the code I wrote and maybe how to improve the speed ?
Here is the loop in question : (BT_Histos and Histos_Last_Rebal are dataframes with dates in index and columns of floats. Portfolio and Portfolio_Last_Rebal are dataframes same index as the 2 previous one that i am filling through the loop. weights is just a list)
Udl_Perf=BT_Histos/Histos_Last_Rebal-1
for i in range(1,len(BT_Histos.index)):
"""tricky because isin doesn't work with timestamp"""
test_date=pd.Series(Portfolio.index[i-1])
if test_date.isin(Rebalancing_Dates)[0]:
Portfolio_Last_Rebal.loc[Portfolio_Last_Rebal.index[i],'PortSeries']=Portfolio.loc[Portfolio.index[i-1],'PortSeries']
else:
Portfolio_Last_Rebal.loc[Portfolio_Last_Rebal.index[i],'PortSeries']=Portfolio_Last_Rebal.loc[Portfolio_Last_Rebal.index[i-1],'PortSeries']
Portfolio.loc[Portfolio.index[i],'PortSeries']=Portfolio_Last_Rebal.loc[Portfolio_Last_Rebal.index[i],'PortSeries']*(1+sum(Udl_Perf.iloc[i]*weights))
Thanks!
If you really want it to be fast then first implement it in while loop.
Second the length variable which you will use, define the type in advance using Mypy library, in this you need Python 3.5+ version installed.
Also if every iteration is unique then you can use multithreading using threading library. Get eg in this git repo
Simple question, and my google-fu is not strong enough to find the right term to get a solid answer from the documentation. Any term I look for that includes either change or modify leads me to questions like 'How to change column name....'
I am reading in a large dataframe, and I may be adding new columns to it. These columns are based on interpolation of values on a row by row basis, and the simple numbers of rows makes this process a couple hours in length. Hence, I save the dataframe, which also can take a bit of time - 30 seconds at least.
My current code will always save the dataframe, even if I have not added any new columns. Since I am still developing some plotting tools around it, I am wasting a lot of time waiting for the save to finish at the termination of the script needlessly.
Is there a DataFrame attribute I can test to see if the DataFrame has been modified? Essentially, if this is False I can avoid saving at the end of the script, but if it is True then a save is necessary. This simple one line if will save me a lot of time and a lost of SSD writes!
You can use:
df.equals(old_df)
You can read the it's functionality in pandas' documentation. It basically does what you want, returning True only if both DataFrames are equal, and it's probably the fastest way to do it since it's an implementation of pandas itself.
Notice you need to use .copy() when assigning old_df before changes in your current df, otherwise you might pass the dataframe by reference and not by value.
I need to do an apply on a dataframe using inputs from multiple rows. As a simple example, I can do the following if all the inputs are from a single row:
df['c'] = df[['a','b']].apply(lambda x: awesome stuff, axis=1)
# or
df['d'] = df[['b','c']].shift(1).apply(...) # to get the values from the previous row
However, if I need 'a' from the current row, and 'b' from the previous row, is there a way to do that with apply? I could add a new 'bshift' column and then just use df[['a','bshift']] but it seems there must be a more direct way.
Related but separate, when accessing a specific value in the df, is there a way to combine labeled indexing with integer-offset? E.g. I know the label of the current row but need the row before. Something like df.at['labelIknow'-1, 'a'] (which of course doesn't work). This is for when I'm forced to iterate through rows. Thanks in advance.
Edit: Some info on what I'm doing etc. I have a pandas store containing tables of OHLC bars (one table per security). When doing backtesting, currently I pull the full date range I need for a security into memory, and then resample it into a frequency that makes sense for the test at hand. Then I do some vectorized operations for things like trade entry signals etc. Finally I loop over the data from start to finish doing the actual backtest, e.g. checking for trade entry exit, drawdown etc - this looping part is the part I'm trying to speed up.
This should directly answer your question and let you use apply, although I'm not sure it's ultimately any better than a two-line solution. It does avoid creating extra variables at least.
df['c'] = pd.concat([ df['a'], df['a'].shift() ], axis=1).apply(np.mean,axis=1)
That will put the mean of 'a' values from the current and previous rows into 'c', for example.
This isn't as general, but for simpler cases you can do something like this (continuing the mean example):
df['c'] = ( df['a'] + df['a'].shift() ) / 2
That is about 10x faster than the concat() method on my tiny example dataset. I imagine that's as fast as you could do it, if you can code it in that style.
You could also look into reshaping the data with stack() and hierarchical indexing. That would be a way to get all your variables into the same row but I think it will likely be more complicated than the concat method or just creating intermediate variables via shift().
For the first part, I don't think such a thing is possible. If you update on what you actually want to achieve, I can update this answer.
Also looking at the second part, your data structure seems to be relying an awfully lot on the order of rows. This is typically not how you want to manage your databases. Again, if you tell us what your overall goal is, we may hint you towards a solution (and potentially a better way to structure the data base).
Anyhow, one way to get the row before, if you know a given index label, is to do:
df.ix[:'labelYouKnow'].iloc[-2]
Note that this is not the optimal thing to do efficiency-wise, so you may want to improve your your db structure in order to prevent the need for doing such things.