I have a dataframe containing dates as rows and columns as $investment in each stock on a particular day ("ndate"). Also, I have a Series ("portT") containing the sum of the total investments in all stocks each date (series size: len(ndate)*1). Here is the code that calculates the weight of each stock/each date by dividing each element of each row of ndate by sum of that day:
(l,w)=port1.shape
for i in range(0,l):
port1.iloc[i]=np.divide(ndate.iloc[i],portT.iloc[i])
The code works very slowly, could you please let me know how I can modify and speed it up? I tried to do this by vectorising, but did not succeed.
as this is justa simple divison of two dataframes of the same shape (or you can formulate it as such) you can use the simple /-operator, pandas will execute it element-wise (possibly with replication if shapes don't match, so be sure about that):
import pandas as pd
df1 = pd.DataFrame([[1,2], [3,4]])
df2 = pd.DataFrame([[2,2], [3,3]])
df_new = df1 / df2
#>>> pd.DataFrame([[0.5, 1.],[1., 1.3]])
this is most likely internally doing the same operations that you have specified in your example, however, internal assignments and checks are by-passed, which should give you some speed
EDIT:
I was mistaken on the outline of your problem; maybe include a minimal self-contained code example next time. Still the /-operator also works for Dataframes and Series in combination:
import pandas as pd
df = pd.DataFrame([[1,2], [3,4]])
s = pd.Series([1,2])
new_df = df / s
#>>> pd.DataFrame([[1., 3.],[1., 2]])
Related
I have a dataframe which has columns with different length. I want to subtract columns VIEWS from each other if the fields URL match.
This is my code which gives me completely false results and almost exclusively NAN values and floats which both doesn´t make sense to me. Is there a better solution for this or an obvious mistake in my code?
a = a.loc[:, ['VIEWS', 'URL']]
b = b.loc[:, ['VIEWS', 'URL']]
df = pd.concat([a,b], ignore_index=True)
df['VIEWS'] = pd.to_numeric(df['VIEWS'], errors='coerce').fillna(0).astype(int)
df['VIEWS'] = df.groupby(['URL'])['VIEWS'].diff().abs()
Great question!
Let's start with a possible solution
I assume you want to deduct the total of the first from the total of the second per group. Taking your cleaning as the basis, here's a small, (hopefully) complete example, which uses .sum() and multiplies the views from b by -1 prior to grouping:
import pandas as pd
import numpy as np
a = pd.DataFrame(data = [
[100, 'x.me'], [200, 'y.me'], [50, 'x.me'], [np.nan, 'y.me']
], columns=['VIEWS', 'URL'])
b = pd.DataFrame(data = [
[90, 'x.me'], [200, 'z.me'],
], columns=['VIEWS', 'URL'])
for x in [a, b]:
x['VIEWS'] = pd.to_numeric(x['VIEWS'], errors='coerce').fillna(0).astype(int)
df = pd.concat([x.groupby(['URL'])['VIEWS'].apply(lambda y: y.sum() * (1 - 2 * cnt)).reset_index(drop = False) for (cnt, x) in enumerate([a, b])], ignore_index=True)
df = df.groupby(['URL'])['VIEWS'].sum().abs().reset_index()
A few words on why your approach is currently not working
diff() There is a diff function for the SeriesGroupBy class. It takes the difference of some row to the previous row in the group. Check this out for a valid usage of diff() in this context: Pandas groupby multiple fields then diff
nan's appear in your last operation since you're trying to set a series object with the indices being the urls onto a series with completely different indices.
So if anything, an operation such as the following could work
df['VIEWS'] = df.groupby(['URL'])['VIEWS'].sum().reset_index(drop=True)
although this still assumes, that df does not change in size and that the indices on the left side accord the ones after the reset on the right side.
I basically have a dataframe (df1) with columns 7 columns. The values are always integers.
I have another dataframe (df2), which has 3 columns. One of these columns is a list of lists with a sequence of 7 integers. Example:
import pandas as pd
df1 = pd.DataFrame(columns = ['A','B','C','D','E','F','G'],
data = np.random.randint(1,5,(100,7)))
df2 = pd.DataFrame(columns = ['Name','Location','Sequence'],
data = [['Alfred','Chicago',
np.random.randint(1,5,(100,7))],
['Nicola','New York',
np.random.randint(1,5,(100,7))]])
I now want to compare the sequence of the rows in df1 with the 'Sequence' column in df2 and get a percentage of overlap. In a primitive for loop this would look like this:
df2['Overlap'] = 0.
for i in range(len(df2)):
c = sum(el in list(df2.at[i, 'Sequence']) for el in df1.values.tolist())
df2.at[i, 'Overlap'] = c/len(df1)
Now the problem is that my df2 has 500000 rows and my df1 usually around 50-100. This means that the task easily gets very time consuming. I know that there must be a way to optimize this with numpy, but I cannot figure it out. Can someone please help me?
By default engine used in pandas cython, but you can also change engine to numba or use njit decorator to speed up. Look up enhancingperf.
Numba converts python code to optimized machine codee, pandas is highly integrated with numpy and hence numba also. You can experiment with parallel, nogil, cache, fastmath option for speedup. This method shines for huge inputs where speed is needed.
Numba you can do eager compilation or first time execution take little time for compilation and subsequent usage will be fast
import pandas as pd
df1 = pd.DataFrame(columns = ['A','B','C','D','E','F','G'],
data = np.random.randint(1,5,(100,7)))
df2 = pd.DataFrame(columns = ['Name','Location','Sequence'],
data = [['Alfred','Chicago',
np.random.randint(1,5,(100,7))],
['Nicola','New York',
np.random.randint(1,5,(100,7))]])
a = df1.values
# Also possible to add `parallel=True`
f = nb.njit(lambda x: (x == a).mean())
# This is just illustration, not correct logic. Change the logic according to needs
# nb.njit((nb.int64,))
# def f(x):
# sum = 0
# for i in nb.prange(x.shape[0]):
# for j in range(a.shape[0]):
# sum += (x[i] == a[j]).sum()
# return sum
# Experiment with engine
print(df2['Sequence'].apply(f))
You can use direct comparison of the arrays and sum the identical values. Use apply to perform the comparison per row in df2:
df2['Sequence'].apply(lambda x: (x==df1.values).sum()/df1.size)
output:
0 0.270000
1 0.298571
To save the output in your original dataframe:
df2['Overlap'] = df2['Sequence'].apply(lambda x: (x==df1.values).sum()/df1.size)
I am sorting every column of a very large pandas dataframe using a for loop. However, this process is taking very long because the dataframe has more than 1 million columns. I want this process to run much faster than it is running right now.
This is the code I have at the moment:
top25s = []
for i in range(1, len(mylist)):
topchoices = df.sort_values(i, ascending=False).iloc[0:25, 0].values
top25s.append(topchoices)
Here len(mylist) is 14256 but can easily go up to more than 1000000 in the future. df has a dimension of 343 rows × 14256 columns.
Thanks for all of your inputs!
You can use nlargest:
df.apply(lambda x: x.nlargest(25).reset_index(drop=True))
But I doubt this will gain you much time honestly. As commented, you just have a lot of data to go through.
I'd propose using a bit of help from numpy. Which should speed things up significantly.
The following code will return a 2D numpy array with the top25 elements in each column.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(50,100)) # Generate random data
rank = df.rank(axis = 0, ascending=False)
top25s = np.extract(rank<=25, df).reshape(25, 100)
I have a Pandas DataFrame of subscriptions, each with a start datetime (timestamp) and an optional end datetime (if they were canceled).
For simplicity, I have created string columns for the date (e.g. "20170901") based on start and end datetimes (timestamps). It looks like this:
df = pd.DataFrame([('20170511', None), ('20170514', '20170613'), ('20170901', None), ...], columns=["sd", "ed"])
The end result should be a time series of how many subscriptions were active on any given date in a range.
To that end, I created an Index for all the days within a range:
days = df.groupby(["sd"])["sd"].count()
I am able to create what I am interested in with a loop each executing a query over the entire DataFrame df.
count_by_day = pd.DataFrame([
len(df.loc[(df.sd <= i) & (df.ed.isnull() | (df.ed > i))])
for i in days.index], index=days.index)
Note that I have values for each day in the original dataset, so there are no gaps. I'm sure getting the date range can be improved.
The actual question is: is there an efficient way to compute this for a large initial dataset df, with multiple thousands of rows? It seems the method I used is quadratic in complexity. I've also tried df.query() but it's 66% slower than the Pythonic filter and does not change the complexity.
I tried to search the Pandas docs for examples but I seem to be using the wrong keywords. Any ideas?
It's an interesting problem, here's how I would do it. Not sure about performance
EDIT: My first answer was incorrect, I didn't read fully the question
# Initial data, columns as Timestamps
df = pd.DataFrame([('20170511', None), ('20170514', '20170613'), ('20170901', None)], columns=["sd", "ed"])
df['sd'] = pd.DatetimeIndex(df.sd)
df['ed'] = pd.DatetimeIndex(df.ed)
# Range input and related index
beg = pd.Timestamp('2017-05-15')
end = pd.Timestamp('2017-09-15')
idx = pd.DatetimeIndex(start=beg, end=end, freq='D')
# We filter data for records out of the range and then clip the
# the subscriptions start/end to the range bounds.
fdf = df[(df.sd <= beg) | ((df.ed >= end) | (pd.isnull(df.ed)))]
fdf['ed'].fillna(end, inplace=True)
fdf['ps'] = fdf.sd.apply(lambda x: max(x, beg))
fdf['pe'] = fdf.ed.apply(lambda x: min(x, end))
# We run a conditional count
idx.to_series().apply(lambda x: len(fdf[(fdf.ps<=x) & (fdf.pe >=x)]))
Ok, I'm answering my own question after quite a bit of research, fiddling and trying things out. I may still be missing an obvious solution but maybe it helps.
The fastest solution I could find to date is (thanks Alex for some nice code patterns):
# Start with test data from question
df = pd.DataFrame([('20170511', None), ('20170514', '20170613'),
('20170901', None), ...], columns=['sd', 'ed'])
# Convert to datetime columns
df['sd'] = pd.DatetimeIndex(df['sd'])
df['ed'] = pd.DatetimeIndex(df['ed'])
df.ed.fillna(df.sd.max(), inplace=True)
# Note: In my real data I have timestamps - I convert them like this:
#df['sd'] = pd.to_datetime(df['start_date'], unit='s').apply(lambda x: x.date())
# Set and sort multi-index to enable slices
df = df.set_index(['sd', 'ed'], drop=False)
df.sort_index(inplace=True)
# Compute the active counts by day in range
di = pd.DatetimeIndex(start=df.sd.min(), end=df.sd.max(), freq='D')
count_by_day = di.to_series().apply(lambda i: len(df.loc[
(slice(None, i.date()), slice(i.date(), None)), :]))
In my real dataset (with >10K rows for df and a date range of about a year), this was twice as fast as the code in the question, about 1.5s.
Here some lessons I learned:
Creating a Series with counters for the date range and iterating through the dataset df with df.apply or df.itertuples and incrementing counters was much slower. Curiously, apply was slower than itertuples. Don't even think of iterrows
My dataset had a product_id with each row, so filtering the dataset for each product and running the calculation on the filtered result (for each product) was twice as fast as adding the product_id to the multi-index and slicing on that level too
Building an intermediate Series of active days (from iterating through each row in df and adding each date in the active range to the Series) and then grouping by date was much slower.
Running the code in the question on a df with a multi-index did not change the performance.
Running the code in the question on a df with a limited set of columns (my real dataset has 22 columns) did not change the performance.
I was looking at pd.crosstab and pd.Period but I was not able to get anything to work
Pandas is pretty awesome and trying to outsmart it is really hard (esp. non-vectorized in Python)
I have the following code and would like to create a new column per Transaction Number and Description that represents the 99th percentile of each row.
I am really struggling to achieve this - it seems that most posts cover calculating the percentile on the column.
Is there a way to achieve this? I would expect a new column to be create with two rows.
df_baseScenario = pd.DataFrame({'Transaction Number' : [1,10],
'Description' :['asf','def'],
'Calc_PV_CF_2479.0':[4418494.085,-3706270.679],
'Calc_PV_CF_2480.0':[4415476.321,-3688327.494],
'Calc_PV_CF_2481.0':[4421698.198,-3712887.034],
'Calc_PV_CF_2482.0':[4420541.944,-3706402.147],
'Calc_PV_CF_2483.0':[4396063.863,-3717554.946],
'Calc_PV_CF_2484.0':[4397897.082,-3695272.043],
'Calc_PV_CF_2485.0':[4394773.762,-3724893.702],
'Calc_PV_CF_2486.0':[4384868.476,-3741759.048],
'Calc_PV_CF_2487.0':[4379614.337,-3717010.873],
'Calc_PV_CF_2488.0':[4389307.584,-3754514.639],
'Calc_PV_CF_2489.0':[4400699.929,-3741759.048],
'Calc_PV_CF_2490.0':[4379651.262,-3714723.435]})
The following should work:
df['99th_percentile'] = df[cols].apply(lambda x: numpy.percentile(x, 99), axis=1)
I'm assuming here that the variable 'cols' contains a list of the columns you want to include in the percentile (You obviously can't use the Description in your calculation, for example).
What this code does is loops over rows in the dataframe, and for each row, computes the numpy.percentile to get the 99th percentile. You'll need to import numpy.
If you need maximum speed, then you can use numpy.vectorize to remove all loops at the expense of readability (untested):
perc99 = np.vectorize(lambda x: numpy.percentile(x, 99))
df['99th_percentile'] = perc99(df[cols].values)
Slightly modified from #mxbi.
import numpy as np
df = df_baseScenario.drop(['Transaction Number','Description'], axis=1)
df_baseScenario['99th_percentile'] = df.apply(lambda x: np.percentile(x, 99), axis=1)