I have a dataframe which has columns with different length. I want to subtract columns VIEWS from each other if the fields URL match.
This is my code which gives me completely false results and almost exclusively NAN values and floats which both doesn´t make sense to me. Is there a better solution for this or an obvious mistake in my code?
a = a.loc[:, ['VIEWS', 'URL']]
b = b.loc[:, ['VIEWS', 'URL']]
df = pd.concat([a,b], ignore_index=True)
df['VIEWS'] = pd.to_numeric(df['VIEWS'], errors='coerce').fillna(0).astype(int)
df['VIEWS'] = df.groupby(['URL'])['VIEWS'].diff().abs()
Great question!
Let's start with a possible solution
I assume you want to deduct the total of the first from the total of the second per group. Taking your cleaning as the basis, here's a small, (hopefully) complete example, which uses .sum() and multiplies the views from b by -1 prior to grouping:
import pandas as pd
import numpy as np
a = pd.DataFrame(data = [
[100, 'x.me'], [200, 'y.me'], [50, 'x.me'], [np.nan, 'y.me']
], columns=['VIEWS', 'URL'])
b = pd.DataFrame(data = [
[90, 'x.me'], [200, 'z.me'],
], columns=['VIEWS', 'URL'])
for x in [a, b]:
x['VIEWS'] = pd.to_numeric(x['VIEWS'], errors='coerce').fillna(0).astype(int)
df = pd.concat([x.groupby(['URL'])['VIEWS'].apply(lambda y: y.sum() * (1 - 2 * cnt)).reset_index(drop = False) for (cnt, x) in enumerate([a, b])], ignore_index=True)
df = df.groupby(['URL'])['VIEWS'].sum().abs().reset_index()
A few words on why your approach is currently not working
diff() There is a diff function for the SeriesGroupBy class. It takes the difference of some row to the previous row in the group. Check this out for a valid usage of diff() in this context: Pandas groupby multiple fields then diff
nan's appear in your last operation since you're trying to set a series object with the indices being the urls onto a series with completely different indices.
So if anything, an operation such as the following could work
df['VIEWS'] = df.groupby(['URL'])['VIEWS'].sum().reset_index(drop=True)
although this still assumes, that df does not change in size and that the indices on the left side accord the ones after the reset on the right side.
Related
I basically have a dataframe (df1) with columns 7 columns. The values are always integers.
I have another dataframe (df2), which has 3 columns. One of these columns is a list of lists with a sequence of 7 integers. Example:
import pandas as pd
df1 = pd.DataFrame(columns = ['A','B','C','D','E','F','G'],
data = np.random.randint(1,5,(100,7)))
df2 = pd.DataFrame(columns = ['Name','Location','Sequence'],
data = [['Alfred','Chicago',
np.random.randint(1,5,(100,7))],
['Nicola','New York',
np.random.randint(1,5,(100,7))]])
I now want to compare the sequence of the rows in df1 with the 'Sequence' column in df2 and get a percentage of overlap. In a primitive for loop this would look like this:
df2['Overlap'] = 0.
for i in range(len(df2)):
c = sum(el in list(df2.at[i, 'Sequence']) for el in df1.values.tolist())
df2.at[i, 'Overlap'] = c/len(df1)
Now the problem is that my df2 has 500000 rows and my df1 usually around 50-100. This means that the task easily gets very time consuming. I know that there must be a way to optimize this with numpy, but I cannot figure it out. Can someone please help me?
By default engine used in pandas cython, but you can also change engine to numba or use njit decorator to speed up. Look up enhancingperf.
Numba converts python code to optimized machine codee, pandas is highly integrated with numpy and hence numba also. You can experiment with parallel, nogil, cache, fastmath option for speedup. This method shines for huge inputs where speed is needed.
Numba you can do eager compilation or first time execution take little time for compilation and subsequent usage will be fast
import pandas as pd
df1 = pd.DataFrame(columns = ['A','B','C','D','E','F','G'],
data = np.random.randint(1,5,(100,7)))
df2 = pd.DataFrame(columns = ['Name','Location','Sequence'],
data = [['Alfred','Chicago',
np.random.randint(1,5,(100,7))],
['Nicola','New York',
np.random.randint(1,5,(100,7))]])
a = df1.values
# Also possible to add `parallel=True`
f = nb.njit(lambda x: (x == a).mean())
# This is just illustration, not correct logic. Change the logic according to needs
# nb.njit((nb.int64,))
# def f(x):
# sum = 0
# for i in nb.prange(x.shape[0]):
# for j in range(a.shape[0]):
# sum += (x[i] == a[j]).sum()
# return sum
# Experiment with engine
print(df2['Sequence'].apply(f))
You can use direct comparison of the arrays and sum the identical values. Use apply to perform the comparison per row in df2:
df2['Sequence'].apply(lambda x: (x==df1.values).sum()/df1.size)
output:
0 0.270000
1 0.298571
To save the output in your original dataframe:
df2['Overlap'] = df2['Sequence'].apply(lambda x: (x==df1.values).sum()/df1.size)
I have two dataframes,
a = {'SEX':[...], 'ENT':[...], 'XY':[...], 'RZD':[...], 'TOT':[...]} with shape 769, 5
and
b = {'K':[...], 'NOM':[...], 'M':[...], SEX':[...], 'ENT':[...], 'POB':[...], 'RZD':[...], '%A':[...], '%B':[...]} with shape 34398, 9.
I need to merge these dataframes based on 'SEX', 'ENT', 'RZD'. Once merged, I fill with zeros wherever values do not match. Finally, I calculate a new column FINAL that equals a['%A'] * b['TOT'] like the code below:
local = b.merge(a, on=['ENT', 'RZD', 'SEX'], how='left')
local.fillna(0, inplace=True)
local['TOT'] = local['%A'].mul(local['TOT']).round(0)
The problem I am encountering is that
x1 = a['TOT'].sum()
should be equal to
x2 = local['TOT'].sum()
However, I am getting differences of nearly 6 million. This means that x2 >> x1
Do you recommend any way of merging these dataframes and keep consistency?
You can find the raw files here.
Try df.join, and see if the how argument helps you out. The how argument will alert you to where the match was made, and instead of merging by conditions you might be able to do a vector operation over the column using
if df["column"] == "left_only":
df["column"] = df["column"].str.replace("left_only", 0)
# repeat
I have a dataframe containing dates as rows and columns as $investment in each stock on a particular day ("ndate"). Also, I have a Series ("portT") containing the sum of the total investments in all stocks each date (series size: len(ndate)*1). Here is the code that calculates the weight of each stock/each date by dividing each element of each row of ndate by sum of that day:
(l,w)=port1.shape
for i in range(0,l):
port1.iloc[i]=np.divide(ndate.iloc[i],portT.iloc[i])
The code works very slowly, could you please let me know how I can modify and speed it up? I tried to do this by vectorising, but did not succeed.
as this is justa simple divison of two dataframes of the same shape (or you can formulate it as such) you can use the simple /-operator, pandas will execute it element-wise (possibly with replication if shapes don't match, so be sure about that):
import pandas as pd
df1 = pd.DataFrame([[1,2], [3,4]])
df2 = pd.DataFrame([[2,2], [3,3]])
df_new = df1 / df2
#>>> pd.DataFrame([[0.5, 1.],[1., 1.3]])
this is most likely internally doing the same operations that you have specified in your example, however, internal assignments and checks are by-passed, which should give you some speed
EDIT:
I was mistaken on the outline of your problem; maybe include a minimal self-contained code example next time. Still the /-operator also works for Dataframes and Series in combination:
import pandas as pd
df = pd.DataFrame([[1,2], [3,4]])
s = pd.Series([1,2])
new_df = df / s
#>>> pd.DataFrame([[1., 3.],[1., 2]])
I would like to efficiently slice a DataFrame with a DatetimeIndex (similar to a resample or groupby operation), but the desired time slices are different lengths.
This is relatively easy to do by looping (see code below), but with large timeseries the multiple slices quickly becomes slow. Any suggestions on vectorising this/improving speed?
import pandas as pd, datetime as dt, numpy as np
#Example DataFrame with a DatetimeIndex
idx = pd.DatetimeIndex(start=dt.datetime(2017,1,1), end=dt.datetime(2017,1,31), freq='h')
df = pd.Series(index = idx, data = np.random.rand(len(idx)))
#The slicer dataframe contains a series of start and end windows
slicer_df = pd.DataFrame(index = [1,2])
slicer_df['start_window'] = [dt.datetime(2017,1,2,2), dt.datetime(2017,1,6,12)]
slicer_df['end_window'] = [dt.datetime(2017,1,6,12), dt.datetime(2017,1,15,2)]
#The results should be stored to a dataframe, indexed by the index of the slicer dataframe
#This is the loop that I would like to vectorise
slice_results = pd.DataFrame()
slice_results['total'] = None
for index, row in slicer_df.iterrows():
slice_results.loc[index,'total'] = df[(df.index >= row.start_window) &
(df.index <= row.end_window)].sum()
NB. I've just realised that my particular data set has adjacent windows (ie. the start of one window corresponds to the end of the one before it), but the windows are of different lengths. It feels like there should be a way to perform a groupby or similar with only one pass over df...
You can do this as an apply, which will concat the results rather than iteratively update the DataFrame:
In [11]: slicer_df.apply((lambda row: \
df[(df.index >= row.start_window)
& (df.index <= row.end_window)].sum()), axis=1)
Out[11]:
1 36.381155
2 111.521803
dtype: float64
You can vectorize this with searchsorted (assuming the datetime index is sorted, otherwise first sort):
In [11]: inds = np.searchsorted(df.index.values, slicer_df.values)
In [12]: s = df.cumsum() # only sum once!
In [13]: pd.Series([s[end] - s[start-1] if start else s[end] for start, end in inds], slicer_df.index)
Out[13]:
1 36.381155
2 111.521803
dtype: float64
There's still a loop in there, but it's now a lot cheaper!
That leads us to a completely vectorized solution (it's a little more cryptic):
In [21]: inds2 = np.maximum(1, inds) # see note
In [22]: inds2[:, 0] -= 1
In [23]: inds2
Out[23]:
array([[ 23, 96],
[119, 336]])
In [24]: x = s[inds2]
In [25]: x
Out[25]:
array([[ 11.4596498 , 47.84080472],
[ 55.94941276, 167.47121538]])
In [26]: x[:, 1] - x[:, 0]
Out[26]: array([ 36.38115493, 111.52180263])
Note: the when the start date is before the first date we want to avoid the start index rolling back from 0 to -1 (which would mean the end of the array i.e. underflow).
I have come up with a vectorised method which relies on the varying length "windows" being always adjacent to one another, ie. that the start of a window is the same as the end of the window before it.
# Ensure that the join will be successful by rounding to a specific frequency
round_freq = '1h'
df.index = df.index.round(round_freq)
slicer_df.start_window= slicer_df.start_window.dt.round(round_freq)
# Give the index of the slicer a useful name
slicer_df.index.name = 'event_number'
#Perform a join to the start of the window, forward fill to the next window, then groupby to get the totals for each time window
df = df.to_frame('orig_data').join(slicer_df.reset_index().set_index('start_window')[['event_number']])
df.event_number = df.event_number.ffill()
df.groupby('event_number').sum()
Of course this only works when the windows are adjacent, ie. they can't overlap or have any gaps. If anyone has a more general method that works for the above, I'd love to see it!
I have two large dataframes I want to compare. I want a comparison result capable of a column and / or row wise comparison of similarities by percent. This part is simple. However, I want to be able to make the comparison ignore differences based upon value criteria. A small example is below.
d1 = {'Sample':pd.Series([101,102,103]),
'Col1':pd.Series(['AA','--','BB']),
'Col2':pd.Series(['AB','AA','BB'])}
d2 = {'Sample':pd.Series([101,102,103]),
'Col1':pd.Series(['BB','AB','--']),
'Col2':pd.Series(['AB','AA','AB'])}
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)
df1 = df1.set_index('Sample')
df2 = df2.set_index('Sample')
comparison = df1.eq(df2)
# for column stats
comparison.sum(axis=0) / float(len(df1.index))
# for row stats
comparison.sum(axis=1) / float(len(df1.columns))
My problem is that for when value1='AA' and value2 = '--' I want them to be viewed as equal (so when one is '--' basically always be true) but, otherwise perform a normal Boolean comparison. I need an efficient way to do this that doesn't include excessive looping as the datasets are quite large.
Below, I'm interpreting "when one is '--' basically always be true" to mean that any comparison against '--' (no matter what the other value is) should return True. In that case, you could use
mask = (df1=='--') | (df2=='--')
to find every location where either df1 or df2 is equal to '--' and then use
comparison |= mask
to update comparison. For example,
import itertools as IT
import numpy as np
import pandas as pd
np.random.seed(2015)
N = 10000
df1, df2 = [pd.DataFrame(
np.random.choice(map(''.join, IT.product(list('ABC'), repeat=2))+['--'],
size=(N, 2)),
columns=['Col1', 'Col2']) for i in range(2)]
comparison = df1.eq(df2)
mask = (df1=='--') | (df2=='--')
comparison |= mask
# for column stats
column_stats = comparison.sum(axis=0) / float(len(df1.index))
# for row stats
row_stats = comparison.sum(axis=1) / float(len(df1.columns))
I think loop comprehension should be quite fast:
new_columns = []
for col in df1.columns:
new_columns.append([True if (x==y or x=='--' or y=='--') else False for x,y in zip(df1[col],df2[col])])
results = pd.DataFrame(new_columns).T
results.index = df1.index
This outputs the full true/false df.