Efficiently taking time slices of variable length in a dataframe - python

I would like to efficiently slice a DataFrame with a DatetimeIndex (similar to a resample or groupby operation), but the desired time slices are different lengths.
This is relatively easy to do by looping (see code below), but with large timeseries the multiple slices quickly becomes slow. Any suggestions on vectorising this/improving speed?
import pandas as pd, datetime as dt, numpy as np
#Example DataFrame with a DatetimeIndex
idx = pd.DatetimeIndex(start=dt.datetime(2017,1,1), end=dt.datetime(2017,1,31), freq='h')
df = pd.Series(index = idx, data = np.random.rand(len(idx)))
#The slicer dataframe contains a series of start and end windows
slicer_df = pd.DataFrame(index = [1,2])
slicer_df['start_window'] = [dt.datetime(2017,1,2,2), dt.datetime(2017,1,6,12)]
slicer_df['end_window'] = [dt.datetime(2017,1,6,12), dt.datetime(2017,1,15,2)]
#The results should be stored to a dataframe, indexed by the index of the slicer dataframe
#This is the loop that I would like to vectorise
slice_results = pd.DataFrame()
slice_results['total'] = None
for index, row in slicer_df.iterrows():
slice_results.loc[index,'total'] = df[(df.index >= row.start_window) &
(df.index <= row.end_window)].sum()
NB. I've just realised that my particular data set has adjacent windows (ie. the start of one window corresponds to the end of the one before it), but the windows are of different lengths. It feels like there should be a way to perform a groupby or similar with only one pass over df...

You can do this as an apply, which will concat the results rather than iteratively update the DataFrame:
In [11]: slicer_df.apply((lambda row: \
df[(df.index >= row.start_window)
& (df.index <= row.end_window)].sum()), axis=1)
Out[11]:
1 36.381155
2 111.521803
dtype: float64

You can vectorize this with searchsorted (assuming the datetime index is sorted, otherwise first sort):
In [11]: inds = np.searchsorted(df.index.values, slicer_df.values)
In [12]: s = df.cumsum() # only sum once!
In [13]: pd.Series([s[end] - s[start-1] if start else s[end] for start, end in inds], slicer_df.index)
Out[13]:
1 36.381155
2 111.521803
dtype: float64
There's still a loop in there, but it's now a lot cheaper!
That leads us to a completely vectorized solution (it's a little more cryptic):
In [21]: inds2 = np.maximum(1, inds) # see note
In [22]: inds2[:, 0] -= 1
In [23]: inds2
Out[23]:
array([[ 23, 96],
[119, 336]])
In [24]: x = s[inds2]
In [25]: x
Out[25]:
array([[ 11.4596498 , 47.84080472],
[ 55.94941276, 167.47121538]])
In [26]: x[:, 1] - x[:, 0]
Out[26]: array([ 36.38115493, 111.52180263])
Note: the when the start date is before the first date we want to avoid the start index rolling back from 0 to -1 (which would mean the end of the array i.e. underflow).

I have come up with a vectorised method which relies on the varying length "windows" being always adjacent to one another, ie. that the start of a window is the same as the end of the window before it.
# Ensure that the join will be successful by rounding to a specific frequency
round_freq = '1h'
df.index = df.index.round(round_freq)
slicer_df.start_window= slicer_df.start_window.dt.round(round_freq)
# Give the index of the slicer a useful name
slicer_df.index.name = 'event_number'
#Perform a join to the start of the window, forward fill to the next window, then groupby to get the totals for each time window
df = df.to_frame('orig_data').join(slicer_df.reset_index().set_index('start_window')[['event_number']])
df.event_number = df.event_number.ffill()
df.groupby('event_number').sum()
Of course this only works when the windows are adjacent, ie. they can't overlap or have any gaps. If anyone has a more general method that works for the above, I'd love to see it!

Related

Subtract 2 dataframes with different column length if key matches

I have a dataframe which has columns with different length. I want to subtract columns VIEWS from each other if the fields URL match.
This is my code which gives me completely false results and almost exclusively NAN values and floats which both doesn´t make sense to me. Is there a better solution for this or an obvious mistake in my code?
a = a.loc[:, ['VIEWS', 'URL']]
b = b.loc[:, ['VIEWS', 'URL']]
df = pd.concat([a,b], ignore_index=True)
df['VIEWS'] = pd.to_numeric(df['VIEWS'], errors='coerce').fillna(0).astype(int)
df['VIEWS'] = df.groupby(['URL'])['VIEWS'].diff().abs()
Great question!
Let's start with a possible solution
I assume you want to deduct the total of the first from the total of the second per group. Taking your cleaning as the basis, here's a small, (hopefully) complete example, which uses .sum() and multiplies the views from b by -1 prior to grouping:
import pandas as pd
import numpy as np
a = pd.DataFrame(data = [
[100, 'x.me'], [200, 'y.me'], [50, 'x.me'], [np.nan, 'y.me']
], columns=['VIEWS', 'URL'])
b = pd.DataFrame(data = [
[90, 'x.me'], [200, 'z.me'],
], columns=['VIEWS', 'URL'])
for x in [a, b]:
x['VIEWS'] = pd.to_numeric(x['VIEWS'], errors='coerce').fillna(0).astype(int)
df = pd.concat([x.groupby(['URL'])['VIEWS'].apply(lambda y: y.sum() * (1 - 2 * cnt)).reset_index(drop = False) for (cnt, x) in enumerate([a, b])], ignore_index=True)
df = df.groupby(['URL'])['VIEWS'].sum().abs().reset_index()
A few words on why your approach is currently not working
diff() There is a diff function for the SeriesGroupBy class. It takes the difference of some row to the previous row in the group. Check this out for a valid usage of diff() in this context: Pandas groupby multiple fields then diff
nan's appear in your last operation since you're trying to set a series object with the indices being the urls onto a series with completely different indices.
So if anything, an operation such as the following could work
df['VIEWS'] = df.groupby(['URL'])['VIEWS'].sum().reset_index(drop=True)
although this still assumes, that df does not change in size and that the indices on the left side accord the ones after the reset on the right side.

How to add multiple columns to a dataframe based on calculations

I have a csv dataset (with > 8m rows) that I load into a dataframe. The csv has columns like:
...,started_at,ended_at,...
2022-04-01 18:23:32,2022-04-01 22:18:15
2022-04-02 01:16:34,2022-04-02 02:18:32
...
I am able to load the dataset into my dataframe, but then I need to add multiple calculated columns to the dataframe for each row. In otherwords, unlike this SO question, I do not want the rows of the new columns to have the same initial value (col 1 all NAN, col 2, all "dogs", etc.).
Right now, I can add my columns by doing something like:
df['start_time'] = df.apply(lambda row: add_start_time(row['started_at']), axis = 1)
df['start_cat'] = df.apply(lambda row: add_start_cat(row['start_time']), axis = 1)
df['is_dark'] = df.apply(lambda row: add_is_dark(row['started_at']), axis = 1)
df['duration'] = df.apply(lamba row: calc_dur(row'[started_at'],row['ended_at']), axis = 1)
But it seems inefficient since the entire dataset is processed N times (once for each call).
It seems that I should be able to calculate all of the new columns in a single go, but I am missing some conceptual approach.
Examples:
def calc_dur(started_at, ended_at):
# started_at, ended_at are datetime64[ns]; converted at csv load
diff = ended_at - started_at
return diff.total_seconds() / 60
def add_start_time(started_at):
# started_at is datetime64[ns]; converted at csv load
return started_at.time()
def add_is_dark(started_at):
# tz is pytz.timezone('US/Central')
# chi_town is the astral lookup for Chicago
st = started_at.replace(tzinfo=TZ)
chk = sun(chi_town.observer, date=st, tzinfo=chi_town.timezone)
return st >= chk['dusk'] or st <= chk['dawn']
Update 1
Following on the information for MoRe, I was able to get the essential working. I needed to augment by adding the column names, and then with the merge to specify the index.
data = pd.Series(df.apply(lambda x: [
add_start_time(x['started_at']),
add_is_dark(x['started_at']),
yrmo(x['year'], x['month']),
calc_duration_in_minutes(x['started_at'], x['ended_at']),
add_start_cat(x['started_at'])
], axis = 1))
new_df = pd.DataFrame(data.tolist(),
data.index,
columns=['start_time','is_dark','yrmo',
'duration','start_cat'])
df = df.merge(new_df, left_index=True, right_index=True)
import pandas as pd
data = pd.Series(dataframe.apply(lambda x: [function1(x[column_name]), function2(x[column_name)], function3(x[column_name])], axis = 1))
pd.DataFrame(data.tolist(),data.index)
if i understood your mean correctly, it's your answer. but before everything please use Swifter pip :)
first create a series by lists and convert it to columns...
swifter is a simple library (at least i think it is simple) that only has only one useful method: apply
import swifter
data.swifter.apply(lambda x: x+1)
it use parallel manner to improve speed in large datasets... in small ones, it isn't good and even is worse
https://pypi.org/project/swifter/

np.where() computes np.random.choice() only once - pandas

I have this dataframe:
np.random.seed(0)
N = 10000
N_Seg = 100
df = pd.DataFrame({"Rut_Num": range(1,N+1),
"Segmento": np.random.choice(
["Afluente", "Afluente","Premium", "Preferente", "Preferente", "Preferente", "Preferente", "Clásico", "Clásico", "Clásico", "Clásico", "Clásico", "Clásico"], N),
"If_Seguro": np.random.choice([0,1,1], N)})
df.head()
Rut_Num Segmento If_Seguro
0 1 Clásico 1
1 2 Preferente 0
2 3 Afluente 0
3 4 Preferente 0
4 5 Clásico 1
When the column If_Seguro is 1, I need a random number between 1 and N_Seg+1, if its 0, I need a 0:
np.random.seed()
df.loc[:,"id_Seguro"] = np.where(df["If_Seguro"] == 1, np.random.choice(range(1,N_Seg+1),1),0)
df["id_Seguro"].value_counts()
You can see that the np.where() true condition will give the same number for all the ones when I need a random number for each 1 from If_Seguro
Besides, why np.where() computes np.random.choice() only once for the whole column and it doesn't compute it for each validation (each row) in the column?
The expression np.where(df["If_Seguro"] == 1, np.random.choice(range(1,N_Seg+1),1),0) shows what is in my opinion a frequently encountered, but generally undesirable use of where. The solution will also answer your question as to why only one value is being generated.
np.where does not compute much. It just selects values based on a mask from a pair of existing arrays. Normal python semantics don't change here. You are passing in the result of a function call, not the function itself, so it's the value that is used. This means that you need to compute np.random.choice(...) for all of the rows of df, not just the ones where df["If_Seguro"] == 1.
df["If_Seguro"] is a mask, and numpy provides you with some tools for worrying with masks. For example, the actual number of elements you want to generate is
np.count_nonzero(df["If_Seguro"])
The row locations where you want to insert those values is given by the mask itself. Both numpy and pandas allow you to index with a boolean mask directly. np.where is just an extra layer of inefficiency in many cases.
Finally, to generate N samples from an existing sequence, do either:
np.random.choice(range(1, N_Seg + 1), size=N, replace=True)
replace=True allows the samples to repeat, as your original call to np.where likely intended. A much better way to do the same thing does not involve an explicit sequence object:
np.random.randint(1, N_Seg + 1, N)
In the proposed solution, where will be the number of masked elements, whereas in your original code it should have been N.
So finally we have:
mask = df["If_Seguro"]
df.loc[mask, "id_Seguro"] = np.random.randint(1, 1 + N_Seg, np.count_nonzero(mask))
If id_Seguro is not already zeroed out to start with, you can do one of a couple of things. Adding on to the previous:
df.loc[~mask, "id_Seguro"] = 0
Or generating a new array from scratch:
mask = df["If_Seguro"]
result = np.zeros(N)
result[mask] = np.random.randint(1, 1 + N_Seg, np.count_nonzero(mask))
df["id_Seguro"] = result

Why isn't here order invariance between the two sets of operations?

I'm handling a CSV file/pandas dataframe, where the first column contains the date.
I want to do some conversion here to datetime, some filtering, sorting and reindexing.
What I experience is that if I change the order of the sets of operations, I get different results (the result of the first configuration is bigger, than the other one). Probably the first one is the "good" one.
Can anyone tell me, which sub-operations cause the difference between the results?
Which of those is the "bad" and which is the "good" solution?
Is it possible secure order independence where the user can call those two methods in any order and still got the good results? (Is it possible to get the good results by implementing interchangeable sets of operations?)
jdf1 = x.copy(deep=True)
jdf2 = x.copy(deep=True)
interval = [DATE_START, DATE_END]
dateColName = "Date"
Configuration 1:
# Operation set 1: dropping duplicates, sorting and reindexing the table
jdf1.drop_duplicates(subset=dateColName, inplace=True)
jdf1.sort_values(dateColName, inplace=True)
jdf1.reset_index(drop=True, inplace=True)
# Operatrion set 2: converting column type and filtering the rows in case of CSV's contents are covering a wider interval
jdf1[dateColName] = pd.to_datetime(jdf1[jdf1.columns[0]], format="%Y-%m-%d")
maskL = jdf1[dateColName] < interval[0]
maskR = jdf1[dateColName] > interval[1]
mask = maskL | maskR
jdf1.drop(jdf1[mask].index, inplace=True)
vs.
Configuration 2:
# Operatrion set 2: converting column type and filtering the rows in case of CSV's contents are covering a wider interval
jdf2[dateColName] = pd.to_datetime(jdf2[jdf2.columns[0]], format="%Y-%m-%d")
maskL = jdf2[dateColName] < interval[0]
maskR = jdf2[dateColName] > interval[1]
mask = maskL | maskR
jdf2.drop(jdf2[mask].index, inplace=True)
# Operation set 1: dropping duplicates, sorting and reindexing the table
jdf2.drop_duplicates(subset=dateColName, inplace=True)
jdf2.sort_values(dateColName, inplace=True)
jdf2.reset_index(drop=True, inplace=True)
Results:
val1 = set(jdf1["Date"].values)
val2 = set(jdf2["Date"].values)
# bigger:
val1 - val2
# empty:
val2 - val1
Thank you for your help!
In first look it is same, but NOT.
Because there are 2 different ways for filtering with affect each others:
drop_duplicates() -> remove M rows, together ALL rows - M
boolean indexing with mask -> remove N rows, together ALL - M - N
--
boolean indexing with mask -> remove K rows, together ALL rows - K
drop_duplicates() -> remove L rows, together ALL - K - L
K != M
L != N
And if swap this operations, result should be different, because both remove rows. And it is important order of calling them, because some rows remove only drop_duplicates, somerows only boolean indexing.
In my opinion both methods are right, it depends what need.

Python Pandas Compare 2 Large DataFrames of Text for Similarity

I have two large dataframes I want to compare. I want a comparison result capable of a column and / or row wise comparison of similarities by percent. This part is simple. However, I want to be able to make the comparison ignore differences based upon value criteria. A small example is below.
d1 = {'Sample':pd.Series([101,102,103]),
'Col1':pd.Series(['AA','--','BB']),
'Col2':pd.Series(['AB','AA','BB'])}
d2 = {'Sample':pd.Series([101,102,103]),
'Col1':pd.Series(['BB','AB','--']),
'Col2':pd.Series(['AB','AA','AB'])}
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)
df1 = df1.set_index('Sample')
df2 = df2.set_index('Sample')
comparison = df1.eq(df2)
# for column stats
comparison.sum(axis=0) / float(len(df1.index))
# for row stats
comparison.sum(axis=1) / float(len(df1.columns))
My problem is that for when value1='AA' and value2 = '--' I want them to be viewed as equal (so when one is '--' basically always be true) but, otherwise perform a normal Boolean comparison. I need an efficient way to do this that doesn't include excessive looping as the datasets are quite large.
Below, I'm interpreting "when one is '--' basically always be true" to mean that any comparison against '--' (no matter what the other value is) should return True. In that case, you could use
mask = (df1=='--') | (df2=='--')
to find every location where either df1 or df2 is equal to '--' and then use
comparison |= mask
to update comparison. For example,
import itertools as IT
import numpy as np
import pandas as pd
np.random.seed(2015)
N = 10000
df1, df2 = [pd.DataFrame(
np.random.choice(map(''.join, IT.product(list('ABC'), repeat=2))+['--'],
size=(N, 2)),
columns=['Col1', 'Col2']) for i in range(2)]
comparison = df1.eq(df2)
mask = (df1=='--') | (df2=='--')
comparison |= mask
# for column stats
column_stats = comparison.sum(axis=0) / float(len(df1.index))
# for row stats
row_stats = comparison.sum(axis=1) / float(len(df1.columns))
I think loop comprehension should be quite fast:
new_columns = []
for col in df1.columns:
new_columns.append([True if (x==y or x=='--' or y=='--') else False for x,y in zip(df1[col],df2[col])])
results = pd.DataFrame(new_columns).T
results.index = df1.index
This outputs the full true/false df.

Categories