I have a csv dataset (with > 8m rows) that I load into a dataframe. The csv has columns like:
...,started_at,ended_at,...
2022-04-01 18:23:32,2022-04-01 22:18:15
2022-04-02 01:16:34,2022-04-02 02:18:32
...
I am able to load the dataset into my dataframe, but then I need to add multiple calculated columns to the dataframe for each row. In otherwords, unlike this SO question, I do not want the rows of the new columns to have the same initial value (col 1 all NAN, col 2, all "dogs", etc.).
Right now, I can add my columns by doing something like:
df['start_time'] = df.apply(lambda row: add_start_time(row['started_at']), axis = 1)
df['start_cat'] = df.apply(lambda row: add_start_cat(row['start_time']), axis = 1)
df['is_dark'] = df.apply(lambda row: add_is_dark(row['started_at']), axis = 1)
df['duration'] = df.apply(lamba row: calc_dur(row'[started_at'],row['ended_at']), axis = 1)
But it seems inefficient since the entire dataset is processed N times (once for each call).
It seems that I should be able to calculate all of the new columns in a single go, but I am missing some conceptual approach.
Examples:
def calc_dur(started_at, ended_at):
# started_at, ended_at are datetime64[ns]; converted at csv load
diff = ended_at - started_at
return diff.total_seconds() / 60
def add_start_time(started_at):
# started_at is datetime64[ns]; converted at csv load
return started_at.time()
def add_is_dark(started_at):
# tz is pytz.timezone('US/Central')
# chi_town is the astral lookup for Chicago
st = started_at.replace(tzinfo=TZ)
chk = sun(chi_town.observer, date=st, tzinfo=chi_town.timezone)
return st >= chk['dusk'] or st <= chk['dawn']
Update 1
Following on the information for MoRe, I was able to get the essential working. I needed to augment by adding the column names, and then with the merge to specify the index.
data = pd.Series(df.apply(lambda x: [
add_start_time(x['started_at']),
add_is_dark(x['started_at']),
yrmo(x['year'], x['month']),
calc_duration_in_minutes(x['started_at'], x['ended_at']),
add_start_cat(x['started_at'])
], axis = 1))
new_df = pd.DataFrame(data.tolist(),
data.index,
columns=['start_time','is_dark','yrmo',
'duration','start_cat'])
df = df.merge(new_df, left_index=True, right_index=True)
import pandas as pd
data = pd.Series(dataframe.apply(lambda x: [function1(x[column_name]), function2(x[column_name)], function3(x[column_name])], axis = 1))
pd.DataFrame(data.tolist(),data.index)
if i understood your mean correctly, it's your answer. but before everything please use Swifter pip :)
first create a series by lists and convert it to columns...
swifter is a simple library (at least i think it is simple) that only has only one useful method: apply
import swifter
data.swifter.apply(lambda x: x+1)
it use parallel manner to improve speed in large datasets... in small ones, it isn't good and even is worse
https://pypi.org/project/swifter/
Related
I have a dataframe like as shown below
data = {
'key':['k1','k2'],
'name_M1':['name', 'name'],'area_M1':[1,2],'length_M1':[11,21],'breadth_M1':[12,22],
'name_M2':['name', 'name'],'area_M2':[1,2],'length_M2':[11,21],'breadth_M2':[12,22],
'name_M3':['name', 'name'],'area_M3':[1,2],'length_M3':[11,21],'breadth_M3':[12,22],
'name_M4':['name', 'name'],'area_M4':[1,2],'length_M4':[11,21],'breadth_M4':[12,22],
'name_M5':['name', 'name'],'area_M5':[1,2],'length_M5':[11,21],'breadth_M5':[12,22],
'name_M6':['name', 'name'],'area_M6':[1,2],'length_M6':[11,21],'breadth_M6':[12,22],
}
df = pd.DataFrame(data)
Input data looks like below in wide format
I would like to convert it into time-based long format like below. We call it time-based because you can see that each row has 3 months data. Then the subsequent rows are pushed by 1 month
ex: sample shape of data looks like below (with only one column for each month)
k1,Area_M1,Area_M2,Area_M3,Area_M4,Area_M5,Area_M6
I would like to convert it like below (subsequent rows are shifted by one month)
k1,Area_M1,Area_M2,Area_M3
K1,Area_M2,Area_M3,Area_M4
K1,Area_M3,Area_M4,Area_M5
K1,Area_M4,Area_M5,Area_M6
But in real data, instead of one column for each month, I have multiple columns for each month. So, we need to convert/transform all those columns. So, I tried something like below but it doesn't work
pd.wide_to_long(df, stubnames=["name_1st","area_1st","length_first","breadth_first",
"name_2nd","area_2nd","length_2nd","breadth_2nd",
"name_3rd","area_3rd","length_3rd","breadth_3rd"],
i="key", j="name",
sep="_", suffix=r"(?:\d+|n)").reset_index()
But I expect my output to be like as below
updated error screenshot
Updated error screenshot
This is pretty ugly, but I'm not exactly sure of an easier way to do this. Perhaps you could melt everything and do a rolling pivot, but it's not really much different.
This approach just slices rows 0:12, 4:16, etc until the end - renaming and concatenating them all together.
import pandas as pd
import numpy as np
data = {
'key':['k1','k2'],
'name_M1':['name', 'name'],'area_M1':[1,2],'length_M1':[11,21],'breadth_M1':[12,22],
'name_M2':['name', 'name'],'area_M2':[1,2],'length_M2':[11,21],'breadth_M2':[12,22],
'name_M3':['name', 'name'],'area_M3':[1,2],'length_M3':[11,21],'breadth_M3':[12,22],
'name_M4':['name', 'name'],'area_M4':[1,2],'length_M4':[11,21],'breadth_M4':[12,22],
'name_M5':['name', 'name'],'area_M5':[1,2],'length_M5':[11,21],'breadth_M5':[12,22],
'name_M6':['name', 'name'],'area_M6':[1,2],'length_M6':[11,21],'breadth_M6':[12,22],
}
df = pd.DataFrame(data)
df = df.set_index('key')
s = 4
n = 3
cols = [
'name_1st','area_1st','length_1st','breadth_1st',
'name_2nd','area_2nd','length_2nd','breadth_2nd',
'name_3rd','area_3rd','length_3rd','breadth_3rd'
]
output = pd.concat((df.iloc[:,0+i*s:12+i*s].set_axis(cols, axis=1) for i in range(int((df.shape[1]-(s*n))/n))), ignore_index=True, axis=0).set_index(np.tile(df.index,4))
I basically have a dataframe (df1) with columns 7 columns. The values are always integers.
I have another dataframe (df2), which has 3 columns. One of these columns is a list of lists with a sequence of 7 integers. Example:
import pandas as pd
df1 = pd.DataFrame(columns = ['A','B','C','D','E','F','G'],
data = np.random.randint(1,5,(100,7)))
df2 = pd.DataFrame(columns = ['Name','Location','Sequence'],
data = [['Alfred','Chicago',
np.random.randint(1,5,(100,7))],
['Nicola','New York',
np.random.randint(1,5,(100,7))]])
I now want to compare the sequence of the rows in df1 with the 'Sequence' column in df2 and get a percentage of overlap. In a primitive for loop this would look like this:
df2['Overlap'] = 0.
for i in range(len(df2)):
c = sum(el in list(df2.at[i, 'Sequence']) for el in df1.values.tolist())
df2.at[i, 'Overlap'] = c/len(df1)
Now the problem is that my df2 has 500000 rows and my df1 usually around 50-100. This means that the task easily gets very time consuming. I know that there must be a way to optimize this with numpy, but I cannot figure it out. Can someone please help me?
By default engine used in pandas cython, but you can also change engine to numba or use njit decorator to speed up. Look up enhancingperf.
Numba converts python code to optimized machine codee, pandas is highly integrated with numpy and hence numba also. You can experiment with parallel, nogil, cache, fastmath option for speedup. This method shines for huge inputs where speed is needed.
Numba you can do eager compilation or first time execution take little time for compilation and subsequent usage will be fast
import pandas as pd
df1 = pd.DataFrame(columns = ['A','B','C','D','E','F','G'],
data = np.random.randint(1,5,(100,7)))
df2 = pd.DataFrame(columns = ['Name','Location','Sequence'],
data = [['Alfred','Chicago',
np.random.randint(1,5,(100,7))],
['Nicola','New York',
np.random.randint(1,5,(100,7))]])
a = df1.values
# Also possible to add `parallel=True`
f = nb.njit(lambda x: (x == a).mean())
# This is just illustration, not correct logic. Change the logic according to needs
# nb.njit((nb.int64,))
# def f(x):
# sum = 0
# for i in nb.prange(x.shape[0]):
# for j in range(a.shape[0]):
# sum += (x[i] == a[j]).sum()
# return sum
# Experiment with engine
print(df2['Sequence'].apply(f))
You can use direct comparison of the arrays and sum the identical values. Use apply to perform the comparison per row in df2:
df2['Sequence'].apply(lambda x: (x==df1.values).sum()/df1.size)
output:
0 0.270000
1 0.298571
To save the output in your original dataframe:
df2['Overlap'] = df2['Sequence'].apply(lambda x: (x==df1.values).sum()/df1.size)
I am trying to find a more pandorable way to get all rows of a DataFrame past a certain value in the a certain column (the Quarter column in this case).
I want to slice a DataFrame of GDP statistics to get all rows past the first quarter of 2000 (2000q1). Currently, I'm doing this by getting the index number of the value in the GDP_df["Quarter"] column that equals 2000q1 (see below). This seems way too convoluted and there must be an easier, simpler, more idiomatic way to achieve this. Any ideas?
Current Method:
def get_GDP_df():
GDP_df = pd.read_excel(
"gdplev.xls",
names=["Quarter", "GDP in 2009 dollars"],
parse_cols = "E,G", skiprows = 7)
year_2000 = GDP_df.index[GDP_df["Quarter"] == '2000q1'].tolist()[0]
GDP_df["Growth"] = (GDP_df["GDP in 2009 dollars"]
.pct_change()
.apply(lambda x: f"{round((x * 100), 2)}%"))
GDP_df = GDP_df[year_2000:]
return GDP_df
Output:
Also, after the DataFrame has been sliced, the indices now start at 212. Is there a method to renumber the indices so they start at 0 or 1?
The following is equivalent:
year_2000 = (GDP_df["Quarter"] == '2000q1').idxmax()
GDP_df["Growth"] = (GDP_df["GDP in 2009 dollars"]
.pct_change()
.mul(100)
.round(2)
.apply(lambda x: f"{x}%"))
return GDP_df.loc[year_2000:]
As pointed in the comments you can use the new awesome method query()
that Query the columns of a DataFrame with a boolean expression that
uses the top-level pandas.eval() function to evaluate the passed
query with pandas.eval method that Evaluate a Python expression
as a string using various backends that uses only Python
expressions.
import pandas as pd
raw_data = {'ID':['101','101','101','102','102','102','102','103','103','103','103'],
'Week':['08-02-2000','09-02-2000','11-02-2000','10-02-2000','09-02-2000','08-02-2000','07-02-2000','01-02-2000',
'02-02-2000','03-02-2000','04-02-2000'],
'Quarter':['2000q1','2000q2','2000q3','2000q4','2000q1','2000q2','2000q3','2000q4','2000q1','2000q2','2000q3'],
'GDP in 2000 dollars':[15,15,10,15,15,5,10,10,15,20,11]}
def get_GDP_df():
GDP_df = pd.DataFrame(raw_data).set_index('ID')
print(GDP_df) # for reference to see how the data is indexed, printing out to the screen
GDP_df = GDP_df.query("Quarter >= '2000q1'").reset_index(drop=True) #performing the query() + reindexing the dataframe
GDP_df["Growth"] = (GDP_df["GDP in 2000 dollars"]
.pct_change()
.apply(lambda x: f"{round((x * 100), 2)}%"))
return GDP_df
get_GDP_df()
I hope it is OK to ask questions of this type.
I have a get_lags function that takes a data frame, and for each column, shifts the column by each n in the list n_lags. So, if n_lags = [1, 2], the function shifts each column once by 1 and once by 2 positions, creating new lagged columns in this way.
def get_lags (df, n_lags):
data =df.copy()
data_with_lags = pd.DataFrame()
for column in data.columns:
for i in range(n_lags[0], n_lags[-1]+1):
new_column_name = str(column) + '_Lag' + str(i)
data_with_lags[new_column_name] = data[column].shift(-i)
data_with_lags.fillna(method = 'ffill', limit = max(n_lags), inplace = True)
return data_with_lags
So, if:
df.columns
ColumnA
ColumnB
Then, get_lags(df, [1 , 2]).columns will be:
ColumnA_Lag1
ColumnA_Lag2
ColumnB_Lag1
ColumnB_Lag2
Issue: working with data frames that have about 100,000 rows and 20,000 columns, this takes forever to run. On a 16-GB RAM, core i7 windows machine, once I waited for 15 minutes to the code to run before I stopped it. Is there anyway I can tweak this function to make it faster?
You'll need shift + concat. Here's the concise version -
def get_lags(df, n_lags):
return pd.concat(
[df] + [df.shift(i).add_suffix('_Lag{}'.format(i)) for i in n_lags],
axis=1
)
And here's a more memory-friendly version, using a for loop -
def get_lags(df, n_lags):
df_list = [df]
for i in n_lags:
v = df.shift(i)
v.columns = v.columns + '_Lag{}'.format(i)
df_list.append(v)
return pd.concat(df_list, axis=1)
This may not apply to your case (I hope I understand what you're trying to do correctly), but you can speed it up massively by not doing it in the first place. Can you treat your columns like a ring buffer?
Instead of changing the columns afterwards, keep track of:
how many columns can you use (how many lag items for each entry)
what was the last lag column used
(optionally) how many times you "rotated"
So instead of moving the data, you do something like:
current_column = (current_column + 1) % total_columns
and write to that column next.
I would like to efficiently slice a DataFrame with a DatetimeIndex (similar to a resample or groupby operation), but the desired time slices are different lengths.
This is relatively easy to do by looping (see code below), but with large timeseries the multiple slices quickly becomes slow. Any suggestions on vectorising this/improving speed?
import pandas as pd, datetime as dt, numpy as np
#Example DataFrame with a DatetimeIndex
idx = pd.DatetimeIndex(start=dt.datetime(2017,1,1), end=dt.datetime(2017,1,31), freq='h')
df = pd.Series(index = idx, data = np.random.rand(len(idx)))
#The slicer dataframe contains a series of start and end windows
slicer_df = pd.DataFrame(index = [1,2])
slicer_df['start_window'] = [dt.datetime(2017,1,2,2), dt.datetime(2017,1,6,12)]
slicer_df['end_window'] = [dt.datetime(2017,1,6,12), dt.datetime(2017,1,15,2)]
#The results should be stored to a dataframe, indexed by the index of the slicer dataframe
#This is the loop that I would like to vectorise
slice_results = pd.DataFrame()
slice_results['total'] = None
for index, row in slicer_df.iterrows():
slice_results.loc[index,'total'] = df[(df.index >= row.start_window) &
(df.index <= row.end_window)].sum()
NB. I've just realised that my particular data set has adjacent windows (ie. the start of one window corresponds to the end of the one before it), but the windows are of different lengths. It feels like there should be a way to perform a groupby or similar with only one pass over df...
You can do this as an apply, which will concat the results rather than iteratively update the DataFrame:
In [11]: slicer_df.apply((lambda row: \
df[(df.index >= row.start_window)
& (df.index <= row.end_window)].sum()), axis=1)
Out[11]:
1 36.381155
2 111.521803
dtype: float64
You can vectorize this with searchsorted (assuming the datetime index is sorted, otherwise first sort):
In [11]: inds = np.searchsorted(df.index.values, slicer_df.values)
In [12]: s = df.cumsum() # only sum once!
In [13]: pd.Series([s[end] - s[start-1] if start else s[end] for start, end in inds], slicer_df.index)
Out[13]:
1 36.381155
2 111.521803
dtype: float64
There's still a loop in there, but it's now a lot cheaper!
That leads us to a completely vectorized solution (it's a little more cryptic):
In [21]: inds2 = np.maximum(1, inds) # see note
In [22]: inds2[:, 0] -= 1
In [23]: inds2
Out[23]:
array([[ 23, 96],
[119, 336]])
In [24]: x = s[inds2]
In [25]: x
Out[25]:
array([[ 11.4596498 , 47.84080472],
[ 55.94941276, 167.47121538]])
In [26]: x[:, 1] - x[:, 0]
Out[26]: array([ 36.38115493, 111.52180263])
Note: the when the start date is before the first date we want to avoid the start index rolling back from 0 to -1 (which would mean the end of the array i.e. underflow).
I have come up with a vectorised method which relies on the varying length "windows" being always adjacent to one another, ie. that the start of a window is the same as the end of the window before it.
# Ensure that the join will be successful by rounding to a specific frequency
round_freq = '1h'
df.index = df.index.round(round_freq)
slicer_df.start_window= slicer_df.start_window.dt.round(round_freq)
# Give the index of the slicer a useful name
slicer_df.index.name = 'event_number'
#Perform a join to the start of the window, forward fill to the next window, then groupby to get the totals for each time window
df = df.to_frame('orig_data').join(slicer_df.reset_index().set_index('start_window')[['event_number']])
df.event_number = df.event_number.ffill()
df.groupby('event_number').sum()
Of course this only works when the windows are adjacent, ie. they can't overlap or have any gaps. If anyone has a more general method that works for the above, I'd love to see it!