Dask how to pivot DataFrame - python

I am using the code below but get an error after pivoting the DataFrame:
dataframe:
name day value time
0 MAC000002 2012-12-16 0.147 09:30:00
1 MAC000002 2012-12-16 0.110 10:00:00
2 MAC000002 2012-12-16 0.736 10:30:00
3 MAC000002 2012-12-16 0.404 11:00:00
4 MAC000003 2012-12-16 0.845 00:30:00
Read in data, and pivot
ddf = dd.read_csv('data.csv')
#I added this but didnt fix the error below
ddf.index.name = 'index'
#dask requires string as category type
ddf['name'] = ddf['name'].astype('category')
ddf['name'] =ddf['name'].cat.as_known()
#pivot the table
df = ddf.pivot_table(columns='name', values='value', index='index')
df.head()
#KeyError: 'index'
Expected result (with or without index) - pivot rows to columns without any value modification:
MAC000002 MAC000003 ...
0.147 0.845
0.110 ...
0.736 ...
0.404 ...
Any idea why I am getting a KeyError 'index' and how I can overcome this?

According to the docs for pivot_table, value of index kwarg should refer to an existing column, so instead of setting name to the index, a column should be created with the desired index value:
# ddf.index.name = 'index'
ddf['index'] = ddf.index
Note that this assumes that the index is what you are really pivoting by.
Below is a reproducible snippet:
data = """
| name | day | value | time
0 | MAC000002 | 2012-12-16| 0.147| 09:30:00
1 | MAC000002 | 2012-12-16| 0.110| 10:00:00
2 | MAC000002 | 2012-12-16| 0.736| 10:30:00
3 | MAC000002 | 2012-12-16| 0.404| 11:00:00
4 | MAC000003 | 2012-12-16| 0.845| 00:30:00
"""
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(data), sep='|')
df.columns = [c.strip() for c in df.columns]
import dask.dataframe as dd
ddf = dd.from_pandas(df, npartitions=3)
ddf['index'] = ddf.index
#dask requires string as category type
ddf['name'] = ddf['name'].astype('category')
ddf['name'] =ddf['name'].cat.as_known()
ddf.pivot_table(columns='name', values='value', index='index').compute()
# name MAC000002 MAC000003
# index
# 0 0.147 NaN
# 1 0.110 NaN
# 2 0.736 NaN
# 3 0.404 NaN
# 4 NaN 0.845

Related

Is there a way to optimize this date range transformation? Conditional merge in pandas?

I have Sales data like this as a DataFrame, the datatype of the columns is datetime[64] of pandas:
Shop ID
Special Offer Start
Special Offer End
A
'2022-01-01'
'2022-01-03'
B
'2022-01-09'
'2022-01-11'
etc.
I want to transform the data into a new binary format, that shows me the date in one column and the special offer information as 0 and 1.
The resulting table should look like this:
Shop ID
Date
Special Offer?
A
'2022-01-01'
1
A
'2022-01-02'
1
A
'2022-01-03'
1
B
'2022-01-09'
1
B
'2022-01-10'
1
B
'2022-01-11'
1
I wrote a function, which iterates every row and creates an DataFrame containing Pandas DateRange and the Special Offer information. These DataFrame are then concatenated. As you can imagine the code runs very slow.
I was thinking to append a Special Offer? Column to the Sales DataFrame and then joining it to a DataFrame containing all dates. Afterwards I could just fill the NaN with the dropna or fillna-function. But I couldn't find a function which lets me join on conditions in pandas.
See example below:
Shop ID
Special Offer Start
Special Offer End
Special Offer ?
A
'2022-01-01'
'2022-01-03'
1
B
'2022-01-09'
'2022-01-11'
1
join with (the join condition being: if Date between Special Offer Start and Special Offer End):
Date
'2022-01-01'
'2022-01-02'
'2022-01-03'
'2022-01-04'
'2022-01-05'
'2022-01-06'
'2022-01-07'
'2022-01-08'
'2022-01-09'
'2022-01-10'
'2022-01-11'
creates:
Shop ID
Date
Special Offer?
A
'2022-01-01'
1
A
'2022-01-02'
1
A
'2022-01-03'
1
A
'2022-01-04'
NaN
A
'2022-01-05'
NaN
A
'2022-01-06'
NaN
A
'2022-01-07'
NaN
A
'2022-01-08'
NaN
A
'2022-01-09'
NaN
A
'2022-01-10'
NaN
A
'2022-01-11'
NaN
B
'2022-01-01'
NaN
B
'2022-01-02'
NaN
B
'2022-01-03'
NaN
B
'2022-01-04'
NaN
B
'2022-01-05'
NaN
B
'2022-01-06'
NaN
B
'2022-01-07'
NaN
B
'2022-01-08'
NaN
B
'2022-01-09'
1
B
'2022-01-10'
1
B
'2022-01-11'
1
EDIT:
here is the code I've written:
new_list = []
for i, row in sales_df.iterrows():
df = pd.DataFrame(pd.date_range(start=row["Special Offer Start"],end=row["Special Offer End"]), columns=['Date'])
df['Shop ID'] = row['Shop ID']
df["Special Offer?"] = 1
new_list.append(df)
result = pd.concat(new_list ).reset_index(drop=True)
Update
The Shop ID column is missing
You can use date_range to expand the dates:
# Setup minimal reproducible example
data = [{'Shop ID': 'A', 'Special Offer Start': '2022-01-01', 'Special Offer End': '2022-01-03'},
{'Shop ID': 'B', 'Special Offer Start': '2022-01-09', 'Special Offer End': '2022-01-11'}]
df = pd.DataFrame(data)
# Not mandatory if you have already DatetimeIndex
df['Special Offer Start'] = pd.to_datetime(df['Special Offer Start'])
df['Special Offer End'] = pd.to_datetime(df['Special Offer End'])
# create full date range
start = df['Special Offer Start'].min()
end = df['Special Offer End'].max()
dti = pd.date_range(start, end, freq='D', name='Date')
date_range = lambda x: pd.date_range(x['Special Offer Start'], x['Special Offer End'])
out = (df.assign(Offer=df.apply(date_range, axis=1), dummy=1).explode('Offer')
.pivot_table(index='Offer', columns='Shop ID', values='dummy', fill_value=0)
.reindex(dti, fill_value=0).unstack().rename('Special Offer?').reset_index())
>>> out
Shop ID Date Special Offer?
0 A 2022-01-01 1
1 A 2022-01-02 1
2 A 2022-01-03 1
3 A 2022-01-04 0
4 A 2022-01-05 0
5 A 2022-01-06 0
6 A 2022-01-07 0
7 A 2022-01-08 0
8 A 2022-01-09 0
9 A 2022-01-10 0
10 A 2022-01-11 0
11 B 2022-01-01 0
12 B 2022-01-02 0
13 B 2022-01-03 0
14 B 2022-01-04 0
15 B 2022-01-05 0
16 B 2022-01-06 0
17 B 2022-01-07 0
18 B 2022-01-08 0
19 B 2022-01-09 1
20 B 2022-01-10 1
21 B 2022-01-11 1

Pandas append() will be deprecated and I can't convert a specific df.append() to df.concat()

I have a current iteration to fill new rows to a dataframe based on new series created:
for i in range (x):
nextMonth = df.index[-1] + DateOffset(months=1)
newRow = pd.Series({'col_1':None,'col_2':1}, name=nextMonth)
df = df.append(newRow)
This works fine. New rows are created, on correct df columns (col_1 and col_2) and I have a correct nextMonth named index on the df (2022-02-01 on date).
col_1 col_2
date
1994-07-01 0.0684 7.177511
1994-08-01 0.0186 6.718000
1994-09-01 0.0153 6.595327
1994-10-01 0.0262 6.495939
1994-11-01 0.0281 6.330091
... ... ...
2021-10-01 0.0125 1.035140
2021-11-01 0.0095 1.022360
2021-12-01 0.0073 1.012739
2022-01-01 0.0054 1.005400
2022-02-01 NaN 1.000000 -----> series added
Note that I'm using the series named indexes to match with the df columns, and I´m also using the series name to use it on the final df as a named index (nextMonth).
Since df.append() will be deprecated, I´m struggling to perform the same instructions using df.concat().
By slightly reworking your loop, you could make it a dict comprehension to build a dictionary; construct a DataFrame with it; then use pd.concat to concatenate it to df. For example, if x=3:
x = 3
df = (pd.concat((df, pd.DataFrame.from_dict(
{df.index[-1] + DateOffset(months=i+1): {'col_1':np.nan, 'col_2':1}
for i in range(x)}, orient='index')))
.rename_axis(index=df.index.name))
Output:
col_1 col_2
date
1994-07-01 0.0684 7.177511
1994-08-01 0.0186 6.718000
1994-09-01 0.0153 6.595327
1994-10-01 0.0262 6.495939
1994-11-01 0.0281 6.330091
2021-10-01 0.0125 1.035140
2021-11-01 0.0095 1.022360
2021-12-01 0.0073 1.012739
2022-01-01 0.0054 1.005400
2022-02-01 NaN 1.000000
2022-03-01 NaN 1.000000
2022-04-01 NaN 1.000000

Efficient way to get row with closest timestamp to a given datetime in pandas

I have a big dataframe that contains around 7,000,000 rows of time series data that looks like this
timestamp | values
2019-08-01 14:53:01 | 20.0
2019-08-01 14:53:55 | 29.0
2019-08-01 14:53:58 | 22.4
...
2019-08-02 14:53:25 | 27.9
I want to create a column that is a lag version of 1 day for each row, since my timestamps don't match up perfectly, I can't use the normal shift() method.
The result will be something like this:
timestamp | values | lag
2019-08-01 14:53:01 | 20.0 | Nan
2019-08-01 14:53:55 | 29.0 | Nan
2019-08-01 14:53:58 | 22.4 | Nan
...
2019-08-02 14:53:25 | 27.9 | 20.0
I found some posts related to get the closest timestamp to a given time: Find closest row of DataFrame to given time in Pandas and tried the methods, it does the job but takes too long to run, here's what I have:
def get_nearest(data, timestamp):
index = data.index.get_loc(timestamp,"nearest")
return data.iloc[index, 0]
df['lag'] = [get_nearest(df, dt) for dt in df.index]
Any efficient ways to solve the problem?
Hmmmm, not sure if this will work out to be more efficient, but merge_asof is an approach worth looking at as won't require a udf.
df['date'] = df.timestamp.dt.date
df2 = df.copy()
df2['date'] = df2['date'] + pd.to_timedelta(1,unit ='D')
df2['timestamp'] = df2['timestamp'] + pd.to_timedelta(1,unit ='D')
pd.merge_asof(df,df2, on = 'timestamp', by = 'date', direction = 'nearest')
The approach essentially merges the previous day value to the next day and then matches to the nearest timestamp.
Assuming your dates are sorted, one way to do this quickly would be to use pd.DateTimeIndex.searchsorted to find all the matching dates in O[N log N] time.
Creating some test data, it might look something like this:
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame(
{'values': np.random.rand(10)},
index=sorted(np.random.choice(pd.date_range('2019-08-01', freq='T', periods=10000), 10, replace=False))
)
def add_lag(df):
ind = df.index.searchsorted(df.index - pd.DateOffset(1))
out_of_range = (ind <= 0) | (ind >= df.shape[0])
ind[out_of_range] = 0
lag = df['values'].values[ind]
lag[out_of_range] = np.nan
df['lag'] = lag
return df
add_lag(df)
values lag
2019-08-01 06:17:00 0.548814 NaN
2019-08-01 10:51:00 0.715189 NaN
2019-08-01 13:56:00 0.602763 NaN
2019-08-02 09:50:00 0.544883 0.715189
2019-08-03 14:06:00 0.423655 0.423655
2019-08-04 03:00:00 0.645894 0.423655
2019-08-05 07:40:00 0.437587 0.437587
2019-08-07 00:41:00 0.891773 0.891773
2019-08-07 07:05:00 0.963663 0.891773
2019-08-07 15:55:00 0.383442 0.891773
With this approach, a dataframe with 1 million rows can be computed in tens of milliseconds:
df = pd.DataFrame(
{'values': np.random.rand(1000000)},
index=sorted(np.random.choice(pd.date_range('2019-08-01', freq='T', periods=10000000), 1000000, replace=False))
)
%timeit add_lag(df)
# 10 loops, best of 3: 71.5 ms per loop
Note however that this doesn't find the nearest value to a lag of one day, but the nearest value after a lag of one day. If you want the nearest value in either direction, you'll need to modify this approach.

How to fill periods in columns?

There is a dataframe. The period column contains lists. These lists contain time spans.
#load data
df = pd.DataFrame(data, columns=['task_id', 'target_start_date', 'target_end_date'])
df['target_start_date'] = pd.to_datetime(df.target_start_date)
df['target_end_date'] = pd.to_datetime(df.target_end_date)
df['period'] = np.nan
#create period column
z = dict()
freq = 'M'
for i in range(0, len(df)):
l = pd.period_range(df.target_start_date[i], df.target_end_date[i], freq=freq)
l = l.to_native_types()
z[i] = l
df.period = z.values()
Output
task_id target_start_date target_end_date period
0 35851 2019-04-01 07:00:00 2019-04-01 07:00:00 [2019-04]
1 35852 2020-02-26 11:30:00 2020-02-26 11:30:00 [2020-02]
2 35854 2019-05-17 07:00:00 2019-06-01 17:30:00 [2019-05, 2019-06]
3 35855 2019-03-20 11:30:00 2019-04-07 15:00:00 [2019-03, 2019-04]
4 35856 2019-04-06 08:00:00 2019-04-26 19:00:00 [2019-04]
enter image description here
Then I add columns which are called time slices.
#create slices
date_min = df.target_start_date.min()
date_max = df.target_end_date.max()
period = pd.period_range(date_min, date_max, freq=freq)
#add columns
for i in period:
df[str(i)] = np.nan
result
enter image description here
How can I fill Nan values ​​for True, if this value is in the list in the period column?
enter image description here
Apply a function across the dataframe rows
def fillit(row):
for i in row.period:
row[i] = True
df.apply(fillit), axis=1)
My approach was to iterate over rows and column names and compare values:
import numpy as np
import pandas as pd
# handle assignment error
pd.options.mode.chained_assignment = None
# setup test data
data = {'time': [['2019-04'], ['2019-01'], ['2019-03'], ['2019-06', '2019-05']]}
data = pd.DataFrame(data=data)
# create periods
date_min = data.time.min()[0]
date_max = data.time.max()[0]
period = pd.period_range(date_min, date_max, freq='M')
for i in period:
data[str(i)] = np.nan
# compare and fill data
for index, row in data.iterrows():
for column in data:
if data[column].name in row['time']:
data[column][index] = 'True'
Output:
time 2019-01 2019-02 2019-03 2019-04 2019-05 2019-06
0 [2019-04] NaN NaN NaN True NaN NaN
1 [2019-01] True NaN NaN NaN NaN NaN
2 [2019-03] NaN NaN True NaN NaN NaN
3 [2019-06, 2019-05] NaN NaN NaN NaN True True

How to access last element of a multi-index dataframe

I have a dataframe with IDs and timestamps as a multi-index. The index in the dataframe is sorted by IDs and timestamps and I want to pick the lastest timestamp for each IDs. for example:
IDs timestamp value
0 2010-10-30 1
2010-11-30 2
1 2000-01-01 300
2007-01-01 33
2010-01-01 400
2 2000-01-01 11
So basically the result I want is
IDs timestamp value
0 2010-11-30 2
1 2010-01-01 400
2 2000-01-01 11
What is the command to do that in pandas?
Given this setup:
import pandas as pd
import numpy as np
import io
content = io.BytesIO("""\
IDs timestamp value
0 2010-10-30 1
0 2010-11-30 2
1 2000-01-01 300
1 2007-01-01 33
1 2010-01-01 400
2 2000-01-01 11""")
df = pd.read_table(content, header=0, sep='\s+', parse_dates=[1])
df.set_index(['IDs', 'timestamp'], inplace=True)
using reset_index followed by groupby
df.reset_index(['timestamp'], inplace=True)
print(df.groupby(level=0).last())
yields
timestamp value
IDs
0 2010-11-30 00:00:00 2
1 2010-01-01 00:00:00 400
2 2000-01-01 00:00:00 11
This does not feel like the best solution, however. There should be a way to do this without calling reset_index...
As you point out in the comments, last ignores NaN values. To not skip NaN values, you could use groupby/agg like this:
df.reset_index(['timestamp'], inplace=True)
grouped = df.groupby(level=0)
print(grouped.agg(lambda x: x.iloc[-1]))
One can also use
df.groupby("IDs").tail(1)
This will take the last row of each label in level "IDs" and will not ignore NaN values.

Categories