Reshape long form panel data to wide stacked time series - python

I have panel data of the form:
+--------+----------+------------+----------+
| | user_id | order_date | values |
+--------+----------+------------+----------+
| 0 | 11039591 | 2017-01-01 | 3277.466 |
| 1 | 25717549 | 2017-01-01 | 587.553 |
| 2 | 13629086 | 2017-01-01 | 501.882 |
| 3 | 3022981 | 2017-01-01 | 1352.546 |
| 4 | 6084613 | 2017-01-01 | 441.151 |
| ... | ... | ... | ... |
| 186415 | 17955698 | 2020-05-01 | 146.868 |
| 186416 | 17384133 | 2020-05-01 | 191.461 |
| 186417 | 28593228 | 2020-05-01 | 207.201 |
| 186418 | 29065953 | 2020-05-01 | 430.401 |
| 186419 | 4470378 | 2020-05-01 | 87.086 |
+--------+----------+------------+----------+
as a Pandas DataFrame in Python.
The data is basically stacked time series data; the table contains numerous time series corresponding to observations for unique users within a certain period (2017/01 - 2020/05 above). The level of coverage for the period is likely to be very low amongst individual users, meaning that if you isolate the individual time series they're all of varying lengths.
I want to take this long-format panel data and convert it to wide format, such that each column is a day and each row corresponds to a unique user:
+----------+------------+------------+------------+------------+------------+
| | 2017-01-01 | 2017-01-02 | 2017-01-03 | 2017-01-04 | 2017-01-05 |
+----------+------------+------------+------------+------------+------------+
| 11039591 | 3277.466 | 6482.722 | NaN | NaN | NaN |
| 25717549 | 587.553 | NaN | NaN | NaN | NaN |
| 13629086 | 501.882 | NaN | NaN | NaN | NaN |
| 3022981 | 1352.546 | NaN | NaN | 557.728 | NaN |
| 6084613 | 441.151 | NaN | NaN | NaN | NaN |
+----------+------------+------------+------------+------------+------------+
I'm struggling to get this using unstack/pivot or other Pandas built-ins as I keep running into:
ValueError: Index contains duplicate entries, cannot reshape
due to the repeated user IDs.
My solution at the moment uses a loop to index the individual timeseries and concatenates them together so it's not scalable - it's already really slow with just 180k rows:
def time_series_stacker(df):
ts = list()
for user in df['user_id'].unique():
values = df.loc[df['user_id']==user].drop('user_id', axis=1).T.values
instance = pd.DataFrame(
values[1,:].reshape(1,-1),
index=[user],
columns=values[0,:].astype('datetime64[ns]')
)
ts.append(instance)
return pd.concat(ts, axis=0)
Can anyone help out with reshaping this more efficiently please?

This is a perfect time to try out pivot_table
user_id order_date values
0 11039591 2017-01-01 3277.466
1 11039591 2017-01-02 587.553
2 13629086 2017-01-03 501.882
3 13629086 2017-01-02 1352.546
4 6084613 2017-01-01 441.151
df.pivot_table(index='user_id',columns='order_date',values='values')
Output
order_date 2017-01-01 2017-01-02 2017-01-03
user_id
6084613 441.151 NaN NaN
11039591 3277.466 587.553 NaN
13629086 NaN 1352.546 501.882

Related

ordene matrix with numpy.triu and non nan's values

I’ve a dataset where i need do a transformation to get a upper triangular matrix. So my matrix has this format:
| 1 | 2 | 3 |
01/01/1999 | nan | 582.96 | nan |
02/01/1999 | nan | 589.78 | 78.47 |
03/01/1999 | nan | 588.74 | 79.41 |
… | | |
01/01/2022 | 752.14 | 1005.78 | 193.47 |
02/01/2022 | 754.14 | 997.57 | 192.99 |
I use a dataframe.T, to get my date as columns, but I also need that my rows be ordened by non nan’s.
| 01/01/1999 | 02/01/1999 |03/01/1999 |… |01/01/2022 | 02/01/2022 |
2 | 582.96 | 589.78 | 588.74 |… | 1005.78 | 997.57 |
3 | nan | 78.47 | 79.41 | … | 193.47 | 192.99 |
1 | nan | nan | nan | … | 752.14 | 754.14 |
A tried use the different combinantions of numpy.triu, sort_by and dataframe.T but I haven’t success.
My main goal is get with this format, but if I get this with performance would be nice, cause my data is big.

combine data frames of different sizes and replacing values

I am having 2 dataframes of different size. I am looking to join the dataframes and want to replace the Nan values after combining both the dataframes and replacing the the Nan values with lower size dataframe.
dataframe1:-
| symbol| value1 | value2 | Occurance |
|=======|========|========|===========|
2020-07-31 | A | 193.5 | 186.05 | 3 |
2020-07-17 | A | 372.5 | 359.55 | 2 |
2020-07-21 | A | 387.8 | 382.00 | 1 |
dataframe2:-
| x | y | z | symbol|
|=====|=====|=====|=======|
2020-10-01 |448.5|453.0|443.8| A |
I tried concatenating and replacing the Nan values with values of dataframe2 value.
I tried df1 =pd.concat([dataframe2,dataframe1],axis=1). The result is given below but i am looking for result as in result desired. How can i achieve that desired result.
Result given:-
| X | Y | Z | symbol|symbol| value1| value2 | Occurance|
|====== | ====|=====|=======|======|=======| =======| =========|
2020-07-31|NaN |NaN | NaN | NaN | A |193.5 | 186.05 | 3 |
2021-05-17| NaN | NaN | NaN | NaN | A |372.5 | 359.55 | 2 |
2021-05-21| NaN | NaN | NaN | NaN | A |387.8 | 382.00 | 1 |
2020-10-01| 448.5 |453.0|443.8| A |NaN | NaN | NaN | NaN |
Result Desired:-
| X | Y | Z | symbol|symbol| value1| value2 | Occurance|
| ===== | ======| ====| ======| =====|=======|========|==========|
2020-10-01| 448.5 |453.0 |443.8| A | A |193.5 | 186.05 | 3 |
2020-10-01| 448.5 |453.0 |443.8| A | A |372.5 | 359.55 | 2 |
2020-10-01| 448.5 |453.0 |443.8| A | A |387.8 | 382.00 | 1 |
2020-10-01| 448.5 |453.0 |443.8| A |NaN | NaN | NaN | NaN |
Please note the datatime needs to be same in the Result Desired. In short replicating the single line of dataframe2 to NaN values of dataframe1. a solution avoiding For loop would be great.
Could you try to sort your dataframe by the index to check how the output would be ?
df1.sort_index()

Imputing NaN values of a column with values in another column based on counts

I have 2 datasets as follows.
Dataset 1
| impute | city1 |
|-------- |------------ |
| 1875.0 | Medan |
| 274.0 | Yogyakarta |
| 257.0 | Jakarta |
| 71.0 | Bekasi |
| 68.0 | Bandung |
| 41.0 | London |
| 41.0 | Purwokerto |
| 36.0 | Malang |
| 33.0 | Manchester |
| 29.0 | Denpasar |
| 27.0 | Surabaya |
| 26.0 | Bogor |
| 24.0 | Semarang |
| 22.0 | Surakarta |
Dimensions = 248 x 2
Dataset 2
| city |
|------------ |
| NaN |
| Yogyakarta |
| Medan |
| NaN |
| Medan |
| Medan |
| NaN |
| Tangerang |
| NaN |
| NaN |
| Tangerang |
| NaN |
| Medan |
| NaN |
| NaN |
| NaN |
| NaN |
| NaN |
| Medan |
Dimensions 13866 x 1
I want to impute Nan values in city (dataset 2) with values in city1 (dataset 1) .
Dataset 2 has 3563 Nan values . So , I want to impute 1874 of them with Medan , 273 with Yogyakarta , 256 with Jakarta and so on randomly (any 1874 NaN's out of 3563 NaN's ) . The impute column in dataset 1 sums up to 3563 (equal to number of NaN values in Dataset 2).
So In short number of NaN values to be replaced by a city in Dataset 1 should be equal to the value in impute column.
Can somebody please help me with this.
You can use
df1['city'].repeat(df1['impute']).sample(frac=1)
to repeat the values in the city column as many times as the number in the impute column, and shuffle the result. Then use
df2['city'].isna()
to find NaN cities, and use that to assign the imputed values.
Put it together you end up with
df2['city'].loc[df2['city'].isna()] = df1['city'].repeat(df1['no']).sample(frac=1).values

Modify column in according another column dataframe python

I have two dataframes. One is the master dataframe and the other df is used to fil my master dataframe.
what I want is fil one column in according another column without alter the others columns.
This is example of master df
| id | Purch. order | cost | size | code |
| 1 | G918282 | 8283 | large| hchs |
| 2 | EE18282 | 1283 | small| ueus |
| 3 | DD08282 | 5583 | large| kdks |
| 4 | GU88912 | 8232 | large| jdhd |
| 5 | NaN | 1283 | large| jdjd |
| 6 | Nan | 5583 | large| qqas |
| 7 | Nan | 8232 | large| djjs |
This is example of the another df
| id | Purch. order | cost |
| 1 | G918282 | 7728 |
| 2 | EE18282 | 2211 |
| 3 | DD08282 | 5321 |
| 4 | GU88912 | 4778 |
| 5 | NaN | 4283 |
| 6 | Nan | 9993 |
| 7 | Nan | 3442 |
This is the result I'd like
| id | Purch. order | cost | size | code |
| 1 | G918282 | 7728 | large| hchs |
| 2 | EE18282 | 2211 | small| ueus |
| 3 | DD08282 | 5321 | large| kdks |
| 4 | GU88912 | 4778 | large| jdhd |
| 5 | NaN | 1283 | large| jdjd |
| 6 | Nan | 5583 | large| qqas |
| 7 | Nan | 8232 | large| djjs |
Where only the cost column is modified only if the secondary df coincides with the purch. order and if it's not NaN.
I hope you can help me... and I'm sorry if my english is so basic, not is my mother language. Thanks a lot.
lets try Update which works along indexes, by default overwrite is set to True which will overwrite overlapping values in your target dataframe. use overwrite=False if you only want to change NA values.
master_df = master_df.set_index(['id','Purch. order'])
another_df = another_df.dropna(subset=['Purch. order']).set_index(['id','Purch. order'])
master_df.update(another_df)
print(master_df)
cost size code
id Purch. order
1 G918282 7728.0 large hchs
2 EE18282 2211.0 small ueus
3 DD08282 5321.0 large kdks
4 GU88912 4778.0 large jdhd
5 NaN 1283.0 large jdjd
6 Nan 5583.0 large qqas
7 Nan 8232.0 large djjs
You can do it with merge followed by updating the cost column based on where the Nan are:
final_df = df1.merge(df2[~df2["Purch. order"].isna()], on = 'Purch. order', how="left")
final_df.loc[~final_df['Purch. order'].isnull(), "cost"] = final_df['cost_y'] # not nan
final_df.loc[final_df['Purch. order'].isnull(), "cost"] = final_df['cost_x'] # nan
final_df = final_df.drop(['id_y','cost_x','cost_y'],axis=1)
Output:
id _x Purch. order size code cost
0 1 G918282 large hchs 7728.0
1 2 EE18282 small ueus 2211.0
2 3 DD08282 large kdks 5321.0
3 4 GU88912 large jdhd 4778.0
4 5 NaN large jdjd 1283.0
5 6 NaN large qqas 5583.0
6 7 NaN large djjs 8232.0

Carry/Copy NaN value to another column DF Pandas

I have the a df which looks like this:
| id | qty | item |
+-----+------+------+
| 001 | 700 | CB04 |
| 002 | 500 | NaN |
| 003 | 1500 | AB01 |
I want to copy the NaN values from df['item'] to df['qty'], so that it looks this :
| id | qty | item |
+-----+------+----------+
| 001 | 700 | CB04 box |
| 002 | NaN | NaN |
| 003 | 1500 | AB01 box |
i did the following
df['qty'] = df.loc[df['item'].isnull(),'item']
but my df turned out to be like this.
| id | qty | item |
+-----+-----+----------+
| 001 | NaN | CB04 box |
| 002 | NaN | NaN |
| 003 | NaN | AB01 box |
Your approach isn't working because you are selecting the column item, when it is null, and setting qty equal to that result, which is always NaN, so it fills qty with NaN
Use loc with boolean indexing and set your desired column. You were close, just not assigning quite right.
df.loc[df.item.isnull(), 'qty'] = np.nan
id qty item
0 1 700.0 CB04
1 2 NaN NaN
2 3 1500.0 AB01
Also using np.where (slightly faster when I tested on a 300k row dataframe)
df.qty = np.where(df.item.isnull(), np.nan, df.qty)

Categories