Merging 2 dataframes by date column (no look-ahead bias) - python

I am trying to create a python function that takes in 2 dataframes (dfA, dfB) and merges them based on their date column. When merging, B looks for the nearest date in A that is either equal to or comes before the given date. This is to prevent the data in dfAB from looking into the future (which is why dfAB.iloc[4]['date'] = 1/4/21 and not 1/9/21)
dfA
date i
0 1/1/21 0
1 1/3/21 0
2 1/4/21 0
3 1/10/21 0
dfB
date j k
0 1/1/21 0 0
1 1/2/21 0 0
2 1/3/21 0 0
3 1/9/21 0 0
4 1/12/21 0 0
dfAB (note that for each row of dfB, there is a row of dfAB)
date j k i
0 1/1/21 0 0 0
1 1/1/21 0 0 0
2 1/3/21 0 0 0
3 1/4/21 0 0 0
4 1/10/21 0 0 0
The values in columns i, j, k are just arbitrary values

So to do this we can use pd.merge_asof and a bit of trickery to push the date column from dfB back to the date column from dfA
# a.csv
date i
1/1/21 0
1/3/21 0
1/4/21 0
1/10/21 0
# b.csv
date j k
1/1/21 0 0
1/2/21 0 0
1/3/21 0 0
1/9/21 0 0
1/12/21 0 0
# merge_ab.py
import pandas as pd
dfA = pd.read_csv(
'a.csv',
delim_whitespace=True,
parse_dates=['date'],
dayfirst=True,
)
dfB = pd.read_csv(
'b.csv',
delim_whitespace=True,
parse_dates=['date'],
dayfirst=True,
)
dfA['new_date'] = dfA['date']
dfAB = pd.merge_asof(dfB, dfA, on='date', direction='backward')
dfAB['date'] = dfAB['new_date']
dfAB = dfAB.drop(columns=['new_date'])
print(dfAB)
# date j k i
# 0 2021-01-01 0 0 0
# 1 2021-01-01 0 0 0
# 2 2021-03-01 0 0 0
# 3 2021-04-01 0 0 0
# 4 2021-10-01 0 0 0
Here pd.merge_asof is doing the heavy lifting. We are merge the rows of dfB backwardswith the rows ofdfA. This should make it so the data in any row of dfABonly has data from equal to or before the corresponding row indfB. We do a little song and dance to copy the datecolumn indfAand then copy that over to thedatecolumn indfAB` to get the desired output.
It's not 100% clear to me that you want direction='backward' since all your sample data is 0, but if it doesn't look right you can always switch to direction='forward'.

Related

Appending 2 dataframes with having duplicates without removing the duplicates

I'm trying to append prediction to my original data which is:
product_id date views wishlists cartadds orders order_units gmv score
mp000000000001321 01-09-2022 0 0 0 0 0 0 0
mp000000000001321 02-09-2022 0 0 0 0 0 0 0
mp000000000001321 03-09-2022 0 0 0 0 0 0 0
mp000000000001321 04-09-2022 0 0 0 0 0 0 0
I have sequence length of [1,3] and each for each sequence length I have prediction. I want to add those prediction to my original data so that my output is like this:
product_id date views wishlists cartadds orders order_units gmv score prediction sequence_length
mp000000000001321 01-09-2022 0 0 0 0 0 0 0 5.75 1
mp000000000001321 01-09-2022 0 0 0 0 0 0 0 5.88 3
mp000000000001321 02-09-2022 0 0 0 0 0 0 0 5.88 3
mp000000000001321 03-09-2022 0 0 0 0 0 0 0 5.88 3
I have tried the following:
df1 = df_batch.head(sequence_length)
dfff = pd.DataFrame.from_dict(predictions_dict, orient='index')
dfff.index.names = ['product_id']
merged_df = df1.merge(dfff, on='product_id')
merged_df.to_csv('data_prediction'+str(sequence_length)+'.csv', index_label='product_id')
but this only saves the data of last product_id which was sent and it saves for each sequence length in a different csv. I want everything to be in 1 csv instead. How do that?
Edit: sample predictions_dict:
{'mp000000000001321': {'sequence_length': 1, 'prediction': 5.75}}
{'mp000000000001321': {'sequence_length': 3, 'prediction': 5.88}}
So, I found a fix
df1 = df_batch[df_batch['product_id'] == product_id].iloc[:sequence_length]
dfff = pd.DataFrame.from_dict(predictions_dict, orient='index')
dfff.index.names = ['product_id']
merged_df = df1.merge(dfff, on='product_id')
new_df = pd.concat([new_df, merged_df], ignore_index=True)
This way I'm able to get the desired output for unique product id's

How to reduce the size of my dataframe in Python?

working on NLP problem
I ended up with a big features dataset
dfMethod
Out[2]:
c0000167 c0000294 c0000545 ... c4721555 c4759703 c4759772
0 0 0 0 ... 0 0 0
1 0 0 0 ... 0 0 0
2 0 0 0 ... 0 0 0
3 0 0 0 ... 0 0 0
4 0 0 0 ... 0 0 0
... ... ... ... ... ... ...
3995 0 0 0 ... 0 0 0
3996 0 0 0 ... 0 0 0
3997 0 0 0 ... 0 0 0
3998 0 0 0 ... 0 0 0
3999 0 0 0 ... 0 0 0
[4000 rows x 14317 columns]
I want to remove columns with the smallest repetition (i.e. the columns with the smallest sum of of all records)
so if my columns sum would look like this
Sum of c0000167 = 7523
Sum of c0000294 = 8330
Sum of c0000545 = 502
Sum of c4721555 = 51
Sum of c4759703 = 9628
in the end, I want to only keep the top 5000 columns based on the sum of each column?
how can I do that?
Let's say you have a big dataframe big_df you can get the top columns with the following:
N = 5000
big_df[big_df.sum().sort_values(ascending=False).index[:N]]
Breaking this down:
big_df.sum() # Gives the sums you mentioned
.sort_values(ascending=False) # Sort the sums in descending order
.index # because .sum() defaults to axis=0, the index is your columns
[:N] # grab first N items
Edited after author comment.
Let's consider df a pandas DataFrame. Preparing the filter, select top 5000 sum columns:
df_sum = df.sum() # avoid repeating df.sum() next line
co = sorted([(c, v) for (c, v) in list(zip(df_sum.keys(), df_sum.values))], key = lambda row: row[1], reverse = True)[0:5000]
# fixed trouble of sum value greater than 5000, but the top 5000.
co = [row[0] for row in co]
# convert to a list of column names of interest
After filter columns in co:
df = df.filter(items = co)
df

Nested lists in DataFrame column: how to carry out calculations on individual values?

I am trying to carrying out calculations on individual values that are stored in a nested list stored in a pandas DataFrame. My issue is on how to access these individual values.
I am working from a data set available here: https://datadryad.org/stash/dataset/doi:10.5061/dryad.h505v
I have imported the .json file in a pandas DataFrame and the elastic constants are stored in the column 'elastic_tensor'.
import pandas as pd
df = pd.read_json(workdir+"ec.json")
df['elastic_tensor'].head()
Out:
0 [[311.33514638650246, 144.45092552856926, 126....
1 [[306.93357350984974, 88.02634955100905, 105.6...
2 [[569.5291276937579, 157.8517489654999, 157.85...
3 [[69.28798774976904, 34.7875015216915, 37.3877...
4 [[349.3767766177825, 186.67131003104407, 176.4...
Name: elastic_tensor, dtype: object
In order to access the individual values, what I have done is expand the nested lists once (as I could not find a way to use .extend() to flatten the nested list):
df1 = pd.DataFrame(df["elastic_tensor"].to_list() , columns=['c'+str(j) for j in range(1,7)])
Note: I have named the columns c1..c6 as the elastic constants in the
end shall be called cij with i and j from 1 to 6.
Then I have expanded each of these columns in turns (as I could not find the way to do a loop):
dfc1 = pd.DataFrame(df1["c1"].to_list() , columns=['c1'+str(j) for j in range(1,7)])
dfc2 = pd.DataFrame(df1["c2"].to_list() , columns=['c2'+str(j) for j in range(1,7)])
dfc3 = pd.DataFrame(df1["c3"].to_list() , columns=['c3'+str(j) for j in range(1,7)])
dfc4 = pd.DataFrame(df1["c4"].to_list() , columns=['c4'+str(j) for j in range(1,7)])
dfc5 = pd.DataFrame(df1["c5"].to_list() , columns=['c5'+str(j) for j in range(1,7)])
dfc6 = pd.DataFrame(df1["c6"].to_list() , columns=['c6'+str(j) for j in range(1,7)])
before merging them
data_frames = [dfc1, dfc2, dfc3, dfc4, dfc5, dfc6]
df_merged = pd.DataFrame().join(data_frames, how="outer")
which gives me a DataFrame with columns containing the individual cij values:
https://i.stack.imgur.com/odraQ.png
I can now carry out arithmetic operations on these individual values and add a column in the initial "df" dataframe with the results, but there must be a better way of doing it (especially if the matrices are large). Any idea?
approach using apply(pd.Series) to expand a list into columns
using stack() and unstack() generate multi-index columns that are zero-indexed values into 2D list
flatten multi-index to match your stated requirement (one-indexed instead of zero indexed)
import json
from pathlib import Path
# file downloaded from https://datadryad.org/stash/dataset/doi:10.5061/dryad.h505v
with open(Path.cwd().joinpath("ec.json")) as f: js = json.load(f)
df = pd.json_normalize(js)
# expand first dimension, put it into row index, expand second dimension, make multi-index columns
dfet = df["elastic_tensor"].apply(pd.Series).stack().apply(pd.Series).unstack()
# flatten multi-index columns, index from 1, instead of standard 0
dfet.columns = [f"c{i+1}{j+1}" for i,j in dfet.columns.to_flat_index()]
head(5)
c11
c12
c13
c14
c15
c16
c21
c22
c23
c24
c25
c26
c31
c32
c33
c34
c35
c36
c41
c42
c43
c44
c45
c46
c51
c52
c53
c54
c55
c56
c61
c62
c63
c64
c65
c66
0
311.335
144.451
126.176
0
-0.110347
0
144.451
311.32
126.169
0
-0.112161
0
126.176
126.169
332.185
0
-0.107541
0
0
0
0
98.9182
0
0
-0.110347
-0.112161
-0.107541
0
98.921
0
0
0
0
0
0
103.339
1
306.934
88.0263
105.696
2.53622
-0.568262
-0.188934
88.0263
298.869
101.79
-1.43474
-0.608261
-0.226253
105.696
101.79
398.441
0.350166
-0.577829
-0.232358
2.53622
-1.43474
0.350166
75.3104
0
0
-0.568262
-0.608261
-0.577829
0
75.5826
1.92806
-0.188934
-0.226253
-0.232358
0
1.92806
105.685
2
569.529
157.852
157.851
0
0
0
157.852
569.53
157.852
0
0
0
157.851
157.852
569.53
0
0
0
0
0
0
94.8801
0
0
0
0
0
0
94.88
0
0
0
0
0
0
94.8801
3
69.288
34.7875
37.3877
0
0
0
34.7875
78.1379
40.6047
0
0
0
37.3877
40.6047
70.1326
0
0
0
0
0
0
19.8954
0
0
0
0
0
0
4.75803
0
0
0
0
0
0
30.4095
4
349.377
186.671
176.476
0
0
0
186.671
415.51
213.834
0
0
0
176.476
213.834
407.479
0
0
0
0
0
0
120.112
0
0
0
0
0
0
125.443
0
0
0
0
0
0
74.9078
numpy approach
a = np.dstack(df["elastic_tensor"])
pd.DataFrame(a.reshape((a.shape[0]*a.shape[1], a.shape[2])).T,
columns=[f"c{i+1}{j+1}" for i in range(a.shape[0]) for j in range(a.shape[1])])

Not able to store the multi-index csv file using pandas

I have a dataframe which looks like ,
JAPE_feature
100 200 2200 2600 4600
did offset word
0 0 aa 0 1 0 0 0
0 11 bf 0 1 0 0 0
0 12 vf 0 1 0 0 0
0 13 rw 1 0 0 0 0
0 14 asd 1 0 0 0 0
0 16 dsdd 0 0 1 0 0
0 18 wd 0 0 0 1 0
0 20 wsw 0 0 0 1 0
0 21 sd 0 0 0 0 1
Now, Here I am trying to save this dataframe in a csv format.
df.to_csv('data.csv')
SO, it gets stored like,
Now, Here I am trying to save without creating the new columns in the JAPE_feature column. it would have the 5 sub features in one column only.
JAPE_FEATURES
100 | 200 | 2200 | 2600 | 4600
the sub-columns should be like this . It should not create the different columns
I think here the best is convert DataFrame to excel, if need merge first level of MultiIndex in columns:
df.to_excel('data.xlsx')
If want csv then it is problem, is necessary change MultiIndex for repalce duplicated values to empty strings:
print (df.columns)
MultiIndex([('JAPE_feature', 100),
('JAPE_feature', 200),
('JAPE_feature', 2200),
('JAPE_feature', 2600),
('JAPE_feature', 4600)],
)
cols = df.columns.to_frame()
cols[0] = cols[0].mask(cols[0].duplicated(), '')
df.columns = pd.MultiIndex.from_arrays([cols[0], cols[1]])
print (df.columns)
MultiIndex([('JAPE_feature', 100),
( '', 200),
( '', 2200),
( '', 2600),
( '', 4600)],
names=[0, 1])
df.to_csv('data.csv')

Pandas - get_dummies with value from another column

I have a dataframe like below. The column Mfr Number is a categorical data type. I'd like to preform get_dummies or one hot encoding on it, but instead of filling in the new column with a 1 if it's from that row, I want it to fill in the value from the quantity column. All the other new 'dummies' should remain a 0 on that row. Is this possible?
Datetime Mfr Number quantity
0 2016-03-15 07:02:00 MWS0460MB 1
1 2016-03-15 07:03:00 TM-120-6X 3
2 2016-03-15 08:33:00 40.50699.0095 5
3 2016-03-15 08:42:00 40.50699.0100 1
4 2016-03-15 08:46:00 CXS-04T098-00-0703R-1025 10
Do it in two steps:
dummies = pd.get_dummies(df['Mfr Number'])
dummies.values[dummies != 0] = df['Quantity']
Check with str.get_dummies and mul
df.Number.str.get_dummies().mul(df.quantity,0)
40.50699.0095 40.50699.0100 ... MWS0460MB TM-120-6X
0 0 0 ... 1 0
1 0 0 ... 0 3
2 5 0 ... 0 0
3 0 1 ... 0 0
4 0 0 ... 0 0
[5 rows x 5 columns]
df = pd.get_dummies(df, columns = ['Mfr Number'])
for col in df.columns[2:]:
df[col] = df[col]*df['quantity']

Categories