Appending 2 dataframes with having duplicates without removing the duplicates - python

I'm trying to append prediction to my original data which is:
product_id date views wishlists cartadds orders order_units gmv score
mp000000000001321 01-09-2022 0 0 0 0 0 0 0
mp000000000001321 02-09-2022 0 0 0 0 0 0 0
mp000000000001321 03-09-2022 0 0 0 0 0 0 0
mp000000000001321 04-09-2022 0 0 0 0 0 0 0
I have sequence length of [1,3] and each for each sequence length I have prediction. I want to add those prediction to my original data so that my output is like this:
product_id date views wishlists cartadds orders order_units gmv score prediction sequence_length
mp000000000001321 01-09-2022 0 0 0 0 0 0 0 5.75 1
mp000000000001321 01-09-2022 0 0 0 0 0 0 0 5.88 3
mp000000000001321 02-09-2022 0 0 0 0 0 0 0 5.88 3
mp000000000001321 03-09-2022 0 0 0 0 0 0 0 5.88 3
I have tried the following:
df1 = df_batch.head(sequence_length)
dfff = pd.DataFrame.from_dict(predictions_dict, orient='index')
dfff.index.names = ['product_id']
merged_df = df1.merge(dfff, on='product_id')
merged_df.to_csv('data_prediction'+str(sequence_length)+'.csv', index_label='product_id')
but this only saves the data of last product_id which was sent and it saves for each sequence length in a different csv. I want everything to be in 1 csv instead. How do that?
Edit: sample predictions_dict:
{'mp000000000001321': {'sequence_length': 1, 'prediction': 5.75}}
{'mp000000000001321': {'sequence_length': 3, 'prediction': 5.88}}

So, I found a fix
df1 = df_batch[df_batch['product_id'] == product_id].iloc[:sequence_length]
dfff = pd.DataFrame.from_dict(predictions_dict, orient='index')
dfff.index.names = ['product_id']
merged_df = df1.merge(dfff, on='product_id')
new_df = pd.concat([new_df, merged_df], ignore_index=True)
This way I'm able to get the desired output for unique product id's

Related

Pandas : Dot product shape mismatch

I am trying to make dot products of some columns in my dataset:
df_disorders_3col = df.iloc[:,disorders_indexes]
df_disorders_3col.drop([5278, 10122, 10124, 10125, 10126], axis=0, inplace=True)
df_disorders_3col = df_disorders_3col.astype(int)
df_disorders_3col['Disorders'] = df_disorders_3col.dot(df.columns + ',').str.rstrip(',')
df_disorders_3col.head()
but I get this error when running this block of code:
ValueError: Dot product shape mismatch, (10133, 38) vs (498,)
this is a sample of my data:
>>>df_disorders_3col.sample(5)
HasDiabetes HasHypertension HasCardiacDisease ... HasMS HasPregnancyHypertension HasPregnancyDiabetes
752 0 0 0 1 0 0
6312 0 0 0 0 0 0
6984 1 0 0 0 0 0
9016 0 0 0 0 0 1
8923 0 0 0 0 0 0
5 rows × 38 columns
also this is the shape of df_disorders_3col:
>>>df_disorders_3col.shape
(10133, 38)
and df:
>>>df.shape
(10138, 498)

Merging 2 dataframes by date column (no look-ahead bias)

I am trying to create a python function that takes in 2 dataframes (dfA, dfB) and merges them based on their date column. When merging, B looks for the nearest date in A that is either equal to or comes before the given date. This is to prevent the data in dfAB from looking into the future (which is why dfAB.iloc[4]['date'] = 1/4/21 and not 1/9/21)
dfA
date i
0 1/1/21 0
1 1/3/21 0
2 1/4/21 0
3 1/10/21 0
dfB
date j k
0 1/1/21 0 0
1 1/2/21 0 0
2 1/3/21 0 0
3 1/9/21 0 0
4 1/12/21 0 0
dfAB (note that for each row of dfB, there is a row of dfAB)
date j k i
0 1/1/21 0 0 0
1 1/1/21 0 0 0
2 1/3/21 0 0 0
3 1/4/21 0 0 0
4 1/10/21 0 0 0
The values in columns i, j, k are just arbitrary values
So to do this we can use pd.merge_asof and a bit of trickery to push the date column from dfB back to the date column from dfA
# a.csv
date i
1/1/21 0
1/3/21 0
1/4/21 0
1/10/21 0
# b.csv
date j k
1/1/21 0 0
1/2/21 0 0
1/3/21 0 0
1/9/21 0 0
1/12/21 0 0
# merge_ab.py
import pandas as pd
dfA = pd.read_csv(
'a.csv',
delim_whitespace=True,
parse_dates=['date'],
dayfirst=True,
)
dfB = pd.read_csv(
'b.csv',
delim_whitespace=True,
parse_dates=['date'],
dayfirst=True,
)
dfA['new_date'] = dfA['date']
dfAB = pd.merge_asof(dfB, dfA, on='date', direction='backward')
dfAB['date'] = dfAB['new_date']
dfAB = dfAB.drop(columns=['new_date'])
print(dfAB)
# date j k i
# 0 2021-01-01 0 0 0
# 1 2021-01-01 0 0 0
# 2 2021-03-01 0 0 0
# 3 2021-04-01 0 0 0
# 4 2021-10-01 0 0 0
Here pd.merge_asof is doing the heavy lifting. We are merge the rows of dfB backwardswith the rows ofdfA. This should make it so the data in any row of dfABonly has data from equal to or before the corresponding row indfB. We do a little song and dance to copy the datecolumn indfAand then copy that over to thedatecolumn indfAB` to get the desired output.
It's not 100% clear to me that you want direction='backward' since all your sample data is 0, but if it doesn't look right you can always switch to direction='forward'.

Calculate count of a numeric column into new columns Pandas DataFrame

I have a pandas DataFrame like this:
Movie Rate
0 5821 4
1 2124 2
2 7582 1
3 3029 5
4 17479 1
both movie and the rating could be repeated. I need to transform this DataFrame to something like this:
Movie Rate_1_Count Rate_2_Count ... Rate_5_Count
0 5821 20 1 5
1 2124 2 0 99
2 7582 50 22 22
...
which the movie ids are unique and Rate {Number} Count is the count of the ratings to that movie that are equal to the {Number}.
I already accomplished this task using the code below which I believe is very messy. I guess there must be a neater way to do that. Can anyone help me with it?
self.movie_df_tmp = self.rating_df[['MovieId', 'Rate']]
self.movie_df_tmp['RaCount'] = self.movie_df_tmp.groupby(['MovieId'])['Rate'].transform('count')
self.movie_df_tmp['Sum'] = self.movie_df_tmp.groupby(['MovieId'])['Rate'].transform('sum')
self.movie_df_tmp['NORC'] = self.movie_df_tmp.groupby(['MovieId', 'Rate'])['Rate'].transform('count')
self.movie_df_tmp = self.movie_df_tmp.drop_duplicates()
self.movie_df_tmp['Rate1C'] = self.movie_df_tmp[self.movie_df_tmp['Rate'] == 1]['NORC']
self.movie_df_tmp['Rate2C'] = self.movie_df_tmp[self.movie_df_tmp['Rate'] == 2]['NORC']
self.movie_df_tmp['Rate3C'] = self.movie_df_tmp[self.movie_df_tmp['Rate'] == 3]['NORC']
self.movie_df_tmp['Rate4C'] = self.movie_df_tmp[self.movie_df_tmp['Rate'] == 4]['NORC']
self.movie_df_tmp['Rate5C'] = self.movie_df_tmp[self.movie_df_tmp['Rate'] == 5]['NORC']
self.movie_df_tmp = self.movie_df_tmp.replace(np.nan, 0)
self.movie_df = self.movie_df_tmp[['MovieId', 'RaCount', 'Sum']].drop_duplicates()
self.movie_df_tmp = self.movie_df_tmp.drop(columns=['Rate', 'NORC', 'Sum', 'RaCount'])
self.movie_df_tmp = self.movie_df_tmp.groupby(['MovieId'])["Rate1C", "Rate2C", "Rate3C", "Rate4C", "Rate5C"].apply(
lambda x: x.astype(int).sum())
self.movie_df = self.movie_df.merge(self.movie_df_tmp, left_on='MovieId', right_on='MovieId')
self.movie_df = pd.DataFrame(self.movie_df.values,
columns=['MovieId', 'Rate1C', 'Rate2C', 'Rate3C', 'Rate4C',
'Rate5C'])
Try with pd.crosstab:
pd.crosstab(df['Movie'], df['Rate'])
Rate 1 2 4 5
Movie
2124 0 1 0 0
3029 0 0 0 1
5821 0 0 1 0
7582 1 0 0 0
17479 1 0 0 0
Fix axis names and column names rename + reset_index + rename_axis:
new_df = (
pd.crosstab(df['Movie'], df['Rate'])
.rename(columns=lambda c: f'Rate_{c}_Count')
.reset_index()
.rename_axis(columns=None)
)
Movie Rate_1_Count Rate_2_Count Rate_4_Count Rate_5_Count
0 2124 0 1 0 0
1 3029 0 0 0 1
2 5821 0 0 1 0
3 7582 1 0 0 0
4 17479 1 0 0 0
This should give you the desired output:
grouper=df.groupby(['Movie','Rate']).size()
dg=pd.DataFrame()
dg['Movie']=df['Movie'].unique()
for i in [1,2,3,4,5]:
dg['Rate_'+str(i)+'Count']=dg['Movie'].apply(lambda x: grouper[x,i] if (x,i)
in grouper.index else 0)

Not able to store the multi-index csv file using pandas

I have a dataframe which looks like ,
JAPE_feature
100 200 2200 2600 4600
did offset word
0 0 aa 0 1 0 0 0
0 11 bf 0 1 0 0 0
0 12 vf 0 1 0 0 0
0 13 rw 1 0 0 0 0
0 14 asd 1 0 0 0 0
0 16 dsdd 0 0 1 0 0
0 18 wd 0 0 0 1 0
0 20 wsw 0 0 0 1 0
0 21 sd 0 0 0 0 1
Now, Here I am trying to save this dataframe in a csv format.
df.to_csv('data.csv')
SO, it gets stored like,
Now, Here I am trying to save without creating the new columns in the JAPE_feature column. it would have the 5 sub features in one column only.
JAPE_FEATURES
100 | 200 | 2200 | 2600 | 4600
the sub-columns should be like this . It should not create the different columns
I think here the best is convert DataFrame to excel, if need merge first level of MultiIndex in columns:
df.to_excel('data.xlsx')
If want csv then it is problem, is necessary change MultiIndex for repalce duplicated values to empty strings:
print (df.columns)
MultiIndex([('JAPE_feature', 100),
('JAPE_feature', 200),
('JAPE_feature', 2200),
('JAPE_feature', 2600),
('JAPE_feature', 4600)],
)
cols = df.columns.to_frame()
cols[0] = cols[0].mask(cols[0].duplicated(), '')
df.columns = pd.MultiIndex.from_arrays([cols[0], cols[1]])
print (df.columns)
MultiIndex([('JAPE_feature', 100),
( '', 200),
( '', 2200),
( '', 2600),
( '', 4600)],
names=[0, 1])
df.to_csv('data.csv')

Convert Dictionary to Pandas in Python

I have a dict as follows:
data_dict = {'1.160.139.117': ['712907','742068'],
'1.161.135.205': ['667386','742068'],
'1.162.51.21': ['326136', '663056', '742068']}
I want to convert the dict into a dataframe:
df= pd.DataFrame.from_dict(data_dict, orient='index')
How can I create a dataframe that has columns representing the values of the dictionary and rows representing the keys of the dictionary?, as below:
The best option is #4
pd.get_dummies(df.stack()).sum(level=0)
Option 1:
One way you could do it:
df.stack().reset_index(level=1)\
.set_index(0,append=True)['level_1']\
.unstack().notnull().mul(1)
Output:
326136 663056 667386 712907 742068
1.160.139.117 0 0 0 1 1
1.161.135.205 0 0 1 0 1
1.162.51.21 1 1 0 0 1
Option 2
Or with a litte reshaping and pd.crosstab:
df2 = df.stack().reset_index(name='Values')
pd.crosstab(df2.level_0,df2.Values)
Output:
Values 326136 663056 667386 712907 742068
level_0
1.160.139.117 0 0 0 1 1
1.161.135.205 0 0 1 0 1
1.162.51.21 1 1 0 0 1
Option 3
df.stack().reset_index(name="Values")\
.pivot(index='level_0',columns='Values')['level_1']\
.notnull().astype(int)
Output:
Values 326136 663056 667386 712907 742068
level_0
1.160.139.117 0 0 0 1 1
1.161.135.205 0 0 1 0 1
1.162.51.21 1 1 0 0 1
Option 4 (#Wen pointed out a short solution and fastest so far)
pd.get_dummies(df.stack()).sum(level=0)
Output:
326136 663056 667386 712907 742068
1.160.139.117 0 0 0 1 1
1.161.135.205 0 0 1 0 1
1.162.51.21 1 1 0 0 1

Categories