Not able to store the multi-index csv file using pandas - python

I have a dataframe which looks like ,
JAPE_feature
100 200 2200 2600 4600
did offset word
0 0 aa 0 1 0 0 0
0 11 bf 0 1 0 0 0
0 12 vf 0 1 0 0 0
0 13 rw 1 0 0 0 0
0 14 asd 1 0 0 0 0
0 16 dsdd 0 0 1 0 0
0 18 wd 0 0 0 1 0
0 20 wsw 0 0 0 1 0
0 21 sd 0 0 0 0 1
Now, Here I am trying to save this dataframe in a csv format.
df.to_csv('data.csv')
SO, it gets stored like,
Now, Here I am trying to save without creating the new columns in the JAPE_feature column. it would have the 5 sub features in one column only.
JAPE_FEATURES
100 | 200 | 2200 | 2600 | 4600
the sub-columns should be like this . It should not create the different columns

I think here the best is convert DataFrame to excel, if need merge first level of MultiIndex in columns:
df.to_excel('data.xlsx')
If want csv then it is problem, is necessary change MultiIndex for repalce duplicated values to empty strings:
print (df.columns)
MultiIndex([('JAPE_feature', 100),
('JAPE_feature', 200),
('JAPE_feature', 2200),
('JAPE_feature', 2600),
('JAPE_feature', 4600)],
)
cols = df.columns.to_frame()
cols[0] = cols[0].mask(cols[0].duplicated(), '')
df.columns = pd.MultiIndex.from_arrays([cols[0], cols[1]])
print (df.columns)
MultiIndex([('JAPE_feature', 100),
( '', 200),
( '', 2200),
( '', 2600),
( '', 4600)],
names=[0, 1])
df.to_csv('data.csv')

Related

Appending 2 dataframes with having duplicates without removing the duplicates

I'm trying to append prediction to my original data which is:
product_id date views wishlists cartadds orders order_units gmv score
mp000000000001321 01-09-2022 0 0 0 0 0 0 0
mp000000000001321 02-09-2022 0 0 0 0 0 0 0
mp000000000001321 03-09-2022 0 0 0 0 0 0 0
mp000000000001321 04-09-2022 0 0 0 0 0 0 0
I have sequence length of [1,3] and each for each sequence length I have prediction. I want to add those prediction to my original data so that my output is like this:
product_id date views wishlists cartadds orders order_units gmv score prediction sequence_length
mp000000000001321 01-09-2022 0 0 0 0 0 0 0 5.75 1
mp000000000001321 01-09-2022 0 0 0 0 0 0 0 5.88 3
mp000000000001321 02-09-2022 0 0 0 0 0 0 0 5.88 3
mp000000000001321 03-09-2022 0 0 0 0 0 0 0 5.88 3
I have tried the following:
df1 = df_batch.head(sequence_length)
dfff = pd.DataFrame.from_dict(predictions_dict, orient='index')
dfff.index.names = ['product_id']
merged_df = df1.merge(dfff, on='product_id')
merged_df.to_csv('data_prediction'+str(sequence_length)+'.csv', index_label='product_id')
but this only saves the data of last product_id which was sent and it saves for each sequence length in a different csv. I want everything to be in 1 csv instead. How do that?
Edit: sample predictions_dict:
{'mp000000000001321': {'sequence_length': 1, 'prediction': 5.75}}
{'mp000000000001321': {'sequence_length': 3, 'prediction': 5.88}}
So, I found a fix
df1 = df_batch[df_batch['product_id'] == product_id].iloc[:sequence_length]
dfff = pd.DataFrame.from_dict(predictions_dict, orient='index')
dfff.index.names = ['product_id']
merged_df = df1.merge(dfff, on='product_id')
new_df = pd.concat([new_df, merged_df], ignore_index=True)
This way I'm able to get the desired output for unique product id's

Merging 2 dataframes by date column (no look-ahead bias)

I am trying to create a python function that takes in 2 dataframes (dfA, dfB) and merges them based on their date column. When merging, B looks for the nearest date in A that is either equal to or comes before the given date. This is to prevent the data in dfAB from looking into the future (which is why dfAB.iloc[4]['date'] = 1/4/21 and not 1/9/21)
dfA
date i
0 1/1/21 0
1 1/3/21 0
2 1/4/21 0
3 1/10/21 0
dfB
date j k
0 1/1/21 0 0
1 1/2/21 0 0
2 1/3/21 0 0
3 1/9/21 0 0
4 1/12/21 0 0
dfAB (note that for each row of dfB, there is a row of dfAB)
date j k i
0 1/1/21 0 0 0
1 1/1/21 0 0 0
2 1/3/21 0 0 0
3 1/4/21 0 0 0
4 1/10/21 0 0 0
The values in columns i, j, k are just arbitrary values
So to do this we can use pd.merge_asof and a bit of trickery to push the date column from dfB back to the date column from dfA
# a.csv
date i
1/1/21 0
1/3/21 0
1/4/21 0
1/10/21 0
# b.csv
date j k
1/1/21 0 0
1/2/21 0 0
1/3/21 0 0
1/9/21 0 0
1/12/21 0 0
# merge_ab.py
import pandas as pd
dfA = pd.read_csv(
'a.csv',
delim_whitespace=True,
parse_dates=['date'],
dayfirst=True,
)
dfB = pd.read_csv(
'b.csv',
delim_whitespace=True,
parse_dates=['date'],
dayfirst=True,
)
dfA['new_date'] = dfA['date']
dfAB = pd.merge_asof(dfB, dfA, on='date', direction='backward')
dfAB['date'] = dfAB['new_date']
dfAB = dfAB.drop(columns=['new_date'])
print(dfAB)
# date j k i
# 0 2021-01-01 0 0 0
# 1 2021-01-01 0 0 0
# 2 2021-03-01 0 0 0
# 3 2021-04-01 0 0 0
# 4 2021-10-01 0 0 0
Here pd.merge_asof is doing the heavy lifting. We are merge the rows of dfB backwardswith the rows ofdfA. This should make it so the data in any row of dfABonly has data from equal to or before the corresponding row indfB. We do a little song and dance to copy the datecolumn indfAand then copy that over to thedatecolumn indfAB` to get the desired output.
It's not 100% clear to me that you want direction='backward' since all your sample data is 0, but if it doesn't look right you can always switch to direction='forward'.

Calculate count of a numeric column into new columns Pandas DataFrame

I have a pandas DataFrame like this:
Movie Rate
0 5821 4
1 2124 2
2 7582 1
3 3029 5
4 17479 1
both movie and the rating could be repeated. I need to transform this DataFrame to something like this:
Movie Rate_1_Count Rate_2_Count ... Rate_5_Count
0 5821 20 1 5
1 2124 2 0 99
2 7582 50 22 22
...
which the movie ids are unique and Rate {Number} Count is the count of the ratings to that movie that are equal to the {Number}.
I already accomplished this task using the code below which I believe is very messy. I guess there must be a neater way to do that. Can anyone help me with it?
self.movie_df_tmp = self.rating_df[['MovieId', 'Rate']]
self.movie_df_tmp['RaCount'] = self.movie_df_tmp.groupby(['MovieId'])['Rate'].transform('count')
self.movie_df_tmp['Sum'] = self.movie_df_tmp.groupby(['MovieId'])['Rate'].transform('sum')
self.movie_df_tmp['NORC'] = self.movie_df_tmp.groupby(['MovieId', 'Rate'])['Rate'].transform('count')
self.movie_df_tmp = self.movie_df_tmp.drop_duplicates()
self.movie_df_tmp['Rate1C'] = self.movie_df_tmp[self.movie_df_tmp['Rate'] == 1]['NORC']
self.movie_df_tmp['Rate2C'] = self.movie_df_tmp[self.movie_df_tmp['Rate'] == 2]['NORC']
self.movie_df_tmp['Rate3C'] = self.movie_df_tmp[self.movie_df_tmp['Rate'] == 3]['NORC']
self.movie_df_tmp['Rate4C'] = self.movie_df_tmp[self.movie_df_tmp['Rate'] == 4]['NORC']
self.movie_df_tmp['Rate5C'] = self.movie_df_tmp[self.movie_df_tmp['Rate'] == 5]['NORC']
self.movie_df_tmp = self.movie_df_tmp.replace(np.nan, 0)
self.movie_df = self.movie_df_tmp[['MovieId', 'RaCount', 'Sum']].drop_duplicates()
self.movie_df_tmp = self.movie_df_tmp.drop(columns=['Rate', 'NORC', 'Sum', 'RaCount'])
self.movie_df_tmp = self.movie_df_tmp.groupby(['MovieId'])["Rate1C", "Rate2C", "Rate3C", "Rate4C", "Rate5C"].apply(
lambda x: x.astype(int).sum())
self.movie_df = self.movie_df.merge(self.movie_df_tmp, left_on='MovieId', right_on='MovieId')
self.movie_df = pd.DataFrame(self.movie_df.values,
columns=['MovieId', 'Rate1C', 'Rate2C', 'Rate3C', 'Rate4C',
'Rate5C'])
Try with pd.crosstab:
pd.crosstab(df['Movie'], df['Rate'])
Rate 1 2 4 5
Movie
2124 0 1 0 0
3029 0 0 0 1
5821 0 0 1 0
7582 1 0 0 0
17479 1 0 0 0
Fix axis names and column names rename + reset_index + rename_axis:
new_df = (
pd.crosstab(df['Movie'], df['Rate'])
.rename(columns=lambda c: f'Rate_{c}_Count')
.reset_index()
.rename_axis(columns=None)
)
Movie Rate_1_Count Rate_2_Count Rate_4_Count Rate_5_Count
0 2124 0 1 0 0
1 3029 0 0 0 1
2 5821 0 0 1 0
3 7582 1 0 0 0
4 17479 1 0 0 0
This should give you the desired output:
grouper=df.groupby(['Movie','Rate']).size()
dg=pd.DataFrame()
dg['Movie']=df['Movie'].unique()
for i in [1,2,3,4,5]:
dg['Rate_'+str(i)+'Count']=dg['Movie'].apply(lambda x: grouper[x,i] if (x,i)
in grouper.index else 0)

Python Dataframe Updating Flags

I am creating four columns which are labeled as flagMin, flagMax, flagLow, flagUp. I am updating these dataframe columns each time it runs through the loop whoever my original data is being override. I would like to keep the previous data I had in the 4 columns since they contain 1s when true.
import pandas as pd
import numpy as np
df = pd.read_excel('help test 1.xlsx')
#groupby function separates the different Name parameters within the Name column and performing functions like finding the lowest of the "minimum" and "lower" columns and highest of the "maximum" and "upper" columns.
flagMin = df.groupby(['Name'], as_index=False)['Min'].min()
flagMax = df.groupby(['Name'], as_index=False)['Max'].max()
flagLow = df.groupby(['Name'], as_index=False)['Lower'].min()
flagUp = df.groupby(['Name'], as_index=False)['Upper'].max()
print(flagMin)
print(flagMax)
print(flagLow)
print(flagUp)
num = len(flagMin) #size of 2, works for all flags in this case
for i in range(num):
#iterating through each row of parameters and column number 1(min,max,lower,upper column)
colMin = flagMin.iloc[i, 1]
colMax = flagMax.iloc[i, 1]
colLow = flagLow.iloc[i, 1]
colUp = flagUp.iloc[i, 1]
#setting flags if any column's parameter matches the flag dataframe's parameter, sets a 1 if true, sets a 0 if false
df['flagMin'] = np.where(df['Min'] == colMin, '1', '0')
df['flagMax'] = np.where(df['Max'] == colMax, '1', '0')
df['flagLow'] = np.where(df['Lower'] == colLow, '1', '0')
df['flagUp'] = np.where(df['Upper'] == colUp, '1', '0')
print(df)
4 Dataframes for each flag printed above
Name Min
0 Vo 12.8
1 Vi -51.3
Name Max
0 Vo 39.9
1 Vi -25.7
Name Low
0 Vo -46.0
1 Vi -66.1
Name Up
0 Vo 94.3
1 Vi -14.1
Output 1st iteration
flagMax flagLow flagUp
0 0 0 0
1 0 0 0
2 0 0 0
3 1 0 0
4 0 0 0
5 0 0 0
6 0 0 1
7 0 1 0
8 0 0 0
9 0 0 0
10 0 0 0
11 0 0 0
12 0 0 0
13 0 0 0
14 0 0 0
15 0 0 0
16 0 0 0
17 0 0 0
Output 2nd Iteration
flagMax flagLow flagUp
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
5 0 0 0
6 0 0 0
7 0 0 0
8 0 0 0
9 1 0 1
10 0 0 0
11 0 0 0
12 0 0 0
13 0 0 0
14 0 0 0
15 0 1 0
16 0 0 0
17 0 0 0
I lose the 1s from row 3,6,7. I would like to keep the 1s from both sets of data. Thank you
Just set to '1' only those elements you want to update and not the whole column.
import pandas as pd
import numpy as np
df = pd.read_excel('help test 1.xlsx')
#groupby function separates the different Name parameters within the Name column and performing functions like finding the lowest of the "minimum" and "lower" columns and highest of the "maximum" and "upper" columns.
flagMin = df.groupby(['Name'], as_index=False)['Min'].min()
flagMax = df.groupby(['Name'], as_index=False)['Max'].max()
flagLow = df.groupby(['Name'], as_index=False)['Lower'].min()
flagUp = df.groupby(['Name'], as_index=False)['Upper'].max()
print(flagMin)
print(flagMax)
print(flagLow)
print(flagUp)
num = len(flagMin) #size of 2, works for all flags in this case
df['flagMin'] = '0'
df['flagMax'] = '0'
df['flagLow'] = '0'
df['flagUp'] = '0'
for i in range(num):
#iterating through each row of parameters and column number 1(min,max,lower,upper column)
colMin = flagMin.iloc[i, 1]
colMax = flagMax.iloc[i, 1]
colLow = flagLow.iloc[i, 1]
colUp = flagUp.iloc[i, 1]
#setting flags if any column's parameter matches the flag dataframe's parameter, sets a 1 if true, sets a 0 if false
df['flagMin'][df['Min'] == colMin] = '1'
df['flagMax'][df['Max'] == colMax] = '1'
df['flagLow'][df['Lower'] == colLow] = '1'
df['flagUp'][df['Upper'] == colUp] = '1'
print(df)
P.S. I don't know why you are using strings of '0' and '1' instead of just using 0 and 1 but that's up to you.

How can I add new columns using another dataframe (related to string columns) in Pandas

Confusing title, let me explain. I have 2 dataframes like this:
dataframe named df1: Looks like this (with million of rows in original):
id ` text c1
1 Hello world how are you people 1
2 Hello people I am fine people 1
3 Good Morning people -1
4 Good Evening -1
Dataframe named df2 looks like this:
Word count Points Percentage
hello 2 2 100
world 1 1 100
how 1 1 100
are 1 1 100
you 1 1 100
people 3 1 33.33
I 1 1 100
am 1 1 100
fine 1 1 100
Good 2 -2 -100
Morning 1 -1 -100
Evening 1 -1 -100
-1
df2 columns explaination:
count means the total number of times that word appeared in df1
points is points given to each word by some kind of algorithm
percentage = points/count*100
Now, I want to add 40 new columns in df1, according to the point & percentage. They will look like this:
perc_-90_2 perc_-80_2 perc_-70_2 perc_-60_2 perc_-50_2 perc_-40_2 perc_-20_2 perc_-10_2 perc_0_2 perc_10_2 perc_20_2 perc_30_2 perc_40_2 perc_50_2 perc_60_2 perc_70_2 perc_80_2 perc_90_2
perc_-90_1 perc_-80_1 perc_-70_1 perc_-60_1 perc_-50_1 perc_-40_1 perc_-20_1 perc_-10_1 perc_0_1 perc_10_1 perc_20_1 perc_30_1 perc_40_1 perc_50_1 perc_60_ perc_70_1 perc_80_1 perc_90_1
Let me break it down. The column name contain 3 parts:
1.) perc just a string, means nothing
2.) Numbers from range -90 to +90. For example, Here -90 means, the percentage is -90 in df2. Now for example, If a word has percentage value in range 81-90, then there will be a value of 1 in that row, and column named prec_-80_xx. The xx is the third part.
3.) The third part is the count. Here I want two type of counts. 1 and 2. As the example given in point 2, If the word count is in range of 0 to 1, then the value will be 1 in prec_-80_1 column. If the word count is 2 or more, then the value will be 1 in prec_-80_2 column.
I hope it is not very on confusing.
Use:
#change previous answer with add id for matching
df2 = (df.drop_duplicates(['id','Word'])
.groupby('Word', sort=False)
.agg({'c1':['sum','size'], 'id':'first'})
)
df2.columns = df2.columns.map(''.join)
df2 = df2.reset_index()
df2 = df2.rename(columns={'c1sum':'Points','c1size':'Totalcount','idfirst':'id'})
df2['Percentage'] = df2['Points'] / df2['Totalcount'] * 100
s1 = df2['Percentage'].div(10).astype(int).mul(10).astype(str)
s2 = np.where(df2['Totalcount'] == 1, '1', '2')
#s2= np.where(df1['Totalcount'].isin([0,1]), '1', '2')
#create colum by join
df2['new'] = 'perc_' + s1 + '_' +s2
#create indicator DataFrame
df3 = pd.get_dummies(df2[['id','new']].drop_duplicates().set_index('id'),
prefix='',
prefix_sep='').max(level=0)
print (df3)
#reindex for add missing columns
c = 'perc_' + pd.Series(np.arange(-100, 110, 10).astype(str)) + '_'
cols = (c + '1').append(c + '2')
#join to original df1
df = df1.join(df3.reindex(columns=cols, fill_value=0), on='id')
print (df)
id text c1 perc_-100_1 perc_-90_1 \
0 1 Hello world how are you people 1 0 0
1 2 Hello people I am fine people 1 0 0
2 3 Good Morning people -1 1 0
3 4 Good Evening -1 1 0
perc_-80_1 perc_-70_1 perc_-60_1 perc_-50_1 perc_-40_1 ... perc_10_2 \
0 0 0 0 0 0 ... 0
1 0 0 0 0 0 ... 0
2 0 0 0 0 0 ... 0
3 0 0 0 0 0 ... 0
perc_20_2 perc_30_2 perc_40_2 perc_50_2 perc_60_2 perc_70_2 \
0 0 1 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
perc_80_2 perc_90_2 perc_100_2
0 0 0 1
1 0 0 0
2 0 0 0
3 0 0 0
[4 rows x 45 columns]

Categories