I have a pandas DataFrame like this:
Movie Rate
0 5821 4
1 2124 2
2 7582 1
3 3029 5
4 17479 1
both movie and the rating could be repeated. I need to transform this DataFrame to something like this:
Movie Rate_1_Count Rate_2_Count ... Rate_5_Count
0 5821 20 1 5
1 2124 2 0 99
2 7582 50 22 22
...
which the movie ids are unique and Rate {Number} Count is the count of the ratings to that movie that are equal to the {Number}.
I already accomplished this task using the code below which I believe is very messy. I guess there must be a neater way to do that. Can anyone help me with it?
self.movie_df_tmp = self.rating_df[['MovieId', 'Rate']]
self.movie_df_tmp['RaCount'] = self.movie_df_tmp.groupby(['MovieId'])['Rate'].transform('count')
self.movie_df_tmp['Sum'] = self.movie_df_tmp.groupby(['MovieId'])['Rate'].transform('sum')
self.movie_df_tmp['NORC'] = self.movie_df_tmp.groupby(['MovieId', 'Rate'])['Rate'].transform('count')
self.movie_df_tmp = self.movie_df_tmp.drop_duplicates()
self.movie_df_tmp['Rate1C'] = self.movie_df_tmp[self.movie_df_tmp['Rate'] == 1]['NORC']
self.movie_df_tmp['Rate2C'] = self.movie_df_tmp[self.movie_df_tmp['Rate'] == 2]['NORC']
self.movie_df_tmp['Rate3C'] = self.movie_df_tmp[self.movie_df_tmp['Rate'] == 3]['NORC']
self.movie_df_tmp['Rate4C'] = self.movie_df_tmp[self.movie_df_tmp['Rate'] == 4]['NORC']
self.movie_df_tmp['Rate5C'] = self.movie_df_tmp[self.movie_df_tmp['Rate'] == 5]['NORC']
self.movie_df_tmp = self.movie_df_tmp.replace(np.nan, 0)
self.movie_df = self.movie_df_tmp[['MovieId', 'RaCount', 'Sum']].drop_duplicates()
self.movie_df_tmp = self.movie_df_tmp.drop(columns=['Rate', 'NORC', 'Sum', 'RaCount'])
self.movie_df_tmp = self.movie_df_tmp.groupby(['MovieId'])["Rate1C", "Rate2C", "Rate3C", "Rate4C", "Rate5C"].apply(
lambda x: x.astype(int).sum())
self.movie_df = self.movie_df.merge(self.movie_df_tmp, left_on='MovieId', right_on='MovieId')
self.movie_df = pd.DataFrame(self.movie_df.values,
columns=['MovieId', 'Rate1C', 'Rate2C', 'Rate3C', 'Rate4C',
'Rate5C'])
Try with pd.crosstab:
pd.crosstab(df['Movie'], df['Rate'])
Rate 1 2 4 5
Movie
2124 0 1 0 0
3029 0 0 0 1
5821 0 0 1 0
7582 1 0 0 0
17479 1 0 0 0
Fix axis names and column names rename + reset_index + rename_axis:
new_df = (
pd.crosstab(df['Movie'], df['Rate'])
.rename(columns=lambda c: f'Rate_{c}_Count')
.reset_index()
.rename_axis(columns=None)
)
Movie Rate_1_Count Rate_2_Count Rate_4_Count Rate_5_Count
0 2124 0 1 0 0
1 3029 0 0 0 1
2 5821 0 0 1 0
3 7582 1 0 0 0
4 17479 1 0 0 0
This should give you the desired output:
grouper=df.groupby(['Movie','Rate']).size()
dg=pd.DataFrame()
dg['Movie']=df['Movie'].unique()
for i in [1,2,3,4,5]:
dg['Rate_'+str(i)+'Count']=dg['Movie'].apply(lambda x: grouper[x,i] if (x,i)
in grouper.index else 0)
Related
I'm trying to append prediction to my original data which is:
product_id date views wishlists cartadds orders order_units gmv score
mp000000000001321 01-09-2022 0 0 0 0 0 0 0
mp000000000001321 02-09-2022 0 0 0 0 0 0 0
mp000000000001321 03-09-2022 0 0 0 0 0 0 0
mp000000000001321 04-09-2022 0 0 0 0 0 0 0
I have sequence length of [1,3] and each for each sequence length I have prediction. I want to add those prediction to my original data so that my output is like this:
product_id date views wishlists cartadds orders order_units gmv score prediction sequence_length
mp000000000001321 01-09-2022 0 0 0 0 0 0 0 5.75 1
mp000000000001321 01-09-2022 0 0 0 0 0 0 0 5.88 3
mp000000000001321 02-09-2022 0 0 0 0 0 0 0 5.88 3
mp000000000001321 03-09-2022 0 0 0 0 0 0 0 5.88 3
I have tried the following:
df1 = df_batch.head(sequence_length)
dfff = pd.DataFrame.from_dict(predictions_dict, orient='index')
dfff.index.names = ['product_id']
merged_df = df1.merge(dfff, on='product_id')
merged_df.to_csv('data_prediction'+str(sequence_length)+'.csv', index_label='product_id')
but this only saves the data of last product_id which was sent and it saves for each sequence length in a different csv. I want everything to be in 1 csv instead. How do that?
Edit: sample predictions_dict:
{'mp000000000001321': {'sequence_length': 1, 'prediction': 5.75}}
{'mp000000000001321': {'sequence_length': 3, 'prediction': 5.88}}
So, I found a fix
df1 = df_batch[df_batch['product_id'] == product_id].iloc[:sequence_length]
dfff = pd.DataFrame.from_dict(predictions_dict, orient='index')
dfff.index.names = ['product_id']
merged_df = df1.merge(dfff, on='product_id')
new_df = pd.concat([new_df, merged_df], ignore_index=True)
This way I'm able to get the desired output for unique product id's
#df =
order_number Product1 ... Product 16 Product17
0 4329374937 1 ... 0 0
1 3483872349 1 ... 0 0
2 2394287383 1 ... 0 0
3 3423984902 1 ... 1 0
4 9378374873 0 ... 0 0
Batch1 = ["Product1", "Product2", "Product 6"]
for indices in df.index:
for column in columns:
if df[column] > 0 and in Batch1 df[B1] = True
else df[B1] = False
print(df.head))
I am trying to determine a way to look through each order number and see if the orders which are greater than 0 are within my listed batch. I want to create a new column for each row that is a boolean. I am getting a syntax error.
From what I understand you want to take the columns in your batch, add up the order numbers and see where it is > 0. So
batch_1 = ["Product1", "Product2", "Product6"]
df['B1'] = df[batch_1].sum(axis=1)>0
df
output:
order_number Product1 Product2 Product6 Product16 Product17 B1
-- -------------- ---------- ---------- ---------- ----------- ----------- -----
0 4329374937 1 1 3 0 0 True
1 3483872349 1 0 1 0 0 True
2 2394287383 1 0 2 0 0 True
3 3423984902 1 0 1 1 0 True
4 9378374873 0 0 0 0 0 False
Confusing title, let me explain. I have 2 dataframes like this:
dataframe named df1: Looks like this (with million of rows in original):
id ` text c1
1 Hello world how are you people 1
2 Hello people I am fine people 1
3 Good Morning people -1
4 Good Evening -1
Dataframe named df2 looks like this:
Word count Points Percentage
hello 2 2 100
world 1 1 100
how 1 1 100
are 1 1 100
you 1 1 100
people 3 1 33.33
I 1 1 100
am 1 1 100
fine 1 1 100
Good 2 -2 -100
Morning 1 -1 -100
Evening 1 -1 -100
-1
df2 columns explaination:
count means the total number of times that word appeared in df1
points is points given to each word by some kind of algorithm
percentage = points/count*100
Now, I want to add 40 new columns in df1, according to the point & percentage. They will look like this:
perc_-90_2 perc_-80_2 perc_-70_2 perc_-60_2 perc_-50_2 perc_-40_2 perc_-20_2 perc_-10_2 perc_0_2 perc_10_2 perc_20_2 perc_30_2 perc_40_2 perc_50_2 perc_60_2 perc_70_2 perc_80_2 perc_90_2
perc_-90_1 perc_-80_1 perc_-70_1 perc_-60_1 perc_-50_1 perc_-40_1 perc_-20_1 perc_-10_1 perc_0_1 perc_10_1 perc_20_1 perc_30_1 perc_40_1 perc_50_1 perc_60_ perc_70_1 perc_80_1 perc_90_1
Let me break it down. The column name contain 3 parts:
1.) perc just a string, means nothing
2.) Numbers from range -90 to +90. For example, Here -90 means, the percentage is -90 in df2. Now for example, If a word has percentage value in range 81-90, then there will be a value of 1 in that row, and column named prec_-80_xx. The xx is the third part.
3.) The third part is the count. Here I want two type of counts. 1 and 2. As the example given in point 2, If the word count is in range of 0 to 1, then the value will be 1 in prec_-80_1 column. If the word count is 2 or more, then the value will be 1 in prec_-80_2 column.
I hope it is not very on confusing.
Use:
#change previous answer with add id for matching
df2 = (df.drop_duplicates(['id','Word'])
.groupby('Word', sort=False)
.agg({'c1':['sum','size'], 'id':'first'})
)
df2.columns = df2.columns.map(''.join)
df2 = df2.reset_index()
df2 = df2.rename(columns={'c1sum':'Points','c1size':'Totalcount','idfirst':'id'})
df2['Percentage'] = df2['Points'] / df2['Totalcount'] * 100
s1 = df2['Percentage'].div(10).astype(int).mul(10).astype(str)
s2 = np.where(df2['Totalcount'] == 1, '1', '2')
#s2= np.where(df1['Totalcount'].isin([0,1]), '1', '2')
#create colum by join
df2['new'] = 'perc_' + s1 + '_' +s2
#create indicator DataFrame
df3 = pd.get_dummies(df2[['id','new']].drop_duplicates().set_index('id'),
prefix='',
prefix_sep='').max(level=0)
print (df3)
#reindex for add missing columns
c = 'perc_' + pd.Series(np.arange(-100, 110, 10).astype(str)) + '_'
cols = (c + '1').append(c + '2')
#join to original df1
df = df1.join(df3.reindex(columns=cols, fill_value=0), on='id')
print (df)
id text c1 perc_-100_1 perc_-90_1 \
0 1 Hello world how are you people 1 0 0
1 2 Hello people I am fine people 1 0 0
2 3 Good Morning people -1 1 0
3 4 Good Evening -1 1 0
perc_-80_1 perc_-70_1 perc_-60_1 perc_-50_1 perc_-40_1 ... perc_10_2 \
0 0 0 0 0 0 ... 0
1 0 0 0 0 0 ... 0
2 0 0 0 0 0 ... 0
3 0 0 0 0 0 ... 0
perc_20_2 perc_30_2 perc_40_2 perc_50_2 perc_60_2 perc_70_2 \
0 0 1 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
perc_80_2 perc_90_2 perc_100_2
0 0 0 1
1 0 0 0
2 0 0 0
3 0 0 0
[4 rows x 45 columns]
I have a dataframe in Pandas (subset below).
DATE IN 200D_MA TEST
10/30/2013 0 1 0
10/31/2013 0 1 0
11/1/2013 1 1 1 IN & 200D_MA both =1, results 1
11/4/2013 0 1 1 PREVIOUS TEST ROW =1 & 200DM_A = 1, TEST ans=1
11/5/2013 0 1 1 PREVIOUS TEST ROW =1 & 200DM_A = 1, TEST ans=1
11/6/2013 0 1 1
11/7/2013 0 1 1
11/8/2013 0 1 1
11/11/2013 0 0 0 PREVIOUS TEST ROW =1 & 200DM_A = 0, TEST ans=0
This is easy to do in excel so I thought it would be easy to do in python. I have this code using nested np.where formulas
df3['TEST'] = np.where( (df3['IN'] == 1) & (df3['200D_MA'] == 1),1,\
np.where( (df3['TEST'].shift(-1) == 1)\
& (df3['200D_MA'] == 1),1,0))
but it throws a KeyError: 'IN' > presumably because I am using a condition from column that has not been created yet. Can anyone help me figure out how to do this?
Seems like you need condition ffill
df['TEST']=df.loc[df.IN==1,'IN']
df.loc[df['200D_MA']==1,'TEST']=df.loc[df['200D_MA']==1,'TEST'].ffill()
df.fillna(0,inplace=True)
df.TEST=df.TEST.astype(int)
df
Out[349]:
DATE IN 200D_MA TEST
0 10/30/2013 0 1 0
1 10/31/2013 0 1 0
2 11/1/2013 1 1 1
3 11/4/2013 0 1 1
4 11/5/2013 0 1 1
5 11/6/2013 0 1 1
6 11/7/2013 0 1 1
7 11/8/2013 0 1 1
8 11/11/2013 0 0 0
I think you can use rolling to calculate previous TEST row.
df['TEST'] = (df['IN 200D_MA'] & df['IN 200D_MA'].rolling(2).min().shift(1)).astype(int)
Output:
DATE IN 200D_MA TEST
10/30/2013 0 1 0
10/31/2013 0 1 0
11/1/2013 1 1 1
11/4/2013 0 1 1
11/5/2013 0 1 1
11/6/2013 0 1 1
11/7/2013 0 1 1
11/8/2013 0 1 1
11/11/2013 0 0 0
I have a dict as follows:
data_dict = {'1.160.139.117': ['712907','742068'],
'1.161.135.205': ['667386','742068'],
'1.162.51.21': ['326136', '663056', '742068']}
I want to convert the dict into a dataframe:
df= pd.DataFrame.from_dict(data_dict, orient='index')
How can I create a dataframe that has columns representing the values of the dictionary and rows representing the keys of the dictionary?, as below:
The best option is #4
pd.get_dummies(df.stack()).sum(level=0)
Option 1:
One way you could do it:
df.stack().reset_index(level=1)\
.set_index(0,append=True)['level_1']\
.unstack().notnull().mul(1)
Output:
326136 663056 667386 712907 742068
1.160.139.117 0 0 0 1 1
1.161.135.205 0 0 1 0 1
1.162.51.21 1 1 0 0 1
Option 2
Or with a litte reshaping and pd.crosstab:
df2 = df.stack().reset_index(name='Values')
pd.crosstab(df2.level_0,df2.Values)
Output:
Values 326136 663056 667386 712907 742068
level_0
1.160.139.117 0 0 0 1 1
1.161.135.205 0 0 1 0 1
1.162.51.21 1 1 0 0 1
Option 3
df.stack().reset_index(name="Values")\
.pivot(index='level_0',columns='Values')['level_1']\
.notnull().astype(int)
Output:
Values 326136 663056 667386 712907 742068
level_0
1.160.139.117 0 0 0 1 1
1.161.135.205 0 0 1 0 1
1.162.51.21 1 1 0 0 1
Option 4 (#Wen pointed out a short solution and fastest so far)
pd.get_dummies(df.stack()).sum(level=0)
Output:
326136 663056 667386 712907 742068
1.160.139.117 0 0 0 1 1
1.161.135.205 0 0 1 0 1
1.162.51.21 1 1 0 0 1