Calculate count of a numeric column into new columns Pandas DataFrame

Calculate count of a numeric column into new columns Pandas DataFrame - python

I have a pandas DataFrame like this:
Movie Rate
0 5821 4
1 2124 2
2 7582 1
3 3029 5
4 17479 1
both movie and the rating could be repeated. I need to transform this DataFrame to something like this:
Movie Rate_1_Count Rate_2_Count ... Rate_5_Count
0 5821 20 1 5
1 2124 2 0 99
2 7582 50 22 22
...
which the movie ids are unique and Rate {Number} Count is the count of the ratings to that movie that are equal to the {Number}.
I already accomplished this task using the code below which I believe is very messy. I guess there must be a neater way to do that. Can anyone help me with it?
self.movie_df_tmp = self.rating_df[['MovieId', 'Rate']]
self.movie_df_tmp['RaCount'] = self.movie_df_tmp.groupby(['MovieId'])['Rate'].transform('count')
self.movie_df_tmp['Sum'] = self.movie_df_tmp.groupby(['MovieId'])['Rate'].transform('sum')
self.movie_df_tmp['NORC'] = self.movie_df_tmp.groupby(['MovieId', 'Rate'])['Rate'].transform('count')
self.movie_df_tmp = self.movie_df_tmp.drop_duplicates()
self.movie_df_tmp['Rate1C'] = self.movie_df_tmp[self.movie_df_tmp['Rate'] == 1]['NORC']
self.movie_df_tmp['Rate2C'] = self.movie_df_tmp[self.movie_df_tmp['Rate'] == 2]['NORC']
self.movie_df_tmp['Rate3C'] = self.movie_df_tmp[self.movie_df_tmp['Rate'] == 3]['NORC']
self.movie_df_tmp['Rate4C'] = self.movie_df_tmp[self.movie_df_tmp['Rate'] == 4]['NORC']
self.movie_df_tmp['Rate5C'] = self.movie_df_tmp[self.movie_df_tmp['Rate'] == 5]['NORC']
self.movie_df_tmp = self.movie_df_tmp.replace(np.nan, 0)
self.movie_df = self.movie_df_tmp[['MovieId', 'RaCount', 'Sum']].drop_duplicates()
self.movie_df_tmp = self.movie_df_tmp.drop(columns=['Rate', 'NORC', 'Sum', 'RaCount'])
self.movie_df_tmp = self.movie_df_tmp.groupby(['MovieId'])["Rate1C", "Rate2C", "Rate3C", "Rate4C", "Rate5C"].apply(
lambda x: x.astype(int).sum())
self.movie_df = self.movie_df.merge(self.movie_df_tmp, left_on='MovieId', right_on='MovieId')
self.movie_df = pd.DataFrame(self.movie_df.values,
columns=['MovieId', 'Rate1C', 'Rate2C', 'Rate3C', 'Rate4C',
'Rate5C'])

Try with pd.crosstab:
pd.crosstab(df['Movie'], df['Rate'])
Rate 1 2 4 5
Movie
2124 0 1 0 0
3029 0 0 0 1
5821 0 0 1 0
7582 1 0 0 0
17479 1 0 0 0
Fix axis names and column names rename + reset_index + rename_axis:
new_df = (
pd.crosstab(df['Movie'], df['Rate'])
.rename(columns=lambda c: f'Rate_{c}_Count')
.reset_index()
.rename_axis(columns=None)
)
Movie Rate_1_Count Rate_2_Count Rate_4_Count Rate_5_Count
0 2124 0 1 0 0
1 3029 0 0 0 1
2 5821 0 0 1 0
3 7582 1 0 0 0
4 17479 1 0 0 0

This should give you the desired output:
grouper=df.groupby(['Movie','Rate']).size()
dg=pd.DataFrame()
dg['Movie']=df['Movie'].unique()
for i in [1,2,3,4,5]:
dg['Rate_'+str(i)+'Count']=dg['Movie'].apply(lambda x: grouper[x,i] if (x,i)
in grouper.index else 0)

Related

Appending 2 dataframes with having duplicates without removing the duplicates

I'm trying to append prediction to my original data which is:
product_id date views wishlists cartadds orders order_units gmv score
mp000000000001321 01-09-2022 0 0 0 0 0 0 0
mp000000000001321 02-09-2022 0 0 0 0 0 0 0
mp000000000001321 03-09-2022 0 0 0 0 0 0 0
mp000000000001321 04-09-2022 0 0 0 0 0 0 0
I have sequence length of [1,3] and each for each sequence length I have prediction. I want to add those prediction to my original data so that my output is like this:
product_id date views wishlists cartadds orders order_units gmv score prediction sequence_length
mp000000000001321 01-09-2022 0 0 0 0 0 0 0 5.75 1
mp000000000001321 01-09-2022 0 0 0 0 0 0 0 5.88 3
mp000000000001321 02-09-2022 0 0 0 0 0 0 0 5.88 3
mp000000000001321 03-09-2022 0 0 0 0 0 0 0 5.88 3
I have tried the following:
df1 = df_batch.head(sequence_length)
dfff = pd.DataFrame.from_dict(predictions_dict, orient='index')
dfff.index.names = ['product_id']
merged_df = df1.merge(dfff, on='product_id')
merged_df.to_csv('data_prediction'+str(sequence_length)+'.csv', index_label='product_id')
but this only saves the data of last product_id which was sent and it saves for each sequence length in a different csv. I want everything to be in 1 csv instead. How do that?
Edit: sample predictions_dict:
{'mp000000000001321': {'sequence_length': 1, 'prediction': 5.75}}
{'mp000000000001321': {'sequence_length': 3, 'prediction': 5.88}}

So, I found a fix
df1 = df_batch[df_batch['product_id'] == product_id].iloc[:sequence_length]
dfff = pd.DataFrame.from_dict(predictions_dict, orient='index')
dfff.index.names = ['product_id']
merged_df = df1.merge(dfff, on='product_id')
new_df = pd.concat([new_df, merged_df], ignore_index=True)
This way I'm able to get the desired output for unique product id's

Function to match orders to batch not working

#df =
order_number Product1 ... Product 16 Product17
0 4329374937 1 ... 0 0
1 3483872349 1 ... 0 0
2 2394287383 1 ... 0 0
3 3423984902 1 ... 1 0
4 9378374873 0 ... 0 0
Batch1 = ["Product1", "Product2", "Product 6"]
for indices in df.index:
for column in columns:
if df[column] > 0 and in Batch1 df[B1] = True
else df[B1] = False
print(df.head))
I am trying to determine a way to look through each order number and see if the orders which are greater than 0 are within my listed batch. I want to create a new column for each row that is a boolean. I am getting a syntax error.

From what I understand you want to take the columns in your batch, add up the order numbers and see where it is > 0. So
batch_1 = ["Product1", "Product2", "Product6"]
df['B1'] = df[batch_1].sum(axis=1)>0
df
output:
order_number Product1 Product2 Product6 Product16 Product17 B1
-- -------------- ---------- ---------- ---------- ----------- ----------- -----
0 4329374937 1 1 3 0 0 True
1 3483872349 1 0 1 0 0 True
2 2394287383 1 0 2 0 0 True
3 3423984902 1 0 1 1 0 True
4 9378374873 0 0 0 0 0 False

How can I add new columns using another dataframe (related to string columns) in Pandas

Confusing title, let me explain. I have 2 dataframes like this:
dataframe named df1: Looks like this (with million of rows in original):
id ` text c1
1 Hello world how are you people 1
2 Hello people I am fine people 1
3 Good Morning people -1
4 Good Evening -1
Dataframe named df2 looks like this:
Word count Points Percentage
hello 2 2 100
world 1 1 100
how 1 1 100
are 1 1 100
you 1 1 100
people 3 1 33.33
I 1 1 100
am 1 1 100
fine 1 1 100
Good 2 -2 -100
Morning 1 -1 -100
Evening 1 -1 -100
-1
df2 columns explaination:
count means the total number of times that word appeared in df1
points is points given to each word by some kind of algorithm
percentage = points/count*100
Now, I want to add 40 new columns in df1, according to the point & percentage. They will look like this:
perc_-90_2 perc_-80_2 perc_-70_2 perc_-60_2 perc_-50_2 perc_-40_2 perc_-20_2 perc_-10_2 perc_0_2 perc_10_2 perc_20_2 perc_30_2 perc_40_2 perc_50_2 perc_60_2 perc_70_2 perc_80_2 perc_90_2
perc_-90_1 perc_-80_1 perc_-70_1 perc_-60_1 perc_-50_1 perc_-40_1 perc_-20_1 perc_-10_1 perc_0_1 perc_10_1 perc_20_1 perc_30_1 perc_40_1 perc_50_1 perc_60_ perc_70_1 perc_80_1 perc_90_1
Let me break it down. The column name contain 3 parts:
1.) perc just a string, means nothing
2.) Numbers from range -90 to +90. For example, Here -90 means, the percentage is -90 in df2. Now for example, If a word has percentage value in range 81-90, then there will be a value of 1 in that row, and column named prec_-80_xx. The xx is the third part.
3.) The third part is the count. Here I want two type of counts. 1 and 2. As the example given in point 2, If the word count is in range of 0 to 1, then the value will be 1 in prec_-80_1 column. If the word count is 2 or more, then the value will be 1 in prec_-80_2 column.
I hope it is not very on confusing.

Use:
#change previous answer with add id for matching
df2 = (df.drop_duplicates(['id','Word'])
.groupby('Word', sort=False)
.agg({'c1':['sum','size'], 'id':'first'})
)
df2.columns = df2.columns.map(''.join)
df2 = df2.reset_index()
df2 = df2.rename(columns={'c1sum':'Points','c1size':'Totalcount','idfirst':'id'})
df2['Percentage'] = df2['Points'] / df2['Totalcount'] * 100
s1 = df2['Percentage'].div(10).astype(int).mul(10).astype(str)
s2 = np.where(df2['Totalcount'] == 1, '1', '2')
#s2= np.where(df1['Totalcount'].isin([0,1]), '1', '2')
#create colum by join
df2['new'] = 'perc_' + s1 + '_' +s2
#create indicator DataFrame
df3 = pd.get_dummies(df2[['id','new']].drop_duplicates().set_index('id'),
prefix='',
prefix_sep='').max(level=0)
print (df3)
#reindex for add missing columns
c = 'perc_' + pd.Series(np.arange(-100, 110, 10).astype(str)) + '_'
cols = (c + '1').append(c + '2')
#join to original df1
df = df1.join(df3.reindex(columns=cols, fill_value=0), on='id')
print (df)
id text c1 perc_-100_1 perc_-90_1 \
0 1 Hello world how are you people 1 0 0
1 2 Hello people I am fine people 1 0 0
2 3 Good Morning people -1 1 0
3 4 Good Evening -1 1 0
perc_-80_1 perc_-70_1 perc_-60_1 perc_-50_1 perc_-40_1 ... perc_10_2 \
0 0 0 0 0 0 ... 0
1 0 0 0 0 0 ... 0
2 0 0 0 0 0 ... 0
3 0 0 0 0 0 ... 0
perc_20_2 perc_30_2 perc_40_2 perc_50_2 perc_60_2 perc_70_2 \
0 0 1 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
perc_80_2 perc_90_2 perc_100_2
0 0 0 1
1 0 0 0
2 0 0 0
3 0 0 0
[4 rows x 45 columns]

creating conditions on np.where in Pandas based on value in current column

I have a dataframe in Pandas (subset below).
DATE IN 200D_MA TEST
10/30/2013 0 1 0
10/31/2013 0 1 0
11/1/2013 1 1 1 IN & 200D_MA both =1, results 1
11/4/2013 0 1 1 PREVIOUS TEST ROW =1 & 200DM_A = 1, TEST ans=1
11/5/2013 0 1 1 PREVIOUS TEST ROW =1 & 200DM_A = 1, TEST ans=1
11/6/2013 0 1 1
11/7/2013 0 1 1
11/8/2013 0 1 1
11/11/2013 0 0 0 PREVIOUS TEST ROW =1 & 200DM_A = 0, TEST ans=0
This is easy to do in excel so I thought it would be easy to do in python. I have this code using nested np.where formulas
df3['TEST'] = np.where( (df3['IN'] == 1) & (df3['200D_MA'] == 1),1,\
np.where( (df3['TEST'].shift(-1) == 1)\
& (df3['200D_MA'] == 1),1,0))
but it throws a KeyError: 'IN' > presumably because I am using a condition from column that has not been created yet. Can anyone help me figure out how to do this?

Seems like you need condition ffill
df['TEST']=df.loc[df.IN==1,'IN']
df.loc[df['200D_MA']==1,'TEST']=df.loc[df['200D_MA']==1,'TEST'].ffill()
df.fillna(0,inplace=True)
df.TEST=df.TEST.astype(int)
df
Out[349]:
DATE IN 200D_MA TEST
0 10/30/2013 0 1 0
1 10/31/2013 0 1 0
2 11/1/2013 1 1 1
3 11/4/2013 0 1 1
4 11/5/2013 0 1 1
5 11/6/2013 0 1 1
6 11/7/2013 0 1 1
7 11/8/2013 0 1 1
8 11/11/2013 0 0 0

I think you can use rolling to calculate previous TEST row.
df['TEST'] = (df['IN 200D_MA'] & df['IN 200D_MA'].rolling(2).min().shift(1)).astype(int)
Output:
DATE IN 200D_MA TEST
10/30/2013 0 1 0
10/31/2013 0 1 0
11/1/2013 1 1 1
11/4/2013 0 1 1
11/5/2013 0 1 1
11/6/2013 0 1 1
11/7/2013 0 1 1
11/8/2013 0 1 1
11/11/2013 0 0 0

Convert Dictionary to Pandas in Python

I have a dict as follows:
data_dict = {'1.160.139.117': ['712907','742068'],
'1.161.135.205': ['667386','742068'],
'1.162.51.21': ['326136', '663056', '742068']}
I want to convert the dict into a dataframe:
df= pd.DataFrame.from_dict(data_dict, orient='index')
How can I create a dataframe that has columns representing the values of the dictionary and rows representing the keys of the dictionary?, as below:

The best option is #4
pd.get_dummies(df.stack()).sum(level=0)
Option 1:
One way you could do it:
df.stack().reset_index(level=1)\
.set_index(0,append=True)['level_1']\
.unstack().notnull().mul(1)
Output:
326136 663056 667386 712907 742068
1.160.139.117 0 0 0 1 1
1.161.135.205 0 0 1 0 1
1.162.51.21 1 1 0 0 1
Option 2
Or with a litte reshaping and pd.crosstab:
df2 = df.stack().reset_index(name='Values')
pd.crosstab(df2.level_0,df2.Values)
Output:
Values 326136 663056 667386 712907 742068
level_0
1.160.139.117 0 0 0 1 1
1.161.135.205 0 0 1 0 1
1.162.51.21 1 1 0 0 1
Option 3
df.stack().reset_index(name="Values")\
.pivot(index='level_0',columns='Values')['level_1']\
.notnull().astype(int)
Output:
Values 326136 663056 667386 712907 742068
level_0
1.160.139.117 0 0 0 1 1
1.161.135.205 0 0 1 0 1
1.162.51.21 1 1 0 0 1
Option 4 (#Wen pointed out a short solution and fastest so far)
pd.get_dummies(df.stack()).sum(level=0)
Output:
326136 663056 667386 712907 742068
1.160.139.117 0 0 0 1 1
1.161.135.205 0 0 1 0 1
1.162.51.21 1 1 0 0 1

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Calculate count of a numeric column into new columns Pandas DataFrame - python

This should give you the desired output: grouper=df.groupby(['Movie','Rate']).size() dg=pd.DataFrame() dg['Movie']=df['Movie'].unique() for i in [1,2,3,4,5]: dg['Rate_'+str(i)+'Count']=dg['Movie'].apply(lambda x: grouper[x,i] if (x,i) in grouper.index else 0)

Related

Appending 2 dataframes with having duplicates without removing the duplicates

Function to match orders to batch not working

How can I add new columns using another dataframe (related to string columns) in Pandas

creating conditions on np.where in Pandas based on value in current column

Convert Dictionary to Pandas in Python

Categories

Resources