Data reshape in python - python

I have a sample data such as:
User_ID bucket brand_name
0 100 3_6months A
1 7543 6_9months A
2 100 9_12months A
3 7543 3_6months B
4 7542 first_3months C
Now I want to reshape this data to one row per userid such that my output data looks like:
User_ID A_first_3months A_3_6months a6_9months A_last_9_12months B_3_6months B6_9month (so on)
100 0 1 2 1
7543 2 0 1 1
7542 0 0 1 0
So here I basically want to pivot among two rows named, bucket and brand_name and aggregate it into one row per user. I know about pandas crosstabulate, pivot and stack functions. But not able to judge the right way as we have three columns.Any help would be highly appreciated. Here the entries can be more then one as we have are looking total count of brands in particular bucket for each user.

You could combine the brand and the bucket into a new column, and then apply crosstabs to it:
df['brand_bucket'] = df['brand_name'] + '_' + df['bucket']
pd.crosstab(index=[df['User_ID']], columns=[df['brand_bucket']])
yields
brand_bucket A_3_6months A_9_12months B_3_6months B_6_9months \
User_ID
100 1 1 0 0
7542 0 0 0 0
7543 0 0 1 1
brand_bucket C_last_3months
User_ID
100 0
7542 1
7543 0
Or, you could pass two columns to crosstab and obtain a DataFrame with a MultiIndex:
pd.crosstab(index=[df['User_ID']], columns=[df['bucket'], df['brand_name']])
yields
bucket 3_6months 6_9months 9_12months last_3months
brand_name A B B A C
User_ID
100 1 0 0 1 0
7542 0 0 0 0 1
7543 0 1 1 0 0
I like the latter better because preserves more of the structure of the data.

Related

Create new columns based on distinct row values and calculate frequency of every value

I would like to extract all distinct row values from specific columns and create new columns and calculate their frequency in every row.
My Input Dataframe is:
import pandas as pd
data = {'user_id': ['abc','def','ghi'],
'alpha': ['A','B,C,D,A','B,C,A'],
'beta': ['1|20|30','350','376']}
df = pd.DataFrame(data = data, columns = ['user_id','alpha','beta'])
print(df)
Looks like this,
user_id alpha beta
0 abc A 1|20|30
1 def B,C,D,A 350
2 ghi B,C,A 376
I want something like this,
user_id alpha beta A B C D 1 20 30 350 376
0 abc A 1|20|30 1 0 0 0 1 1 1 0 0
1 def B,C,D,A 350 1 1 1 1 0 0 0 1 0
2 ghi B,C,A 376 1 1 1 0 0 0 0 0 1
My original data contains 11K rows. And these distinct values in alpha & beta are around 550.
I created a list from all the values in alpha & beta columns and applied pd.get_dummies but it results in a lot of rows. I would like all the rows to be rolled up based on user_id.
A similar idea is used by CountVectorizer on documents, where it creates columns based on all the words in the sentence and checks the frequency of a word. However, I am guessing Pandas has a better and efficient way to do that.
Grateful for all your assistance. :)
Desired Output:
You can use Series.str.get_dummies to create a dummy indicator dataframe for each of the columns alpha and beta then using pd.concat concat these dataframes along axis=1:
cs = (('alpha', ','), ('beta', '|'))
df1 = pd.concat([df] + [df[c].str.get_dummies(sep=s) for c, s in cs], axis=1)
Result:
print(df1)
user_id alpha beta A B C D 1 20 30 350 376
0 abc A 1|20|30 1 0 0 0 1 1 1 0 0
1 def B,C,D,A 350 1 1 1 1 0 0 0 1 0
2 ghi B,C,A 376 1 1 1 0 0 0 0 0 1

Extracting rows in R based on an ID number

I have a data frame in R which, when running data$data['rs146217251',1:10,] looks something like:
g=0 g=1 g=2
1389117_1389117 0 1 NA
2912943_2912943 0 0 1
3094358_3094358 0 0 1
5502557_5502557 0 0 1
2758547_2758547 0 0 1
3527892_3527892 0 1 NA
3490518_3490518 0 0 1
1569224_1569224 0 0 1
4247075_4247075 0 1 NA
4428814_4428814 0 0 1
The leftmost column are participant identifiers. There are roughly 500,000 participants listed in this data frame, but I have a subset of them (about 5,000) listed by their identifiers. What is the best way to go about extracting only these rows that I care about according to their identifier in R or python (or some other way)?
Assuming that the participant identifiers are row names, you can filter by the vector of the identifiers as below:
df <- data$data['rs146217251',1:10,]
#Assuming the vector of identifiers
id <- c("4428814_4428814", "3490518_3490518", "3094358_3094358")
filtered <- df[id,]
Output:
> filtered
g.0 g.1 g.2
4428814_4428814 0 0 1
3490518_3490518 0 0 1
3094358_3094358 0 0 1

How to create by default two columns for every features (One Hot Encoding)?

My feature engineering runs for different documents. For some documents some features do not exist and followingly the sublist consists only of the same values such as the third sublist [0,0,0,0,0]. One hot encoding of this sublist leads to only one column, while the feature lists of other documents are transformed to two columns. Is there any possibility to tell ohe also to create two columns if it consits only of one and the same value and insert the column in the right spot? The main problem is that my feature dataframe of different documents consists in the end of a different number of columns, which make them not comparable.
import pandas as pd
feature = [[0,0,1,0,0], [1,1,1,0,1], [0,0,0,0,0], [1,0,1,1,1], [1,1,0,1,1], [1,0,1,1,1], [0,1,0,0,0]]
df = pd.DataFrame(feature[0])
df_features_final = pd.get_dummies(df[0])
for feature in feature[1:]:
df = pd.DataFrame(feature)
df_enc = pd.get_dummies(df[0])
print(df_enc)
df_features_final = pd.concat([df_features_final, df_enc], axis = 1, join ='inner')
print(df_features_final)
The result is the following dataframe. As you can see in the changing columntitles, after column 5 does not follow a 1:
0 1 0 1 0 0 1 0 1 0 1 0 1
0 1 0 0 1 1 0 1 0 1 0 1 1 0
1 1 0 0 1 1 1 0 0 1 1 0 0 1
2 0 1 0 1 1 0 1 1 0 0 1 1 0
3 1 0 1 0 1 0 1 0 1 0 1 1 0
4 1 0 0 1 1 0 1 0 1 0 1 1 0
I don't notice the functionality you want in pandas atleast. But, in TensorFlow, we do have
tf.one_hot(
indices, depth, on_value=None, off_value=None, axis=None, dtype=None, name=None
)
Set depth to 2.

Pandas - scoring column

I have data about product sales (1 column per product) at the customer level (1 row per customer).
I'm assessing which customers are more likely to be interested in a specific product. I have a list of the 10 most correlated products. (and I have this for multiple products, so I'm trying to build a scalable approach).
I'm trying to score all customers based on how many of those 10 products they buy.
Let's say my list is:
prod_x_corr_prod
How can I create a scoring column (say prox_x_propensity) which goes through the 10 relevant columns, for every row, and for each column with a value > 0 adds 1?
For instance, if customer Y bought 3 of the products correlated with product X, he would have a score of 3 in the "prox_x_score" column.
EDIT: thanks to all of you for the feedback.
For customer 5 I would ge a 2, while for 1,2,3 I would get 1. For 4, 0.
You can do:
df['prox_x_score'] = (df[prod_x_corr_prod] > 0).sum(axis=1)
Example with dummy data:
import numpy as np
import pandas as pd
prod_x_corr_prod = ["prod{}".format(i) for i in range(1, 11)]
df = pd.DataFrame({col:np.random.choice([0,1], size=5) for col in prod_x_corr_prod})
df['prox_x_score'] = (df[prod_x_corr_prod] > 0).sum(axis=1)
print(df)
Output:
prod1 prod10 prod2 prod3 prod4 prod5 prod6 prod7 prod8 prod9 \
0 1 1 1 0 0 1 1 1 1 0
1 1 1 1 0 1 0 0 1 1 0
2 1 1 1 1 0 1 0 0 1 0
3 0 0 0 0 0 0 1 0 1 0
4 0 0 0 0 0 0 0 1 1 0
prox_x_score
0 7
1 6
2 6
3 2
4 2

Using two different data frames to compute new variable

I have two dataframes of the same dimensions that look like:
df1
ID flag
0 1
1 0
2 1
df2
ID flag
0 0
1 1
2 0
In both dataframes I want to create a new variable that denotes an additive flag. So the new variable will look like this:
df1
ID flag new_flag
0 1 1
1 0 1
2 1 1
df2
ID flag new_flag
0 0 1
1 1 1
2 0 1
So if either flag columns is a 1 the new flag will be a 1.
I tried this code:
df1['new_flag']= 1
df2['new_flag']= 1
df1['new_flag'][(df1['flag']==0)&(df1['flag']==0)]=0
df2['new_flag'][(df2['flag']==0)&(df2['flag']==0)]=0
I would expect the same number of 1 in both new_flag but they differ. Is this because I'm not going row by row? Like this question?
pandas create new column based on values from other columns
If so how do I include criteria from both datafrmes?
You can use np.logical_or to achieve this, if we set df1 to be all 0's except for the last row so we don't just get a column of 1's, we can cast the result of np.logical_or using astype(int) to convert the boolean array to 1 and 0:
In [108]:
df1['new_flag'] = np.logical_or(df1['flag'], df2['flag']).astype(int)
df2['new_flag'] = np.logical_or(df1['flag'], df2['flag']).astype(int)
df1
Out[108]:
ID flag new_flag
0 0 0 0
1 1 0 1
2 2 1 1
In [109]:
df2
Out[109]:
ID flag new_flag
0 0 0 0
1 1 1 1
2 2 0 1

Categories