Pivotting via Python and Pandas - python

Has a table like this:
ID Word
1 take
2 the
3 long
4 long
5 road
6 and
7 walk
8 it
9 walk
10 it
Wanna to use pivot table in pandas to get distinct words in columns and 1 and 0 in Values. Smth like this matrix:
ID Take The Long Road And Walk It
1 1 0 0 0 0 0 0
2 0 1 0 0 0 0 0
3 0 0 1 0 0 0 0
4 0 0 1 0 0 0 0
5 0 0 0 1 0 0 0
and so on
Trying to use pivot table but not familiar with pandas syntax yet:
import pandas as pd
data = pd.read_csv('dataset.txt', sep='|', encoding='latin1')
table = pd.pivot_table(data,index=["ID"],columns=pd.unique(data["Word"].values),fill_value=0)
How can I rewrite pivot table function to deal with it?

You can use concatwith str.get_dummies:
print pd.concat([df['ID'], df['Word'].str.get_dummies()], axis=1)
ID and it long road take the walk
0 1 0 0 0 0 1 0 0
1 2 0 0 0 0 0 1 0
2 3 0 0 1 0 0 0 0
3 4 0 0 1 0 0 0 0
4 5 0 0 0 1 0 0 0
5 6 1 0 0 0 0 0 0
6 7 0 0 0 0 0 0 1
7 8 0 1 0 0 0 0 0
8 9 0 0 0 0 0 0 1
9 10 0 1 0 0 0 0 0
Or as Edchum mentioned in comments - pd.get_dummies:
print pd.concat([df['ID'], pd.get_dummies(df['Word'])], axis=1)
ID and it long road take the walk
0 1 0 0 0 0 1 0 0
1 2 0 0 0 0 0 1 0
2 3 0 0 1 0 0 0 0
3 4 0 0 1 0 0 0 0
4 5 0 0 0 1 0 0 0
5 6 1 0 0 0 0 0 0
6 7 0 0 0 0 0 0 1
7 8 0 1 0 0 0 0 0
8 9 0 0 0 0 0 0 1
9 10 0 1 0 0 0 0 0

Related

How to identify cells representing contour of cluster in matrix (Python)

I have a binary matrix [0/1] where 1 is a cluster. I would like to identify the cells (in term of position) representing the contour of this cluster.
> test
X0 X0.1 X0.2 X0.3 X0.4 X0.5 X0.6
1 0 0 0 0 0 0 0
2 0 0 1 1 1 0 0
3 0 0 1 1 1 1 0
4 0 0 0 1 1 1 0
5 0 0 0 0 1 1 0
6 0 0 0 0 1 0 0
7 0 0 0 0 1 0 0
8 1 1 0 0 0 0 0
9 1 1 1 0 0 0 0
10 1 1 1 0 0 0 0
11 0 1 0 0 0 0 0
Thanks

Extract all row values and create new columns similar to Count Vectorizer

My apologies SO community, I am a newbie on the platform and in the pursuit of making this question precise and straight to the point, I didn't give relevant info.
My Input Dataframe is:
import pandas as pd
data = {'user_id': ['abc','def','ghi'],
'alpha': ['A','B,C,D,A','B,C,A'],
'beta': ['1|20|30','350','376|98']}
df = pd.DataFrame(data = data, columns = ['user_id','alpha','beta'])
print(df)
Looks like this,
user_id alpha beta
0 abc A 1|20|30
1 def B,C,D,A 350
2 ghi B,C,A 376
I want something like this,
user_id alpha beta a_A a_B a_C a_D b_1 b_20 b_30 b_350 b_376
0 abc A 1|20|30 1 0 0 0 1 1 1 1 0
1 def B,C,D,A 350 1 1 1 1 0 0 0 1 0
2 ghi B,C,A 376 1 1 1 0 0 0 0 0 1
My original data contains 11K rows. And these distinct values in alpha & beta are around 550.
I created a list from all the values in alpha & beta columns and applied pd.get_dummies but it results in a lot of rows like the one displayed by #wwwnde. I would like all the rows to be rolled up based on user_id.
A similar idea is used by CountVectorizer on documents, where it creates columns based on all the words in the sentence and checks the frequency of a word. However, I am guessing Pandas has a better and efficient way to do that.
Grateful for all your assistance. :)
Desired Output
You will have to achieve that in a series of steps.
Sample Data
id ALPHA BETA
0 1 A 1|20|30
1 2 B,C,D,A 350
2 3 B,C,A 395|45|90
Create Lists for values in ALPHA and BETA
df.BETA=df.BETA.apply(lambda x: x.split('|'))#.str.join(',')
df.ALPHA=df.ALPHA.apply(lambda x: x.split(','))#.str.join(',')
Disintegrate the list elements into individuals
df=df.explode('ALPHA')
df=df.explode('BETA')
Extract the variable frequencies using get dummies.
pd.get_dummies(df)
Strip columns of the prefix
df.columns=df.columns.str.replace('ALPHA_|BETA_','')
id A B C D 1 20 30 350 395 45 90
0 1 1 0 0 0 1 0 0 0 0 0 0
0 1 1 0 0 0 0 1 0 0 0 0 0
0 1 1 0 0 0 0 0 1 0 0 0 0
1 2 0 1 0 0 0 0 0 1 0 0 0
1 2 0 0 1 0 0 0 0 1 0 0 0
1 2 0 0 0 1 0 0 0 1 0 0 0
1 2 1 0 0 0 0 0 0 1 0 0 0
2 3 0 1 0 0 0 0 0 0 1 0 0
2 3 0 1 0 0 0 0 0 0 0 1 0
2 3 0 1 0 0 0 0 0 0 0 0 1
2 3 0 0 1 0 0 0 0 0 1 0 0
2 3 0 0 1 0 0 0 0 0 0 1 0
2 3 0 0 1 0 0 0 0 0 0 0 1
2 3 1 0 0 0 0 0 0 0 1 0 0
2 3 1 0 0 0 0 0 0 0 0 1 0
2 3 1 0 0 0 0 0 0 0 0 0 1

How do you remove values not in a cluster using a pandas data frame?

If I have a pandas data frame like this made up of 0 and 1s:
1 1 1 0 0 0 0 1 0
1 1 1 1 1 0 0 0 0
1 1 1 0 0 0 0 1 0
1 0 0 0 0 1 0 0 0
How do I filter out outliers such that I get something like this:
1 1 1 0 0 0 0 0 0
1 1 1 1 1 0 0 0 0
1 1 1 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0
Such that I remove the outliers.
We can do this with a cummulative product over the second axis with pandas.cumprod [pandas-doc]:
>>> df.cumprod(axis=1)
0 1 2 3 4 5 6 7 8
0 1 1 1 0 0 0 0 0 0
1 1 1 1 1 1 0 0 0 0
2 1 1 1 0 0 0 0 0 0
3 1 0 0 0 0 0 0 0 0
The same result can here be obtained with pandas.cummin [pandas-doc]:
>>> df.cummin(axis=1)
0 1 2 3 4 5 6 7 8
0 1 1 1 0 0 0 0 0 0
1 1 1 1 1 1 0 0 0 0
2 1 1 1 0 0 0 0 0 0
3 1 0 0 0 0 0 0 0 0

Create dummies when categories are single characters in multi character strings

Consider my data in a Pandas Series
s = pd.Series('1az wb58 jsui ne3'.split())
s
0 1az
1 wb58
2 jsui
3 ne3
dtype: object
I need it to look like:
1 3 5 8 a b e i j n s u w z
0 1 0 0 0 1 0 0 0 0 0 0 0 0 1
1 0 0 1 1 0 1 0 0 0 0 0 0 1 0
2 0 0 0 0 0 0 0 1 1 0 1 1 0 0
3 0 1 0 0 0 0 1 0 0 1 0 0 0 0
However when I try:
pd.get_dummies(s)
1az jsui ne3 wb58
0 1 0 0 0
1 0 0 0 1
2 0 1 0 0
3 0 0 1 0
What is the most concise way to do this?
Maybe apply list
pd.get_dummies(s.apply(list).apply(pd.Series).stack()).sum(level=0)
Out[222]:
1 3 5 8 a b e i j n s u w z
0 1 0 0 0 1 0 0 0 0 0 0 0 0 1
1 0 0 1 1 0 1 0 0 0 0 0 0 1 0
2 0 0 0 0 0 0 0 1 1 0 1 1 0 0
3 0 1 0 0 0 0 1 0 0 1 0 0 0 0
Or
s.apply(list).str.join(',').str.get_dummies(',')
Out[224]:
1 3 5 8 a b e i j n s u w z
0 1 0 0 0 1 0 0 0 0 0 0 0 0 1
1 0 0 1 1 0 1 0 0 0 0 0 0 1 0
2 0 0 0 0 0 0 0 1 1 0 1 1 0 0
3 0 1 0 0 0 0 1 0 0 1 0 0 0 0
Solution with MultiLabelBinarizer and DataFrame constructor:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(s),columns=mlb.classes_)
print (df)
1 3 5 8 a b e i j n s u w z
0 1 0 0 0 1 0 0 0 0 0 0 0 0 1
1 0 0 1 1 0 1 0 0 0 0 0 0 1 0
2 0 0 0 0 0 0 0 1 1 0 1 1 0 0
3 0 1 0 0 0 0 1 0 0 1 0 0 0 0
Another solution - DataFrame.from_records + get_dummies, but last is necessary aggregate columns by max:
df = pd.get_dummies(pd.DataFrame.from_records(s),prefix_sep='',prefix='').max(level=0, axis=1)
print (df)
1 3 5 8 a b e i j n s u w z
0 1 0 0 0 1 0 0 0 0 0 0 0 0 1
1 0 0 1 1 0 1 0 0 0 0 0 0 1 0
2 0 0 0 0 0 0 0 1 1 0 1 1 0 0
3 0 1 0 0 0 0 1 0 0 1 0 0 0 0

How do I open a binary matrix and convert it into a 2D array or a dataframe?

I have a binary matrix in a txt file that looks as follows:
0011011000
1011011000
0011011000
0011011010
1011011000
1011011000
0011011000
1011011000
0100100101
1011011000
I want to make this into a 2D array or a dataframe where there is one number per column and the rows are as shown. I've tried using numpy and pandas, but the output has only one column that contains the whole number. I want to be able to call an entire column as a number.
One of the codes I've tried is:
with open("a1data1.txt") as myfile:
dat1=myfile.read().split('\n')
dat1=pd.DataFrame(dat1)
Use read_fwf with parameter widths:
df = pd.read_fwf("a1data1.txt", header=None, widths=[1]*10)
print (df)
0 1 2 3 4 5 6 7 8 9
0 0 0 1 1 0 1 1 0 0 0
1 1 0 1 1 0 1 1 0 0 0
2 0 0 1 1 0 1 1 0 0 0
3 0 0 1 1 0 1 1 0 1 0
4 1 0 1 1 0 1 1 0 0 0
5 1 0 1 1 0 1 1 0 0 0
6 0 0 1 1 0 1 1 0 0 0
7 1 0 1 1 0 1 1 0 0 0
8 0 1 0 0 1 0 0 1 0 1
9 1 0 1 1 0 1 1 0 0 0
After you read your txt, you can using following code fix it
pd.DataFrame(df[0].apply(list).values.tolist())
Out[846]:
0 1 2 3 4 5 6 7 8 9
0 0 0 1 1 0 1 1 0 0 0
1 1 0 1 1 0 1 1 0 0 0
2 0 0 1 1 0 1 1 0 0 0
3 0 0 1 1 0 1 1 0 1 0
4 1 0 1 1 0 1 1 0 0 0
5 1 0 1 1 0 1 1 0 0 0
6 0 0 1 1 0 1 1 0 0 0
7 1 0 1 1 0 1 1 0 0 0
8 0 1 0 0 1 0 0 1 0 1
9 1 0 1 1 0 1 1 0 0 0

Categories