Crosstabin Pandas?

Crosstabin Pandas? - python

I have DataFrame in Pandas like this:
df = pd.DataFrame({"price_range": [0,1,2,3,0,2], "blue":[0,0,1,0,1,1], "four_g":[0,0,0,1,0,1]})
I have line like this: pd.crosstab(df['price_range'], df["blue"])
Nevertheless, now I only see only for example how many "blue" 0 and 1 is for each "price_range", but I want to exapnd this code and also know how many "four_g" 0 and 1 is for each "price_range". How can do that? Please help me

One way is to use 'melt':
df_out = df.melt('price_range')
pd.crosstab(df_out['price_range'], df_out['variable'], df_out['value'], aggfunc='sum')
Output:
variable blue four_g
price_range
0 1 0
1 0 0
2 2 1
3 0 1
Another way is to use groupby:
df.groupby('price_range')[['blue','four_g']].sum()
Output:
blue four_g
price_range
0 1 0
1 0 0
2 2 1
3 0 1

a simplest way is using 2 crosstab through list comprehension with concat
cols = ['blue', 'four_g']
df_out = pd.concat([pd.crosstab(df['price_range'], df[col])
for col in cols], keys=cols, axis=1)
Out[1116]:
blue four_g
blue 0 1 0 1
price_range
0 1 1 2 0
1 1 0 1 0
2 0 2 1 1
3 1 0 0 1

Related

pandas: convert a list in a column into separate columns

dataframe df has a column
id data_words
1 [salt,major,lab,water]
2 [lab,plays,critical,salt]
3 [water,success,major]
I want to make one-hot-code of the column
id critical lab major plays salt success water
1 0 1 1 0 1 0 1
2 1 1 0 1 1 0 0
3 0 0 1 1 0 1 0
What I tried:
Attempt 1:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = df.join(pd.DataFrame(mlb.fit_transform(df.pop('data_words')),
columns=mlb.classes_,
index=df.index))
Error: ValueError: columns overlap but no suffix specified: Index(['class'], dtype='object')
Attempt 2:
I converted the list into simple comma separated string with the following code
df['data_words_Joined'] = df.data_words.apply(','.join)
it makes the dataframe as following
id data_words
1 salt,major,lab,water
2 lab,plays,critical,salt
3 water,success,major
Then I tried
pd.concat([df,pd.get_dummies(df['data_words_Joined'])],axis=1)
But It makes all the words into one column name instead of separate words as separate columns
id salt,major,lab,water lab,plays,critical,salt water,success,major
1 1 0 0
2 0 1 0
3 0 0 1

You can try with explode followed by pivot_table
df_e = df.explode('data_words')
print(df_e.pivot_table(index=df_e['id'],columns=df_e['data_words'],values='id',aggfunc='count',fill_value=0))
Returning the following output:
data_words critical lab major plays salt success water
id
1 0 1 1 0 1 0 1
2 1 1 0 1 1 0 0
3 0 0 1 0 0 1 1
Edit: Adding data for replication purposes:
df = pd.DataFrame({'id':[1,2,3],
'data_words':[['salt','major','lab','water'],['lab','plays','critical','salt'],['water','success','major']]})
Which looks like:
id data_words
0 1 [salt, major, lab, water]
1 2 [lab, plays, critical, salt]
2 3 [water, success, major]

One possible approach could be to use get_dummies with your apply function:
new_df = df.data_words.apply(','.join).str.get_dummies(sep=',')
print(new_df)
Output:
critical lab major plays salt success water
0 0 1 1 0 1 0 1
1 1 1 0 1 1 0 0
2 0 0 1 0 0 1 1
Tested with pandas version 1.1.2 and borrowed input data from Celius Stingher's Answer.

How can I add new columns using another dataframe (related to string columns) in Pandas

Confusing title, let me explain. I have 2 dataframes like this:
dataframe named df1: Looks like this (with million of rows in original):
id ` text c1
1 Hello world how are you people 1
2 Hello people I am fine people 1
3 Good Morning people -1
4 Good Evening -1
Dataframe named df2 looks like this:
Word count Points Percentage
hello 2 2 100
world 1 1 100
how 1 1 100
are 1 1 100
you 1 1 100
people 3 1 33.33
I 1 1 100
am 1 1 100
fine 1 1 100
Good 2 -2 -100
Morning 1 -1 -100
Evening 1 -1 -100
-1
df2 columns explaination:
count means the total number of times that word appeared in df1
points is points given to each word by some kind of algorithm
percentage = points/count*100
Now, I want to add 40 new columns in df1, according to the point & percentage. They will look like this:
perc_-90_2 perc_-80_2 perc_-70_2 perc_-60_2 perc_-50_2 perc_-40_2 perc_-20_2 perc_-10_2 perc_0_2 perc_10_2 perc_20_2 perc_30_2 perc_40_2 perc_50_2 perc_60_2 perc_70_2 perc_80_2 perc_90_2
perc_-90_1 perc_-80_1 perc_-70_1 perc_-60_1 perc_-50_1 perc_-40_1 perc_-20_1 perc_-10_1 perc_0_1 perc_10_1 perc_20_1 perc_30_1 perc_40_1 perc_50_1 perc_60_ perc_70_1 perc_80_1 perc_90_1
Let me break it down. The column name contain 3 parts:
1.) perc just a string, means nothing
2.) Numbers from range -90 to +90. For example, Here -90 means, the percentage is -90 in df2. Now for example, If a word has percentage value in range 81-90, then there will be a value of 1 in that row, and column named prec_-80_xx. The xx is the third part.
3.) The third part is the count. Here I want two type of counts. 1 and 2. As the example given in point 2, If the word count is in range of 0 to 1, then the value will be 1 in prec_-80_1 column. If the word count is 2 or more, then the value will be 1 in prec_-80_2 column.
I hope it is not very on confusing.

Use:
#change previous answer with add id for matching
df2 = (df.drop_duplicates(['id','Word'])
.groupby('Word', sort=False)
.agg({'c1':['sum','size'], 'id':'first'})
)
df2.columns = df2.columns.map(''.join)
df2 = df2.reset_index()
df2 = df2.rename(columns={'c1sum':'Points','c1size':'Totalcount','idfirst':'id'})
df2['Percentage'] = df2['Points'] / df2['Totalcount'] * 100
s1 = df2['Percentage'].div(10).astype(int).mul(10).astype(str)
s2 = np.where(df2['Totalcount'] == 1, '1', '2')
#s2= np.where(df1['Totalcount'].isin([0,1]), '1', '2')
#create colum by join
df2['new'] = 'perc_' + s1 + '_' +s2
#create indicator DataFrame
df3 = pd.get_dummies(df2[['id','new']].drop_duplicates().set_index('id'),
prefix='',
prefix_sep='').max(level=0)
print (df3)
#reindex for add missing columns
c = 'perc_' + pd.Series(np.arange(-100, 110, 10).astype(str)) + '_'
cols = (c + '1').append(c + '2')
#join to original df1
df = df1.join(df3.reindex(columns=cols, fill_value=0), on='id')
print (df)
id text c1 perc_-100_1 perc_-90_1 \
0 1 Hello world how are you people 1 0 0
1 2 Hello people I am fine people 1 0 0
2 3 Good Morning people -1 1 0
3 4 Good Evening -1 1 0
perc_-80_1 perc_-70_1 perc_-60_1 perc_-50_1 perc_-40_1 ... perc_10_2 \
0 0 0 0 0 0 ... 0
1 0 0 0 0 0 ... 0
2 0 0 0 0 0 ... 0
3 0 0 0 0 0 ... 0
perc_20_2 perc_30_2 perc_40_2 perc_50_2 perc_60_2 perc_70_2 \
0 0 1 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
perc_80_2 perc_90_2 perc_100_2
0 0 0 1
1 0 0 0
2 0 0 0
3 0 0 0
[4 rows x 45 columns]

creating conditions on np.where in Pandas based on value in current column

I have a dataframe in Pandas (subset below).
DATE IN 200D_MA TEST
10/30/2013 0 1 0
10/31/2013 0 1 0
11/1/2013 1 1 1 IN & 200D_MA both =1, results 1
11/4/2013 0 1 1 PREVIOUS TEST ROW =1 & 200DM_A = 1, TEST ans=1
11/5/2013 0 1 1 PREVIOUS TEST ROW =1 & 200DM_A = 1, TEST ans=1
11/6/2013 0 1 1
11/7/2013 0 1 1
11/8/2013 0 1 1
11/11/2013 0 0 0 PREVIOUS TEST ROW =1 & 200DM_A = 0, TEST ans=0
This is easy to do in excel so I thought it would be easy to do in python. I have this code using nested np.where formulas
df3['TEST'] = np.where( (df3['IN'] == 1) & (df3['200D_MA'] == 1),1,\
np.where( (df3['TEST'].shift(-1) == 1)\
& (df3['200D_MA'] == 1),1,0))
but it throws a KeyError: 'IN' > presumably because I am using a condition from column that has not been created yet. Can anyone help me figure out how to do this?

Seems like you need condition ffill
df['TEST']=df.loc[df.IN==1,'IN']
df.loc[df['200D_MA']==1,'TEST']=df.loc[df['200D_MA']==1,'TEST'].ffill()
df.fillna(0,inplace=True)
df.TEST=df.TEST.astype(int)
df
Out[349]:
DATE IN 200D_MA TEST
0 10/30/2013 0 1 0
1 10/31/2013 0 1 0
2 11/1/2013 1 1 1
3 11/4/2013 0 1 1
4 11/5/2013 0 1 1
5 11/6/2013 0 1 1
6 11/7/2013 0 1 1
7 11/8/2013 0 1 1
8 11/11/2013 0 0 0

I think you can use rolling to calculate previous TEST row.
df['TEST'] = (df['IN 200D_MA'] & df['IN 200D_MA'].rolling(2).min().shift(1)).astype(int)
Output:
DATE IN 200D_MA TEST
10/30/2013 0 1 0
10/31/2013 0 1 0
11/1/2013 1 1 1
11/4/2013 0 1 1
11/5/2013 0 1 1
11/6/2013 0 1 1
11/7/2013 0 1 1
11/8/2013 0 1 1
11/11/2013 0 0 0

Convert Dictionary to Pandas in Python

I have a dict as follows:
data_dict = {'1.160.139.117': ['712907','742068'],
'1.161.135.205': ['667386','742068'],
'1.162.51.21': ['326136', '663056', '742068']}
I want to convert the dict into a dataframe:
df= pd.DataFrame.from_dict(data_dict, orient='index')
How can I create a dataframe that has columns representing the values of the dictionary and rows representing the keys of the dictionary?, as below:

The best option is #4
pd.get_dummies(df.stack()).sum(level=0)
Option 1:
One way you could do it:
df.stack().reset_index(level=1)\
.set_index(0,append=True)['level_1']\
.unstack().notnull().mul(1)
Output:
326136 663056 667386 712907 742068
1.160.139.117 0 0 0 1 1
1.161.135.205 0 0 1 0 1
1.162.51.21 1 1 0 0 1
Option 2
Or with a litte reshaping and pd.crosstab:
df2 = df.stack().reset_index(name='Values')
pd.crosstab(df2.level_0,df2.Values)
Output:
Values 326136 663056 667386 712907 742068
level_0
1.160.139.117 0 0 0 1 1
1.161.135.205 0 0 1 0 1
1.162.51.21 1 1 0 0 1
Option 3
df.stack().reset_index(name="Values")\
.pivot(index='level_0',columns='Values')['level_1']\
.notnull().astype(int)
Output:
Values 326136 663056 667386 712907 742068
level_0
1.160.139.117 0 0 0 1 1
1.161.135.205 0 0 1 0 1
1.162.51.21 1 1 0 0 1
Option 4 (#Wen pointed out a short solution and fastest so far)
pd.get_dummies(df.stack()).sum(level=0)
Output:
326136 663056 667386 712907 742068
1.160.139.117 0 0 0 1 1
1.161.135.205 0 0 1 0 1
1.162.51.21 1 1 0 0 1

Append count of rows meeting a condition within a group to Pandas dataframe

I know how to append a column counting the number of elements in a group, but I need to do so just for the number within that group that meets a certain condition.
For example, if I have the following data:
import numpy as np
import pandas as pd
columns=['group1', 'value1']
data = np.array([np.arange(5)]*2).T
mydf = pd.DataFrame(data, columns=columns)
mydf.group1 = [0,0,1,1,2]
mydf.value1 = ['P','F',100,10,0]
valueslist={'50','51','52','53','54','55','56','57','58','59','60','61','62','63','64','65','66','67','68','69','70','71','72','73','74','75','76','77','78','79','80','81','82','83','84','85','86','87','88','89','90','91','92','93','94','95','96','97','98','99','100','A','B','C','D','P','S'}
and my dataframe therefore looks like this:
mydf
group1 value1
0 0 P
1 0 F
2 1 100
3 1 10
4 2 0
I would then want to count the number of rows within each group1 value where value1 is in valuelist.
My desired output is:
group1 value1 count
0 0 P 1
1 0 F 1
2 1 100 1
3 1 10 1
4 2 0 0

After changing the type of the value1 column to match your valueslist (or the other way around), you can use isin to get a True/False column, and convert that to 1s and 0s with astype(int). Then we can apply an ordinary groupby transform:
In [13]: mydf["value1"] = mydf["value1"].astype(str)
In [14]: mydf["count"] = (mydf["value1"].isin(valueslist).astype(int)
.groupby(mydf["group1"]).transform(sum))
In [15]: mydf
Out[15]:
group1 value1 count
0 0 P 1
1 0 F 1
2 1 100 1
3 1 10 1
4 2 0 0

mydf.value1=mydf.value1.astype(str)
mydf['count']=mydf.group1.map(mydf.groupby('group1').apply(lambda x : sum(x.value1.isin(valueslist))))
mydf
Out[412]:
group1 value1 count
0 0 P 1
1 0 F 1
2 1 100 1
3 1 10 1
4 2 0 0
Data input :
valueslist=['50','51','52','53','54','55','56','57','58','59','60','61','62','63','64','65','66','67','68','69','70','71','72','73','74','75','76','77','78','79','80','81','82','83','84','85','86','87','88','89','90','91','92','93','94','95','96','97','98','99','100','A','B','C','D','P','S']

You can groupby each group1 and then use transform to find the max of whether your values are in the list.
mydf['count'] = mydf.groupby('group1').transform(lambda x: x.astype(str).isin(valueslist).sum())
group1 value1 count
0 0 P 1
1 0 F 1
2 1 100 1
3 1 10 1
4 2 0 0

Here is one way to do it, albeit a one-liner:
mydf.merge(mydf.groupby('group1').apply(lambda x: len(set(x['value1'].values).intersection(valueslist))).reset_index().rename(columns={0: 'count'}), how='inner', on='group1')
group1 value1 count
0 0 P 1
1 0 F 1
2 1 100 1
3 1 10 1
4 2 0 0

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Crosstabin Pandas? - python

a simplest way is using 2 crosstab through list comprehension with concat cols = ['blue', 'four_g'] df_out = pd.concat([pd.crosstab(df['price_range'], df[col]) for col in cols], keys=cols, axis=1) Out[1116]: blue four_g blue 0 1 0 1 price_range 0 1 1 2 0 1 1 0 1 0 2 0 2 1 1 3 1 0 0 1

Related

pandas: convert a list in a column into separate columns

How can I add new columns using another dataframe (related to string columns) in Pandas

creating conditions on np.where in Pandas based on value in current column

Convert Dictionary to Pandas in Python

Append count of rows meeting a condition within a group to Pandas dataframe

Categories

Resources