How sum columns with same name [duplicate] - python

I have a dataframe with about 100 columns that looks like this:
Id Economics-1 English-107 English-2 History-3 Economics-zz Economics-2 \
0 56 1 1 0 1 0 0
1 11 0 0 0 0 1 0
2 6 0 0 1 0 0 1
3 43 0 0 0 1 0 1
4 14 0 1 0 0 1 0
Histo Economics-51 Literature-re Literatureu4
0 1 0 1 0
1 0 0 0 1
2 0 0 0 0
3 0 1 1 0
4 1 0 0 0
My goal is to leave only global categories -- English, History, Literature -- and write the sum of the value of their components, respectively, in this dataframe. For instance, "English" would be the sum of "English-107" and "English-2":
Id Economics English History Literature
0 56 1 1 2 1
1 11 1 0 0 1
2 6 0 1 1 0
3 43 2 0 1 1
4 14 0 1 1 0
For this purpose, I have tried two methods. First method:
df = pd.read_csv(file_path, sep='\t')
df['History'] = df.loc[df[df.columns[pd.Series(df.columns).str.startswith('History')]].sum(axes=1)]
Second method:
df = pd.read_csv(file_path, sep='\t')
filter_col = [col for col in list(df) if col.startswith('History')]
df['History'] = 0 # initialize value, otherwise throws KeyError
for c in df[filter_col]:
df['History'] = df[filter_col].sum(axes=1)
print df['History', df[filter_col]]
However, both gives the error:
TypeError: 'DataFrame' objects are mutable, thus they cannot be
hashed
My question is either: how can I debug this error or is there another solution for my problem. Notice that I have a rather large dataframe with about 100 columns and 400000 rows, so I'm looking for an optimized solution, like using loc in pandas.

I'd suggest that you do something different, which is to perform a transpose, groupby the prefix of the rows (your original columns), sum, and transpose again.
Consider the following:
df = pd.DataFrame({
'a_a': [1, 2, 3, 4],
'a_b': [2, 3, 4, 5],
'b_a': [1, 2, 3, 4],
'b_b': [2, 3, 4, 5],
})
Now
[s.split('_')[0] for s in df.T.index.values]
is the prefix of the columns. So
>>> df.T.groupby([s.split('_')[0] for s in df.T.index.values]).sum().T
a b
0 3 3
1 5 5
2 7 7
3 9 9
does what you want.
In your case, make sure to split using the '-' character.

You can use these to create sum of columns starting with specific name,
df['Economics']= df[list(df.filter(regex='Economics'))].sum(axis=1)

Using brilliant DSM's idea:
from __future__ import print_function
import pandas as pd
categories = set(['Economics', 'English', 'Histo', 'Literature'])
def correct_categories(cols):
return [cat for col in cols for cat in categories if col.startswith(cat)]
df = pd.read_csv('data.csv', sep=r'\s+', index_col='Id')
#print(df)
print(df.groupby(correct_categories(df.columns),axis=1).sum())
Output:
Economics English Histo Literature
Id
56 1 1 2 1
11 1 0 0 1
6 1 1 0 0
43 2 0 1 1
14 1 1 1 0
Here is another version, which takes care of "Histo/History" problematic..
from __future__ import print_function
import pandas as pd
#categories = set(['Economics', 'English', 'Histo', 'Literature'])
#
# mapping: common starting pattern: desired name
#
categories = {
'Histo': 'History',
'Economics': 'Economics',
'English': 'English',
'Literature': 'Literature'
}
def correct_categories(cols):
return [categories[cat] for col in cols for cat in categories.keys() if col.startswith(cat)]
df = pd.read_csv('data.csv', sep=r'\s+', index_col='Id')
#print(df.columns, len(df.columns))
#print(correct_categories(df.columns), len(correct_categories(df.columns)))
#print(df.groupby(pd.Index(correct_categories(df.columns)),axis=1).sum())
rslt = df.groupby(correct_categories(df.columns),axis=1).sum()
print(rslt)
print('History\n', rslt['History'])
Output:
Economics English History Literature
Id
56 1 1 2 1
11 1 0 0 1
6 1 1 0 0
43 2 0 1 1
14 1 1 1 0
History
Id
56 2
11 0
6 0
43 1
14 1
Name: History, dtype: int64
PS You may want to add missing categories to categories map/dictionary

Related

Replace a particular column value with 1 and the rest with 0

I have a DataFrame which has a column containing these values with % occurrence
I want to convert the value with highest occurrence as 1 and the rest as 0.
How can I do the same using Pandas?
Try this:
import pandas as pd
import numpy as np
df = pd.DataFrame({'availability': np.random.randint(0, 100, 10), 'some_col': np.random.randn(10)})
print(df)
"""
availability some_col
0 9 -0.332662
1 35 0.193257
2 1 2.042402
3 50 -0.298372
4 52 -0.669655
5 3 -1.031884
6 44 -0.763867
7 28 1.093086
8 67 0.723319
9 87 -1.439568
"""
df['availability'] = np.where(df['availability'] == df['availability'].max(), 1, 0)
print(df)
"""
availability some_col
0 0 -0.332662
1 0 0.193257
2 0 2.042402
3 0 -0.298372
4 0 -0.669655
5 0 -1.031884
6 0 -0.763867
7 0 1.093086
8 0 0.723319
9 1 -1.439568
"""
Edit
If you are trying to mask the rows with the values that occur most often instead, try this:
df = pd.DataFrame(
{
'availability': [10, 10, 20, 30, 40, 40, 50, 50, 50, 50],
'some_col': np.random.randn(10)
}
)
print(df)
"""
availability some_col
0 10 0.954199
1 10 0.779256
2 20 -0.438860
3 30 -2.547989
4 40 0.587108
5 40 0.398858
6 50 0.776177 # <--- Most Frequent is 50
7 50 -0.391724 # <--- Most Frequent is 50
8 50 -0.886805 # <--- Most Frequent is 50
9 50 1.989000 # <--- Most Frequent is 50
"""
df['availability'] = np.where(df['availability'].isin(df['availability'].mode()), 1, 0)
print(df)
"""
availability some_col
0 0 0.954199
1 0 0.779256
2 0 -0.438860
3 0 -2.547989
4 0 0.587108
5 0 0.398858
6 1 0.776177
7 1 -0.391724
8 1 -0.886805
9 1 1.989000
"""
Try:
df.availability.apply(lambda x: 1 if x == df.availability.value_counts().idxmax() else 0)
You can use Series.mode() to get the most often value and use isin to check if value in column in list
df['col'] = df['availability'].isin(df['availability'].mode()).astype(int)
You can compare to the mode with isin, then convert the boolean to integer (True -> 1, False -> 0):
df['col2'] = df['col'].isin(df['col'].mode()).astype(int)
example (here, 2 and 4 are tied as most frequent value), as new column "col2" for clarity:
col col2
0 0 0
1 2 1
2 2 1
3 2 1
4 4 1
5 4 1
6 4 1
7 1 0

pandas: convert a list in a column into separate columns

dataframe df has a column
id data_words
1 [salt,major,lab,water]
2 [lab,plays,critical,salt]
3 [water,success,major]
I want to make one-hot-code of the column
id critical lab major plays salt success water
1 0 1 1 0 1 0 1
2 1 1 0 1 1 0 0
3 0 0 1 1 0 1 0
What I tried:
Attempt 1:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = df.join(pd.DataFrame(mlb.fit_transform(df.pop('data_words')),
columns=mlb.classes_,
index=df.index))
Error: ValueError: columns overlap but no suffix specified: Index(['class'], dtype='object')
Attempt 2:
I converted the list into simple comma separated string with the following code
df['data_words_Joined'] = df.data_words.apply(','.join)
it makes the dataframe as following
id data_words
1 salt,major,lab,water
2 lab,plays,critical,salt
3 water,success,major
Then I tried
pd.concat([df,pd.get_dummies(df['data_words_Joined'])],axis=1)
But It makes all the words into one column name instead of separate words as separate columns
id salt,major,lab,water lab,plays,critical,salt water,success,major
1 1 0 0
2 0 1 0
3 0 0 1
You can try with explode followed by pivot_table
df_e = df.explode('data_words')
print(df_e.pivot_table(index=df_e['id'],columns=df_e['data_words'],values='id',aggfunc='count',fill_value=0))
Returning the following output:
data_words critical lab major plays salt success water
id
1 0 1 1 0 1 0 1
2 1 1 0 1 1 0 0
3 0 0 1 0 0 1 1
Edit: Adding data for replication purposes:
df = pd.DataFrame({'id':[1,2,3],
'data_words':[['salt','major','lab','water'],['lab','plays','critical','salt'],['water','success','major']]})
Which looks like:
id data_words
0 1 [salt, major, lab, water]
1 2 [lab, plays, critical, salt]
2 3 [water, success, major]
One possible approach could be to use get_dummies with your apply function:
new_df = df.data_words.apply(','.join).str.get_dummies(sep=',')
print(new_df)
Output:
critical lab major plays salt success water
0 0 1 1 0 1 0 1
1 1 1 0 1 1 0 0
2 0 0 1 0 0 1 1
Tested with pandas version 1.1.2 and borrowed input data from Celius Stingher's Answer.

How to encode dummy variables in Python for sequential data such that the same order is maintained always?

A simple issue really, I have a dataset that is too large to hold in to memory and thus must load it then perform machine learning on it sequentially. One of my features is categorical and I would like to do convert it to a dummy variable, but I have two issues:
1) Not all of the categories are present during a splice. So I would like to add the extra categories even if they are not presented in the current slice
2) The columns would have to maintain the same order as they were before.
This is an example of the problem:
In[1]: import pandas as pd
splice1 = pd.Series(list('bdcccb'))
Out[1]: 0 b
1 d
2 c
3 c
4 c
5 b
dtype: object
In[2]: splice2 = pd.Series(list('accd'))
Out[2]: 0 a
1 c
2 c
3 d
dtype: object
In[3]: splice1_dummy = pd.get_dummies(splice1)
Out[3]: b c d
0 1 0 0
1 0 0 1
2 0 1 0
3 0 1 0
4 0 1 0
5 1 0 0
In[4]: splice2_dummy = pd.get_dummies(splice2)
Out[4]: a c d
0 1 0 0
1 0 1 0
2 0 1 0
3 0 0 1
Edit: How to deal with the N-1 rule. A dummy variable has to be dropped, but which one to drop? Every new splice would hold different categorical variables.
So if you pass the categories in the exact order that you want, get_dummies will maintain it regardless. The code shows how its done.
In[1]: from pandas.api.types import CategoricalDtype
splice1 = pd.Series(list('bdcccb'))
splice1 = splice1.astype(CategoricalDtype(categories=['a','c','b','d']))
splice2 = pd.Series(list('accd'))
splice2 = splice2.astype(CategoricalDtype(categories=['a','c','b','d']))
In[2]: splice1_dummy = pd.get_dummies(splice1)
Out[2]: a c b d
0 0 0 1 0
1 0 0 0 1
2 0 1 0 0
3 0 1 0 0
4 0 1 0 0
5 0 0 1 0
In[3]: splice2_dummy = pd.get_dummies(splice2)
Out[3]: a c b d
0 1 0 0 0
1 0 1 0 0
2 0 1 0 0
3 0 0 0 1
Although, I still haven't solved the issue of which variable to drop.

Append count of rows meeting a condition within a group to Pandas dataframe

I know how to append a column counting the number of elements in a group, but I need to do so just for the number within that group that meets a certain condition.
For example, if I have the following data:
import numpy as np
import pandas as pd
columns=['group1', 'value1']
data = np.array([np.arange(5)]*2).T
mydf = pd.DataFrame(data, columns=columns)
mydf.group1 = [0,0,1,1,2]
mydf.value1 = ['P','F',100,10,0]
valueslist={'50','51','52','53','54','55','56','57','58','59','60','61','62','63','64','65','66','67','68','69','70','71','72','73','74','75','76','77','78','79','80','81','82','83','84','85','86','87','88','89','90','91','92','93','94','95','96','97','98','99','100','A','B','C','D','P','S'}
and my dataframe therefore looks like this:
mydf
group1 value1
0 0 P
1 0 F
2 1 100
3 1 10
4 2 0
I would then want to count the number of rows within each group1 value where value1 is in valuelist.
My desired output is:
group1 value1 count
0 0 P 1
1 0 F 1
2 1 100 1
3 1 10 1
4 2 0 0
After changing the type of the value1 column to match your valueslist (or the other way around), you can use isin to get a True/False column, and convert that to 1s and 0s with astype(int). Then we can apply an ordinary groupby transform:
In [13]: mydf["value1"] = mydf["value1"].astype(str)
In [14]: mydf["count"] = (mydf["value1"].isin(valueslist).astype(int)
.groupby(mydf["group1"]).transform(sum))
In [15]: mydf
Out[15]:
group1 value1 count
0 0 P 1
1 0 F 1
2 1 100 1
3 1 10 1
4 2 0 0
mydf.value1=mydf.value1.astype(str)
mydf['count']=mydf.group1.map(mydf.groupby('group1').apply(lambda x : sum(x.value1.isin(valueslist))))
mydf
Out[412]:
group1 value1 count
0 0 P 1
1 0 F 1
2 1 100 1
3 1 10 1
4 2 0 0
Data input :
valueslist=['50','51','52','53','54','55','56','57','58','59','60','61','62','63','64','65','66','67','68','69','70','71','72','73','74','75','76','77','78','79','80','81','82','83','84','85','86','87','88','89','90','91','92','93','94','95','96','97','98','99','100','A','B','C','D','P','S']
You can groupby each group1 and then use transform to find the max of whether your values are in the list.
mydf['count'] = mydf.groupby('group1').transform(lambda x: x.astype(str).isin(valueslist).sum())
group1 value1 count
0 0 P 1
1 0 F 1
2 1 100 1
3 1 10 1
4 2 0 0
Here is one way to do it, albeit a one-liner:
mydf.merge(mydf.groupby('group1').apply(lambda x: len(set(x['value1'].values).intersection(valueslist))).reset_index().rename(columns={0: 'count'}), how='inner', on='group1')
group1 value1 count
0 0 P 1
1 0 F 1
2 1 100 1
3 1 10 1
4 2 0 0

Keeping subset of row labels in pandas DataFrame based on second index

Given a DataFrame with a hierarchical index containing three levels (experiment, trial, slot) and a second DataFrame with a hierarchical index containing two levels (experiment, trial), how do I drop all the rows in the first DataFrame whose (experiment, trial) are not contained in the second dataframe?
Example data:
from io import StringIO
import pandas as pd
df1_data = StringIO(u',experiment,trial,slot,token\n0,btn144a10_p_RDT,0,0,4.0\n1,btn144a10_p_RDT,0,1,14.0\n2,btn144a10_p_RDT,1,0,12.0\n3,btn144a10_p_RDT,1,1,14.0\n4,btn145a07_p_RDT,0,0,6.0\n5,btn145a07_p_RDT,0,1,19.0\n6,btn145a07_p_RDT,1,0,17.0\n7,btn145a07_p_RDT,1,1,13.0\n8,chn004b06_p_RDT,0,0,6.0\n9,chn004b06_p_RDT,0,1,8.0\n10,chn004b06_p_RDT,1,0,2.0\n11,chn004b06_p_RDT,1,1,5.0\n12,chn008a06_p_RDT,0,0,12.0\n13,chn008a06_p_RDT,0,1,14.0\n14,chn008a06_p_RDT,1,0,6.0\n15,chn008a06_p_RDT,1,1,4.0\n16,chn008b06_p_RDT,0,0,3.0\n17,chn008b06_p_RDT,0,1,13.0\n18,chn008b06_p_RDT,1,0,12.0\n19,chn008b06_p_RDT,1,1,19.0\n20,chn008c04_p_RDT,0,0,17.0\n21,chn008c04_p_RDT,0,1,2.0\n22,chn008c04_p_RDT,1,0,1.0\n23,chn008c04_p_RDT,1,1,6.0\n')
df1 = pd.DataFrame.from_csv(df1_data).set_index(['experiment', 'trial', 'slot'])
df2_data = StringIO(u',experiment,trial,target\n0,btn145a07_p_RDT,1,13\n1,chn004b06_p_RDT,1,9\n2,chn008a06_p_RDT,0,15\n3,chn008a06_p_RDT,1,15\n4,chn008b06_p_RDT,1,1\n5,chn008c04_p_RDT,1,12\n')
df2 = pd.DataFrame.from_csv(df2_data).set_index(['experiment', 'trial'])
The first dataframe looks like:
token
experiment trial slot
btn144a10_p_RDT 0 0 4
1 14
1 0 12
1 14
btn145a07_p_RDT 0 0 6
1 19
1 0 17
1 13
chn004b06_p_RDT 0 0 6
1 8
1 0 2
1 5
chn008a06_p_RDT 0 0 12
1 14
1 0 6
1 4
chn008b06_p_RDT 0 0 3
1 13
1 0 12
1 19
chn008c04_p_RDT 0 0 17
1 2
1 0 1
1 6
The second dataframe looks like:
target
experiment trial
btn145a07_p_RDT 1 13
chn004b06_p_RDT 1 9
chn008a06_p_RDT 0 15
1 15
chn008b06_p_RDT 1 1
chn008c04_p_RDT 1 12
The result I want:
token
experiment trial slot
btn145a07_p_RDT 1 0 17
1 13
chn004b06_p_RDT 1 0 2
1 5
chn008a06_p_RDT 0 0 12
1 14
1 0 6
1 4
chn008b06_p_RDT 1 0 12
1 19
chn008c04_p_RDT 1 0 1
1 6
One way to do it would by using merge
merged = pd.merge(
df2.reset_index(),
df1.reset_index(),
left_on=['experiment', 'trial'],
right_on=['experiment', 'trial'],
how='left')
You just need to reindex merged to whatever you like (I could not tell exactly from the question).
What should work is
df1.loc[df2.index]
but multi indexing still has some problems. What does work is
df1.reset_index(2).loc[df2.index].set_index('slot', append=True)
which is a bit of a hack around this problem. Note that
df1.loc[df2.index[:1]]
gives garbage while
df.loc[df2.index[0]]
gives what you would expect. So passing multiple values from a m-level index to an n-level index where n > m > 2 doesn't work, though it should.

Categories