I would like to give a name to groups of columns and rows in my Pandas DataFrame to achieve the same result as a merged Excel table:
However, I can't find any way to give an overarching name to groups of columns/rows like what is shown.
I tried wrapping the tables in an array, but the dataframes don't display:
labels = ['a', 'b', 'c']
df = pd.DataFrame(np.ones((3,3)), index=labels, columns=labels)
labeledRowsCols = pd.DataFrame([df, df])
labeledRowsCols = pd.DataFrame(labeledRowsCols.T, index=['actual'], columns=['predicted 1', 'predicted 2'])
print(labeledRowsCols)
predicted 1 predicted 2
actual NaN NaN
You can set hierarchical indices for both the rows and columns.
import pandas as pd
df = pd.DataFrame([[3,1,0,3,1,0],[0,3,0,0,3,0],[2,1,3,2,1,3]])
col_ix = pd.MultiIndex.from_product([['Predicted: Set 1', 'Predicted: Set 2'], list('abc')])
row_ix = pd.MultiIndex.from_product([['True label'], list('abc')])
df = df.set_index(row_ix)
df.columns = col_ix
df
# returns:
Predicted: Set 1 Predicted: Set 2
a b c a b c
True label a 3 1 0 3 1 0
b 0 3 0 0 3 0
c 2 1 3 2 1 3
Exporting this to Excel should have the merged cells as in your example.
Related
I am trying to split my dataframe based on a partial match of the column name, using a group level stored in a separate dataframe. The dataframes are here, and the expected output is below
df = pd.DataFrame(data={'a19-76': [0,1,2],
'a23pz': [0,1,2],
'a23pze': [0,1,2],
'b887': [0,1,2],
'b59lp':[0,1,2],
'c56-6u': [0,1,2],
'c56-6uY': [np.nan, np.nan, np.nan]})
ids = pd.DataFrame(data={'id': ['a19', 'a23', 'b8', 'b59', 'c56'],
'group': ['test', 'sub', 'test', 'pass', 'fail']})
desired output
test_ids = 'a19-76', 'b887'
sub_ids = 'a23pz', 'a23pze', 'c56-6u'
pass_ids = 'b59lp'
fail_ids = 'c56-6u', 'c56-6uY'
I have written thise onliner, which assigned the group to each column name, but doesnt create two seperate lists as required above
gb = ids.groupby([[col for col in df.columns if col.startswith(tuple(i for i in ids.id))], 'group']).agg(lambda x: list(x)).reset_index()
gb.groupby('group').agg({'level_0':lambda x: list(x)})
thanks for reading
May be not what you are looking for, but anyway.
A pending question is what to do with not matched columns, the answer obviously depends on what you will do after matching.
Plain python solution
Simple collections wrangling, but there may be a simpler way.
from collections import defaultdict
groups = defaultdict(list)
idsr = ids.to_records(index=False)
for col in df.columns:
for id, group in idsr:
if col.startswith(id):
groups[group].append(col)
break
# the following 'else' clause is optional, it creates a group for not matched columns
else: # for ... else ...
groups['UNGROUPED'].append(col)
Groups =
{'sub': ['a23pz', 'c56-6u'], 'test': ['a19-76', 'b887', 'b59lp']}
Then after
df.columns = pd.MultiIndex.from_tuples(sorted([(k, col) for k,id in groups.items() for col in id]))
df =
sub test
a23pz c56-6u a19-76 b59lp b887
0 0 0 0 0 0
1 1 1 1 1 1
2 2 2 2 2 2
pandas solution
Columns to dataframe
product of dataframes (join )
filtering of the resulting dataframe
There is surely a better way
df1 = ids.copy()
df2 = df.columns.to_frame(index=False)
df2.columns = ['col']
# Not tested enhancement:
# with pandas version >= 1.2, the four following lines may be replaced by a single one :
# dfm = df1.merge(df2, how='cross')
df1['join'] = 1
df2['join'] = 1
dfm = df1.merge(df2, on='join').drop('join', axis=1)
df1.drop('join', axis=1, inplace = True)
dfm['match'] = dfm.apply(lambda x: x.col.find(x.id), axis=1).ge(0)
dfm = dfm[dfm.match][['group', 'col']].sort_values(by=['group', 'col'], axis=0)
dfm =
group col
6 sub a23pz
24 sub c56-6u
0 test a19-76
18 test b59lp
12 test b887
# Note 1: The index can be removed
# note 2: Unmatched columns are not taken in account
then after
df.columns = pd.MultiIndex.from_frame(dfm)
df =
group sub test
col a23pz c56-6u a19-76 b59lp b887
0 0 0 0 0 0
1 1 1 1 1 1
2 2 2 2 2 2
You can use a regex generated from the values in iidf and filter:
Example with "test":
s = iddf.set_index('group')['id']
regex_test = '^(%s)' % '|'.join(s.loc['test'])
# the generated regex is: '^(a19|b8|b59)'
df.filter(regex=regex_test)
output:
a19-76 b887 b59lp
0 0 0 0
1 1 1 1
2 2 2 2
To get a list of columns for each unique group in iidf, apply the same process in a dictionary comprehension:
{x: list(df.filter(regex='^(%s)' % '|'.join(s.loc[x])).columns)
for x in s.index.unique()}
output:
{'test': ['a19-76', 'b887', 'b59lp'],
'sub': ['a23pz', 'c56-6u']}
NB. this should generalize to any number of groups, however, if really there are many groups, it will be preferable to loop on the columns names rather than using filter repeatedly
A straightforward groupby(...).apply(...) can achieve this result:
def id_match(group, to_match):
regex = "[{}]".format("|".join(group))
matches = to_match.str.match(regex)
return pd.Series(to_match[matches])
matched_groups = ids.groupby("group")["id"].apply(id_match, df.columns)
print(matched_groups)
group
fail 0 c56-6u
1 c56-6uY
pass 0 b887
1 b59lp
sub 0 a19-76
1 a23pz
2 a23pze
test 0 a19-76
1 a23pz
2 a23pze
3 b887
4 b59lp
You can treat this Series as a dictionary-like entity to access each of the groups independently:
print(matched_ids["fail"])
0 c56-6u
1 c56-6uY
Name: id, dtype: object
print(matched_ids["pass"])
0 b887
1 b59lp
Name: id, dtype: object
Then you can take it a step further to can subset your original DataFrame with this new Series like so:
print(df[matched_ids["fail"]])
c56-6u c56-6uY
0 0 NaN
1 1 NaN
2 2 NaN
print(df[matched_ids["pass"]])
b887 b59lp
0 0 0
1 1 1
2 2 2
I'm in a trouble with adding a new column to a pandas dataframe when the length of new column value is bigger than length of index.
Data may like this :
import pandas as pd
df = pd.DataFrame(
{
"bar": ["A","B","C"],
"zoo": [1,2,3],
})
So, you see, length of this df's index is 3.
And next I wanna add a new column , code may like this two ways below:
df["new_col"] = [1,2,3,4]
It'll raise an error : Length of values does not match length of index.
Or:
df["new_col"] = pd.Series([1,2,3,4])
I will just get values[1,2,3] in my data frame df.
(The count of new column values can't out of the max index).
Now, what I want just like :
Is there a better way ?
Looking forward to your answer,thanks!
Use DataFrame.join with change Series name and right join:
#if not default index
#df = df.reset_index(drop=True)
df = df.join(pd.Series([1,2,3,4]).rename('new_col'), how='right')
print (df)
bar zoo new_col
0 A 1.0 1
1 B 2.0 2
2 C 3.0 3
3 NaN NaN 4
Another idea is add reindex by new s.index:
s = pd.Series([1,2,3,4])
df = df.reindex(s.index)
df["new_col"] = s
print (df)
bar zoo new_col
0 A 1.0 1
1 B 2.0 2
2 C 3.0 3
3 NaN NaN 4
s = pd.Series([1,2,3,4])
df = df.reindex(s.index).assign(new_col = s)
df = pd.DataFrame(
{
"bar": ["A","B","C"],
"zoo": [1,2,3],
})
new_col = pd.Series([1,2,3,4])
df = pd.concat([df,new_col],axis=1)
print(df)
bar zoo 0
0 A 1.0 1
1 B 2.0 2
2 C 3.0 3
3 NaN NaN 4
I have a df where I have several columns, that, based on the value (1-6) in these columns, I want to assign a value (0-1) to its corresponding column. I can do it on a column by column basis but would like to make it a single function. Below is some example code:
import pandas as pd
df = pd.DataFrame({'col1': [1,3,6,3,5,2], 'col2': [4,5,6,6,1,3], 'col3': [3,6,5,1,1,6],
'colA': [0,0,0,0,0,0], 'colB': [0,0,0,0,0,0], 'colC': [0,0,0,0,0,0]})
(col1 corresponds with colA, col2 with colB, col3 with colC)
This code works on a column by column basis:
df.loc[(df.col1 != 1) & (df.col1 < 6), 'colA'] = (df['colA']+ 1)
But I would like to be able to have a list of columns, so to speak, and have it correspond with another. Something like this, (but that actually works):
m = df['col1' : 'col3'] != 1 & df['col1' : 'col3'] < 6
df.loc[m, 'colA' : 'colC'] += 1
Thank You!
Idea is filter both DataFrames by DataFrame.loc, then filter columns by mask and rename columns by another df2 and last use DataFrame.add only for df.columns:
df1 = df.loc[:, 'col1' : 'col3']
df2 = df.loc[:, 'colA' : 'colC']
d = dict(zip(df1.columns,df2.columns))
df1 = ((df1 != 1) & (df1 < 6)).rename(columns=d)
df[df2.columns] = df[df2.columns].add(df1)
print (df)
col1 col2 col3 colA colB colC
0 1 4 3 0 1 1
1 3 5 6 1 1 0
2 6 6 5 0 0 1
3 3 6 1 1 0 0
4 5 1 1 1 0 0
5 2 3 6 1 1 0
Here's what I would do:
# split up dataframe
sub_df = df.iloc[:,:3]
abc = df.iloc[:,3:]
# make numpy array truth table
truth_table = (sub_df.to_numpy() > 1) & (sub_df.to_numpy() < 6)
# redefine abc based on numpy truth table
new_abc = pd.DataFrame(truth_table.astype(int), columns=['colA', 'colB', 'colC'])
# join the updated dataframe subgroups
new_df = pd.concat([sub_df, new_abc], axis=1)
I have a dataframe like this:
data = {'id': [1,1,1,2,2,3],
'value': ['a','a','a','b','b','c'],
'obj_id': [1,2,3,3,3,4]
}
df = pd.DataFrame (data, columns = ['id','value','obj_id'])
I would like to get the unique counts of obj_id groupby id and value:
1 a 3
2 b 1
3 c 1
But when I do:
result=df.groupby(['id','value'])['obj_id'].nunique().reset_index(name='obj_counts')
the result I got was:
1 a 2
1 a 1
2 b 1
3 c 1
so the first two rows with same id and value don't group together.
How can I fix this? Many thanks!
For me your solution working nice with sample data.
Like mentioned #YOBEN_S in comments is possible problem traling whitespeces, then solution is add Series.str.strip:
data = {'id': [1,1,1,2,2,3],
'value': ['a ','a','a','b','b','c'],
'obj_id': [1,2,3,3,3,4]
}
df = pd.DataFrame (data, columns = ['id','value','obj_id'])
df['value'] = df['value'].str.strip()
df = df.groupby(['id','value'])['obj_id'].nunique().reset_index(name='obj_counts')
print (df)
id value obj_counts
0 1 a 3
1 2 b 1
2 3 c 1
I have two data frame, one is user-item-rating and the other is side information of the items:
#df1
A12VH45Q3H5R5I B000NWJTKW 5.0
A3J8AQWNNI3WSN B000NWJTKW 4.0
A1XOBWIL4MILVM BDASK99000 1.0
#df2
B000NWJTKW ....
BDASK99000 ....
Now I w'd like to map the name of item and user to integer ID. I know there is a way of factorize:
df.apply(lambda x: pd.factorize(x)[0] + 1)
But I 'd like to ensure that the integer of the items in two data frame are consistent. So the resulting data frames is:
#df1
1 1 5.0
2 1 4.0
3 2 1.0
#df2
1 ...
2 ...
Do you know how to ensure that? Thanks in advance!
Concatenate the common column(s), and apply pd.factorize (or pd.Categorical) on that:
codes, uniques = pd.factorize(pd.concat([df1['item'], df2['item']]))
df1['item'] = codes[:len(df1)] + 1
df2['item'] = codes[len(df1):] + 1
For example,
import pandas as pd
df1 = pd.DataFrame(
[('A12VH45Q3H5R5I', 'B000NWJTKW', 5.0),
('A3J8AQWNNI3WSN', 'B000NWJTKW', 4.0),
('A1XOBWIL4MILVM', 'BDASK99000', 1.0)], columns=['user', 'item', 'rating'])
df2 = pd.DataFrame(
[('B000NWJTKW', 10),
('BDASK99000', 20)], columns=['item', 'extra'])
codes, uniques = pd.factorize(pd.concat([df1['item'], df2['item']]))
df1['item'] = codes[:len(df1)] + 1
df2['item'] = codes[len(df1):] + 1
codes, uniques = pd.factorize(df1['user'])
df1['user'] = codes + 1
print(df1)
print(df2)
yields
# df1
user item rating
0 1 1 5
1 2 1 4
2 3 2 1
# df2
item extra
0 1 10
1 2 20
Another way to work-around the problem (if you have enough memory) would be to merge the two DataFrames: df3 = pd.merge(df1, df2, on='item', how='outer'), and then factorize df3['item']:
df3 = pd.merge(df1, df2, on='item', how='outer')
for col in ['item', 'user']:
df3[col] = pd.factorize(df3[col])[0] + 1
print(df3)
yields
user item rating extra
0 1 1 5 10
1 2 1 4 10
2 3 2 1 20
Another option could be to apply factorize on the first dataframe, and then apply the resulting mapping to the second dataframe:
# create factorization:
idx, levels = pd.factorize(df1['item'])
# replace the item codes in the first dataframe with the new index value
df1['item'] = idx
# create a dictionary mapping the original code to the new index value
d = {code: i for i, code in enumerate(codes)}
# apply this mapping to the second dataframe
df2['item'] = df2.item.apply(lambda code: d[code])
This approach will only work if every level is present in both dataframes.