Python: combine columns then rows by group - python

I have this dataframe:
In [182]: data_set
Out[182]:
name parent distance rank
0 x aaa 10 1
1 x bbb 5 1
2 x fff 3 2
3 y aaa 2 2
4 y bbb 10 1
5 z ccc 8 2
I want to reshape it to:
name Combined
x ('aaa',10,1),('bbb',5,1),('fff',3,2)
y ('aaa',2,2),('bbb',10,1)
z ('ccc',8,2)
Then I wanted to convert it into dataframe 3x2 with two columns name and combined.
I was thinking to use zip or group but those return different outputs.

First combine your columns to tuple, then groupby to list.
df['combined'] = df[['parent', 'distance', 'rank']].apply(tuple, axis=1)
res = df.groupby('name')['combined'].apply(list).reset_index()
print(res)
name combined
0 x [(aaa, 10, 1), (bbb, 5, 1), (fff, 3, 2)]
1 y [(aaa, 2, 2), (bbb, 10, 1)]
2 z [(ccc, 8, 2)]

By using groupby and apply
df.groupby('name')[['parent','distance','rank']].apply(lambda x : x.values.tolist())
Out[14]:
name
x [[aaa, 10, 1], [bbb, 5, 1], [fff, 3, 2]]
y [[aaa, 2, 2], [bbb, 10, 1]]
z [[ccc, 8, 2]]
dtype: object

Related

Conflate sets of pandas columns into a single column

I have a dataframe that looks like this:
n objects id x y Vx Vy id.1 x.1 ... Vx.40 Vy.40 ...
0 41 1 2 3 4 5 17 3 ... 5 6 ...
1 21 1 2 3 4 5 17 3 ... 0 0 ...
2 36 1 2 3 4 5 17 3 ... 0 0 ...
My goal is to conflate the contents of every set of id, x, y, Vx, and Vy columns into a single column.
I.e. the end result should look like this:
n objects object_0 object_1 object_40 ...
0 41 [1,2,3,4,5] [17,3,...] ... [...5,6] ...
1 21 [1,2,3,4,5] [17,3,...] ... [...0,0] ...
2 36 [1,2,3,4,5] [17,3,...] ... [...0,0] ...
I am kind of at a loss as to how to achieve that. My only idea was hardcoding it like
df['object_0'] = df[['id', 'x', 'y', 'Vx', 'Vy']].values.tolist()
df.drop(['id', 'x', 'y', 'Vx', 'Vy'], inplace=True)
for i in range(1,41):
df[f'object_{i}'] = df[[f'id.{i}', f'x.{i}', f'y.{i}', f'Vx.{i}', f'Vy.{i}']].values.tolist()
df.drop([f'id.{i}', f'x.{i}', f'y.{i}', f'Vx.{i}', f'Vy.{i}'], inplace=True)
but that is not a good option, as the number (and names) of repeating columns varies between dataframes. What is consistent is that the number of objects per row is listed, and every object has the same number of elements (i.e. there are no cases of columns going like id.26, y.26, Vx.26, id.27 Vy.27, id.28...)
I suppose I could find the number of objects via something like
last_obj = max([ int(col.split('.')[-1]) for col in df.columns ])
and then dig out the number and names of cols per object by
[ col.split('.')[0] for col in df.columns if col.split('.')[-1] == last_obj ]
but at that point this all starts seeming a bit too cluttered and hacky.
Is there a cleaner way to do that, one that works irrespective of the number of objects, of columns per object, and (ideally) of column names? Any help would be appreciated!
EDIT:
This does work, but is there a more elegant way of doing it?
last_obj = max([ int(col.split('.')[-1]) for col in df.columns if '.' in col])
obj_col_names = [ col.split('.')[0] for col in df.columns if col.split('.')[-1] == str(last_obj) ]
df['object_0'] = df[obj_col_names].values.tolist()
df.drop(obj_col_names, axis=1, inplace=True)
for i in range(1, last_obj+1):
current_col_set = [ "".join([col, f'.{i}']) for col in obj_col_names ]
df[f'object_{i}'] = df[current_col_set].values.tolist()
df.drop(current_col_set, axis=1, inplace=True)
This solution renames the columns into same-named groups. Then does a groupby on those columns and converts them into lists.
Starting with
n objects id x y Vx Vy id.1 x.1 y.1 Vx.1 Vy.1
0 0 41 1 2 3 4 5 17 3 3 4 5
1 1 21 1 2 3 4 5 17 3 3 4 5
2 2 36 1 2 3 4 5 17 3 3 4 5
Then
nb_cols = df.shape[1]-2
nb_groups = int(df.columns[-1].split('.')[1])+1
cols_per_group = nb_cols // nb_groups
group_cols = np.arange(nb_cols)//cols_per_group
explode_cols = list(np.arange(nb_groups))
pd.concat([df.loc[:,:'objects'].reset_index(drop=True), \
df.loc[:,'id':].set_axis(group_cols, axis=1).groupby(level=0, axis=1) \
.apply(lambda x: x.values).to_frame().T.explode(explode_cols).reset_index(drop=True) \
.rename(columns = lambda x: 'object_' + str(x)) \
], axis=1)
Result
n objects object_0 object_1
0 0 41 [1, 2, 3, 4, 5] [17, 3, 3, 4, 5]
1 1 21 [1, 2, 3, 4, 5] [17, 3, 3, 4, 5]
2 2 36 [1, 2, 3, 4, 5] [17, 3, 3, 4, 5]

How do I reverse all lists in a pandas dataframe column?

I have a dataframe with a column that is a filled with different lists. I want to reverse every list in the column.
Example of what I want:
I want this df:
index x_val
1 [1,2,3,4,5]
2 [2,3,4,5,6]
to become this:
index x_val
1 [5,4,3,2,1]
2 [6,5,4,3,2]
I tried the following code:
df['x_val'] = df['x_val'].apply(lambda x: x.reverse())
and I got this:
index x_val
1 nan
2 nan
What am I doing wrong?
Two Cases:
Case 1
If List is of int type forming the object type column x_val
df = pd.DataFrame({
'index':[1,2],
'x_val':[[1,2,3,4,5], [2,3,4,5,6]]
})
Code
df['x_val'] = df['x_val'].apply(lambda x: list(reversed(x)))
df
Case 2
If x_val column is of object of str type. (Will work with single character strings, if in practice string is of multi-char then consider converting it to list of ints and then use code of case 1).
d="""index|x_val
1|[1,2,3,4,5]
2|[2,3,4,5,6]"""
df=pd.read_csv(StringIO(d), sep='|', engine='python')
Code
df['x_val'] = df['x_val'].apply(lambda x: '['+x[-2:0:-1]+']')
df
Output
index x_val
0 1 [5,4,3,2,1]
1 2 [6,5,4,3,2]
#Uts is exactly right for Case 1. For Case 2, any 2-digit (or more) numbers will be transposed because they are treated as strings instead of delimited integer values that are represented as strings.
Please see Case 2b below for splitting and reversing the list of integers represented as strings.
import pandas as pd
print('Case 1 - x_val = lists of int')
df = pd.DataFrame({
'index':[1,2,3],
'x_val':[[1,2,3,4,5], [2,3,4,5,6], [6,7,8,9,10]]
})
print(df)
df['x_val'] = df['x_val'].apply(lambda x: list(reversed(x)))
print(df)
print('\nCase 2 - x_val = strings, string values will be reversed')
df = pd.DataFrame({
'index':[1,2,3],
'x_val':['[1,2,3,4,5]', '[2,3,4,5,6]', '[6,7,8,9,10]']
})
print(df)
df['x_val'] = df['x_val'].apply(lambda x: '['+x[-2:0:-1]+']')
print(df)
print('\nCase 2b - x_val = strings')
df = pd.DataFrame({
'index':[1,2,3],
'x_val':['[1,2,3,4,5]', '[2,3,4,5,6]', '[6,7,8,9,10]']
})
print(df)
df['x_val'] = df['x_val'].apply(lambda x: '['+','.join(el for el in
reversed(x[1:-1].split(',')))+']')
print(df)
Output:
Case 1 - x_val = lists of int
index x_val
0 1 [1, 2, 3, 4, 5]
1 2 [2, 3, 4, 5, 6]
2 3 [6, 7, 8, 9, 10]
index x_val
0 1 [5, 4, 3, 2, 1]
1 2 [6, 5, 4, 3, 2]
2 3 [10, 9, 8, 7, 6]
Case 2 - x_val = strings, string values will be reversed
index x_val
0 1 [1,2,3,4,5]
1 2 [2,3,4,5,6]
2 3 [6,7,8,9,10]
index x_val
0 1 [5,4,3,2,1]
1 2 [6,5,4,3,2]
2 3 [01,9,8,7,6]
Case 2b - x_val = strings
index x_val
0 1 [1,2,3,4,5]
1 2 [2,3,4,5,6]
2 3 [6,7,8,9,10]
index x_val
0 1 [5,4,3,2,1]
1 2 [6,5,4,3,2]
2 3 [10,9,8,7,6]

How to compute a similarity metric between two Series containing lists?

Having the following Series:
a = pd.Series([[1,2,34], [2,3], [2,3,4,5,1]], index = [1,2,3])
1 [1, 2, 34]
2 [2, 3]
3 [2, 3, 4, 5, 1]
and the following metric:
def metric(x, y):
return len(np.intersect1d(x, y))
I wanna compute a similarity metric over the Series and the result should be:
1 2 3
1 3 1 2
2 1 2 2
3 2 2 5
So far i used this:
sims = a.map(lambda x: a.map(lambda y: metric(x, y)))
pd.DataFrame({k: v for k,v in sims.items()})
I wanna know if there is another more elegant method that can achieve this.
You might use pd.concat to join pd.Series objects together, it's more efficient.
pd.concat([a.apply(metric, args=(a.loc[y],)) for y in a.index], 1)

Create columns for every category in Pandas DataFrame

I have a data frame with many columns with binaries representing the presence of a category in the observation. Each observation has exactly 3 categories with a value of 1, the rest 0. I want to create 3 new columns, 1 for each category, where the value is instead the name of the category (so the name of the binary column) if it's equal to one.To make it clearer:
I have:
x|y|z|k|w
---------
0|1|1|0|1
To be:
cat1|cat2|cat3
--------------
y |z |w
Can I do this ?
For better performance use numpy solution:
print (df)
x y z k w
0 0 1 1 0 1
1 1 1 0 0 1
c = df.columns.values
df = pd.DataFrame(c[np.where(df)[1].reshape(-1, 3)]).add_prefix('cat')
print (df)
cat0 cat1 cat2
0 y z w
1 x y w
Details:
#get indices of 1s
print (np.where(df))
(array([0, 0, 0, 1, 1, 1], dtype=int64), array([1, 2, 4, 0, 1, 4], dtype=int64))
#seelct second array
print (np.where(df)[1])
[1 2 4 0 1 4]
#reshape to 3 columns
print (np.where(df)[1].reshape(-1, 3))
[[1 2 4]
[0 1 4]]
#indexing
print (c[np.where(df)[1].reshape(-1, 3)])
[['y' 'z' 'w']
['x' 'y' 'w']]
Timings:
df = pd.concat([df] * 1000, ignore_index=True)
#jezrael solution
In [390]: %timeit (pd.DataFrame(df.columns.values[np.where(df)[1].reshape(-1, 3)]).add_prefix('cat'))
The slowest run took 4.62 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 503 µs per loop
#jpp solution
In [391]: %timeit (pd.DataFrame(df.apply(lambda row: [x for x in df.columns if row[x]], axis=1).values.tolist()))
10 loops, best of 3: 111 ms per loop
#Zero solution working only with one row DataFrame, so not included
Here is one way:
import pandas as pd
df = pd.DataFrame({'x': [0, 1], 'y': [1, 1], 'z': [1, 0], 'k': [0, 1], 'w': [1, 1]})
split = df.apply(lambda row: [x for x in df.columns if row[x]], axis=1).values.tolist()
df2 = pd.DataFrame(split)
# 0 1 2 3
# 0 w y z None
# 1 k w x y
You could
In [13]: pd.DataFrame([df.columns[df.astype(bool).values[0]]]).add_prefix('cat')
Out[13]:
cat0 cat1 cat2
0 y z w

Pandas: How to group by and sum MultiIndex

I have a dataframe with categorical attributes where the index contains duplicates. I am trying to find the sum of each possible combination of index and attribute.
x = pd.DataFrame({'x':[1,1,3,3],'y':[3,3,5,5]},index=[11,11,12,12])
y = x.stack()
print(y)
print(y.groupby(level=[0,1]).sum())
output
11 x 1
y 3
x 1
y 3
12 x 3
y 5
x 3
y 5
dtype: int64
11 x 1
y 3
x 1
y 3
12 x 3
y 5
x 3
y 5
dtype: int64
The stack and group by sum are just the same.
However, the one I expect is
11 x 2
11 y 6
12 x 6
12 y 10
EDIT 2:
x = pd.DataFrame({'x':[1,1,3,3],'y':[3,3,5,5]},index=[11,11,12,12])
y = x.stack().groupby(level=[0,1]).sum()
print(y.groupby(level=[0,1]).sum())
output:
11 x 1
y 3
x 1
y 3
12 x 3
y 5
x 3
y 5
dtype: int64
EDIT3:
An issue has been logged
https://github.com/pydata/pandas/issues/10417
With pandas 0.16.2 and Python 3, I was able to get the correct result via:
x.stack().reset_index().groupby(['level_0','level_1']).sum()
Which produces:
0
level_0 level_1
11 x 2
y 6
12 x 6
y 10
You can then change the index and column names to more desirable ones using reindex() and columns.
Based on my research, I agree that the failure of the original approach appears to be a bug. I think the bug is on Series, which is what x.stack() produces. My workaround is to turn the Series into a DataFrame via reset_index(). In this case the DataFrame does not have a MultiIndex anymore - I'm just grouping on labeled columns.
To make sure that grouping and summing works on a DataFrame with a MultiIndex, you can try this to get the same correct output:
x.stack().reset_index().set_index(['level_0','level_1'],drop=True).\
groupby(level=[0,1]).sum()
Either of these workarounds should take care of things until the bug is resolved.
I wonder if the bug has something to do with the MultiIndex instances that are created on a Series vs. a DataFrame. For example:
In[1]: obj = x.stack()
type(obj)
Out[1]: pandas.core.series.Series
In[2]: obj.index
Out[2]: MultiIndex(levels=[[11, 11, 12, 12], ['x', 'y']],
labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]])
vs.
In[3]: obj = x.stack().reset_index().set_index(['level_0','level_1'],drop=True)
type(obj)
Out[3]: pandas.core.frame.DataFrame
In[4]: obj.index
Out[4]: MultiIndex(levels=[[11, 12], ['x', 'y']],
labels=[[0, 0, 0, 0, 1, 1, 1, 1], [0, 1, 0, 1, 0, 1, 0, 1]],
names=['level_0', 'level_1'])
Notice how the MultiIndex on the DataFrame describes the levels more correctly.
sum allows you to specify the levels to sum over in a MultiIndex data frame.
x = pd.DataFrame({'x':[1,1,3,3],'y':[3,3,5,5]},index=[11,11,12,12])
y = x.stack()
y.sum(level=[0,1])
11 x 2
y 6
12 x 6
y 10
Using Pandas 0.15.2, you just need one more iteration of groupby
x = pd.DataFrame({'x':[1,1,3,3],'y':[3,3,5,5]},index=[11,11,12,12])
y = x.stack().groupby(level=[0,1]).sum()
print(y.groupby(level=[0,1]).sum())
prints
11 x 2
y 6
12 x 6
y 10

Categories