pandas DataFrame split the column and extend the rows - python

like:
A B C D
1 1 2 3 ['a','b']
2 4 6 7 ['b','c']
3 1 0 1 ['a']
4 2 1 1 ['b']
5 1 2 3 []
to:
A B C D
1 1 2 3 ['a']
2 1 2 3 ['b']
3 4 6 7 ['b']
4 4 6 7 ['c']
5 1 0 1 ['a']
6 2 1 1 ['b']
7 1 2 3 []
ps: split the row in "D" and extend the row
use: pandas dataframe deal with the data

One way would be to use a list comprehension with a doubly nest for-loop:
>>> [(key + (item,))
for key, val in df.set_index(['A','B','C'])['D'].iteritems()
for item in map(list, val) or [[]]]
# [(1, 2, 3, ['a']),
# (1, 2, 3, ['b']),
# (4, 6, 7, ['b']),
# (4, 6, 7, ['c']),
# (1, 0, 1, ['a']),
# (2, 1, 1, ['b']),
# (1, 2, 3, [])]
Passing the data in this form to pd.DataFrame produces the desired result:
import pandas as pd
df = pd.DataFrame({'A': {1: 1, 2: 4, 3: 1, 4: 2, 5: 1},
'B': {1: 2, 2: 6, 3: 0, 4: 1, 5: 2},
'C': {1: 3, 2: 7, 3: 1, 4: 1, 5: 3},
'D': {1: ['a', 'b'], 2: ['b', 'c'], 3: ['a'], 4: ['b'], 5: []}})
result = pd.DataFrame(
[(key + (item,))
for key, val in df.set_index(['A','B','C'])['D'].iteritems()
for item in map(list, val) or [[]]])
yields
0 1 2 3
0 1 2 3 [a]
1 1 2 3 [b]
2 4 6 7 [b]
3 4 6 7 [c]
4 1 0 1 [a]
5 2 1 1 [b]
6 1 2 3 []
Another option is to use df['D'].apply to expand the items in the list into different columns, and then use stack to expand the rows:
df = pd.DataFrame({'A': {1: 1, 2: 4, 3: 1, 4: 2, 5: 1},
'B': {1: 2, 2: 6, 3: 0, 4: 1, 5: 2},
'C': {1: 3, 2: 7, 3: 1, 4: 1, 5: 3},
'D': {1: ['a', 'b'], 2: ['b', 'c'], 3: ['a'], 4: ['b'], 5: []}})
df = df.set_index(['A', 'B', 'C'])
result = df['D'].apply(lambda x: pd.Series(map(list, x) if x else [[]]))
# 0 1
# A B C
# 1 2 3 [a] [b]
# 4 6 7 [b] [c]
# 1 0 1 [a] NaN
# 2 1 1 [b] NaN
# 1 2 3 [] NaN
result = result.stack()
# A B C
# 1 2 3 0 [a]
# 1 [b]
# 4 6 7 0 [b]
# 1 [c]
# 1 0 1 0 [a]
# 2 1 1 0 [b]
# 1 2 3 0 []
# dtype: object
result.index = result.index.droplevel(-1)
result = result.reset_index()
# A B C 0
# 0 1 2 3 [a]
# 1 1 2 3 [b]
# 2 4 6 7 [b]
# 3 4 6 7 [c]
# 4 1 0 1 [a]
# 5 2 1 1 [b]
# 6 1 2 3 []
Although this does not use explicit for-loops or a list comprehension, there is an implicit for-loop hidden in the call to apply. In fact, it is much slower than using a list comprehension:
In [170]: df = pd.concat([df]*10)
In [171]: %%timeit
.....: result = df['D'].apply(lambda x: pd.Series(map(list, x) if x else [[]]))
result = result.stack()
result.index = result.index.droplevel(-1)
result = result.reset_index()
100 loops, best of 3: 11.5 ms per loop
In [172]: %%timeit
.....: result = pd.DataFrame(
[(key + (item,))
for key, val in df['D'].iteritems()
for item in map(list, val) or [[]]])
1000 loops, best of 3: 618 µs per loop

Assuming your column D content is of type string:
print(type(df.loc[0, 'D']))
<class 'str'>
df = df.set_index(['A', 'B', 'C']).sortlevel()
df.loc[:, 'D'] = df.loc[:, 'D'].str.strip('[').str.strip(']')
df = df.loc[:, 'D'].str.split(',', expand=True).stack()
df = df.str.strip('').apply(lambda x: '[{}]'.format(x)).reset_index().drop('level_3', axis=1).rename(columns={0: 'D'})
A B C D
0 1 0 1 ['a']
1 1 2 3 ['a']
2 1 2 3 ['b']
3 1 2 3 []
4 2 1 1 ['b']
5 4 6 7 ['b']
6 4 6 7 ['c']

Related

Assign unique id inside of groups (with duplicated records) [duplicate]

This question already has an answer here:
Python add new column with repeating value based on two other columns
(1 answer)
Closed 8 months ago.
I have a DataFrame looks like this:
df = pd.DataFrame({'type': ['A', 'A', 'A', 'B', 'B', 'B', 'B', 'B','C','C','C','D','D'],
'value': [1, 1, 2, 3, 4, 5, 5, 5, 6, 6, 7, 7, 8],
})
I would like to create a unique id based on the type and value columns, the output will look like this:
df = pd.DataFrame({'type': ['A', 'A', 'A', 'B', 'B', 'B', 'B', 'B','C','C','C','D','D'],
'value': [1, 1, 2, 3, 4, 5, 5, 5, 6, 6, 7, 7, 8],
'id': [1, 1, 2, 1, 2, 3, 3, 3, 1, 1, 2, 1, 2],
})
Use DataFrameGroupBy.rank:
df['id'] = df.groupby('type')['value'].rank('dense').astype(int)
print (df)
type value id
0 A 1 1
1 A 1 1
2 A 2 2
3 B 3 1
4 B 4 2
5 B 5 3
6 B 5 3
7 B 5 3
8 C 6 1
9 C 6 1
10 C 7 2
11 D 7 1
12 D 8 2
Or GroupBy.transform with factorize:
f = lambda x: pd.factorize(x)[0]
df['id'] = df.groupby('type')['value'].transform(f).add(1)
Use:
t = df.groupby(['type']).transform(lambda x: x.iloc[0])
df['id'] = df.groupby(['type', 'value'])[['type', 'value']].apply(lambda x: x.name[1]).reset_index().merge(df, on = ['type', 'value'])[0]-t['value']+1
Output:
type value id
0 A 1 1
1 A 1 1
2 A 2 2
3 B 3 1
4 B 4 2
5 B 5 3
6 B 5 3
7 B 5 3
8 C 6 1
9 C 6 1
10 C 7 2
11 D 7 1
12 D 8 2

Python Pandas: How can I make labels for dropped data?

I used drop_duplicates() on original data(subset = A and B), and I made labels for the refined data.
Now I have to make labels for the original data, but It costs to much time and not that efficient.
For example,
My original dataframe is as follows:
A B
1 1
1 1
2 2
2 3
5 3
6 4
5 4
5 4
after drop_duplicates():
A B
1 1
2 2
2 3
5 3
6 4
5 4
after labeling:
A B label
1 1 1
2 2 0
2 3 1
5 3 1
6 4 0
5 4 1
Following is my expected output:
A B label
1 1 1
1 1 1
2 2 0
2 3 1
5 3 1
6 4 0
5 4 1
5 4 1
My current code for achieving above result is as follows:
for i in range(origin_data):
check = False
j = 0
while not check:
if origin_data['A'].iloc[i] == dropped_data['A'].iloc[j] and origin_data['B'].iloc[i] == dropped_data['B'].iloc[j]:
origin_data['label'].iloc[i] = dropped_data['label'].iloc[j]
check = True
j+=1
As my code takes much more time, is there any way I can perform it more efficiently ?
You can merge the labeled dataset with the original one:
original.merge(labeled, how="left", on=["A", "B"])
result:
A B label
0 1 1 1
1 1 1 1
2 1 2 0
3 1 3 0
4 1 4 1
5 1 4 1
Full code:
import pandas as pd
original = pd.DataFrame(
{'A': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1},
'B': {0: 1, 1: 1, 2: 2, 3: 3, 4: 4, 5: 4}}
)
labeled = pd.DataFrame(
{'A': {0: 1, 1: 1, 2: 1, 3: 1},
'B': {0: 1, 1: 2, 2: 3, 3: 4},
'label': {0: 1, 1: 0, 2: 0, 3: 1}}
)
print(original.merge(labeled, how="left", on=["A", "B"]))
If the problem is just mapping the 'B' labels to the original dataframe, you can use map:
origin_data.B.map(dropped_data.set_index('B').label)

Pandas - aggregate over inconsistent values types (string vs list)

Given the following DataFrame, I try to aggregate over columns 'A' and 'C'. for 'A', count unique appearances of the strings, and for 'C', sum the values.
Problem arises when some of the samples in 'A' are actually lists of those strings.
Here's a simplified example:
df = pd.DataFrame({'ID': [1, 1, 1, 1, 1, 2, 2, 2],
'A' : ['a', 'a', 'a', 'b', ['b', 'c', 'd'], 'a', 'a', ['a', 'b', 'c']],
'C' : [1, 2, 15, 5, 13, 6, 7, 1]})
df
Out[100]:
ID A C
0 1 a 1
1 1 a 2
2 1 a 15
3 1 b 5
4 1 [b, c, d] 13
5 2 a 6
6 2 a 7
7 2 [a, b, c] 1
aggs = {'A' : lambda x: x.nunique(dropna=True),
'C' : 'sum'}
# This will result an error: TypeError: unhashable type: 'list'
agg_df = df.groupby('ID').agg(aggs)
I'd like the following output:
print(agg_df)
A C
ID
1 4 36
2 3 14
Which resulted because for 'ID' = 1 we had 'a', 'b', 'c' and 'd' and for 'ID' = 2, we had 'a', 'b', 'c'.
One solution is to split your problem into 2 parts. First flatten your dataframe to ensure df['A'] consists only of strings. Then concatenate a couple of GroupBy operations.
Step 1: Flatten your dataframe
You can use itertools.chain and numpy.repeat to chain and repeat values as appropriate.
from itertools import chain
A = df['A'].apply(lambda x: [x] if not isinstance(x, list) else x)
lens = A.map(len)
res = pd.DataFrame({'ID': np.repeat(df['ID'], lens),
'A': list(chain.from_iterable(A)),
'C': np.repeat(df['C'], lens)})
print(res)
# A C ID
# 0 a 1 1
# 1 a 2 1
# 2 a 15 1
# 3 b 5 1
# 4 b 13 1
# 4 c 13 1
# 4 d 13 1
# 5 a 6 2
# 6 a 7 2
# 7 a 1 2
# 7 b 1 2
# 7 c 1 2
Step 2: Concatenate GroupBy on original and flattened
agg_df = pd.concat([res.groupby('ID')['A'].nunique(),
df.groupby('ID')['C'].sum()], axis=1)
print(agg_df)
# A C
# ID
# 1 4 36
# 2 3 14

Different results from pandas groupby and pivot_table when dtype is categorical

I ran in to this earlier today when creating pivot tables after categorizing a column of values using pd.cut. When creating the pivot tables I was finding that the subsequent index was incorrect. This was not an issue when using groupby instead, or after converting the category column to a different dtype.
Simplified example:
df = pd.DataFrame({'l1': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'b', 'b']
, 'g1': [1, 1, 2, 2, 1, 1, 1, 2, 2, 2]
, 'vals': [3, 1, 3, 1, 3, 2, 2, 3, 2, 2]})
df['l2'] = pd.cut(df.vals, bins=[0, 2, 4], labels=['l', 'h'])
df = df[['l1', 'l2', 'g1', 'vals']]
Using groupby:
df.groupby(['l1', 'l2', 'g1']).vals.agg(('sum', 'count')).unstack()[['count', 'sum']]
count sum
g1 1 2 1 2
l1 l2
a l 1 1 1 1
h 1 1 3 3
b l 2 2 4 4
h 1 1 3 3
Using pd.pivot_table:
pd.pivot_table(df, index=['l1', 'l2'], columns='g1', aggfunc=('sum', 'count'))
vals
count sum
g1 1 2 1 2
l1 l2
a h 1 1 1 1
l 1 1 3 3
b h 2 2 4 4
l 1 1 3 3
Using pd.pivot_table after converting the l2 column to str dtype:
df2 = df.copy()
df2['l2'] = df2.l2.astype(str)
pd.pivot_table(df2, index=['l1', 'l2'], columns='g1', aggfunc=('sum', 'count'))
vals
count sum
g1 1 2 1 2
l1 l2
a h 1 1 3 3
l 1 1 1 1
b h 1 1 3 3
l 2 2 4 4
The order in the last example is reversed, but the values are correct (in contrast to the middle example, where the order is reversed and the values are incorrect).

What does "col_level" do in the melt function?

From the documentation:
pd.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None)
What does col_level do?
Examples with different values of col_level would be great.
My current dataframe is created by the following:
df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},
'B': {0: 1, 1: 3, 2: 5},
'C': {0: 2, 1: 4, 2: 6}})
df.columns = [list('ABC'), list('DEF'), list('GHI')]
Thanks.
You can check melt:
col_level : int or string, optional
If columns are a MultiIndex then use this level to melt.
And examples:
df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},
'B': {0: 1, 1: 3, 2: 5},
'C': {0: 2, 1: 4, 2: 6}})
#use Multiindex.from_arrays for set levels names
df.columns = pd.MultiIndex.from_arrays([list('ABC'), list('DEF'), list('GHI')],
names=list('abc'))
print (df)
a A B C
b D E F
c G H I
0 a 1 2
1 b 3 4
2 c 5 6
#melt by first level of MultiIndex
print (df.melt(col_level=0))
a value
0 A a
1 A b
2 A c
3 B 1
4 B 3
5 B 5
6 C 2
7 C 4
8 C 6
#melt by level a of MultiIndex
print (df.melt(col_level='a'))
a value
0 A a
1 A b
2 A c
3 B 1
4 B 3
5 B 5
6 C 2
7 C 4
8 C 6
#melt by level c of MultiIndex
print (df.melt(col_level='c'))
c value
0 G a
1 G b
2 G c
3 H 1
4 H 3
5 H 5
6 I 2
7 I 4
8 I 6

Categories