How to unnest a column in a Pandas DataFrame?

How to unnest a column in a Pandas DataFrame? - python

I have the following this question where one of the columns is an object (list type cell):
I don't want to use explode (using an older version of pandas). How to do the same for dataframe with three columns?
df
A B C
0 1 [1, 2] 3
1 1 [1, 2] 4
2 2 [3, 4] 5
My expected output is:
A B C
0 1 1 3
1 1 2 3
3 1 1 4
4 1 2 4
5 2 3 5
6 2 4 5
I found these two methods useful.
How to add third column to this code.
df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns={0:'B'})
or
df=pd.DataFrame({'A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values)})

You set the index to be all of the columns you want to keep tied to the list you explode:
(df.set_index(['A', 'C'])['B']
.apply(pd.Series).stack()
.reset_index()
.drop(columns='level_2').rename(columns={0: 'B'}))
A C B
0 1 3 1
1 1 3 2
2 1 4 1
3 1 4 2
4 2 5 3
5 2 5 4
Or for the second method also repeat 'C'
pd.DataFrame({'A': df.A.repeat(df.B.str.len()),
'C': df.C.repeat(df.B.str.len()),
'B': np.concatenate(df.B.to_numpy())})

You can use itertools to reshape your data :
from itertools import product,chain
pd.DataFrame(chain.from_iterable((product([a],b,[c]))
for a,b,c in df.to_numpy()),
columns = df.columns)
A B C
0 1 1 3
1 1 2 3
2 1 1 4
3 1 2 4
4 2 1 5
5 2 4 5

Related

Autoincrement indexing after groupby with pandas on the original table

I cannot solve a very easy/simple problem in pandas. :(
I have the following table:
df = pd.DataFrame(data=dict(a=[1, 1, 1,2, 2, 3,1], b=["A", "A","B","A", "B", "A","A"]))
df
Out[96]:
a b
0 1 A
1 1 A
2 1 B
3 2 A
4 2 B
5 3 A
6 1 A
I would like to make an incrementing ID of each grouped (grouped by columns a and b) unique item. So the result would like like this (column c):
Out[98]:
a b c
0 1 A 1
1 1 A 1
2 1 B 2
3 2 A 3
4 2 B 4
5 3 A 5
6 1 A 1
I tried with:
df.groupby(["a", "b"]).nunique().cumsum().reset_index()
Result:
Out[105]:
a b c
0 1 A 1
1 1 B 2
2 2 A 3
3 2 B 4
4 3 A 5
Unfortunatelly this works only for the grouped by dataset and not on the original dataset. As you can see in the original table I have 7 rows and the grouped by returns only 5.
So could someone please help me on how to get the desired table:
a b c
0 1 A 1
1 1 A 1
2 1 B 2
3 2 A 3
4 2 B 4
5 3 A 5
6 1 A 1
Thank you in advance!

groupby + ngroup
df['c'] = df.groupby(['a', 'b']).ngroup() + 1
a b c
0 1 A 1
1 1 A 1
2 1 B 2
3 2 A 3
4 2 B 4
5 3 A 5
6 1 A 1

Use pd.factorize after create a tuple from (a, b) columns:
df['c'] = pd.factorize(df[['a', 'b']].apply(tuple, axis=1))[0] + 1
print(df)
# Output
a b c
0 1 A 1
1 1 A 1
2 1 B 2
3 2 A 3
4 2 B 4
5 3 A 5
6 1 A 1

how to loop through columns of a dataframe which have intezers as column names

I have a dataframe with column names as 1,2,3,4..10 . I have sub category of columns as
sub_cols = ['1','2','3']
I want to loop through these sub_cols
for col in sub_cols:
print('column: '+str(col))
data[col]
len(data[col])
I get an output
column: 1
column: 2
column: 3
but data of columns or len of columns is not printed. I don't see any error too. Where am I going wrong?

Your code corrected:
for col in sub_cols:
print('column: '+str(col))
print(data[col])
print(len(data[col])

Looks like your fundamental issue if you list of integers is a list of strings.
following code shows how to loop through columns where name is an integer
simple list comprehension with isinstance(c, int)
cols = [i for i in range(5)]+list("abcd")
df = pd.DataFrame(np.random.randint(1,5,5*len(cols)).reshape(5,len(cols)), columns=cols)
df.loc[:,[c for c in df.columns if isinstance(c, int)]]
df
0
1
2
3
4
a
b
c
d
0
3
4
1
2
4
4
4
1
4
1
1
4
3
2
1
2
4
4
1
2
2
3
4
1
1
4
4
4
2
3
3
1
3
1
2
4
2
4
3
4
3
4
1
4
1
4
1
1
1
df.loc[:,[c for c in df.columns if isinstance(c, int)]]
0
1
2
3
4
0
3
4
1
2
4
1
1
4
3
2
1
2
2
3
4
1
1
3
3
1
3
1
2
4
3
4
1
4
1

How to use two columns to distinguish data points in a pandas dataframe

I have a dataframe that looks like follow:
import pandas as pd
df = pd.DataFrame({'a':[1,2,3], 'b':[[1,2,3],[1,2,3],[1,2,3]], 'c': [[4,5,6],[4,5,6],[4,5,6]]})
I want to explode the dataframe with column b and c. I know that if we only use one column then we can do
df.explode('column_name')
However, I can't find an way to use with two columns. So here is the desired output.
output = pd.DataFrame({'a':[1,1,1,2,2,2,3,3,3], 'b':[1,2,3,1,2,3,1,2,3], 'c': [4,5,6,4,5,6,4,5,6]})
I have tried
df.explode(['a','b'])
but it does not work and gives me a
ValueError: column must be a scalar.
Thanks.

Let us try
df=pd.concat([df[x].explode() for x in ['b','c']],axis=1).join(df[['a']]).reindex(columns=df.columns)
Out[179]:
a b c
0 1 1 4
0 1 2 5
0 1 3 6
1 2 1 4
1 2 2 5
1 2 3 6
2 3 1 4
2 3 2 5
2 3 3 6

You can use itertools chain, along with zip to get your result :
pd.DataFrame(chain.from_iterable(zip([a] * df.shape[-1], b, c)
for a, b, c in df.to_numpy()))
0 1 2
0 1 1 4
1 1 2 5
2 1 3 6
3 2 1 4
4 2 2 5
5 2 3 6
6 3 1 4
7 3 2 5
8 3 3 6

List comprehension from #Ben is the fastest. However, if you don't concern too much about speed, you may use apply with pd.Series.explode
df.set_index('a').apply(pd.Series.explode).reset_index()
Or simply apply. On non-list columns, it will return the original values
df.apply(pd.Series.explode).reset_index(drop=True)
Out[42]:
a b c
0 1 1 4
1 1 2 5
2 1 3 6
3 2 1 4
4 2 2 5
5 2 3 6
6 3 1 4
7 3 2 5
8 3 3 6

pandas aggregate data while carrying a column unchanged

I have a data frame, a:
a=pd.DataFrame({'ID': [1,1,2,2,3,4], 'B': [1,5,3,2,4,1], 'C': [1,4,3,6,1,1]})
ID B C
0 1 1 1
1 1 5 4
2 2 3 3
3 2 2 6
4 3 4 1
5 4 1 1
And I want to aggregate it so that the resulting new data frame will be grouped by ID and return the row corresponding to min of B (so apply min() on B and carry C as is.
So the resulting data frame should be:
ID B C
0 1 1 1
1 2 2 6
2 3 4 1
3 4 1 1
How can I do this programmatically using pandas.groupby(), or is there another way to do it?

You can use groupby and transform to filter rows
a.loc[a['B'] == a.groupby('ID').B.transform('min')]
B C ID
0 1 1 1
3 2 6 2
4 4 1 3
5 1 1 4

Try sorting before your groupby, then taking first:
a.sort_values('B').groupby('ID',as_index=False).first()
ID B C
0 1 1 1
1 2 2 6
2 3 4 1
3 4 1 1
Or, probably a faster way to do it is to sort by ID and B and then drop duplicate IDs, keeping the first (which is the default behavior of drop_duplicates):
a.sort_values(['ID','B']).drop_duplicates('ID')
ID B C
0 1 1 1
1 2 2 6
2 3 4 1
3 4 1 1

When there is sorting involved, and the grouping doesn't involve any calculations, I prefer to work on the underlying numpy arrays for performance.
Using argsort and numpy.unique:
arr = a.values
out = arr[np.argsort(arr[:, 1])]
_, idx = np.unique(out[:, 0], return_index=True)
out[idx]
array([[1, 1, 1],
[2, 2, 6],
[3, 4, 1],
[4, 1, 1]], dtype=int64)
To reassign the values to your DataFrame:
pd.DataFrame(out[idx], columns=a.columns)
ID B C
0 1 1 1
1 2 2 6
2 3 4 1
3 4 1 1

pandas stack second column below first and vice versa

I have a DataFrame with two columns and I would like to stack the second column below the first and the first below the second.
pd.DataFrame({'A':[1,2,3], 'B': [4,5,6]})
A B
0 1 4
1 2 5
2 3 6
Desired output:
A B
0 1 4
1 2 5
2 3 6
3 4 1
4 5 2
5 6 3
So far I have tried:
pd.concat([df, df[['B','A']].rename(columns={'A':'B', 'B':'A'})])
A B
0 1 4
1 2 5
2 3 6
3 4 1
4 5 2
5 6 3
Is this the cleanest way?

Concat is better if you ask me. But if you have a 100 columns renaming is a pain. As a generalized approach here's one with numpy flip and vstack i.e
v = df.values
pd.DataFrame(pd.np.vstack((v, pd.np.fliplr(v))), columns=df.columns)
A B
0 1 4
1 2 5
2 3 6
3 4 1
4 5 2
5 6 3

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to unnest a column in a Pandas DataFrame? - python

You can use itertools to reshape your data : from itertools import product,chain pd.DataFrame(chain.from_iterable((product([a],b,[c])) for a,b,c in df.to_numpy()), columns = df.columns) A B C 0 1 1 3 1 1 2 3 2 1 1 4 3 1 2 4 4 2 1 5 5 2 4 5

Related

Autoincrement indexing after groupby with pandas on the original table

how to loop through columns of a dataframe which have intezers as column names

How to use two columns to distinguish data points in a pandas dataframe

pandas aggregate data while carrying a column unchanged

pandas stack second column below first and vice versa

Categories

Resources