I have Pandas dataset with orders and different types of packaging inside this order separated by comma inside the cell.
order
container
box
a
c1,c2,c3
b1,b2,b3
b
c4,c5,c6
b4,b5,b6
Need to get table with two columns: "order" and "content" with all values from both container and box.
I could only merge the container and box - but do not know how to list them row by row.
Needed table is:
order
content
a
c1
a
c2
a
c3
a
b1
a
b2
a
b3
b
c4
b
c5
b
c6
b
b4
b
b5
b
b6
You can stack, split+explode and convert to DataFrame:
out = (df.set_index('order').stack() # set other columns aside and stack
.str.split(',').explode() # expand values to multiple rows
# cleanup
.reset_index('order', name='content').reset_index(drop=True)
)
print(out)
Output:
order content
0 a c1
1 a c2
2 a c3
3 a b1
4 a b2
5 a b3
6 b c4
7 b c5
8 b c6
9 b b4
10 b b5
11 b b6
Alternative with melt:
(df.melt('order', value_name='content')
.assign(content=lambda d: d['content'].str.split(','))
.explode('content').drop(columns='variable')
)
You could also use pd.DataFrame.melt, pd.DataFrame.set_index, pd.DataFrame.pipe, pd.DataFrame.sort_index:
(df
.melt(id_vars='order', value_vars=['container', 'box'])
.set_index('order')
.pipe(lambda x: x['value'].str.split(',').explode())
.sort_index()
)
order
a c1
a c2
a c3
a b1
a b2
a b3
b c4
b c5
b c6
b b4
b b5
b b6
or even a more concise approach::
(df
.melt(id_vars='order', value_vars=['container', 'box'])
.set_index('order')['value'].str.split(',').explode()
.sort_index())
I would do it following way
import pandas as pd
df = pd.DataFrame({"order":["a","b"],"container":["c1,c2,c3","c4,c5,c6"],"box":["b1,b2,b3","b4,b5,b6"]})
df2 = pd.concat([df.order,df.container.str.split(",",expand=True),df.box.str.split(",",expand=True)],axis=1)
df2 = df2.melt("order")[["order","value"]]
print(df2)
output
order value
0 a c1
1 b c4
2 a c2
3 b c5
4 a c3
5 b c6
6 a b1
7 b b4
8 a b2
9 b b5
10 a b3
11 b b6
Explanation: use .str.split with expand=True to get ,-sheared values into separate columns, then use .concat to create DataFrame from them, then use .melt at order to get desired result and then select required columns.
df1.set_index("order").agg(','.join,axis=1).map(lambda x:x.split(",")).explode().rename("content")
out
order
a c1
a c2
a c3
a b1
a b2
a b3
b c4
b c5
b c6
b b4
b b5
b b6
Consider the following data frames:
import pandas as pd
df1 = pd.DataFrame({'id': list('fghij'), 'A': ['A' + str(i) for i in range(5)]})
A id
0 A0 f
1 A1 g
2 A2 h
3 A3 i
4 A4 j
df2 = pd.DataFrame({'id': list('fg'), 'B': ['B' + str(i) for i in range(2)]})
B id
0 B0 f
1 B1 g
df3 = pd.DataFrame({'id': list('ij'), 'B': ['B' + str(i) for i in range(3, 5)]})
B id
0 B3 i
1 B4 j
I want to merge them to get
A id B
0 A0 f B0
1 A1 g B1
2 A2 h NaN
3 A3 i B3
4 A4 j B4
Inspired by this answer I tried
final = reduce(lambda l, r: pd.merge(l, r, how='outer', on='id'), [df1, df2, df3])
but unfortunately it yields
A id B_x B_y
0 A0 f B0 NaN
1 A1 g B1 NaN
2 A2 h NaN NaN
3 A3 i NaN B3
4 A4 j NaN B4
Additionally, I checked out this question but I can't adapt the solution to my problem. Also, I didn't find any options in the docs for pandas.merge to make this happen.
In my real world problem the list of data frames might be much longer and the size of the data frames might be much larger.
Is there any "pythonic" way to do this directly and without "postprocessing"? It would be perfect to have a solution that raises an exception if column B of df2 and df3 would overlap (so if there might be multiple candidates for some value in column B of the final data frame).
Consider pd.concat + groupby?
pd.concat([df1, df2, df3], axis=0).groupby('id').first().reset_index()
id A B
0 f A0 B0
1 g A1 B1
2 h A2 NaN
3 i A3 B3
4 j A4 B4
I think this is a simple one, but I am not able to figure this out today and needed some help.
I have a pandas dataframe:
df = pd.DataFrame({
'id': [0, 0, 1, 1, 2],
'q.name':['A'] * 3 + ['B'] * 2,
'q.value':['A1','A2','A3','B1','B2'],
'w.name':['Q', 'W', 'E', 'R', 'Q'],
'w.value':['B1','B2','C3','C1','D2']
})
that looks like this
id q.name q.value w.name w.value
0 0 A A1 Q B1
1 0 A A2 W B2
2 1 A A3 E C3
3 1 B B1 R C1
4 2 B B2 Q D2
I am looking to convert it to
id q.name q.value w.name w.value
0 0 A A A1 A2 Q W B1 B2
1 1 A B A3 B1 E R C3 C1
2 2 B B2 Q D2
I tried pd.DataFrame(df.apply(lambda s: s.str.cat(sep=" "))) but that did not give me the result I wanted. I have done this before but I'm struggling to recall or find any post on SO to help me.
Update:
I should have mentioned this before: Is there a way of doing this without specifying which column? The DataFrame changes based on context.
I have also updated the dataframe and shown an id field, as I just realised that this was possible. I think now a groupby on the id field should solve this.
UPDATE:
In [117]: df.groupby('id', as_index=False).agg(' '.join)
Out[117]:
id q.name q.value w.name w.value
0 0 A A A1 A2 Q W B1 B2
1 1 A B A3 B1 E R C3 C1
2 2 B B2 Q D2
Old answer:
In [106]: df.groupby('category', as_index=False).agg(' '.join)
Out[106]:
category name
0 A A1 A2 A3
1 B B1 B2
I am reading multiple JSON objects into one DataFrame. The problem is that some of the columns are lists. Also, the data is very big and because of that I cannot use the available solutions on the internet. They are very slow and memory-inefficient
Here is how my data looks like:
df = pd.DataFrame({'A': ['x1','x2','x3', 'x4'], 'B':[['v1','v2'],['v3','v4'],['v5','v6'],['v7','v8']], 'C':[['c1','c2'],['c3','c4'],['c5','c6'],['c7','c8']],'D':[['d1','d2'],['d3','d4'],['d5','d6'],['d7','d8']], 'E':[['e1','e2'],['e3','e4'],['e5','e6'],['e7','e8']]})
A B C D E
0 x1 [v1, v2] [c1, c2] [d1, d2] [e1, e2]
1 x2 [v3, v4] [c3, c4] [d3, d4] [e3, e4]
2 x3 [v5, v6] [c5, c6] [d5, d6] [e5, e6]
3 x4 [v7, v8] [c7, c8] [d7, d8] [e7, e8]
And this is the shape of my data: (441079, 12)
My desired output is:
A B C D E
0 x1 v1 c1 d1 e1
0 x1 v2 c2 d2 e2
1 x2 v3 c3 d3 e3
1 x2 v4 c4 d4 e4
.....
EDIT: After being marked as duplicate, I would like to stress on the fact that in this question I was looking for an efficient method of exploding multiple columns. Therefore the approved answer is able to explode an arbitrary number of columns on very large datasets efficiently. Something that the answers to the other question failed to do (and that was the reason I asked this question after testing those solutions).
pandas >= 0.25
Assuming all columns have the same number of lists, you can call Series.explode on each column.
df.set_index(['A']).apply(pd.Series.explode).reset_index()
A B C D E
0 x1 v1 c1 d1 e1
1 x1 v2 c2 d2 e2
2 x2 v3 c3 d3 e3
3 x2 v4 c4 d4 e4
4 x3 v5 c5 d5 e5
5 x3 v6 c6 d6 e6
6 x4 v7 c7 d7 e7
7 x4 v8 c8 d8 e8
The idea is to set as the index all columns that must NOT be exploded first, then reset the index after.
It's also faster.
%timeit df.set_index(['A']).apply(pd.Series.explode).reset_index()
%%timeit
(df.set_index('A')
.apply(lambda x: x.apply(pd.Series).stack())
.reset_index()
.drop('level_1', 1))
2.22 ms ± 98.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
9.14 ms ± 329 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Use set_index on A and on remaining columns apply and stack the values. All of this condensed into a single liner.
In [1253]: (df.set_index('A')
.apply(lambda x: x.apply(pd.Series).stack())
.reset_index()
.drop('level_1', 1))
Out[1253]:
A B C D E
0 x1 v1 c1 d1 e1
1 x1 v2 c2 d2 e2
2 x2 v3 c3 d3 e3
3 x2 v4 c4 d4 e4
4 x3 v5 c5 d5 e5
5 x3 v6 c6 d6 e6
6 x4 v7 c7 d7 e7
7 x4 v8 c8 d8 e8
def explode(df, lst_cols, fill_value=''):
# make sure `lst_cols` is a list
if lst_cols and not isinstance(lst_cols, list):
lst_cols = [lst_cols]
# all columns except `lst_cols`
idx_cols = df.columns.difference(lst_cols)
# calculate lengths of lists
lens = df[lst_cols[0]].str.len()
if (lens > 0).all():
# ALL lists in cells aren't empty
return pd.DataFrame({
col:np.repeat(df[col].values, df[lst_cols[0]].str.len())
for col in idx_cols
}).assign(**{col:np.concatenate(df[col].values) for col in lst_cols}) \
.loc[:, df.columns]
else:
# at least one list in cells is empty
return pd.DataFrame({
col:np.repeat(df[col].values, df[lst_cols[0]].str.len())
for col in idx_cols
}).assign(**{col:np.concatenate(df[col].values) for col in lst_cols}) \
.append(df.loc[lens==0, idx_cols]).fillna(fill_value) \
.loc[:, df.columns]
Usage:
In [82]: explode(df, lst_cols=list('BCDE'))
Out[82]:
A B C D E
0 x1 v1 c1 d1 e1
1 x1 v2 c2 d2 e2
2 x2 v3 c3 d3 e3
3 x2 v4 c4 d4 e4
4 x3 v5 c5 d5 e5
5 x3 v6 c6 d6 e6
6 x4 v7 c7 d7 e7
7 x4 v8 c8 d8 e8
Building on #cs95's answer, we can use an if clause in the lambda function, instead of setting all the other columns as the index. This has the following advantages:
Preserves column order
Lets you easily specify columns using the set you want to modify, x.name in [...], or not modify x.name not in [...].
df.apply(lambda x: x.explode() if x.name in ['B', 'C', 'D', 'E'] else x)
A B C D E
0 x1 v1 c1 d1 e1
0 x1 v2 c2 d2 e2
1 x2 v3 c3 d3 e3
1 x2 v4 c4 d4 e4
2 x3 v5 c5 d5 e5
2 x3 v6 c6 d6 e6
3 x4 v7 c7 d7 e7
3 x4 v8 c8 d8 e8
As of pandas 1.3.0 (What’s new in 1.3.0 (July 2, 2021)):
DataFrame.explode() now supports exploding multiple columns. Its column argument now also accepts a list of str or tuples for exploding on multiple columns at the same time (GH39240)
So now this operation is as simple as:
df.explode(['B', 'C', 'D', 'E'])
A B C D E
0 x1 v1 c1 d1 e1
0 x1 v2 c2 d2 e2
1 x2 v3 c3 d3 e3
1 x2 v4 c4 d4 e4
2 x3 v5 c5 d5 e5
2 x3 v6 c6 d6 e6
3 x4 v7 c7 d7 e7
3 x4 v8 c8 d8 e8
Or if wanting unique indexing:
df.explode(['B', 'C', 'D', 'E'], ignore_index=True)
A B C D E
0 x1 v1 c1 d1 e1
1 x1 v2 c2 d2 e2
2 x2 v3 c3 d3 e3
3 x2 v4 c4 d4 e4
4 x3 v5 c5 d5 e5
5 x3 v6 c6 d6 e6
6 x4 v7 c7 d7 e7
7 x4 v8 c8 d8 e8
Gathering all of the responses on this and other threads, here is how I do it for comma-delineated rows:
from collections.abc import Sequence
import pandas as pd
import numpy as np
def explode_by_delimiter(
df: pd.DataFrame,
columns: str | Sequence[str],
delimiter: str = ",",
reindex: bool = True
) -> pd.DataFrame:
"""Convert dataframe with columns separated by a delimiter into an
ordinary dataframe. Requires pandas 1.3.0+."""
if isinstance(columns, str):
columns = [columns]
col_dict = {
col: df[col]
.str.split(delimiter)
# Without .fillna(), .explode() will fail on empty values
.fillna({i: [np.nan] for i in df.index})
for col in columns
}
df = df.assign(**col_dict).explode(columns)
return df.reset_index(drop=True) if reindex else df
Here is my solution using 'apply' function. Main features/differences:
offer options to specify selected multiple columns or all columns
offer options to specify values to fill in the 'missing' position (through parameter fill_mode = 'external'; 'internal'; or 'trim', explanation would be long, see examples below and try yourself to change the option and check the result)
Notes: option 'trim' was developed for my need, out of scope for this question
def lenx(x):
return len(x) if isinstance(x,(list, tuple, np.ndarray, pd.Series)) else 1
def cell_size_equalize2(row, cols='', fill_mode='internal', fill_value=''):
jcols = [j for j,v in enumerate(row.index) if v in cols]
if len(jcols)<1:
jcols = range(len(row.index))
Ls = [lenx(x) for x in row.values]
if not Ls[:-1]==Ls[1:]:
vals = [v if isinstance(v,list) else [v] for v in row.values]
if fill_mode=='external':
vals = [[e] + [fill_value]*(max(Ls)-1) if (not j in jcols) and (isinstance(row.values[j],list))
else e + [fill_value]*(max(Ls)-lenx(e))
for j,e in enumerate(vals)]
elif fill_mode == 'internal':
vals = [[e]+[e]*(max(Ls)-1) if (not j in jcols) and (isinstance(row.values[j],list))
else e+[e[-1]]*(max(Ls)-lenx(e))
for j,e in enumerate(vals)]
else:
vals = [e[0:min(Ls)] for e in vals]
row = pd.Series(vals,index=row.index.tolist())
return row
Examples:
df=pd.DataFrame({
'a':[[1],2,3],
'b':[[4,5,7],[5,4],4],
'c':[[4,5],5,[6]]
})
print(df)
df1 = df.apply(cell_size_equalize2, cols='', fill_mode='external', fill_value = "OK", axis=1).apply(pd.Series.explode)
print('\nfill_mode=\'external\', all columns, fill_value = \'OK\'\n', df1)
df2 = df.apply(cell_size_equalize2, cols=['a', 'b'], fill_mode='external', fill_value = "OK", axis=1).apply(pd.Series.explode)
print('\nfill_mode=\'external\', cols = [\'a\', \'b\'], fill_value = \'OK\'\n', df2)
df3 = df.apply(cell_size_equalize2, cols=['a', 'b'], fill_mode='internal', axis=1).apply(pd.Series.explode)
print('\nfill_mode=\'internal\', cols = [\'a\', \'b\']\n', df3)
df4 = df.apply(cell_size_equalize2, cols='', fill_mode='trim', axis=1).apply(pd.Series.explode)
print('\nfill_mode=\'trim\', all columns\n', df4)
Output:
a b c
0 [1] [4, 5, 7] [4, 5]
1 2 [5, 4] 5
2 3 4 [6]
fill_mode='external', all columns, fill_value = 'OK'
a b c
0 1 4 4
0 OK 5 5
0 OK 7 OK
1 2 5 5
1 OK 4 OK
2 3 4 6
fill_mode='external', cols = ['a', 'b'], fill_value = 'OK'
a b c
0 1 4 [4, 5]
0 OK 5 OK
0 OK 7 OK
1 2 5 5
1 OK 4 OK
2 3 4 6
fill_mode='internal', cols = ['a', 'b']
a b c
0 1 4 [4, 5]
0 1 5 [4, 5]
0 1 7 [4, 5]
1 2 5 5
1 2 4 5
2 3 4 6
fill_mode='trim', all columns
a b c
0 1 4 4
1 2 5 5
2 3 4 6
I have a Pandas DataFrame, say df, which is 1099 lines by 33 rows. I need the original file to be processed by another software, but it is not in the proper format. This is why I'm trying to get the good format whith pandas.
The problem is very simple: df is constituted by columns of identifiers (7 in the real case, only 3 in the following example), and then by corresponding results by months. To be clear, it's like
A B C date1result date2result date2result
a1 b1 c1 12 15 17
a2 b2 c3 5 8 3
But to be processed, I would need it to have one line per result, adding a column for the date. In the given example, it would be
A B C result date
a1 b1 c1 12 date1
a1 b1 c1 15 date2
a1 b1 c1 17 date3
a2 b2 c3 5 date1
a2 b2 c3 8 date2
a2 b2 c3 3 date3
So to be more precise, I have edited manually all column names with date (after the read_excel, the looked like '01/01/2015 0:00:00' or something like that, and I was unable to access them... As a secondary question, does anyone knows how to access columns being imported from a date field in an .xlsx?), so that date column names are now 2015_01, 2015_02... 2015_12, 2016_01, ..., 2016_12, the 5 first being 'Account','Customer Name','Postcode','segment' and 'Rep'. So I tried the following code:
core = df.loc[:,('Account','Customer Name','Postcode','segment','Rep')]
df_final=pd.Series([])
for year in [2015,2016]:
for month in range(1, 13):
label = "%i_%02i" % (year,month)
date = []
for i in range(core.shape[0]):
date.append("01/%02i/%i"%(month,year))
df_date=pd.Series(date) #I don't know to create this 1xn df
df_final = df_final.append(pd.concat([core, df[label], df_date], axis=1))
That works roughly, but it is very unclean: I get a (26376, 30) shaped df_final, fist column being the dates, then the results, but of course with '2015_01' as column name, then all the '2015_02' through '2016_12' filled by NaN, and at last my Account', 'Customer Name', 'Postcode', 'segment' and 'Rep' columns. Does anyone know how I could do such a "slicing+stacking" in a clean way?
Thank you very much.
Edit: it is roughly the reverse of this question: Stacking and shaping slices of DataFrame (pandas) without looping
Ithink you need melt:
df = pd.melt(df, id_vars=['A','B','C'], value_name='result', var_name='date')
print (df)
A B C date result
0 a1 b1 c1 date1result 12
1 a2 b2 c3 date1result 5
2 a1 b1 c1 date2result 15
3 a2 b2 c3 date2result 8
4 a1 b1 c1 date3result 17
5 a2 b2 c3 date3result 3
And then convert to_datetime:
print (df)
A B C 2015_01 2016_10 2016_12
0 a1 b1 c1 12 15 17
1 a2 b2 c3 5 8 3
df = pd.melt(df, id_vars=['A','B','C'], value_name='result', var_name='date')
df.date = pd.to_datetime(df.date, format='%Y_%m')
print (df)
A B C date result
0 a1 b1 c1 2015-01-01 12
1 a2 b2 c3 2015-01-01 5
2 a1 b1 c1 2016-10-01 15
3 a2 b2 c3 2016-10-01 8
4 a1 b1 c1 2016-12-01 17
5 a2 b2 c3 2016-12-01 3