Interweave two dataframes - python

Suppose I have two dataframes d1 and d2
d1 = pd.DataFrame(np.ones((3, 3), dtype=int), list('abc'), [0, 1, 2])
d2 = pd.DataFrame(np.zeros((3, 2), dtype=int), list('abc'), [3, 4])
d1
0 1 2
a 1 1 1
b 1 1 1
c 1 1 1
d2
3 4
a 0 0
b 0 0
c 0 0
What is an easy and generalized way to interweave two dataframes' columns. We can assume that the number of columns in d2 is always one less than the number of columns in d1. And, the indices are the same.
I want this:
pd.concat([d1[0], d2[3], d1[1], d2[4], d1[2]], axis=1)
0 3 1 4 2
a 1 0 1 0 1
b 1 0 1 0 1
c 1 0 1 0 1

Using pd.concat to combine the DataFrames, and toolz.interleave reorder the columns:
from toolz import interleave
pd.concat([d1, d2], axis=1)[list(interleave([d1, d2]))]
The resulting output is as expected:
0 3 1 4 2
a 1 0 1 0 1
b 1 0 1 0 1
c 1 0 1 0 1

Here's one NumPy approach -
def numpy_interweave(d1, d2):
c1 = list(d1.columns)
c2 = list(d2.columns)
N = (len(c1)+len(c2))
cols = [None]*N
cols[::2] = c1
cols[1::2] = c2
out_dtype = np.result_type(d1.values.dtype, d2.values.dtype)
out = np.empty((d1.shape[0],N),dtype=out_dtype)
out[:,::2] = d1.values
out[:,1::2] = d2.values
df_out = pd.DataFrame(out, columns=cols, index=d1.index)
return df_out
Sample run -
In [346]: d1
Out[346]:
x y z
a 6 7 4
b 3 5 6
c 4 6 2
In [347]: d2
Out[347]:
p q
a 4 2
b 7 7
c 7 2
In [348]: numpy_interweave(d1, d2)
Out[348]:
x p y q z
a 6 4 7 2 4
b 3 7 5 7 6
c 4 7 6 2 2

Interweave the columns:
c = np.empty((d1.columns.size + d2.columns.size,), dtype=object)
c[0::2], c[1::2] = d1.columns, d2.columns
Now, do a join and re-order with boolean indexing:
d1.join(d2)[c]
0 3 1 4 2
a 1 0 1 0 1
b 1 0 1 0 1
c 1 0 1 0 1
You may prefer pd.concat when dealing with multiple dataframes.

write a function to abstract away the generic merge-reorder
from itertools import zip_longest
def weave(df1, df2):
col1 = df1.columns
col2 = df2.columns
weaved = [col for zipped in zip_longest(col1,col2)
for col in zipped
if col is not None]
return pd.concat([df1, df2], axis=1)[weaved]
weave(d1, d2)
# Output:
0 3 1 4 2
a 1 0 1 0 1
b 1 0 1 0 1
c 1 0 1 0 1

we can use itertools.zip_longest:
In [75]: from itertools import zip_longest
In [76]: cols = pd.Series(np.concatenate(list(zip_longest(d1.columns, d2.columns)))).dropna()
In [77]: cols
Out[77]:
0 0
1 3
2 1
3 4
4 2
dtype: object
In [78]: df = pd.concat([d1, d2], axis=1)[cols]
In [79]: df
Out[79]:
0 3 1 4 2
a 1 0 1 0 1
b 1 0 1 0 1
c 1 0 1 0 1

My solution was to use pd.DataFrame.insert making sure to insert from the back first
df = d1.copy()
for i in range(d2.shape[1], 0, -1):
df.insert(i, d2.columns[i - 1], d2.iloc[:, i - 1])
df
0 3 1 4 2
a 1 0 1 0 1
b 1 0 1 0 1
c 1 0 1 0 1

The roundrobin itertools recipe has an interleaving characteristic. This option offers the choice between directly implementing the recipe from the Python docs, or importing a third-party package such as more_itertools that implements the recipe for you:
from more_itertools import roundrobin
pd.concat([d1, d2], axis=1)[list(roundrobin(d1, d2))]
# Output
0 3 1 4 2
a 1 0 1 0 1
b 1 0 1 0 1
c 1 0 1 0 1
Inspired by #root's answer, the column indices are interleaved and used to slice a concatenated DataFrame.

Related

Is there a pythonic way to add an enumerating column while exploding a list column in pandas?

Consider the following DataFrame:
>>> df = pd.DataFrame({'A': [1,2,3], 'B':['abc', 'def', 'ghi']}).apply({'A':int, 'B':list})
>>> df
A B
0 1 [a, b, c]
1 2 [d, e, f]
2 3 [g, h, I]
This is one way to get the desired result:
>>> df['B'] = df['B'].apply(enumerate).apply(list)
>>> df = df.explode('B', ignore_index=True)
>>> df['B'] = pd.Series(df['B'], index=['B1', 'B2'])})
>>> df.droplevel(0, axis=1)
A B1 B2
0 1 0 a
1 1 1 b
2 1 2 c
3 2 0 d
4 2 1 e
5 2 2 f
6 3 0 g
7 3 1 h
8 3 2 i
Is there a neater way?
A groupby on the index is an option:
df.explode('B').assign(
B1 = lambda df: df.groupby(level=0).cumcount())
A B B1
0 1 a 0
0 1 b 1
0 1 c 2
1 2 d 0
1 2 e 1
1 2 f 2
2 3 g 0
2 3 h 1
2 3 i 2
you can always reset the index, if you have no use for it:
df.explode('B').assign(
B1 = lambda df: df.groupby(level=0).cumcount()).reset_index(drop=True)
A B B1
0 1 a 0
1 1 b 1
2 1 c 2
3 2 d 0
4 2 e 1
5 2 f 2
6 3 g 0
7 3 h 1
8 3 i 2
Since pandas version 1.3.0 you can use multiple columns with explode out of the box:
df.assign(
B1 = df.B.apply(len).apply(range)).explode(['B', 'B1'], ignore_index = True))
A B B1
0 1 a 0
1 1 b 1
2 1 c 2
3 2 d 0
4 2 e 1
5 2 f 2
6 3 g 0
7 3 h 1
8 3 i 2
I think a faster option would be to run the reshaping outside Pandas, and then rejoin back to the dataframe (of course only tests can affirm/deny this):
from itertools import chain
# you can use np.concatenate instead
# np.concatenate(df.B)
flattened = chain.from_iterable(df.B)
index = df.index.repeat([*map(len, df.B)])
flattened = pd.Series(flattened, index, name = 'B1')
(pd.concat([df.A, flattened], axis=1)
.assign(B2 = lambda df: df.groupby(level=0).cumcount())
)
A B1 B2
0 1 a 0
0 1 b 1
0 1 c 2
1 2 d 0
1 2 e 1
1 2 f 2
2 3 g 0
2 3 h 1
2 3 i 2

Cross addition in pandas

How to apply cross addition (OR) in my pandas dataframe like below.
Input:
A B C D
0 0 1 0 1
Output:
A B C D
0 0 1 0 1
1 1 1 1 1
2 0 1 0 1
3 1 1 1 1
So far I can achieve using this,
cols=df.columns
n=len(cols)
df1=pd.concat([df]*n,ignore_index=True).eq(1)
df2= pd.concat([df.T]*n,axis=1,ignore_index=True).eq(1)
df2.columns=cols
df2=df2.reset_index(drop=True)
print (df1|df2).astype(int)
I think there is much more simpler way to handle this case.
You can use numpy | operation with broadcast as:
data = df.values
df = pd.DataFrame((data.T | data), columns=df.columns)
Or using np.logical_or as:
df = pd.DataFrame(np.logical_or(data,data.T).astype(int), columns=df.columns)
print(df)
A B C D
0 0 1 0 1
1 1 1 1 1
2 0 1 0 1
3 1 1 1 1
Numpy solution:
First extract first row to 1d array with iloc and then broadcast by a[:, None] for change shape to Mx1:
a = df.iloc[0].values
df = pd.DataFrame(a | a[:, None], columns=df.columns)
print (df)
A B C D
0 0 1 0 1
1 1 1 1 1
2 0 1 0 1
3 1 1 1 1

Concat() alternate group by python3.0

My goal here is to concat() alternate groups between two dataframe.
desired result :
group ordercode quantity
0 A 1
B 1
C 1
D 1
0 A 1
B 3
1 A 1
B 2
C 1
1 A 1
B 1
C 2
My dataframe:
import pandas as pd
df1=pd.DataFrame([[0,"A",1],[0,"B",1],[0,"C",1],[0,"D",1],[1,"A",1],[1,"B",2],[1,"C",1]],columns=["group","ordercode","quantity"])
df2=pd.DataFrame([[0,"A",1],[0,"B",3],[1,"A",1],[1,"B",1],[1,"C",2]],columns=["group","ordercode","quantity"])
print(df1)
print(df2)
I have used dfff=pd.concat([df1,df2]).sort_index(kind="merge")
but I have got the below result:
group ordercode quantity
0 0 A 1
0 0 A 1
1 B 1
1 B 3
2 C 1
3 D 1
4 1 A 1
4 1 A 1
5 B 2
5 B 1
6 C 1
6 C 2
You can see here the concatenate is formed between each rows not by group.
It has to print like
group 0 of df1
group0 of df2
group1 of df1
group1 of df2 and so on
Note:
I have created these DataFrame using groupby() function
df = pd.DataFrame(np.concatenate(df.apply(lambda x: [x[0]] * x[1], 1).as_matrix()),
columns=['ordercode'])
df['quantity'] = 1
df['group'] = sorted(list(range(0, len(df)//3, 1)) * 4)[0:len(df)]
df=df.groupby(['group', 'ordercode']).sum()
Question:
Where I went wrong?
Its sorting out by taking index
I have used .set_index("group") but It didnt work either.
Use cumcount for helper column used for sorting by sort_values :
df1['g'] = df1.groupby('ordercode').cumcount()
df2['g'] = df2.groupby('ordercode').cumcount()
dfff = pd.concat([df1,df2]).sort_values(['group','g']).reset_index(drop=True)
print (dfff)
group ordercode quantity g
0 0 A 1 0
1 0 B 1 0
2 0 C 1 0
3 0 D 1 0
4 0 A 1 0
5 0 B 3 0
6 1 C 2 0
7 1 A 1 1
8 1 B 2 1
9 1 C 1 1
10 1 A 1 1
11 1 B 1 1
and last remove column:
dfff = dfff.drop('g', axis=1)

How to categorize two categories in one dataframe in Pandas

I have one pd including two categorical columns with 150 categories. May be a value in column A is not appeared in Column B. For example
a = pd.DataFrame({'A':list('bbaba'), 'B':list('cccaa')})
a['A'] = a['A'].astype('category')
a['B'] = a['B'].astype('category')
The output is
Out[217]:
A B
0 b c
1 b c
2 a c
3 b a
4 a a
And also
cat_columns = a.select_dtypes(['category']).columns
a[cat_columns] = a[cat_columns].apply(lambda x: x.cat.codes)
a
The output is
Out[220]:
A B
0 1 1
1 1 1
2 0 1
3 1 0
4 0 0
My problem is that in column A, the b is considered as 1, but in column B, the c is considered as 1. However, I want something like this:
Out[220]:
A B
0 1 2
1 1 2
2 0 2
3 1 0
4 0 0
which 2 is considered as c.
Please note that I have 150 different labels.
Using pd.Categorical() you can specify a list of categories:
In [44]: cats = a[['A','B']].stack().sort_values().unique()
In [45]: cats
Out[45]: array(['a', 'b', 'c'], dtype=object)
In [46]: a['A'] = pd.Categorical(a['A'], categories=cats)
In [47]: a['B'] = pd.Categorical(a['B'], categories=cats)
In [48]: a[cat_columns] = a[cat_columns].apply(lambda x: x.cat.codes)
In [49]: a
Out[49]:
A B
0 1 2
1 1 2
2 0 2
3 1 0
4 0 0
We can use pd.factorize all at once.
pd.DataFrame(
pd.factorize(a.values.ravel())[0].reshape(a.shape),
a.index, a.columns
)
A B
0 0 1
1 0 1
2 2 1
3 0 2
4 2 2
Or if you wanted to factorize by sorted category value, use the sort=True argument
pd.DataFrame(
pd.factorize(a.values.ravel(), True)[0].reshape(a.shape),
a.index, a.columns
)
A B
0 1 2
1 1 2
2 0 2
3 1 0
4 0 0
Or equivalently with np.unique
pd.DataFrame(
np.unique(a.values.ravel(), return_inverse=True)[1].reshape(a.shape),
a.index, a.columns
)
A B
0 1 2
1 1 2
2 0 2
3 1 0
4 0 0
If you are only interested in converting to categorical codes and being able to access the mapping via a dictionary, pd.factorize may be more convenient.
Algorithm for getting unique values across columns via #AlexRiley.
a = pd.DataFrame({'A':list('bbaba'), 'B':list('cccaa')})
fact = dict(zip(*pd.factorize(pd.unique(a[['A', 'B']].values.ravel('K')))[::-1]))
b = a.applymap(fact.get)
Result:
A B
0 0 2
1 0 2
2 1 2
3 0 1
4 1 1

Make a table from 2 columns

I'm fairly new on Python.
I have 2 columns on a dataframe, columns are something like:
db = pd.read_excel(path_to_file/file.xlsx)
db = db.loc[:,['col1','col2']]
col1 col2
C 4
C 5
A 1
B 6
B 1
A 2
C 4
I need them to be like this:
1 2 3 4 5 6
A 1 1 0 0 0 0
B 1 0 0 0 0 1
C 0 0 0 2 1 0
so they act like rows and columns and values refer to the number of coincidences.
Say your columns are called cat and val:
In [26]: df = pd.DataFrame({'cat': ['C', 'C', 'A', 'B', 'B', 'A', 'C'], 'val': [4, 5, 1, 6, 1, 2, 4]})
In [27]: df
Out[27]:
cat val
0 C 4
1 C 5
2 A 1
3 B 6
4 B 1
5 A 2
6 C 4
Then you can groupby the table hierarchicaly, then unstack it:
In [28]: df.val.groupby([df.cat, df.val]).sum().unstack().fillna(0).astype(int)
Out[28]:
val 1 2 4 5 6
cat
A 1 2 0 0 0
B 1 0 0 0 6
C 0 0 8 5 0
Edit
As IanS pointed out, 3 is missing here (thanks!). If there's a range of columns you must have, then you can use
r = df.val.groupby([df.cat, df.val]).sum().unstack().fillna(0).astype(int)
for c in set(range(1, 7)) - set(df.val.unique()):
r[c] = 0
I think you need aggreagate by size and add missing values to columns by reindex:
print (df)
a b
0 C 4
1 C 5
2 A 1
3 B 6
4 B 1
5 A 2
6 C 4
df1 = df.b.groupby([df.a, df.b])
.size()
.unstack()
.reindex(columns=(range(1,df.b.max() + 1)))
.fillna(0)
.astype(int)
df1.index.name = None
df1.columns.name = None
print (df1)
1 2 3 4 5 6
A 1 1 0 0 0 0
B 1 0 0 0 0 1
C 0 0 0 2 1 0
Instead size you can use count, size counts NaN values, count does not.

Categories