How Best to Unpack a Pandas Dataframe of Tuples? - python

Probably really straightforward but I'm having no luck with Google. I have a 2 column dataframe of tuples, and I'm looking to unpack each tuple then pair up the contents from the same position in each column. For example:
Col1 Col2
(a,b,c) (d,e,f)
my desired output is
a d
b e
c f
I have a solution using loops but I would like to know a better way to do it - firstly because I am trying to eradicate loops from my life and secondly because it's potentially not as flexible as I may need it to be.
l1=[('a','b'),('c','d'),('e','f','g'),('h','i')]
l2=[('j','k'),('l','m'),('n','o','p'),('q','r')]
df = pd.DataFrame(list(zip(l1,l2)),columns=['Col1','Col2'])
df
Out[547]:
Col1 Col2
0 (a, b) (j, k)
1 (c, d) (l, m)
2 (e, f, g) (n, o, p)
3 (h, i) (q, r)
for i in range(len(df)):
for j in range(len(df.iloc[i][1])):
print(df.iloc[i][0][j], df.iloc[i][1][j])
a j
b k
c l
d m
e n
f o
g p
h q
i r
All pythonic suggestions and guidance hugely appreciated. Many thanks.
Addition: an example including a row with differing length tuples, per Ch3steR's request below - my loop would not work in this instance ('d2' would not be included, where I would want it to be outputted paired with a null).
l1=[('a','b'),('c','d','d2'),('e','f','g'),('h','i')]
l2=[('j','k'),('l','m'),('n','o','p'),('q','r')]
df = pd.DataFrame(list(zip(l1,l2)),columns=['Col1','Col2'])

Send each Series tolist and then reconstruct the DataFrame and stack. Then concat back together. This will leave you with a MultiIndex with the first level being the original DataFrame index and the second level being the position in the tuple.
This will work for older versions of pandas pd.__version__ < '1.3.0' and for instances where the tuples have an unequal number of elements (where explode will fail)
import pandas as pd
df1 = pd.concat([pd.DataFrame(df[col].tolist()).stack().rename(col)
for col in df.columns], axis=1)
Col1 Col2
0 0 a j
1 b k
1 0 c l
1 d m
2 0 e n
1 f o
2 g p
3 0 h q
1 i r

if the tuples length are always matching and you don't have the newer version of pandas to pass a list columns to explode, do something like this:
import pandas as pd
pd.concat([df.Col1.explode(), df.Col2.explode()], axis=1).reset_index(drop=True)
Col1 Col2
0 a j
1 b k
2 c l
3 d m
4 e n
5 f o
6 g p
7 h q
8 i r

Related

Make a new dataframe from multiple dataframes

Suppose I have 3 dataframes that are wrapped in a list. The dataframes are:
df_1 = pd.DataFrame({'text':['a','b','c','d','e'],'num':[2,1,3,4,3]})
df_2 = pd.DataFrame({'text':['f','g','h','i','j'],'num':[1,2,3,4,3]})
df_3 = pd.DataFrame({'text':['k','l','m','n','o'],'num':[6,5,3,1,2]})
The list of the dfs is:
df_list = [df_1, df_2, df_3]
Now I want to make a for loop such that goes on df_list, and for each df takes the text column and merge them on a new dataframe with a new column head called topic. Now since each text column is different from each dataframe I want to populate the headers as topic_1, topic_2, etc. The desired outcome should be as follow:
topic_1 topic_2 topic_3
0 a f k
1 b g l
2 c h m
3 d i n
4 e j o
I can easily extract the text columns as:
lst = []
for i in range(len(df_list)):
lst.append(df_list[i]['text'].tolist())
It is just that I am stuck on the last part, namely bringing the columns into 1 df without using brute force.
You can extract the wanted columns with a list comprehension and concat them:
pd.concat([d['text'].rename(f'topic_{i}')
for i,d in enumerate(df_list, start=1)],
axis=1)
output:
topic_1 topic_2 topic_3
0 a f k
1 b g l
2 c h m
3 d i n
4 e j o
Generally speaking you want to avoid looping anything on a pandas DataFrame. However, in this solution I do use a loop to rename your columns. This should work assuming you just have these 3 dataframes:
import pandas as pd
df_1 = pd.DataFrame({'text':['a','b','c','d','e'],'num':[2,1,3,4,3]})
df_2 = pd.DataFrame({'text':['f','g','h','i','j'],'num':[1,2,3,4,3]})
df_3 = pd.DataFrame({'text':['k','l','m','n','o'],'num':[6,5,3,1,2]})
df_list = [df_1.text, df_2.text, df_3.text]
df_combined = pd.concat(df_list,axis=1)
df_combined.columns = [f"topic_{i+1}" for i in range(len(df_combined.columns))]
>>> df_combined
topic_1 topic_2 topic_3
0 a f k
1 b g l
2 c h m
3 d i n
4 e j o

Pandas implode Dataframe with values separated by char

I was just wondering how is the best approach to implode a DataFrame with values separated by a given char.
For example, imagine this dataframe:
A B C D E
1 z a q p
2 x s w l
3 c d e k
4 v f r m
5 b g t n
And we want to implode by #
A B C D E
1#2#3#4#5 z#x#c#v#b a#s#d#f#g q#w#e#r#t p#l#k#m#n
Maybe to create a copy from the original dataframe and process column by column with Pandas str.concat?
Thanks in advance!
Use DataFrame.agg with join, then convert Series to one row DataFrame with Series.to_frame and transpose by DataFrame.T:
df = df.astype(str).agg('#'.join).to_frame().T
print (df)
A B C D E
0 1#2#3#4#5 z#x#c#v#b a#s#d#f#g q#w#e#r#t p#l#k#m#n

Renaming columns on slice of dataframe not performing as expected

I was trying to clean up column names in a dataframe but only a part of the columns.
It doesn't work when trying to replace column names on a slice of the dataframe somehow, why is that?
Lets say we have the following dataframe:
Note, on the bottom is copy-able code to reproduce the data:
Value ColAfjkj ColBhuqwa ColCouiqw
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
I want to clean up the column names (expected output):
Value ColA ColB ColC
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
Approach 1:
I can get the clean column names like this:
df.iloc[:, 1:].columns.str[:4]
Index(['ColA', 'ColB', 'ColC'], dtype='object')
Or
Approach 2:
s = df.iloc[:, 1:].columns
[col[:4] for col in s]
['ColA', 'ColB', 'ColC']
But when I try to overwrite the column names, nothing happens:
df.iloc[:, 1:].columns = df.iloc[:, 1:].columns.str[:4]
Value ColAfjkj ColBhuqwa ColCouiqw
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
Same for the second approach:
s = df.iloc[:, 1:].columns
cols = [col[:4] for col in s]
df.iloc[:, 1:].columns = cols
Value ColAfjkj ColBhuqwa ColCouiqw
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
This does work, but you have to manually concat the name of the first column, which is not ideal:
df.columns = ['Value'] + df.iloc[:, 1:].columns.str[:4].tolist()
Value ColA ColB ColC
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
Is there an easier way to achieve this? Am I missing something?
Dataframe for reproduction:
df = pd.DataFrame({'Value':[1,2,3,4],
'ColAfjkj':['a', 'b', 'c', 'd'],
'ColBhuqwa':['e', 'f', 'g', 'h'],
'ColCouiqw':['i', 'j', 'k', 'l']})
This is because pandas' index is immutable. If you check the documentation for class pandas.Index, you'll see that it is defined as:
Immutable ndarray implementing an ordered, sliceable set
So in order to modify it you'll have to create a new list of column names, for instance with:
df.columns = [df.columns[0]] + list(df.iloc[:, 1:].columns.str[:4])
Another option is to use rename with a dictionary containing the columns to replace:
df.rename(columns=dict(zip(df.columns[1:], df.columns[1:].str[:4])))
To overwrite columns names you can .rename() method:
So, it will look like:
df.rename(columns={'ColA_fjkj':'ColA',
'ColB_huqwa':'ColB',
'ColC_ouiqw':'ColC'}
, inplace=True)
More info regarding rename here in docs: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html
I had this problem as well and came up with this solution:
First, create a mask of the columns you want to rename
mask = df.iloc[:,1:4].columns
Then, use list comprehension and a conditional to rename just the columns you want
df.columns = [x if x not in mask else str[:4] for x in df.columns]

Pandas: Get unique items from a groupby into separate rows instead of arrays

When using the unique() method on a Series you get a numpy array as a result, this also happens when doing it on a groupby. Consider this example:
import pandas as pd
L0 = ['G','i','G','h','j','h','G','j']
L1 = ['A','A','B','B','B','B','B','B']
df = pd.DataFrame({"A":L0,"B":L1})
dg = df.groupby('B').A.unique()
Resulting in this:
Out[56]:
B
A [G, i]
B [G, h, j]
Name: A, dtype: object
I want each unique element in its own row though:
A
B
A G
A i
B G
B h
B j
I can achieve this by hand like this (I'm deliberately omitting any iteration over DataFrames and only use the underlying numpy arrays):
de = pd.DataFrame(columns=["A","B"])
for i in range(dg.index.nunique()):
ds = pd.Series(dg.values[i]).to_frame()
ds.columns = ["A"]
ds["B"] = dg.index.values[i]
de = de.append(ds)
de = de.set_index('B')
But I'm wondering if there is a shorter (and fast) way that doesn't need loops, creating new Series or DataFrames, or messing around with the numpy arrays.
If not, I might propose it as a feature.
You can use apply with Series:
dg = df.groupby('B').A
.apply(lambda x: pd.Series(x.unique()))
.reset_index(level=1, drop=True)
.to_frame()
print (dg)
A
B
A G
A i
B G
B h
B j
Another possible solution is drop_duplicates:
df = df.drop_duplicates(['A','B']).set_index('B')
print (df)
A
B
A G
A i
B G
B h
B j

Summing rows from a MultiIndex pandas df based on index label

If I have a pandas.DataFrame which looks like this:
Probability
0 1 2 3
C H C C 0.058619
H H C H 0.073274
C H C H 0.195398
C H H H 0.113982
C H H C 0.009770
C C C C 0.087929
H C H H 0.005234
H H H C 0.003664
H H C C 0.021982
C C H C 0.004187
H C H C 0.000449
C C H H 0.048849
H C C C 0.009421
H H H H 0.042743
C C C H 0.293096
H C C H 0.031403
The index is a tuple of len(4) and its order corresponds to all sequences of length four and a 2 letter alphabet ['H','C']. What is the best way to sum the rows with a H in position [i for i in df.index] ?
df.ix['H'].sum() is functional but I can't see how to generalize to the 'any case' method. For example, I need to be able to perform the same calculation regardless of how long the sequence is or with more than a 2 letter alphabet. Moreover, the df.ix['H'] is difficult because it doesn't seem to accept wild cards, i.e. df.ix['*','H'] for all sequences with H in index position 1. Does anybody have any suggestions? thanks
Setup
To create a dummy data frame which corresponds to your provided example, I used the following:
import pandas as pd
import numpy as np
import random
# define sequence and target
sequence = ["H", "C"]
target = "H"
# define shapes
size_col = 4
size_row = 100
# create dummy data and dummy columns
array_indices = np.random.choice(sequence, size=(size_row, size_col))
array_value = np.random.random(size=(size_row, 1))
array = np.concatenate([array_indices, array_value], axis=1)
col_indices = ["Idx {}".format(x) for x in range(size_col)]
col_values = ["Probability"]
columns = col_indices + col_values
# create pandas data frame
df = pd.DataFrame(array, columns=columns)
df[col_values] = df[col_values].astype(float)
The resulting pandas.DataFrame looks like this:
>>> print(df.head())
Idx 0 Idx 1 Idx 2 Idx 3 Probability
C C C H 0.892125
C H C H 0.633699
C C C C 0.228546
H C H C 0.766639
C H C C 0.379930
The only difference to your data frame is the reset index (you get the same when using df.reset_index()).
Solution
Now, to get the sums of the rows with a target value for all indices, you may use the following:
bool_indices = df[col_indices] == target
result = bool_indices.apply(lambda x: df.loc[x, col_values].sum())
First, you create a new data frame with boolean values which correspond to each index column containing the target value for each row.
Second, you use these boolean series as index columns to define a subset of your actual value column and finally apply an arbitrary method like sum() on it.
The result is the following:
>>> print(result)
Idx 0 Idx 1 Idx 2 Idx 3
Probability 23.246007 23.072544 24.775996 24.683079
This solution is flexible in regard to your input sequence, the target and the shape of your data.
In addition, if you want to use slicing with wildcards, you can use the pandas.IndexSlice on your original data frame example like:
idx = pd.IndexSlice
# to get all rows which have the "H" at second index
df.loc[idx[:, "H"], :]
# to get all rows which have the "H" at third index
df.loc[idx[:, :, "H"], :]
An alternative solution from what pansen suggested is to use pandas.groupby
levels=[0,1,2,3]
for i in range(levels):
for j in df.groupby(level=i):
MI=pandas.MultiIndex.from_product([i,j[0]])
val= float(j[1].sum())
df_l.append( pandas.DataFrame([val],index=MI))
return pandas.concat(df_l)

Categories