Repeating other column values when using pandas.Series.explode() - python

I have a pandas dataframe of the form
a b
0 [a, b] 0
1 [c, d, e] 1
I have written a function to create a list of partial lists:
def partials(l):
result = []
for i, elem in enumerate(l):
result.append(l[:i+1])
return result
which, when applied to the series df['a'], and exploding, using d['a'].apply(partials).explode() correctly gives:
0 [a]
0 [a, b]
1 [c]
1 [c, d]
1 [c, d, e]
However, this series is necessarily longer than the original. How can I apply this function in-place to column a of my dataframe, such that the column b repeats its value wherever the corresponding line from the original dataframe is 'exploded', like this :
a b
0 [a] 0
0 [a, b] 0
1 [c] 1
1 [c, d] 1
1 [c, d, e] 1
?

You can join back:
(df['a'].apply(partials)
.explode().to_frame()
.join(df.drop('a', axis=1))
)
Output:
a b
0 [a] 0
0 [a, b] 0
1 [c] 1
1 [c, d] 1
1 [c, d, e] 1

Related

change value in pandas dataframe using iteration

I have a training data set of the following format:
print(data.head(5))
#Output
0 1
0 a b c d e 1
1 a b c d e 1
2 a b c d e 1
3 a b c d e 1
4 a b c d e 1
It is a text classification task and I am trying to split the text "a b c d e" in to a Python list. I tried iteration:
data #the dataset
len_data = len(data)
for row_num in range(len_data):
data.loc[row_num, 0] = data.loc[row_num, 0].split(" ")
However this doesn't work and returned the error Must have equal len keys and value when setting with an iterable. Could someone help me with this problem? Many thanks!
Use str.split:
df[0] = df[0].str.split()
print(df)
# Output
0 1
0 [a, b, c, d, e] 1
1 [a, b, c, d, e] 1
2 [a, b, c, d, e] 1
3 [a, b, c, d, e] 1
4 [a, b, c, d, e] 1
Setup:
data = {0: {0: 'a b c d e', 1: 'a b c d e'}, 1: {0: 1, 1: 1}}
df = pd.DataFrame(data)

Pandas: modify columns indicated in another column of type list

I have a dataframe like the following:
A B C my_list
0 0 0 [A, B]
0 0 0 [B, C]
0 0 0 [A, C]
0 0 0 [A]
0 0 0 [A, C]
I need a dataframe like:
A B C my_list
1 1 0 [A, B]
0 1 1 [B, C]
1 0 1 [A, C]
1 0 0 [A]
1 0 1 [A, C]
That is, filling with 1 values the columns indicated on the 'my_list' column.
How can I do that?
My real dataframe is huge, so performance is important here.
We can re-create the dataframe with explode and str.get_dummies
df.update(df['my_list'].explode().str.get_dummies().sum(level=0))

Putting rows of pandas dataframe into list form

I have a pandas dataframe of the form
T1 T2
0 A B
1 C D
2 B C
3 D E
4 F A
I would like to generate another pandas dataframe that lists each of the unique items in T1 and T2 has its own row, and has a column with the name of that unique item and a column with a list of the items it shared a row with in the original dataframe. For example, in this case I would be looking for something of the form:
Name List
0 A [B, F]
1 B [A, C]
2 C [D, B]
3 D [C, E]
4 E [D]
5 F [A]
Can someone suggest a proper pandonic (like pythonic but for pandas :)) way to do this? Thanks in advance!
IIUC, swap columns and use pandas.DataFrame.columns:
df2 = df.copy()
df2.columns = df.columns[::-1]
new_df = pd.concat([df, df2])
new_df.groupby("T1")["T2"].apply(list).reset_index()
Output:
T1 T2
0 A [B, F]
1 B [C, A]
2 C [D, B]
3 D [E, C]
4 E [D]
5 F [A]

Counting each unique array of an array in each row of a column in a data frame

I am practicing pandas and python and I am not so good at for loops. I have a data frame as below: let's say this is df:
Name Value
A [[A,B],[C,D]]
B [[A,B],[D,E]]
C [[D,E],[K,L],[M,L]]
D [[K,L]]
I want to go through each row and find unique arrays and count them.
I have tried np.unique(a, return_index=True) then returns two different list and my problem I don't know how to go through each array.
Expected result would be:
Value Counts
[A,B] 2
[D,E] 2
[K,L] 2
[C,D] 1
[M,L] 1
Thank you very much.
Use DataFrame.explode in pandas +0.25:
df.explode('Value')['Value'].value_counts()
Output:
[K, L] 2
[A, B] 2
[D, E] 2
[C, D] 1
[M, L] 1
Name: Value, dtype: int64
Use Series.explode with Series.value_counts:
df = df['Value'].explode().value_counts().rename_axis('Value').reset_index(name='Counts')
print (df)
Value Counts
0 [D, E] 2
1 [A, B] 2
2 [K, L] 2
3 [C, D] 1
4 [M, L] 1
Numpy solution:
a, v = np.unique(np.concatenate(df['Value']),axis=0, return_counts=True)
df = pd.DataFrame({'Value':a.tolist(), 'Counts':v})
print (df)
Value Counts
0 [A, B] 2
1 [C, D] 1
2 [D, E] 2
3 [K, L] 2
4 [M, L] 1

How to combine multiple pandas data frames that are the same size into one cohesive database?

I am trying to build a database with various pandas data frames that are the same size. I want to assign each point in the data frame with all of the other points.
Such that I get something like this:
Data frame 1:
1 2 3 4
1 A B C D
2 A B C D
3 A B C D
Data frame 2:
1 2 3 4
1 A D G J
2 B E H K
3 C F I L
Final Data Frame (something like this)
1 2 3 4
1 [A,A] [B,D] [C,G] [D,J]
2 [A,B] [B,E] [C,H] [D,K]
3 [A,C] [B,F] [C,I] [D,L]
Check with
pd.concat([df1,df2],1).groupby(level=0,axis=1).agg(lambda x : x.values.tolist())
Out[529]:
1 2 3 4
1 [A, A] [B, D] [C, G] [D, J]
2 [A, B] [B, E] [C, H] [D, K]
3 [A, C] [B, F] [C, I] [D, L]
If I were you I will look at multiple index rather than list type columns
pd.concat([df1,df2],1,keys=[1,2])
Out[530]:
1 2
1 2 3 4 1 2 3 4
1 A B C D A D G J
2 A B C D B E H K
3 A B C D C F I L
You could add the values from both dataframes, plus a separator in between, and iterate over the columns using str.split to create lists. Finally use pd.concat to turn the result into a dataframe:
df = df1 + ',' + df2
pd.concat([df[col].str.split(',') for col in df.columns], axis=1)
1 2 3 4
1 [A, A] [B, D] [C, G] [D, J]
2 [A, B] [B, E] [C, H] [D, K]
3 [A, C] [B, F] [C, I] [D, L]
Or using DataFrame.combine:
df1.combine(df2, func=lambda a,b: (a + ',' + b).str.split(','))
1 2 3 4
1 [A, A] [B, D] [C, G] [D, J]
2 [A, B] [B, E] [C, H] [D, K]
3 [A, C] [B, F] [C, I] [D, L]
Not very efficient but you can convert all cells to lists and then simply add the dataframes like this:
import pandas as pd
import numpy as np
np.random.seed(42)
df1 = pd.DataFrame(
[list('ABCD') for _ in range(5)]
)
df2 = pd.DataFrame(
[np.random.choice(list('ABCD'), size=4).tolist() for _ in range(5)]
)
df1 = df1.apply(lambda x: [[v] for v in x])
df2 = df2.apply(lambda x: [[v] for v in x])
df1 + df2
which outputs
0 1 2 3
0 [A, C] [B, D] [C, A] [D, C]
1 [A, C] [B, D] [C, A] [D, A]
2 [A, C] [B, B] [C, C] [D, C]
3 [A, C] [B, C] [C, D] [D, A]
4 [A, D] [B, D] [C, D] [D, C]

Categories