I have a training data set of the following format:
print(data.head(5))
#Output
0 1
0 a b c d e 1
1 a b c d e 1
2 a b c d e 1
3 a b c d e 1
4 a b c d e 1
It is a text classification task and I am trying to split the text "a b c d e" in to a Python list. I tried iteration:
data #the dataset
len_data = len(data)
for row_num in range(len_data):
data.loc[row_num, 0] = data.loc[row_num, 0].split(" ")
However this doesn't work and returned the error Must have equal len keys and value when setting with an iterable. Could someone help me with this problem? Many thanks!
Use str.split:
df[0] = df[0].str.split()
print(df)
# Output
0 1
0 [a, b, c, d, e] 1
1 [a, b, c, d, e] 1
2 [a, b, c, d, e] 1
3 [a, b, c, d, e] 1
4 [a, b, c, d, e] 1
Setup:
data = {0: {0: 'a b c d e', 1: 'a b c d e'}, 1: {0: 1, 1: 1}}
df = pd.DataFrame(data)
Related
I have a pandas dataframe of the form
a b
0 [a, b] 0
1 [c, d, e] 1
I have written a function to create a list of partial lists:
def partials(l):
result = []
for i, elem in enumerate(l):
result.append(l[:i+1])
return result
which, when applied to the series df['a'], and exploding, using d['a'].apply(partials).explode() correctly gives:
0 [a]
0 [a, b]
1 [c]
1 [c, d]
1 [c, d, e]
However, this series is necessarily longer than the original. How can I apply this function in-place to column a of my dataframe, such that the column b repeats its value wherever the corresponding line from the original dataframe is 'exploded', like this :
a b
0 [a] 0
0 [a, b] 0
1 [c] 1
1 [c, d] 1
1 [c, d, e] 1
?
You can join back:
(df['a'].apply(partials)
.explode().to_frame()
.join(df.drop('a', axis=1))
)
Output:
a b
0 [a] 0
0 [a, b] 0
1 [c] 1
1 [c, d] 1
1 [c, d, e] 1
I have a pandas dataframe of the form
T1 T2
0 A B
1 C D
2 B C
3 D E
4 F A
I would like to generate another pandas dataframe that lists each of the unique items in T1 and T2 has its own row, and has a column with the name of that unique item and a column with a list of the items it shared a row with in the original dataframe. For example, in this case I would be looking for something of the form:
Name List
0 A [B, F]
1 B [A, C]
2 C [D, B]
3 D [C, E]
4 E [D]
5 F [A]
Can someone suggest a proper pandonic (like pythonic but for pandas :)) way to do this? Thanks in advance!
IIUC, swap columns and use pandas.DataFrame.columns:
df2 = df.copy()
df2.columns = df.columns[::-1]
new_df = pd.concat([df, df2])
new_df.groupby("T1")["T2"].apply(list).reset_index()
Output:
T1 T2
0 A [B, F]
1 B [C, A]
2 C [D, B]
3 D [E, C]
4 E [D]
5 F [A]
Is there a function in Python that does what the R fct_lump function does (i.e. to group all groups that are too small into one 'OTHER' group)?
Example below:
library(dplyr)
library(forcats)
> x <- factor(rep(LETTERS[1:9], times = c(40, 10, 5, 27, 1, 1, 1, 1, 1)))
> x
[1] A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A B B B B B B B B
[49] B B C C C C C D D D D D D D D D D D D D D D D D D D D D D D D D D D E F G H I
Levels: A B C D E F G H I
> x %>% fct_lump_n(3)
[1] A A A A A A A A A A A A A A A A
[17] A A A A A A A A A A A A A A A A
[33] A A A A A A A A B B B B B B B B
[49] B B Other Other Other Other Other D D D D D D D D D
[65] D D D D D D D D D D D D D D D D
[81] D D Other Other Other Other Other
Levels: A B D Other
pip install siuba
#( in python or anaconda prompth shell)
#use library as:
from siuba.dply.forcats import fct_lump, fct_reorder
#just like fct_lump of R :
df['Your_column'] = fct_lump(df['Your_column'], n= 10)
df['Your_column'].value_counts() # check your levels
#it reduces the level to 10, lumps all the others as 'Other'
You may also want to try datar:
>>> from datar.all import factor, rep, LETTERS, c, fct_lump_n, fct_count
>>>
>>> x = factor(rep(LETTERS[:9], times=c(40, 10, 5, 27, 1, 1, 1, 1, 1)))
>>> x >> fct_count()
f n
<category> <int64>
0 A 40
1 B 10
2 C 5
3 D 27
4 E 1
5 F 1
6 G 1
7 H 1
8 I 1
>>> x >> fct_lump_n(3) >> fct_count()
f n
<category> <int64>
0 A 40
1 B 10
2 D 27
3 Other 10
Disclaimer: I am the author of the datar package.
I need apply a function to all rows of dataframe
I have used this function that returns a list of column names if value is 1:
def find_column(x):
a=[]
for column in df.columns:
if (df.loc[x,column] == 1):
a = a + [column]
return a
it works if i just insert the index, for example:
print(find_column(1))
but:
df['new_col'] = df.apply(find_column,axis=1)
does not work
any idea?
Thanks!
You can iterate by each row, so x is Series with index same like columns names, so is possible filter index matched data and convert to list:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,1,4,5,5,1],
'C':[7,1,9,4,2,3],
'D':[1,1,5,7,1,1],
'E':[5,1,6,9,1,4],
'F':list('aaabbb')
})
def find_column(x):
return x.index[x == 1].tolist()
df['new'] = df.apply(find_column,axis=1)
print (df)
A B C D E F new
0 a 4 7 1 5 a [D]
1 b 1 1 1 1 a [B, C, D, E]
2 c 4 9 5 6 a []
3 d 5 4 7 9 b []
4 e 5 2 1 1 b [D, E]
5 f 1 3 1 4 b [B, D]
Another idea is use DataFrame.dot with mask by DataFrame.eq for equal, then remove last separator and use Series.str.split:
df['new'] = df.eq(1).dot(df.columns + ',').str.rstrip(',').str.split(',')
print (df)
A B C D E F new
0 a 4 7 1 5 a [D]
1 b 1 1 1 1 a [B, C, D, E]
2 c 4 9 5 6 a []
3 d 5 4 7 9 b []
4 e 5 2 1 1 b [D, E]
5 f 1 3 1 4 b [B, D]
I am trying to build a database with various pandas data frames that are the same size. I want to assign each point in the data frame with all of the other points.
Such that I get something like this:
Data frame 1:
1 2 3 4
1 A B C D
2 A B C D
3 A B C D
Data frame 2:
1 2 3 4
1 A D G J
2 B E H K
3 C F I L
Final Data Frame (something like this)
1 2 3 4
1 [A,A] [B,D] [C,G] [D,J]
2 [A,B] [B,E] [C,H] [D,K]
3 [A,C] [B,F] [C,I] [D,L]
Check with
pd.concat([df1,df2],1).groupby(level=0,axis=1).agg(lambda x : x.values.tolist())
Out[529]:
1 2 3 4
1 [A, A] [B, D] [C, G] [D, J]
2 [A, B] [B, E] [C, H] [D, K]
3 [A, C] [B, F] [C, I] [D, L]
If I were you I will look at multiple index rather than list type columns
pd.concat([df1,df2],1,keys=[1,2])
Out[530]:
1 2
1 2 3 4 1 2 3 4
1 A B C D A D G J
2 A B C D B E H K
3 A B C D C F I L
You could add the values from both dataframes, plus a separator in between, and iterate over the columns using str.split to create lists. Finally use pd.concat to turn the result into a dataframe:
df = df1 + ',' + df2
pd.concat([df[col].str.split(',') for col in df.columns], axis=1)
1 2 3 4
1 [A, A] [B, D] [C, G] [D, J]
2 [A, B] [B, E] [C, H] [D, K]
3 [A, C] [B, F] [C, I] [D, L]
Or using DataFrame.combine:
df1.combine(df2, func=lambda a,b: (a + ',' + b).str.split(','))
1 2 3 4
1 [A, A] [B, D] [C, G] [D, J]
2 [A, B] [B, E] [C, H] [D, K]
3 [A, C] [B, F] [C, I] [D, L]
Not very efficient but you can convert all cells to lists and then simply add the dataframes like this:
import pandas as pd
import numpy as np
np.random.seed(42)
df1 = pd.DataFrame(
[list('ABCD') for _ in range(5)]
)
df2 = pd.DataFrame(
[np.random.choice(list('ABCD'), size=4).tolist() for _ in range(5)]
)
df1 = df1.apply(lambda x: [[v] for v in x])
df2 = df2.apply(lambda x: [[v] for v in x])
df1 + df2
which outputs
0 1 2 3
0 [A, C] [B, D] [C, A] [D, C]
1 [A, C] [B, D] [C, A] [D, A]
2 [A, C] [B, B] [C, C] [D, C]
3 [A, C] [B, C] [C, D] [D, A]
4 [A, D] [B, D] [C, D] [D, C]