change value in pandas dataframe using iteration

change value in pandas dataframe using iteration - python

I have a training data set of the following format:
print(data.head(5))
#Output
0 1
0 a b c d e 1
1 a b c d e 1
2 a b c d e 1
3 a b c d e 1
4 a b c d e 1
It is a text classification task and I am trying to split the text "a b c d e" in to a Python list. I tried iteration:
data #the dataset
len_data = len(data)
for row_num in range(len_data):
data.loc[row_num, 0] = data.loc[row_num, 0].split(" ")
However this doesn't work and returned the error Must have equal len keys and value when setting with an iterable. Could someone help me with this problem? Many thanks!

Use str.split:
df[0] = df[0].str.split()
print(df)
# Output
0 1
0 [a, b, c, d, e] 1
1 [a, b, c, d, e] 1
2 [a, b, c, d, e] 1
3 [a, b, c, d, e] 1
4 [a, b, c, d, e] 1
Setup:
data = {0: {0: 'a b c d e', 1: 'a b c d e'}, 1: {0: 1, 1: 1}}
df = pd.DataFrame(data)

Related

Repeating other column values when using pandas.Series.explode()

I have a pandas dataframe of the form
a b
0 [a, b] 0
1 [c, d, e] 1
I have written a function to create a list of partial lists:
def partials(l):
result = []
for i, elem in enumerate(l):
result.append(l[:i+1])
return result
which, when applied to the series df['a'], and exploding, using d['a'].apply(partials).explode() correctly gives:
0 [a]
0 [a, b]
1 [c]
1 [c, d]
1 [c, d, e]
However, this series is necessarily longer than the original. How can I apply this function in-place to column a of my dataframe, such that the column b repeats its value wherever the corresponding line from the original dataframe is 'exploded', like this :
a b
0 [a] 0
0 [a, b] 0
1 [c] 1
1 [c, d] 1
1 [c, d, e] 1
?

You can join back:
(df['a'].apply(partials)
.explode().to_frame()
.join(df.drop('a', axis=1))
)
Output:
a b
0 [a] 0
0 [a, b] 0
1 [c] 1
1 [c, d] 1
1 [c, d, e] 1

Putting rows of pandas dataframe into list form

I have a pandas dataframe of the form
T1 T2
0 A B
1 C D
2 B C
3 D E
4 F A
I would like to generate another pandas dataframe that lists each of the unique items in T1 and T2 has its own row, and has a column with the name of that unique item and a column with a list of the items it shared a row with in the original dataframe. For example, in this case I would be looking for something of the form:
Name List
0 A [B, F]
1 B [A, C]
2 C [D, B]
3 D [C, E]
4 E [D]
5 F [A]
Can someone suggest a proper pandonic (like pythonic but for pandas :)) way to do this? Thanks in advance!

IIUC, swap columns and use pandas.DataFrame.columns:
df2 = df.copy()
df2.columns = df.columns[::-1]
new_df = pd.concat([df, df2])
new_df.groupby("T1")["T2"].apply(list).reset_index()
Output:
T1 T2
0 A [B, F]
1 B [C, A]
2 C [D, B]
3 D [E, C]
4 E [D]
5 F [A]

Equivalent of fct_lump in pandas

Is there a function in Python that does what the R fct_lump function does (i.e. to group all groups that are too small into one 'OTHER' group)?
Example below:
library(dplyr)
library(forcats)
> x <- factor(rep(LETTERS[1:9], times = c(40, 10, 5, 27, 1, 1, 1, 1, 1)))
> x
[1] A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A B B B B B B B B
[49] B B C C C C C D D D D D D D D D D D D D D D D D D D D D D D D D D D E F G H I
Levels: A B C D E F G H I
> x %>% fct_lump_n(3)
[1] A A A A A A A A A A A A A A A A
[17] A A A A A A A A A A A A A A A A
[33] A A A A A A A A B B B B B B B B
[49] B B Other Other Other Other Other D D D D D D D D D
[65] D D D D D D D D D D D D D D D D
[81] D D Other Other Other Other Other
Levels: A B D Other

pip install siuba
#( in python or anaconda prompth shell)
#use library as:
from siuba.dply.forcats import fct_lump, fct_reorder
#just like fct_lump of R :
df['Your_column'] = fct_lump(df['Your_column'], n= 10)
df['Your_column'].value_counts() # check your levels
#it reduces the level to 10, lumps all the others as 'Other'

You may also want to try datar:
>>> from datar.all import factor, rep, LETTERS, c, fct_lump_n, fct_count
>>>
>>> x = factor(rep(LETTERS[:9], times=c(40, 10, 5, 27, 1, 1, 1, 1, 1)))
>>> x >> fct_count()
f n
<category> <int64>
0 A 40
1 B 10
2 C 5
3 D 27
4 E 1
5 F 1
6 G 1
7 H 1
8 I 1
>>> x >> fct_lump_n(3) >> fct_count()
f n
<category> <int64>
0 A 40
1 B 10
2 D 27
3 Other 10
Disclaimer: I am the author of the datar package.

python pandas - how to create for each row a list of column names with a condition?

I need apply a function to all rows of dataframe
I have used this function that returns a list of column names if value is 1:
def find_column(x):
a=[]
for column in df.columns:
if (df.loc[x,column] == 1):
a = a + [column]
return a
it works if i just insert the index, for example:
print(find_column(1))
but:
df['new_col'] = df.apply(find_column,axis=1)
does not work
any idea?
Thanks!

You can iterate by each row, so x is Series with index same like columns names, so is possible filter index matched data and convert to list:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,1,4,5,5,1],
'C':[7,1,9,4,2,3],
'D':[1,1,5,7,1,1],
'E':[5,1,6,9,1,4],
'F':list('aaabbb')
})
def find_column(x):
return x.index[x == 1].tolist()
df['new'] = df.apply(find_column,axis=1)
print (df)
A B C D E F new
0 a 4 7 1 5 a [D]
1 b 1 1 1 1 a [B, C, D, E]
2 c 4 9 5 6 a []
3 d 5 4 7 9 b []
4 e 5 2 1 1 b [D, E]
5 f 1 3 1 4 b [B, D]
Another idea is use DataFrame.dot with mask by DataFrame.eq for equal, then remove last separator and use Series.str.split:
df['new'] = df.eq(1).dot(df.columns + ',').str.rstrip(',').str.split(',')
print (df)
A B C D E F new
0 a 4 7 1 5 a [D]
1 b 1 1 1 1 a [B, C, D, E]
2 c 4 9 5 6 a []
3 d 5 4 7 9 b []
4 e 5 2 1 1 b [D, E]
5 f 1 3 1 4 b [B, D]

How to combine multiple pandas data frames that are the same size into one cohesive database?

I am trying to build a database with various pandas data frames that are the same size. I want to assign each point in the data frame with all of the other points.
Such that I get something like this:
Data frame 1:
1 2 3 4
1 A B C D
2 A B C D
3 A B C D
Data frame 2:
1 2 3 4
1 A D G J
2 B E H K
3 C F I L
Final Data Frame (something like this)
1 2 3 4
1 [A,A] [B,D] [C,G] [D,J]
2 [A,B] [B,E] [C,H] [D,K]
3 [A,C] [B,F] [C,I] [D,L]

Check with
pd.concat([df1,df2],1).groupby(level=0,axis=1).agg(lambda x : x.values.tolist())
Out[529]:
1 2 3 4
1 [A, A] [B, D] [C, G] [D, J]
2 [A, B] [B, E] [C, H] [D, K]
3 [A, C] [B, F] [C, I] [D, L]
If I were you I will look at multiple index rather than list type columns
pd.concat([df1,df2],1,keys=[1,2])
Out[530]:
1 2
1 2 3 4 1 2 3 4
1 A B C D A D G J
2 A B C D B E H K
3 A B C D C F I L

You could add the values from both dataframes, plus a separator in between, and iterate over the columns using str.split to create lists. Finally use pd.concat to turn the result into a dataframe:
df = df1 + ',' + df2
pd.concat([df[col].str.split(',') for col in df.columns], axis=1)
1 2 3 4
1 [A, A] [B, D] [C, G] [D, J]
2 [A, B] [B, E] [C, H] [D, K]
3 [A, C] [B, F] [C, I] [D, L]
Or using DataFrame.combine:
df1.combine(df2, func=lambda a,b: (a + ',' + b).str.split(','))
1 2 3 4
1 [A, A] [B, D] [C, G] [D, J]
2 [A, B] [B, E] [C, H] [D, K]
3 [A, C] [B, F] [C, I] [D, L]

Not very efficient but you can convert all cells to lists and then simply add the dataframes like this:
import pandas as pd
import numpy as np
np.random.seed(42)
df1 = pd.DataFrame(
[list('ABCD') for _ in range(5)]
)
df2 = pd.DataFrame(
[np.random.choice(list('ABCD'), size=4).tolist() for _ in range(5)]
)
df1 = df1.apply(lambda x: [[v] for v in x])
df2 = df2.apply(lambda x: [[v] for v in x])
df1 + df2
which outputs
0 1 2 3
0 [A, C] [B, D] [C, A] [D, C]
1 [A, C] [B, D] [C, A] [D, A]
2 [A, C] [B, B] [C, C] [D, C]
3 [A, C] [B, C] [C, D] [D, A]
4 [A, D] [B, D] [C, D] [D, C]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

change value in pandas dataframe using iteration - python

Use str.split: df[0] = df[0].str.split() print(df) # Output 0 1 0 [a, b, c, d, e] 1 1 [a, b, c, d, e] 1 2 [a, b, c, d, e] 1 3 [a, b, c, d, e] 1 4 [a, b, c, d, e] 1 Setup: data = {0: {0: 'a b c d e', 1: 'a b c d e'}, 1: {0: 1, 1: 1}} df = pd.DataFrame(data)

Related

Repeating other column values when using pandas.Series.explode()

Putting rows of pandas dataframe into list form

Equivalent of fct_lump in pandas

python pandas - how to create for each row a list of column names with a condition?

How to combine multiple pandas data frames that are the same size into one cohesive database?

Categories

Resources