split each cell in dataframe (pandas/python)

split each cell in dataframe (pandas/python) - python

I have a large pandas dataframe consisting of many rows and columns containing binary data like '0|1', '0|0','1|1','1|0' which i would like to split either in 2 dataframes, and/or expand so that this (both are useful to me):
a b c d
rowa 1|0 0|1 0|1 1|0
rowb 0|1 0|0 0|0 0|1
rowc 0|1 1|0 1|0 0|1
becomes
a b c d
rowa1 1 0 0 1
rowa2 0 1 1 0
rowb1 0 0 0 0
rowb2 1 0 0 1
rowc1 0 1 1 0
rowc2 1 0 0 1
and/or
df1: a b c d
rowa 1 0 0 1
rowb 0 0 0 0
rowc 0 1 1 0
df2: a b c d
rowa 0 1 1 0
rowb 1 0 0 1
rowc 1 0 0 1
currently i'm trying to do something like the following, but believe this is not very effective, any guidance would be helpful.
Atmp_dict=defaultdict(list)
Btmp_dict=defaultdict(list)
for index,row in df.iterrows():
for columnname in list(df.columns.values):
Atmp_dict[columnname].append(row[columnname].split('|')[0])
Btmp_dict[columnname].append(row[columnname].split('|')[1])

user2734178 is close, but his or her answer has some issues. Here is a slight variation that works
import pandas as pd
df1 = pd.DataFrame()
df2 = pd.DataFrame()
# df is your original DataFrame
for col in df.columns:
df1[col] = df[col].apply(lambda x: x.split('|')[0])
df2[col] = df[col].apply(lambda x: x.split('|')[1])
Here is another option that is slightly more elegant. Replace the loop with:
for col in df.columns:
df1[col] = df[col].str.extract("(\d)\|")
df2[col] = df[col].str.extract("\|(\d)")

This is pretty compact, but it seems like there should be an even easier and more compact way.
df1 = df.applymap( lambda x: str(x)[0] )
df2 = df.applymap( lambda x: str(x)[2] )
Or loop over the columns as in the other answers. I don't think it matters. Note that because the question specified binary data, it is OK (and simpler) to just do str[0] and str[2] rather than using split or extract.
Or you could do this, which seems almost silly, but there's nothing actually wrong with it and it is fairly compact.
df1 = df.stack().str[0].unstack()
df2 = df.stack().str[2].unstack()
stack just converts it to a series so you can use str and then unstack converts it back to a dataframe.

Since it looks like all of your values are strings, you can use the .str accessor to split up everything using the pipe as your delimiter, comme ca,
import pandas as pd
df1 = pd.DataFrame()
df2 = pd.DataFrame()
#df is defined as in your first example
for col in df.columns:
df1[col] = df[col].str[0]
df2[col] = df[col].str[-1]
You'll then probably want to recast your df1 and df2 as int columns using astype(int).

Related

Creating a new map from existing maps in python

This question might be common but I am new to python and would like to learn more from the community. I have 2 map files which have data mapping like this:
map1 : A --> B
map2 : B --> C,D,E
I want to create a new map file which will be A --> C
What is the most efficient way to achieve this in python? A generic approach would be very helpful as I need to apply the same logic on different files and different columns
Example:
Map1:
1,100
2,453
3,200
Map2:
100,25,30,
200,300,,
250,190,20,1
My map3 should be:
1,25
2,0
3,300
As 453 is not present in map2, our map3 contains value 0 for key 2.

First create DataFrames:
df1 = pd.read_csv(Map1, header=None)
df2 = pd.read_csv(Map2, header=None)
And then use Series.map by second column with by Series created by df2 with set index by first column, last replace missing values to 0 for not matched values:
df1[1] = df1[1].map(df2.set_index(0)[1]).fillna(0, downcast='int')
print (df1)
0 1
0 1 25
1 2 0
2 3 300
EDIT: for mapping multiple columns use left join with remove only missing columns by DataFrame.dropna and columns b,c used for join, last replace missing values:
df1.columns=['a','b']
df2.columns=['c','d','e','f']
df = (df1.merge(df2, how='left', left_on='b', right_on='c')
.dropna(how='all', axis=1)
.drop(['b','c'], axis=1)
.fillna(0)
.convert_dtypes())
print (df)
a d e
0 1 25 30
1 2 0 0
2 3 300 0

Shuffle Columns in Dataframe

I want to shuffle columns without order; completely pseudo-randomly, on one line of code.
Before:
A B
0 1 2
1 1 2
After:
B A
0 2 1
1 2 1
My attempts so far:
df = df.reindex(columns=columns)
df.sample(frac=1, axis=1)
df.apply(np.random.shuffle, axis=1)

You can use np.random.default_rng()'s permutation with a seed to make it reproducible.
df = df[np.random.default_rng(seed=42).permutation(df.columns.values)]

Use DataFrame.sample with the axis argument set to columns (1):
df = df.sample(frac=1, axis=1)
print(df)
B A
0 2 1
1 2 1
Or use Series.sample with columns converted to Series and change order of columns by subset:
df = df[df.columns.to_series().sample(frac=1)]
print(df)
B A
0 2 1
1 2 1

Use numpy.random.permutation with list of column names.
df = df[np.random.permutation(df.columns)]

Drop rows and sort one dataframe according to another

I have two pandas data frames (df1 and df2):
# df1
ID COL
1 A
2 F
2 A
3 A
3 S
3 D
4 D
# df2
ID VAL
1 1
2 0
3 0
3 1
4 0
My goal is to append the corresponding val from df2 to each ID in df1. However, the relationship is not one-to-one (this is my client's fault and there's nothing I can do about this). To solve this problem, I want to sort df1 by df2['ID'] such that df1['ID'] is identical to df2['ID'].
So basically, for any row i in 0 to len(df2):
if df1.loc[i, 'ID'] == df2.loc[i, 'ID'] then keep row i in df1.
if df1.loc[i, 'ID'] != df2.loc[i, 'ID'] then drop row i from df1 and repeat.
The desired result is:
ID COL
1 A
2 F
3 A
3 S
4 D
This way, I can use pandas.concat([df1, df2['ID']], axis=0) to assign df2[VAL] to df1.
Is there a standardized way to do this? Does pandas.merge() have a method for doing this?
Before this gets voted as a duplicate, please realize that len(df1) != len(df2), so threads like this are not quite what I'm looking for.

This can be done with merge on both ID and the order within each ID:
(df1.assign(idx=df1.groupby('ID').cumcount())
.merge(df2.assign(idx=df2.groupby('ID').cumcount()),
on=['ID','idx'],
suffixes=['','_drop'])
[df1.columns]
)
Output:
ID COL
0 1 A
1 2 F
2 3 A
3 3 S
4 4 D

The simplest way I can see of getting the result you want is:
# Add a count for each repetition of the ids to temporary frames
x = df1.assign(id_counter=df1.groupby('ID').cumcount())
y = df2.assign(id_counter=df2.groupby('ID').cumcount())
# Merge using the ID and the repetition counter
df1 = pd.merge(x, y, how='right', on=['ID', 'id_counter']).drop('id_counter', axis=1)
Which would produce this output:
ID COL VAL
0 1 A 1
1 2 F 0
2 3 A 0
3 3 S 1
4 4 D 0

Concatenate dataframes alternating rows with Pandas

I have two dataframes df1 and df2 that are defined like so:
df1 df2
Out[69]: Out[70]:
A B A B
0 2 a 0 5 q
1 1 s 1 6 w
2 3 d 2 3 e
3 4 f 3 1 r
My goal is to concatenate the dataframes by alternating the rows so that the resulting dataframe is like this:
dff
Out[71]:
A B
0 2 a <--- belongs to df1
0 5 q <--- belongs to df2
1 1 s <--- belongs to df1
1 6 w <--- belongs to df2
2 3 d <--- belongs to df1
2 3 e <--- belongs to df2
3 4 f <--- belongs to df1
3 1 r <--- belongs to df2
As you can see the first row of dff corresponds to the first row of df1 and the second row of dff is the first row of df2. The pattern repeats until the end.
I tried to reach my goal by using the following lines of code:
import pandas as pd
df1 = pd.DataFrame({'A':[2,1,3,4], 'B':['a','s','d','f']})
df2 = pd.DataFrame({'A':[5,6,3,1], 'B':['q','w','e','r']})
dfff = pd.DataFrame()
for i in range(0,4):
dfx = pd.concat([df1.iloc[i].T, df2.iloc[i].T])
dfff = pd.concat([dfff, dfx])
However this approach doesn't work because df1.iloc[i] and df2.iloc[i] are automatically reshaped into columns instead of rows and I cannot revert the process (even by using .T).
Question: Can you please suggest me a nice and elegant way to reach my goal?
Optional: Can you also provide an explanation about how to convert a column back to row?

I'm unable to comment on the accepted answer, but note that the sort operation in unstable by default, so you must choose a stable sorting algorithm.
pd.concat([df1, df2]).sort_index(kind='merge')

IIUC
In [64]: pd.concat([df1, df2]).sort_index()
Out[64]:
A B
0 2 a
0 5 q
1 1 s
1 6 w
2 3 d
2 3 e
3 4 f
3 1 r

Pandas python - matching values

I currently have two dataframes that have two matching columns. For example :
Data frame 1 with columns : A,B,C
Data frame 2 with column : A
I want to keep all lines in the first dataframe that have the values that the A contains. For example if df2 and df1 are:
df1
A B C
0 1 3
4 2 5
6 3 1
8 0 0
2 1 1
df2
Α
4
6
1
So in this case, I want to only keep the second and third line of df1.
I tried doing it like this, but it didnt work since both dataframes are pretty big:
for index, row in df1.iterrows():
counter = 0
for index2,row2 in df2.iterrows():
if row["A"] == row2["A"]:
counter = counter + 1
if counter == 0:
df2.drop(index, inplace=True)

Use isin to test for membership:
In [176]:
df1[df1['A'].isin(df2['A'])]
Out[176]:
A B C
1 4 2 5
2 6 3 1

Or use the merge method:
df1= pandas.DataFrame([[0,1,3],[4,2,5],[6,3,1],[8,0,0],[2,1,1]], columns = ['A', 'B', 'C'])
df2= pandas.DataFrame([4,6,1], columns = ['A'])
df2.merge(df1, on = 'A')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

split each cell in dataframe (pandas/python) - python

Related

Creating a new map from existing maps in python

Shuffle Columns in Dataframe

Drop rows and sort one dataframe according to another

Concatenate dataframes alternating rows with Pandas

Pandas python - matching values

Categories

Resources