Splitting comma separated values into rows - Data Cleaning

Splitting comma separated values into rows - Data Cleaning - python

How to split a column into rows if values are separated with a comma? I am stuck in here. I have used the following code
xd = df.assign(var1=df['var1'].str.split(',')).explode('var1')
xd = xd.assign(var2=xd['var2'].str.split(',')).explode('var2')
xd
But the above code generate multiple irrelevant rows. I am stuck here. Please suggest answers

DataFrame.explode
For multiple columns, specify a non-empty list with each element be str or tuple, and all specified columns their list-like data on same row of the frame must have matching length.
From docs:
df = pd.DataFrame({'A': [[0, 1, 2], 'foo', [], [3, 4]],
'B': 1,
'C': [['a', 'b', 'c'], np.nan, [], ['d', 'e']]})
df
A B C
0 [0, 1, 2] 1 [a, b, c]
1 foo 1 NaN
2 [] 1 []
3 [3, 4] 1 [d, e]
Multi-column explode.
df.explode(list('AC'))
A B C
0 0 1 a
0 1 1 b
0 2 1 c
1 foo 1 NaN
2 NaN 1 NaN
3 3 1 d
3 4 1 e
For your specific question:
xd = df.assign(
var1=df['var1'].str.split(','),
var2=df['var2'].str.split(',')
).explode(['var1', 'var2'])
xd
var1 var2 var3
0 a e 1
0 b f 1
0 c g 1
0 d h 1
1 p s 2
1 q t 2
1 r u 2

Related

Replicate pandas grouped string join

Hi there I would like to join all strings within a group with Python datatable in order to avoid pandas. Below is the code I am currently using and which I would like to replicate in datatable.
Does anyone know how to do it? Thank you very much!
from datatable import dt, f, by
df = dt.Frame(group1=[1, 1, 1, 2, 2, 2], group2=[1, 1, 2, 2, 2, 3], text=['a', 'b', 'c', 'd', 'e', 'f'])
df = df.to_pandas()
df2 = df.groupby(['group1', 'group2'])['text'].apply(' '.join).reset_index() # replicate this with datatable
df:
group1 group2 text
0 1 1 a
1 1 1 b
2 1 2 c
3 2 2 d
4 2 2 e
5 2 3 f
df2
group1 group2 text
0 1 1 a b
1 1 2 c
2 2 2 d e
3 2 3 f

Copy data from a row to another row in Pandas dataframe based on condition

Let's say I have a (pandas) dataframe like this:
Index A ID B C
1 a 1 0 0
2 b 2 0 0
3 c 2 a a
4 d 3 0 0
I want to copy the data of the third row to the second row, because their IDs are matching, but the data is not filled. However, I want to leave column 'A' intact. Looking for a result like this:
Index A ID B C
1 a 1 0 0
2 b 2 a a
3 c 2 a a
4 d 3 0 0
What would you suggest as solution?

You can try replacing '0' with NaN then ffill()+bfill() using groupby()+apply():
df[['B','C']]=df[['B','C']].replace('0',float('NaN'))
df[['B','C']]=df.groupby('ID')[['B','C']].apply(lambda x:x.ffill().bfill()).fillna('0')
output of df:
Index A ID B C
0 1 a 1 0 0
1 2 b 2 a a
2 3 c 2 a a
3 4 d 3 0 0
Note: you can also use transform() method in place of apply() method

You can use combine_first:
s = df.loc[df[["B","C"]].ne("0").all(1)].set_index("ID")[["B", "C"]]
print (s.combine_first(df.set_index("ID")).reset_index())
ID A B C Index
0 1 a 0 0 1.0
1 2 b a a 2.0
2 2 c a a 3.0
3 3 d 0 0 4.0

import pandas as pd
data = { 'A': ['a', 'b', 'c', 'd'], 'ID': [1, 2, 2, 3], 'B': [0, 0, 'a', 0], 'C': [0, 0, 'a', 0]}
df = pd.DataFrame(data)
df.index += 1
index_to_be_replaced = 2
index_to_use_to_replace = 3
columns_to_replace = ['ID', 'B', 'C']
columns_not_to_replace = ['A']
x = df[columns_not_to_replace].loc[index_to_be_replaced]
y = df[columns_to_replace].loc[index_to_use_to_replace]
df.loc[index_to_be_replaced] = pd.concat([x, y])
print(df)
Does it solve your problem? I would check on other pandas functions, as well. Like join, merge.
❯ python3 b.py
A ID B C
1 a 1 0 0
2 b 2 a a
3 c 2 a a
4 d 3 0 0

Pandas join only a certain column

I have dataframe A and dataframe B, I want to join B onto A but only for a certain column on B. Like this:
dataA = ['a', 'c', 'd', 'e']
A = pd.DataFrame(dataA, columns=['testA'])
dataB = [['a', 1, 'asdf'],
['b', 2, 'asdf'],
['c', 3, 'asdf'],
['d', 4, 'asdf'],
['e', 5, 'asdf']]
B = pd.DataFrame(data1, columns=['testB', 'num', 'asdf'])
Out[1]: A
testA
0 a
1 c
2 d
3 e
Out[2]: B
testB num asdf
0 a 1 asdf
1 b 2 asdf
2 c 3 asdf
3 d 4 asdf
4 e 5 asdf
My current code is:
Out[3]: A.join(B.set_index('testB'), on='testA')
testA num asdf
0 a 1 asdf
1 c 3 asdf
2 d 4 asdf
3 e 5 asdf
My desired output is only to join over the 'num' column as below and ignore the 'asdf' column, or all other columns if there were even more:
Out[4]: A
testA num
0 a 1
1 c 3
2 d 4
3 e 5

One way may be to use merge:
new_df= A.merge(B, how='left', left_on='testA', right_on='testB')[['testA', 'num']]
Result:
testA num
0 a 1
1 c 3
2 d 4
3 e 5

Use map, first create a pd.Series with the column you are bringing over as values, and in the index set the "mapping" column. This ignores and not doing any work on other columns not need in the desired result:
A['num'] = A['testA'].map(B.set_index('testB')['num'])
A
Output:
testA num
0 a 1
1 c 3
2 d 4
3 e 5

Using what you already have, and keeping only the columns you want.
z = a.join(b.set_index('testB'), on='testA')[["testA","num"]]
outputs:
testA num
0 a 1
1 c 3
2 d 4
3 e 5

Pandas dataframe rearrangement stack to two value columns (for factorplots)

I have been trying to rearrange my dataframe to use it as input for a factorplot. The raw data would look like this:
A B C D
1 0 1 2 "T"
2 1 2 3 "F"
3 2 1 0 "F"
4 1 0 2 "T"
...
My question is how can I rearrange it into this form:
col val val2
1 A 0 "T"
1 B 1 "T"
1 C 2 "T"
2 A 1 "F"
...
I was trying:
df = DF.cumsum(axis=0).stack().reset_index(name="val")
However this produces only one value column not two.. thanks for your support

I would use melt, and you can sort it how ever you like
pd.melt(df.reset_index(),id_vars=['index','D'], value_vars=['A','B','C']).sort_values(by='index')
Out[40]:
index D variable value
0 1 T A 0
4 1 T B 1
8 1 T C 2
1 2 F A 1
5 2 F B 2
9 2 F C 3
2 3 F A 2
6 3 F B 1
10 3 F C 0
3 4 T A 1
7 4 T B 0
11 4 T C 2
then obviously you can name column as you like
df.set_index('index').rename(columns={'D': 'col', 'variable': 'val2', 'value': 'val'})

consider your dataframe df
df = pd.DataFrame([
[0, 1, 2, 'T'],
[1, 2, 3, 'F'],
[2, 1, 3, 'F'],
[1, 0, 2, 'T'],
], [1, 2, 3, 4], list('ABCD'))
solution
df.set_index('D', append=True) \
.rename_axis(['col'], 1) \
.rename_axis([None, 'val2']) \
.stack().to_frame('val') \
.reset_index(['col', 'val2']) \
[['col', 'val', 'val2']]

Pandas index column by boolean

I want to keep columns that have 'n' or more values.
For example:
> df = pd.DataFrame({'a': [1,2,3], 'b': [1,None,4]})
a b
0 1 1
1 2 NaN
2 3 4
3 rows × 2 columns
> df[df.count()==3]
IndexingError: Unalignable boolean Series key provided
> df[:,df.count()==3]
TypeError: unhashable type: 'slice'
> df[[k for (k,v) in (df.count()==3).items() if v]]
a
0 1
1 2
2 3
Is that the best way to do this? It seems ridiculous.

You can use conditional list comprehension to generate the columns that exceed your threshold (e.g. 3). Then just select those columns from the data frame:
# Create sample DataFrame
df = pd.DataFrame({'a': [1, 2, 3, 4, 5],
'b': [1, None, 4, None, 2],
'c': [5, 4, 3, 2, None]})
>>> df_new = df[[col for col in df if df[col].count() > 3]]
Out[82]:
a c
0 1 5
1 2 4
2 3 3
3 4 2
4 5 NaN

Use count to produce a boolean index and use this as a mask for the columns:
In [10]:
df[df.columns[df.count() > 2]]
Out[10]:
a
0 1
1 2
2 3

if you want to keep columns that have 'n' or more values. for my example i am considering n value as 4
df = pd.DataFrame({'a': [1,2,3,4,6], 'b': [1,None,4,5,7],'c': [1,2,3,5,8]})
print df
a b c
0 1 1 1
1 2 NaN 2
2 3 4 3
3 4 5 5
4 6 7 8
print df[[i for i in xrange(0,len(df.columns)) if len(df.iloc[:,i]) - df.isnull().sum()[i] >4]]
a c
0 1 1
1 2 2
2 3 3
3 4 5
4 6 8

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Splitting comma separated values into rows - Data Cleaning - python

Related

Replicate pandas grouped string join

Copy data from a row to another row in Pandas dataframe based on condition

Pandas join only a certain column

Pandas dataframe rearrangement stack to two value columns (for factorplots)

Pandas index column by boolean

Categories

Resources