Pandas dataframe rearrangement stack to two value columns (for factorplots) - python

I have been trying to rearrange my dataframe to use it as input for a factorplot. The raw data would look like this:
A B C D
1 0 1 2 "T"
2 1 2 3 "F"
3 2 1 0 "F"
4 1 0 2 "T"
...
My question is how can I rearrange it into this form:
col val val2
1 A 0 "T"
1 B 1 "T"
1 C 2 "T"
2 A 1 "F"
...
I was trying:
df = DF.cumsum(axis=0).stack().reset_index(name="val")
However this produces only one value column not two.. thanks for your support

I would use melt, and you can sort it how ever you like
pd.melt(df.reset_index(),id_vars=['index','D'], value_vars=['A','B','C']).sort_values(by='index')
Out[40]:
index D variable value
0 1 T A 0
4 1 T B 1
8 1 T C 2
1 2 F A 1
5 2 F B 2
9 2 F C 3
2 3 F A 2
6 3 F B 1
10 3 F C 0
3 4 T A 1
7 4 T B 0
11 4 T C 2
then obviously you can name column as you like
df.set_index('index').rename(columns={'D': 'col', 'variable': 'val2', 'value': 'val'})

consider your dataframe df
df = pd.DataFrame([
[0, 1, 2, 'T'],
[1, 2, 3, 'F'],
[2, 1, 3, 'F'],
[1, 0, 2, 'T'],
], [1, 2, 3, 4], list('ABCD'))
solution
df.set_index('D', append=True) \
.rename_axis(['col'], 1) \
.rename_axis([None, 'val2']) \
.stack().to_frame('val') \
.reset_index(['col', 'val2']) \
[['col', 'val', 'val2']]

Related

Splitting comma separated values into rows - Data Cleaning

How to split a column into rows if values are separated with a comma? I am stuck in here. I have used the following code
xd = df.assign(var1=df['var1'].str.split(',')).explode('var1')
xd = xd.assign(var2=xd['var2'].str.split(',')).explode('var2')
xd
But the above code generate multiple irrelevant rows. I am stuck here. Please suggest answers
DataFrame.explode
For multiple columns, specify a non-empty list with each element be str or tuple, and all specified columns their list-like data on same row of the frame must have matching length.
From docs:
df = pd.DataFrame({'A': [[0, 1, 2], 'foo', [], [3, 4]],
'B': 1,
'C': [['a', 'b', 'c'], np.nan, [], ['d', 'e']]})
df
A B C
0 [0, 1, 2] 1 [a, b, c]
1 foo 1 NaN
2 [] 1 []
3 [3, 4] 1 [d, e]
Multi-column explode.
df.explode(list('AC'))
A B C
0 0 1 a
0 1 1 b
0 2 1 c
1 foo 1 NaN
2 NaN 1 NaN
3 3 1 d
3 4 1 e
For your specific question:
xd = df.assign(
var1=df['var1'].str.split(','),
var2=df['var2'].str.split(',')
).explode(['var1', 'var2'])
xd
var1 var2 var3
0 a e 1
0 b f 1
0 c g 1
0 d h 1
1 p s 2
1 q t 2
1 r u 2

drop rows using pandas groupby and filter

I'm trying to drop rows from a df where certain conditions are met. Using below, I'm grouping values using column C. For each unique group, I want to drop ALL rows where A is less than 1 AND B is greater than 100. This has to occur on the same row though. If I use .any() or .all(), it doesn't return what I want.
df = pd.DataFrame({
'A' : [1,0,1,0,1,0,0,1,0,1],
'B' : [101, 2, 3, 1, 5, 101, 2, 3, 4, 5],
'C' : ['d', 'd', 'd', 'd', 'e', 'e', 'e', 'f', 'f',],
})
df.groupby(['C']).filter(lambda g: g['A'].lt(1) & g['B'].gt(100))
initial df:
A B C
0 1 101 d # A is not lt 1 so keep all d's
1 0 2 d
2 1 3 d
3 0 1 d
4 1 5 e
5 0 101 e # A is lt 1 and B is gt 100 so drop all e's
6 0 2 e
7 1 3 f
8 0 4 f
9 1 5 f
intended out:
A B C
0 1 101 d
1 0 2 d
2 1 3 d
3 0 1 d
7 1 3 f
8 0 4 f
9 1 5 f
For better performnce get all C values match condition and then filter original column C by Series.isin in boolean indexing with inverted mask:
df1 = df[~df['C'].isin(df.loc[df['A'].lt(1) & df['B'].gt(100), 'C'])]
Another idea is use GroupBy.transform with GroupBy.any for test if match at least one value:
df1 = df[~(df['A'].lt(1) & df['B'].gt(100)).groupby(df['C']).transform('any')]
Your solution is possible with any and not for scalars, if large DataFrame it should be slow:
df1 = df.groupby(['C']).filter(lambda g:not ( g['A'].lt(1) & g['B'].gt(100)).any())
df1 = df.groupby(['C']).filter(lambda g: (g['A'].ge(1) | g['B'].le(100)).all())
print (df1)
A B C
0 1 101 d
1 0 2 d
2 1 3 d
3 0 1 d
7 1 3 f
8 0 4 f
9 1 5 f

Replicate pandas grouped string join

Hi there I would like to join all strings within a group with Python datatable in order to avoid pandas. Below is the code I am currently using and which I would like to replicate in datatable.
Does anyone know how to do it? Thank you very much!
from datatable import dt, f, by
df = dt.Frame(group1=[1, 1, 1, 2, 2, 2], group2=[1, 1, 2, 2, 2, 3], text=['a', 'b', 'c', 'd', 'e', 'f'])
df = df.to_pandas()
df2 = df.groupby(['group1', 'group2'])['text'].apply(' '.join).reset_index() # replicate this with datatable
df:
group1 group2 text
0 1 1 a
1 1 1 b
2 1 2 c
3 2 2 d
4 2 2 e
5 2 3 f
df2
group1 group2 text
0 1 1 a b
1 1 2 c
2 2 2 d e
3 2 3 f

Copy data from a row to another row in Pandas dataframe based on condition

Let's say I have a (pandas) dataframe like this:
Index A ID B C
1 a 1 0 0
2 b 2 0 0
3 c 2 a a
4 d 3 0 0
I want to copy the data of the third row to the second row, because their IDs are matching, but the data is not filled. However, I want to leave column 'A' intact. Looking for a result like this:
Index A ID B C
1 a 1 0 0
2 b 2 a a
3 c 2 a a
4 d 3 0 0
What would you suggest as solution?
You can try replacing '0' with NaN then ffill()+bfill() using groupby()+apply():
df[['B','C']]=df[['B','C']].replace('0',float('NaN'))
df[['B','C']]=df.groupby('ID')[['B','C']].apply(lambda x:x.ffill().bfill()).fillna('0')
output of df:
Index A ID B C
0 1 a 1 0 0
1 2 b 2 a a
2 3 c 2 a a
3 4 d 3 0 0
Note: you can also use transform() method in place of apply() method
You can use combine_first:
s = df.loc[df[["B","C"]].ne("0").all(1)].set_index("ID")[["B", "C"]]
print (s.combine_first(df.set_index("ID")).reset_index())
ID A B C Index
0 1 a 0 0 1.0
1 2 b a a 2.0
2 2 c a a 3.0
3 3 d 0 0 4.0
import pandas as pd
data = { 'A': ['a', 'b', 'c', 'd'], 'ID': [1, 2, 2, 3], 'B': [0, 0, 'a', 0], 'C': [0, 0, 'a', 0]}
df = pd.DataFrame(data)
df.index += 1
index_to_be_replaced = 2
index_to_use_to_replace = 3
columns_to_replace = ['ID', 'B', 'C']
columns_not_to_replace = ['A']
x = df[columns_not_to_replace].loc[index_to_be_replaced]
y = df[columns_to_replace].loc[index_to_use_to_replace]
df.loc[index_to_be_replaced] = pd.concat([x, y])
print(df)
Does it solve your problem? I would check on other pandas functions, as well. Like join, merge.
❯ python3 b.py
A ID B C
1 a 1 0 0
2 b 2 a a
3 c 2 a a
4 d 3 0 0

Pandas index column by boolean

I want to keep columns that have 'n' or more values.
For example:
> df = pd.DataFrame({'a': [1,2,3], 'b': [1,None,4]})
a b
0 1 1
1 2 NaN
2 3 4
3 rows × 2 columns
> df[df.count()==3]
IndexingError: Unalignable boolean Series key provided
> df[:,df.count()==3]
TypeError: unhashable type: 'slice'
> df[[k for (k,v) in (df.count()==3).items() if v]]
a
0 1
1 2
2 3
Is that the best way to do this? It seems ridiculous.
You can use conditional list comprehension to generate the columns that exceed your threshold (e.g. 3). Then just select those columns from the data frame:
# Create sample DataFrame
df = pd.DataFrame({'a': [1, 2, 3, 4, 5],
'b': [1, None, 4, None, 2],
'c': [5, 4, 3, 2, None]})
>>> df_new = df[[col for col in df if df[col].count() > 3]]
Out[82]:
a c
0 1 5
1 2 4
2 3 3
3 4 2
4 5 NaN
Use count to produce a boolean index and use this as a mask for the columns:
In [10]:
df[df.columns[df.count() > 2]]
Out[10]:
a
0 1
1 2
2 3
if you want to keep columns that have 'n' or more values. for my example i am considering n value as 4
df = pd.DataFrame({'a': [1,2,3,4,6], 'b': [1,None,4,5,7],'c': [1,2,3,5,8]})
print df
a b c
0 1 1 1
1 2 NaN 2
2 3 4 3
3 4 5 5
4 6 7 8
print df[[i for i in xrange(0,len(df.columns)) if len(df.iloc[:,i]) - df.isnull().sum()[i] >4]]
a c
0 1 1
1 2 2
2 3 3
3 4 5
4 6 8

Categories