Isolating Adjacent columns based on str.contains - python

Hi all so my dataframe looks like such:
A | B | C | D | E
'USD'
'trading expenses-total'
8.10 2.3 5.5
9.1 1.4 6.1
5.4 5.1 7.8
I haven't found anything quite like this so apologies if this is a duplicate. But essentially I am trying to locate the column that contains the string 'total' (column B) and their adjacent columns (C and D) and turn them into a dataframe. I feel like I am close with the following code:
test.loc[:,test.columns.str.contains('total')]
which isolates the correct column, but i can't quite figure out how to grab the adjacent two columns. My desired output is:
B | C | D
'USD'
'trading expenses-total'
8.10 2.3 5.5
9.1 1.4 6.1
5.4 5.1 7.8

OLD answer:
Pandas approach:
In [36]: df = pd.DataFrame(np.random.rand(3,5), columns=['A','total','C','D','E'])
In [37]: df
Out[37]:
A total C D E
0 0.789482 0.427260 0.169065 0.112993 0.142648
1 0.303391 0.484157 0.454579 0.410785 0.827571
2 0.984273 0.001532 0.676777 0.026324 0.094534
In [38]: idx = np.argmax(df.columns.str.contains('total'))
In [39]: df.iloc[:, idx:idx+3]
Out[39]:
total C D
0 0.427260 0.169065 0.112993
1 0.484157 0.454579 0.410785
2 0.001532 0.676777 0.026324
UPDATE:
In [118]: df
Out[118]:
A B C D E
0 NaN USD NaN NaN NaN
1 NaN trading expenses-total NaN NaN NaN
2 A 8.10 2.3 5.5 10.0
3 B 9.1 1.4 6.1 11.0
4 C 5.4 5.1 7.8 12.0
In [119]: col = df.select_dtypes(['object']).apply(lambda x: x.str.contains('total').any()).idxmax()
In [120]: cols = df.columns.to_series().loc[col:].head(3).tolist()
In [121]: col
Out[121]: 'B'
In [122]: cols
Out[122]: ['B', 'C', 'D']
In [123]: df[cols]
Out[123]:
B C D
0 USD NaN NaN
1 trading expenses-total NaN NaN
2 8.10 2.3 5.5
3 9.1 1.4 6.1
4 5.4 5.1 7.8

Here's one approach -
from scipy.ndimage.morphology import binary_dilation as bind
mask = test.columns.str.contains('total')
test_out = test.iloc[:,bind(mask,[1,1,1],origin=-1)]
If you don't have access to SciPy, you can also use np.convolve, like so -
test_out = test.iloc[:,np.convolve(mask,[1,1,1])[:-2]>0]
Sample runs
Case #1 :
In [390]: np.random.seed(1234)
In [391]: test = pd.DataFrame(np.random.randint(0,9,(3,5)))
In [392]: test.columns = [['P','total001','g','r','t']]
In [393]: test
Out[393]:
P total001 g r t
0 3 6 5 4 8
1 1 7 6 8 0
2 5 0 6 2 0
In [394]: mask = test.columns.str.contains('total')
In [395]: test.iloc[:,bind(mask,[1,1,1],origin=-1)]
Out[395]:
total001 g r
0 6 5 4
1 7 6 8
2 0 6 2
Case #2 :
This also works if you have multiple matching columns and also if you are going out of limits and don't have two columns to the right of the matching columns -
In [401]: np.random.seed(1234)
In [402]: test = pd.DataFrame(np.random.randint(0,9,(3,7)))
In [403]: test.columns = [['P','total001','g','r','t','total002','k']]
In [406]: test
Out[406]:
P total001 g r t total002 k
0 3 6 5 4 8 1 7
1 6 8 0 5 0 6 2
2 0 5 2 6 3 7 0
In [407]: mask = test.columns.str.contains('total')
In [408]: test.iloc[:,bind(mask,[1,1,1],origin=-1)]
Out[408]:
total001 g r total002 k
0 6 5 4 1 7
1 8 0 5 6 2
2 5 2 6 7 0

Related

Python: how to apply a function fo same ids in a pandas dataframe without a loop?

I have two dataframes with same column id and for each id I need to apply the following function
def findConstant(df1,df2):
c = df1.iloc[[0], df1.eq(df1.iloc[0]).all().to_numpy()].squeeze()
return pd.concat([df1, df2]).assign(**c).reset_index(drop=True)
what I am doing the is the following:
df3 = pd.DataFrame()
for idx in df1['id']:
tmp1 = df1[df1['id']==idx]
tmp2 = df2[df2['id']==idx]
tmp3 = findConstant(tmp1,tmp2)
df3 = pd.concat([df3,tmp3], ignore_index(drop=True))
I would like to know how to avoid a loop like that
Use:
print (df1)
A B C id val
0 ar 2 8 1 3.2
1 ar 3 7 1 5.6
3 ar1 0 3 2 7.8
4 ar1 4 3 2 9.2
5 ar1 5 3 2 3.4
print (df2)
id val
0 1 3.3
1 2 6.4
#get number of unique values and first values to df3
df3 = df1.groupby('id').agg(['nunique','first'])
#filter if same values by comapre by 1
m = df3.xs('nunique', axis=1, level=1).eq(1)
#get correct values to df with replace not matched by original df2
df = df3.xs('first', axis=1, level=1).where(m).combine_first(df2.set_index('id'))
print (df)
A B C val
id
1 ar NaN NaN 3.3
2 ar1 NaN 3.0 6.4
#join together
df = pd.concat([df1, df.reset_index()], ignore_index=True)
print (df)
A B C id val
0 ar 2.0 8.0 1 3.2
1 ar 3.0 7.0 1 5.6
2 ar1 0.0 3.0 2 7.8
3 ar1 4.0 3.0 2 9.2
4 ar1 5.0 3.0 2 3.4
5 ar NaN NaN 1 3.3
6 ar1 NaN 3.0 2 6.4

Filling missing data in pandas dataframe on the basis of a value in another column [duplicate]

I have a dataframe having 4 columns(A,B,C,D). D has some NaN entries. I want to fill the NaN values by the average value of D having same value of A,B,C.
For example,if the value of A,B,C,D are x,y,z and Nan respectively,then I want the NaN value to be replaced by the average of D for the rows where the value of A,B,C are x,y,z respectively.
df['D'].fillna(df.groupby(['A','B','C'])['D'].transform('mean')) would be faster than apply
In [2400]: df
Out[2400]:
A B C D
0 1 1 1 1.0
1 1 1 1 NaN
2 1 1 1 3.0
3 3 3 3 5.0
In [2401]: df['D'].fillna(df.groupby(['A','B','C'])['D'].transform('mean'))
Out[2401]:
0 1.0
1 2.0
2 3.0
3 5.0
Name: D, dtype: float64
In [2402]: df['D'] = df['D'].fillna(df.groupby(['A','B','C'])['D'].transform('mean'))
In [2403]: df
Out[2403]:
A B C D
0 1 1 1 1.0
1 1 1 1 2.0
2 1 1 1 3.0
3 3 3 3 5.0
Details
In [2396]: df.shape
Out[2396]: (10000, 4)
In [2398]: %timeit df['D'].fillna(df.groupby(['A','B','C'])['D'].transform('mean'))
100 loops, best of 3: 3.44 ms per loop
In [2397]: %timeit df.groupby(['A','B','C'])['D'].apply(lambda x: x.fillna(x.mean()))
100 loops, best of 3: 5.34 ms per loop
I think you need:
df.D = df.groupby(['A','B','C'])['D'].apply(lambda x: x.fillna(x.mean()))
Sample:
df = pd.DataFrame({'A':[1,1,1,3],
'B':[1,1,1,3],
'C':[1,1,1,3],
'D':[1,np.nan,3,5]})
print (df)
A B C D
0 1 1 1 1.0
1 1 1 1 NaN
2 1 1 1 3.0
3 3 3 3 5.0
df.D = df.groupby(['A','B','C'])['D'].apply(lambda x: x.fillna(x.mean()))
print (df)
A B C D
0 1 1 1 1.0
1 1 1 1 2.0
2 1 1 1 3.0
3 3 3 3 5.0
Link to duplicate of this question for further information:
Pandas Dataframe: Replacing NaN with row average
Another suggested way of doing it mentioned in the link is using a simple fillna on the transpose:
df.T.fillna(df.mean(axis=1)).T

Partial sums over series in pandas

I have a DataFrame that looks like
A B
0 1.2 1
1 1.2 6
2 1.2 4
3 2.3 2
4 2.3 5
5 1.2 7
and I would like to obtain the partial sums for a group that shares the same value of A but only if they are next to each other. For this case, I would expect another DataFrame as in
0 1.2 11
3 2.3 7
5 1.2 7
I have a feeling that I can use .groupby but I can only manage it to work disregarding if the groups of A are next to each other.
Use groupby by helper Series with aggregate first and sum:
df = df.groupby(df.A.ne(df.A.shift()).cumsum(), as_index=False).agg({'A':'first','B':'sum'})
print (df)
A B
0 1.2 11
1 2.3 7
2 1.2 7
Detail:
Compare shiftd column with ne (!=) and add cumsum for consecutive groups Series:
print (df.A.ne(df.A.shift()).cumsum())
0 1
1 1
2 1
3 2
4 2
5 3
Name: A, dtype: int32
Thank you #user2285236 for comment:
Checking for equality may lead to unwanted results when the dtype is float. np.isclose might be a better option here
df = df.groupby(np.cumsum(~np.isclose(df.A, df.A.shift())), as_index=False).agg({'A':'first','B':'sum'})
print (df)
A B
0 1.2 11
1 2.3 7
2 1.2 7
print (np.cumsum(~np.isclose(df.A, df.A.shift())))
[1 1 1 2 2 3]
itertools.groupby
Suffers from same problem highlighted by #user2285236
g = groupby(df.itertuples(index=False), key=lambda x: x.A)
pd.DataFrame(
[[a, sum(t.B for t in b)] for a, b in g],
columns=df.columns
)
A B
0 1.2 11
1 2.3 7
2 1.2 7

Assign values from pandas.quantile

I just try to get the quantiles of a dataframe asigned on to an other dataframe like:
dataframe['pc'] = dataframe['row'].quantile([.1,.5,.7])
the result is
0 NaN
...
5758 NaN
Name: pc, Length: 5759, dtype: float64
any idea why the dataframe['row'] got plenty of values
It is expected, because different indices, so no align Series created by quantile with original DataFrame and get NaNs:
#indices 0,1,2...6
dataframe = pd.DataFrame({'row':[2,0,8,1,7,4,5]})
print (dataframe)
row
0 2
1 0
2 8
3 1
4 7
5 4
6 5
#indices 0.1, 0.5, 0.7
print (dataframe['row'].quantile([.1,.5,.7]))
0.1 0.6
0.5 4.0
0.7 5.4
Name: row, dtype: float64
#not align
dataframe['pc'] = dataframe['row'].quantile([.1,.5,.7])
print (dataframe)
row pc
0 2 NaN
1 0 NaN
2 8 NaN
3 1 NaN
4 7 NaN
5 4 NaN
6 5 NaN
If want create DataFrame of quantile add rename_axis + reset_index:
df = dataframe['row'].quantile([.1,.5,.7]).rename_axis('a').reset_index(name='b')
print (df)
a b
0 0.1 0.6
1 0.5 4.0
2 0.7 5.4
But if some indices are same (I think it is not what you want, only for better explanation):
Add reset_index for default indices 0,1,2:
print (dataframe['row'].quantile([.1,.5,.7]).reset_index(drop=True))
0 0.6
1 4.0
2 5.4
Name: row, dtype: float64
First 3 rows are aligned, because same indices 0,1,2 in Series and DataFrame:
dataframe['pc'] = dataframe['row'].quantile([.1,.5,.7]).reset_index(drop=True)
print (dataframe)
row pc
0 2 0.6
1 0 4.0
2 8 5.4
3 1 NaN
4 7 NaN
5 4 NaN
6 5 NaN
EDIT:
For multiple columns need DataFrame.quantile, it also exclude non numeric columns:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
A B C D E F
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b
df1 = df.quantile([.1,.2,.3,.4])
print (df1)
B C D E
0.1 4.0 2.5 0.5 2.5
0.2 4.0 3.0 1.0 3.0
0.3 4.0 3.5 1.0 3.5
0.4 4.0 4.0 1.0 4.0

pandas.DataFrame set all string values to nan

I have a pandas.DataFrame that contain string, float and int types.
Is there a way to set all strings that cannot be converted to float to NaN ?
For example:
A B C D
0 1 2 5 7
1 0 4 NaN 15
2 4 8 9 10
3 11 5 8 0
4 11 5 8 "wajdi"
to:
A B C D
0 1 2 5 7
1 0 4 NaN 15
2 4 8 9 10
3 11 5 8 0
4 11 5 8 NaN
You can use pd.to_numeric and set errors='coerce'
pandas.to_numeric
df['D'] = pd.to_numeric(df.D, errors='coerce')
Which will give you:
A B C D
0 1 2 5.0 7.0
1 0 4 NaN 15.0
2 4 8 9.0 10.0
3 11 5 8.0 0.0
4 11 5 8.0 NaN
Deprecated solution (pandas <= 0.20 only):
df.convert_objects(convert_numeric=True)
pandas.DataFrame.convert_objects
Here's the dev note in the convert_objects source code: # TODO: Remove in 0.18 or 2017, which ever is sooner. So don't make this a long term solution if you use it.
Here is a way:
df['E'] = pd.to_numeric(df.D, errors='coerce')
And then you have:
A B C D E
0 1 2 5.0 7 7.0
1 0 4 NaN 15 15.0
2 4 8 9.0 10 10.0
3 11 5 8.0 0 0.0
4 11 5 8.0 wajdi NaN
You can use pd.to_numeric with errors='coerce'.
In [30]: df = pd.DataFrame({'a': [1, 2, 'NaN', 'bob', 3.2]})
In [31]: pd.to_numeric(df.a, errors='coerce')
Out[31]:
0 1.0
1 2.0
2 NaN
3 NaN
4 3.2
Name: a, dtype: float64
Here is one way to apply it to all columns:
for c in df.columns:
df[c] = pd.to_numeric(df[c], errors='coerce')
(See comment by NinjaPuppy for a better way.)

Categories