Pandas - Edit Index using pattern / regex - python

Given a data frame like:
>>> df
ix val1 val2 val3 val4
1.31 2 3 4 5
8.22 2 3 4 5
5.39 2 3 4 5
7.34 2 3 4 5
Is it possible to edit index using something like replace?
Pseudo code: (since df index doesnt have str attribute)
df.index=df.index.str.replace("\\.[0-9]*","")
I need something like:
>>> df
ix val1 val2 val3 val4
1 2 3 4 5
8 2 3 4 5
5 2 3 4 5
7 2 3 4 5
The problem is that my dataframe is huge.
Thanks in advance

You can do:
df.index = df.index.to_series().astype(str).str.replace(r'\.[0-9]*','').astype(int)
you may also use .extract:
df.index.to_series().astype(str).str.extract(r'(\d+)').astype(int)
alternatively, you may just map the index to int:
pd.Index(map(int, df.index))

Related

Replace NaN values with values from other table

Please help.
My first table looks like:
id val1 val2
0 4 30
1 5 NaN
2 3 10
3 2 8
4 3 NaN
My second table looks like
id val1 val2_estimate
0 1 8
1 2 12
2 3 13
3 4 16
4 5 22
I want to replace Nan in 1st table with estimated values from column val2_estimate from 2nd table where val1 are the same. val1 in 2nd table are unique. End result need to look like that:
id val1 val2
0 4 30
1 5 22
2 3 10
3 2 8
4 3 13
I want to replace NaN values only.
Use merge to get the corresponding df2's estimate for df1, then use fillna:
df['val2'] = df['val2'].fillna(
df.merge(df2, on=['val1'], how='left')['val2_estimate'])
df
id val1 val2
0 0 4 30.0
1 1 5 22.0
2 2 3 10.0
3 3 2 8.0
4 4 3 13.0
Many ways to skin a cat, this is one of them.
Use fillna with map from a pd.Series created using set_index:
df['val2'] = df['val2'].fillna(df['val1'].map(df2.set_index('val1')['val2_estimate']))
df
Output:
val1 val2
id
0 4 30.0
1 5 22.0
2 3 10.0
3 2 8.0
4 3 13.0

Use regex to remove/exclude columns from dataframe - Python

I have a dataframe which can be generated from the code below
df = pd.DataFrame({'person_id' :[1,2,3],'date1': ['12/31/2007','11/25/2009','10/06/2005'],'date1derived':[0,0,0],'val1':[2,4,6],'date2': ['12/31/2017','11/25/2019','10/06/2015'],'date2derived':[0,0,0],'val2':[1,3,5],'date3':['12/31/2027','11/25/2029','10/06/2025'],'date3derived':[0,0,0],'val3':[7,9,11]})
The dataframe looks like as shown below
I would like to remove columns that contain "derived" in their name. I tried different regex but couldn't get the expected output.
df = df.filter(regex='[^H\dDerived]+', axis=1)
df = df.filter(regex='[^Derived]',axis=1)
Can you let me know the right regex to do this?
You can use a zero-width negative lookahead to make sure the string derived does not come anywhere:
^(?!.*?derived)
^ matches the start of the string
(?!.*?derived) is the negative lookahead pattern that makes sure derived does not come in the string
Your pattern [^Derived] will match any single character that are not one of D/e/r/i/v/e/d .
IIUC, you want to drop columns has derived in it. This should do:
df.drop(df.filter(like='derived').columns, 1)
Out[455]:
person_id date1 val1 date2 val2 date3 val3
0 1 12/31/2007 2 12/31/2017 1 12/31/2027 7
1 2 11/25/2009 4 11/25/2019 3 11/25/2029 9
2 3 10/06/2005 6 10/06/2015 5 10/06/2025 11
pd.Index.difference() with df.filter()
df[df.columns.difference(df.filter(like='derived').columns,sort=False)]
person_id date1 val1 date2 val2 date3 val3
0 1 12/31/2007 2 12/31/2017 1 12/31/2027 7
1 2 11/25/2009 4 11/25/2019 3 11/25/2029 9
2 3 10/06/2005 6 10/06/2015 5 10/06/2025 11
df[[c for c in df.columns if 'derived' not in c ]]
Output
person_id date1 val1 date2 val2 date3 val3
0 1 12/31/2007 2 12/31/2017 1 12/31/2027 7
1 2 11/25/2009 4 11/25/2019 3 11/25/2029 9
2 3 10/06/2005 6 10/06/2015 5 10/06/2025 11
In recent versions of pandas, you can use string methods on the index and columns. Here, str.endswith seems like a good fit.
import pandas as pd
df = pd.DataFrame({'person_id' :[1,2,3],'date1': ['12/31/2007','11/25/2009','10/06/2005'],
'date1derived':[0,0,0],'val1':[2,4,6],'date2': ['12/31/2017','11/25/2019','10/06/2015'],
'date2derived':[0,0,0],'val2':[1,3,5],'date3':['12/31/2027','11/25/2029','10/06/2025'],
'date3derived':[0,0,0],'val3':[7,9,11]})
df = df.loc[:,~df.columns.str.endswith('derived')]
print(df)
O/P:
person_id date1 val1 date2 val2 date3 val3
0 1 12/31/2007 2 12/31/2017 1 12/31/2027 7
1 2 11/25/2009 4 11/25/2019 3 11/25/2029 9
2 3 10/06/2005 6 10/06/2015 5 10/06/2025 11

Pandas conditionally copying of cell value

Working with a Pandas DataFrame, I am trying to copy data from one cell into another cell only if the recipient cell contains a specific value. The transfer should go from:
Col1 Col2
0 4 X
1 2 5
2 1 X
3 7 8
4 12 20
5 3 X
And the result should be
Col1 Col2
0 4 4
1 2 5
2 1 1
3 7 8
4 12 20
5 3 3
Is there an elegant or simple solution I am missing?
df.Col2 = df.Col1.where(df.Col2 == 'X', df.Col2)
import pandas as pd
import numpy as np
df.Col2 = np.where(df.Col2 == 'specific value', df.Col1, df.Col2)
Using pandas.DataFrame.ffill:
>>> df.replace('X', np.nan, inplace=True)
>>> df.ffill(axis=1)
Col1 Col2
0 4 4
1 2 5
2 1 1
3 7 8
4 12 20
5 3 3

What is the difference between using `[data2]` and `[[data2]]` with `groupby`

I am working through a Python for data analysis tutorial and want some clarification on the output I get from using [data2] and [[data2]] when using groupby.
If use:
[data2]
you get Series with Multiindex.
If use subset
[[data2]]
you get DataFrame with Multiindex.
And if use:
df.groupby(['key1','key2'], as_index=False)['data2'].mean()
you get DataFrame with 3 columns without Multiindex.
Maybe it is more clear if use another form:
import pandas as pd
df = pd.DataFrame({'key1':[1,2,2,1,2,2],
'key2':[4,4,4,4,5,5],
'data2':[7,8,9,1,3,5],
'D':[1,3,5,7,9,5]})
print (df)
D data2 key1 key2
0 1 7 1 4
1 3 8 2 4
2 5 9 2 4
3 7 1 1 4
4 9 3 2 5
5 5 5 2 5
print (df['data2'].groupby([df.key1,df.key2]).mean())
key1 key2
1 4 4.0
2 4 8.5
5 4.0
Name: data2, dtype: float64
print (df[['data2']].groupby([df.key1,df.key2]).mean())
data2
key1 key2
1 4 4.0
2 4 8.5
5 4.0

Pandas - Delete cells based on ranking within column

I want to delete values based on their relative rank within their column. Specifically, I want to isolate the X highest and X lowest values within several columns. So if X=2 and my dataframe looks like this:
ID Val1 Val2 Val3
001 2 8 14
002 10 15 8
003 3 1 20
004 11 11 7
005 14 4 19
The output should look like this:
ID Val1 Val2 Val3
001 2 NaN NaN
002 NaN 15 8
003 3 1 20
004 11 11 7
005 14 4 19
I know that I can make a sub-table to isolate the high and low rank using:
df = df.sort('Column Name')
df2 = df.head(X) # OR: df.tail(X)
And I figure I clear these sub-tables of the values from other columns using:
df2['Other Column'] = np.NaN
df2['Other Column B'] = np.NaN
Then merge the sub-tables back together in a way that replaces NaN values when there is data in one of the tables. I tried:
df2.update(df3) # df3 is a sub-table made the same way as df2 using a different column
Which only updated rows already present in df2.
I tried:
out = pd.merge(df2, df3, how='outer')
which gave me separate rows when a row appeared in both df2 and d3
I tried:
out = df2.combine_first(df3)
which over-wrote numerical values with found NaN values in some cases making it unsuitable.
There must be a way to do this: I want to the original dataframe with NaN values plugged in whenever a value is not among the X highest or X lowest values in that column.
Interesting question, you can get the index of the values of each columns in the sorted values of each columns (here in the mask DataFrame), and then keep the values that have the index within you defined boundary.
In [98]:
print df
Val1 Val2 Val3
ID
1 2 8 14
2 10 15 8
3 3 1 20
4 11 11 7
5 14 4 19
In [99]:
mask = df.apply(lambda x: np.searchsorted(sorted(x),x))
print mask
Val1 Val2 Val3
ID
1 0 2 2
2 2 4 1
3 1 0 4
4 3 3 0
5 4 1 3
In [100]:
print (mask<=1)|(mask>=(len(mask)-2))
Val1 Val2 Val3
ID
1 True False False
2 False True True
3 True True True
4 True True True
5 True True True
In [101]:
print df.where((mask<=1)|(mask>=(len(mask)-2)))
Val1 Val2 Val3
ID
1 2 NaN NaN
2 NaN 15 8
3 3 1 20
4 11 11 7
5 14 4 19

Categories