How to use boolean indexing with Pandas - python

I have a dataframe:
df =
time time b
0 0.0 1.1 21
1 0.1 2.2 22
2 0.2 3.3 23
3 0.3 4.4 24
4 0.4 5.5 24
I also have a series for my units, defined as
su =
time sal
time zulu
b m/s
Now, I want to set df.index equal to the "time (sal)" values. Those values can be in any column and I will need to check.
I can do this as:
df.index = df.values[:,(df.columns == 'time') & (su.values == 'sal')]
But, my index looks like:
array([[0.0],
[0.1],
[0.2],
[0.3],
[0.4]])
However, this is an array of arrays. In bigger datasets, plot seems to take longer. If I hardcode the value, I get just an array:
df.index = df[0,0]
array([0.0, 0.1, 0.2, 0.3, 0.4])
I can also do the following:
inx = ((df.columns == 'time') & (s.values == 'sal')).tolist().index(True)
This sets "inx" to 0 and then gets a single array
df.index=df.values[0,inx]
However, I shouldn't have to do this. Am I using pandas and boolean indexing incorrectly?
I want:
df =
time time b
0.0 0.0 1.1 21
0.1 0.1 2.2 22
0.2 0.2 3.3 23
0.3 0.3 4.4 24
0.4 0.4 5.5 24

As I understood, this is what you expected. However, I renamed time names as time1 & time2, otherwise it won't let to create the dictionary with same name.
df = {'time1': [0.0,0.1,0.2,0.3,0.4], 'time2': [1.1,2.2,3.3,4.4,5.5],'b':[21,22,23,24,24]}
su = {'time1':'sal', 'time2':'zulu', 'b':'m/s'}
indexes = df[su.keys()[su.values().index('sal')]]
df = pd.DataFrame(df, index=indexes, columns=['time1', 'time2', 'b'])
print df

Your original DataFrame has the duplicate column name, it make complexity.
Try to modify the columns' name.
Sample Code
unit = pd.Series(['sal', 'zulu', 'm/s'], index=['time', 'time', 'b'])
>>> df
time time b
0 0.0 1.1 21.0
1 0.1 2.2 22.0
2 0.2 3.3 23.0
3 0.3 4.4 24.0
4 0.4 5.5 25.0
new_col = ['{}({})'.format(df.columns[i], unit[i]) for i in range(len(df.columns))]
>>> new_col
['time(sal)', 'time(zulu)', 'b(m/s)']
>>> df.columns = new_col
>>> df
time(sal) time(zulu) b(m/s)
0 0.0 1.1 21.0
1 0.1 2.2 22.0
2 0.2 3.3 23.0
3 0.3 4.4 24.0
4 0.4 5.5 25.0
>>> df.index = df['time(sal)'].values
>>> df
time(sal) time(zulu) b(m/s)
0.0 0.0 1.1 21.0
0.1 0.1 2.2 22.0
0.2 0.2 3.3 23.0
0.3 0.3 4.4 24.0
0.4 0.4 5.5 25.0

Related

Fill the dataframe values from other dataframe in pandas python

I have 2 dataframes df1 and df2. df1 is filled with values and df2 is empty.
df1 and df2, as it can be seen, both dataframes's index and columns will always be same, just difference is df1 doesn't contain duplicate values of columns and indexes but df2 does contain.
How to fill values in df2 from df1, so that it also considers the combination of index and columns?
df1 = pd.DataFrame({'Ind':pd.Series([1,2,3,4]),1:pd.Series([1,0.2,0.2,0.8])
,2:pd.Series([0.2,1,0.2,0.8]),3:pd.Series([0.2,0.2,1,0.8])
,4:pd.Series([0.8,0.8,0.8,1])})
df1 = df1.set_index(['Ind'])
df2 = pd.DataFrame(columns = [1,1,2,2,3,4], index=[1,1,2,2,3,4])
IIUC, you want to update:
df2.update(df1)
print(df2)
1 1 2 2 3 4
1 1.0 1.0 0.2 0.2 0.2 0.8
1 1.0 1.0 0.2 0.2 0.2 0.8
2 0.2 0.2 1.0 1.0 0.2 0.8
2 0.2 0.2 1.0 1.0 0.2 0.8
3 0.2 0.2 0.2 0.2 1.0 0.8
4 0.8 0.8 0.8 0.8 0.8 1.0

Removing outliers and surrounding data from dataframe

I have a data set containing some outliers that I'd like to remove.
I want to remove the 0 value in the data frame shown below:
df = pd.DataFrame({'Time': [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], 'data': [1.1, 1.05, 1.01, 1.05, 0, 1.2, 1.1, 1.08, 1.07, 1.1]})
I can do something like this in order to remove values below a certain threshold:
df.loc[df['data'] < 0.5, 'data'] = np.NaN
This yelds me a list without the '0' value:
Time data
0 0.0 1.10
1 0.1 1.05
2 0.2 1.01
3 0.3 1.05
4 0.4 NaN
5 0.5 1.20
6 0.6 1.10
7 0.7 1.08
8 0.8 1.07
9 0.9 1.10
However, I am also suspicious about data surrounding invalid values, and would like to remove values '0.2' units of Time away from the outliers. Like the following:
Time data
0 0.0 1.10
1 0.1 1.05
2 0.2 NaN
3 0.3 NaN
4 0.4 NaN
5 0.5 NaN
6 0.6 NaN
7 0.7 1.08
8 0.8 1.07
9 0.9 1.10
You can get a list of all points in time in which you have bad measurements and filter for all nearby time values:
bad_times = df.Time[df['data'] < 0.5]
for t in bad_times:
df.loc[(df['Time'] - t).abs() <= 0.2, 'data'] = np.NaN
result:
>>> print(df)
Time data
0 0.0 1.10
1 0.1 1.05
2 0.2 NaN
3 0.3 NaN
4 0.4 NaN
5 0.5 NaN
6 0.6 NaN
7 0.7 1.08
8 0.8 1.07
9 0.9 1.10
You can get a list of Time to be deleted, and then apply nan for those rows.
df.loc[df['data'] < 0.5, 'data'] = np.NaN
l=df[df['data'].isna()]['Time'].values
l2=[]
for i in l:
l2=l2+[round(i-0.1,1),round(i-0.2,1),round(i+0.1,1),round(i+0.2,1)]
df.loc[df['Time'].isin(l2), 'data'] = np.nan

Create a new column based on conditions from other columns in Python

I want to do in Python something very similar as this question from this one R users. My intention is to create a new column that its values are created based on conditions from other columns
For example:
d = {'year': [2010, 2011,2013, 2014], 'PD': [0.5, 0.8, 0.9, np.nan], 'PD_thresh': [0.7, 0.8, 0.9, 0.7]}
df_temp = pd.DataFrame(data=d)
Now I want to create a condition that says:
pseudo-code:
if for year X the value of PD is greater or equal to the value of PD_thresh
then set 0 in a new column y_pseudo
otherwise set 1
My expected outcome is this:
df_temp
Out[57]:
year PD PD_thresh y_pseudo
0 2010 0.5 0.7 0.0
1 2011 0.6 0.7 0.0
2 2013 0.9 0.8 1.0
3 2014 NaN 0.7 NaN
Use numpy.select with isna and ge:
m1 = df_temp['PD'].isna()
m2 = df_temp['PD'].ge(df_temp['PD_thresh'])
df_temp['y_pseudo'] = np.select([m1, m2], [np.nan, 1], default=0)
print (df_temp)
year PD PD_thresh y_pseudo
0 2010 0.5 0.7 0.0
1 2011 0.6 0.8 0.0
2 2013 0.9 0.9 1.0
3 2014 NaN 0.7 NaN
Another solution is convert mask to integer for True/False to 1/0 mapping and set only non missing rows by notna:
m2 = df_temp['PD'].ge(df_temp['PD_thresh'])
m3 = df_temp['PD'].notna()
df_temp.loc[m3, 'y_pseudo'] = m2[m3].astype(int)
print (df_temp)
year PD PD_thresh y_pseudo
0 2010 0.5 0.7 0.0
1 2011 0.6 0.8 0.0
2 2013 0.9 0.9 1.0
3 2014 NaN 0.7 NaN
Your data d is different from your outcome, and I think you meant 1 if greater than the threshold, not the other way around, so I have this:
y = [a if np.isnan(a) else 1 if a>=b else 0 for a,b in zip(df_temp.PD,df_temp.PD_thresh)]
df_temp['y_pseudo'] = y
Output:
year PD PD_thresh y_pseudo
0 2010 0.5 0.7 0.0
1 2011 0.8 0.8 1.0
2 2013 0.9 0.9 1.0
3 2014 NaN 0.7 NaN

Organizing results of experiments with pandas

I have the following input data. Each line is the result of one experiment:
instance algo profit time
x A 10 0.5
y A 20 0.1
z A 13 0.7
x B 39 0.9
y B 12 1.2
z B 14 0.6
And I would like to generate the following table:
A B
instance profit time profit time
x 10 0.5 39 0.9
y 20 0.1 12 1.2
z 13 0.7 14 0.6
I have tried using pivot and pivot_table with no success. Is there any way to achieve this result with pandas?
First melt to get'profit' and 'time' in the same column, and then use a pivot table with multiple column levels
(df.melt(id_vars=['instance', 'algo'])
.pivot_table(index='instance', columns=['algo', 'variable'], values='value'))
#algo A B
#variable profit time profit time
#instance
#x 10.0 0.5 39.0 0.9
#y 20.0 0.1 12.0 1.2
#z 13.0 0.7 14.0 0.6
set_index and unstack:
df.set_index(['instance', 'algo']).unstack().swaplevels(1, 0, axis=1)
profit time
algo A B A B
instance
x 10 39 0.5 0.9
y 20 12 0.1 1.2
z 13 14 0.7 0.6
(df.set_index(['instance', 'algo'])
.unstack()
.swaplevel(1, 0, axis=1)
.sort_index(axis=1))
algo A B
profit time profit time
instance
x 10 0.5 39 0.9
y 20 0.1 12 1.2
z 13 0.7 14 0.6
Another option is using pivot and swaplevel:
(df.pivot('instance', 'algo', ['profit', 'time'])
.swaplevel(1, 0, axis=1)
.sort_index(axis=1))
algo A B
profit time profit time
instance
x 10.0 0.5 39.0 0.9
y 20.0 0.1 12.0 1.2
z 13.0 0.7 14.0 0.6

Subtracting columns based on key column in pandas dataframe

I have two dataframes looking like
df1:
ID A B C D
0 'ID1' 0.5 2.1 3.5 6.6
1 'ID2' 1.2 5.5 4.3 2.2
2 'ID1' 0.7 1.2 5.6 6.0
3 'ID3' 1.1 7.2 10. 3.2
df2:
ID A B C D
0 'ID1' 1.0 2.0 3.3 4.4
1 'ID2' 1.5 5.0 4.0 2.2
2 'ID3' 0.6 1.2 5.9 6.2
3 'ID4' 1.1 7.2 8.5 3.0
df1 can have multiple entries with the same ID whereas each ID occurs only once in df2. Also not all ID in df2 are necessarily present in df1. I can't solve this by using set_index() as multiple rows in df1 can have the same ID, and that the ID in df1 and df2 are not aligned.
I want to create a new dataframe where I subtract the values in df2[['A','B','C','D']] from df1[['A','B','C','D']] based on matching the ID.
The resulting dataframe would look like:
df_new:
ID A B C D
0 'ID1' -0.5 0.1 0.2 2.2
1 'ID2' -0.3 0.5 0.3 0.0
2 'ID1' -0.3 -0.8 2.3 1.6
3 'ID3' 0.5 6.0 1.5 0.2
I know how to do this with a loop, but since I'm dealing with huge data quantities this is not practical at all. What is the best way of approaching this with Pandas?
You just need set_index and subtract
(df1.set_index('ID')-df2.set_index('ID')).dropna(axis=0)
Out[174]:
A B C D
ID
'ID1' -0.5 0.1 0.2 2.2
'ID1' -0.3 -0.8 2.3 1.6
'ID2' -0.3 0.5 0.3 0.0
'ID3' 0.5 6.0 4.1 -3.0
If the order matters add reindex for df2
(df1.set_index('ID')-df2.set_index('ID').reindex(df1.ID)).dropna(axis=0).reset_index()
Out[211]:
ID A B C D
0 'ID1' -0.5 0.1 0.2 2.2
1 'ID2' -0.3 0.5 0.3 0.0
2 'ID1' -0.3 -0.8 2.3 1.6
3 'ID3' 0.5 6.0 4.1 -3.0
Similarly to what Wen (who beat me to it) proposed, you can use pd.DataFrame.subtract:
df1.set_index('ID').subtract(df2.set_index('ID')).reset_index()
A B C D
ID
'ID1' -0.5 0.1 0.2 2.2
'ID1' -0.3 -0.8 2.3 1.6
'ID2' -0.3 0.5 0.3 0.0
'ID3' 0.5 6.0 4.1 -3.0
One method is to use numpy. We can extract the ordered indices required from df2 using numpy.searchsorted.
Then feed this into the construction of a new dataframe.
idx = np.searchsorted(df2['ID'], df1['ID'])
res = pd.DataFrame(df1.iloc[:, 1:].values - df2.iloc[:, 1:].values[idx],
index=df1['ID']).reset_index()
print(res)
ID 0 1 2 3
0 'ID1' -0.5 0.1 0.2 2.2
1 'ID2' -0.3 0.5 0.3 0.0
2 'ID1' -0.3 -0.8 2.3 1.6
3 'ID3' 0.5 6.0 4.1 -3.0

Categories