Create a new column based on conditions from other columns in Python - python

I want to do in Python something very similar as this question from this one R users. My intention is to create a new column that its values are created based on conditions from other columns
For example:
d = {'year': [2010, 2011,2013, 2014], 'PD': [0.5, 0.8, 0.9, np.nan], 'PD_thresh': [0.7, 0.8, 0.9, 0.7]}
df_temp = pd.DataFrame(data=d)
Now I want to create a condition that says:
pseudo-code:
if for year X the value of PD is greater or equal to the value of PD_thresh
then set 0 in a new column y_pseudo
otherwise set 1
My expected outcome is this:
df_temp
Out[57]:
year PD PD_thresh y_pseudo
0 2010 0.5 0.7 0.0
1 2011 0.6 0.7 0.0
2 2013 0.9 0.8 1.0
3 2014 NaN 0.7 NaN

Use numpy.select with isna and ge:
m1 = df_temp['PD'].isna()
m2 = df_temp['PD'].ge(df_temp['PD_thresh'])
df_temp['y_pseudo'] = np.select([m1, m2], [np.nan, 1], default=0)
print (df_temp)
year PD PD_thresh y_pseudo
0 2010 0.5 0.7 0.0
1 2011 0.6 0.8 0.0
2 2013 0.9 0.9 1.0
3 2014 NaN 0.7 NaN
Another solution is convert mask to integer for True/False to 1/0 mapping and set only non missing rows by notna:
m2 = df_temp['PD'].ge(df_temp['PD_thresh'])
m3 = df_temp['PD'].notna()
df_temp.loc[m3, 'y_pseudo'] = m2[m3].astype(int)
print (df_temp)
year PD PD_thresh y_pseudo
0 2010 0.5 0.7 0.0
1 2011 0.6 0.8 0.0
2 2013 0.9 0.9 1.0
3 2014 NaN 0.7 NaN

Your data d is different from your outcome, and I think you meant 1 if greater than the threshold, not the other way around, so I have this:
y = [a if np.isnan(a) else 1 if a>=b else 0 for a,b in zip(df_temp.PD,df_temp.PD_thresh)]
df_temp['y_pseudo'] = y
Output:
year PD PD_thresh y_pseudo
0 2010 0.5 0.7 0.0
1 2011 0.8 0.8 1.0
2 2013 0.9 0.9 1.0
3 2014 NaN 0.7 NaN

Related

Pandas Is it possible to add new time values with empty values in columns in a csv with a time sequence?

I have a csv file that looks something like this
Time
OI
V
10:00:23
5.4
27
10:00:24
-0.7
1
10:00:28
-0.5
4
10:00:29
0.2
12
Can I somehow add new time values using Pandas while filling the columns with zeros or Nan? For the entire csv file.
What would have turned out something like that ?
Time
OI
V
10:00:23
5.4
27
10:00:24
-0.7
1
10:00:25
0
Nan
10:00:26
0
Nan
10:00:27
0
Nan
10:00:28
-0.5
4
10:00:29
0.2
12
Convert column to datetimes, create DatetimeIndex and add missing values by DataFrame.asfreq, last replace NaNs in OI column:
df['Time'] = pd.to_datetime(df['Time'])
df = df.set_index('Time').asfreq('S').fillna({'OI':0})
df.index = df.index.time
print (df)
OI V
10:00:23 5.4 27.0
10:00:24 -0.7 1.0
10:00:25 0.0 NaN
10:00:26 0.0 NaN
10:00:27 0.0 NaN
10:00:28 -0.5 4.0
10:00:29 0.2 12.0
df['Time'] = pd.to_datetime(df['Time'])
df = df.set_index('Time').asfreq('S').fillna({'OI':0}).reset_index()
df['Time'] = df['Time'].dt.time
print (df)
Time OI V
0 10:00:23 5.4 27.0
1 10:00:24 -0.7 1.0
2 10:00:25 0.0 NaN
3 10:00:26 0.0 NaN
4 10:00:27 0.0 NaN
5 10:00:28 -0.5 4.0
6 10:00:29 0.2 12.0

Pandas Reshape DataFrame in a loop

I am new to Pandas in python. I have a dataframe with 2 keys 15 rows each and 1 column like below
1
key1/1 0.5
key1/2 0.5
key1/3 0
key1/4 0
key1/5 0.6
key1/6 0.7
key1/7 0
key1/8 0
key1/9 0
key1/10 0.5
key1/11 0.5
key1/12 0.5
key1/13 0
key1/14 0.5
key1/15 0.5
key2/1 0.4
key2/2 0.2
key2/3 0
key2/4 0
key2/5 0.1
key2/6 0.2
key2/7 0
key2/8 0
key2/9 0.3
key2/10 0.2
key2/11 0
key2/12 0.5
key2/13 0
key2/14 0
key2/15 0.5
I want to iterate the rows of the dataframe so each time it meets a 'zero' it creates a new column like below
1 2 3 4
key1/1 0.5 0.6 0.5 0.5
key1/2 0.5 0.7 0.5 0.5
key1/3 nan nan 0.5 nan
key1/4 nan nan nan nan
1 2 3 4 5
key2/1 0.4 0.1 0.3 0.5 0.5
key2/2 0.2 0.2 0.2 nan nan
key2/3 nan nan nan nan nan
key2/4 nan nan nan nan nan
I have tried the following code trying to iterate 'key1' only
df2=pd.Dataframe[]
for row in df['key1'].index:
new_df['keyl'][row] == df['keyl'][row]
if df['keyl'][row] == 0:
new_df['key1'].append(df2,ignore_index=True)
Obviously it is not working, please send some help. Ideally I would like to modify the same dataframe instead of creating a new one. Thanks
EDIT
Below is a drawing of how my data looks like
And below is what I am trying to achieve
You can use mask them by zero and assign a key. Based on the key you can group them and transform them to columns.
All credit goes to this answer. You will find a great explanation there.
df2 = df.mask((df['1'] == 0) )
df2['group'] = (df2['1'].shift(1).isnull() & df2['1'].notnull()).cumsum()
df2 = df2.dropna()
df2.pivot(columns='group')
1
group 1 2 3 4
key1/1 0.5 NaN NaN NaN
key1/10 NaN NaN 0.5 NaN
key1/11 NaN NaN 0.5 NaN
key1/12 NaN NaN 0.5 NaN
key1/14 NaN NaN NaN 0.5
key1/15 NaN NaN NaN 0.5
key1/2 0.5 NaN NaN NaN
key1/5 NaN 0.6 NaN NaN
key1/6 NaN 0.7 NaN NaN
Your group key will look like this:
1 group
key1/1 0.5 1
key1/2 0.5 1
key1/3 NaN 1
key1/4 NaN 1
key1/5 0.6 2
key1/6 0.7 2
key1/7 NaN 2
key1/8 NaN 2
key1/9 NaN 2
key1/10 0.5 3
key1/11 0.5 3
key1/12 0.5 3
key1/13 NaN 3
key1/14 0.5 4
key1/15 0.5 4
This data you can translate it into column format.
Complete solution:
df2 = df.mask((df['1'] == 0) )
df2['group'] = (df2['1'].shift(1).isnull() & df2['1'].notnull()).cumsum()
df2 = df2.dropna()
x = df2.groupby('group')['1'].apply(list)
df3 = pd.DataFrame(x.values.tolist()).T
df3.index = [f"key1/{i}" for i in range(1,len(df3)+1)]
0 1 2 3
key1/1 0.5 0.6 0.5 0.5
key1/2 0.5 0.7 0.5 0.5
key1/3 NaN NaN 0.5 NaN
If you want something in that format you need to have data like this:
group
1 [0.5, 0.5]
2 [0.6, 0.7]
3 [0.5, 0.5, 0.5]
4 [0.5, 0.5]
Name: 1, dtype: object
Update1:
Assuming data to be:
def func(r):
df2 = r.mask((r['1'] == 0) )
df2['group'] = (df2['1'].shift(1).isnull() & df2['1'].notnull()).cumsum()
df2 = df2.dropna()
x = df2.groupby('group')['1'].apply(list)
df3 = pd.DataFrame(x.values.tolist()).T
# df3.index = [r.name]*len(df3)
return (df3)
df.groupby(df.index).apply(func)

Organizing results of experiments with pandas

I have the following input data. Each line is the result of one experiment:
instance algo profit time
x A 10 0.5
y A 20 0.1
z A 13 0.7
x B 39 0.9
y B 12 1.2
z B 14 0.6
And I would like to generate the following table:
A B
instance profit time profit time
x 10 0.5 39 0.9
y 20 0.1 12 1.2
z 13 0.7 14 0.6
I have tried using pivot and pivot_table with no success. Is there any way to achieve this result with pandas?
First melt to get'profit' and 'time' in the same column, and then use a pivot table with multiple column levels
(df.melt(id_vars=['instance', 'algo'])
.pivot_table(index='instance', columns=['algo', 'variable'], values='value'))
#algo A B
#variable profit time profit time
#instance
#x 10.0 0.5 39.0 0.9
#y 20.0 0.1 12.0 1.2
#z 13.0 0.7 14.0 0.6
set_index and unstack:
df.set_index(['instance', 'algo']).unstack().swaplevels(1, 0, axis=1)
profit time
algo A B A B
instance
x 10 39 0.5 0.9
y 20 12 0.1 1.2
z 13 14 0.7 0.6
(df.set_index(['instance', 'algo'])
.unstack()
.swaplevel(1, 0, axis=1)
.sort_index(axis=1))
algo A B
profit time profit time
instance
x 10 0.5 39 0.9
y 20 0.1 12 1.2
z 13 0.7 14 0.6
Another option is using pivot and swaplevel:
(df.pivot('instance', 'algo', ['profit', 'time'])
.swaplevel(1, 0, axis=1)
.sort_index(axis=1))
algo A B
profit time profit time
instance
x 10.0 0.5 39.0 0.9
y 20.0 0.1 12.0 1.2
z 13.0 0.7 14.0 0.6

How to use boolean indexing with Pandas

I have a dataframe:
df =
time time b
0 0.0 1.1 21
1 0.1 2.2 22
2 0.2 3.3 23
3 0.3 4.4 24
4 0.4 5.5 24
I also have a series for my units, defined as
su =
time sal
time zulu
b m/s
Now, I want to set df.index equal to the "time (sal)" values. Those values can be in any column and I will need to check.
I can do this as:
df.index = df.values[:,(df.columns == 'time') & (su.values == 'sal')]
But, my index looks like:
array([[0.0],
[0.1],
[0.2],
[0.3],
[0.4]])
However, this is an array of arrays. In bigger datasets, plot seems to take longer. If I hardcode the value, I get just an array:
df.index = df[0,0]
array([0.0, 0.1, 0.2, 0.3, 0.4])
I can also do the following:
inx = ((df.columns == 'time') & (s.values == 'sal')).tolist().index(True)
This sets "inx" to 0 and then gets a single array
df.index=df.values[0,inx]
However, I shouldn't have to do this. Am I using pandas and boolean indexing incorrectly?
I want:
df =
time time b
0.0 0.0 1.1 21
0.1 0.1 2.2 22
0.2 0.2 3.3 23
0.3 0.3 4.4 24
0.4 0.4 5.5 24
As I understood, this is what you expected. However, I renamed time names as time1 & time2, otherwise it won't let to create the dictionary with same name.
df = {'time1': [0.0,0.1,0.2,0.3,0.4], 'time2': [1.1,2.2,3.3,4.4,5.5],'b':[21,22,23,24,24]}
su = {'time1':'sal', 'time2':'zulu', 'b':'m/s'}
indexes = df[su.keys()[su.values().index('sal')]]
df = pd.DataFrame(df, index=indexes, columns=['time1', 'time2', 'b'])
print df
Your original DataFrame has the duplicate column name, it make complexity.
Try to modify the columns' name.
Sample Code
unit = pd.Series(['sal', 'zulu', 'm/s'], index=['time', 'time', 'b'])
>>> df
time time b
0 0.0 1.1 21.0
1 0.1 2.2 22.0
2 0.2 3.3 23.0
3 0.3 4.4 24.0
4 0.4 5.5 25.0
new_col = ['{}({})'.format(df.columns[i], unit[i]) for i in range(len(df.columns))]
>>> new_col
['time(sal)', 'time(zulu)', 'b(m/s)']
>>> df.columns = new_col
>>> df
time(sal) time(zulu) b(m/s)
0 0.0 1.1 21.0
1 0.1 2.2 22.0
2 0.2 3.3 23.0
3 0.3 4.4 24.0
4 0.4 5.5 25.0
>>> df.index = df['time(sal)'].values
>>> df
time(sal) time(zulu) b(m/s)
0.0 0.0 1.1 21.0
0.1 0.1 2.2 22.0
0.2 0.2 3.3 23.0
0.3 0.3 4.4 24.0
0.4 0.4 5.5 25.0

How to substract values in DataFrames omiting some solumns

I created two dataFrames and I want to subtract their values omitting two first columns in the first DataFrame.
df = pd.DataFrame({'sys':[23,24,27,30],'dis': [0.8, 0.8, 1.0,1.0], 'Country':['US', 'England', 'US', 'Germany'], 'Price':[500, 1000, 1500, 2000]})
print df
index = {'sys':[23,24,27,30]}
df2 = pd.DataFrame({ 'sys':[23,24,27,30],'dis': [0.8, 0.8, 1.0,1.0],'Price2':[300, 600, 4000, 1000], 'Price3': [2000, 1000, 600, 2000]})
df = df.set_index(['sys','dis', 'Country']).unstack().fillna(0)
df = df.reset_index()
df.columns.names =[None, None]
df.columns = df.columns.droplevel(0)
infile = pd.read_csv('benchmark_data.csv')
infile_slice = infile[(infile.System_num==26)]['Benchmark']
infile_slice = infile_slice.reset_index()
infile_slice = infile_slice.drop(infile_slice.index[4])
del infile_slice['index']
print infile_slice
dfNew = df.sub(infile_slice['Benchmark'].values, axis=0)
In this case I can substract only all values from all columns. How can I skip two first columns from df?
I've tried: dfNew = df.iloc[3:].sub(infile_slice['Benchmark'].values,axis=0), but it does not work.
DataFrames look like:
df:
England Germany US
0 23 0.8 0.0 0.0 500.0
1 24 0.8 1000.0 0.0 0.0
2 27 1.0 0.0 0.0 1500.0
3 30 1.0 0.0 2000.0 0.0
infile_slice:
Benchmark
0 3.3199
1 -4.0135
2 -4.9794
3 -3.1766
Maybe, this is what you are looking for?
>>> df
England Germany US
0 23 0.8 0.0 0.0 500.0
1 24 0.8 1000.0 0.0 0.0
2 27 1.0 0.0 0.0 1500.0
3 30 1.0 0.0 2000.0 0.0
>>> infile_slice
Benchmark
0 3.3199
1 -4.0135
2 -4.9794
3 -3.1766
>>> df.iloc[:, 4:] = df.iloc[:, 4:].sub(infile_slice['Benchmark'].values,axis=0)
>>> df
England Germany US
0 23 0.8 0.0 0.0 496.6801
1 24 0.8 1000.0 0.0 4.0135
2 27 1.0 0.0 0.0 1504.9794
3 30 1.0 0.0 2000.0 3.1766
>>>
You could use iloc as follows:
df_0_2 = df.iloc[:,0:2] # First 2 columns
df_2_end = df.iloc[:,2:].sub(infile_slice['Benchmark'].values, axis=0)
pd.concat([df_0_2, df_2_end], axis=1)
England Germany US
0 23 0.8 -3.3199 -3.3199 496.6801
1 24 0.8 1004.0135 4.0135 4.0135
2 27 1.0 4.9794 4.9794 1504.9794
3 30 1.0 3.1766 2003.1766 3.1766

Categories