Pandas - add column with row value applying conditions relative to current row - python

I’m trying to add a new column to my dataframe that contains the time value of the first instance where the tick is equal to the current tick plus 1.
df2 is somthing like this:
Time Tick Desired col
Count
0 1594994400 3212.25 1594994405
1 1594994401 3212.00 1594994404
2 1594994402 3212.25 1594994405
3 1594994402 3212.50 NaN
4 1594994403 3212.75 NaN
5 1594994404 3212.75 NaN
6 1594994404 3213.00 NaN
7 1594994405 3213.25 NaN
8 1594994405 3213.25 NaN
9 1594994405 3213.25 NaN
I'm hoping to do something like:
df2['Desired col'] = df2['Tick'].loc[(df2['Tick'(other rows)]==df2['Tick'current row] +1)&(df2['Time'(other rows)]>=df2['Time'](current row)].idxmax()
Hope that makes sense. I'm new to pandas and python, this is my first posted question. Many thanks to the stackoverflow community for all the excellent reference material available!

If you want a one liner this should do it:
df['Desired'] = df.apply(lambda x: df[df['Tick'] == x['Tick']+1].reset_index().iloc[0]['Timestamp'], axis=1)
Problem is it will throw a KeyError [0] because in your 'Tick' column you don't always have tick + 1, what I suggest is this:
def desired_generator(row, df):
try:
return df[df['Tick'] == row['Tick']+1].reset_index().iloc[0]['Timestamp']
except:
return None
df['Desired'] = df.apply(lambda x: desired_generator(x, df), axis=1)

Related

How to add logic to list comprehension when renaming columns?

I have dictionary that I am using to rename columns in a dataframe like so:
column_names = {name1:rename1, name2:rename2}
new_df = df[[k for k in column_names.keys()]]
new_df.rename(columns=columns_dict, inplace=True)
When a new field comes dictionary in I am getting this error at this line in the code:
code: new_df = df[[k for k in column_names.keys()]]
issue: KeyError: "['new_col'] not in index"
How do I create a flexible solution to the list comprehesion when renaming values, that if a value new value is present in the df or dictionary, include that column and assign it a value of zero?
I tried creating some try catches but I am not sure how to proceed after this point:
try:
new_df = df[[k for k in column_names.keys()]]
new_df.rename(columns=columns_dict, inplace=True)
except:
#assign the failed column back to original df (named df) and assign value
#of zero
#rerun all steps in try block.
Try changing this line of code
new_df = df[[k for k in column_names.keys() if k in df.columns]]
IIUC, you can divide the column_names into 2 parts: entries that are also in the columns of df and others. Then index & rename with the first part and assign with the second part:
# get the non-existent ones
to_assign = column_names.keys() - df.columns
# drop them
[column_names.pop(key) for key in to_assign]
# subset the `df` with what remained
ndf = df[column_names.keys()]
# rename them
ndf = ndf.rename(columns=column_names)
# assign the other ones (with zeros)
ndf = ndf.assign(**dict.fromkeys(to_assign, 0))
E.g.,
In []: df
Out[]:
L_1 D_1 L_2 D_2
0 1.0 7 NaN NaN
1 1.0 12 1-1 play
2 NaN -1 1-1 play
3 1.0 9 1-1 play
In []: column_names = {"L_1": "M_1", "L_4": "Z_5", "D_1": "E_1"}
In []: ndf = above_operations...
In []: ndf
Out[]:
M_1 E_1 L_4
0 1.0 7 0
1 1.0 12 0
2 NaN -1 0
3 1.0 9 0

Moving pandas series value by switching column name?

I have a DF, however the last value of some series should be placed in a different one. This happened due to column names not being standardized - i.e., some are "Wx_y_x_PRED" and some are "Wx_x_y_PRED". I'm having difficulty writing a function that will simply find the columns with >= 225 NaN's and changing the column it's assigned to.
I've written a function that for some reason will sometimes work and sometimes won't. When it does, it further creates approx 850 columns in its wake (the OG dataframe is around 420 with the duplicate columns). I'm hoping to have something that just reassigns the value. If it automatically deletes the incorrect column, that's awesome too, but I just used .dropna(thresh = 2) when my function worked originally.
Here's what it looks like originally:
in: df = pd.DataFrame(data = {'W10_IND_JAC_PRED': ['NaN','NaN','NaN','NaN','NaN',2],
'W10_JAC_IND_PRED': [1,2,1,2,1,'NAN']})
out:df
W10_IND_JAC_PRED W10_JAC_IND_PRED
0 NaN 1
1 NaN 2
2 NaN 1
3 NaN 2
4 NaN 1
W 2 NAN
I wrote this, which occasionally works but most of the time doesn't and i'm not sure why.
def switch_cols(x):
"""Takes mismatched columns (where only the last value != NaN) and changes order of team column names"""
if x.isna().sum() == 5:
col_string = x.name.split('_')
col_to_switch = ('_').join([col_string[0],col_string[2],col_string[1],'PRED'])
df[col_to_switch]['row_name'] = x[-1]
else:
pass
return x
Most of the time it just returns to me the exact same DF, but this is the desired outcome.
W10_IND_JAC_PRED W10_JAC_IND_PRED
0 NaN 1
1 NaN 2
2 NaN 1
3 NaN 2
4 NaN 1
W 2 2
Anyone have any tips or could share why my function works maybe 10% of the time?
Edit:
so this is an ugly "for" loop I wrote that works. I know there has to be a much more pythonic way of doing this while preserving original column names, though.
for i in range(df.shape[1]):
if df.iloc[:,i].isna().sum() == 5:
split_nan_col = df.columns[i].split('_')
correct_col_name = ('_').join([split_nan_col[0],split_nan_col[2],split_nan_col[1],split_nan_col[3]])
df.loc[5,correct_col_name] = df.loc[5,df.columns[i]]
else:
pass
Doing with split before frozenset(will return the order list), then we do join: Notice this solution can be implemented to more columns
df.columns=df.columns.str.split('_').map(frozenset).map('_'.join)
df.mask(df=='NaN').groupby(level=0,axis=1).first() # groupby first will return the first not null value
PRED_JAC_W10_IND
0 1
1 2
2 1
3 2
4 1
5 2

How to replace all non-NaN entries of a dataframe with 1 and all NaN with 0

I have a dataframe with 71 columns and 30597 rows. I want to replace all non-nan entries with 1 and the nan values with 0.
Initially I tried for-loop on each value of the dataframe which was taking too much time.
Then I used data_new=data.subtract(data) which was meant to subtract all the values of the dataframe to itself so that I can make all the non-null values 0.
But an error occurred as the dataframe had multiple string entries.
You can take the return value of df.notnull(), which is False where the DataFrame contains NaN and True otherwise and cast it to integer, giving you 0 where the DataFrame is NaN and 1 otherwise:
newdf = df.notnull().astype('int')
If you really want to write into your original DataFrame, this will work:
df.loc[~df.isnull()] = 1 # not nan
df.loc[df.isnull()] = 0 # nan
Use notnull with casting boolean to int by astype:
print ((df.notnull()).astype('int'))
Sample:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [np.nan, 4, np.nan], 'b': [1,np.nan,3]})
print (df)
a b
0 NaN 1.0
1 4.0 NaN
2 NaN 3.0
print (df.notnull())
a b
0 False True
1 True False
2 False True
print ((df.notnull()).astype('int'))
a b
0 0 1
1 1 0
2 0 1
I'd advise making a new column rather than just replacing. You can always delete the previous column if necessary but its always helpful to have a source for a column populated via an operation on another.
e.g. if df['col1'] is the existing column
df['col2'] = df['col1'].apply(lambda x: 1 if not pd.isnull(x) else np.nan)
where col2 is the new column. Should also work if col2 has string entries.
I do a lot of data analysis and am interested in finding new/faster methods of carrying out operations. I had never come across jezrael's method, so I was curious to compare it with my usual method (i.e. replace by indexing). NOTE: This is not an answer to the OP's question, rather it is an illustration of the efficiency of jezrael's method. Since this is NOT an answer I will remove this post if people do not find it useful (and after being downvoted into oblivion!). Just leave a comment if you think I should remove it.
I created a moderately sized dataframe and did multiple replacements using both the df.notnull().astype(int) method and simple indexing (how I would normally do this). It turns out that the latter is slower by approximately five times. Just an fyi for anyone doing larger-scale replacements.
from __future__ import division, print_function
import numpy as np
import pandas as pd
import datetime as dt
# create dataframe with randomly place NaN's
data = np.ones( (1e2,1e2) )
data.ravel()[np.random.choice(data.size,data.size/10,replace=False)] = np.nan
df = pd.DataFrame(data=data)
trials = np.arange(100)
d1 = dt.datetime.now()
for r in trials:
new_df = df.notnull().astype(int)
print( (dt.datetime.now()-d1).total_seconds()/trials.size )
# create a dummy copy of df. I use a dummy copy here to prevent biasing the
# time trial with dataframe copies/creations within the upcoming loop
df_dummy = df.copy()
d1 = dt.datetime.now()
for r in trials:
df_dummy[df.isnull()] = 0
df_dummy[df.isnull()==False] = 1
print( (dt.datetime.now()-d1).total_seconds()/trials.size )
This yields times of 0.142 s and 0.685 s respectively. It is clear who the winner is.
There is a method .fillna() on DataFrames which does what you need. For example:
df = df.fillna(0) # Replace all NaN values with zero, returning the modified DataFrame
or
df.fillna(0, inplace=True) # Replace all NaN values with zero, updating the DataFrame directly
for fmarc 's answer:
df.loc[~df.isnull()] = 1 # not nan
df.loc[df.isnull()] = 0 # nan
The code above does not work for me, and the below works.
df[~df.isnull()] = 1 # not nan
df[df.isnull()] = 0 # nan
With the pandas 0.25.3
And if you want to just change values in specific columns, you may need to create a temp dataframe and assign it to the columns of the original dataframe:
change_col = ['a', 'b']
tmp = df[change_col]
tmp[tmp.isnull()]='xxx'
df[change_col]=tmp
Try this one:
df.notnull().mul(1)
Here i will give a suggestion to take a particular column and if the rows in that column is NaN replace it by 0 or values are there in that column replace it as 1
this below line will change your column to 0
df.YourColumnName.fillna(0,inplace=True)
Now Rest of the Not Nan Part will be Replace by 1 by below code
df["YourColumnName"]=df["YourColumnName"].apply(lambda x: 1 if x!=0 else 0)
Same Can Be applied to the total dataframe by not defining the column Name
Use: df.fillna(0)
to fill NaN with 0.
Generally there are two steps - substitute all not NAN values and then substitute all NAN values.
dataframe.where(~dataframe.notna(), 1) - this line will replace all not nan values to 1.
dataframe.fillna(0) - this line will replace all NANs to 0
Side note: if you take a look at pandas documentation, .where replaces all values, that are False - this is important thing. That is why we use inversion to create a mask ~dataframe.notna(), by which .where() will replace values

Using an if statement in a dataframe with lambda functions

I am trying to add a new column to a dataframe based on an if statement depending on the values of two columns. i.e. if column x == None then column y else column x
below is the script I have written but doesn't work. any ideas?
dfCurrentReportResults['Retention'] = dfCurrentReportResults.apply(lambda x : x.Retention_y if x.Retention_x == None else x.Retention_x)
Also I got this error message:
AttributeError: ("'Series' object has no attribute 'Retention_x'", u'occurred at index BUSINESSUNIT_NAME')
fyi: BUSINESSUNIT_NAME is the first column name
Additional Info:
My data printed out looks like this and I want to add a 3rd column to take a value if there is one else keep NaN.
Retention_x Retention_y
0 1 NaN
1 NaN 0.672183
2 NaN 1.035613
3 NaN 0.771469
4 NaN 0.916667
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 NaN NaN
9 NaN NaN
UPDATE:
In the end I was having issues referencing the Null or is Null in my dataframe the final line of code I used also including the axis = 1 answered my question.
dfCurrentReportResults['RetentionLambda'] = dfCurrentReportResults.apply(lambda x : x['Retention_y'] if pd.isnull(x['Retention_x']) else x['Retention_x'], axis = 1)
Thanks #EdChum, #strim099 and #aus_lacy for all your input. As my data set gets larger I may switch to the np.where option if I notice performance issues.
You'r lambda is operating on the 0 axis which is columnwise. Simply add axis=1 to the apply arg list. This is clearly documented.
In [1]: import pandas
In [2]: dfCurrentReportResults = pandas.DataFrame([['a','b'],['c','d'],['e','f'],['g','h'],['i','j']], columns=['Retention_y', 'Retention_x'])
In [3]: dfCurrentReportResults['Retention_x'][1] = None
In [4]: dfCurrentReportResults['Retention_x'][3] = None
In [5]: dfCurrentReportResults
Out[5]:
Retention_y Retention_x
0 a b
1 c None
2 e f
3 g None
4 i j
In [6]: dfCurrentReportResults['Retention'] = dfCurrentReportResults.apply(lambda x : x.Retention_y if x.Retention_x == None else x.Retention_x, axis=1)
In [7]: dfCurrentReportResults
Out[7]:
Retention_y Retention_x Retention
0 a b b
1 c None c
2 e f f
3 g None g
4 i j j
Just use np.where:
dfCurrentReportResults['Retention'] = np.where(df.Retention_x == None, df.Retention_y, else df.Retention_x)
This uses the test condition, the first param and sets the value to df.Retention_y else df.Retention_x
also avoid using apply where possible as this is just going to loop over the values, np.where is a vectorised method and will scale much better.
UPDATE
OK no need to use np.where just use the following simpler syntax:
dfCurrentReportResults['Retention'] = df.Retention_y.where(df.Retention_x == None, df.Retention_x)
Further update
dfCurrentReportResults['Retention'] = df.Retention_y.where(df.Retention_x.isnull(), df.Retention_x)

Pandas. Selection by label. One-row output

I'm trying to select every entry in a pandas DataFrame D, correspoding to some certain userid, filling missing etime values with zeros as follows:
user_entries = D.loc[userid]
user_entries.index = user_entries.etime
user_entries = user_entries.reindex(range(distinct_time_entries_num))
user_entries = user_entries.fillna(0)
The problem is, for some ids, there exists exactly one entry, and thus .loc() method is returning a Series object with an unexpected index:
(Pdb) user_entries.index = user_entries.etime
*** TypeError: Index(...) must be called with a collection of some kind, 388 was passed
(Pdb) user_entries
etime 388
requested 1
rejected 0
Name: 351, dtype: int64
(Pdb) user_entries.index
Index([u'etime', u'requested', u'rejected'], dtype='object')
which is painful to handle. I'd seiously prefer a DataFrame object with one row. Is there any way around it? Thanks.
UPD: A have to apologize for unintengible formulation, this is my first post here. I'll try again.
So the deal is: there is a dataframe, indexed by userid. Every userid can possibly have up to some number N corresponding dataframe rows (columns are: 'etime','requested','rejected') for which 'etime' is basically the key. For some 'userid', there exist all of the N corresponding entries, but for the most of them, there are missing entries for some 'etime'.
My intensions are: for every 'userid' construct an explicit DataFrame object, containing all N entries indexed by 'etime', filled with zeros for the missing entries. That's why I'm changing index to 'etime' and then reindexing selected row subset with the full 'etime' range.
The problem is: for some 'userid' there is exactly one corresponding 'etime', for which.loc() subsetting returns not a dataframe with one row indexed by 'userid' but a series object indexed by the array:
Index([u'etime', u'requested', u'rejected'], dtype='object')
And that's why changing index fails. Checking dimensions and index every time I select some dataframe subset looks pretty ugly. What else can I do about it?
UPD2: here is the script demonstrating the case
full_etime_range = range(10)
df = DataFrame(index=[0,0,1],
columns=['etime','requested'],
data=[[0,1],[1,1],[1,1]])
for i in df.index:
tmp = df.loc[i]
tmp.index = tmp['etime']
tmp = tmp.reindex(full_etime_range,fill_value = 0)
print tmp
So, starting with df being your dataframe, we can do the following safely:
In[215]: df.set_index([df.index, 'etime'], inplace=True)
In[216]: df
Out[216]:
requested
etime
0 0 1
1 1
1 1 1
DF = pd.DataFrame(index=full_etime_range, columns=[])
df0 = DF.copy()
In[225]: df0.join(df.loc[0])
Out[225]:
requested
0 1
1 1
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
In[230]: df1 = DF.copy()
In[231]: df1.join(df.loc[1])
Out[231]:
requested
0 NaN
1 1
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
which is technically what you want. But behold, we can do this nicer:
listOfDf = [DF.copy().join(df.loc[i]) for i in df.index.get_level_values(1).unique()]
I wanted to do it even one level nicer, but the following did not work - maybe someone can chip in why.
df.groupby(level=0).apply(lambda x: DF.copy().join(x))
Are you just trying to fill nas? Why are you reindexing the dataframe?
Just
user_entries = D.loc[userid]
user_entries.fillna(0)
Should do the trick. But if you are willing to fillna just for the etime field, what you should do is:
user_entries = D.loc[userid]
temp = user_entries["etime"].fillna(0)
user_extries["etime"] = temp
Hope it helps. If not, clarify what you're trying to achieve

Categories