I have gone through all posts on the website and am not able to find solution to my problem.
I have a dataframe with 15 columns. Some of them come with None or NaN values. I need help in writing the if-else condition.
If the column in the dataframe is not null and nan, I need to format the datetime column. Current Code is as below
for index, row in df_with_job_name.iterrows():
start_time=df_with_job_name.loc[index,'startTime']
if not df_with_job_name.isna(df_with_job_name.loc[index,'startTime']) :
start_time_formatted =
datetime(*map(int, re.split('[^\d]', start_time)[:-1]))
The error that I am getting is
if not df_with_job_name.isna(df_with_job_name.loc[index,'startTime']) :
TypeError: isna() takes exactly 1 argument (2 given)
A direct way to take care of missing/invalid values is probably:
def is_valid(val):
if val is None:
return False
try:
return not math.isnan(val)
except TypeError:
return True
and of course you'll have to import math.
Also it seems isna is not invoked with any argument and returns a dataframe of boolean values (see link). You can iterate thru both dataframes to determine if the value is valid.
isna takes your entire data frame as the instance argument (that's self, if you're already familiar with classes) and returns a data frame of Boolean values, True where that value is invalid. You tried to specify the individual value you're checking as a second input argument. isna doesn't work that way; it takes empty parentheses in the call.
You have a couple of options. One is to follow the individual checking tactics here. The other is to make the map of the entire data frame and use that:
null_map_df = df_with_job_name.isna()
for index, row in df_with_job_name.iterrows() :
if not null_map_df.loc[index,row]) :
start_time=df_with_job_name.loc[index,'startTime']
start_time_formatted =
datetime(*map(int, re.split('[^\d]', start_time)[:-1]))
Please check my use of row & column indices; the index, row handling doesn't look right. Also, you should be able to apply an any operation to the entire row at once.
Related
I have a function that receives a whole entry of a multiindex that returns true if or false for the entire index.
Hereby I am feeding several columns of the entry as a key value pair e.g.:
temp = cells.loc[0]
x = temp.set_index(['eta','phi'])['e'].to_dict()
filter_frame(x,20000) # drop event if this function returns false
So far I only found examples where people want to remove single rows but I am talking an entire entry with several hundred subentries, as all subentries are used to output the boolean.
How can I drop entries that dont fulfill this condition?
Edit:
Data sample
The filter_frame() function would just produce a true or false for this entry 0, which contains 780 rows.
The function also works fine, I just dont know how to apply it without doing slow for loops.
What I am looking for is something like this
cells = cells[apply the filter function somehow for all entries]
and have a significantly smaller dataframe
Edit2:
print(mask) of jezraels solution:
Frst call function per first level of MultiIndex in GroupBy.apply - get mask per groups, so for filtering original DataFrame use MultiIndex.droplevel for remove second level with mapping by Index.map, so possible filtering in boolean indexing:
def f(temp):
x = temp.set_index(['eta','phi'])['e'].to_dict()
return filter_frame(x,20000)
mask = cells.index.droplevel(1).map(cells.groupby(level=0).apply(f))
out = cells[mask]
---Hello, everyone! New student of Python's Pandas here.
I have a dataframe I artificially constructed here: https://i.stack.imgur.com/cWgiB.png. Below is a text reconstruction.
df_dict = {
'header0' : [55,12,13,14,15],
'header1' : [21,22,23,24,25],
'header2' : [31,32,55,34,35],
'header3' : [41,42,43,44,45],
'header4' : [51,52,53,54,33]
}
index_list = {
0:'index0',
1:'index1',
2:'index2',
3:'index3',
4:'index4'
}
df = pd.DataFrame(df_dict).rename(index = index_list)
GOAL:
I want to pull the index row(s) and column header(s) of any ARBITRARY value(s) (int, float, str, etc.). So for eg, if I want the values of 55, this code will return: header0, index0, header2, index2 in some format. They could be list or tuple or print, etc.
CLARIFICATIONS:
Imagine the dataframe is of a large enough size that I cannot "just find it manually"
I do not know how large this value is in comparison to other values (so a "simple .idxmax()" probably won't cut it)
I do not know where this value is column or index wise (so "just .loc,.iloc where the value is" won't help either)
I do not know whether this value has duplicates or not, but if it does, return all its column/indexes.
WHAT I'VE TRIED SO FAR:
I've played around with .columns, .index, .loc, but just can't seem to get the answer. The farthest I've gotten is creating a boolean dataframe with df.values == 55 or df == 55, but cannot seem to do anything with it.
Another "farthest" way I've gotten is using df.unstack.idxmax(), which would return a tuple of the column and header, but has 2 major problems:
Only returns the max/min as per the .idxmax(), .idxmin() functions
Only returns the FIRST column/index matching my value, which doesn't help if there are duplicates
I know I could do a for loop to iterate through the entire dataframe, tracking which column and index I am on in temporary variables. Once I hit the value I am looking for, I'll break and return the current column and index. Was just hoping there was a less brute-force-y method out there, since I'd like a "high-speed calculation" method that would work on any dataframe of any size.
Thanks.
EDIT: Added text database, clarified questions.
Use np.where:
r, c = np.where(df == 55)
list(zip(df.index[r], df.columns[c]))
Output:
[('index0', 'header0'), ('index2', 'header2')]
There is a function in pandas that gives duplicate rows.
duplicate = df[df.duplicated()]
print(duplicate)
Use DataFrame.unstack for Series with MultiIndex and then filter duplicates by Series.duplicated with keep=False:
s = df.unstack()
out = s[s.duplicated(keep=False)].index.tolist()
If need also duplicates with values:
df1 = (s[s.duplicated(keep=False)]
.sort_values()
.rename_axis(index='idx', columns='cols')
.reset_index(name='val'))
If need tet specific value change mask for Series.eq (==):
s = df.unstack()
out = s[s.eq(55)].index.tolist()
So, in the code below, there is an iteration. However, it doesn't iterate over the whole DataFrame, but it just iterates over the columns, and then use .any() to check if there is any of the desierd value. Then using loc feature in the pandas it locates the value, and finally returns the index.
wanted_value = 55
for col in list(df.columns):
if df[col].eq(wanted_value).any() == True:
print("row:", *list(df.loc[df[col].eq(wanted_value)].index), ' col', col)
I'm very new to Python and am having a problem trying to execute a very basic task in pandas. I am trying to create a new column (variable) called RACE which is based off of the values in RAC1P_RC1. I have tried every way to recode RACE (loc, apply, lambda), but it will not update its values at all, even when the argument is true. For example, I tried to use the code
def f(x):
if x['RAC1P_RC1'] == 1: return 1
else: return 0
acs['RACE'] = acs.apply(f, axis=1)
And when I look at the dataframe, all cases in RACE have a value of 0, even in cases where RAC1P_RC1 equals 1. There seems to be something very basic I'm missing here, since this is one of the simplest tasks in pandas, and I'm not able to do it. Any help would be appreciated.
Check the datatype of 'RAC1P_RC1' column, make sure it's not an object data type. If its object datatype then the condition (if x['RAC1P_RC1'] == 1) will always return False.
Also, you could use .loc to make the code faster as follows:
mask = (acs['RAC1P_RC1'] == 1)
acs.loc[mask,'RACE'] = 1
acs.loc[~mask,'RACE'] = 0
You can check your condition directly and that will give you a Series of True/False then typecast that Series to int via astype() method and you will get corrosponding binary values:
acs['RACE'] =acs['RAC1P_RC1'].eq(1).astype(int)
OR
you can also use view() method in place of astype() for achieving the same:
acs['RACE'] =acs['RAC1P_RC1'].eq(1).view('i1')
I have a dataframe of corporate actions for a specific equity. it looks something like this:
0 Declared Date Ex-Date Record Date
BAR_DATE
2018-01-17 2017-02-21 2017-08-09 2017-08-11
2018-01-16 2017-02-21 2017-05-10 2017-06-05
except that it has hundreds of rows, but that is unimportant. I created the index "BAR_DATE" from one of the columns which is where the 0 comes from above BAR_DATE.
What I want to do is to be able to reference a specific element of the dataframe and return the index value, or BAR_DATE, I think it would go something like this:
index_value = cacs.iloc[5, :].index.get_values()
except index_value becomes the column names, not the index. Now, this may stem from a poor understanding of indexing in pandas dataframes, so this may or may not be really easy to solve for someone else.
I have looked at a number of other questions including this one, but it returns column values as well.
Your code is really close, but you took it just one step further than you needed to.
# creates a slice of the dataframe where the row is at iloc 5 (row number 5) and where the slice includes all columns
slice_of_df = cacs.iloc[5, :]
# returns the index of the slice
# this will be an Index() object
index_of_slice = slice_of_df.index
From here we can use the documentation on the Index object: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.html
# turns the index into a list of values in the index
index_list = index_of_slice.to_list()
# gets the first index value
first_value = index_list[0]
The most important thing to remember about the Index is that it is an object of its own, and thus we need to change it to the type we expect to work with if we want something other than an index. This is where documentation can be a huge help.
EDIT: It turns out that the iloc in this case is returning a Series object which is why the solution is returning the wrong value. Knowing this, the new solution would be:
# creates a Series object from row 5 (technically the 6th row)
row_as_series = cacs.iloc[5, :]
# the name of a series relates to it's index
index_of_series = row_as_series.name
This would be the approach for single-row indexing. You would use the former approach with multi-row indexing where the return value is a DataFrame and not a Series.
Unfortunately, I don't know how to coerce the Series into a DataFrame for single-row slicingbeyond explicit conversion:
row_as_df = DataFrame(cacs.iloc[5, :])
While this will work, and the first approach will happily take this and return the index, there is likely a reason why Pandas doesn't return a DataFrame for single-row slicing so I am hesitant to offer this as a solution.
In my data frame DF I have a column called 'A' and its values are integer values. However when I want to retrieve the value of A for a specific row using
DF[DF.someCondition=condition].A
it returns an object of shape (1,) that is not int because int does not have a shape. I want int because I want to use this value as an index entry to another numpy array. How can I retrieve the value of A so that it's an int value?
In general, a condition of the form
DF.someCondition = condition
may be True more than once. That is why
DF[DF.someCondition=condition].A
returns an object of shape (1,) rather than a scalar value.
If you are certain that the condition is True only once, then you can extract the scalar value using item
DF[DF.someCondition=condition].A.item()
However, as MaxU suggested, it is better to use .loc to avoid chained-indexing:
DF.loc[DF.someCondition=condition, 'A'].item()
For example,
import numpy as np
import pandas as pd
df = pd.DataFrame(np.arange(6).reshape(3,2), columns=list('AB'))
df[df['B']==3].A
# 1 2
# Name: A, dtype: int64
df.loc[df['B']==3, 'A'].item()
# 2
For future reference:
Using iter as suggested by unutbu definitely gets the job done; however it must be noted that the function has been deprecated as of 2018.
Using the iter function as suggested in unutbu's answer will result in a warning that looks as below:
FutureWarning: item has been deprecated and will be removed in a
future version
As is evident from the warning, this functionality will soon removed. This information can also be found in the source code.
As posted here following is the workaround using iter with next if the first matched value is required:
x = df.loc[df['B']==3, 'A'].item()
next(iter(x), 'no match')
The advantage is that if no value is matched the default value can be returned.
In case you are further interested, this issue is discussed here and here .