I have a dataframe with 2 columns and I want to create a 3rd column that returns True or False for each row according to if the value in column A is contained in the value in column B .
Here's my code:
C = []
for index, row in df.iterrows():
if row['A'][index] in row['B'][index]:
C[index] = True
else:
C[index] = False
I get the following errors:
1) TypeError: 'float' object is not subscriptable
2) IndexError: list assignment index out of range
How can I solve these errors?
I think the problem is some values of row['A'] or row['B'] contain float values. This is why when you get that float value you can not subscript it. Otherwise it would be like [float][index] which is what giving error. Are you expecting a string value there? It could be possible not all values are having same data type in the data frame.
Secondly, the index is of the row, I don't now why are you using it like this. For more clarifications I need to take a look at that data, but what seems possible is like even if you got a string or array value for row ['A'], which could be traversed, the index is too large. For ex-
row['A'] = "hello"
a = row['A'][10]
will give you the index error.
Related
I have created a function for which the input is a pandas dataframe.
It should return the row-indices of the rows with a missing value.
It works for all the defined Missingness values except when the cell is entirely empty - even though I tried to specify this in the missing_values List as [...,""] .
What could be the issue here? Or is there even a more intuitive way to solve this in general?
def missing_values(x):
df=x
missing_values = ["NaN","NAN","NA","Na","n/a", "na", "--","-"," ","","None","0","-inf"] #common ways to indicate missingness
observations = df.shape[0] # Gives number of observations (rows)
variables = df.shape[1] # Gives number of variables (columns)
row_index_list = []
#this goes through each observation in the first row
for n in range(0,variables): #this iterates over all variables
column_list = [] #creates a list for each value per variable
for i in range(0,observations): #now this iterates over every observation per variable
column_list.append(df.iloc[i,n]) #and adds the values to the list
for i in range(0,len(column_list)): #now for every value
if column_list[i] in missing_values: #it is checked, whether the value is a Missing one
row_index_list.append(column_list.index(column_list[i])) #and if yes, the row index is appended
finished = list(set(row_index_list)) #set is used to make sure the index only appears once if there are multiple occurences in one row and then it is listed
return finished
There might be spurious whitespace, so try adding strip() on this line:
if column_list[i].strip() in missing_values: #it is checked, whether the value is a Missing one
Also a simpler way to get the indexes of rows containing missing_values is with isin() and any(axis=1):
x = x.replace('\s+', '', regex=True)
row_index_list = x[x.isin(missing_values).any(axis=1)].index
When you import a file to Pandas using for example read_csv or read_excel, the missing variable (literally missing) is then can only be specify using np.nan or other type of null value with the numpy library.
(Sorry my bad right here, I was really silly when doing np.nan == np.nan)
You can replace the np.nan value first with:
df = df.replace(np.nan, 'NaN')
then your function can catch it.
Another way is to use isna() in pandas,
df.isna()
This will return the same DataFrame but with cell contains boolean value, True for each cell that is np.nan
If you do df.isna().any(),
This will return a Series with True value for any columns that contains null value.
If you want to retrieved the ID, simply adding the parameter axis = 1 to any():
df.isna().any(axis = 1)
This will return a Series show all the rows with np.nan value.
Now you have the boolean values that indicate which row contains null values. If you add these boolean value to a list and apply that on the DF.index this will took out the index value of the rows containing null.
booleanlist = df.isna().any(axis =1).tolist()
null_row_id = df.index[booleanlist]
so i have a
df = read_excel(...)
loop does work:
for i, row in df.iterrows(): #loop through rows
a = df[df.columns].SignalName[i] #column "SignalName" of row i, is read
b = (row[7]) #column "Bus-Signalname" of row i, taken primitively=hardcoded
Access to a is ok, how to replace the hardcoded b = (row[7]) with a dynamically found/located "Bus-Signalname" element from the excel table. Which are the many ways to do this?
b = df[df.columns].Bus-Signalname[i]
does not work.
To access the whole column, run: df['Bus-Signalname'].
So called attribute notation (df.Bus-Signalname) will not work here,
since "-" is not allowed as a part of an attribute name.
It is treated as minus operator, so:
the expression before it is df.Bus, but df probably has no
column with whis name, so an exception is thrown,
what occurs after it (Signalname) is expected to be e.g. a variable,
but you probably have no such variable and this is another reason
which could cause an exception.
Note also that then you wrote [i].
As I understand, i is an integer and you want to access element No i from this column.
Note that the column you retrieved is a Series with index just the
same as your whole DataFrame.
If the index is a default one (consecutive numbers, starting from 0),
you will succeed. Otherwise (if the index does not contain value of i)
you will fail.
A more pandasonic syntax to access an element in a DataFrame is:
df.loc[i, 'Bus-Signalname']
where i is the index of the row in question and Bus-Signalname is the column name.
#Valdi_Bo
thank you. In the loop, both
df.loc[i, 'Bus-Signalname']
and
df['Bus-Signalname'][i]
work.
I have a DataFrame that I need to modify based on one of column values. In particular, when the value in column a is above 110, I want the column b to be assigned value of -99. The only issue is that first 3 rows of the dataframe contain a mix of string and numerical data types so when I try:
df.loc[df['a'] >= 110, 'b'] = -99
I get a TypeError because comparison between str and int is not allowed.
So my question is: how do I do this assignment while ignoring the first 3 rows of the dataframe?
So far I came up with this rather dodgy way:
try:
df.loc[df['a'] >= 110, 'b'] = -99
except TypeError:
pass
This does seem to work, but it obviously doesn't seem like the proper way to do it.
EDIT: And also this method just skips first 3 rows, but I really need to keep them as is.
Try:
df.loc[df['a'].apply(pd.to_numeric, errors='coerce').ge(110), 'b'] = -99
or use errors='ignore'
I have found an inconsistency (at least to me) in the following two approaches:
For a dataframe defined as:
df=pd.DataFrame([[1,2,3,4,np.NaN],[8,2,0,4,5]])
I would like to access the element in the 1st row, 4th column (counting from 0). I either do this:
df[4][1]
Out[94]: 5.0
Or this:
df.iloc[1,4]
Out[95]: 5.
Am I correctly understanding that in the first approach I need to use the column first and then the rows, and vice versa when using iloc? I just want to make sure that I use both approaches correctly going forward.
EDIT: Some of the answers below have pointed out that the first approach is not as reliable, and I see now that this is why:
df.index = ['7','88']
df[4][1]
Out[101]: 5.0
I still get the correct result. But using int instead, will raise an exception if that corresponding number is not there anymore:
df.index = [7,88]
df[4][1]
KeyError: 1
Also, changing the column names:
df.columns = ['4','5','6','1','5']
df['4'][1]
Out[108]: 8
Gives me a different result. So overall, I should stick to iloc or loc to avoid these issues.
You should think of DataFrames as a collection of columns. Therefore when you do df[4] you get the 4th column of df, which is of type Pandas Series. Afer this when you do df[4][1] you get the 1st element of this Series, which corresponds to the 1st row and 4th column entry of the DataFrame, which is what df.iloc[1,4] does exactly.
Therefore, no inconsistency at all, but beware: This will work only if you don't have any column names, or if your column names are [0,1,2,3,4]. Else, it will either fail or give you a wrong result. Hence, for positional indexing you must stick with iloc, or loc for name indexing.
Unfortunately, you are not using them correctly. It's just coincidence you get the same result.
df.loc[i, j] means the element in df with the row named i and the column named j
Besides many other defferences, df[j] means the column named j, and df[j][i] menas the column named j, and the element (which is row here) named i.
df.iloc[i, j] means the element in the i-th row and the j-th column started from 0.
So, df.loc select data by label (string or int or any other format, int in this case), df.iloc select data by position. It's just coincidence that in your example, the i-th row named i.
For more details you should read the doc
Update:
Think of df[4][1] as a convenient way. There are some logic background that under most circumstances you'll get what you want.
In fact
df.index = ['7', '88']
df[4][1]
works because the dtype of index is str. And you give an int 1, so it will fall back to position index. If you run:
df.index = [7, 88]
df[4][1]
Will raise an error. And
df.index = [1, 0]
df[4][1]
Sill won't be the element you expect. Because it's not the 1st row starts from 0. It will be the row with the name 1
I am trying to iterate through a dataframe that has null values for the column = [myCol]. I am able to iterate through the dataframe fine, however when I specify I only want to see null values I get an error.
End goal is that I want to force a value into the fields that are Null which is why I am iterating to identify which are first.
for index,row in df.iterrows():
if(row['myCol'].isnull()):
print('true')
AttributeError: 'str' object has no attribute 'isnull'
I tried specifying the column = 'None' since that is the value I see when I print the iteration of the dataframe. Still no luck:
for index,row in df.iterrows():
if(row['myCol']=='None'):
print('true')
No returned rows
Any help greatly appreciated!
You can use pd.isnull() to check if a value is null or not:
for index, row in df.iterrows():
if(pd.isnull(row['myCol'])):
print('true')
But seems like you need df.fillna(myValue) where myValue is the value you want to force into fields that are NULL. And also to check the NULL fields in a data frame you can invoke df.myCol.isnull() instead of looping through rows and check individually.
If the columns are of string type, you might also want check if it is empty string:
for index, row in df.iterrows():
if(row['myCol'] == ""):
print('true')