Issues while using .loc with a Pandas dataframe - python

raw=pd.read_csv('raw_6_12_8_30.csv')
raw2=raw.loc[raw['spices'].isnull()==False] # code for deleting 10 values #
b=[]
for i in range(len(raw2)):
if raw2['Status'][i]==0: # codes didn't run perfectly#
print(i)
But when I use this code without line 2, it works fine.
raw=pd.read_csv('raw_6_12_8_30.csv')
b=[]
for i in range(len(raw)):
if raw['Status'][i]==0:
print(i)
I checked there is no errors in this raw2['Status] and raw['Status']
But whenever I use pandas.loc ,there is an error.
I bet that line 2 makes an error but I don't know why?
error images here
enter image description here
key errors 11 # what is it #

there are 3 ways to get values from dataframe by indexing.
loc gets rows (or columns) with particular labels from the index.
iloc gets rows (or columns) at particular positions in the index (so it only takes integers).
ix usually tries to behave like loc but falls back to behaving like iloc if a label is not present in the index.
if you want to take values by indexing you can use iloc. Like in the code below
raw=pd.read_csv('raw_6_12_8_30.csv')
b=[]
for i in range(len(raw)):
if raw['Status'].iloc[i]==0:
print(i)

can you try with :
for i in range(0,len(raw)-1):
i guess key error 11 occur because of index range. the key 11 may be out of range.

Are you trying to drop all rows where spices is null?
raw.dropna(subset="spices", inplace=True)
To print where the status is 0:
raw_subset = raw[raw["Status"]==0]
print(raw_subset)
# To get the specific indices
print(raw_subset.index)

Related

PYTHON: Using if statement to find where two variable columns are same and storing index

I'm trying to store the indices where two DataFrames have the same values. For example...
current_gages = 95x4 DataFrame with the following headers: site_no, sitename, Lat, Lon
ucol_gages = 253 x 2 Dataframe with the following headers: site_no, area
I want to find where the indices for these two dataframes have the same site number (site_no).
I am working with the following:
count1=0
gages_idx=[]
for j in range(len(current_gages)):
if ucol_gages['site_no'][count1] == current_gages['site_no'][j]:
gages_idx.append([count1, j])
count1+=1
elif ucol_gages['site_no'][count1] != current_gages['site_no'][j]:
count1+=0
The loop stops at [4,4] with the first if statement and then to make sure there aren't any other rows in current gages, I added the elif statement to run through the remainder of the rows.
I know the next indices where ucol_gages and current_gages match is:
ucol_gages['site_no'][5] and current_gages['site_no'][4].
Is there an easier way to do this? I need to make sure I run through all variations of the row indices within the DFs to determine any possible matches.
To get the matching indices of the ucol_gages dataframe you can do the following
ucol_gages[ucol_gages['site_no'].isin(pd.merge(ucol_gages, current_gages, on=['site_no'], how='inner')['site_no'])].index.values.tolist()
To get the matching indices of the current_gages dataframe you can do the following
current_gages[current_gages['site_no'].isin(pd.merge(current_gages, ucol_gages, on=['site_no'], how='inner')['site_no'])].index.values.tolist()
Hope this helps ;)

How do I pull the index(es) and column(s) of a specific value from a dataframe?

---Hello, everyone! New student of Python's Pandas here.
I have a dataframe I artificially constructed here: https://i.stack.imgur.com/cWgiB.png. Below is a text reconstruction.
df_dict = {
'header0' : [55,12,13,14,15],
'header1' : [21,22,23,24,25],
'header2' : [31,32,55,34,35],
'header3' : [41,42,43,44,45],
'header4' : [51,52,53,54,33]
}
index_list = {
0:'index0',
1:'index1',
2:'index2',
3:'index3',
4:'index4'
}
df = pd.DataFrame(df_dict).rename(index = index_list)
GOAL:
I want to pull the index row(s) and column header(s) of any ARBITRARY value(s) (int, float, str, etc.). So for eg, if I want the values of 55, this code will return: header0, index0, header2, index2 in some format. They could be list or tuple or print, etc.
CLARIFICATIONS:
Imagine the dataframe is of a large enough size that I cannot "just find it manually"
I do not know how large this value is in comparison to other values (so a "simple .idxmax()" probably won't cut it)
I do not know where this value is column or index wise (so "just .loc,.iloc where the value is" won't help either)
I do not know whether this value has duplicates or not, but if it does, return all its column/indexes.
WHAT I'VE TRIED SO FAR:
I've played around with .columns, .index, .loc, but just can't seem to get the answer. The farthest I've gotten is creating a boolean dataframe with df.values == 55 or df == 55, but cannot seem to do anything with it.
Another "farthest" way I've gotten is using df.unstack.idxmax(), which would return a tuple of the column and header, but has 2 major problems:
Only returns the max/min as per the .idxmax(), .idxmin() functions
Only returns the FIRST column/index matching my value, which doesn't help if there are duplicates
I know I could do a for loop to iterate through the entire dataframe, tracking which column and index I am on in temporary variables. Once I hit the value I am looking for, I'll break and return the current column and index. Was just hoping there was a less brute-force-y method out there, since I'd like a "high-speed calculation" method that would work on any dataframe of any size.
Thanks.
EDIT: Added text database, clarified questions.
Use np.where:
r, c = np.where(df == 55)
list(zip(df.index[r], df.columns[c]))
Output:
[('index0', 'header0'), ('index2', 'header2')]
There is a function in pandas that gives duplicate rows.
duplicate = df[df.duplicated()]
print(duplicate)
Use DataFrame.unstack for Series with MultiIndex and then filter duplicates by Series.duplicated with keep=False:
s = df.unstack()
out = s[s.duplicated(keep=False)].index.tolist()
If need also duplicates with values:
df1 = (s[s.duplicated(keep=False)]
.sort_values()
.rename_axis(index='idx', columns='cols')
.reset_index(name='val'))
If need tet specific value change mask for Series.eq (==):
s = df.unstack()
out = s[s.eq(55)].index.tolist()
So, in the code below, there is an iteration. However, it doesn't iterate over the whole DataFrame, but it just iterates over the columns, and then use .any() to check if there is any of the desierd value. Then using loc feature in the pandas it locates the value, and finally returns the index.
wanted_value = 55
for col in list(df.columns):
if df[col].eq(wanted_value).any() == True:
print("row:", *list(df.loc[df[col].eq(wanted_value)].index), ' col', col)

Selecting by index -1 in a df column / time series throws error

Let's assume we have a simple dataframe like this:
df = pd.DataFrame({'col1':[1,2,3], 'col2':[10,20,30]})
Then I can select elements like this
df.col2[0] or df.col2[1]
But if I want to select the last element with df.col2[-1] it results in the error message:
KeyError: -1
I know that there are workarounds to that. I could do for example df.col2[len(df)-1] or df.iloc[-1,1]. But why wouldn't be the much simpler version of indexing directly by -1 be allowed? Am I maybe missing another simple selection way for -1? Tnx
The index labels of your DataFrame are [0,1,2]. Your code df.col2[1] is an equivalent of using a loc function as df['col2'].loc[1](or df.col2.loc[1]). You can see that you index does not contain a label '-1' (which is why you get the KeyError).
For positional indexing you need to use an iloc function (which you can use on Pandas Series as well as DataFrame), so you could do df['col2'].iloc[-1] (or df.col2.iloc[-1]).
As you can see, you can use both label based ('col2') and position based (-1) indexing together, you don't need to choose one or another as df.iloc[-1,1] or df.col2[len(df)-1] (which would be equivalent to df.loc[lend(df)-1,'col2'])

How to return the index value of an element in a pandas dataframe

I have a dataframe of corporate actions for a specific equity. it looks something like this:
0 Declared Date Ex-Date Record Date
BAR_DATE
2018-01-17 2017-02-21 2017-08-09 2017-08-11
2018-01-16 2017-02-21 2017-05-10 2017-06-05
except that it has hundreds of rows, but that is unimportant. I created the index "BAR_DATE" from one of the columns which is where the 0 comes from above BAR_DATE.
What I want to do is to be able to reference a specific element of the dataframe and return the index value, or BAR_DATE, I think it would go something like this:
index_value = cacs.iloc[5, :].index.get_values()
except index_value becomes the column names, not the index. Now, this may stem from a poor understanding of indexing in pandas dataframes, so this may or may not be really easy to solve for someone else.
I have looked at a number of other questions including this one, but it returns column values as well.
Your code is really close, but you took it just one step further than you needed to.
# creates a slice of the dataframe where the row is at iloc 5 (row number 5) and where the slice includes all columns
slice_of_df = cacs.iloc[5, :]
# returns the index of the slice
# this will be an Index() object
index_of_slice = slice_of_df.index
From here we can use the documentation on the Index object: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.html
# turns the index into a list of values in the index
index_list = index_of_slice.to_list()
# gets the first index value
first_value = index_list[0]
The most important thing to remember about the Index is that it is an object of its own, and thus we need to change it to the type we expect to work with if we want something other than an index. This is where documentation can be a huge help.
EDIT: It turns out that the iloc in this case is returning a Series object which is why the solution is returning the wrong value. Knowing this, the new solution would be:
# creates a Series object from row 5 (technically the 6th row)
row_as_series = cacs.iloc[5, :]
# the name of a series relates to it's index
index_of_series = row_as_series.name
This would be the approach for single-row indexing. You would use the former approach with multi-row indexing where the return value is a DataFrame and not a Series.
Unfortunately, I don't know how to coerce the Series into a DataFrame for single-row slicingbeyond explicit conversion:
row_as_df = DataFrame(cacs.iloc[5, :])
While this will work, and the first approach will happily take this and return the index, there is likely a reason why Pandas doesn't return a DataFrame for single-row slicing so I am hesitant to offer this as a solution.

What is Pandas doing here that my indexes [0] and [1] refer to the same value?

I have a dataframe with these indices and values:
df[df.columns[0]]
1 example
2 example1
3 example2
When I access df[df.columns[0]][2], I get "example1". Makes sense. That's how indices work.
When I access df[df.columns[0]], however, I get "example", and I get example when I access df[df.columns[1]] as well. So for
df[df.columns[0]][0]
df[df.columns[0]][1]
I get "example".
Strangely, I can delete "row" 0, and the result is that 1 is deleted:
gf = df.drop(df.index[[0]])
gf
exampleDF
2 example1
3 example2
But when I delete row 1, then
2 example1
is deleted, as opposed to example.
This is a bit confusing to me; are there inconsistent standards in Pandas regarding row indices, or am I missing something / made an error?
You are probably causing pandas to switch between .iloc (index based) and .loc (label based) indexing.
All arrays in Python are 0 indexed. And I notice that indexes in your DataFrame are starting from 1. So when you run df[df.column[0]][0] pandas realizes that there is no index named 0, and falls back to .iloc which locates things by array indexing. Therefore it returns what it finds at the first location of the array, which is 'example'.
When you run df[df.column[0]][1] however, pandas realizes that there is a index label 1, and uses .loc which returns what it finds at that label, which again happens to be 'example'.
When you delete the first row, your DataFrame does not have index labels 0 and 1. So when you go to locate elements at those places in the way you are, it does not return None to you, but instead falls back on array based indexing and returns elements from the 0th and 1st places in the array.
To enforce pandas to use one of the two indexing techniques, use .iloc or .loc. .loc is label based, and will raise KeyError if you try df[df.column[0]].loc[0]. .iloc is index based and will return 'example' when you try df[df.column[0]].iloc[0].
Additional note
These commands are bad practice: df[col_label].iloc[row_index]; df[col_label].loc[row_label].
Please use df.loc[row_label, col_label]; or df.iloc[row_index, col_index]; or df.ix[row_label_or_index, col_label_or_index]
See Different Choices for Indexing for more information.

Categories