python pandas.loc not finding row name: KeyError - python

This is driving me crazy because it should be so simple and yet it's not working. It's a duplicate question and yet the answers from previous questions don't work.
My csv looks similar to this:
name,val1,val2,val3
ted,1,2,
bob,1,,
joe,,,4
I want to print the contents of row 'joe'. I use the line below and pycharm gives me a KeyError.
print(df.loc['joe'])

The problem with your logic is that you have not let pandas know which column it should search joe for.
print(df.loc[df['name'] == 'joe'])
or
print(df[df['name'] == 'joe'])

Using .loc directly is achievable only on index.
If you just used pd.read_csv without mentioning the index, by default pandas will use number as index. You can set name to be the index if it is unique. Then .loc will work:
df.set_index("name")
print(df.loc['joe'])
Another option, and it's how usually working with .loc, is to name specifically what column you refer to:
print(df.loc[df["name"]=="joe"])
Note that the condition df["name"]=="joe" returns a series with true/false for each row. df.loc[...] on that series will return only rows where the value is true, and therefore it will return only rows where name is "joe". Keep that in mind when in future you will try to do more complex conditioning on your dataframe using .loc.

Related

How to modify a Pandas dataframe while iterating over it

So I have a dataframe that I am iterating over, and about halfway through the df I want to modify a column name but continue my iteration. I have code like this:
for index, row in df.iterrows():
do something with row
if certain condition is met:
df.rename(columns={'old_name':'new_name'}, inplace=True)
After I do the rename, the column name is changed in the 'df' variable for subsequent iterations, but the value of 'row' still contains the old column name. How can I fix this? I know I have encountered similar situations in pandas before. Maybe the iterator doesn't get updated even the dataframe itself is modified?
Changing the source of something you're iterating over is not a good practice.
You could set a flag if the condition is met, and then after the iteration, make any necessary changes to the dataframe.
Edited to add: I have a large dataset that needs "line by line" parsing, but that instruction was given to me by a non-programmer. Here's what I did: I added a boolean condition to the dataframe, split the dataframe into two separate dataframes based on that condition, stored one for later integration and moved on with the other dataframe. At the end I used pd.concat to put everything back together. But if you change a column name that pd.concat will create extra columns in the end.

Weird inconsistency between df.drop() and df.idxmin()

I am encountering a weird issue with pandas. After some careful debugging I have found the problem, but I would like a fix, and an explanation as to why this is happening.
I have a dataframe which consists of a list of cities with some distances. I have to iteratively find a city which is closest to some "Seed" city (details are not too important here).
To locate the "closest" city to my seed city, i use:
id_new_point = df["Time from seed"].idxmin(skipna=True)
Then, I want to remove the city I just found from the dataframe, for which I use:
df.drop(index=emte_df.index[id_new_point], inplace=True)
Now comes the strange part; I've connected the pycharm debugger and closely observed my dataframe as I stepped through my loops.
When you delete a row in pandas with df.drop, it seems deletes the entire row by row number. This means for example that if I delete row #126, the "Index column" of the dataframe doesn't adapt accordingly. (See screenshot below)
idxmin seems to return the actual index of the row, so the index associated with the row in the pandas data-frame.
Now here comes the issue: Say I want to remove row 130 (EMTE WIJCHEN). I would use df.idxmin which would give me the index, namely 130. Then i call df.drop[index], but since df.drop uses row numbers to delete instead of dataframe indices and I just deleted a row (#126) it now (wrongly) removes row #131 which slided up into place 130.
Why is this happening, and how can I improve my code so I can drop the right rows in my dataframe? Any suggestions are greatly appreciated (Also if there's a better practice for approaching these kinds of problems, maybe a solution which doesn't adapt the dataframe in place?)
Pandas don't update the index column because it is not always a range of numbers, it can be, for example, names (John, Maria, etc) and thus it wouldn't make sense to "update" the index after dropping a specific row. You can reset_index after each iteration like this:
df.drop(index=emte_df.index[id_new_point], inplace=True)
df = df.reset_index(drop=True)

How to search in a pandas dataframe column with the space in the column name

If I need to search if a value exists in a pandas data frame column , which has got a name without any spaces, then I simply do something like this
if value in df.Timestamp.values
This will work if the column name is Timestamp. However, I have got plenty of data with column names as 'Date Time'. How do I use the if in statement in that case?
If there is no easy way to check for this using the if in statement, can I search for the existence of the value in some other way? Note that I just need to search for the existence of the value. Also, this is not an index column.
Thank you for any inputs
It's better practice to use the square bracket notation:
df["Date Time"].values
Which does exactly the same thing
There are 2 ways of indexing columns in pandas. One is using the dot notation which you are using and the other is using square brackets. Both work the same way.
if value in df["Date Time"].values
in the case where you want to work with a column that has a header name with spaces
but you don't want it changed because you may have to forward the file
...one way is to just rename it, do whatever you want with the new no-spaced-name, them rename it back...# e.g. to drop the rows with the value "DUMMY" in the column 'Recipient Fullname'
df.rename(columns={'Recipient Fullname':'Recipient_Fullname'}, inplace=True)
df = df[(df.Recipient_Fullname != "DUMMY")]
df.rename(columns={'Recipient_Fullname':'Recipient Fullname'}, inplace=True)

Python Pandas Dataframe Pulling cell value of Column B based on Column A

struggling here. Probably missing something incredibly easy, but beating my head on my desk while trying to learn Python and realizing that's probably not going to solve this for me.
I have a dataframe df and need to pull the value of column B based on the value of column A.
Here's what I can tell you of my dataset that should make it easier. Column A is unique (FiscalYear) but despite being a year was converted to_numeric. Column B is not specifically unique (Sales) and like Column A was converted to to_numeric. This is what I have been trying as I was able to do this when finding the value of sales using idx max. However at a specific value, this is returning an error:
v = df.at[df.FiscalYear == 2007.0, 'Sales']
I am getting ValueError: At based indexing on an integer index can only have integer indexers I am certain that I am doing something wrong, but I can't quite put my finger on it.
And here's the code that is working for me.
v = df.at[df.FiscalYear.idxmax(), 'Sales']
No issues there, returning the proper value, etc.
Any help is appreciated. I saw a bunch of similar threads, but for some reason searching and blindly writing lines of code is failing me tonight.
you can use .loc method
df.Sales.loc[df.FiscalYear==2007.0]
this will be pandas series type object.
if you want it in a list you can do:
df.Sales.loc[df.FiscalYear==2007.0].tolist()
Can you try this:
v = df.at[df.FiscalYear.eq(2007.0).index[0], 'Sales']

What is the difference between these two Python pandas dataframe commands?

Let's say I have an empty pandas dataframe.
import pandas as pd
m = pd.DataFrame(index=range(1,100), columns=range(1,100))
m = m.fillna(0)
What is the difference between the following two commands?
m[2][1]
m[2].ix[1] # This code actually allows you to edit the dataframe
Feel free to provide further reading if it would be helpful for future reference.
The short answer is that you probably shouldn't do either of these (see #EdChum's link for the reason):
m[2][1]
m[2].ix[1]
You should generally use a single ix, iloc, or loc command on the entire dataframe anytime you want access by both row and column -- not a sequential column access, followed by row access as you did here. For example,
m.iloc[1,2]
Note that the 1 and 2 are reversed compared to your example because ix/iloc/loc all use standard syntax of row then column. Your syntax is reversed because you are chaining, and are first selecting a column from a dataframe (which is a series) and then selecting a row from that series.
In simple cases like yours, it often won't matter too much but the usefulness of ix/iloc/loc is that they are designed to let you "edit" the dataframe in complex ways (aka set or assign values).
There is a really good explanation of ix/iloc/loc here:
pandas iloc vs ix vs loc explanation?
and also in standard pandas documentation:
http://pandas.pydata.org/pandas-docs/stable/indexing.html

Categories