Python Pandas indexing - python

Sorry if this is a simple question, I've tried to look for a solution but can't find anything.
My code goes like this:
given zip1, create an index to select observations (other zipcodes) where some calculation has not been done yet (666)
I = (df['zip1'] == zip1) & (df['Distances'] == 666)
perform some calculation
distances = calc(zip1,df['zip2'][I])
So far so good, I've checked the distances variable, correct values, correct sized array.
put the distance variable in the right place
df['Distances'][I] = distances
but this last part updates all the df['Distances'] variables to nonsense values FOR ALL observations with df['zip1']=zip1 instead of the ones selected by I.
I've checked the boolean array I before the df['Distances'][I] = distances command and it looks fine. Any ideas would be greatly appreciated.

What you are attempting is called chained assignment and does not work the way you think as it returns a copy rather than a view hence the error you see.
There is more information about it here and related issues, this and this.
So you should either use .loc or .ix like so:
df.loc[I,'Distances']=distances

Related

Pandas - value changes when adding a new column to a dataframe

I'm trying to add a new column to a dataframe using the following code.
labels = df1['labels']
df2['labels'] = labels
However, in the later part of my program, I found that there might be something wrong with the assignment. So, I checked it using
labels.equals(other=df2['labels'])
and I got a False. (I added this line instantly after assignment)
I also tried to
print out part of labels and df2, and it turns out that there are indeed some lines that are different.
check max and min values of both series, and they are different
check number of unique values in both series using len(set(labels)) and len(set(df2['labels'])), and they differs a lot
test with a smaller amount of data, but this works totally fine.
My dataframe is rather large (40 million+ lines), so I cannot print them all out and check the values. Does anyone have any idea about what might lead to this kind of problem? Or is there any suggestions for further tests?

Weird error when accessing pandas column value by attribute (dot) vs bracket

I am having trouble with a really strange error in Python when it comes to accessing a value in a pandas dataframe.
For a given row and a specific column, the two code lines below return different values, when I expected them to be the same:
>> df[df.obsId == 107099]['length'].values[0]
101.720001220703
>> df[df.obsId == 107099].length.values[0]
101.64261358425581
I really don't understand why the length values returned are different. Aren't bracket access and attribute access supposed to be equivalent ? I thought it could be a float imprecision reason but the difference is actually big.
Also it might be useful to mention that when I display the dataframe, the corresponding value is 101.720001, which seems to indicate that the display accesses to the data with the first method rather than the second one:
Any clue of what could be the reason of such an important difference, how to avoid it and which of the two methods to trust?
Thanks a lot!
Thanks to the hint of the comments, I finally understood the problem.
My data type was actually a geodataframe, and geodataframes appear to have a .length attribute. So the attribute notation referenced to this .length attribute instead of referencing to the identically named column, which happen to have different values!

not able to change object to float in pandas dataframe

just started learning python. trying to change a columns data type from object to float to take out the mean. I have tried to change [] to () and even the "". I dont know whether it makes a difference or not. Please help me figure out what the issue is. thanks!!
My code:
df["normalized-losses"]=df["normalized-losses"].astype(float)
error which i see: attached as imageenter image description here
Use:
df['normalized-losses'] = df['normalized-losses'][~(df['normalized-losses'] == '?' )].astype(float)
Using df.normalized-losses leads to interpreter evaluating df.normalized which doesn't exist. The statement you have written executes (df.normalized) - (losses.astype(float)).There appears to be a question mark in your data which can't be converted to float.The above statement converts to float only those rows which don't contain a question mark and drops the rest.If you don't want to drop the columns you can replace them with 0 using:
df['normalized-losses'] = df['normalized-losses'].replace('?', 0.0)
df['normalized-losses'] = df['normalized-losses'].astype(float)
Welcome to Stack Overflow, and good luck on your Python journey! An important part of coding is learning how to interpret error messages. In this case, the traceback is quite helpful - it is telling you that you cannot call normalized after df, since a dataframe does not have a method of this name.
Of course you weren't trying to call something called normalized, but rather the normalized-losses column. The way to do this is as you already did once - df["normalized-losses"].
As to your main problem - if even one of your values can't be converted to a float, the columnwide operation will fail. This is very common. You need to first eliminate all of the non-numerical items in the column, one way to find them is with df[~df['normalized_losses'].str.isnumeric()].
The "df.normalized-losses" does not signify anything to python in this case. you can replace it with df["normalized-losses"]. Usually, if you try
df["normalized-losses"]=df["normalized-losses"].astype(float)
This should work. What this does is, it takes normalized-losses column from dataframe, converts it to float, and reassigns it to normalized column in the same dataframe. But sometimes it might need some data processing before you try the above statement.
You can't use - in an attribute or variable name. Perhaps you mean normalized_losses?

Pandas -- String of boolean conditions -- SettingWithCopyWarning

I am wondering if anyone can assist me with this warning I get in my code. The code DOES score items correctly, but this warning is bugging me and I can't seem to find a good fix, given that I need to string a few boolean conditions together.
Background: Imagine that I have a magical fruit identifier and I have a csv file that lists what fruit was identified and in which area (1, 2, etc.). I read in the csv file with columns of "FruitID" and "Area." An identification of "APPLE" or "apple" in Zone 1 is scored as correct/true (other identified fruits are incorrect/false). I apply similar logic for other areas, but I won't get into that.
Any ideas for how to correct this? Should I use .loc, although I'm not sure that this will work with multiple booleans. Thanks!
My code snippet that initiates the CopyWarning:
Area1_ID_df['Area 1, Score']=(Area1_ID_df['FruitID']=='APPLE')|(Area1_ID_df['FruitID']=='apple')
Stacktrace:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Pandas finds it ambiguous what you are trying to do. Certain operations return a view of the dataset, whereas other operations make a copy of the dataset. The confusion is whether you want to modify a copy of the dataset or whether you want to modify the original dataset or are trying to create something new.
https://www.dataquest.io/blog/settingwithcopywarning/ is a great link to learn more about the problem you are having.
If the line that's causing this error is truly: s = t | u, where t and u are Boolean series indexed consistently, you should not worry about SettingWithCopyWarning.
This is a warning rather than an error. The latter indicates there is a problem, the former indicates there may be a problem. In this case, Pandas guesses you may be working with a copy rather than a view.
If the result is as you expect, you can safely ignore the warning.

Minimizing an array and value in Python

I have a vector of floats (coming from an operation on an array) and a float value (which is actually an element of the array, but that's unimportant), and I need to find the smallest float out of them all.
I'd love to be able to find the minimum between them in one line in a 'Pythony' way.
MinVec = N[i,:] + N[:,j]
Answer = min(min(MinVec),N[i,j])
Clearly I'm performing two minimisation calls, and I'd love to be able to replace this with one call. Perhaps I could eliminate the vector MinVec as well.
As an aside, this is for a short program in Dynamic Programming.
TIA.
EDIT: My apologies, I didn't specify I was using numpy. The variable N is an array.
You can append the value, then minimize. I'm not sure what the relative time considerations of the two approaches are, though - I wouldn't necessarily assume this is faster:
Answer = min(np.append(MinVec, N[i, j]))
This is the same thing as the answer above but without using numpy.
Answer = min(MinVec.append(N[i, j]))

Categories