not able to change object to float in pandas dataframe - python

just started learning python. trying to change a columns data type from object to float to take out the mean. I have tried to change [] to () and even the "". I dont know whether it makes a difference or not. Please help me figure out what the issue is. thanks!!
My code:
df["normalized-losses"]=df["normalized-losses"].astype(float)
error which i see: attached as imageenter image description here

Use:
df['normalized-losses'] = df['normalized-losses'][~(df['normalized-losses'] == '?' )].astype(float)
Using df.normalized-losses leads to interpreter evaluating df.normalized which doesn't exist. The statement you have written executes (df.normalized) - (losses.astype(float)).There appears to be a question mark in your data which can't be converted to float.The above statement converts to float only those rows which don't contain a question mark and drops the rest.If you don't want to drop the columns you can replace them with 0 using:
df['normalized-losses'] = df['normalized-losses'].replace('?', 0.0)
df['normalized-losses'] = df['normalized-losses'].astype(float)

Welcome to Stack Overflow, and good luck on your Python journey! An important part of coding is learning how to interpret error messages. In this case, the traceback is quite helpful - it is telling you that you cannot call normalized after df, since a dataframe does not have a method of this name.
Of course you weren't trying to call something called normalized, but rather the normalized-losses column. The way to do this is as you already did once - df["normalized-losses"].
As to your main problem - if even one of your values can't be converted to a float, the columnwide operation will fail. This is very common. You need to first eliminate all of the non-numerical items in the column, one way to find them is with df[~df['normalized_losses'].str.isnumeric()].

The "df.normalized-losses" does not signify anything to python in this case. you can replace it with df["normalized-losses"]. Usually, if you try
df["normalized-losses"]=df["normalized-losses"].astype(float)
This should work. What this does is, it takes normalized-losses column from dataframe, converts it to float, and reassigns it to normalized column in the same dataframe. But sometimes it might need some data processing before you try the above statement.

You can't use - in an attribute or variable name. Perhaps you mean normalized_losses?

Related

How to update dataframe.ix code when refering to dates

I'm trying to run a code which was written a while ago and the problem is they used pandas.dataframe.ix which apparently no longer exists (I have the version 1.2.4)
So I'm trying to make it work using iloc and loc but there are parts where it refers to dates and I don't know what I could do as it's neither a label nor an integer
Here is an example of my problem:
def ExtractPeriode(self, dtStartPeriode, dtEndPeriode):
start=self.Energie.index.searchsorted(dtStartPeriode)
end=self.Energie.index.searchsorted(dtEndPeriode)
self.Energie = self.Energie.ix[start:end]
Does someone know what I could use to replace .ix in this situation? Also you might have noticed I'm not very experienced with coding so please try to keep your answers simple if you can
.loc is used for any label, so if you have dates as your index, you can pass in the date. But .searchsorted returns an integer - the row number where you should insert, so you should use .iloc.
self.Energie = self.Energie.iloc[start:end]

Weird error when accessing pandas column value by attribute (dot) vs bracket

I am having trouble with a really strange error in Python when it comes to accessing a value in a pandas dataframe.
For a given row and a specific column, the two code lines below return different values, when I expected them to be the same:
>> df[df.obsId == 107099]['length'].values[0]
101.720001220703
>> df[df.obsId == 107099].length.values[0]
101.64261358425581
I really don't understand why the length values returned are different. Aren't bracket access and attribute access supposed to be equivalent ? I thought it could be a float imprecision reason but the difference is actually big.
Also it might be useful to mention that when I display the dataframe, the corresponding value is 101.720001, which seems to indicate that the display accesses to the data with the first method rather than the second one:
Any clue of what could be the reason of such an important difference, how to avoid it and which of the two methods to trust?
Thanks a lot!
Thanks to the hint of the comments, I finally understood the problem.
My data type was actually a geodataframe, and geodataframes appear to have a .length attribute. So the attribute notation referenced to this .length attribute instead of referencing to the identically named column, which happen to have different values!

Pandas -- String of boolean conditions -- SettingWithCopyWarning

I am wondering if anyone can assist me with this warning I get in my code. The code DOES score items correctly, but this warning is bugging me and I can't seem to find a good fix, given that I need to string a few boolean conditions together.
Background: Imagine that I have a magical fruit identifier and I have a csv file that lists what fruit was identified and in which area (1, 2, etc.). I read in the csv file with columns of "FruitID" and "Area." An identification of "APPLE" or "apple" in Zone 1 is scored as correct/true (other identified fruits are incorrect/false). I apply similar logic for other areas, but I won't get into that.
Any ideas for how to correct this? Should I use .loc, although I'm not sure that this will work with multiple booleans. Thanks!
My code snippet that initiates the CopyWarning:
Area1_ID_df['Area 1, Score']=(Area1_ID_df['FruitID']=='APPLE')|(Area1_ID_df['FruitID']=='apple')
Stacktrace:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Pandas finds it ambiguous what you are trying to do. Certain operations return a view of the dataset, whereas other operations make a copy of the dataset. The confusion is whether you want to modify a copy of the dataset or whether you want to modify the original dataset or are trying to create something new.
https://www.dataquest.io/blog/settingwithcopywarning/ is a great link to learn more about the problem you are having.
If the line that's causing this error is truly: s = t | u, where t and u are Boolean series indexed consistently, you should not worry about SettingWithCopyWarning.
This is a warning rather than an error. The latter indicates there is a problem, the former indicates there may be a problem. In this case, Pandas guesses you may be working with a copy rather than a view.
If the result is as you expect, you can safely ignore the warning.

Dataframe not showing in Pycharm

I am using PyCharm 2016.2.1 . When I try to view a Pandas dataframe through the newly added feature 'View as DataFrame' in the debugger, this works as expected for a small (e.g. 4x4) DataFrame.
However when I try to view a DataFrame (generated by custom script) of ~10,000 rows x ~50 columns, I get the message: "Nothing to show".
When I run the same script (that generates the DataFrame) in Spyder, I am able to view it, so I am pretty sure it's not an error in my script.
Does anyone know if there is a maximum size to the DataFrames that can be viewed in PyCharm, and if there is a way to change this?
EDIT:
It seems that the maximum size allowed is 1000 x 15 , as in some cases it gets truncated to this size (when the number of rows is too large, but when there are too many columns pycharm just says 'nothing to show').
Still, I would like to know if there is a way to increase the maximum allowed rows and columns viewable through the DataFrame viewer.
I have faced the same problem with PyCharm 2018.2.2. The reason was having a special character in a column's name as mentioned by Yunzhao .
If your having a column name like 'R&D' changing it to 'RnD' will fix the problem. It's really strange JetBrains hasn't solved this problem for over 2 years.
As you said in your edit, there's a limit on the number of columns (on my PC though it's far less than 15). However, you can see the whole thing by typing:
df.values
It will show you the whole dataframe, but without the names of the columns.
Edit:
To show the column names as well:
np.vstack([df.columns, df.values])
I have met the same problems.
I figured it was because of the special characters in column names (in my case)
In my case, I have "%" in the column name, then it doesn't show the data in View as DataFrame function. After I remove it, everything was correctly shown.
Please double check if you also have some special characters in the column names.
This might be helpful for some people experiencing similar problem:
As of August 2019 SciView in PyCharm does struggle with displaying DataFrames that have contain nullable integer type, see issue on JetBrains
I use PyCharm 2019.1.1 (Community Edition). And when I used right-click "View as DataFrame". I get the message: "Nothing to show".
But when I click the object tail button "...View as DataFrame", it worked.
I find out that my problem is my DataFrame Object is an Object's param. Right-click the "View as DataFrame" doesn't transfer the class name, need to user input the class's name and param's name.
Hope can help somebody.
In the case that you don't strictly need to use the functionalities given by the DataFrame viewer, you can print the whole DataFrame in the output window, using:
def print_full(x):
pd.set_option('display.max_rows', len(x))
print(x)
pd.reset_option('display.max_rows')
I use PyCharm 2019.1.1 (Community Edition) and I run Python 3.7.
When I first click on "View as DataFrame" there seems to be the same issue, but if I wait a few second the content pops up. For me it is a matter of loading.
For the sake of completeness: I face the same problem, due to the fact that some elements in the index of the dataframe contain a question mark '?'. One should avoid that too, if you still want to use the data viewer. Data viewer still worked, if the index strings contain hashes or less-than/greather-than signs though.
In my situation, the problem is caused by two same cloumn name in my dataframe.
Check it by:len(list(df)) == len(set(df))
As of 2020-10-02, using PyCharm 2020.1.4, I found that this issue also occurs if the DataFrame contains a column containing a tuple.
The same thing is happening to me in version 2021.3.3
In my case, it seems to have something to do with the column dtype being Int64, but then completely full of <NA> values. As soon as I change even a single row's value in the offending column to an actual integer, it renders again.
So, if feasible, you can fix it by dropping the column, or setting all (or at least one) of the values to some meaningful replacement value, like -1 or -99999 or something:
df["col"] = df["col"].fillna(value=-1)

pandas crashes on series with multiple data types

I have a simple excel file with two columns - one categorical column and another numerical column that i read into pandas with the read_excel function as below
df= pd.read_excel('pandas_crasher.xlsx')
The first column is of type Object with multiple types. Since the excel was badly formatted, the column contains a combination of timestamps, floats and texts. But its normally supposed to be just a simple textual column
from datetime import datetime
from collections import Counter
df['random_names'].dtype
dtype('O')
print Counter([type(i) for i in load_instance['random_names']])
Counter({type 'unicode'>: 15427, type 'datetime.datetime'>: 18,
type 'float'>: 2})
When i do a simple groupby on it, it crashes the python kernel without any error messages or notifications - i tried doing it from both jupyter and a small custom flask app without any luck.
df.groupby('random_names')['random_values'].sum() << crashes
Its a relatively small file of 700kb (15k rows and 2 cols) - so its definitely not a memory issue
I tried debugging with pdb to trace the point at which crashes but couldnt get past the the cython function in the pandas/core/groupby.py module
def _cython_operation(self, kind, values, how, axis)
a possible bug in pandas - instead of crashing directly shouldnt it throw an exception and quit gracefully ?
I then convert the various datatypes into text with the following function
def custom_converter(x):
if isinstance(x,datetime) or isinstance( x, ( int, long, float ) ):
return str(x)
else:
return x.encode('utf-8')
df['new_random_names'] = df['random_names'].apply(custom_converter)
df['new_random_names'].groupby('random_names')['random_values'].sum() << does not crash
The apply custom function is probably the slowest way of doing this. Is there any better/faster way of doing this ?
Excel file here: https://drive.google.com/file/d/0B1ZLijGO6gbLelBXMjJWRFV3a2c/view?usp=sharing
For me, the crash seems to happen when pandas tries to sort the group keys. If I pass the sort=False argument to .groupby() then the operation succeeds. This may work for you as well. The sort appears to be a numpy operation that doesn't actually involve pandas objects, so it may ultimately be a numpy issue. (For instance, df.random_names.values.argsort() also crashes for me.)
After some more playing around, I'm guessing the problem has to do with some sort of obscure condition that arises due to the particular comparisons that are made during numpy's sort operation. For me, this crashes:
df.random_names.values[14005:15447]
but leaving one item off either end of the slice doesn't crash anymore. Making a copy of this data and then tweaking it by taking out individual elements, the crash will occur or not depending on whether certain seemingly random elements are removed from the data. Also, under certain circumstances it will fail with an exception of "TypeError: can't compare datetime.datetime to unicode" (or "datetime to float").
This section of the data contains one datetime and one float value, which happens to be a nan. It looks like there is some weird edge case in the numpy code that causes failed comparisons to crash under certain circumstances rather than raise the right exception.
To answer the question at the end of your post, you may have an easier time using the various arguments to read_excel (such as the converters argument) to read all the data in as textual values right from the start.

Categories