Truncate decimal numbers in string - python

A weird thing - i have a dataframe, lets call it ID.
While importing xlsx source file, I do .astype({"ID_1": str, "ID_2": str})
Yet, for example instead of 10300 I get 10300.0.
Moreover, then I get string "nan" as well.
In order to fix both issues I did this rubbish:
my_df['ID_1'].replace(['None', 'nan'], np.nan, inplace=True)
my_df[my_df['ID_1'].notnull()].ID_1.astype(float).astype(int).astype(str)
As a result I still have these 10300.0
Any thoughts how to fix these? I could keep it as float while importing data, instead of .astype, but it does not change anything.

The issue is that int cannot represent NaN value, so pandas converts the column to float.
It is a common pitfall, as the presence of additional rows with missing data can change the result of a given row.
You can however pick a specific pandas type to indicate that it is an integer with missing values, see Convert Pandas column containing NaNs to dtype `int`, especially the link https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html

Related

Trying to convert a column with strings to float via Pandas

Hi I have looked but on stackoverflow and not found a solution for my problem. Any help highly appeciated.
After importing a csv I noticed that all the types of the columns are object and not float.
My goal is to convert all the columns but the YEAR column to float. I have read that you first have to strip the columns for taking blanks out and then also convert NaNs to 0 and then try to convert strings to floats. But in the code below I'm getting an error.
My code in Jupyter notes is:
And I get the following error.
How do I have to change the code.
All the columns but the YEAR column have to be set to float.
If you can help me set the column Year to datetime that would be also very nice. But my main problem is getting the data right so I can start making calculations.
Thanks
Runy
Easiest would be
df = df.astype(float)
df['YEAR'] = df['YEAR'].astype(int)
Also, your code fails because you have two columns with the same name BBPWN, so when you do df['BBPWN'], you will get a dataframe with those two columns. Then, df['BBPWN'].str will fail.

Reading Date times from Excel to Python using Pandas

I'm trying to read from an Excel file that gets converted to python and then gets split into numbers (Integers and floats) and everything else. There are numerous columns of different types.
I currently bring in the data with
pd.read_excel
and then split the data up with
DataFrame.select_dtypes("number")
When users upload a time (so 12:30:00) they expect for it to be recognized as a time. However python (currently) treats it as dtype object.
If I specify the column with parse_dates then it works, however since I don't know what the data is in advance I ideally want this to be done automatically. I`ve tried setting parse_dates = True however it doesn't seem to make a difference.
I'm not sure if there is a way to recognize the datatime after the file is uploaded. Again however I would want this to be done without having to specify the column (so anything that can be converted is)
Many Thanks
If your data contains only one column with dtype object (I assume it is a string) you can do the following:
1) filter the column with dtype object
import pandas as pd
datatime_col = df.select_dtypes(object)
2) convert it to seconds
datetime_col_in_seconds = pd.to_timedelta(datatime_col.loc[0]).dt.total_seconds()
Then you can re-append the converted column to your original data and/or do whatever processing you want.
Eventually, you can convert it back to datetime.
datetime_col = pd.to_datetime(datetime_col_in_seconds, unit='s')
if you have more than one column with dtype object you might have to do some more pre-processing but I guess this is a good way to start tackling your particular case.
This does what I need
for column_name in df.columns:
try:
df.loc[:, column_name] = pd.to_timedelta(df.loc[:, column_name].astype(str))
except ValueError:
pass
This tries to convert every column into a timedelta format. If it isn't capable of transforming it, it returns a value error and moves onto the next column.
After being run any columns that could be recognized as a timedelta format are transformed.

Pandas Dtypewarning: How do I find the dtype of different cells in a column?

When I import a csv file in pandas, I get a DtypeWarning:
Columns (3) have mixed types. Specify dtype option on import or set low_memory=False.
How do I find out what the dtype is of each cell? I think there might be some issue with the data that is why the warning is coming but it is a file with ~5 milllion rows so hard to ideentify the culprit?
Is it a good practice to specify dtype on Import? Aid if that is done, will it not result in "loss" of data?
I agree with piRSquared. Just adding to his comments, I had a similar problem. My column was supposed to have string values, but one value was a float value (with a NaN value).
There are some things you can do to help you with your analysis. Supose your dataframe is df. You can check each column's type with:
df.dtypes
For each column of type 'object', you can inspect even more by creating a cell's type:
df['type'] = df['mycolumn'].apply(lambda x: type(x).__name__)
If your column is supposed to be string valued, you can check which cells are not string with:
df[df.type != 'str']

How to replace all non-numeric entries with NaN in a pandas dataframe?

I have various csv files and I import them as a DataFrame. The problem is that many files use different symbols for missing values. Some use nan, others NaN, ND, None, missing etc. or just live the entry empty. Is there a way to replace all these values with a np.nan? In other words, any non-numeric value in the dataframe becomes np.nan. Thank you for the help.
I found what I think is a relatively elegant but also robust method:
def isnumber(x):
try:
float(x)
return True
except:
return False
df[df.applymap(isnumber)]
In case it's not clear: You define a function that returns True only if whatever input you have can be converted to a float. You then filter df with that boolean dataframe, which automatically assigns NaN to the cells you didn't filter for.
Another solution I tried was to define isnumber as
import number
def isnumber(x):
return isinstance(x, number.Number)
but what I liked less about that approach is that you can accidentally have a number as a string, so you would mistakenly filter those out. This is also a sneaky error, seeing that the dataframe displays the string "99" the same as the number 99.
EDIT:
In your case you probably still need to df = df.applymap(float) after filtering, for the reason that float works on all different capitalizations of 'nan', but until you explicitely convert them they will still be considered strings in the dataframe.
Replacing non-numeric entries on read, the easier (more safe) way
TL;DR: Set a datatype for the column(s) that aren't casting properly, and supply a list of na_values
# Create a custom list of values I want to cast to NaN, and explicitly
# define the data types of columns:
na_values = ['None', '(S)', 'S']
last_names = pd.read_csv('names_2010_census.csv', dtype={'pctapi': np.float64}, na_values=na_values)
Longer Explanation
I believe best practices when working with messy data is to:
Provide datatypes to pandas for columns whose datatypes are not inferred properly.
Explicitly define a list of values that should be cast to NaN.
This is quite easy to do.
Pandas read_csv has a list of values that it looks for and automatically casts to NaN when parsing the data (see the documentation of read_csv for the list). You can extend this list using the na_values parameter, and you can tell pandas how to cast particular columns using the dtypes parameter.
In the example above, pctapi is the name of a column that was casting to object type instead of float64, due to NaN values. So, I force pandas to cast to float64 and provide the read_csv function with a list of values to cast to NaN.
Process I follow
Since data science is often completely about process, I thought I describe the steps I use to create an na_values list and debug this issue with a dataset.
Step 1: Try to import the data and let pandas infer data types. Check if the data types are as expected. If they are = move on.
In the example above, Pandas was right on about half the columns. However, I expected all columns listed below the 'count' field to be of type float64. We'll need to fix this.
Step 2: If data types are not as expected, explicitly set the data types on read using dtypes parameter. This will throw errors by default on values that cannot be cast.
# note: the dtypes dictionary specifying types. pandas will attempt to infer
# the type of any column name that's not listed
last_names = pd.read_csv('names_2010_census.csv', dtype={'pctwhite': np.float64})
Here's the error message I receive when running the code above:
Step 3: Create an explicit list of values pandas cannot convert and cast them to NaN on read.
From the error message, I can see that pandas was unable to cast the value of (S). I add this to my list of na_values:
# note the new na_values argument provided to read_csv
last_names = pd.read_csv('names_2010_census.csv', dtype={'pctwhite': np.float64}, na_values=['(S)'])
Finally, I repeat steps 2 & 3 until I have a comprehensive list of dtype mappings and na_values.
If you're working on a hobbyist project this method may be more than you need, you may want to use u/instant's answer instead. However, if you're working in production systems or on a team, it's well worth the 10 minutes it takes to correctly cast your columns.

Pandas read scientific notation and change

I have a dataframe in pandas that i'm reading in from a csv.
One of my columns has values that include NaN, floats, and scientific notation, i.e. 5.3e-23
My trouble is that as I read in the csv, pandas views these data as an object dtype, not the float32 that it should be. I guess because it thinks the scientific notation entries are strings.
I've tried to convert the dtype using df['speed'].astype(float) after it's been read in, and tried to specify the dtype as it's being read in using df = pd.read_csv('path/test.csv', dtype={'speed': np.float64}, na_values=['n/a']). This throws the error ValueError: cannot safely convert passed user dtype of <f4 for object dtyped data in column ...
So far neither of these methods have worked. Am I missing something that is an incredibly easy fix?
this question seems to suggest I can specify known numbers that might throw an error, but i'd prefer to convert the scientific notation back to a float if possible.
EDITED TO SHOW DATA FROM CSV AS REQUESTED IN COMMENTS
7425616,12375,28,2015-08-09 11:07:56,0,-8.18644,118.21463,2,0,2
7425615,12375,28,2015-08-09 11:04:15,0,-8.18644,118.21463,2,NaN,2
7425617,12375,28,2015-08-09 11:09:38,0,-8.18644,118.2145,2,0.14,2
7425592,12375,28,2015-08-09 10:36:34,0,-8.18663,118.2157,2,0.05,2
65999,1021,29,2015-01-30 21:43:26,0,-8.36728,118.29235,1,0.206836151554794,2
204958,1160,30,2015-02-03 17:53:37,2,-8.36247,118.28664,1,9.49242000872744e-05,7
384739,,32,2015-01-14 16:07:02,1,-8.36778,118.29206,2,Infinity,4
275929,1160,30,2015-02-17 03:13:51,1,-8.36248,118.28656,1,113.318511172611,5
It's hard to say without seeing your data but it seems that problem in your rows that they contain something else except for numbers and 'n/a' values. You could load your dataframe and then convert it to numeric as show in answers for that question. If you have pandas version >= 0.17.0 then you could use following:
df1 = df.apply(pd.to_numeric, args=('coerce',))
Then you could drop row with NA values with dropna or fill them with zeros with fillna
I realised it was the infinity statement causing the issue in my data. Removing this with a find and replace worked.
#Anton Protopopov answer also works as did #DSM's comment regarding me not typing df['speed'] = df['speed'].astype(float).
Thanks for the help.
In my case, using pandas.round() worked.
df['column'] = df['column'].round(2)

Categories