I have a dataset in an excel. I read the data into a dataframe "df" using read_excel.
During this process, I observed that col1 from df is providing decimals, when it should only have numbers with only 4 digits.
So, I have two questions here:
Why is it returning a decimal when the source of the data does not have any decimals
How can I remove the decimals in the result column
I have tried astype(int) and astype(float)
Assumed the reason it is providing decimals could be because of a few empty values. So, used fillna(0)
df_A = pd.read_excel("path\filename.xls")
Data = {
"A" : df_A['col1'].fillna(0)
(Also tried "A" : df_A['col1'].astype(int))
}
df_B = pd.DataFrame(Data_A)
Expected... column values of "5124, 5487, 9487, 3598"
Actual results.. column values of "5124.0, 5487.0, 9487.0, 3598.0"
Since df_A is a dataframe, you can fillna and then convert the column to int.
df_A['col1'] = df_A['col1'].fillna(0).astype(int)
Since you are getting the error invalid literal for int() with base 10: with the above code, it means that there are some non-numeric values in your data which can not be converted to int. Use pd.to_numeric to coerce those values to NaN and then use the above code.
df_A['col1'] = pd.to_numeric(df_A['col1'], errors = 'coerce')
df_A['col1'] = df_A['col1'].fillna(0).astype(int)
Related
This is the original data and I need the mean of each year of all the variables.
But when I am using groupby('year') command, it is dropping all variables except 'lnmcap' and 'epu'.
Why this is happening and what needs to be done?
Probably the other columns have object or string type of the data, instead of integer, as a result of which only 'Inmcap' and 'epu' has got the average column. Use ds.dtypes or simply ds.info() to check the data types of data in the columns
it comes out to be object/string type then use
ds=ds.drop('company',axis=1)
column_names=ds.columns
for i in column_names:
ds[i]=ds[i].astype(str).astype(float)
This could work
You might want to convert all numerical columns to float before getting their mean, for example
cols = list(ds.columns)
#remove irrelevant columns
cols.pop(cols.index('company'))
cols.pop(cols.index('year'))
#convert remaining relevant columns to float
for col in cols:
ds[col] = pd.to_numeric(ds[col], errors='coerce')
#after that you can apply the aggregation
ds.groupby('year').mean()
You will need to convert the numeric columns to float types. Use df.info() to check the various data types.
for col in ds.select_dtypes(['object']).columns:
try:
ds[col] = ds[col].astype('float')
except:
continue
After this, use df.info() to check again. Those columns with objects like '1.604809' will be converted to float 1.604809
Sometimes, the column may contain some "dirty" data that cannot be converted to float. In this case, you could use below code with errors='coerce' means non-numeric data becomes NaN
column_names = list(ds.columns)
column_names.remove('company')
column_names.remove('year')
for col in column_names:
ds[col] = pd.to_numeric(ds[col], errors='coerce') #this will convert to numeric, whereas non-numeric becomes NaN
I read my csv in Python where I read "col1" as str dtype.
Nevertheless, I have different dtypes in this column (float and string) as below:
enter image description here
What can I do so as to drop obserwations where in col1 I have values ending with .0 or simply drop float values from this column ? I totally do not know ?
I do not think that col1 is an actual value in your dataset, perhaps you did not set it as a Title of your column. About having different types such as integer and float in your dataset you might look at this thread to see how you can convert all of them into just integers How to remove decimal points in pandas.
I have a pandas DataFrame with data that looks like this:
With the data extending beyond what you can see here. I can't tell if the blue cells hold numeric or string data, but it should be numeric, since I transformed them to those values with multiplication. But I don't know pandas well enough to be sure.
Anyway, I call .max(axis=1) on this dataframe, and it gives me this:
As far as I know, there are no empty cells or cells with weird data. So why am I getting all nan?
First convert all values to numeric by DataFrame.astype:
df = df.astype(float)
If not working, use to_numeric with errors='coerce' for NaNs if not numeric values:
df = df.apply(pd.to_numeric, errors='coerce')
And then count max:
print (df.max(axis=1))
I have a column that has the values 1,2,3....
I need to change this value to Cluster_1, Cluster_2, Cluster_3... dynamically. My original table looks like below, where cluster_predicted is a column, containing integer value and I need to convert these numbers to cluster_0, cluster_1...
I have tried the below code
clustersDf['clusterDfCategorical'] = "Cluster_" + str(clustersDf['clusterDfCategorical'])
But this is giving me a very weird output as shown below.
import pandas as pd
df = pd.DataFrame()
df['cols']=[1,2,3,4,5]
df['vals']=['one','two','three','four','five']
df['cols'] =df['cols'].astype(str)
df['cols']= 'confuse_'+df['cols']
print(df)
try this , the string conversion is making the issue for you.
One way to convert to string is to use astype
I do df = pandas.io.json.json_normalize(data) and I want to specify datatype of an integer column that has missing data. Since pandas doesn't have an integer NaN, I want to use dtype object, i.e. string. As far as I can tell, json_normalize doesn't have any dtype parameter. If I try to .astype(object) the columns afterwards, I'll end up with decimal points in the column.
How can I get the string format with no decimal point into this column?
There should be a more elegant solution, but this one should work for your specific task:
df['col'] = df['col'].astype(object).apply(lambda x: '%.f' % x)
hope it helps.