How to convert varchar to int/float in pandas - python

My data is coming from mysql table.
id,revenue,cost,state are varchar columns.
I need to do get_dummies(one hot encoding) for my categorical variable that is state only
if its reading directly from csv(pd.read_csv) I am getting dtypes of id,revenue,cost as int/float and state as object
My Question is how to convert object to int64/float if its numeric and object for category variable
There is a chance of some strange like ?,- character might appear in revenue, still i want this column to be numeric
What I have done
To fix this right now change the varchar to int in the database directly and issue got fixed
But i need to do in pandas
df.apply(pd.to_numeric, errors='coerce').fillna(df) still my int/float columns such as id,revenue,cost is not changing dtype

I think first is necesarry test dtypes after pd.read_csv:
print (df.dtypes)
Then converting columns to numeric, but cannot replace missing values to original, because get mixed values - numeric with strings:
cols = ['id','revenue','cost']
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')

Related

Do not convert numerical column names to float in pandas read_excel

I have an Excel file where column name might be a number, i.e. 2839238. I am reading it using pd.read_excel(bytes(filedata), engine='openpyxl') and, for some reason, this column name gets converted to a float 2839238.0. How to disable this conversion?
This is an issue for me because I then operate on column names using string-only methods like df = df.loc[:, ~df.columns.str.contains('^Unnamed')], and it gives me the following error:
TypeError: bad operand type for unary ~: 'float'
Column names are arbitrary.
try to change the type of the columns.
df['col'] = df['col'].astype(int)
the number you gave in the example shows that maybe you have a big numbers so python can't handle the big numbers as int but it can handle it like a float or double, check the ranges of the data types in python and compare it to your data and see which one you can use
Verify that you don't have any duplicate column names. Pandas will add .0 or .1 if there is another instance of 2839238 as a header name.
See description of mangle_dupe_colsbool
which says:
Duplicate columns will be specified as ‘X’, ‘X.1’, …’X.N’, rather than ‘X’…’X’. Passing in False will cause data to be overwritten if there are duplicate names in the columns.

Groupby year dropping some variables

This is the original data and I need the mean of each year of all the variables.
But when I am using groupby('year') command, it is dropping all variables except 'lnmcap' and 'epu'.
Why this is happening and what needs to be done?
Probably the other columns have object or string type of the data, instead of integer, as a result of which only 'Inmcap' and 'epu' has got the average column. Use ds.dtypes or simply ds.info() to check the data types of data in the columns
it comes out to be object/string type then use
ds=ds.drop('company',axis=1)
column_names=ds.columns
for i in column_names:
ds[i]=ds[i].astype(str).astype(float)
This could work
You might want to convert all numerical columns to float before getting their mean, for example
cols = list(ds.columns)
#remove irrelevant columns
cols.pop(cols.index('company'))
cols.pop(cols.index('year'))
#convert remaining relevant columns to float
for col in cols:
ds[col] = pd.to_numeric(ds[col], errors='coerce')
#after that you can apply the aggregation
ds.groupby('year').mean()
You will need to convert the numeric columns to float types. Use df.info() to check the various data types.
for col in ds.select_dtypes(['object']).columns:
try:
ds[col] = ds[col].astype('float')
except:
continue
After this, use df.info() to check again. Those columns with objects like '1.604809' will be converted to float 1.604809
Sometimes, the column may contain some "dirty" data that cannot be converted to float. In this case, you could use below code with errors='coerce' means non-numeric data becomes NaN
column_names = list(ds.columns)
column_names.remove('company')
column_names.remove('year')
for col in column_names:
ds[col] = pd.to_numeric(ds[col], errors='coerce') #this will convert to numeric, whereas non-numeric becomes NaN

How to drop values ending with .0 from column with two different dtypes in Python Pandas?

I read my csv in Python where I read "col1" as str dtype.
Nevertheless, I have different dtypes in this column (float and string) as below:
enter image description here
What can I do so as to drop obserwations where in col1 I have values ending with .0 or simply drop float values from this column ? I totally do not know ?
I do not think that col1 is an actual value in your dataset, perhaps you did not set it as a Title of your column. About having different types such as integer and float in your dataset you might look at this thread to see how you can convert all of them into just integers How to remove decimal points in pandas.

Pandas adding NA changes column from float to object

I have a dataframe with a column that is of type float64. Then when I run the following it converts to object, which ends up messing with a downstream calculation where this is the denominator and there are some 0s:
df.loc[some criteria,'my_column'] = pd.NA
After this statement, my column is now of type object. Can someone explain why that is, and how to correct this? I'd rather do it right the first time rather than adding this to the end:
df['my_column'] = df['my_column'].astype(float)
Thanks.
In pandas, you can use NA or np.nan to represent missing values. np.nan casts it to float which is what you want and what the documentation states it as. NA casts the column as string and is best used for dtype string columns.
df.loc[some criteria,'my_column'] = np.nan

Pandas read_sql_query converting integer column to float

I have the following line
df = pandas.read_sql_query(sql = sql_script, con=conn, coerce_float = False)
that pulls data from Postgres using a sql script. Pandas keeps setting some of the columns to type float64. They should be just int. These columns contain some null values. Is there a ways to pull the data without having Pandas setting them to float64?
Thanks!
As per the documentation, the lack of NA representation in Numpy implies integer NA values can't be managed, so pandas promotes int columns into float.

Categories