Trying to convert a column with strings to float via Pandas - python

Hi I have looked but on stackoverflow and not found a solution for my problem. Any help highly appeciated.
After importing a csv I noticed that all the types of the columns are object and not float.
My goal is to convert all the columns but the YEAR column to float. I have read that you first have to strip the columns for taking blanks out and then also convert NaNs to 0 and then try to convert strings to floats. But in the code below I'm getting an error.
My code in Jupyter notes is:
And I get the following error.
How do I have to change the code.
All the columns but the YEAR column have to be set to float.
If you can help me set the column Year to datetime that would be also very nice. But my main problem is getting the data right so I can start making calculations.
Thanks
Runy

Easiest would be
df = df.astype(float)
df['YEAR'] = df['YEAR'].astype(int)
Also, your code fails because you have two columns with the same name BBPWN, so when you do df['BBPWN'], you will get a dataframe with those two columns. Then, df['BBPWN'].str will fail.

Related

Change in data type from bigint to double while renaming in pandas

I am new to Python Pandas. I am having data input in the below form.
Name,Empid,2020_salary,2021_salary,2022_salary
Adam,21223,58000,60000,62000
John,21534,67465,70985,90857
Terry,35463,76000,86765,97886
As you observe in the year I am getting an underscore salary(ex 2020_salary) along with the year. But I want the data in the below form
Name,Empid,2020,2021,2022
Adam,21223,58000,60000,62000
John,21534,67465,70985,90857
Terry,35463,76000,86765,97886
Since the table structure is not consistent(New Year will add in the future). That's why I cannot manually rename each column. I used pandas to rename all columns having salary text in the column.
df.rename(columns= lambda x: x.replace('_salary',''),inplace=True)
Now I got the column structure the way I wanted. But now my data type has been changed from int to
double for some columns.
Name,Empid,2020,2021,2022
Adam,21223,58000.0,60000,62000
John,21534,67465.0,70985,90857
Terry,35463,76000.0,86765,97886
I am getting output similar to the above. Can you please help me how to convert the above double data type to int without manually changing anything or some better way to rename the year's column? Thanks in advance

Truncate decimal numbers in string

A weird thing - i have a dataframe, lets call it ID.
While importing xlsx source file, I do .astype({"ID_1": str, "ID_2": str})
Yet, for example instead of 10300 I get 10300.0.
Moreover, then I get string "nan" as well.
In order to fix both issues I did this rubbish:
my_df['ID_1'].replace(['None', 'nan'], np.nan, inplace=True)
my_df[my_df['ID_1'].notnull()].ID_1.astype(float).astype(int).astype(str)
As a result I still have these 10300.0
Any thoughts how to fix these? I could keep it as float while importing data, instead of .astype, but it does not change anything.
The issue is that int cannot represent NaN value, so pandas converts the column to float.
It is a common pitfall, as the presence of additional rows with missing data can change the result of a given row.
You can however pick a specific pandas type to indicate that it is an integer with missing values, see Convert Pandas column containing NaNs to dtype `int`, especially the link https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html

Interpolating data for missing values pandas python

enter image description here[enter image description here][2]I am having trouble interpolating my missing values. I am using the following code to interpolate
df=pd.read_csv(filename, delimiter=',')
#Interpolating the nan values
df.set_index(df['Date'],inplace=True)
df2=df.interpolate(method='time')
Water=(df2['Water'])
Oil=(df2['Oil'])
Gas=(df2['Gas'])
Whenever I run my code I get the following message: "time-weighted interpolation only works on Series or DataFrames with a DatetimeIndex"
My Data consist of several columns with a header. The first column is named Date and all the rows look similar to this 12/31/2009. I am new to python and time series in general. Any tips will help.
Sample of CSV file
Try this, assuming the first column of your csv is the one with date strings:
df = pd.read_csv(filename, index_col=0, parse_dates=[0], infer_datetime_format=True)
df2 = df.interpolate(method='time', limit_direction='both')
It theoretically should 1) convert your first column into actual datetime objects, and 2) set the index of the dataframe to that datetime column, all in one step. You can optionally include the infer_datetime_format=True argument. If your datetime format is a standard format, it can help speed up parsing by quite a bit.
The limit_direction='both' should back fill any NaNs in the first row, but because you haven't provided a copy-paste-able sample of your data, I cannot confirm on my end.
Reading the documentation can be incredibly helpful and can usually answer questions faster than you'll get answers from Stack Overflow!

pd.merge throwing error while executing through .bat file

Python script does not run while executing in a bat file, but runs seamlessly on the editor.
The error is related to datatype difference in pd.merge script. Although the datatype given to both the columns is same in both the dataframes.
df2a["supply"] = df2a["supply"].astype(str)
df2["supply_typ"] = df2["supply_typ"].astype(str)
df2a["supply_typ"] = df2a["supply_typ"].astype(str)
df = (pd.merge(df2,df2a, how=join,on=
['entity_id','pare','grome','buame','tame','prd','gsn',
'supply','supply_typ'],suffixes=['gs2','gs2x']))
While running the bat file i am getting following error in pd.merge:
You are trying to merge on float64 and object columns. If you wish to proceed you should use pd.concat
Not a direct answer, but contains code that cannot be formatted in a comment, and should be enough to solve the problem.
When pandas says that you are trying to merge on float64 and object columns, it is certainly right. It may not be evident because pandas relies on numpy, and that a numpy object column can store any data.
I ended with a simple function to diagnose all those data type problem:
def show_types(df):
for i,c in enumerate(df.columns):
print(df[c].dtype, type(df.iat[0, i]))
It shows both the pandas datatype of the columns of a dataframe, and the actual type of the first element of the column. It can help do see the difference between columns containing str elements and other containing datatime.datatime ones, while the datatype is just objects.
Use that on both of your dataframes, and the problem should become evident...

Pandas Dataframe column will set to string but not integer

I am trying to set the values in a Dataframe to the values from a separate dataframe. This works just fine when the source column is a string but the integer columns are not being copied, or throwing an error.
RentryDf=pd.DataFrame(index=tportDf.index.values,columns=tradesDf.columns)
RentryDf.loc[:,'TRADER']=tportDf.loc[:,'TRADER']
RentryDf.loc[:,'CONTRACT_VOL']=tportDf.loc[:,'DELIVERY VOLUME']
the second line has no problem setting to the string names of trader but the third line stays NaN. I have tried the two lines of code to just see if they would work and even these dont work.
RentryDf.loc[:,'CONTRACT_VOL']=11
RentryDf.loc[:,'CONTRACT_VOL'].apply(lambda x: 11)
I solved my question while trying to recreate it (i guess i learned a good strategy!)
The problem was in my deceleration of the dataframe i was passing columns=tradesDf.columns rather than columns=tradesDf.columns.values.
I am pleased to have it fixed but does anyone know why this would cause the DF not to set the integer values but it would set string values?
I can't reproduce the bug, it works both for float64 and int64.
I guess the problem could be wrong indexing, since line 1 will create a DF with all value as NaN.

Categories