Interpolating data for missing values pandas python - python

enter image description here[enter image description here][2]I am having trouble interpolating my missing values. I am using the following code to interpolate
df=pd.read_csv(filename, delimiter=',')
#Interpolating the nan values
df.set_index(df['Date'],inplace=True)
df2=df.interpolate(method='time')
Water=(df2['Water'])
Oil=(df2['Oil'])
Gas=(df2['Gas'])
Whenever I run my code I get the following message: "time-weighted interpolation only works on Series or DataFrames with a DatetimeIndex"
My Data consist of several columns with a header. The first column is named Date and all the rows look similar to this 12/31/2009. I am new to python and time series in general. Any tips will help.
Sample of CSV file

Try this, assuming the first column of your csv is the one with date strings:
df = pd.read_csv(filename, index_col=0, parse_dates=[0], infer_datetime_format=True)
df2 = df.interpolate(method='time', limit_direction='both')
It theoretically should 1) convert your first column into actual datetime objects, and 2) set the index of the dataframe to that datetime column, all in one step. You can optionally include the infer_datetime_format=True argument. If your datetime format is a standard format, it can help speed up parsing by quite a bit.
The limit_direction='both' should back fill any NaNs in the first row, but because you haven't provided a copy-paste-able sample of your data, I cannot confirm on my end.
Reading the documentation can be incredibly helpful and can usually answer questions faster than you'll get answers from Stack Overflow!

Related

ValueError: time-weighted interpolation only works on Series or DataFrames with a DatetimeIndex When using Pandas Interpolate using Time method

I am using a dataset found on the Kaggle website (https://www.kaggle.com/claytonmiller/lbnl-automated-fault-detection-for-buildings-data) specifically the 'RTU.CSV'.
I have converted the timestamp to DateTime using following code:
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
yet when I try to use the Pandas Interpolate using Time method
df.interpolate(method = "time")
The Error I get is
ValueError: time-weighted interpolation only works on Series or
DataFrames with a DatetimeIndex
Can anyone explain what does this means?
You are trying to call the interpolate on the whole dataframe instead of just the timestamp column. The dataframe has other columns that are not time data. The interpolate will work if: it is called on a specific Series (column) if it is time data, or the dataframe, via an index (DatetimeIndex).
I imagine this is what you intended to do:
df['Timestamp'].interpolate(method = "time")
If you wish to turn your timestamp column into the index:
df.set_index(df['Timestamp'], inplace=True)
Edit from seeing the dataset: my guess is that you might need something a bit more powerful than interpolate if you want to, basically, predict all columns values based on the timestamp and historical data. iterpolate is here more to fill the gaps in a column for example. As your timestamp is pretty regular, you can also choose to assume the rest of the data is partially independent from it and call interpolate on all columns 1 by 1 (the method might need to be changed). But since it is big chunks of data missing at the start, not sure how good interpolate guesses would be

Trying to convert a column with strings to float via Pandas

Hi I have looked but on stackoverflow and not found a solution for my problem. Any help highly appeciated.
After importing a csv I noticed that all the types of the columns are object and not float.
My goal is to convert all the columns but the YEAR column to float. I have read that you first have to strip the columns for taking blanks out and then also convert NaNs to 0 and then try to convert strings to floats. But in the code below I'm getting an error.
My code in Jupyter notes is:
And I get the following error.
How do I have to change the code.
All the columns but the YEAR column have to be set to float.
If you can help me set the column Year to datetime that would be also very nice. But my main problem is getting the data right so I can start making calculations.
Thanks
Runy
Easiest would be
df = df.astype(float)
df['YEAR'] = df['YEAR'].astype(int)
Also, your code fails because you have two columns with the same name BBPWN, so when you do df['BBPWN'], you will get a dataframe with those two columns. Then, df['BBPWN'].str will fail.

How can i select multiple date columns in a dataframe in pandas, then format them all ? (python)

I have a large dataset with multiple date columns that I need to clean up, mostly by removing the time stamp since it is all 00:00:00. I want to write a function that collects all columns if type is datetime, then format all of them instead of having to attack one each.
I figured it out. This is what I came up with and it works for me:
def tidy_dates(df):
for col in df.select_dtypes(include="datetime64[ns, UTC]"):
df[col] = df[col].dt.strftime("%Y-%m-%d")
return df

Reading Date times from Excel to Python using Pandas

I'm trying to read from an Excel file that gets converted to python and then gets split into numbers (Integers and floats) and everything else. There are numerous columns of different types.
I currently bring in the data with
pd.read_excel
and then split the data up with
DataFrame.select_dtypes("number")
When users upload a time (so 12:30:00) they expect for it to be recognized as a time. However python (currently) treats it as dtype object.
If I specify the column with parse_dates then it works, however since I don't know what the data is in advance I ideally want this to be done automatically. I`ve tried setting parse_dates = True however it doesn't seem to make a difference.
I'm not sure if there is a way to recognize the datatime after the file is uploaded. Again however I would want this to be done without having to specify the column (so anything that can be converted is)
Many Thanks
If your data contains only one column with dtype object (I assume it is a string) you can do the following:
1) filter the column with dtype object
import pandas as pd
datatime_col = df.select_dtypes(object)
2) convert it to seconds
datetime_col_in_seconds = pd.to_timedelta(datatime_col.loc[0]).dt.total_seconds()
Then you can re-append the converted column to your original data and/or do whatever processing you want.
Eventually, you can convert it back to datetime.
datetime_col = pd.to_datetime(datetime_col_in_seconds, unit='s')
if you have more than one column with dtype object you might have to do some more pre-processing but I guess this is a good way to start tackling your particular case.
This does what I need
for column_name in df.columns:
try:
df.loc[:, column_name] = pd.to_timedelta(df.loc[:, column_name].astype(str))
except ValueError:
pass
This tries to convert every column into a timedelta format. If it isn't capable of transforming it, it returns a value error and moves onto the next column.
After being run any columns that could be recognized as a timedelta format are transformed.

Pandas DatetimeIndex string format conversion from American to European

Ok I have read some data from a CSV file using:
df=pd.read_csv(path,index_col='Date',parse_dates=True,dayfirst=True)
The data are in European date convention format dd/mm/yyyy, that is why i am using dayfirst=True.
However, what i want to do is change the string format appearance of my dataframe index df from the American(yyyy/mm/dd) to the European format(dd/mm/yyyy) just to visually been consistent with how i am looking the dates.
I could't find any relevant argument in the pd.read_csv method.
In the output I want a dataframe in which simply the index will be a datetime index visually consistent with the European date format.
Could anyone propose a solution? It should be straightforward, since I guess there should be a pandas method to handle that, but i am currently stuck.
Try something like the following once it's loaded from the CSV. I don't believe it's possible to perform the conversion as part of the reading process.
import pandas as pd
df = pd.DataFrame({'date': pd.date_range(start='11/24/2016', periods=4)})
df['date_eu'] = df['date'].dt.strftime('%d/%m/%Y')

Categories