Here is my dataframe:
import pandas as pd
dates = ('2020-09-24','2020-10-19','2020-12-17','2021-03-17','2021-06-17','2021-09-17','2022-03-17','2022-09-20','2023-09-19','2024-09-17','2025-09-17','2026-09-17','2027-09-17','2028-09-19','2029-09-18','2030-09-17','2031-09-17','2032-09-17','2035-09-18','2040-09-18','2045-09-19')
factors = ('1','0.999994','0.999875','1.000166','1.000303','1.000438','1.00056','1.000817','1.001046','1.001412','1.001525','1.001334','1.000685','0.999376','0.997456','0.994626','0.991244','0.986754','0.982072','0.962028','0.925136')
df = pd.DataFrame()
df['dates']=dates
df['factors']=factors
df['dates'] = pd.to_datetime(df['dates'])
df.set_index(['dates'],inplace=True)
df
Here is another dataframe with a timeseries with fixed interval
interpolated = pd.DataFrame(0, index=pd.date_range('2020-09-24', '2045-09-19', freq='3M'),columns=['result'])
The goal is to populate the second dataframe with the cubic spline interpolated values from the first table.
Thanks for all the ideas
Attempt
interpolated['result'] = df['factors'].interpolate(method='cubic')
However it gives only NaN values in the intepolated dataframe. Not sure how to correctly refer to the first table.
First things first, the shapes don't match. Since it seems none of the dates in the index from the df match the dates in interpolated, you just end up with NaN being filled in on the dates. I think you want something more like merge or join, as described in this post: Merging time series data by timestamp using numpy/pandas
merge
and join will also be helpful.
Related
I have a list of dictionaries with values that I convert to a dataframe in pandas. I also convert the date-information (here measured in ms) to datetime and set a new index. The values are recorded every 5 seconds. When I want to resample them to 1 Minute I get the error message "DataError: No numeric types to aggregate" and I do not understand why. Here you see my code:
import pandas as pd
#convert list of dictionaries into a dataframe
dataframe = pd.DataFrame(data)
#convert the date-information (here measured in ms) to datetime
dataframe['timestamp'] = pd.to_datetime(dataframe['timestamp'], unit='ms')
#rearrange the columns
dataframe = dataframe.reindex(['timestamp','value'], axis=1)
#set the index to the column 'timestamp'
dataframe.set_index('timestamp', inplace=True)
#resample to a resolution of 1 minute
dataframe_minutes = dataframe.resample('1M').mean()
The problematic part is the last line dataframe_minutes = dataframe.resample('1M').mean(). Before that the dataframe looks like this:
Do you have an idea, why I get this error because as far as I see it both columns have numeric values. I'd appreciate every comment.
I have two data frames that I imported as spreadsheets into Pandas and cleaned up. They have a similar key value called 'PurchaseOrders' that I am using to match product numbers to a shipment number. When I attempt to merge them, I only end up with a df of 34 rows, but I have over 400 pairs of matching product to shipment numbers.
This is the closest I've gotten, but I have also tried using join()
ShipSheet = pd.merge(new_df, orders, how ='inner')
ShipSheet.shape
Here is my order df
orders df
and here is my new_df that I want to add to my orders df using the 'PurchaseOrders' key
new_df
In the end, I want them to look like this
end goal df
I am not sure if I'm not using the merge function improperly, but my end product should have around 300+ rows. I will note that the new_df data frame's 'PurchaseOrders' values had to be delimited from a single column and split into rows, so I guess this could have something to do with it.
Use the merge method on the dataframe and specify the key
merged_inner = pd.merge(left=df_left, right=df_right, left_on='PurchaseOrders', right_on='PurchaseOrders')
learn more here
Use the concat method on pandas and specify the axis.
final_df = pd.concat([new_df, order], axis = 1)
when you specify the axis please careful if you specify axis = 0 then it placed second data frame under the first one and if you specify axis = 1 then it placed the second data frame right to the first data frame.
My pandas dataframe consists of a Column "timeStamp". I'm trying to obtain the difference between two values of two set of data frames. I use the following piece of code for it (see code). My question: How can I keep the date the same and only subtract the values?
merge is a nice approach as SwaggaTing suggested. Alternatively, you can set your date as the index:
a.set_index('date')['values_TProducing'] - b.set_index('date')['values_AProducing']
Assuming the dates are unique, you can join the dataframes on the date column and then subtract:
merged = a.merge(b, on='date')
merged['diff'] = merged['values_AProducing'] - merged['values_TProducing']
This assumes that the dates line up as they do in your example:
x = a.copy().drop('values_TProducing', 1)
x['values'] = a['values_TProducing'] - b['values_AProducing']
I have created a DataFrame in order to process some data, and I want to find the difference in time between each pair of data in the DataFrame. Prior to using pandas, I was using two numpy arrays, one describing the data and the other describing time (an array of datetime.datetimes). With the data in arrays, I could do timearray[1:] - timearray[:-1] which resulted in an array (of n-1 elements) describing the gap in time between each pair of data.
In pandas, doing DataFrame.index[1] - DataFrame.index[0] gives me the result I want – the difference in time between the two indices I've picked out. However, doing DataFrame.index[1:] - DataFrame.index[:-1] does not yield an array of similar results, instead simply being equal to DataFrame.index[-1]. Why is this, and how can I replicate the numpy behaviour in pandas?
Alternatively, what is the best way to find datagaps in a DataFrame in pandas?
You can use shift to offset the date and use it to calculate the difference between rows.
# create dummy data
import pandas as pd
rng = pd.date_range('1/1/2011', periods=90, freq='h')
# shift a copy of the date column and subtract from the original date
df = pd.DataFrame({'value':range(1,91),'date':rng})
df['time_gap'] = df['date']- df['date'].shift(1)
To use this set your index to a column temporarily by using .reset_index() and .set_index('date') to return the date column to an index if required.
Let's say I've pulled csv data from two seperate files containing a date index that pandas automatically pulled which was one of the original columns.
import pandas as pd
df1 = pd.io.parsers.read_csv(data1, parse_dates = True, infer_datetime_format=True, index_col=0, names=['A'])
df2 = pd.io.parsers.read_csv(data2, parse_dates = True, infer_datetime_format=True, index_col=0, names=['A'])
Now the dates for one csv file are different than the other, but when loaded with read_csv, the dates are well defined. I've tried the join command, but it doesn't seem to preserve the dates.
df1 = df1.join(df2)
I get a valid data frame, but the range of the dates is fixed to some smaller subset of what the original range should be given the disparity between the dates for the two csv files. What I would like is a way to create a single dataframe with 2 columns (both 'A' columns) that contains NaN or zero values for the non overlapping dates filled in automatically. Is there a simple solution for this or is there something that I might be missing here. Thanks so much.
By default, pandas DataFrame method 'join' combines two dataframes using 'inner' merging. You want to use 'outer' merging. Your join line should read:
df1 = df1.join(df2, how='outer')
See http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.DataFrame.join.html