pandas combining 2 dataframes with different date indices - python

Let's say I've pulled csv data from two seperate files containing a date index that pandas automatically pulled which was one of the original columns.
import pandas as pd
df1 = pd.io.parsers.read_csv(data1, parse_dates = True, infer_datetime_format=True, index_col=0, names=['A'])
df2 = pd.io.parsers.read_csv(data2, parse_dates = True, infer_datetime_format=True, index_col=0, names=['A'])
Now the dates for one csv file are different than the other, but when loaded with read_csv, the dates are well defined. I've tried the join command, but it doesn't seem to preserve the dates.
df1 = df1.join(df2)
I get a valid data frame, but the range of the dates is fixed to some smaller subset of what the original range should be given the disparity between the dates for the two csv files. What I would like is a way to create a single dataframe with 2 columns (both 'A' columns) that contains NaN or zero values for the non overlapping dates filled in automatically. Is there a simple solution for this or is there something that I might be missing here. Thanks so much.

By default, pandas DataFrame method 'join' combines two dataframes using 'inner' merging. You want to use 'outer' merging. Your join line should read:
df1 = df1.join(df2, how='outer')
See http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.DataFrame.join.html

Related

How do you reorganize a dataframe of dataframes so that there's only 1 final dataframe?

If I have a list of df lists, I was wondering how I would go about concatenating all of the dataframes in the nested list and then finally concatenating all of the concatenating all of those dataframes together?
(Looking at the output again, I think this is actually a dataframe of multiple dataframes?)
Here is the input data the tickers
Here is the code I'm using:
#from pandas.io.pytables import DataCol
from six.moves import urllib
import pandas as pd
df = pd.read_excel('C:/Users/Jacob/Downloads/CEF Tickers.xlsx', sheet_name='Sheet1')
tickers_list = df['Ticker'].tolist()
data = pd.DataFrame(columns=tickers_list)
for ticker in tickers_list:
data[ticker] = pd.read_html(f'https://www.cefconnect.com/fund/{ticker}', header=0)
export_excel = data.to_excel(r'C:/Users/Jacob/Downloads/ceftest.xlsx', sheet_name='Sheet1', index= True)
This is the output I get (link)
I was hoping to use the tickers as the index and then each column header would be each data point. I also notice that some of the dataframes in the list will not have matching values. For example, the 14th, 15th and 16th dfs will not have matching values to put into the column headers. Would it be possible to make all of them column headers and put "0" in the cell? Or how do you think I should deal with those dfs? Drop them out completely? If so, how would I removed those final 3 frames in the process of concatenating everything?
Thank you for your time!

Dataframe - interpolate values based on inputs from another dataframe

Here is my dataframe:
import pandas as pd
dates = ('2020-09-24','2020-10-19','2020-12-17','2021-03-17','2021-06-17','2021-09-17','2022-03-17','2022-09-20','2023-09-19','2024-09-17','2025-09-17','2026-09-17','2027-09-17','2028-09-19','2029-09-18','2030-09-17','2031-09-17','2032-09-17','2035-09-18','2040-09-18','2045-09-19')
factors = ('1','0.999994','0.999875','1.000166','1.000303','1.000438','1.00056','1.000817','1.001046','1.001412','1.001525','1.001334','1.000685','0.999376','0.997456','0.994626','0.991244','0.986754','0.982072','0.962028','0.925136')
df = pd.DataFrame()
df['dates']=dates
df['factors']=factors
df['dates'] = pd.to_datetime(df['dates'])
df.set_index(['dates'],inplace=True)
df
Here is another dataframe with a timeseries with fixed interval
interpolated = pd.DataFrame(0, index=pd.date_range('2020-09-24', '2045-09-19', freq='3M'),columns=['result'])
The goal is to populate the second dataframe with the cubic spline interpolated values from the first table.
Thanks for all the ideas
Attempt
interpolated['result'] = df['factors'].interpolate(method='cubic')
However it gives only NaN values in the intepolated dataframe. Not sure how to correctly refer to the first table.
First things first, the shapes don't match. Since it seems none of the dates in the index from the df match the dates in interpolated, you just end up with NaN being filled in on the dates. I think you want something more like merge or join, as described in this post: Merging time series data by timestamp using numpy/pandas
merge
and join will also be helpful.

How can these two dataframes be merged on a specific key?

I have two dataframes, both with a column 'hotelCode' that is type string. I made sure to convert both columns to string beforehand.
The first dataframe, we'll call old_DF looks like so:
and the second dataframe new_DF looks like:
I have been trying to merge these unsuccessfully. I've tried
final_DF = new_DF.join(old_DF, on = 'hotelCode')
and get this error:
I've tried a variety of things: changing the index name, various merge/join/concat and just haven't been successful.
Ideally, I will have a new dataframe where you have columns [[hotelCode, oldDate, newDate]] under one roof.
import pandas as pd
final_DF = pd.merge(old_DF, new_DF, on='hotelCode', how='outer')

Use multiple rows as column header for pandas

I have a dataframe that I've imported as follows.
df = pd.read_excel("./Data.xlsx", sheet_name="Customer Care", header=None)
I would like to set the first three rows as column headers but can't figure out how to do this. I gave the following a try:
df.columns = df.iloc[0:3,:]
but that doesn't seem to work.
I saw something similar in this answer. But it only applies if all sub columns are going to be named the same way, which is not necessarily the case.
Any recommendations would be appreciated.
df = pd.read_excel(
"./Data.xlsx",
sheet_name="Customer Care",
header=[0,1,2]
)
This will tell pandas to read the first three rows of the excel file as multiindex column labels.
If you want to modify the rows after you load them then set them as columns
#set the first three rows as columns
df.columns=pd.MultiIndex.from_arrays(df.iloc[0:3].values)
#delete the first three rows (because they are also the columns
df=df.iloc[3:]

pandas.concat of multiple data frames using only common columns

I have multiple pandas data frame objects cost1, cost2, cost3 ....
They have different column names (and number of columns) but have some in common.
Number of columns is fairly large in each data frame, hence handpicking the common columns manually will be painful.
How can I append rows from all of these data frames into one single data frame
while retaining elements from only the common column names?
As of now I have
frames=[cost1,cost2,cost3]
new_combined = pd.concat(frames, ignore_index=True)
This obviously contains columns which are not common across all data frames.
For future readers, Above functionality can be implemented by pandas itself.
Pandas can concat dataframe while keeping common columns only, if you provide join='inner' argument in pd.concat. e.g.
pd.concat(frames,join='inner', ignore_index=True)
You can find the common columns with Python's set.intersection:
common_cols = list(set.intersection(*(set(df.columns) for df in frames)))
To concatenate using only the common columns, you can use
pd.concat([df[common_cols] for df in frames], ignore_index=True)

Categories