Join dataframe with different indices - python

please consider the following dataframe with daily dates as its index
df1= pd.date_range(start_date, end_date)
df1 = pd.DataFrame(index=date_range, columns=['A', 'B'])
now I have a second dataframe df2 where df2.index is a subset of df1.index
I want to join the data from df2 into df1 and for the missing indices I want to have NAN.
In a second step I want to replace the NaN with the last available data like this:
2004-03-28 5
2004-03-30 NaN
2004-03-31 NaN
2004-04-01 7
should become
2004-03-28 5
2004-03-30 5
2004-03-31 5
2004-04-01 7
many thanks for your help

Assuming that you have common index and just a single column that is named the same in both dataframes:
First merge
df1 = df1.merge(df2, how='left')
Now fill the missing values using 'ffill' which means forwards fill:
df1 = df1.fillna(method='ffill')
In the situation where the columns are not named the same you can either rename the columns:
right.rename(columnss={'old_name':'new_name'},inplace=True)
or specify the columns from both left and right hand side to merge with:
df1.merge(df2, left_on='left_col', right='right_col', how='left')
if the indexes don't match then you have to set left_index=False and right_index=False

Related

How to get rows from a dataframe that are not joined when merging with other dataframe in pandas

I am trying to make 2 new dataframes by using 2 given dataframe objects:
DF1 = id feature_text length
1 "example text" 12
2 "example text2" 13
....
....
DF2 = id case_num
3 0
....
....
As you could see, both df1 and df2 have column called "id". However, the df1 has all id values, where df2 only has some of them. I mean, df1 has 3200 rows, where each row has a unique id value (1~3200), however, df2 has only some of them (i.e. id=[3,7,20,...]).
What I want to do is 1) get a merged dataframe which contains all rows that have the id values which are included in both df1 and df2, and 2) get a dataframe, which contains the rows in the df1, which have id values that are not included in the df2.
I was able to find a solution for 1), however, have no idea how to do 2).
Thanks.
For the first case, you could use inner merge:
out = df1.merge(df2, on='id')
For the second case, you could use isin, with negation operator, so that we filter out the rows in df1 that have ids that also exist in df2:
out = df1[~df1['id'].isin(df2['id'])]

Why does my merged data frame have multiple columns that are populated with "none" values?

I'm merging 2 pretty large data frames, the shape of RD_4ML is (97058, 24) while the shape of NewDF is (104047, 3). They share a common column called 'personUID', below is the merge code I used.
Final_DF = RD_4ML.merge(NewDF, how='left', on='personUID')
Final_DF.fillna('none', inplace=True)
Final_DF.sample()
DF sample output:
|personUID| |code| |Death| |diagnosis_code_type| |lr|
|abc123| |ICD10| |1| |none| |none|
Essentially the columns from RD_4ML populate while the 2 columns from NewDF return "none" values. Does anyone know how to solve an error like this?
I think the 'personUID' column does not match in the two dataframe.
Ensure that they have the same data type.
Merge with how='left' takes every entry from the left dataframe and tries to find a corresponding matching id in the right dataframe. For all nonmatching ones, it will fill in nans for the columns coming from the right frame. In SQL that is called a left join. As an example you can have a look at this here
df1 = pd.DataFrame({"uid":range(4), "values": range(4)})
df2 = pd.DataFrame({"uid":range(5, 9), "values2": range(4)})
df1.merge(df2, how="left", on='uid')
# OUTPUT
uid values values2
0 0 0 NaN
1 1 1 NaN
2 2 2 NaN
3 3 3 NaN
here yous see all uids from the left dataframe end up in the merged dataframe and as no matching entry was found, the column from the right dataframe is set to NaN.
If your goal is, to end up with only those that have a match, you can change from "left" to "inner". For more information about that, just have a look at the great pandas docs.

How can I replace missing values in a dataframe with values from another dataframe?

I have two dataframes of different shapes. I want to fill in missing data in my df1 from data that exists in df2.
How do I join these two datasets while keeping the original shape and columns of df1?
I have tried using pd.merge, but I don't think I am getting the syntax right. I have created new columns in the dataframe, but I'm not able to only add data to the NaN values.
I have also tried using combine first, but I don't think I'm doing that right either.
df1 = pd.DataFrame({'a': ["dogs","cats","birds","turtles"], 'b': [1,5,"NA",10]})
print(df1)
df2 = pd.DataFrame({'a': ["birds"],'b': [6]})
print(df2)
df_Final = pd.DataFrame({'a': ["dogs","cats","birds","turtles"], 'b': [1,5,6,10]})
print(df_Final)
I expect the output to be the df_Final dataframe shown here, where the "birds" value, is populated with df2.
fuelbaby
How about this ?
df1['b'] = df1['b'].where(df1['b']!=('NA'), df1['a'].map(df2.set_index('a')['b']))
Out[166]:
a b
0 dogs 1
1 cats 5
2 birds 6
3 turtles 10

Filter pandas dataframe columns based on other dataframe

I have two dataframes df1 and df2. df1 gives some numerical data on some elements (A,B,C ...) while df2 is a dataframe acting like a classification table with its index being the column names of df1. I would like to filter df1 by only keeping columns that are matching a certain classification in df2.
For instance, let's assume the following two dataframes and that I only want to keep elements (i.e. columns of df1) that belong to class 'C1':
df1 = pd.DataFrame({'A': [1,2],'B': [3,4],'C': [5,6]},index=[0, 1])
df2 = pd.DataFrame({'Name': ['A','B','C'],'Class': ['C1','C1','C2'],'Subclass': [C11,C12,C21]},index=[0, 1, 2])
df2 = df2.set_index('Name')
The expected result should be the dataframe df1 with only columns A and B because in df2, we can see that A and B are in class C1. Not sure how to do that. I was thinking about first filtering df2 by 'C1' values in its 'Class' column and then check if df1.columns are in df2.index but I suppose there is a much efficient way to do that. Thanks for your help
Here is one way using index slice
df1.loc[:,df2.index[df2.Class=='C1']]
Out[578]:
Name A B
0 1 3
1 2 4

Pandas Merge a Grouped-by dataframe with another dataframe for each group

I have a datframe like:
id date temperature
1 2011-09-12 12
2011-09-15 12
2011-10-13 12
2 2011-12-12 14
2011-12-24 15
I want to make sure that each device id has temperature recordings for each day, if the value exists it will be copied from above if it doesn't i will put 0.
so, I prepare another dataframe which has dates for the entire year:
using pd.DataFrame(0, index=pd.range('2011-01-01', '2011-12-12'), columns=['temperature'])
date temperature
2011-01-01 0
.
.
.
2011-12-12 0
Now, for each id I want to merge this dataframe so that I have entire year's entry for each of the id.
I am stuck at the merge step, just merging on the date column does not work, i.e.
pd.merge(df1, df2, on=['date'])
gives a blank dataframe.
As an alternative to jezrael's answer, you could also do the following iteration, especially if you want to keep your device id intact:
data={"date":[pd.Timestamp('2011-09-12'), pd.Timestamp('2011-09-15'), pd.Timestamp('2011-10-13'),pd.Timestamp('2011-12-12'),pd.Timestamp('2011-12-24')],"temperature":[12,12,12,14,15],"sensor_id":[1,1,1,2,2]}
df1=pd.DataFrame(data,index=data["sensor_id"])
df2=pd.DataFrame(0, index=pd.date_range('2011-01-01', '2011-12-12'), columns=['temperature','sensor_id'])
for i,row in df1.iterrows():
df2.loc[df2.index==row["date"], ['temperature']] = row['temperature']
df2.loc[df2.index==row["date"], ['sensor_id']] = row['sensor_id']
for t in data["date"]:
print(df2[df2.index==t])
Note that df2 in your question only goes to 2011-12-12, hence the last print() will return an empty DataFrame. I wasn't whether you did this on purpose.
Also, depending on the variability and density in your actual data, it might make sense to use:
for s in [1,2]: ## iterate over device ids
ma=(df['sensor_id']==s)
df.loc[ma]=df.loc[ma].fillna(method='ffill') # fill forward
hence an incomplete time series would be filled (forward) by the last measured temperature value. Depends on the quality of your data, of course, and df.resample() might make more sense.
Create MultiIndex by MultiIndex.from_product and merge by both MultiIndexes:
mux = pd.MultiIndex.from_product([df.index.levels[0],
pd.date_range('2011-01-01', '2011-12-12')],
names=['id','date'])
df1 = pd.DataFrame(0, index=mux, columns=['temperature'])
df = pd.merge(df1, df, left_index=True, right_index=True, how='left')
If want only one column temperature:
df = pd.merge(df1, df, left_index=True, right_index=True, how='left', suffixes=('','_'))
df['temperature'] = df.pop('temperature_').fillna(df['temperature'])
Another idea is use itertools.product for 2 columns DataFrame:
from itertools import product
data = list(product(df.index.levels[0], pd.date_range('2011-01-01', '2011-12-12')))
df1 = pd.DataFrame(data, columns=['id','date'])
df = pd.merge(df1, df, left_on=['id','date'], right_index=True, how='left')
Another idea is use DataFrame.reindex:
mux = pd.MultiIndex.from_product([df.index.levels[0],
pd.date_range('2011-01-01', '2011-12-12')],
names=['id','date'])
df = df.reindex(mux, fill_value=0)

Categories