pandas how to outer join without creating new columns - python

I have 2 pandas dataframes like this:
date value
20100101 100
20100102 150
date value
20100102 150.01
20100103 180
The expected output should be:
date value
20100101 100
20100102 150
20100103 180
The 2nd dataframe always contains newest value that I'd like to add into the 1st dataframe. However, the value on the same day may differ slightly between the two dataframes. I would like to ignore the same dates and focus on adding the new date and value into the 1st dataframe.
I've tried outer join in pandas, but it gives me two columns value_x and value_y because the value are not essentially the same on same dates. Any solution to this?

I believe need concat with drop_duplicates:
df = pd.concat([df1,df2]).drop_duplicates('date', keep='last')
print (df)
date value
0 20100101 100.00
0 20100102 150.01
1 20100103 180.00
df = pd.concat([df1,df2]).drop_duplicates('date', keep='first')
print (df)
date value
0 20100101 100.0
1 20100102 150.0
1 20100103 180.0

Related

How does (DataFrame - Groupby) match rows?

I can't figure out how (DataFrame - Groupby) works.
Specifically, given the following dataframe:
df = pd.DataFrame([['usera',1,100],['usera',5,130],['userc',1,100],['userd',5,100]])
df.columns = ['id','date','sum']
id date sum
0 usera 1 100
1 usera 5 130
2 userc 1 100
3 userd 5 100
Passing the below code returns:
df['shift'] = df['date']-df.groupby(['id'])['date'].shift(1)
id date sum shift
0 usera 1 100
1 usera 5 130 4.0
2 userc 1 100
3 userd 5 100
How did Python know that I meant for it to match by id column?
It doesn't even appear in df['date']
Let us dissect the command df['shift'] = df['date']-df.groupby(['id'])['date'].shift(1).
df['shift'] appends a new column "shift" in the dataframe.
df['date'] returns Series using date column from the dataframe.
0 1
1 5
2 1
3 5
Name: date, dtype: int64
df.groupby(['id'])['date'].shift(1) groupby(['id']) creates a groupby object.
From that groupby object selecting date column and shifting one (previous) value using shift(1). By the way, this also a Series.
df.groupby(['id'])['date'].shift(1)
0 NaN
1 1.0
2 NaN
3 NaN
Name: date, dtype: float64
The Series obtained from step 3 is subtracted (element-wise) with the Series obtained from Step 2. The result is assigned to the df['shift'] column.
df['date']-df.groupby(['id'])['date'].shift(1)
0 NaN
1 4.0
2 NaN
3 NaN
Name: date, dtype: float64
I am not exactly knowing what you are trying, but groupby() method is usuful if you have several same objects in a column (like you usera) and you want to calculate for example the sum(), mean(), find max() etc. of all columns or just one specific column.
e.g. df.groupby(['id'])['sum'].sum() groups you usera and just select the sum column and build the sum over all usera. So it is 230. If you would use .mean() it would output 115 etc. And it also does it for all other unique id in your id column. In the example from above it outputs one column with just three rows (user a-c).
Greetz, miGa

Python: compare dataframes based on two conditions

I have the following two dataframes:
df1:
date id
2000 1
2001 1
2002 2
df2:
date id
2000 1
2002 2
I now want to extract a list of observations that are in df1 but not in df2 based on date AND id.
The result should look like this:
date id
2001 1
I know how make a command to compare a column to a list with isin like this:
result = df1[~df1["id"].isin(df2["id"].tolist())]
However, this would only compare the two dataframes based on the column id. Because it could be that the id is in df1 and df2, but for different dates it is important that I only get values where both - id and date- are present in the two dataframes. Does somebody know how to do that?
Using merge
In [795]: (df1.merge(df2, how='left', indicator='_a')
.query('_a == "left_only"')
.drop('_a', 1))
Out[795]:
date id
1 2001 1
Details
In [796]: df1.merge(df2, how='left', indicator='_a')
Out[796]:
date id _a
0 2000 1 both
1 2001 1 left_only
2 2002 2 both
In [797]: df1.merge(df2, how='left', indicator='_a').query('_a == "left_only"')
Out[797]:
date id _a
1 2001 1 left_only

merging two dataframe on column and index

Hi so I have two dataframes, first one is a dataframe which was created by grouping by another df by id (which is index now) and then sorting by 'due' column.
df1:
paid due
id
3 13.000000 5.000000
2 437.000000 5.000000
5 90.000000 5.000000
1 60.000000 5.000000
4 675.000000 5.000000
The other one is a normal dataframe which has 3 columns: 'id' 'name' and 'country'.
df2:
id name country
1 'AB' 'DE'
2 'CD' 'DE'
3 'EF' 'NL'
4 'HAH' 'SG'
5 'NOP' 'NOR'
So what I was trying to do is to add the 'name' column to the 1st dataframe based on the id number (which is index in first df and column in second one).
So I thought this code would work:
pd.merge(df1, df2['name'], left_index=True, right_on='id')
But I get error
ValueError: can not merge DataFrame with instance of type <class 'pandas.core.series.Series'>
You can use rename for map by dict:
df1['name'] = df1.rename(index=df2.set_index('id')['name']).index
print (df1)
paid due name
id
3 13.0 5.0 'EF'
2 437.0 5.0 'CD'
5 90.0 5.0 'NOP'
1 60.0 5.0 'AB'
4 675.0 5.0 'HAH'
You might find that pd.concat is a better option here because it can accept a mix of dataframe and series: http://pandas.pydata.org/pandas-docs/stable/merging.html#concatenating-with-mixed-ndims.
Okay so I figured out that I can't really get one column of dataframe in that way but I can remake df2 so that it contains only one needed column:
df2=df2[['id', 'name']]
pd.merge(df1, df2, left_index=True, right_on='id')
And there is no error anymore.

Taking Min and Max of Contigous Rows in a Pandas Dataframe

I have some data that looks something like this:
ID Value Starts Ends
0 A 1 2000-01-01 2000-06-01
1 A 2 2000-06-02 2000-12-31
2 A 1 2001-01-01 2001-06-01
3 A 1 2001-06-02 2001-12-31
What I want to do is collapse consecutive rows where there Id and value are the same. So ideally the output would be:
ID Value Starts Ends
0 A 1 2000-01-01 2000-06-01
1 A 2 2000-06-02 2000-12-31
2 A 1 2001-01-01 2001-12-31
However, if you naively take np.min(Starts) and np.max(Ends) it appears that (A,1) spans the values (A,2).
gb = df.groupby(['ID', 'Value'], as_index=False)
df = gb.agg({'Starts': np.min, 'Ends': np.max}, as_index=False)
ID Value Starts Ends
0 A 1 2000-01-01 2001-12-31
1 A 2 2000-06-02 2000-12-31
Is there an efficient way to get Pandas to do what I want?
If you add a column (let's call it "extra") that increments each time the groupby category changes, you can groupby that instead. The challenge is then to make the addition of the new column efficient, and this is the most vectorized way I can think of to make it work.
increment = ((df.Value[:-1] != df.Value[1:]) | (df.ID[:-1] != df.ID[1:])).cumsum()
df["extra"] = pd.concat((pd.Series([0]),increment),ignore_index=True)
The first line takes the cumulative sum of a boolean array showing differing lines, then the second tacks on a zero at the front and adds it to the dataframe.
Then you can do
gb = df.groupby(['extra'], as_index=False)
df = gb.agg({'Starts': np.min, 'Ends': np.max}, as_index=False)
Just do df.drop_duplicates(subset = ['ID', 'Value'], inplace=True)
this will drop the rows where you have duplicate ID and Value input.

Merge Only When Value is Empty/Null in Pandas

I have two dataframes in Pandas which are being merged together df.A and df.B, df.A is the original, and df.B has the new data I want to bring over. The merge works fine and as expected I get two columns col_x and col_y in the merged df.
However, in some rows, the original df.A has values where the other df.B does not. My question is, how can I selectively take the values from col_x and col_y and place them into a new col such as col_z ?
Here's what I mean, how can I merge df.A:
date impressions spend col
1/1/15 100000 3.00 ABC123456
1/2/15 145000 5.00 ABCD00000
1/3/15 300000 15.00 (null)
with df.B
date col
1/1/15 (null)
1/2/15 (null)
1/3/15 DEF123456
To get:
date impressions spend col_z
1/1/15 100000 3.00 ABC123456
1/2/15 145000 5.00 ABCD00000
1/3/15 300000 15.00 DEF123456
Any help or point in the right direction would be really appreciated!
Thanks
OK assuming that your (null) values are in fact NaN values and not that string then the following works:
In [10]:
# create the merged df
merged = dfA.merge(dfB, on='date')
merged
Out[10]:
date impressions spend col_x col_y
0 2015-01-01 100000 3 ABC123456 NaN
1 2015-01-02 145000 5 ABCD00000 NaN
2 2015-01-03 300000 15 NaN DEF123456
You can use where to conditionally assign a value from the _x and _y columns:
In [11]:
# now create col_z using where
merged['col_z'] = merged['col_x'].where(merged['col_x'].notnull(), merged['col_y'])
merged
Out[11]:
date impressions spend col_x col_y col_z
0 2015-01-01 100000 3 ABC123456 NaN ABC123456
1 2015-01-02 145000 5 ABCD00000 NaN ABCD00000
2 2015-01-03 300000 15 NaN DEF123456 DEF123456
You can then drop the extraneous columns:
In [13]:
merged = merged.drop(['col_x','col_y'],axis=1)
merged
Out[13]:
date impressions spend col_z
0 2015-01-01 100000 3 ABC123456
1 2015-01-02 145000 5 ABCD00000
2 2015-01-03 300000 15 DEF123456
IMO the shortest and yet readable solution is something like that:
df.A.loc[df.A['col'].isna(), 'col'] = df.A.merge(df.B, how='left', on='date')['col_y']
What it basically does is assigning values from merged table column col_y to primary df.A table, for those rows in col column, which are empty (.isna() condition).
If you have got data that contains 'nans' and you want to fill the 'nans' from other dataframe
(that matching the index and columns names) you can do the following:
df_A : target DataFrame that contain nans element
df_B : the source DataFrame thatcomplete the missing elements
df_A = df_A.where(df_A.notnull(),df_B)

Categories