merge two pandas data frame and skip common columns of right - python

I am using pandas DataFrame as a lightweight dataset to maintain some status and need to dynamically/continuously merge new DataFrames into existing table. Say I have two datasets as below:
df1:
a b
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
df2:
b c
0 10 11
1 12 13
2 14 15
3 16 17
4 18 19
I want to merge df2 to df1 (on index), and for columns in common (in this case, it is 'b'), simply discard the common column of df2.
a b c
0 0 1 11
1 2 3 13
2 4 5 15
3 6 7 17
4 8 9 19
My code was checking common part between df1 and df2 by using SET, so that I manually drop common part in df2. I wonder is there any much efficient way to do this?

First identify the columns in df2 not in df1
cols = df2.columns.difference(df1.columns)
Then pd.DataFrame.join
df1.join(df2[cols])
a b c
0 0 1 11
1 2 3 13
2 4 5 15
3 6 7 17
4 8 9 19
Or pd.concat will also work
pd.concat([df1, df2[cols]], axis=1)
a b c
0 0 1 11
1 2 3 13
2 4 5 15
3 6 7 17
4 8 9 19

Pandas merge function will also work wonders. You can do it as:
pd.merge(left=df1, right=df2, how='inner')
a b c
0 0 1 11
1 2 3 13
2 4 5 15
3 6 7 17
4 8 9 19
by eliminating the 'on' attribute of merge function it will consider the columns which are in-common in both of the dataframes.

Related

Create columns from index values

Let say I have my data shaped as in this example
idx = pd.MultiIndex.from_product([[1, 2, 3, 4, 5, 6], ['a', 'b', 'c']],
names=['numbers', 'letters'])
col = ['Value']
df = pd.DataFrame(list(range(18)), idx, col)
print(df.unstack())
The output will be
Value
letters a b c
numbers
1 0 1 2
2 3 4 5
3 6 7 8
4 9 10 11
5 12 13 14
6 15 16 17
letters and numbers are indexes and Value is the only column
The question is how can I replace Value column with columns named as values of index letters?
So I would like to get such output
numbers a b c
1 0 1 2
2 3 4 5
3 6 7 8
4 9 10 11
5 12 13 14
6 15 16 17
where a, b and c are columns and numbers is the only index.
Appreciate your help.
The problem is caused by you are using unstack with DataFrame, not pd.Series
df.Value.unstack().rename_axis(None,1)
Out[151]:
a b c
numbers
1 0 1 2
2 3 4 5
3 6 7 8
4 9 10 11
5 12 13 14
6 15 16 17
Wen-Ben's answer prevents you from running into a data frame with multiple column levels in the first place.
If you happened to be stuck with a multi-index column anyway, you can get rid of it by using .droplevel():
df = df.unstack()
df.columns = df.columns.droplevel()
df
Out[7]:
letters a b c
numbers
1 0 1 2
2 3 4 5
3 6 7 8
4 9 10 11
5 12 13 14
6 15 16 17

Python Pandas dataframe is not including all duplicates

I'm basically trying to create a Pandas dataframe (CQUAD_mech_loads) that is a subset of a larger dataframe (CQUAD_Mech). This subset dataframe is essentially created by filtering based on two conditions. There are NO duplicates in the larger dataframe (CQUAD_Mech).
The problem is that my subset dataframe doesn't include the duplicate ID's in the ELM column. It does, however, include duplicates in the LC column.
CQUAD_ELM is a list containing four ID's ([387522, 387522, 387506, 387507]). I have duplicate ID's of 387522. Right now, CQUAD_mech_loads is a dataframe with only three rows for the three unique IDs. I want that fourth duplicate ID in there as well.
The code:
def get_df(df, col1, cond1, col2='', cond2=0):
return df[(df[col1] == cond1) & (df[col2].isin(cond2))].reset_index(drop=True)
CQUAD_mech_loads = get_df(CQUAD_Mech,'LC', LC, 'ELM', CQUAD_ELM)
The output (where is the other line for 387522?):
LC ELM FX FY FXY
0 3113 387506 0 0 0
1 3113 387507 0 0 0
2 3113 387522 0 0 0
Since you're dropping the index anyway, you can just set the index to be the column you're interested in and use .ix indexing:
In [28]: df = pd.DataFrame(np.arange(25).reshape(5,5))
In [29]: df
Out[29]:
0 1 2 3 4
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
4 20 21 22 23 24
In [30]: df.set_index(4, drop=False).ix[[4,4,19,4,24]].reset_index(drop=True)
Out[30]:
0 1 2 3 4
0 0 1 2 3 4
1 0 1 2 3 4
2 15 16 17 18 19
3 0 1 2 3 4
4 20 21 22 23 24
EDIT: Your current method just finds each distinct col1/col2 pair. If you want to filter on multiple columns, just do it twice, once for each column:
In [98]: df.set_index(1, drop=False).ix[[1, 6, 16]].set_index(4, drop=False).ix[[4,4,4,4,4,4,4,4,19,9]].reset_index(drop=True)
Out[98]:
0 1 2 3 4
0 0 1 2 3 4
1 0 1 2 3 4
2 0 1 2 3 4
3 0 1 2 3 4
4 0 1 2 3 4
5 0 1 2 3 4
6 0 1 2 3 4
7 0 1 2 3 4
8 15 16 17 18 19
9 5 6 7 8 9

Zip pandas dataframes into a new dataframe

I have 2 dataframes:
df_A
country_codes
0 4
1 8
2 12
3 16
4 24
and df_B
continent_codes
0 4
1 3
2 5
3 6
4 5
Both dataframes have same length, but no common column. I want to concatenate the two but since not all values are common, I get lots of NaNs. How do I concatenate or zip them up into a combined dataframe?
-- EDIT desired output is this:
country_codes continent_codes
0 4 4
1 8 3
2 12 5
3 16 6
4 24 5
The following code will do as you want :
pd.concat([df1, df2], axis=1)
Source
Output:
country_codes continent_codes
0 4 4
1 8 3
2 12 5
3 16 6
4 24 5
From the comments:
I feel like this is too simple, but may I suggest:
df_A['continent_codes'] = df_B['continent_codes']
print(df_A)
Output:
country_codes continent_codes
0 4 4
1 8 3
2 12 5
3 16 6
4 24 5

How do i find the minimum of two dataframe columns with multi-indices in python pandas?

I have got two pandas Dataframes df1 and df2. df1 has a multi-index:
A
instance index
a 0 10
1 11
2 7
b 0 8
1 9
2 13
The frame df2 has the same first-level index as df1:
B
instance
a 5
b 12
I want to do two things:
1) Assign the values in df2 to the all the rows of df1
A B
instance index
a 0 10 5
1 11 5
2 7 5
b 0 8 12
1 9 12
2 13 12
2) Create a dataframe object that represents the minimum of values in A and B without concatenating the two dataframes like above:
min(df1,df2):
min
instance index
a 0 5
1 5
2 5
b 0 8
1 9
2 12
For your first request, you can use DataFrame.join:
>>> df1.join(df2)
A B
instance index
a 0 10 5
1 11 5
2 7 5
b 0 8 12
1 9 12
2 13 12
For your second, you can simply call min(axis=1) on that object:
>>> df1.join(df2).min(axis=1).to_frame("min")
min
instance index
a 0 5
1 5
2 5
b 0 8
1 9
2 12

Summing over a DataFrame with two conditions and multiple values

I have a DataFrame x with three columns;
a b c
1 1 10 4
2 5 6 5
3 4 6 5
4 2 11 9
5 1 2 10
... and a Series y of two values;
t
1 3
2 7
Now I'd like to get a DataFrame z with two columns;
t sum_c
1 3 18
2 7 13
... with t from y and sum_c the sum of c from x for all rows where t was larger than a and smaller than b.
Would anybody be able to help me with this?
here is a possible solution based on the given condition (the expected results listed in ur question dont quite line up with the given condition):
In[99]: df1
Out[99]:
a b c
0 1 10 4
1 5 6 5
2 4 6 5
3 2 11 9
4 1 2 10
In[100]: df2
Out[100]:
t
0 3
1 5
then write a function which would be used by pandas.apply() later:
In[101]: def cond_sum(x):
return sum(df1['c'].ix[np.logical_and(df1['a']<x.ix[0],df1['b']>x.ix[0])])
finally:
In[102]: df3 = df2.apply(cond_sum,axis=1)
In[103]: df3
Out[103]:
0 13
1 18
dtype: int64

Categories