I would like to merge two DataFrames while creating a multilevel column naming scheme denoting which dataframe the rows came from. For example:
In [98]: A=pd.DataFrame(np.arange(9.).reshape(3,3),columns=list('abc'))
In [99]: A
Out[99]:
a b c
0 0 1 2
1 3 4 5
2 6 7 8
In [100]: B=A.copy()
If I use pd.merge(), then I get
In [104]: pd.merge(A,B,left_index=True,right_index=True)
Out[104]:
a_x b_x c_x a_y b_y c_y
0 0 1 2 0 1 2
1 3 4 5 3 4 5
2 6 7 8 6 7 8
Which is what I expect with that statement, what I would like (but I don't know how to get!) is:
In [104]: <<one or more statements>>
Out[104]:
A B
a b c a b c
0 0 1 2 0 1 2
1 3 4 5 3 4 5
2 6 7 8 6 7 8
Can this be done without changing the original pd.DataFrame calls? I am reading the data in the dataframes in from .csv files and that might be my problem.
first case can be ordered arbitrarily among A,B (not the columns, just the order A or B)
2nd should preserve ordering
IMHO this is pandonic!
In [5]: concat(dict(A = A, B = B),axis=1)
Out[5]:
A B
a b c a b c
0 0 1 2 0 1 2
1 3 4 5 3 4 5
2 6 7 8 6 7 8
In [6]: concat([ A, B ], keys=['A','B'],axis=1)
Out[6]:
A B
a b c a b c
0 0 1 2 0 1 2
1 3 4 5 3 4 5
2 6 7 8 6 7 8
Here's one way, which does change A and B:
In [10]: from itertools import cycle
In [11]: A.columns = pd.MultiIndex.from_tuples(zip(cycle('A'), A.columns))
In [12]: A
Out[12]:
A
a b c
0 0 1 2
1 3 4 5
2 6 7 8
In [13]: B.columns = pd.MultiIndex.from_tuples(zip(cycle('B'), B.columns))
In [14]: A.join(B)
Out[14]:
A B
a b c a b c
0 0 1 2 0 1 2
1 3 4 5 3 4 5
2 6 7 8 6 7 8
I actually think this would be a good alternative behaviour, rather than suffixes...
Related
I would like to obtain the 'Value' column below, from the original df:
A B C Column_To_Use
0 2 3 4 A
1 5 6 7 C
2 8 0 9 B
A B C Column_To_Use Value
0 2 3 4 A 2
1 5 6 7 C 7
2 8 0 9 B 0
Use DataFrame.lookup:
df['Value'] = df.lookup(df.index, df['Column_To_Use'])
print (df)
A B C Column_To_Use Value
0 2 3 4 A 2
1 5 6 7 C 7
2 8 0 9 B 0
Let's say I have the following series:
0 A
1 B
2 C
dtype: object
0 1
1 2
2 3
3 4
dtype: int64
How can I merge them to create an empty dataframe with every possible combination of values, like this:
letter number
0 A 1
1 A 2
2 A 3
3 A 4
4 B 1
5 B 2
6 B 3
7 B 4
8 C 1
9 C 2
10 C 3
11 C 4
Assuming the 2 series are s and s1, use itertools.product() which gives a cartesian product of input iterables :
import itertools
df = pd.DataFrame(list(itertools.product(s,s1)),columns=['letter','number'])
print(df)
letter number
0 A 1
1 A 2
2 A 3
3 A 4
4 B 1
5 B 2
6 B 3
7 B 4
8 C 1
9 C 2
10 C 3
11 C 4
As of Pandas 1.2.0, there is a how='cross' option in pandas.merge() that produces the Cartesian product of the columns.
import pandas as pd
letters = pd.DataFrame({'letter': ['A','B','C']})
numbers = pd.DataFrame({'number': [1,2,3,4]})
together = pd.merge(letters, numbers, how = 'cross')
letter number
0 A 1
1 A 2
2 A 3
3 A 4
4 B 1
5 B 2
6 B 3
7 B 4
8 C 1
9 C 2
10 C 3
11 C 4
As an additional bonus, this function makes it easy to do so with more than one column.
letters = pd.DataFrame({'letterA': ['A','B','C'],
'letterB': ['D','D','E']})
numbers = pd.DataFrame({'number': [1,2,3,4]})
together = pd.merge(letters, numbers, how = 'cross')
letterA letterB number
0 A D 1
1 A D 2
2 A D 3
3 A D 4
4 B D 1
5 B D 2
6 B D 3
7 B D 4
8 C E 1
9 C E 2
10 C E 3
11 C E 4
If you have 2 Series s1 and s2.
you can do this:
pd.DataFrame(index=s1,columns=s2).unstack().reset_index()[["s1","s2"]]
It will give you the follow
s1 s2
0 A 1
1 B 1
2 C 1
3 A 2
4 B 2
5 C 2
6 A 3
7 B 3
8 C 3
9 A 4
10 B 4
11 C 4
You can use pandas.MultiIndex.from_product():
import pandas as pd
pd.DataFrame(
index = pd.MultiIndex
.from_product(
[
['A', 'B', 'C'],
[1, 2, 3, 4]
],
names = ['letters', 'numbers']
)
)
which results in a hierarchical structure:
letters numbers
A 1
2
3
4
B 1
2
3
4
C 1
2
3
4
and you can further call .reset_index() to get ungrouped results:
letters numbers
0 A 1
1 A 2
2 A 3
3 A 4
4 B 1
5 B 2
6 B 3
7 B 4
8 C 1
9 C 2
10 C 3
11 C 4
(However I find #NickCHK's answer to be the best)
Given a sample MultiIndex:
idx = pd.MultiIndex.from_product([[0, 1, 2], ['a', 'b', 'c', 'd']])
df = pd.DataFrame({'value' : np.arange(12)}, index=idx)
df
value
0 a 0
b 1
c 2
d 3
1 a 4
b 5
c 6
d 7
2 a 8
b 9
c 10
d 11
How can I efficiently convert this to a tabular format like so?
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
Furthermore, given the dataframe above, how can I bring it back to its original multi-indexed state?
What I've tried:
pd.DataFrame(df.values.reshape(-1, df.index.levels[1].size),
index=df.index.levels[0], columns=df.index.levels[1])
Which works for the first problem, but I'm not sure how to bring it back to its original from there.
Using unstack and stack
In [5359]: dff = df['value'].unstack()
In [5360]: dff
Out[5360]:
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
In [5361]: dff.stack().to_frame('name')
Out[5361]:
name
0 a 0
b 1
c 2
d 3
1 a 4
b 5
c 6
d 7
2 a 8
b 9
c 10
d 11
By using get_level_values
pd.crosstab(df.index.get_level_values(0),df.index.get_level_values(1),values=df.value,aggfunc=np.sum)
Out[477]:
col_0 a b c d
row_0
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
Another alternative, which you should think of when using stack/unstack (though unstack is clearly better in this case!) is pivot_table:
In [11]: df.pivot_table(values="value", index=df.index.get_level_values(0), columns=df.index.get_level_values(1))
Out[11]:
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
Suppose I have a data frame with 3 columns: A, B, C. I want to group by column A, and find the row (for each unique A) with the maximum entry in C, so that I can store that row.A, row.B, row.C into a dictionary elsewhere.
What's the best way to do this without using iterrows?
# generate sample data
import pandas as pd
df = pd.DataFrame(np.random.randint(0,10,(10,3)))
df.columns = ['A','B','C']
# sort by C, group by A, take last row of each group
df.sort('C').groupby('A').nth(-1)
Here's another method. If df is the DataFrame, you can write df.groupby('A').apply(lambda d: d.ix[d['C'].argmax()]).
For example,
In [96]: df
Out[96]:
A B C
0 1 0 3
1 3 0 4
2 0 4 5
3 2 4 0
4 3 1 1
5 1 6 2
6 3 6 0
7 4 0 1
8 2 3 4
9 0 5 0
10 7 6 5
11 3 1 2
In [97]: g = df.groupby('A').apply(lambda d: d['C'].argmax())
In [98]: g
Out[98]:
A
0 2
1 0
2 8
3 1
4 7
7 10
dtype: int64
In [99]: df.ix[g.values]
Out[99]:
A B C
2 0 4 5
0 1 0 3
8 2 3 4
1 3 0 4
7 4 0 1
10 7 6 5
I have two df,
First df
A B C
1 1 3
1 1 2
1 2 5
2 2 7
2 3 7
Second df
B D
1 5
2 6
3 4
The column Bhas the same meaning in the both dfs. What is the most easy way add column D to the corresponding values in the first df? Output should be:
A B C D
1 1 3 5
1 1 2 5
1 2 5 6
2 2 7 6
2 3 7 4
Perform a 'left' merge in your case on column 'B':
In [206]:
df.merge(df1, how='left', on='B')
Out[206]:
A B C D
0 1 1 3 5
1 1 1 2 5
2 1 2 5 6
3 2 2 7 6
4 2 3 7 4
Another method would be to set 'B' on your second df as the index and then call map:
In [215]:
df1 = df1.set_index('B')
df['D'] = df['B'].map(df1['D'])
df
Out[215]:
A B C D
0 1 1 3 5
1 1 1 2 5
2 1 2 5 6
3 2 2 7 6
4 2 3 7 4