I'm looking to combine dataframes df1 and df2 to get df3 in Python, most preferably in a one-liner (that is, no "for all x in df1.LETS...").
I'm at a current loss for words to use with my Google-fu, so here I am at StackExchange, hoping another programmer can help fill in my mental blank with this predicament.
Thank you!
df1 df2 df3
LETS NUMS LETS NUMS
A 1 A 1
B 2 A 2
3 A 3
4 A 4
B 1
B 2
B 3
B 4
You can use:
df1 = pd.DataFrame({'LETS':list('AB')})
df2 = pd.DataFrame({'NUMS':range(1,5)})
cross join solution with merge + assign column with constant and drop helper column A:
df = pd.merge(df1.assign(A=1), df2.assign(A=1), on='A').drop('A', axis=1)
print (df)
LETS NUMS
0 A 1
1 A 2
2 A 3
3 A 4
4 B 1
5 B 2
6 B 3
7 B 4
Another solution with MultiIndex.from_product and new function in pandas 0.20.1 - MultiIndex.to_frame
df = pd.MultiIndex.from_product([df1['LETS'], df2['NUMS']]).to_frame()
df.columns = ['LETS','NUMS']
print (df)
LETS NUMS
A 1 A 1
2 A 2
3 A 3
4 A 4
B 1 B 1
2 B 2
3 B 3
4 B 4
print (df.reset_index(drop=True))
LETS NUMS
0 A 1
1 A 2
2 A 3
3 A 4
4 B 1
5 B 2
6 B 3
7 B 4
pd.DataFrame(index=pd.MultiIndex.from_product([df1.LETS, df2.NUMS],
names=("LETS", "NUMS"))).reset_index()
# LETS NUMS
#0 A 1
#1 A 2
#2 A 3
#3 A 4
#4 B 1
#5 B 2
#6 B 3
#7 B 4
Related
I cannot solve a very easy/simple problem in pandas. :(
I have the following table:
df = pd.DataFrame(data=dict(a=[1, 1, 1,2, 2, 3,1], b=["A", "A","B","A", "B", "A","A"]))
df
Out[96]:
a b
0 1 A
1 1 A
2 1 B
3 2 A
4 2 B
5 3 A
6 1 A
I would like to make an incrementing ID of each grouped (grouped by columns a and b) unique item. So the result would like like this (column c):
Out[98]:
a b c
0 1 A 1
1 1 A 1
2 1 B 2
3 2 A 3
4 2 B 4
5 3 A 5
6 1 A 1
I tried with:
df.groupby(["a", "b"]).nunique().cumsum().reset_index()
Result:
Out[105]:
a b c
0 1 A 1
1 1 B 2
2 2 A 3
3 2 B 4
4 3 A 5
Unfortunatelly this works only for the grouped by dataset and not on the original dataset. As you can see in the original table I have 7 rows and the grouped by returns only 5.
So could someone please help me on how to get the desired table:
a b c
0 1 A 1
1 1 A 1
2 1 B 2
3 2 A 3
4 2 B 4
5 3 A 5
6 1 A 1
Thank you in advance!
groupby + ngroup
df['c'] = df.groupby(['a', 'b']).ngroup() + 1
a b c
0 1 A 1
1 1 A 1
2 1 B 2
3 2 A 3
4 2 B 4
5 3 A 5
6 1 A 1
Use pd.factorize after create a tuple from (a, b) columns:
df['c'] = pd.factorize(df[['a', 'b']].apply(tuple, axis=1))[0] + 1
print(df)
# Output
a b c
0 1 A 1
1 1 A 1
2 1 B 2
3 2 A 3
4 2 B 4
5 3 A 5
6 1 A 1
I have two dataframes df1 and df2
df1
A B
0 4 2
1 3 3
2 1 2
df2
B AB C
0 4 8 3
1 3 9 2
2 1 2 4
I would like to make a join only on different columns
df3
A B AB C
0 4 2 8 3
1 3 3 9 2
2 1 2 2 4
Use Index.isin with inverse mask or Index.difference:
df22 = df2.loc[:, ~df2.columns.isin(df1.columns)]
df = df1.join(df22)
Or:
df22 = df2[df2.columns.difference(df1.columns)]
df = df1.join(df22)
print (df)
A B AB C
0 4 2 8 3
1 3 3 9 2
2 1 2 2 4
You can also use the merge functions as an alternate solution:
df3=pd.merge(df1,df2, left_on='A', right_on='B', how ='left', suffixes=('','_')).drop('B_',axis=1)
I have a data frame like this:
df1 = pd.DataFrame({'a': [1,2],
'b': [3,4],
'c': [6,5]})
df1
Out[150]:
a b c
0 1 3 6
1 2 4 5
Now I want to create a df that repeats each row based on difference between col b and c plus 1. So diff between b and c for first row is 6-3 = 3. I want to repeat that row 3+1=4 times. Similarly for second row the difference is 5-4 = 1, so I want to repeat it 1+1=2 times. The column d is added to have value from min(b) to diff between b and c (i.e.6-3 = 3. So it goes from 3->6). So I want to get this df:
a b c d
0 1 3 6 3
0 1 3 6 4
0 1 3 6 5
0 1 3 6 6
1 2 4 5 4
1 2 4 5 5
Do it with reindex + repeat, then using groupby cumcount assign the new value d
df1.reindex(df1.index.repeat(df1.eval('c-b').add(1))).\
assign(d=lambda x : x.c-x.groupby('a').cumcount(ascending=False))
Out[572]:
a b c d
0 1 3 6 3
0 1 3 6 4
0 1 3 6 5
0 1 3 6 6
1 2 4 5 4
1 2 4 5 5
I get a dataframe
df
A B
0 1 4
1 2 5
2 3 6
For further processing, it would be more convenient to have the df restructered
as follows:
letters numbers
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 B 6
How can I achieve that?
Use unstack with reset_index :
df = df.unstack().reset_index(level=1, drop=True).reset_index()
df.columns = ['letters','numbers']
print (df)
letters numbers
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 B 6
Or numpy.concatenate + numpy.repeat + DataFrame:
a = np.concatenate(df.values)
b = np.repeat(df.columns,len(df.index))
df = pd.DataFrame({'letters':b, 'numbers':a})
print (df)
letters numbers
0 A 1
1 A 4
2 A 2
3 B 5
4 B 3
5 B 6
Probably simplest to melt:
In [36]: pd.melt(df, var_name="letters", value_name="numbers")
Out[36]:
letters numbers
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 B 6
I have three columns, A, B and C. I want to create a fourth column D that contains values of A or B, based on the value of C. For example:
A B C D
0 1 2 1 1
1 2 3 0 3
2 3 4 0 4
3 4 5 1 4
In the above example, column D takes the value of column A if the value of C is 1 and the value of column B if the value of C is 0. Is there an elegant way to do it in Pandas? Thank you for your help.
Use numpy.where:
In [20]: df
Out[20]:
A B C
0 1 2 1
1 2 3 0
2 3 4 0
3 4 5 1
In [21]: df['D'] = np.where(df.C, df.A, df.B)
In [22]: df
Out[22]:
A B C D
0 1 2 1 1
1 2 3 0 3
2 3 4 0 4
3 4 5 1 4
pandas
In consideration of the OP's request
Is there an elegant way to do it in Pandas?
my opinion of elegance
and idiomatic pure pandas
assign + pd.Series.where
df.assign(D=df.A.where(df.C, df.B))
A B C D
0 1 2 1 1
1 2 3 0 3
2 3 4 0 4
3 4 5 1 4
response to comment
how would you modify the pandas answer if instead of 0, 1 in column C you had A, B?
df.assign(D=df.lookup(df.index, df.C))
A B C D
0 1 2 A 1
1 2 3 B 3
2 3 4 B 4
3 4 5 A 4