I have a dataset:
name val
a a1
a a2
b b1
b b2
b b3
c c1
I want to make all possible permutations "names" which are not same. So desired result is:
name1 val1 name2 val2
a a1 b b1
a a1 b b2
a a1 b b3
a a2 b b1
a a2 b b2
a a2 b b3
a a1 c c1
a a2 c c2
b b1 c c1
b b2 c c1
b b3 c c1
How to do that? Id like to write a function that would make same operation with bigger table with same structure.
I would like to make it efficiently, since original data has several thousands rows
Easiest is to cross merge and query, if you have enough memory for few million rows, which is not too bad:
df.merge(df, how='cross', suffixes=['1','2']).query('name1 < name2')
Output:
name1 val1 name2 val2
2 a a1 b b1
3 a a1 b b2
4 a a1 b b3
5 a a1 c c1
8 a a2 b b1
9 a a2 b b2
10 a a2 b b3
11 a a2 c c1
17 b b1 c c1
23 b b2 c c1
29 b b3 c c1
Related
Df1
A B C1 C2 D E
a1 b1 2 4 d1 e1
a2 b2 1 2 d2 e2
Df2
A B C D E
a1 b1 2 d1 e1
a1 b1 3 d1 e1
a1 b1 4 d1 e1
a2 b2 1 d2 e2
a2 b2 2 d2 e2
How to make Df2 from Df1 in the fastest possible way?
I tried using groupby and then within for loop used np.arange to fill Df2.C and then used pd.concat to make the final Df2. But this approach is very slow and doesn't seem very elegant and pythonic as well. Can somebody please help with this problem.
Try this:
df1.assign(C = [np.arange(s, e+1) for s, e in zip(df1['C1'], df1['C2'])])\
.explode('C')
Output:
A B C1 C2 D E C
0 a1 b1 2 4 d1 e1 2
0 a1 b1 2 4 d1 e1 3
0 a1 b1 2 4 d1 e1 4
1 a2 b2 1 2 d2 e2 1
1 a2 b2 1 2 d2 e2 2
One way is to melt df1, use groupby.apply to add ranges; then explode for the final output:
cols = ['A','B','D','E']
out = (df1.melt(cols, value_name='C').groupby(cols)['C']
.apply(lambda x: range(x.min(), x.max()+1))
.explode().reset_index(name='C'))
Output:
A B D E C
0 a1 b1 d1 e1 2
1 a1 b1 d1 e1 3
2 a1 b1 d1 e1 4
3 a2 b2 d2 e2 1
4 a2 b2 d2 e2 2
I have two dfs that I want to concat
(sorry I don't know how to properly recreate a df here)
A B
a1 b1
a2 b2
a3 b3
A C
a1 c1
a4 c4
Result:
A B C
a1 b1 c1
a2 b2 NaN
a3 b3 NaN
a4 NaN c4
I have tried:
merge = pd.concat([df1,df2],axis = 0,ignore_index= True)
but this seems to just append the second df to the first df
Thank you!
I believe you need an outer join:
>>> pd.merge(df,df2,how='outer')
A B C
0 a1 b1 c1
1 a2 b2 NaN
2 a3 b3 NaN
3 a4 NaN c4
Lets say I have the dataframe below:
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
4 A4 B4 C4 D4
I am trying to write something that would essentially say; if column A contains A1, A2, or A4, then add a 'column E' populated by 'xx' in the rows where any of the three variables appear.
Then create a df2 which only contains the flagged rows and a df3 which has the flagged rows and column E subtracted. Resulting in df2:
A B C D E
0 A1 B1 C1 D1 xx
1 A2 B2 C2 D2 xx
2 A4 B4 C4 D4 xx
and df3:
A B C D
0 A0 B0 C0 D0
1 A3 B3 C3 D3
Python/pandas beginner here, so any and all help is much appreciated!
You can use boolean indexing:
mask = df["A"].isin(["A1", "A2", "A4"])
df_a = df[mask].copy()
df_a["E"] = "xx"
df_b = df[~mask] # add .copy()
print(df_a)
print(df_b)
Prints:
A B C D E
1 A1 B1 C1 D1 xx
2 A2 B2 C2 D2 xx
4 A4 B4 C4 D4 xx
A B C D
0 A0 B0 C0 D0
3 A3 B3 C3 D3
I have two dataframes:
df1 :
A B C
0 a0 b0 c0
1 a1 b1 c1
2 a2 b2 c2
3 a3 b3 c3
4 a4 b4 c4
df2 :
A B C
0 a0 b0 c11
1 a1 b1 c5
2 a70 b2 c20
3 a3 b9 c9
In df1, for every row, whenever Column A and Column B values are equal to values in df2, column C should be updated with value from df2.
Output:
A B C
0 a0 b0 c11
1 a1 b1 c5
2 a2 b2 c2
3 a3 b3 c3
4 a4 b4 c4
I tried the following, but it did not work.
df1.set_index(['A', 'B'])
df2.set_index(['A', 'B'])
df1.update(df2)
df1.reset_index()
df2.reset_index()
df1["C"][:4] = np.where((df1["A"][:4]==df2["A"])&(df1["B"][:4]==df2["B"]),df2["C"],df1["C"][:4])
A B C
0 a0 b0 c11
1 a1 b1 c5
2 a2 b2 c2
3 a3 b3 c3
4 a4 b4 c4
Lets say I have a pandas dataframe as follows:
A B C D
0 a0 b0 c0 d0
1 a1 b1 c1 d1
2 a2 b2 c2 d2
3 a3 b3 c3 d3
I would like to know how I can convert it to this.
A B
0 C c0 a0 b0
D d0 a0 b0
1 C c1 a1 b1
D d1 a1 b1
2 C c2 a2 b2
D d2 a2 b2
3 C c3 a3 b3
D d3 a3 b3
basically making a few columns as rows and creating a multi index.
Well, melt will pretty much get it in the form you want and then you can set the index as desired:
print df
0 a0 b0 c0 d0
1 a1 b1 c1 d1
2 a2 b2 c2 d2
3 a3 b3 c3 d3
Now use melt to stack (note, I reset the index and use that column as an id_var because it looks like you want the [0,1,2,3] index including in the stacking):
new = pd.melt(df.reset_index(),value_vars=['C','D'],id_vars=['index','A','B'])
print new
index A B variable value
0 0 a0 b0 C c0
1 1 a1 b1 C c1
2 2 a2 b2 C c2
3 3 a3 b3 C c3
4 0 a0 b0 D d0
5 1 a1 b1 D d1
6 2 a2 b2 D d2
7 3 a3 b3 D d3
Now just set the index (well sort it and then set the index to make it look like your desired output):
new = new.sort(['index']).set_index(['index','variable','value'])
print new
A B
index variable value
0 C c0 a0 b0
D d0 a0 b0
1 C c1 a1 b1
D d1 a1 b1
2 C c2 a2 b2
D d2 a2 b2
3 C c3 a3 b3
D d3 a3 b3
If you don't need the [0,1,2,3] as part of the stack, the melt command is a bit cleaner:
print pd.melt(df,value_vars=['C','D'],id_vars=['A','B'])
A B variable value
0 a0 b0 C c0
1 a1 b1 C c1
2 a2 b2 C c2
3 a3 b3 C c3
4 a0 b0 D d0
5 a1 b1 D d1
6 a2 b2 D d2
7 a3 b3 D d3