How to reshape my dataset in specific way? - python

I have a dataset:
name val
a a1
a a2
b b1
b b2
b b3
c c1
I want to make all possible permutations "names" which are not same. So desired result is:
name1 val1 name2 val2
a a1 b b1
a a1 b b2
a a1 b b3
a a2 b b1
a a2 b b2
a a2 b b3
a a1 c c1
a a2 c c2
b b1 c c1
b b2 c c1
b b3 c c1
How to do that? Id like to write a function that would make same operation with bigger table with same structure.
I would like to make it efficiently, since original data has several thousands rows

Easiest is to cross merge and query, if you have enough memory for few million rows, which is not too bad:
df.merge(df, how='cross', suffixes=['1','2']).query('name1 < name2')
Output:
name1 val1 name2 val2
2 a a1 b b1
3 a a1 b b2
4 a a1 b b3
5 a a1 c c1
8 a a2 b b1
9 a a2 b b2
10 a a2 b b3
11 a a2 c c1
17 b b1 c c1
23 b b2 c c1
29 b b3 c c1

Related

How to fill a df column with range of values of 2 columns from another df?

Df1
A B C1 C2 D E
a1 b1 2 4 d1 e1
a2 b2 1 2 d2 e2
Df2
A B C D E
a1 b1 2 d1 e1
a1 b1 3 d1 e1
a1 b1 4 d1 e1
a2 b2 1 d2 e2
a2 b2 2 d2 e2
How to make Df2 from Df1 in the fastest possible way?
I tried using groupby and then within for loop used np.arange to fill Df2.C and then used pd.concat to make the final Df2. But this approach is very slow and doesn't seem very elegant and pythonic as well. Can somebody please help with this problem.
Try this:
df1.assign(C = [np.arange(s, e+1) for s, e in zip(df1['C1'], df1['C2'])])\
.explode('C')
Output:
A B C1 C2 D E C
0 a1 b1 2 4 d1 e1 2
0 a1 b1 2 4 d1 e1 3
0 a1 b1 2 4 d1 e1 4
1 a2 b2 1 2 d2 e2 1
1 a2 b2 1 2 d2 e2 2
One way is to melt df1, use groupby.apply to add ranges; then explode for the final output:
cols = ['A','B','D','E']
out = (df1.melt(cols, value_name='C').groupby(cols)['C']
.apply(lambda x: range(x.min(), x.max()+1))
.explode().reset_index(name='C'))
Output:
A B D E C
0 a1 b1 d1 e1 2
1 a1 b1 d1 e1 3
2 a1 b1 d1 e1 4
3 a2 b2 d2 e2 1
4 a2 b2 d2 e2 2

Pandas concat with different columns

I have two dfs that I want to concat
(sorry I don't know how to properly recreate a df here)
A B
a1 b1
a2 b2
a3 b3
A C
a1 c1
a4 c4
Result:
A B C
a1 b1 c1
a2 b2 NaN
a3 b3 NaN
a4 NaN c4
I have tried:
merge = pd.concat([df1,df2],axis = 0,ignore_index= True)
but this seems to just append the second df to the first df
Thank you!
I believe you need an outer join:
>>> pd.merge(df,df2,how='outer')
A B C
0 a1 b1 c1
1 a2 b2 NaN
2 a3 b3 NaN
3 a4 NaN c4

Adding and subtracting dataframe rows conditionally

Lets say I have the dataframe below:
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
4 A4 B4 C4 D4
I am trying to write something that would essentially say; if column A contains A1, A2, or A4, then add a 'column E' populated by 'xx' in the rows where any of the three variables appear.
Then create a df2 which only contains the flagged rows and a df3 which has the flagged rows and column E subtracted. Resulting in df2:
A B C D E
0 A1 B1 C1 D1 xx
1 A2 B2 C2 D2 xx
2 A4 B4 C4 D4 xx
and df3:
A B C D
0 A0 B0 C0 D0
1 A3 B3 C3 D3
Python/pandas beginner here, so any and all help is much appreciated!
You can use boolean indexing:
mask = df["A"].isin(["A1", "A2", "A4"])
df_a = df[mask].copy()
df_a["E"] = "xx"
df_b = df[~mask] # add .copy()
print(df_a)
print(df_b)
Prints:
A B C D E
1 A1 B1 C1 D1 xx
2 A2 B2 C2 D2 xx
4 A4 B4 C4 D4 xx
A B C D
0 A0 B0 C0 D0
3 A3 B3 C3 D3

Pandas combine two dataframes to update values of a particular column in 1st dataframe

I have two dataframes:
df1 :
A B C
0 a0 b0 c0
1 a1 b1 c1
2 a2 b2 c2
3 a3 b3 c3
4 a4 b4 c4
df2 :
A B C
0 a0 b0 c11
1 a1 b1 c5
2 a70 b2 c20
3 a3 b9 c9
In df1, for every row, whenever Column A and Column B values are equal to values in df2, column C should be updated with value from df2.
Output:
A B C
0 a0 b0 c11
1 a1 b1 c5
2 a2 b2 c2
3 a3 b3 c3
4 a4 b4 c4
I tried the following, but it did not work.
df1.set_index(['A', 'B'])
df2.set_index(['A', 'B'])
df1.update(df2)
df1.reset_index()
df2.reset_index()
df1["C"][:4] = np.where((df1["A"][:4]==df2["A"])&(df1["B"][:4]==df2["B"]),df2["C"],df1["C"][:4])
A B C
0 a0 b0 c11
1 a1 b1 c5
2 a2 b2 c2
3 a3 b3 c3
4 a4 b4 c4

convert some rows in rows of a multiindex in pandas dataframe

Lets say I have a pandas dataframe as follows:
A B C D
0 a0 b0 c0 d0
1 a1 b1 c1 d1
2 a2 b2 c2 d2
3 a3 b3 c3 d3
I would like to know how I can convert it to this.
A B
0 C c0 a0 b0
D d0 a0 b0
1 C c1 a1 b1
D d1 a1 b1
2 C c2 a2 b2
D d2 a2 b2
3 C c3 a3 b3
D d3 a3 b3
basically making a few columns as rows and creating a multi index.
Well, melt will pretty much get it in the form you want and then you can set the index as desired:
print df
0 a0 b0 c0 d0
1 a1 b1 c1 d1
2 a2 b2 c2 d2
3 a3 b3 c3 d3
Now use melt to stack (note, I reset the index and use that column as an id_var because it looks like you want the [0,1,2,3] index including in the stacking):
new = pd.melt(df.reset_index(),value_vars=['C','D'],id_vars=['index','A','B'])
print new
index A B variable value
0 0 a0 b0 C c0
1 1 a1 b1 C c1
2 2 a2 b2 C c2
3 3 a3 b3 C c3
4 0 a0 b0 D d0
5 1 a1 b1 D d1
6 2 a2 b2 D d2
7 3 a3 b3 D d3
Now just set the index (well sort it and then set the index to make it look like your desired output):
new = new.sort(['index']).set_index(['index','variable','value'])
print new
A B
index variable value
0 C c0 a0 b0
D d0 a0 b0
1 C c1 a1 b1
D d1 a1 b1
2 C c2 a2 b2
D d2 a2 b2
3 C c3 a3 b3
D d3 a3 b3
If you don't need the [0,1,2,3] as part of the stack, the melt command is a bit cleaner:
print pd.melt(df,value_vars=['C','D'],id_vars=['A','B'])
A B variable value
0 a0 b0 C c0
1 a1 b1 C c1
2 a2 b2 C c2
3 a3 b3 C c3
4 a0 b0 D d0
5 a1 b1 D d1
6 a2 b2 D d2
7 a3 b3 D d3

Categories