Lets say I have a pandas dataframe as follows:
A B C D
0 a0 b0 c0 d0
1 a1 b1 c1 d1
2 a2 b2 c2 d2
3 a3 b3 c3 d3
I would like to know how I can convert it to this.
A B
0 C c0 a0 b0
D d0 a0 b0
1 C c1 a1 b1
D d1 a1 b1
2 C c2 a2 b2
D d2 a2 b2
3 C c3 a3 b3
D d3 a3 b3
basically making a few columns as rows and creating a multi index.
Well, melt will pretty much get it in the form you want and then you can set the index as desired:
print df
0 a0 b0 c0 d0
1 a1 b1 c1 d1
2 a2 b2 c2 d2
3 a3 b3 c3 d3
Now use melt to stack (note, I reset the index and use that column as an id_var because it looks like you want the [0,1,2,3] index including in the stacking):
new = pd.melt(df.reset_index(),value_vars=['C','D'],id_vars=['index','A','B'])
print new
index A B variable value
0 0 a0 b0 C c0
1 1 a1 b1 C c1
2 2 a2 b2 C c2
3 3 a3 b3 C c3
4 0 a0 b0 D d0
5 1 a1 b1 D d1
6 2 a2 b2 D d2
7 3 a3 b3 D d3
Now just set the index (well sort it and then set the index to make it look like your desired output):
new = new.sort(['index']).set_index(['index','variable','value'])
print new
A B
index variable value
0 C c0 a0 b0
D d0 a0 b0
1 C c1 a1 b1
D d1 a1 b1
2 C c2 a2 b2
D d2 a2 b2
3 C c3 a3 b3
D d3 a3 b3
If you don't need the [0,1,2,3] as part of the stack, the melt command is a bit cleaner:
print pd.melt(df,value_vars=['C','D'],id_vars=['A','B'])
A B variable value
0 a0 b0 C c0
1 a1 b1 C c1
2 a2 b2 C c2
3 a3 b3 C c3
4 a0 b0 D d0
5 a1 b1 D d1
6 a2 b2 D d2
7 a3 b3 D d3
Related
I have a dataset:
name val
a a1
a a2
b b1
b b2
b b3
c c1
I want to make all possible permutations "names" which are not same. So desired result is:
name1 val1 name2 val2
a a1 b b1
a a1 b b2
a a1 b b3
a a2 b b1
a a2 b b2
a a2 b b3
a a1 c c1
a a2 c c2
b b1 c c1
b b2 c c1
b b3 c c1
How to do that? Id like to write a function that would make same operation with bigger table with same structure.
I would like to make it efficiently, since original data has several thousands rows
Easiest is to cross merge and query, if you have enough memory for few million rows, which is not too bad:
df.merge(df, how='cross', suffixes=['1','2']).query('name1 < name2')
Output:
name1 val1 name2 val2
2 a a1 b b1
3 a a1 b b2
4 a a1 b b3
5 a a1 c c1
8 a a2 b b1
9 a a2 b b2
10 a a2 b b3
11 a a2 c c1
17 b b1 c c1
23 b b2 c c1
29 b b3 c c1
Lets say I have the dataframe below:
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
4 A4 B4 C4 D4
I am trying to write something that would essentially say; if column A contains A1, A2, or A4, then add a 'column E' populated by 'xx' in the rows where any of the three variables appear.
Then create a df2 which only contains the flagged rows and a df3 which has the flagged rows and column E subtracted. Resulting in df2:
A B C D E
0 A1 B1 C1 D1 xx
1 A2 B2 C2 D2 xx
2 A4 B4 C4 D4 xx
and df3:
A B C D
0 A0 B0 C0 D0
1 A3 B3 C3 D3
Python/pandas beginner here, so any and all help is much appreciated!
You can use boolean indexing:
mask = df["A"].isin(["A1", "A2", "A4"])
df_a = df[mask].copy()
df_a["E"] = "xx"
df_b = df[~mask] # add .copy()
print(df_a)
print(df_b)
Prints:
A B C D E
1 A1 B1 C1 D1 xx
2 A2 B2 C2 D2 xx
4 A4 B4 C4 D4 xx
A B C D
0 A0 B0 C0 D0
3 A3 B3 C3 D3
I have two dataframes:
df1 :
A B C
0 a0 b0 c0
1 a1 b1 c1
2 a2 b2 c2
3 a3 b3 c3
4 a4 b4 c4
df2 :
A B C
0 a0 b0 c11
1 a1 b1 c5
2 a70 b2 c20
3 a3 b9 c9
In df1, for every row, whenever Column A and Column B values are equal to values in df2, column C should be updated with value from df2.
Output:
A B C
0 a0 b0 c11
1 a1 b1 c5
2 a2 b2 c2
3 a3 b3 c3
4 a4 b4 c4
I tried the following, but it did not work.
df1.set_index(['A', 'B'])
df2.set_index(['A', 'B'])
df1.update(df2)
df1.reset_index()
df2.reset_index()
df1["C"][:4] = np.where((df1["A"][:4]==df2["A"])&(df1["B"][:4]==df2["B"]),df2["C"],df1["C"][:4])
A B C
0 a0 b0 c11
1 a1 b1 c5
2 a2 b2 c2
3 a3 b3 c3
4 a4 b4 c4
I have a df like this:
A B C D E F
2 a1 a2 a3 a4 100
2 a1 b2 c3 a4 100 # note
2 b1 b2 b3 b4 100
2 c1 c2 c3 c4 100
1 a1 a2 a3 a4 120
2 a1 b2 c3 a4 150 # note
1 b1 b2 b3 b4 130
1 c1 c2 c3 c4 110
0 a1 a2 a3 a4 80
I want to compare the results of F column where the columns B-E match based on A column like so:
A B C D E F diff
2 a1 a2 a3 a4 100 120/100
2 a1 b2 c3 a4 100 # note 150/100
2 b1 b2 b3 b4 100 130/100
2 c1 c2 c3 c4 100 110/100
1 a1 a2 a3 a4 120 80/120
1 a1 b2 c3 a4 150 # note
1 b1 b2 b3 b4 130
1 c1 c2 c3 c4 110
0 a1 a2 a3 a4 80
Since the first line has the same values in the first line where A is 1 I do 120/100.
What I've tried:
df.groupby(['B',' 'C', 'D', 'E']) - this groups the data, but I don't know how I could apply the logic of calculating the previous value of column A. Or maybe there is a simpler way of achieving it.
Use DataFrameGroupBy.shift with Series.div:
df['d'] = df.groupby(['B', 'C', 'D', "E"])['F'].shift(-1).div(df['F'])
print (df)
A B C D E F d
0 2 a1 a2 a3 a4 100 1.200000
1 2 a1 b2 c3 a4 100 1.500000
2 2 b1 b2 b3 b4 100 1.300000
3 2 c1 c2 c3 c4 100 1.100000
4 1 a1 a2 a3 a4 120 0.666667
5 2 a1 b2 c3 a4 150 NaN
6 1 b1 b2 b3 b4 130 NaN
7 1 c1 c2 c3 c4 110 NaN
8 0 a1 a2 a3 a4 80 NaN
Suppose I have a dataframe as below:
df1=
name street city coordinates
0 A0 B0 C0 1,1
1 A1 B0 C0 NaN
2 A2 B0 C0 NaN
3 A3 B2 C2 NaN
4 A4 B2 C2 2,3
5 A5 B3 C3 NaN
6 A6 B3 C3 NaN
I want the result to be
df1=
name street city coordinates
0 A0 B0 C0 1,1
1 A1 B0 C0 1,1
2 A2 B0 C0 1,1
3 A3 B2 C2 2,3
4 A4 B2 C2 2,3
5 A5 B3 C3 NaN
6 A6 B3 C3 NaN
I want to update coordinates with the same street and city.
In the above example (B0,C0) at index 0 has coordinates (1,1). So I need to update coordinates at indices 1 and 2 to (1,1) since they have same street and city(B0,C0).
What is the best way to achieve this?
Also how do I update all the dataframes in similar fashion if we are given a list of dataframes. i.e
df_list = [df1,df2,..]
Is it a good idea to first generate a dataframe with unique rows from all the dataframes and then use this dataframe for look-up and update each dataframe?
If only one non NaN value in each group is possible use sort_values with ffill (Series.fillna with method='ffill'):
df = df.sort_values(['street','city','coordinates'])
df['coordinates'] = df['coordinates'].ffill()
print (df)
name street city coordinates
0 A0 B0 C0 1,1
1 A1 B0 C0 1,1
2 A2 B0 C0 1,1
4 A4 B2 C2 2,3
3 A3 B2 C2 2,3
5 A5 B2 C2 2,3
5 A6 B2 C2 2,3
Solution with GroupBy.transform with dropna:
df['coordinates'] = df.groupby(['street','city'])['coordinates']
.transform(lambda x: x.dropna())
print (df)
name street city coordinates
0 A0 B0 C0 1,1
1 A1 B0 C0 1,1
2 A2 B0 C0 1,1
3 A3 B2 C2 2,3
4 A4 B2 C2 2,3
5 A5 B2 C2 2,3
5 A6 B2 C2 2,3
Or ffill with bfill:
df['coordinates'] = df.groupby(['street','city'])['coordinates']
.transform(lambda x: x.ffill().bfill())
print (df)
name street city coordinates
0 A0 B0 C0 1,1
1 A1 B0 C0 1,1
2 A2 B0 C0 1,1
3 A3 B2 C2 2,3
4 A4 B2 C2 2,3
5 A5 B2 C2 2,3
5 A6 B2 C2 2,3
Second solution works with multiple values also - first forward fill values per group (not replace first values, stay NaN) and then all first values replace by back filling:
print (df)
name street city coordinates
0 A0 B0 C0 1,1
1 A1 B0 C0 NaN
2 A2 B0 C0 NaN
3 A3 B2 C2 NaN
4 A4 B2 C2 2,3
5 A5 B2 C2 4,7
5 A6 B2 C2 NaN
df['coordinates'] = df.groupby(['street','city'])['coordinates']
.transform(lambda x: x.ffill().bfill())
print (df)
name street city coordinates
0 A0 B0 C0 1,1
1 A1 B0 C0 1,1
2 A2 B0 C0 1,1
3 A3 B2 C2 2,3
4 A4 B2 C2 2,3
5 A5 B2 C2 4,7
5 A6 B2 C2 4,7