python pandas, certain columns to rows [duplicate] - python

This question already has answers here:
Convert columns into rows with Pandas
(6 answers)
Closed 4 years ago.
I have a pandas dataframe, with 4 rows and 4 columns - here is asimple version:
import pandas as pd
import numpy as np
rows = np.arange(1, 4, 1)
values = np.arange(1, 17).reshape(4,4)
df = pd.DataFrame(values, index=rows, columns=['A', 'B', 'C', 'D'])
what I am trying to do is to convert this to a 2 * 8 dataframe, with B, C and D alligng for each array - so it would look like this:
1 2
1 3
1 4
5 6
5 7
5 8
9 10
9 11
9 12
13 14
13 15
13 16
reading on pandas documentation I tried this:
df1 = pd.pivot_table(df, rows = ['B', 'C', 'D'], cols = 'A')
but gives me an error that I cannot identify the source (ends with
DataError: No numeric types to aggregate
)
following that I want to split the dataframe based on A values, but I think the .groupby command is probably going to take care of it

What you are looking for is the melt function
pd.melt(df,id_vars=['A'])
A variable value
0 1 B 2
1 5 B 6
2 9 B 10
3 13 B 14
4 1 C 3
5 5 C 7
6 9 C 11
7 13 C 15
8 1 D 4
9 5 D 8
10 9 D 12
11 13 D 16
    
A final sorting according to A is then necessary
pd.melt(df,id_vars=['A']).sort('A')
A variable value
0 1 B 2
4 1 C 3
8 1 D 4
1 5 B 6
5 5 C 7
9 5 D 8
2 9 B 10
6 9 C 11
10 9 D 12
3 13 B 14
7 13 C 15
11 13 D 16
Note: pd.DataFrame.sort has been deprecated in favour of pd.DataFrame.sort_values.

Related

Pandas, filter dataframe based on unique values in one column and grouby in another

I have a dataframe like this:
ID Packet Type
1 1 A
2 1 B
3 2 A
4 2 C
5 2 B
6 3 A
7 3 C
8 4 C
9 4 B
10 5 B
11 6 C
12 6 B
13 6 A
14 7 A
I want to filter the dataframe so that I have only entries that are part of a packet with size n and which types are all different. There are only n types.
For this example let's use n=3 and the types A,B,C.
In the end I want this:
ID Packet Type
3 2 A
4 2 C
5 2 B
11 6 C
12 6 B
13 6 A
How do I do this with pandas?
Another solution, using .groupby + .filter:
df = df.groupby("Packet").filter(lambda x: len(x) == x["Type"].nunique() == 3)
print(df)
Prints:
ID Packet Type
2 3 2 A
3 4 2 C
4 5 2 B
10 11 6 C
11 12 6 B
12 13 6 A
You can do transform with nunique
out = df[df.groupby('Packet')['Type'].transform('nunique')==3]
Out[46]:
ID Packet Type
2 3 2 A
3 4 2 C
4 5 2 B
10 11 6 C
11 12 6 B
12 13 6 A
I'd loop over the groupby object, filter and concatenate:
>>> pd.concat(frame for _,frame in df.groupby("Packet") if len(frame) == 3 and frame.Type.is_unique)
ID Packet Type
2 3 2 A
3 4 2 C
4 5 2 B
10 11 6 C
11 12 6 B
12 13 6 A

how to export values generated from agg groupby function

I have a large dataset based on servers at target locations. I used the following code to calculate the mean of a set of values for each server grouped by Site.
df4 = df4.merge(df4.groupby('SITE',as_index=False).agg({'DSKPERCENT':'mean'})[['SITE','DSKPERCENT']],on='SITE',how='left')
Sample Resulting DF
Site Server DSKPERCENT DSKPERCENT_MEAN
A 1 12 11
A 2 10 11
A 3 11 11
B 1 9 9
B 2 12 9
B 3 7 9
C 1 12 13
C 2 12 13
C 3 16 13
what I need now is to print/export the newly calculated mean per site. How can I print/export just the single unique calculated mean value per site (i.e. Site A has a calculated mean of 11, Site B of 9, etc...)?
IIUC, you're looking for a groupby -> transform type of operation. Essentially using transform is similar to agg except that the results are broadcasted back to the same shape of the original group.
Sample Data
df = pd.DataFrame({
"groups": list("aaabbbcddddd"),
"values": [1,2,3,4,5,6,7,8,9,10,11,12]
})
df
groups values
0 a 1
1 a 2
2 a 3
3 b 4
4 b 5
5 b 6
6 c 7
7 d 8
8 d 9
9 d 10
10 d 11
11 d 12
Method
df["group_mean"] = df.groupby("groups")["values"].transform("mean")
print(df)
groups values group_mean
0 a 1 2
1 a 2 2
2 a 3 2
3 b 4 5
4 b 5 5
5 b 6 5
6 c 7 7
7 d 8 10
8 d 9 10
9 d 10 10
10 d 11 10
11 d 12 10

Create columns from index values

Let say I have my data shaped as in this example
idx = pd.MultiIndex.from_product([[1, 2, 3, 4, 5, 6], ['a', 'b', 'c']],
names=['numbers', 'letters'])
col = ['Value']
df = pd.DataFrame(list(range(18)), idx, col)
print(df.unstack())
The output will be
Value
letters a b c
numbers
1 0 1 2
2 3 4 5
3 6 7 8
4 9 10 11
5 12 13 14
6 15 16 17
letters and numbers are indexes and Value is the only column
The question is how can I replace Value column with columns named as values of index letters?
So I would like to get such output
numbers a b c
1 0 1 2
2 3 4 5
3 6 7 8
4 9 10 11
5 12 13 14
6 15 16 17
where a, b and c are columns and numbers is the only index.
Appreciate your help.
The problem is caused by you are using unstack with DataFrame, not pd.Series
df.Value.unstack().rename_axis(None,1)
Out[151]:
a b c
numbers
1 0 1 2
2 3 4 5
3 6 7 8
4 9 10 11
5 12 13 14
6 15 16 17
Wen-Ben's answer prevents you from running into a data frame with multiple column levels in the first place.
If you happened to be stuck with a multi-index column anyway, you can get rid of it by using .droplevel():
df = df.unstack()
df.columns = df.columns.droplevel()
df
Out[7]:
letters a b c
numbers
1 0 1 2
2 3 4 5
3 6 7 8
4 9 10 11
5 12 13 14
6 15 16 17

Multiply multiple columns conditionally by one column (easy in Excel)

I have a dataframe like this:
a b c d m1 m2
3 2 2 2 5 4
1 4 1 1 5 4
3 2 2 3 5 4
I would like to multiply a and b for m1 and c and d for m2:
a b c d m1 m2
15 10 8 8 5 4
5 20 4 4 5 4
15 10 8 12 5 4
Also retaining the original dataframe structure, This is fairly simple in Excel, but Pandas is proving complicated since if I try the first multiplication (m1) then the DF drops non used columns.
Cheers!
Use mul with subset of columns defined by list of columns names:
df[['a', 'b']] = df[['a', 'b']].mul(df['m1'], axis=0)
df[['c', 'd']] = df[['c', 'd']].mul(df['m2'], axis=0)
print (df)
a b c d m1 m2
0 15 10 8 8 5 4
1 5 20 4 4 5 4
2 15 10 8 12 5 4
Here's one way using np.repeat:
df.iloc[:, :4] *= np.repeat(df.iloc[:, 4:].values, 2, axis=1)
print(df)
a b c d m1 m2
0 15 10 8 8 5 4
1 5 20 4 4 5 4
2 15 10 8 12 5 4

Pandas read_csv usecols same index

Consider the following code:
import pandas as pd
from StringIO import StringIO
x='''
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11,12
13,14,15,16
17,18,19,20
'''
df = pd.read_csv(StringIO(x), skipinitialspace=True, usecols=[2,3,2])
print df
Output:
c d
0 3 4
1 7 8
2 11 12
3 15 16
4 19 20
is there any way i can get
c d c
0 3 4 3
1 7 8 7
2 11 12 11
3 15 16 15
4 19 20 19
You can use iloc[] indexer:
In [67]: pd.read_csv(StringIO(x), skipinitialspace=True).iloc[:, [2,3,2]]
Out[67]:
c d c
0 3 4 3
1 7 8 7
2 11 12 11
3 15 16 15
4 19 20 19
But as #Boud has already mentioned in comments it would be much more efficient to make use of usecols parameter (as we don't need to parse columns that we don't need and we won't waste memory for them), if you know either names of columns in the CSV file:
In [6]: pd.read_csv(StringIO(x), skipinitialspace=True, usecols=[2,3,2]).loc[:, ['c','d','c']]
Out[6]:
c d c
0 3 4 3
1 7 8 7
2 11 12 11
3 15 16 15
4 19 20 19
or if you know their new indexes (in the new DataFrame):
In [7]: pd.read_csv(StringIO(x), skipinitialspace=True, usecols=[2,3,2]).iloc[:, [0,1,0]]
Out[7]:
c d c
0 3 4 3
1 7 8 7
2 11 12 11
3 15 16 15
4 19 20 19
PS you may also want to read about Pandas boolean indexing

Categories