Let say I have my data shaped as in this example
idx = pd.MultiIndex.from_product([[1, 2, 3, 4, 5, 6], ['a', 'b', 'c']],
names=['numbers', 'letters'])
col = ['Value']
df = pd.DataFrame(list(range(18)), idx, col)
print(df.unstack())
The output will be
Value
letters a b c
numbers
1 0 1 2
2 3 4 5
3 6 7 8
4 9 10 11
5 12 13 14
6 15 16 17
letters and numbers are indexes and Value is the only column
The question is how can I replace Value column with columns named as values of index letters?
So I would like to get such output
numbers a b c
1 0 1 2
2 3 4 5
3 6 7 8
4 9 10 11
5 12 13 14
6 15 16 17
where a, b and c are columns and numbers is the only index.
Appreciate your help.
The problem is caused by you are using unstack with DataFrame, not pd.Series
df.Value.unstack().rename_axis(None,1)
Out[151]:
a b c
numbers
1 0 1 2
2 3 4 5
3 6 7 8
4 9 10 11
5 12 13 14
6 15 16 17
Wen-Ben's answer prevents you from running into a data frame with multiple column levels in the first place.
If you happened to be stuck with a multi-index column anyway, you can get rid of it by using .droplevel():
df = df.unstack()
df.columns = df.columns.droplevel()
df
Out[7]:
letters a b c
numbers
1 0 1 2
2 3 4 5
3 6 7 8
4 9 10 11
5 12 13 14
6 15 16 17
Related
I have a large dataset based on servers at target locations. I used the following code to calculate the mean of a set of values for each server grouped by Site.
df4 = df4.merge(df4.groupby('SITE',as_index=False).agg({'DSKPERCENT':'mean'})[['SITE','DSKPERCENT']],on='SITE',how='left')
Sample Resulting DF
Site Server DSKPERCENT DSKPERCENT_MEAN
A 1 12 11
A 2 10 11
A 3 11 11
B 1 9 9
B 2 12 9
B 3 7 9
C 1 12 13
C 2 12 13
C 3 16 13
what I need now is to print/export the newly calculated mean per site. How can I print/export just the single unique calculated mean value per site (i.e. Site A has a calculated mean of 11, Site B of 9, etc...)?
IIUC, you're looking for a groupby -> transform type of operation. Essentially using transform is similar to agg except that the results are broadcasted back to the same shape of the original group.
Sample Data
df = pd.DataFrame({
"groups": list("aaabbbcddddd"),
"values": [1,2,3,4,5,6,7,8,9,10,11,12]
})
df
groups values
0 a 1
1 a 2
2 a 3
3 b 4
4 b 5
5 b 6
6 c 7
7 d 8
8 d 9
9 d 10
10 d 11
11 d 12
Method
df["group_mean"] = df.groupby("groups")["values"].transform("mean")
print(df)
groups values group_mean
0 a 1 2
1 a 2 2
2 a 3 2
3 b 4 5
4 b 5 5
5 b 6 5
6 c 7 7
7 d 8 10
8 d 9 10
9 d 10 10
10 d 11 10
11 d 12 10
I have a df1 and df2 as follows:
df1:
a b c
0 1 2 4
1 6 12 24
2 7 14 28
3 4 8 16
4 3 6 12
df2:
a b c
0 7 8 9
1 10 11 12
How can I insert df2 to df1 but after the second row? My desired output will like this.
a b c
0 1 2 4
1 6 12 24
2 7 8 9
3 10 11 12
4 7 14 28
5 4 8 16
6 3 6 12
Thank you.
Use concat with splitted first DataFrame by DataFrame.iloc:
df = pd.concat([df1.iloc[:2], df2, df1.iloc[2:]], ignore_index=False)
print (df)
a b c
0 1 2 4
1 6 12 24
0 7 8 9
1 10 11 12
2 7 14 28
3 4 8 16
4 3 6 12
Here is another way using np.r_:
df2.index=range(len(df1),len(df1)+len(df2)) #change index where df1 ends
final=pd.concat((df1,df2)) #concat
final.iloc[np.r_[0,1,df2.index,2:len(df1)]] #select ordering with iloc
#final.iloc[np.r_[0:2,df2.index,2:len(df1)]]
a b c
0 1 2 4
1 6 12 24
5 7 8 9
6 10 11 12
2 7 14 28
3 4 8 16
4 3 6 12
I have a dataframe like below:
A B C
1 8 23
2 8 22
3 9 45
4 9 45
5 6 12
6 4 10
7 11 12
I want to drop duplicates where keep the first value in the consecutive occurence if the C is also the same.
E.G here occurence '9' is column B is repetitive and their correponding occurences in column 'C' is also repetitive '45'. In this case i want to retain the first occurence.
Expected Output:
A B C
1 8 23
2 8 22
3 9 45
5 6 12
6 4 10
7 11 12
I tried some group by, but didnot know how to drop.
code:
df['consecutive'] = (df['B'] != df['B'].shift(1)).cumsum()
test=df.groupby('consecutive',as_index=False).apply(lambda x: (x['B'].head(1),x.shape[0],
x['C'].iloc[-1] - x['C'].iloc[0]))
This group by returns me a series, but i want to drop.
Add DataFrame.drop_duplicates by 2 columns:
df['consecutive'] = (df['B'] != df['B'].shift(1)).cumsum()
df = df.drop_duplicates(['consecutive','C'])
print (df)
A B C consecutive
0 1 8 23 1
1 2 8 22 1
2 3 9 45 2
4 5 6 12 3
5 6 4 10 4
6 7 11 12 5
Or chain both conditions with | for bitwise OR:
df = df[(df['B'] != df['B'].shift()) | (df['C'] != df['C'].shift())]
print (df)
A B C
0 1 8 23
1 2 8 22
2 3 9 45
4 5 6 12
5 6 4 10
6 7 11 12
the easy way to check the difference between row of B and C then drop value if difference is 0 (duplicate values), the code is
df[ ~((df.B.diff()==0) & (df.C.diff()==0)) ]
A oneliner to filter out such records is:
df[(df[['B', 'C']].shift() != df[['B', 'C']]).any(axis=1)]
Here we thus check if the columns ['B', 'C'] is the same as the shifted rows, if it is not, we retain the values:
>>> df[(df[['B', 'C']].shift() != df[['B', 'C']]).any(axis=1)]
A B C
0 1 8 23
1 2 8 22
2 3 9 45
4 5 6 12
5 6 4 10
6 7 11 12
This is quite scalable, since we can define a function that will easily operate on an arbitrary number of values:
def drop_consecutive_duplicates(df, *colnames):
dff = df[list(colnames)]
return df[(dff.shift() != dff).any(axis=1)]
So you can then filter with:
drop_consecutive_duplicates(df, 'B', 'C')
Using diff, ne and any over axis=1:
Note: this method only works for numeric columns
m = df[['B', 'C']].diff().ne(0).any(axis=1)
print(df[m])
Output
A B C
0 1 8 23
1 2 8 22
2 3 9 45
4 5 6 12
5 6 4 10
6 7 11 12
Details
df[['B', 'C']].diff()
B C
0 NaN NaN
1 0.0 -1.0
2 1.0 23.0
3 0.0 0.0
4 -3.0 -33.0
5 -2.0 -2.0
6 7.0 2.0
Then we check if any of the values in a row are not equal (ne) to 0:
df[['B', 'C']].diff().ne(0).any(axis=1)
0 True
1 True
2 True
3 False
4 True
5 True
6 True
dtype: bool
You can compute a series of the rows to drop, and then drop them:
to_drop = (df['B'] == df['B'].shift())&(df['C']==df['C'].shift())
df = df[~to_drop]
It gives as expected:
A B C
0 1 8 23
1 2 8 22
2 3 9 45
4 5 6 12
5 6 4 10
6 7 11 12
Code
df1 = df.drop_duplicates(subset=['B', 'C'])
Result
A B C
0 1 8 23
1 2 8 22
2 3 9 45
4 5 6 12
5 6 4 10
6 7 11 12
If I understand your question correctly, given the following dataframe:
df = pd.DataFrame({'B': [8, 8, 9, 9, 6, 4, 11], 'C': [22, 23, 45, 45, 12, 10, 12],})
This one-line code solved your problem using the drop_duplicates method:
df.drop_duplicates(['B', 'C'])
It gives as expected results:
B C
0 8 22
1 8 23
2 9 45
4 6 12
5 4 10
6 11 12
What is the best way to fill missing values in dataframe with items from list?
For example:
pd.DataFrame([[1,2,3],[4,5],[7,8],[10,11,12],[13,14]])
0 1 2
0 1 2 3
1 4 5 NaN
2 7 8 NaN
3 10 11 12
4 13 14 NaN
list = [6, 9, 150]
to get some something like this:
0 1 2
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12
4 13 14 15
this is actually a little tricky and a bit of a hack, if you know the column you want to fill the NaN values for then you can construct a df for that column with the indices of the missing values and pass the df to fillna:
In [33]:
fill = pd.DataFrame(index =df.index[df.isnull().any(axis=1)], data= [6, 9, 150],columns=[2])
df.fillna(fill)
Out[33]:
0 1 2
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12
4 13 14 150
You can't pass a dict (my original answer) as the dict key values are the column values to match on and the scalar value will be used for all NaN values for that column which is not what you want:
In [40]:
l=[6, 9, 150]
df.fillna(dict(zip(df.index[df.isnull().any(axis=1)],l)))
Out[40]:
0 1 2
0 1 2 3
1 4 5 9
2 7 8 9
3 10 11 12
4 13 14 9
You can see that it has replaced all NaNs with 9 as it matched the missing NaN index value of 2 with column 2.
This question already has answers here:
Convert columns into rows with Pandas
(6 answers)
Closed 4 years ago.
I have a pandas dataframe, with 4 rows and 4 columns - here is asimple version:
import pandas as pd
import numpy as np
rows = np.arange(1, 4, 1)
values = np.arange(1, 17).reshape(4,4)
df = pd.DataFrame(values, index=rows, columns=['A', 'B', 'C', 'D'])
what I am trying to do is to convert this to a 2 * 8 dataframe, with B, C and D alligng for each array - so it would look like this:
1 2
1 3
1 4
5 6
5 7
5 8
9 10
9 11
9 12
13 14
13 15
13 16
reading on pandas documentation I tried this:
df1 = pd.pivot_table(df, rows = ['B', 'C', 'D'], cols = 'A')
but gives me an error that I cannot identify the source (ends with
DataError: No numeric types to aggregate
)
following that I want to split the dataframe based on A values, but I think the .groupby command is probably going to take care of it
What you are looking for is the melt function
pd.melt(df,id_vars=['A'])
A variable value
0 1 B 2
1 5 B 6
2 9 B 10
3 13 B 14
4 1 C 3
5 5 C 7
6 9 C 11
7 13 C 15
8 1 D 4
9 5 D 8
10 9 D 12
11 13 D 16
A final sorting according to A is then necessary
pd.melt(df,id_vars=['A']).sort('A')
A variable value
0 1 B 2
4 1 C 3
8 1 D 4
1 5 B 6
5 5 C 7
9 5 D 8
2 9 B 10
6 9 C 11
10 9 D 12
3 13 B 14
7 13 C 15
11 13 D 16
Note: pd.DataFrame.sort has been deprecated in favour of pd.DataFrame.sort_values.