How to Multi-Index an existing DataFrame - python

"Multi-Index" might be the incorrect term of what I'm looking to do, but below is an example of what I'm trying to accomplish.
Original DF:
HOD site1_units site1_orders site2_units site2_orders
hour1 6 3 20 16
hour2 25 10 16 3
hour3 500 50 50 25
hour4 125 65 59 14
hour5 16 1 158 6
hour6 0 0 15 15
hour7 180 18 99 90
Desired DF
site1 site2
HOD units orders units orders
hour1 6 3 20 16
hour2 25 10 16 3
hour3 500 50 50 25
hour4 125 65 59 14
hour5 16 1 158 6
hour6 0 0 15 15
hour7 180 18 99 90
Is there an efficient way to construct/format the dataframe like this? Thank you for the help!

Try this:
df = df.set_index('HOD')
df = df.set_axis(df.columns.map(lambda x: tuple(x.split('_'))),axis=1)
Output:
site1 site2
units orders units orders
HOD
hour1 6 3 20 16
hour2 25 10 16 3
hour3 500 50 50 25
hour4 125 65 59 14
hour5 16 1 158 6
hour6 0 0 15 15
hour7 180 18 99 90

Here is one way you can do this:
df = df.set_index("HOD")
index = pd.MultiIndex.from_tuples(zip(["site1","site1", "site2", "site2"],["units", "orders", "units", "orders"]))
df.columns = index
Result:
site1 site2
units orders units orders
HOD
hour1 6 3 20 16
hour2 25 10 16 3
hour3 500 50 50 25
hour4 125 65 59 14
hour5 16 1 158 6
hour6 0 0 15 15
hour7 180 18 99 90

Related

Add/Update/Merge original DataFrame into a grouped DataFrame

How can I merge, update, join, concat, or filter the original DF correctly so that I can have the complete 78 columns?
I have a DataFrame with 22 rows and 78 columns. An internet-friendly version of the file can be found here. This a sample:
item_no code group gross_weight net_weight value ... ... +70 columns more
1 7417.85.24.25 0 18 17 13018.74
2 1414.19.00.62 1 35 33 0.11
3 7815.80.99.96 0 49 48 1.86
4 1414.19.00.62 1 30 27 2.7
5 5867.21.36.92 1 31 24 94
6 9227.71.84.12 1 24 17 56.4
7 1414.19.00.62 0 42 35 0.56
8 4465.58.84.31 0 50 42 0.94
9 1596.09.32.64 1 20 13 0.75
10 2194.64.27.41 1 38 33 1.13
11 1596.09.32.64 1 53 46 1.9
12 1596.09.32.64 1 18 15 10.44
13 1596.09.32.64 1 35 33 15.36
14 4835.09.81.44 1 55 47 10.44
15 5698.44.72.13 1 51 49 15.36
16 5698.44.72.13 1 49 45 2.15
17 5698.44.72.13 0 41 33 16
18 3815.79.80.69 1 25 21 4
19 3815.79.80.69 1 35 30 2.4
20 4853.40.53.94 1 53 46 3.12
21 4853.40.53.94 1 50 47 3.98
22 4853.40.53.94 1 16 13 6.53
The column group gives me the instruction that I should group all similar values in the code column and add the values in the columns: 'gross_weight', 'net_weight', 'value', and 'item_quantity'. Additionally, I have to modify 2 additional columns as shown below:
#Group DF
grouped_df = df.groupby(['group', 'code'], as_index=False).agg({'item_quantity':'sum', 'gross_weight':'sum','net_weight':'sum', 'value':'sum'}).copy()
#Total items should be equal to the length of the DF
grouped_df['total_items'] = len(grouped_df)
#Item No.
grouped_df['item_no'] = [x+1 for x in range(len(grouped_df))]
This is the result:
group code item_quantity gross_weight net_weight value total_items item_no
0 0 1414.19.00.62 75.0 42 35 0.56 14 1
1 0 4465.58.84.31 125.0 50 42 0.94 14 2
2 0 5698.44.72.13 200.0 41 33 16.0 14 3
3 0 7417.85.24.25 1940.2 18 17 13018.74 14 4
4 0 7815.80.99.96 200.0 49 48 1.86 14 5
5 1 1414.19.00.62 275.0 65 60 2.81 14 6
6 1 1596.09.32.64 515.0 126 107 28.45 14 7
7 1 2194.64.27.41 151.0 38 33 1.13 14 8
8 1 3815.79.80.69 400.0 60 51 6.4 18 14 9
9 1 4835.09.81.44 87.0 55 47 10.44 14 10
10 1 4853.40.53.94 406.0 119 106 13.63 14 11
11 1 5698.44.72.13 328.0 100 94 17.51 14 12
12 1 5867.21.36.92 1000.0 31 24 94.0 14 13
13 1 9227.71.84.12 600.0 24 17 56.4 14 14
All of the columns in the grouped DF exist in the original DF but some have different values.
How can I merge, update, join, concat, or filter the original DF correctly so that I can have the complete 78 columns?
The objective DataFrame is the grouped DF.
The columns in the original DF that already exists in the Grouped DF should be omitted.
I should be able to take the first value of the columns in the original DF that aren't in the Grouped DF.
The column code does not have unique values.
The column part_number in the complete file does not have unique values.
I tried:
pd.Merge(how='left') after creating a unique ID; it duplicates existing columns instead of updating values or overwriting.
join, concat, update: does not yield the expected results.
.agg({lambda x: x.iloc[0]}) adds all the columns but I don't know how to add it to the current .agg({'item_quantity':'sum', 'gross_weight':'sum','net_weight':'sum', 'value':'sum'})
I know that .agg({'column_name':'first']) returns the first value, but I don't know how to make it work for over 70 columns automatically.
You can achieve this dynamically creating a dictionary with list comprehension like this:
df.groupby(['group', 'code'], as_index=False).agg({col : 'sum' for col in df.columns[3:]}
If item_no is your index, then change df.columns[3:] to df.columns[2:]

Create groups based on column values

I am attempting to create user groups based on a particluar DataFrame column value. I would like to create 10 user groups of the entire DataFrame's population, based on the total_usage metric. An example DataFrame df is shown below.
user_id total_usage
1 10
2 10
3 20
4 20
5 30
6 30
7 40
8 40
9 50
10 50
11 60
12 60
13 70
14 70
15 80
16 80
17 90
18 90
19 100
20 100
The df is just a snippet of the entire DataFrame which is over 6000 records long, however I would like like to only have 10 user groups.
An example of my desired output is shown below.
user_id total_usage user_group
1 10 10th_group
2 10 10th_group
3 20 9th_group
4 20 9th_group
5 30 8th_group
6 30 8th_group
7 40 7th_group
8 40 7th_group
9 50 6th_group
10 50 6th_group
11 60 5th_group
12 60 5th_group
13 70 4th_group
14 70 4th_group
15 80 3th_group
16 80 3th_group
17 90 2nd_group
18 90 2nd_group
19 100 1st_group
20 100 1st_group
Any assistance that anyone could provide would be greatly appreciated.
Looks like you are looking for qcut, but in reverse order
df['user_group'] = 10 - pd.qcut(df['total_usage'], np.arange(0,1.1, 0.1)).cat.codes
Output, it's not ordinal, but I hope it will do:
0 10
1 10
2 9
3 9
4 8
5 8
6 7
7 7
8 6
9 6
10 5
11 5
12 4
13 4
14 3
15 3
16 2
17 2
18 1
19 1
dtype: int8
Use qcut with changed order by negatives and Series.map for 1.st and 2.nd values:
s = pd.qcut(-df['total_usage'], np.arange(0,1.1, 0.1), labels=False) + 1
d = {1:'st', 2:'nd'}
df['user_group'] = s.astype(str) + s.map(d).fillna('th') + '_group'
print (df)
user_id total_usage user_group
0 1 10 10th_group
1 2 10 10th_group
2 3 20 9th_group
3 4 20 9th_group
4 5 30 8th_group
5 6 30 8th_group
6 7 40 7th_group
7 8 40 7th_group
8 9 50 6th_group
9 10 50 6th_group
10 11 60 5th_group
11 12 60 5th_group
12 13 70 4th_group
13 14 70 4th_group
14 15 80 3th_group
15 16 80 3th_group
16 17 90 2nd_group
17 18 90 2nd_group
18 19 100 1st_group
19 20 100 1st_group
Try using pd.Series with np.repeat, np.arange, pd.DataFrame.groupby, pd.Series.astype, pd.Series.map and pd.Series.fillna:
x = df.groupby('total_usage')
s = pd.Series(np.repeat(np.arange(len(x.ngroups), [len(i) for i in x.groups.values()]) + 1)
df['user_group'] = (s.astype(str) + s.map({1: 'st', 2: 'nd'}).fillna('th') + '_Group').values[::-1]
And now:
print(df)
Is:
user_id total_usage user_group
0 1 10 10th_Group
1 2 10 10th_Group
2 3 20 9th_Group
3 4 20 9th_Group
4 5 30 8th_Group
5 6 30 8th_Group
6 7 40 7th_Group
7 8 40 7th_Group
8 9 50 6th_Group
9 10 50 6th_Group
10 11 60 5th_Group
11 12 60 5th_Group
12 13 70 4th_Group
13 14 70 4th_Group
14 15 80 3th_Group
15 16 80 3th_Group
16 17 90 2nd_Group
17 18 90 2nd_Group
18 19 100 1st_Group
19 20 100 1st_Group

Pandas DataFrame RangeIndex

I have created a Pandas DataFrame. I need to create a RangeIndex for the DataFrame that corresponds to the frame -
RangeIndex(start=0, stop=x, step=y) - where x and y relate to my DataFrame.
I've not seen an example of how to do this - is there a method or syntax specific to this?
thanks
It seems you need RangeIndex constructor:
df = pd.DataFrame({'A' : range(1, 21)})
print (df)
A
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
10 11
11 12
12 13
13 14
14 15
15 16
16 17
17 18
18 19
19 20
print (df.index)
RangeIndex(start=0, stop=20, step=1)
df.index = pd.RangeIndex(start=0, stop=99, step=5)
print (df)
A
0 1
5 2
10 3
15 4
20 5
25 6
30 7
35 8
40 9
45 10
50 11
55 12
60 13
65 14
70 15
75 16
80 17
85 18
90 19
95 20
print (df.index)
RangeIndex(start=0, stop=99, step=5)
More dynamic solution:
step = 10
df.index = pd.RangeIndex(start=0, stop=len(df.index) * step - 1, step=step)
print (df)
A
0 1
10 2
20 3
30 4
40 5
50 6
60 7
70 8
80 9
90 10
100 11
110 12
120 13
130 14
140 15
150 16
160 17
170 18
180 19
190 20
print (df.index)
RangeIndex(start=0, stop=199, step=10)
EDIT:
As #ZakS pointed in comments better is use only DataFrame constructor:
df = pd.DataFrame({'A' : range(1, 21)}, index=pd.RangeIndex(start=0, stop=99, step=5))
print (df)
0 1
5 2
10 3
15 4
20 5
25 6
30 7
35 8
40 9
45 10
50 11
55 12
60 13
65 14
70 15
75 16
80 17
85 18
90 19
95 20

grouping by id and a condition

I have a dataframe df
df=DataFrame({'id': ['a','a','a','a','a','a','a','b','b','b','b','b','b','b','b','b','b'],
'min':[10,17,21,22,22,7,58,15,17,19,19,19,19,19,25,26,26],
'day':[15,15,15,15,15,17,17,41,41,41,41,41,41,41,57,57,57]})
that looks like
id min day
0 a 10 15
1 a 17 15
2 a 21 15
3 a 30 15
4 a 50 15
5 a 57 17
6 a 58 17
7 b 15 41
8 b 17 41
9 b 19 41
10 b 19 41
11 b 19 41
12 b 19 41
13 b 19 41
14 b 25 57
15 b 26 57
16 b 26 57
I want a new column that categorizes the data in a certain format based on the id and the relationship between the rows as follows, if min value difference for consecutive rows is less than 8 and the day value is the same I want to assign them to the same group, so my output would look like.
id min day category
0 a 10 15 1
1 a 17 15 1
2 a 21 15 1
3 a 30 15 2
4 a 50 15 3
5 a 57 17 4
6 a 58 17 4
7 b 15 41 5
8 b 17 41 5
9 b 19 41 5
10 b 19 41 5
11 b 19 41 5
12 b 19 41 5
13 b 19 41 5
14 b 25 57 6
15 b 26 57 6
16 b 26 57 6
hope this helps. let me know your views.
All the best.
import pandas as pd
df=pd.DataFrame({'id': ['a','a','a','a','a','a','a','b','b','b','b','b','b','b','b','b','b'],
'min':[10,17,21,22,22,7,58,15,17,19,19,19,19,19,25,26,26],
'day':[15,15,15,15,15,17,17,41,41,41,41,41,41,41,57,57,57]})
# initialize the catagory to 1 for counter increament
cat =1
# for the first row the catagory will be 1
new_series = [cat]
# loop will start from 1 and not from 0 because we cannot perform operation on iloc -1
for i in range(1,len(df)):
if df.iloc[i]['day'] == df.iloc[i-1]['day']:
if df.iloc[i]['min'] - df.iloc[i-1]['min'] > 8:
cat+=1
else:
cat+=1
new_series.append(cat)
df['catagory']= new_series
print(df)

Pandas dataframe from nested dictionary to melted data frame

I converted a nested dictionary to a Pandas DataFrame which I want to use as to create a heatmap.
The nested dictionary is simple to create:
>>>df = pandas.DataFrame.from_dict(my_nested_dict)
>>>df
93 94 95 96 97 98 99 100 100A 100B ... 100M 100N 100O 100P 100Q 100R 100S 101 102 103
A 465 5 36 36 28 24 25 30 28 32 ... 28 19 16 15 4 4 185 2 7 3
C 0 1 2 0 6 10 8 16 23 17 ... 9 5 6 3 4 2 3 3 0 1
D 1 0 132 6 17 22 17 25 21 25 ... 12 16 21 7 5 18 2 1 296 0
E 4 0 45 10 16 12 10 15 17 18 ... 4 9 7 10 5 6 4 3 129 0
F 1 0 4 17 14 11 8 11 24 9 ... 17 8 8 12 7 3 1 98 0 1
G 2 10 77 55 71 52 65 39 37 45 ... 46 65 23 9 18 171 141 2 31 0
H 0 5 25 12 18 8 12 7 10 6 ... 8 11 6 4 4 5 2 2 1 8
I 1 8 7 23 26 35 36 34 31 38 ... 19 7 2 37 7 3 0 3 2 26
K 0 42 3 24 5 15 17 11 6 8 ... 9 10 9 8 9 2 1 28 0 0
L 3 0 19 50 32 33 21 26 26 18 ... 19 44 122 11 10 7 5 17 2 5
M 0 1 1 3 1 13 9 12 12 8 ... 20 3 1 1 0 1 0 191 0 0
N 0 5 3 12 8 15 12 13 21 9 ... 18 10 10 11 12 26 3 0 5 1
P 1 1 19 50 39 47 42 43 39 33 ... 48 35 15 16 59 2 13 6 0 160
Q 0 2 16 15 12 13 10 13 16 5 ... 11 6 3 11 4 1 0 1 6 28
R 0 380 17 66 54 41 51 32 24 29 ... 43 44 16 17 14 6 2 126 4 5
S 14 18 27 42 55 37 41 42 45 70 ... 47 31 64 14 42 18 8 3 1 5
T 4 13 17 32 29 37 33 32 30 38 ... 87 79 19 125 96 11 11 7 7 3
V 4 9 36 24 39 40 35 45 42 52 ... 20 12 12 9 8 5 0 6 7 209
W 0 0 1 6 6 8 4 7 7 9 ... 6 6 1 1 1 1 27 1 0 0
X 0 0 0 0 0 0 0 0 0 0 ... 0 4 0 0 0 0 0 0 0 0
Y 0 0 13 17 24 27 44 47 41 31 ... 29 76 139 179 191 208 92 0 2 45
I like to use ggplot to make heat maps which would just be this data frame. However, the dataframes needed for ggplot are a little different. I can use the pandas.melt function to get close, but I'm missing the row titles.
>>>mdf = pandas.melt(df)
>>>mdf
variable value
0 93 465
1 93 0
2 93 1
3 93 4
4 93 1
5 93 2
6 93 0
7 93 1
8 93 0
...
624 103 5
625 103 3
626 103 209
627 103 0
628 103 0
629 103 45
The easiest thing to make this dataframe would be is to add the value of the amino acid so the DataFrame looks like:
variable value rowvalue
0 93 465 A
1 93 0 C
2 93 1 D
3 93 4 E
4 93 1 F
5 93 2 G
6 93 0 H
7 93 1 I
8 93 0 K
That way I can take that dataframe and put it right into ggplot:
>>> from ggplot import *
>>> ggplot(new_df,aes("variable","rowvalue")) + geom_tile(fill="value")
would produce a beautiful heatmap. How do I manipulate the nested dictionary dataframe in order to get the dataframe at the end. If there is a more efficient way to do this, I'm open for suggestions, but I still want to use ggplot2.
Edit -
I found a solution but it seems to be way too convoluted. Basically I make the index into a column, then melt the data frame.
>>>df.reset_index(level=0,inplace=True)
>>>pandas.melt(df,id_vars['index']
index variable value
0 A 93 465
1 C 93 0
2 D 93 1
3 E 93 4
4 F 93 1
5 G 93 2
6 H 93 0
7 I 93 1
8 K 93 0
9 L 93 3
10 M 93 0
11 N 93 0
12 P 93 1
13 Q 93 0
14 R 93 0
15 S 93 14
16 T 93 4
if i understand properly your question, i think you can simply do the following :
mdf = pandas.melt(df)
mdf['rowvalue'] = df.index
mdf
variable value rowvalue
0 93 465 A
1 93 0 C
2 93 1 D
3 93 4 E
4 93 1 F
5 93 2 G
6 93 0 H
7 93 1 I
8 93 0 K

Categories