Create groups based on column values - python

I am attempting to create user groups based on a particluar DataFrame column value. I would like to create 10 user groups of the entire DataFrame's population, based on the total_usage metric. An example DataFrame df is shown below.
user_id total_usage
1 10
2 10
3 20
4 20
5 30
6 30
7 40
8 40
9 50
10 50
11 60
12 60
13 70
14 70
15 80
16 80
17 90
18 90
19 100
20 100
The df is just a snippet of the entire DataFrame which is over 6000 records long, however I would like like to only have 10 user groups.
An example of my desired output is shown below.
user_id total_usage user_group
1 10 10th_group
2 10 10th_group
3 20 9th_group
4 20 9th_group
5 30 8th_group
6 30 8th_group
7 40 7th_group
8 40 7th_group
9 50 6th_group
10 50 6th_group
11 60 5th_group
12 60 5th_group
13 70 4th_group
14 70 4th_group
15 80 3th_group
16 80 3th_group
17 90 2nd_group
18 90 2nd_group
19 100 1st_group
20 100 1st_group
Any assistance that anyone could provide would be greatly appreciated.

Looks like you are looking for qcut, but in reverse order
df['user_group'] = 10 - pd.qcut(df['total_usage'], np.arange(0,1.1, 0.1)).cat.codes
Output, it's not ordinal, but I hope it will do:
0 10
1 10
2 9
3 9
4 8
5 8
6 7
7 7
8 6
9 6
10 5
11 5
12 4
13 4
14 3
15 3
16 2
17 2
18 1
19 1
dtype: int8

Use qcut with changed order by negatives and Series.map for 1.st and 2.nd values:
s = pd.qcut(-df['total_usage'], np.arange(0,1.1, 0.1), labels=False) + 1
d = {1:'st', 2:'nd'}
df['user_group'] = s.astype(str) + s.map(d).fillna('th') + '_group'
print (df)
user_id total_usage user_group
0 1 10 10th_group
1 2 10 10th_group
2 3 20 9th_group
3 4 20 9th_group
4 5 30 8th_group
5 6 30 8th_group
6 7 40 7th_group
7 8 40 7th_group
8 9 50 6th_group
9 10 50 6th_group
10 11 60 5th_group
11 12 60 5th_group
12 13 70 4th_group
13 14 70 4th_group
14 15 80 3th_group
15 16 80 3th_group
16 17 90 2nd_group
17 18 90 2nd_group
18 19 100 1st_group
19 20 100 1st_group

Try using pd.Series with np.repeat, np.arange, pd.DataFrame.groupby, pd.Series.astype, pd.Series.map and pd.Series.fillna:
x = df.groupby('total_usage')
s = pd.Series(np.repeat(np.arange(len(x.ngroups), [len(i) for i in x.groups.values()]) + 1)
df['user_group'] = (s.astype(str) + s.map({1: 'st', 2: 'nd'}).fillna('th') + '_Group').values[::-1]
And now:
print(df)
Is:
user_id total_usage user_group
0 1 10 10th_Group
1 2 10 10th_Group
2 3 20 9th_Group
3 4 20 9th_Group
4 5 30 8th_Group
5 6 30 8th_Group
6 7 40 7th_Group
7 8 40 7th_Group
8 9 50 6th_Group
9 10 50 6th_Group
10 11 60 5th_Group
11 12 60 5th_Group
12 13 70 4th_Group
13 14 70 4th_Group
14 15 80 3th_Group
15 16 80 3th_Group
16 17 90 2nd_Group
17 18 90 2nd_Group
18 19 100 1st_Group
19 20 100 1st_Group

Related

Pandas: Find closest group from another dataframe

Below, I have two dataframe. I need to update df_mapped using df_original.
In df_mapped, For each x_time need to find 3 closest rows (closest defined from difference from x_price) and add those to df_mapped dataframe.
import io
import pandas as pd
d = """
x_time expiration x_price p_price
60 4 10 20
60 5 11 30
60 6 12 40
60 7 13 50
60 8 14 60
70 5 10 20
70 6 11 30
70 7 12 40
70 8 13 50
70 9 14 60
80 1 10 20
80 2 11 30
80 3 12 40
80 4 13 50
80 5 14 60
"""
df_original = pd.read_csv(io.StringIO(d), delim_whitespace=True)`
to_mapped = """
x_time expiration x_price
50 4 15
60 5 15
70 6 13
80 7 20
90 8 20
"""
df_mapped = pd.read_csv(io.StringIO(to_mapped), delim_whitespace=True)
df_mapped = df_mapped.merge(df_original, on='x_time', how='left')
df_mapped['x_price_delta'] = abs(df_mapped['x_price_x'] - df_mapped['x_price_y'])`
**Intermediate output: In this, need to select 3 min x_price_delta row for each x_time
**
int_out = """
x_time expiration_x x_price_x expiration_y x_price_y p_price x_price_delta
50 4 15
60 5 15 6 12 40 3
60 5 15 7 13 50 2
60 5 15 8 14 60 1
70 6 13 7 12 40 1
70 6 13 8 13 50 0
70 6 13 9 14 60 1
80 7 20 3 12 40 8
80 7 20 4 13 50 7
80 7 20 5 14 60 6
90 8 20
"""
df_int_out = pd.read_csv(io.StringIO(int_out), delim_whitespace=True)
**Final step: keeping x_time fixed need to flatten the dataframe so we get the 3 closest row in one row
**
final_out = """
x_time expiration_original x_price_original expiration_1 x_price_1 p_price_1 expiration_2 x_price_2 p_price_2 expiration_3 x_price_3 p_price_3
50 4 15
60 5 15 6 12 40 7 13 50 8 14 60
70 6 13 7 12 40 8 13 50 9 14 60
80 7 20 3 12 40 4 13 50 5 14 60
90 8 20
"""
df_out = pd.read_csv(io.StringIO(final_out), delim_whitespace=True)
I am stuck in between intermediate and last step. Can't think of way out, what could be done to massage the dataframe?
This is not complete solution but it might help you to get unstuck.
At the end we get the correct data.
In [1]: df = df_int_out.groupby("x_time").apply(lambda x: x.sort_values(ascen
...: ding=False, by="x_price_delta")).set_index(["x_time", "expiration_x"]
...: ).drop(["x_price_delta", "x_price_x"],axis=1)
In [2]: df1 = df.iloc[1:-1]
In [3]: df1.groupby(df1.index).apply(lambda x: pd.concat([pd.DataFrame(d) for
...: d in x.values],axis=1).unstack())
Out[3]:
0
0 1 2 0 1 2 0 1 2
(60, 5) 6.0 12.0 40.0 7.0 13.0 50.0 8.0 14.0 60.0
(70, 6) 7.0 12.0 40.0 9.0 14.0 60.0 8.0 13.0 50.0
(80, 7) 3.0 12.0 40.0 4.0 13.0 50.0 5.0 14.0 60.0
I am sure there are much better ways of handling this case.

Adding a new column in which filling the column ,for first 6 rows fill it with 1 to 6 number same to next 6 rows with python

I am trying to add a new column in which every 6 rows in the dataframe is filled with 1 to 6 numbers.
Repeating it for all the rows in the dataframe. The illustration below shows how the output should look like
input
ID
0 20
1 20
2 20
3 20
4 20
5 20
6 34
7 34
8 34
9 34
10 34
11 34
12 67
13 67
14 67
15 67
16 67
17 67
output
ID 6_months
0 20 1
1 20 2
2 20 3
3 20 4
4 20 5
5 20 6
6 34 1
7 34 2
8 34 3
9 34 4
10 34 5
11 34 6
12 67 1
13 67 2
14 67 3
15 67 4
16 67 5
17 67 6

Pandas DataFrame RangeIndex

I have created a Pandas DataFrame. I need to create a RangeIndex for the DataFrame that corresponds to the frame -
RangeIndex(start=0, stop=x, step=y) - where x and y relate to my DataFrame.
I've not seen an example of how to do this - is there a method or syntax specific to this?
thanks
It seems you need RangeIndex constructor:
df = pd.DataFrame({'A' : range(1, 21)})
print (df)
A
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
10 11
11 12
12 13
13 14
14 15
15 16
16 17
17 18
18 19
19 20
print (df.index)
RangeIndex(start=0, stop=20, step=1)
df.index = pd.RangeIndex(start=0, stop=99, step=5)
print (df)
A
0 1
5 2
10 3
15 4
20 5
25 6
30 7
35 8
40 9
45 10
50 11
55 12
60 13
65 14
70 15
75 16
80 17
85 18
90 19
95 20
print (df.index)
RangeIndex(start=0, stop=99, step=5)
More dynamic solution:
step = 10
df.index = pd.RangeIndex(start=0, stop=len(df.index) * step - 1, step=step)
print (df)
A
0 1
10 2
20 3
30 4
40 5
50 6
60 7
70 8
80 9
90 10
100 11
110 12
120 13
130 14
140 15
150 16
160 17
170 18
180 19
190 20
print (df.index)
RangeIndex(start=0, stop=199, step=10)
EDIT:
As #ZakS pointed in comments better is use only DataFrame constructor:
df = pd.DataFrame({'A' : range(1, 21)}, index=pd.RangeIndex(start=0, stop=99, step=5))
print (df)
0 1
5 2
10 3
15 4
20 5
25 6
30 7
35 8
40 9
45 10
50 11
55 12
60 13
65 14
70 15
75 16
80 17
85 18
90 19
95 20

How to plot multiple lines as histograms per group from a pandas Date Frame

I am trying to look at 'time of day' effects on my users on a week over week basis to get a quick visual take on how consistent time of day trends are. So as a first start I've used this:
df[df['week'] < 10][['realLocalTime', 'week']].hist(by = 'week', bins = 24, figsize = (15, 15))
To produce the following:
This is a nice easy start, but what I would really like is to represent the histogram as a line plot, and overlay all the lines, one for each week on the same plot. Is there a way to do this?
I have a bit more experience with ggplot, where I would just do this by adding a factor level dependency on color and by. Is there a similarly easy way to do this with pandas and or matplotlib?
Here's what my data looks like:
realLocalTime week
1 12 10
2 12 10
3 12 10
4 12 10
5 13 5
6 17 5
7 17 5
8 6 6
9 17 5
10 20 6
11 18 5
12 18 5
13 19 6
14 21 6
15 21 6
16 14 6
17 6 6
18 0 6
19 21 5
20 17 6
21 23 6
22 22 6
23 22 6
24 17 6
25 22 5
26 13 6
27 23 6
28 22 5
29 21 6
30 17 6
... ... ...
70 14 5
71 9 5
72 19 6
73 19 6
74 21 6
75 20 5
76 20 5
77 21 5
78 15 6
79 22 6
80 23 6
81 15 6
82 12 6
83 7 6
84 9 6
85 8 6
86 22 6
87 22 6
88 22 6
89 8 5
90 8 5
91 8 5
92 9 5
93 7 5
94 22 5
95 8 6
96 10 6
97 0 6
98 22 5
99 14 6
Maybe you can simply use crosstab to compute the number of element by week and plot it.
# Test data
d = {'realLocalTime': ['12','14','14','12','13','17','14', '17'],
'week': ['10','10','10','10','5','5','6', '6']}
df = DataFrame(d)
ax = pd.crosstab(df['realLocalTime'], df['week']).plot()
Use groupby and value_counts
df.groupby('week').realLocalTime.value_counts().unstack(0).fillna(0).plot()

grouping by id and a condition

I have a dataframe df
df=DataFrame({'id': ['a','a','a','a','a','a','a','b','b','b','b','b','b','b','b','b','b'],
'min':[10,17,21,22,22,7,58,15,17,19,19,19,19,19,25,26,26],
'day':[15,15,15,15,15,17,17,41,41,41,41,41,41,41,57,57,57]})
that looks like
id min day
0 a 10 15
1 a 17 15
2 a 21 15
3 a 30 15
4 a 50 15
5 a 57 17
6 a 58 17
7 b 15 41
8 b 17 41
9 b 19 41
10 b 19 41
11 b 19 41
12 b 19 41
13 b 19 41
14 b 25 57
15 b 26 57
16 b 26 57
I want a new column that categorizes the data in a certain format based on the id and the relationship between the rows as follows, if min value difference for consecutive rows is less than 8 and the day value is the same I want to assign them to the same group, so my output would look like.
id min day category
0 a 10 15 1
1 a 17 15 1
2 a 21 15 1
3 a 30 15 2
4 a 50 15 3
5 a 57 17 4
6 a 58 17 4
7 b 15 41 5
8 b 17 41 5
9 b 19 41 5
10 b 19 41 5
11 b 19 41 5
12 b 19 41 5
13 b 19 41 5
14 b 25 57 6
15 b 26 57 6
16 b 26 57 6
hope this helps. let me know your views.
All the best.
import pandas as pd
df=pd.DataFrame({'id': ['a','a','a','a','a','a','a','b','b','b','b','b','b','b','b','b','b'],
'min':[10,17,21,22,22,7,58,15,17,19,19,19,19,19,25,26,26],
'day':[15,15,15,15,15,17,17,41,41,41,41,41,41,41,57,57,57]})
# initialize the catagory to 1 for counter increament
cat =1
# for the first row the catagory will be 1
new_series = [cat]
# loop will start from 1 and not from 0 because we cannot perform operation on iloc -1
for i in range(1,len(df)):
if df.iloc[i]['day'] == df.iloc[i-1]['day']:
if df.iloc[i]['min'] - df.iloc[i-1]['min'] > 8:
cat+=1
else:
cat+=1
new_series.append(cat)
df['catagory']= new_series
print(df)

Categories