Pandas DataFrame RangeIndex - python

I have created a Pandas DataFrame. I need to create a RangeIndex for the DataFrame that corresponds to the frame -
RangeIndex(start=0, stop=x, step=y) - where x and y relate to my DataFrame.
I've not seen an example of how to do this - is there a method or syntax specific to this?
thanks

It seems you need RangeIndex constructor:
df = pd.DataFrame({'A' : range(1, 21)})
print (df)
A
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
10 11
11 12
12 13
13 14
14 15
15 16
16 17
17 18
18 19
19 20
print (df.index)
RangeIndex(start=0, stop=20, step=1)
df.index = pd.RangeIndex(start=0, stop=99, step=5)
print (df)
A
0 1
5 2
10 3
15 4
20 5
25 6
30 7
35 8
40 9
45 10
50 11
55 12
60 13
65 14
70 15
75 16
80 17
85 18
90 19
95 20
print (df.index)
RangeIndex(start=0, stop=99, step=5)
More dynamic solution:
step = 10
df.index = pd.RangeIndex(start=0, stop=len(df.index) * step - 1, step=step)
print (df)
A
0 1
10 2
20 3
30 4
40 5
50 6
60 7
70 8
80 9
90 10
100 11
110 12
120 13
130 14
140 15
150 16
160 17
170 18
180 19
190 20
print (df.index)
RangeIndex(start=0, stop=199, step=10)
EDIT:
As #ZakS pointed in comments better is use only DataFrame constructor:
df = pd.DataFrame({'A' : range(1, 21)}, index=pd.RangeIndex(start=0, stop=99, step=5))
print (df)
0 1
5 2
10 3
15 4
20 5
25 6
30 7
35 8
40 9
45 10
50 11
55 12
60 13
65 14
70 15
75 16
80 17
85 18
90 19
95 20

Related

combining specific row conditionally and add output to existing row in pandas

suppose I have following data frame :
data = {'age' :[10,11,12,11,11,10,11,13,13,13,14,14,15,15,15],
'num1':[10,11,12,13,14,15,16,17,18,19,20,21,22,23,24],
'num2':[20,21,22,23,24,25,26,27,28,29,30,31,32,33,34]}
df = pd.DataFrame(data)
I want to sum rows for age 14 and 15 and keep those new values as age 14. my expected output would be like this:
age time1 time2
1 10 10 20
2 11 11 21
3 12 12 22
4 11 13 23
5 11 14 24
6 10 15 25
7 11 16 26
8 13 17 27
9 13 18 28
10 13 19 29
11 14 110 160
in the code below, I have tried to group.by age but it does not work for me:
df1 =df.groupby(age[age >=14])['num1', 'num2'].apply(', '.join).reset_index(drop=True).to_frame()
limit_age = 14
new = df.query("age < #limit_age").copy()
new.loc[len(new)] = [limit_age,
*df.query("age >= #limit_age").drop(columns="age").sum()]
first get the "before 14" dataframe
then assign it to a new row where
age is 14
other values are the row-wise sums of "after 14" dataframe
to get
>>> new
age num1 num2
0 10 10 20
1 11 11 21
2 12 12 22
3 11 13 23
4 11 14 24
5 10 15 25
6 11 16 26
7 13 17 27
8 13 18 28
9 13 19 29
10 14 110 160
(new.index += 1 can be used for a 1-based index at the end.)
I would use a mask and concat:
m = df['age'].isin([14, 15])
out = pd.concat([df[~m],
df[m].agg({'age': 'min', 'num1': 'sum', 'num2': 'sum'})
.to_frame().T
], ignore_index=True)
Output:
age num1 num2
0 10 10 20
1 11 11 21
2 12 12 22
3 11 13 23
4 11 14 24
5 10 15 25
6 11 16 26
7 13 17 27
8 13 18 28
9 13 19 29
10 14 110 160

Pandas dataframe problem. Create column where a row cell gets the value of another row cell

I have this pandas dataframe. It is sorted by the "h" column. What I want is to add two new columns where:
The items of each zone, will have a max boundary and a min boundary. (They will be the same for every item in the zone). The max boundary will be the minimum "h" value of the previous zone, and the min boundary will be the maximum "h" value of the next zone
name h w set row zone
ZZON5 40 36 A 0 0
DWOPN 38 44 A 1 0
5SWYZ 37 22 B 2 0
TFQEP 32 55 B 3 0
OQ33H 26 41 A 4 1
FTJVQ 24 25 B 5 1
F1RK2 20 15 B 6 1
266LT 18 19 A 7 1
HSJ3X 16 24 A 8 2
L754O 12 86 B 9 2
LWHDX 11 68 A 10 2
ZKB2F 9 47 A 11 2
5KJ5L 7 72 B 12 3
CZ7ET 6 23 B 13 3
SDZ1B 2 10 A 14 3
5KWRU 1 59 B 15 3
what i hope for:
name h w set row zone maxB minB
ZZON5 40 36 A 0 0 26
DWOPN 38 44 A 1 0 26
5SWYZ 37 22 B 2 0 26
TFQEP 32 55 B 3 0 26
OQ33H 26 41 A 4 1 32 16
FTJVQ 24 25 B 5 1 32 16
F1RK2 20 15 B 6 1 32 16
266LT 18 19 A 7 1 32 16
HSJ3X 16 24 A 8 2 18 7
L754O 12 86 B 9 2 18 7
LWHDX 11 68 A 10 2 18 7
ZKB2F 9 47 A 11 2 18 7
5KJ5L 7 72 B 12 3 9
CZ7ET 6 23 B 13 3 9
SDZ1B 2 10 A 14 3 9
5KWRU 1 59 B 15 3 9
Any ideas?
First group-by zone and find the minimum and maximum of them
min_max_zone = df.groupby('zone').agg(min=('h', 'min'), max=('h', 'max'))
Now you can use apply:
df['maxB'] = df['zone'].apply(lambda x: min_max_zone.loc[x-1, 'min']
if x-1 in min_max_zone.index else np.nan)
df['minB'] = df['zone'].apply(lambda x: min_max_zone.loc[x+1, 'max']
if x+1 in min_max_zone.index else np.nan)

Create groups based on column values

I am attempting to create user groups based on a particluar DataFrame column value. I would like to create 10 user groups of the entire DataFrame's population, based on the total_usage metric. An example DataFrame df is shown below.
user_id total_usage
1 10
2 10
3 20
4 20
5 30
6 30
7 40
8 40
9 50
10 50
11 60
12 60
13 70
14 70
15 80
16 80
17 90
18 90
19 100
20 100
The df is just a snippet of the entire DataFrame which is over 6000 records long, however I would like like to only have 10 user groups.
An example of my desired output is shown below.
user_id total_usage user_group
1 10 10th_group
2 10 10th_group
3 20 9th_group
4 20 9th_group
5 30 8th_group
6 30 8th_group
7 40 7th_group
8 40 7th_group
9 50 6th_group
10 50 6th_group
11 60 5th_group
12 60 5th_group
13 70 4th_group
14 70 4th_group
15 80 3th_group
16 80 3th_group
17 90 2nd_group
18 90 2nd_group
19 100 1st_group
20 100 1st_group
Any assistance that anyone could provide would be greatly appreciated.
Looks like you are looking for qcut, but in reverse order
df['user_group'] = 10 - pd.qcut(df['total_usage'], np.arange(0,1.1, 0.1)).cat.codes
Output, it's not ordinal, but I hope it will do:
0 10
1 10
2 9
3 9
4 8
5 8
6 7
7 7
8 6
9 6
10 5
11 5
12 4
13 4
14 3
15 3
16 2
17 2
18 1
19 1
dtype: int8
Use qcut with changed order by negatives and Series.map for 1.st and 2.nd values:
s = pd.qcut(-df['total_usage'], np.arange(0,1.1, 0.1), labels=False) + 1
d = {1:'st', 2:'nd'}
df['user_group'] = s.astype(str) + s.map(d).fillna('th') + '_group'
print (df)
user_id total_usage user_group
0 1 10 10th_group
1 2 10 10th_group
2 3 20 9th_group
3 4 20 9th_group
4 5 30 8th_group
5 6 30 8th_group
6 7 40 7th_group
7 8 40 7th_group
8 9 50 6th_group
9 10 50 6th_group
10 11 60 5th_group
11 12 60 5th_group
12 13 70 4th_group
13 14 70 4th_group
14 15 80 3th_group
15 16 80 3th_group
16 17 90 2nd_group
17 18 90 2nd_group
18 19 100 1st_group
19 20 100 1st_group
Try using pd.Series with np.repeat, np.arange, pd.DataFrame.groupby, pd.Series.astype, pd.Series.map and pd.Series.fillna:
x = df.groupby('total_usage')
s = pd.Series(np.repeat(np.arange(len(x.ngroups), [len(i) for i in x.groups.values()]) + 1)
df['user_group'] = (s.astype(str) + s.map({1: 'st', 2: 'nd'}).fillna('th') + '_Group').values[::-1]
And now:
print(df)
Is:
user_id total_usage user_group
0 1 10 10th_Group
1 2 10 10th_Group
2 3 20 9th_Group
3 4 20 9th_Group
4 5 30 8th_Group
5 6 30 8th_Group
6 7 40 7th_Group
7 8 40 7th_Group
8 9 50 6th_Group
9 10 50 6th_Group
10 11 60 5th_Group
11 12 60 5th_Group
12 13 70 4th_Group
13 14 70 4th_Group
14 15 80 3th_Group
15 16 80 3th_Group
16 17 90 2nd_Group
17 18 90 2nd_Group
18 19 100 1st_Group
19 20 100 1st_Group

How to plot multiple lines as histograms per group from a pandas Date Frame

I am trying to look at 'time of day' effects on my users on a week over week basis to get a quick visual take on how consistent time of day trends are. So as a first start I've used this:
df[df['week'] < 10][['realLocalTime', 'week']].hist(by = 'week', bins = 24, figsize = (15, 15))
To produce the following:
This is a nice easy start, but what I would really like is to represent the histogram as a line plot, and overlay all the lines, one for each week on the same plot. Is there a way to do this?
I have a bit more experience with ggplot, where I would just do this by adding a factor level dependency on color and by. Is there a similarly easy way to do this with pandas and or matplotlib?
Here's what my data looks like:
realLocalTime week
1 12 10
2 12 10
3 12 10
4 12 10
5 13 5
6 17 5
7 17 5
8 6 6
9 17 5
10 20 6
11 18 5
12 18 5
13 19 6
14 21 6
15 21 6
16 14 6
17 6 6
18 0 6
19 21 5
20 17 6
21 23 6
22 22 6
23 22 6
24 17 6
25 22 5
26 13 6
27 23 6
28 22 5
29 21 6
30 17 6
... ... ...
70 14 5
71 9 5
72 19 6
73 19 6
74 21 6
75 20 5
76 20 5
77 21 5
78 15 6
79 22 6
80 23 6
81 15 6
82 12 6
83 7 6
84 9 6
85 8 6
86 22 6
87 22 6
88 22 6
89 8 5
90 8 5
91 8 5
92 9 5
93 7 5
94 22 5
95 8 6
96 10 6
97 0 6
98 22 5
99 14 6
Maybe you can simply use crosstab to compute the number of element by week and plot it.
# Test data
d = {'realLocalTime': ['12','14','14','12','13','17','14', '17'],
'week': ['10','10','10','10','5','5','6', '6']}
df = DataFrame(d)
ax = pd.crosstab(df['realLocalTime'], df['week']).plot()
Use groupby and value_counts
df.groupby('week').realLocalTime.value_counts().unstack(0).fillna(0).plot()

grouping by id and a condition

I have a dataframe df
df=DataFrame({'id': ['a','a','a','a','a','a','a','b','b','b','b','b','b','b','b','b','b'],
'min':[10,17,21,22,22,7,58,15,17,19,19,19,19,19,25,26,26],
'day':[15,15,15,15,15,17,17,41,41,41,41,41,41,41,57,57,57]})
that looks like
id min day
0 a 10 15
1 a 17 15
2 a 21 15
3 a 30 15
4 a 50 15
5 a 57 17
6 a 58 17
7 b 15 41
8 b 17 41
9 b 19 41
10 b 19 41
11 b 19 41
12 b 19 41
13 b 19 41
14 b 25 57
15 b 26 57
16 b 26 57
I want a new column that categorizes the data in a certain format based on the id and the relationship between the rows as follows, if min value difference for consecutive rows is less than 8 and the day value is the same I want to assign them to the same group, so my output would look like.
id min day category
0 a 10 15 1
1 a 17 15 1
2 a 21 15 1
3 a 30 15 2
4 a 50 15 3
5 a 57 17 4
6 a 58 17 4
7 b 15 41 5
8 b 17 41 5
9 b 19 41 5
10 b 19 41 5
11 b 19 41 5
12 b 19 41 5
13 b 19 41 5
14 b 25 57 6
15 b 26 57 6
16 b 26 57 6
hope this helps. let me know your views.
All the best.
import pandas as pd
df=pd.DataFrame({'id': ['a','a','a','a','a','a','a','b','b','b','b','b','b','b','b','b','b'],
'min':[10,17,21,22,22,7,58,15,17,19,19,19,19,19,25,26,26],
'day':[15,15,15,15,15,17,17,41,41,41,41,41,41,41,57,57,57]})
# initialize the catagory to 1 for counter increament
cat =1
# for the first row the catagory will be 1
new_series = [cat]
# loop will start from 1 and not from 0 because we cannot perform operation on iloc -1
for i in range(1,len(df)):
if df.iloc[i]['day'] == df.iloc[i-1]['day']:
if df.iloc[i]['min'] - df.iloc[i-1]['min'] > 8:
cat+=1
else:
cat+=1
new_series.append(cat)
df['catagory']= new_series
print(df)

Categories