combining specific row conditionally and add output to existing row in pandas - python

suppose I have following data frame :
data = {'age' :[10,11,12,11,11,10,11,13,13,13,14,14,15,15,15],
'num1':[10,11,12,13,14,15,16,17,18,19,20,21,22,23,24],
'num2':[20,21,22,23,24,25,26,27,28,29,30,31,32,33,34]}
df = pd.DataFrame(data)
I want to sum rows for age 14 and 15 and keep those new values as age 14. my expected output would be like this:
age time1 time2
1 10 10 20
2 11 11 21
3 12 12 22
4 11 13 23
5 11 14 24
6 10 15 25
7 11 16 26
8 13 17 27
9 13 18 28
10 13 19 29
11 14 110 160
in the code below, I have tried to group.by age but it does not work for me:
df1 =df.groupby(age[age >=14])['num1', 'num2'].apply(', '.join).reset_index(drop=True).to_frame()

limit_age = 14
new = df.query("age < #limit_age").copy()
new.loc[len(new)] = [limit_age,
*df.query("age >= #limit_age").drop(columns="age").sum()]
first get the "before 14" dataframe
then assign it to a new row where
age is 14
other values are the row-wise sums of "after 14" dataframe
to get
>>> new
age num1 num2
0 10 10 20
1 11 11 21
2 12 12 22
3 11 13 23
4 11 14 24
5 10 15 25
6 11 16 26
7 13 17 27
8 13 18 28
9 13 19 29
10 14 110 160
(new.index += 1 can be used for a 1-based index at the end.)

I would use a mask and concat:
m = df['age'].isin([14, 15])
out = pd.concat([df[~m],
df[m].agg({'age': 'min', 'num1': 'sum', 'num2': 'sum'})
.to_frame().T
], ignore_index=True)
Output:
age num1 num2
0 10 10 20
1 11 11 21
2 12 12 22
3 11 13 23
4 11 14 24
5 10 15 25
6 11 16 26
7 13 17 27
8 13 18 28
9 13 19 29
10 14 110 160

Related

Why are my matrix values being reset back to 1?

I can't seem to figure out why the values of the first column after the 3 are being reset to 1 and then 1 again. I think it has to do with the not 999 if statement, but I'm not sure what to add or change. Any help is appreciated thank you!
n = 15 #the number the matrices will be an n*n of
a = [[ 0 for x in range(n)] for x in range(n)]
a[-1][-1] =111
a[0][3] =999
def Manhattan(x):
k = 1
m = 1
for i in range(n):
for j in range(n):
if (x[i][j] < 1):
if (x[j][i] < 1):
if (x[i][j-1] != 999 or x[j-1][i] !=999 ):
k = x[i][j-1]
x[i][j] = k + 1
m = x[j-1][i]
x[j][i] = m + 1
else:
k = x[i][j]
x[i][j] = k + 1
for row in x:
print()
for val in row:
print('%4d' %val, end = " ")
Manhattan(a)
im not completely sure what your program is meant to do but i think this is what youre trying to achieve as output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
code:
n = 15 #the number the matrices will be an n*n of
a = [[ 0 for x in range(n)] for x in range(n)]
a[-1][-1] = 111
a[0][3] = 999
def Manhattan(x1):
start_value = 0
for x in range(n):
start_value += 1
second_val = start_value
for y in range(n):
x1[y][x] = second_val
second_val += 1
for line in x1:
for char in line:
print('%4d' %char, end = " ")
print()
Manhattan(a)
your code is fairly unorganized it makes it hard to understand but if this was your goal then what you need to do is increment a value starting from 1 by 1 each outer loop while looping through the matrix and then every iteration of the inner loop you increase it by 1 and set the value into the corresponding index.
I think your question is why x[0][3] equals 1 after calling the function Manhattan, well that because your else statement is compare with if (x[j][i] < 1) statement.
So, when the code go on x[3][0], the if (x[0][3]) will equal False,
then go to the else statement, that means: k=x[3][0]=0, and then x[3][0]=0+1=1.
I'm not sure if this output is your expectation:
1 2 3 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
15 16 17 18 19 20 21 22 23 24 25 26 27 28 111
Code:
def Manhattan(x):
for i in range(n):
for j in range(n):
if x[i][j] == 111:
break
up = 0
left = 0
if x[i-1][j] > 0 and bool(i):
up = x[i-1][j]
if x[i][j-1] > 0 and bool(j):
left = x[i][j-1]
if bool(up) and bool(left):
x[i][j] = x[i][j] + min(up, left) + 1
elif bool(up):
x[i][j] = x[i][j] + up + 1
elif bool(left):
x[i][j] = x[i][j] + left + 1
else:
x[i][j] = x[i][j] + 1
for row in x:
print()
for val in row:
print('%4d' %val, end = " ")
Figured it out myself! The issue was that when checking j-1 != 999 it was eqaul to 999 for x[0][3] so it would never fill out those elements correctly. Fixed it by changing the else statement and adding in a second if statement for j-1 == 999
n = 10 #the number the matrices will be an n*n of
a = [[ 0 for x in range(n)] for x in range(n)]
a[-1][-1] =111
a[0][6] =999
def Manhattan(x):
k = 1
m = 1
for i in range(n):
for j in range(n):
if (x[i][j] < 1):
if (x[j][i] < 1):
if (x[i][j-1] != 999 and x[j-1][i] !=999 ):
k = x[i][j-1]
x[i][j] = k + 1
m = x[j-1][i]
x[j][i] = m + 1
if (x[i][j-1] == 999 or x[j-1][i] ==999 ):
k = x[i][j-2]
x[i][j] = k + 4
m = x[j-2][i]
x[j][i] = m + 2
else:
k = x[i][j]
x[i][j] = k + 4
for row in x:
print()
for val in row:
print('%4d' %val, end = " ")
Manhattan(a)

Get Max Value of a Row in subset of Column respecting a condition

I have a dataframe that looks like this:
FakeDist
-5
-4
-3
-2
-1
0
1
2
3
4
5
1
37
14
17
29
31
34
32
31
21
17
18
2
12
13
12
16
30
33
37
32
32
15
42
3
40
16
29
31
36
32
30
19
16
15
12
4
12
14
12
28
28
30
29
27
16
18
33
5
12
13
16
17
28
32
33
30
29
17
35
I want to add a column that will be the Column_Name of the Maximum Value per Row.
I did that with:
df['MaxVal_Dist'] = df.idxmax(axis=1)
Which gives me this df:
FakeDist
-5
-4
...
MaxVal_Dist
1
37
14
...
-5
2
12
13
...
5
3
40
16
...
-5
4
12
14
...
5
5
12
13
...
5
But my real end point would be to add an if condition. I want the Max Value for the column where 'FakeDist' is between -2 and 2. To have the following result:
FakeDist
-5
-4
...
MaxVal_Dist
1
37
14
...
0
2
12
13
...
1
3
40
16
...
-1
4
12
14
...
0
5
12
13
...
1
I did try to look at how to add a df.apply but couldn't find how to make it work.
I have a "work around" idea that would be to store a subset of column (from -2 to 2) in a new dataframe, create my new column to get the max there, and then add that result column to my initial dataframe but it seem to me to be a very not elegant solution and I am sure there is much better to do.
I would be really glad to learn the elegant way to do that from you !
You can use boolean indexing with loc to filter the columns in the range -2 to 2, then use idxmax along axis=1:
c = df.columns.astype(int)
df['MaxVal_Dist'] = df.loc[:, (c >= -2) & (c <= 2)].idxmax(1)
Result:
FakeDist -5 -4 -3 -2 -1 0 1 2 3 4 5 MaxVal_Dist
1 37 14 17 29 31 34 32 31 21 17 18 0
2 12 13 12 16 30 33 37 32 32 15 42 1
3 40 16 29 31 36 32 30 19 16 15 12 -1
4 12 14 12 28 28 30 29 27 16 18 33 0
5 12 13 16 17 28 32 33 30 29 17 35 1
You can try List comprehension:
In [1159]: cols = [i for i in df.columns[1:] if -2 <= int(i) <= 2]
In [1161]: df['MaxVal_Dist'] = df[cols].idxmax(axis=1)
In [1162]: df
Out[1162]:
FakeDist -5 -4 -3 -2 -1 0 1 2 3 4 5 MaxVal_Dist
0 1 37 14 17 29 31 34 32 31 21 17 18 0
1 2 12 13 12 16 30 33 37 32 32 15 42 1
2 3 40 16 29 31 36 32 30 19 16 15 12 -1
3 4 12 14 12 28 28 30 29 27 16 18 33 0
4 5 12 13 16 17 28 32 33 30 29 17 35 1

How to add values by column into a Dataframe

I have a Dataframe with three columns store, hour, count. The problem I'm facing is some hours are missing for some stores and I want them to be 0.
This is how the dataframe looks like
# store_id hour count
# 0 13 0 56
# 1 13 1 78
# 2 13 2 53
# 3 23 13 14
# 4 23 14 13
As you can see for the store with id 13 doesn't have values for hours 3-23, similarly with store 23 it doesn't have values for many other hours.
I tried to solve this by creating a temporal dataframe with two columns id and count and performing a right outer join, but didn't work.
If typo and no duplicates in hour per groups, solution is reindex with MultiIndex.from_product:
df = df.set_index(['store_id','hour'])
mux = pd.MultiIndex.from_product([df.index.levels[0], range(23)], names=df.index.names)
df = df.reindex(mux, fill_value=0).reset_index()
print (df)
store_id hour count
0 13 0 56
1 13 1 78
2 13 2 53
3 13 3 0
4 13 4 0
5 13 5 0
6 13 6 0
7 13 7 0
8 13 8 0
9 13 9 0
10 13 10 0
11 13 11 0
12 13 12 0
13 13 13 0
14 13 14 0
15 13 15 0
16 13 16 0
17 13 17 0
18 13 18 0
19 13 19 0
20 13 20 0
21 13 21 0
22 13 22 0
23 23 0 0
24 23 1 0
25 23 2 0
26 23 3 0
27 23 4 0
28 23 5 0
29 23 6 0
30 23 7 0
31 23 8 0
32 23 9 0
33 23 10 0
34 23 11 0
35 23 12 0
36 23 13 14
37 23 14 0
38 23 15 0
39 23 16 0
40 23 17 0
41 23 18 0
42 23 19 0
43 23 20 0
44 23 21 0
45 23 22 0
Try this:
all_hours = set(range(24))
for sid in set(df['store_id']):
misshours = list(all_hours - set(df['hour'][df['store_id'] == sid]))
nmiss = len(misshours)
df = pandas.concat([df, DataFrame({'store_id': nmiss * [sid], misshours, 'count': nmiss * [0]})])

Pandas DataFrame RangeIndex

I have created a Pandas DataFrame. I need to create a RangeIndex for the DataFrame that corresponds to the frame -
RangeIndex(start=0, stop=x, step=y) - where x and y relate to my DataFrame.
I've not seen an example of how to do this - is there a method or syntax specific to this?
thanks
It seems you need RangeIndex constructor:
df = pd.DataFrame({'A' : range(1, 21)})
print (df)
A
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
10 11
11 12
12 13
13 14
14 15
15 16
16 17
17 18
18 19
19 20
print (df.index)
RangeIndex(start=0, stop=20, step=1)
df.index = pd.RangeIndex(start=0, stop=99, step=5)
print (df)
A
0 1
5 2
10 3
15 4
20 5
25 6
30 7
35 8
40 9
45 10
50 11
55 12
60 13
65 14
70 15
75 16
80 17
85 18
90 19
95 20
print (df.index)
RangeIndex(start=0, stop=99, step=5)
More dynamic solution:
step = 10
df.index = pd.RangeIndex(start=0, stop=len(df.index) * step - 1, step=step)
print (df)
A
0 1
10 2
20 3
30 4
40 5
50 6
60 7
70 8
80 9
90 10
100 11
110 12
120 13
130 14
140 15
150 16
160 17
170 18
180 19
190 20
print (df.index)
RangeIndex(start=0, stop=199, step=10)
EDIT:
As #ZakS pointed in comments better is use only DataFrame constructor:
df = pd.DataFrame({'A' : range(1, 21)}, index=pd.RangeIndex(start=0, stop=99, step=5))
print (df)
0 1
5 2
10 3
15 4
20 5
25 6
30 7
35 8
40 9
45 10
50 11
55 12
60 13
65 14
70 15
75 16
80 17
85 18
90 19
95 20

grouping by id and a condition

I have a dataframe df
df=DataFrame({'id': ['a','a','a','a','a','a','a','b','b','b','b','b','b','b','b','b','b'],
'min':[10,17,21,22,22,7,58,15,17,19,19,19,19,19,25,26,26],
'day':[15,15,15,15,15,17,17,41,41,41,41,41,41,41,57,57,57]})
that looks like
id min day
0 a 10 15
1 a 17 15
2 a 21 15
3 a 30 15
4 a 50 15
5 a 57 17
6 a 58 17
7 b 15 41
8 b 17 41
9 b 19 41
10 b 19 41
11 b 19 41
12 b 19 41
13 b 19 41
14 b 25 57
15 b 26 57
16 b 26 57
I want a new column that categorizes the data in a certain format based on the id and the relationship between the rows as follows, if min value difference for consecutive rows is less than 8 and the day value is the same I want to assign them to the same group, so my output would look like.
id min day category
0 a 10 15 1
1 a 17 15 1
2 a 21 15 1
3 a 30 15 2
4 a 50 15 3
5 a 57 17 4
6 a 58 17 4
7 b 15 41 5
8 b 17 41 5
9 b 19 41 5
10 b 19 41 5
11 b 19 41 5
12 b 19 41 5
13 b 19 41 5
14 b 25 57 6
15 b 26 57 6
16 b 26 57 6
hope this helps. let me know your views.
All the best.
import pandas as pd
df=pd.DataFrame({'id': ['a','a','a','a','a','a','a','b','b','b','b','b','b','b','b','b','b'],
'min':[10,17,21,22,22,7,58,15,17,19,19,19,19,19,25,26,26],
'day':[15,15,15,15,15,17,17,41,41,41,41,41,41,41,57,57,57]})
# initialize the catagory to 1 for counter increament
cat =1
# for the first row the catagory will be 1
new_series = [cat]
# loop will start from 1 and not from 0 because we cannot perform operation on iloc -1
for i in range(1,len(df)):
if df.iloc[i]['day'] == df.iloc[i-1]['day']:
if df.iloc[i]['min'] - df.iloc[i-1]['min'] > 8:
cat+=1
else:
cat+=1
new_series.append(cat)
df['catagory']= new_series
print(df)

Categories