How to add values by column into a Dataframe - python

I have a Dataframe with three columns store, hour, count. The problem I'm facing is some hours are missing for some stores and I want them to be 0.
This is how the dataframe looks like
# store_id hour count
# 0 13 0 56
# 1 13 1 78
# 2 13 2 53
# 3 23 13 14
# 4 23 14 13
As you can see for the store with id 13 doesn't have values for hours 3-23, similarly with store 23 it doesn't have values for many other hours.
I tried to solve this by creating a temporal dataframe with two columns id and count and performing a right outer join, but didn't work.

If typo and no duplicates in hour per groups, solution is reindex with MultiIndex.from_product:
df = df.set_index(['store_id','hour'])
mux = pd.MultiIndex.from_product([df.index.levels[0], range(23)], names=df.index.names)
df = df.reindex(mux, fill_value=0).reset_index()
print (df)
store_id hour count
0 13 0 56
1 13 1 78
2 13 2 53
3 13 3 0
4 13 4 0
5 13 5 0
6 13 6 0
7 13 7 0
8 13 8 0
9 13 9 0
10 13 10 0
11 13 11 0
12 13 12 0
13 13 13 0
14 13 14 0
15 13 15 0
16 13 16 0
17 13 17 0
18 13 18 0
19 13 19 0
20 13 20 0
21 13 21 0
22 13 22 0
23 23 0 0
24 23 1 0
25 23 2 0
26 23 3 0
27 23 4 0
28 23 5 0
29 23 6 0
30 23 7 0
31 23 8 0
32 23 9 0
33 23 10 0
34 23 11 0
35 23 12 0
36 23 13 14
37 23 14 0
38 23 15 0
39 23 16 0
40 23 17 0
41 23 18 0
42 23 19 0
43 23 20 0
44 23 21 0
45 23 22 0

Try this:
all_hours = set(range(24))
for sid in set(df['store_id']):
misshours = list(all_hours - set(df['hour'][df['store_id'] == sid]))
nmiss = len(misshours)
df = pandas.concat([df, DataFrame({'store_id': nmiss * [sid], misshours, 'count': nmiss * [0]})])

Related

combining specific row conditionally and add output to existing row in pandas

suppose I have following data frame :
data = {'age' :[10,11,12,11,11,10,11,13,13,13,14,14,15,15,15],
'num1':[10,11,12,13,14,15,16,17,18,19,20,21,22,23,24],
'num2':[20,21,22,23,24,25,26,27,28,29,30,31,32,33,34]}
df = pd.DataFrame(data)
I want to sum rows for age 14 and 15 and keep those new values as age 14. my expected output would be like this:
age time1 time2
1 10 10 20
2 11 11 21
3 12 12 22
4 11 13 23
5 11 14 24
6 10 15 25
7 11 16 26
8 13 17 27
9 13 18 28
10 13 19 29
11 14 110 160
in the code below, I have tried to group.by age but it does not work for me:
df1 =df.groupby(age[age >=14])['num1', 'num2'].apply(', '.join).reset_index(drop=True).to_frame()
limit_age = 14
new = df.query("age < #limit_age").copy()
new.loc[len(new)] = [limit_age,
*df.query("age >= #limit_age").drop(columns="age").sum()]
first get the "before 14" dataframe
then assign it to a new row where
age is 14
other values are the row-wise sums of "after 14" dataframe
to get
>>> new
age num1 num2
0 10 10 20
1 11 11 21
2 12 12 22
3 11 13 23
4 11 14 24
5 10 15 25
6 11 16 26
7 13 17 27
8 13 18 28
9 13 19 29
10 14 110 160
(new.index += 1 can be used for a 1-based index at the end.)
I would use a mask and concat:
m = df['age'].isin([14, 15])
out = pd.concat([df[~m],
df[m].agg({'age': 'min', 'num1': 'sum', 'num2': 'sum'})
.to_frame().T
], ignore_index=True)
Output:
age num1 num2
0 10 10 20
1 11 11 21
2 12 12 22
3 11 13 23
4 11 14 24
5 10 15 25
6 11 16 26
7 13 17 27
8 13 18 28
9 13 19 29
10 14 110 160

Get Max Value of a Row in subset of Column respecting a condition

I have a dataframe that looks like this:
FakeDist
-5
-4
-3
-2
-1
0
1
2
3
4
5
1
37
14
17
29
31
34
32
31
21
17
18
2
12
13
12
16
30
33
37
32
32
15
42
3
40
16
29
31
36
32
30
19
16
15
12
4
12
14
12
28
28
30
29
27
16
18
33
5
12
13
16
17
28
32
33
30
29
17
35
I want to add a column that will be the Column_Name of the Maximum Value per Row.
I did that with:
df['MaxVal_Dist'] = df.idxmax(axis=1)
Which gives me this df:
FakeDist
-5
-4
...
MaxVal_Dist
1
37
14
...
-5
2
12
13
...
5
3
40
16
...
-5
4
12
14
...
5
5
12
13
...
5
But my real end point would be to add an if condition. I want the Max Value for the column where 'FakeDist' is between -2 and 2. To have the following result:
FakeDist
-5
-4
...
MaxVal_Dist
1
37
14
...
0
2
12
13
...
1
3
40
16
...
-1
4
12
14
...
0
5
12
13
...
1
I did try to look at how to add a df.apply but couldn't find how to make it work.
I have a "work around" idea that would be to store a subset of column (from -2 to 2) in a new dataframe, create my new column to get the max there, and then add that result column to my initial dataframe but it seem to me to be a very not elegant solution and I am sure there is much better to do.
I would be really glad to learn the elegant way to do that from you !
You can use boolean indexing with loc to filter the columns in the range -2 to 2, then use idxmax along axis=1:
c = df.columns.astype(int)
df['MaxVal_Dist'] = df.loc[:, (c >= -2) & (c <= 2)].idxmax(1)
Result:
FakeDist -5 -4 -3 -2 -1 0 1 2 3 4 5 MaxVal_Dist
1 37 14 17 29 31 34 32 31 21 17 18 0
2 12 13 12 16 30 33 37 32 32 15 42 1
3 40 16 29 31 36 32 30 19 16 15 12 -1
4 12 14 12 28 28 30 29 27 16 18 33 0
5 12 13 16 17 28 32 33 30 29 17 35 1
You can try List comprehension:
In [1159]: cols = [i for i in df.columns[1:] if -2 <= int(i) <= 2]
In [1161]: df['MaxVal_Dist'] = df[cols].idxmax(axis=1)
In [1162]: df
Out[1162]:
FakeDist -5 -4 -3 -2 -1 0 1 2 3 4 5 MaxVal_Dist
0 1 37 14 17 29 31 34 32 31 21 17 18 0
1 2 12 13 12 16 30 33 37 32 32 15 42 1
2 3 40 16 29 31 36 32 30 19 16 15 12 -1
3 4 12 14 12 28 28 30 29 27 16 18 33 0
4 5 12 13 16 17 28 32 33 30 29 17 35 1

Pandas dataframe problem. Create column where a row cell gets the value of another row cell

I have this pandas dataframe. It is sorted by the "h" column. What I want is to add two new columns where:
The items of each zone, will have a max boundary and a min boundary. (They will be the same for every item in the zone). The max boundary will be the minimum "h" value of the previous zone, and the min boundary will be the maximum "h" value of the next zone
name h w set row zone
ZZON5 40 36 A 0 0
DWOPN 38 44 A 1 0
5SWYZ 37 22 B 2 0
TFQEP 32 55 B 3 0
OQ33H 26 41 A 4 1
FTJVQ 24 25 B 5 1
F1RK2 20 15 B 6 1
266LT 18 19 A 7 1
HSJ3X 16 24 A 8 2
L754O 12 86 B 9 2
LWHDX 11 68 A 10 2
ZKB2F 9 47 A 11 2
5KJ5L 7 72 B 12 3
CZ7ET 6 23 B 13 3
SDZ1B 2 10 A 14 3
5KWRU 1 59 B 15 3
what i hope for:
name h w set row zone maxB minB
ZZON5 40 36 A 0 0 26
DWOPN 38 44 A 1 0 26
5SWYZ 37 22 B 2 0 26
TFQEP 32 55 B 3 0 26
OQ33H 26 41 A 4 1 32 16
FTJVQ 24 25 B 5 1 32 16
F1RK2 20 15 B 6 1 32 16
266LT 18 19 A 7 1 32 16
HSJ3X 16 24 A 8 2 18 7
L754O 12 86 B 9 2 18 7
LWHDX 11 68 A 10 2 18 7
ZKB2F 9 47 A 11 2 18 7
5KJ5L 7 72 B 12 3 9
CZ7ET 6 23 B 13 3 9
SDZ1B 2 10 A 14 3 9
5KWRU 1 59 B 15 3 9
Any ideas?
First group-by zone and find the minimum and maximum of them
min_max_zone = df.groupby('zone').agg(min=('h', 'min'), max=('h', 'max'))
Now you can use apply:
df['maxB'] = df['zone'].apply(lambda x: min_max_zone.loc[x-1, 'min']
if x-1 in min_max_zone.index else np.nan)
df['minB'] = df['zone'].apply(lambda x: min_max_zone.loc[x+1, 'max']
if x+1 in min_max_zone.index else np.nan)

Split several columns by "space" pandas

I want to split my data frame by "space" for all columns. I can do it for 1 column. How to apply it to the whole data? (maybe with loop)
df =
0 1 2 4
11 22 12 22 13 22 14 22
15 16 17 18 33 44 22 55
19 20 21 22 66 55 33 66
23 24 25 26 22 44 66 44
I am splitting in like:
df[0].str.split(' ', 1, expand=True)
Output is:
0 1
11 22
15 16
19 20
23 24
You can stack and unstack:
df.stack().str.split(' ', expand=True).unstack()
Output:
0 1
0 1 2 4 0 1 2 4
0 11 12 13 14 22 22 22 22
1 15 17 33 22 16 18 44 55
2 19 21 66 33 20 22 55 66
3 23 25 22 66 24 26 44 44

grouping by id and a condition

I have a dataframe df
df=DataFrame({'id': ['a','a','a','a','a','a','a','b','b','b','b','b','b','b','b','b','b'],
'min':[10,17,21,22,22,7,58,15,17,19,19,19,19,19,25,26,26],
'day':[15,15,15,15,15,17,17,41,41,41,41,41,41,41,57,57,57]})
that looks like
id min day
0 a 10 15
1 a 17 15
2 a 21 15
3 a 30 15
4 a 50 15
5 a 57 17
6 a 58 17
7 b 15 41
8 b 17 41
9 b 19 41
10 b 19 41
11 b 19 41
12 b 19 41
13 b 19 41
14 b 25 57
15 b 26 57
16 b 26 57
I want a new column that categorizes the data in a certain format based on the id and the relationship between the rows as follows, if min value difference for consecutive rows is less than 8 and the day value is the same I want to assign them to the same group, so my output would look like.
id min day category
0 a 10 15 1
1 a 17 15 1
2 a 21 15 1
3 a 30 15 2
4 a 50 15 3
5 a 57 17 4
6 a 58 17 4
7 b 15 41 5
8 b 17 41 5
9 b 19 41 5
10 b 19 41 5
11 b 19 41 5
12 b 19 41 5
13 b 19 41 5
14 b 25 57 6
15 b 26 57 6
16 b 26 57 6
hope this helps. let me know your views.
All the best.
import pandas as pd
df=pd.DataFrame({'id': ['a','a','a','a','a','a','a','b','b','b','b','b','b','b','b','b','b'],
'min':[10,17,21,22,22,7,58,15,17,19,19,19,19,19,25,26,26],
'day':[15,15,15,15,15,17,17,41,41,41,41,41,41,41,57,57,57]})
# initialize the catagory to 1 for counter increament
cat =1
# for the first row the catagory will be 1
new_series = [cat]
# loop will start from 1 and not from 0 because we cannot perform operation on iloc -1
for i in range(1,len(df)):
if df.iloc[i]['day'] == df.iloc[i-1]['day']:
if df.iloc[i]['min'] - df.iloc[i-1]['min'] > 8:
cat+=1
else:
cat+=1
new_series.append(cat)
df['catagory']= new_series
print(df)

Categories