I have the follwing DataFrame
import pandas as pd
data = {"hours": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
"values": [0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1]}
df = pd.DataFrame(data)
I have been trying to add an extra column to df including the values by groupby values and the follwing list:
[2, 4, 6, 8, 10, 16, 18, 21, 23]
this list represents hours after which the gruoping should be conducted. E.g. in the new column category it gives 1 for those values between 2 and 4 gives 1 and else where gives 0 and for hours between 6 and 8 gives 2 where the values are 1 and else where 0 and so on..
I tried the following:
df.groupby(["values", "hours"])
and I could not come forward with it.
The expected result looks like:
Updated to answer question. You'd have to create individual queries (as below). This should work for the specific ranges
df['category'] = 0
df.loc[(df['hours'] >= 2) & (df['hours'] <= 4), 'category'] = df['values']
df.loc[(df['hours'] >= 6) & (df['hours'] <= 8), 'category'] = df['values'] * 2
df.loc[df['hours'] == 10, 'category'] = df['values'] * 3
df.loc[(df['hours'] >= 16) & (df['hours'] <= 18), 'category'] = df['values'] * 4
df.loc[(df['hours'] >= 21) & (df['hours'] <= 23), 'category'] = df['values'] * 5
There is something wrong with your question So I will assume what Epsi95 has commented. So you can try something like this:
This will work when you have list having even size. You can modify this for your case also.
df['category']=0
x = list(zip(bins[::2], bins[1::2]))
rng = { range(i[0], i[1]+1):idx+1 for idx,i in enumerate(x)}
df.loc[df['values'].eq(1), 'category'] = df.loc[df['values'].eq(1), 'hours'].apply(lambda x: next((v for k, v in rng.items() if x in k), 0))
Edit:
df['category']=0
bins = [(2, 4), (6, 8), (10), (16, 18), (21, 23)]
rng = {}
for idx,i in enumerate(bins, start=1):
if not isinstance(i, int):
rng[range(i[0],i[1]+1)]=idx
else:
rng[i] = idx
def func(val):
print(val)
for k, v in rng.items():
if isinstance(k, int):
if val==k:
return v
else:
if val in k:
return v
df.loc[df['values'].eq(1), 'category'] = df.loc[df['values'].eq(1), 'hours'].apply(func)
df:
hours values category
0 1 0 0
1 2 1 1
2 3 1 1
3 4 1 1
4 5 0 0
5 6 1 2
6 7 0 0
7 8 1 2
8 9 0 0
9 10 1 3
10 11 0 0
11 12 0 0
12 13 0 0
13 14 0 0
14 15 0 0
15 16 1 4
16 17 1 4
17 18 1 4
18 19 0 0
19 20 0 0
20 21 1 5
21 22 0 0
22 23 1 5
Related
This question already has answers here:
Pandas conditional creation of a series/dataframe column
(13 answers)
Closed 3 years ago.
I want to add a new column and fill values based on condition.
df:
indicator, value, a, b
1, 20, 5, 3
0, 30, 6, 8
0, 70, 2, 2
1, 10, 3, 7
I want to add a new column (value_new) based on Indicator. If indicator == 1, value_new = a*b otherwise value_new = value.
df:
indicator, value, a, b, value_new
1, 20, 5, 3, 15
0, 30, 6, 8, 30
0, 70, 2, 2, 70
1, 10, 3, 7, 21
I have tried following:
value_new = []
for in in range(1, len(df)):
if indicator[i] == 1:
value_new.append(df['a'][i]*df['b'][i])
else:
value_new.append(df['value'][i])
df['value_new'] = value_new
Error: 'Length of values does not match length of index'
And I have also tried:
for in in range(1, len(df)):
if indicator[i] == 1:
df['value_new'][i] = df['a'][i]*df['b'][i]
else:
df['value_new'][i] = df['value'][i]
KeyError: 'value_new'
You can use np.where:
df['value_new'] = np.where(df['indicator'], df['a']*df['b'], df['value'])
print(df)
Prints:
indicator value a b value_new
0 1 20 5 3 15
1 0 30 6 8 30
2 0 70 2 2 70
3 1 10 3 7 21
I have a Pandas Dataframe that stores a food item on each row in the following format -
Id Calories Protein IsBreakfast IsLunch IsDinner
1 300 6 0 1 0
2 400 12 1 1 0
.
.
.
100 700 25 0 1 1
I want to print all three-row combinations with the following conditions -
The combinations should contain at least one of the breakfast, lunch, and dinner.
sum of calories should be between certain range (say minCal < sum of calories in three rows < maxCal)
similar condition for proteins too.
Right now, I am first iterating over all breakfast items, choosing lunch items. Then iterating over all dinner items. After selecting a combination, I am adding relevant columns and checking if values are within the desired range
You can use the approach described in this answer to generate a new DataFrame containing all the combinations of three rows from your original data:
from itertools import combinations
import pandas as pd
# Using skbrhmn's df
df = pd.DataFrame({"Calories": [100, 200, 300, 400, 500],
"Protein": [10, 20, 30, 40, 50],
"IsBreakfast": [1, 1, 0, 0, 0],
"IsLunch": [1, 0, 0, 0, 1],
"IsDinner": [1, 1, 1, 0, 1]})
comb_rows = list(combinations(df.index, 3))
comb_rows
Output:
[(0, 1, 2),
(0, 1, 3),
(0, 1, 4),
(0, 2, 3),
(0, 2, 4),
(0, 3, 4),
(1, 2, 3),
(1, 2, 4),
(1, 3, 4),
(2, 3, 4)]
Then create a new DataFrame containing the sum of all numeric fields in your original frame, over all the possible combinations of three rows:
combinations = pd.DataFrame([df.loc[c,:].sum() for c in comb_rows], index=comb_rows)
print(combinations)
Calories Protein IsBreakfast IsLunch IsDinner
(0, 1, 2) 600 60 2 1 3
(0, 1, 3) 700 70 2 1 2
(0, 1, 4) 800 80 2 2 3
(0, 2, 3) 800 80 1 1 2
(0, 2, 4) 900 90 1 2 3
(0, 3, 4) 1000 100 1 2 2
(1, 2, 3) 900 90 1 0 2
(1, 2, 4) 1000 100 1 1 3
(1, 3, 4) 1100 110 1 1 2
(2, 3, 4) 1200 120 0 1 2
Finally you can apply any filters you need:
filtered = combinations[
(combinations.IsBreakfast>0) &
(combinations.IsLunch>0) &
(combinations.IsDinner>0) &
(combinations.Calories>600) &
(combinations.Calories<1000) &
(combinations.Protein>=80) &
(combinations.Protein<120)
]
print(filtered)
Calories Protein IsBreakfast IsLunch IsDinner
(0, 1, 4) 800 80 2 2 3
(0, 2, 3) 800 80 1 1 2
(0, 2, 4) 900 90 1 2 3
You can add combinations of filters to a dataframe using the | and & operators.
Creating a dummy dataframe for example:
df1 = pd.DataFrame({"Calories": [100, 200, 300, 400, 500],
"Protein": [10, 20, 30, 40, 50],
"IsBreakfast": [1, 1, 0, 0, 0],
"IsLunch": [1, 0, 0, 0, 1],
"IsDinner": [1, 1, 1, 0, 1]})
print(df1)
Output:
Calories Protein IsBreakfast IsLunch IsDinner
0 100 10 1 1 1
1 200 20 1 0 1
2 300 30 0 0 1
3 400 40 0 0 0
4 500 50 0 1 1
Now add all the conditions:
min_cal = 100
max_cal = 600
min_prot = 10
max_prot = 40
df_filtered = df1[
((df1['IsBreakfast']==1) | (df1['IsLunch']==1) | (df1['IsDinner']==1)) &
((df1['Calories'] > min_cal) & (df1['Calories'] < max_cal)) &
((df1['Protein'] > min_prot) & (df1['Protein'] < max_prot))
]
print(df_filtered)
Output:
Calories Protein IsBreakfast IsLunch IsDinner
1 200 20 1 0 1
2 300 30 0 0 1
Data frame has w (week) and y (year) columns.
d = {
'y': [11,11,13,15,15],
'w': [5, 4, 7, 7, 8],
'z': [1, 2, 3, 4, 5]
}
df = pd.DataFrame(d)
In [61]: df
Out[61]:
w y z
0 5 11 1
1 4 11 2
2 7 13 3
3 7 15 4
4 8 15 5
Two questions:
1) How to get from this data frame min/max date as two numbers w and y in a list [w,y] ?
2) How to subset both columns and rows, so all w and y in the resulting data frame are constrained by conditions:
11 <= y <= 15
4 <= w <= 7
To get min/max pairs I need functions:
min_pair() --> [11,4]
max_pair() --> [15,8]
and these to get a data frame subset:
from_to(y1,w1,y2,w2)
from_to(11,4,15,7) -->
should return rf data frame like this:
r = {
'y': [11,13,15],
'w': [4, 7, 7 ],
'z': [2, 3, 4 ]
}
rf = pd.DataFrame(r)
In [62]: rf
Out[62]:
w y z
0 4 11 2
1 7 13 3
2 7 15 4
Are there any standard functions for this?
Update
For subsetting the following worked for me:
df[(df.y <= 15 ) & (df.y >= 11) & (df.w >= 4) & (df.w <= 7)]
a lot of typing though ...
Here are couple of methods
In [176]: df.min().tolist()
Out[176]: [4, 11]
In [177]: df.max().tolist()
Out[177]: [8, 15]
In [178]: df.query('11 <= y <= 15 and 4 <= w <= 7')
Out[178]:
w y
0 5 11
1 4 11
2 7 13
3 7 15
I would like to bin a dataframe in pandas based on the sum of another column.
I have the following dataframe:
time variable frequency
2 7 7
3 12 2
4 13 3
6 15 4
6 18 4
6 3 1
10 21 2
11 4 5
13 6 5
15 17 6
17 5 4
I would like to bin the data so that each group contains a minimum total frequency of 10 and output the average time and the total variable and total frequency.
avg time total variable total frequency
3 32 12
7 57 11
12 10 10
16 22 10
Any help would be greatly appreciated
A little brute force would get you a long way.
import numpy as np
data = ((2, 7, 7),
(3, 12, 2),
(4, 13, 3),
(6, 15, 4),
(6, 18, 4),
(6, 3, 1),
(10, 21, 2),
(11, 4, 5),
(13, 6, 5),
(15, 17, 6),
(17, 5, 4))
freq = [data[i][2] for i in range(len(data))]
variable = [data[i][1] for i in range(len(data))]
time = [data[i][0] for i in range(len(data))]
freqcounter = 0
timecounter = 0
variablecounter = 0
counter = 0
freqlist = []
timelist = []
variablelist = []
for k in range(len(data)):
freqcounter += freq[k]
timecounter += time[k]
variablecounter += variable[k]
counter += 1
if freqcounter >= 10:
freqlist.append(freqcounter)
timelist.append(timecounter/counter)
variablelist.append(variablecounter)
freqcounter = 0
timecounter = 0
variablecounter = 0
counter = 0
print(timelist)
print(variablelist)
print(freqlist)
If I have a data frame df (indexed by integer)
BBG.KABN.S BBG.TKA.S BBG.CON.S BBG.ISAT.S
index
0 -0.004881 0.008011 0.007047 -0.000307
1 -0.004881 0.008011 0.007047 -0.000307
2 -0.005821 -0.016792 -0.016111 0.001028
3 0.000588 0.019169 -0.000307 -0.001832
4 0.007468 -0.011277 -0.003273 0.004355
and I want to iterate though each element individually (by row and column) I know I need to use .iloc(row,column) but do I need to create 2 for loops (one for row and one for column) and how I would do that?
I guess it would be something like:
for col in rollReturnRandomDf.keys():
for row in rollReturnRandomDf.iterrows():
item = df.iloc(col,row)
But I am unsure of the exact syntax.
Maybe try using df.values.ravel().
import pandas as pd
import numpy as np
# data
# =================
df = pd.DataFrame(np.arange(25).reshape(5,5), columns='A B C D E'.split())
Out[72]:
A B C D E
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
4 20 21 22 23 24
# np.ravel
# =================
df.values.ravel()
Out[74]:
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
21, 22, 23, 24])
for item in df.values.ravel():
# do something with item