irregular binning with regards to the sum of a column - python

I would like to bin a dataframe in pandas based on the sum of another column.
I have the following dataframe:
time variable frequency
2 7 7
3 12 2
4 13 3
6 15 4
6 18 4
6 3 1
10 21 2
11 4 5
13 6 5
15 17 6
17 5 4
I would like to bin the data so that each group contains a minimum total frequency of 10 and output the average time and the total variable and total frequency.
avg time total variable total frequency
3 32 12
7 57 11
12 10 10
16 22 10
Any help would be greatly appreciated

A little brute force would get you a long way.
import numpy as np
data = ((2, 7, 7),
(3, 12, 2),
(4, 13, 3),
(6, 15, 4),
(6, 18, 4),
(6, 3, 1),
(10, 21, 2),
(11, 4, 5),
(13, 6, 5),
(15, 17, 6),
(17, 5, 4))
freq = [data[i][2] for i in range(len(data))]
variable = [data[i][1] for i in range(len(data))]
time = [data[i][0] for i in range(len(data))]
freqcounter = 0
timecounter = 0
variablecounter = 0
counter = 0
freqlist = []
timelist = []
variablelist = []
for k in range(len(data)):
freqcounter += freq[k]
timecounter += time[k]
variablecounter += variable[k]
counter += 1
if freqcounter >= 10:
freqlist.append(freqcounter)
timelist.append(timecounter/counter)
variablelist.append(variablecounter)
freqcounter = 0
timecounter = 0
variablecounter = 0
counter = 0
print(timelist)
print(variablelist)
print(freqlist)

Related

Groupby and lists category

I have the follwing DataFrame
import pandas as pd
data = {"hours": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
"values": [0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1]}
df = pd.DataFrame(data)
I have been trying to add an extra column to df including the values by groupby values and the follwing list:
[2, 4, 6, 8, 10, 16, 18, 21, 23]
this list represents hours after which the gruoping should be conducted. E.g. in the new column category it gives 1 for those values between 2 and 4 gives 1 and else where gives 0 and for hours between 6 and 8 gives 2 where the values are 1 and else where 0 and so on..
I tried the following:
df.groupby(["values", "hours"])
and I could not come forward with it.
The expected result looks like:
Updated to answer question. You'd have to create individual queries (as below). This should work for the specific ranges
df['category'] = 0
df.loc[(df['hours'] >= 2) & (df['hours'] <= 4), 'category'] = df['values']
df.loc[(df['hours'] >= 6) & (df['hours'] <= 8), 'category'] = df['values'] * 2
df.loc[df['hours'] == 10, 'category'] = df['values'] * 3
df.loc[(df['hours'] >= 16) & (df['hours'] <= 18), 'category'] = df['values'] * 4
df.loc[(df['hours'] >= 21) & (df['hours'] <= 23), 'category'] = df['values'] * 5
There is something wrong with your question So I will assume what Epsi95 has commented. So you can try something like this:
This will work when you have list having even size. You can modify this for your case also.
df['category']=0
x = list(zip(bins[::2], bins[1::2]))
rng = { range(i[0], i[1]+1):idx+1 for idx,i in enumerate(x)}
df.loc[df['values'].eq(1), 'category'] = df.loc[df['values'].eq(1), 'hours'].apply(lambda x: next((v for k, v in rng.items() if x in k), 0))
Edit:
df['category']=0
bins = [(2, 4), (6, 8), (10), (16, 18), (21, 23)]
rng = {}
for idx,i in enumerate(bins, start=1):
if not isinstance(i, int):
rng[range(i[0],i[1]+1)]=idx
else:
rng[i] = idx
def func(val):
print(val)
for k, v in rng.items():
if isinstance(k, int):
if val==k:
return v
else:
if val in k:
return v
df.loc[df['values'].eq(1), 'category'] = df.loc[df['values'].eq(1), 'hours'].apply(func)
df:
hours values category
0 1 0 0
1 2 1 1
2 3 1 1
3 4 1 1
4 5 0 0
5 6 1 2
6 7 0 0
7 8 1 2
8 9 0 0
9 10 1 3
10 11 0 0
11 12 0 0
12 13 0 0
13 14 0 0
14 15 0 0
15 16 1 4
16 17 1 4
17 18 1 4
18 19 0 0
19 20 0 0
20 21 1 5
21 22 0 0
22 23 1 5

Getting all row combinations from a pandas dataframe based on certain column conditions?

I have a Pandas Dataframe that stores a food item on each row in the following format -
Id Calories Protein IsBreakfast IsLunch IsDinner
1 300 6 0 1 0
2 400 12 1 1 0
.
.
.
100 700 25 0 1 1
I want to print all three-row combinations with the following conditions -
The combinations should contain at least one of the breakfast, lunch, and dinner.
sum of calories should be between certain range (say minCal < sum of calories in three rows < maxCal)
similar condition for proteins too.
Right now, I am first iterating over all breakfast items, choosing lunch items. Then iterating over all dinner items. After selecting a combination, I am adding relevant columns and checking if values are within the desired range
You can use the approach described in this answer to generate a new DataFrame containing all the combinations of three rows from your original data:
from itertools import combinations
import pandas as pd
# Using skbrhmn's df
df = pd.DataFrame({"Calories": [100, 200, 300, 400, 500],
"Protein": [10, 20, 30, 40, 50],
"IsBreakfast": [1, 1, 0, 0, 0],
"IsLunch": [1, 0, 0, 0, 1],
"IsDinner": [1, 1, 1, 0, 1]})
comb_rows = list(combinations(df.index, 3))
comb_rows
Output:
[(0, 1, 2),
(0, 1, 3),
(0, 1, 4),
(0, 2, 3),
(0, 2, 4),
(0, 3, 4),
(1, 2, 3),
(1, 2, 4),
(1, 3, 4),
(2, 3, 4)]
Then create a new DataFrame containing the sum of all numeric fields in your original frame, over all the possible combinations of three rows:
combinations = pd.DataFrame([df.loc[c,:].sum() for c in comb_rows], index=comb_rows)
print(combinations)
Calories Protein IsBreakfast IsLunch IsDinner
(0, 1, 2) 600 60 2 1 3
(0, 1, 3) 700 70 2 1 2
(0, 1, 4) 800 80 2 2 3
(0, 2, 3) 800 80 1 1 2
(0, 2, 4) 900 90 1 2 3
(0, 3, 4) 1000 100 1 2 2
(1, 2, 3) 900 90 1 0 2
(1, 2, 4) 1000 100 1 1 3
(1, 3, 4) 1100 110 1 1 2
(2, 3, 4) 1200 120 0 1 2
Finally you can apply any filters you need:
filtered = combinations[
(combinations.IsBreakfast>0) &
(combinations.IsLunch>0) &
(combinations.IsDinner>0) &
(combinations.Calories>600) &
(combinations.Calories<1000) &
(combinations.Protein>=80) &
(combinations.Protein<120)
]
print(filtered)
Calories Protein IsBreakfast IsLunch IsDinner
(0, 1, 4) 800 80 2 2 3
(0, 2, 3) 800 80 1 1 2
(0, 2, 4) 900 90 1 2 3
You can add combinations of filters to a dataframe using the | and & operators.
Creating a dummy dataframe for example:
df1 = pd.DataFrame({"Calories": [100, 200, 300, 400, 500],
"Protein": [10, 20, 30, 40, 50],
"IsBreakfast": [1, 1, 0, 0, 0],
"IsLunch": [1, 0, 0, 0, 1],
"IsDinner": [1, 1, 1, 0, 1]})
print(df1)
Output:
Calories Protein IsBreakfast IsLunch IsDinner
0 100 10 1 1 1
1 200 20 1 0 1
2 300 30 0 0 1
3 400 40 0 0 0
4 500 50 0 1 1
Now add all the conditions:
min_cal = 100
max_cal = 600
min_prot = 10
max_prot = 40
df_filtered = df1[
((df1['IsBreakfast']==1) | (df1['IsLunch']==1) | (df1['IsDinner']==1)) &
((df1['Calories'] > min_cal) & (df1['Calories'] < max_cal)) &
((df1['Protein'] > min_prot) & (df1['Protein'] < max_prot))
]
print(df_filtered)
Output:
Calories Protein IsBreakfast IsLunch IsDinner
1 200 20 1 0 1
2 300 30 0 0 1

Count the frequency that a combination occurs in a Dataframe column - Apriori algorithm

I have a problem to search the correct solution of the frequency of a combination.
This my code:
import pandas as pd
import itertools
list = [1,20,1,50]
combinations = []
for i in itertools.combinations(list ,2):
combinations .append(i)
data = pd.DataFrame({'products':combinations})
data['frequency'] = data.groupby('products')['products'].transform('count')
print data
The out is:
products frequency
0 (1, 20) 1
1 (1, 1) 1
2 (1, 50) 2
3 (20, 1) 1
4 (20, 50) 1
5 (1, 50) 2
The problem is (1, 20) and (20, 1), the frequency puts 1 but are the same combination and has to be 2, Is there any method with the correct solution?
You can use group by a modification on the column by using applyand lambda
import pandas as pd
import itertools
list = [1,20,1,50]
combinations = []
for i in itertools.combinations(list ,2):
combinations .append(i)
data = pd.DataFrame({'products':combinations})
data['frequency'] = data.groupby(data['products'].apply(
lambda i :tuple(sorted(i))))['products'].transform('count')
print (data)
The output will be
products frequency
0 (1, 20) 2
1 (1, 1) 1
2 (1, 50) 2
3 (20, 1) 2
4 (20, 50) 1
5 (1, 50) 2

while loop using python list

my_list=[1,2,3,4,5]
i = 10
while i < 10:
print i ,my_list
i = i +1
My desired output:
1,1
2,2
3,3
4,4
5,5
6,1
7,2
8,3
9,4
10,5
How can I achieve this?
my_list=[1,2,3,4,5]
for index, item in enumerate(my_list*2, start = 1):
print index,item
Your task is what itertools.cycle is built for (from Python's standard library):
In [5]: from itertools import cycle
In [6]: for i, j in zip(xrange(1, 11), cycle(my_list)):
...: print i, j
...:
1 1
2 2
3 3
4 4
5 5
6 1
7 2
8 3
9 4
10 5
In [7]: for i, j in zip(xrange(12), cycle(my_list)):
...: print i, j
...:
0 1
1 2
2 3
3 4
4 5
5 1
6 2
7 3
8 4
9 5
10 1
11 2
for x in range(10):
print(x+1,list[x%len(list)])
This code is unchecked and you may need to modify it a bit.
You can try this easier way :
my_list = [1,2,3,4,5]
newList = (enumerate(my_list*2))
for num in newList:
print(num)
Output:
(0, 1)
(1, 2)
(2, 3)
(3, 4)
(4, 5)
(5, 1)
(6, 2)
(7, 3)
(8, 4)
(9, 5)

Convert pandas cut operation to a regular string

I get the foll. output from a pandas cut operation:
0 (0, 20]
1 (0, 20]
2 (0, 20]
3 (0, 20]
4 (0, 20]
5 (0, 20]
6 (0, 20]
7 (0, 20]
8 (0, 20]
9 (0, 20]
How can I convert the (0, 20] to 0 - 20?
I am doing this:
.str.replace('(', '').str.replace(']', '').str.replace(',', ' -')
Any better approach?
Use the labels parameter of pd.cut:
pd.cut(df['some_col'], bins=[0,20,40,60], labels=['0-20', '20-40', '40-60'])
I don't know what your exact pd.cut command looks like, but the code above should give you a good idea of what to do.
Example usage:
df = pd.DataFrame({'some_col': range(5, 56, 5)})
df['cut'] = pd.cut(df['some_col'], bins=[0,20,40,60], labels=['0-20','20-40','40-60'])
Example output:
some_col cut
0 5 0-20
1 10 0-20
2 15 0-20
3 20 0-20
4 25 20-40
5 30 20-40
6 35 20-40
7 40 20-40
8 45 40-60
9 50 40-60
10 55 40-60
assuming the output was assigned to a variable cut
cut.astype(str)
To remove bracketing
cut.astype(str).str.strip('()[]')

Categories