Looping through multiple columns in a table in Python - python

I'm trying to loop through a table that contains covid-19 data. My table has 4 columns: month, day, location, and cases. The values of each column in the table is stored in its own list, so each list has the same length. (Ie. there is a month list, day list, location list, and cases list). There are 12 months, with up to 31 days in a month. Cases are recorded for many locations around the world. I would like to figure out what day of the year had the most total combined global cases. I'm not sure how to structure my loops appropriately. An oversimplified sample version of the table represented by the lists is shown below.
In this small example, the result would be month 1, day 3 with 709 cases (257 + 452).
Month
Day
Location
Cases
1
1
CAN
124
1
1
USA
563
1
2
CAN
242
1
2
USA
156
1
3
CAN
257
1
3
USA
452
.
.
...
...
12
31
...
...

I assume that you've put all the data in the same data frame, df.
df = pandas.DataFrame()
df['Month'] = name_of_your_month_list
df['Day'] = name_of_your_daylist
df['Location'] = name_of_your_location_list
df['Cases'] = name_of_your_cases_list
df.Cases.max() gives you the biggest number of cases. I assume that there is on year only in the dataset. So df[df.Cases==df.Cases.max()].index gives youth index that you search
For the the day, just filter :
df[df.index==df[df.Cases==df.Cases.max()].index].Day
For the month:
df[df.index==df[df.Cases==df.Cases.max()].index].Month
For the number of cases:
df[df.index==df[df.Cases==df.Cases.max()].index].Cases
For the country :
df[df.index==df[df.Cases==df.Cases.max()].index].Location
Reading the comment, it is not clear if you search the biggest cases in a Location or of the day. If its from the day, you'll have to filter first with a groupby('Day') function, to use it as groupby('Day').max()

You group your dataframe by month and day. Then iterate through the groups to find the group in which the sum of cases in all locations was max as shown below:
import pandas as pd
df = pd.DataFrame({'Month':[1,1,1,1,1,1], 'Day':[1,1,2,2,3,3],
'Location':['CAN', 'USA', 'CAN', 'USA','CAN', 'USA'],
'Cases':[124,563,242,156,257,452]})
grouped = df.groupby(['Month', 'Day'])
max_sum = 0
max_day = None
for idx, group in grouped:
if group['Cases'].sum() > max_sum:
max_sum = group['Cases'].sum()
max_day = group
month = max_day['Month'].iloc[1]
day = max_day['Day'].iloc[1]
print(f'Maximum cases of {max_sum} occurred on {month}/{day}.')
#prints: Maximum cases of 709 occurred on 1/3
If you don't want to use Pandas, this is how you do it:
months = [1,1,1,1,1,1]
days = [1,1,2,2,3,3]
locations = ['CAN', 'USA', 'CAN', 'USA','CAN', 'USA']
cases = [124,563,242,156,257,452]
dic = {}
target_day = 0
count = 0
for i in range(len(days)):
if days[i] != target_day:
target_day = days[i]
count = cases[i]
else:
count += cases[i]
dic[f'{months[i]}/{days[i]}'] = count
max_cases = max(dic.values())
worst_day = list(dic.keys())[list(dic.values()).index(max_cases)]
print(f'Maximum cases of {max_cases} occurred on {worst_day}.')
#Prints: Maximum cases of 709 occurred on 1/3.

you can check the max value in your cases list first. then map the max case's index with other three lists and obtain their values.
ex: caseList = [1,2,3,52,1,0]
the maximum is 52. its index is 3. in your case you can get the monthList[3], dayList[3],
locationList[3] respectively. then you get the relevant day, month and country which is having the most total global cases.
check whether this will help in your scenario.

You may use this strategy to get the required result.
daylist,monthlist,location,Cases = [1, 2, 3, 4], [1,1,1,1],['CAN','USA','CAN','USA'],[124,563,242,999]
maxCases = Cases.index(max(Cases))
print("Max Case:",Cases[maxCases])
print("Location:",location[maxCases])
print("Month:",monthlist[maxCases])
print("Day:",daylist[maxCases])

Related

Pandas - How can I iterate through a column to put respondents into appropriate bins?

I have a DataFrame called df3 with 2 columns - 'fan' and 'Household Income' as seen below. I'm trying to iterate through the 'Household Income' column and if the value of the column is '$0 - $24,999', add it to bin 'low_inc'. If the value of the column is '$25,000 - $49,999', add it to bin 'lowmid_inc', etc. But I'm getting an error saying 'int' object is not iterable.
df3 = df_hif.dropna(subset=['Household Income', 'fan'],how='any')
low_inc = []
lowmid_inc = []
mid_inc = []
midhigh_inc = []
high_inc = []
for inc in df3['Household Income']:
if inc == '$0 - $24,999':
low_inc += 1
elif inc == '$25,000 - $49,999':
lowmid_inc += 1
elif inc == '$50,000 - $99,999':
mid_inc += 1
elif inc == '$100,000 - $149,999':
midhigh_inc += 1
else:
high_inc += 1
#print(low_inc)
Here is a sample of 5 rows of the df used:
Household Income fan
774 25,000− 49,999 Yes
290 50,000− 99,999 No
795 50,000− 99,999 Yes
926 $150,000+ No
1017 $150,000+ Yes
The left column (774, 290, etc.) is the index, showing the respondents ID. The 5 ranges of the different 'Household Income' columns are listed above in my if/else statement, but I'm receiving an error when I try to print out the bins.
For each respondent, I'm trying to add 1 to the buckets 'low_bin', 'high_bin', etc. So I'm trying to count the number of respondents that have a household income between 0-24999, 25000-49000, etc. How can I iterate through a column to count the number of respondents into the appropriate bins?
Iterating in Pandas is not preferable.
You can separate them to different dataframes:
low_inc = df3[df3['Household Income'] == '$0 - $24,999'
lowmid_inc = df3[df3['Household Income'] == '$25,000 - $49,999'
etc...
The len(low_inc) for example will give you the number of rows in each dataframe
Alternatively, try groupby:
df3.grouby('Household Income').count()
I would simply use
df3 = df3['Household Income']
bins = int(max(df3)-min(df3)/25000)
out = df3.hist(bins=10)
finally take the sum of out results in related bins. ex. 25000-50000 will be related to 1 bin whereas 50000-100000 will be 2 bins.

How to iterate over column values for each group and track sum

I have 4 dataframes like as given below
df_raw = pd.DataFrame(
{'stud_id' : [101, 101,101],
'prod_id':[12,13,16],
'total_qty':[100,1000,80],
'ques_date' : ['13/11/2020', '10/1/2018','11/11/2017']})
df_accu = pd.DataFrame(
{'stud_id' : [101,101,101],
'prod_id':[12,13,16],
'accu_qty':[10,500,10],
'accu_date' : ['13/08/2021','02/11/2019','17/12/2018']})
df_inv = pd.DataFrame(
{'stud_id' : [101,101,101],
'prod_id':[12,13,18],
'inv_qty':[5,100,15],
'inv_date' : ['16/02/2022', '22/11/2020','19/10/2019']})
df_bkl = pd.DataFrame(
{'stud_id' : [101,101,101,101],
'prod_id' :[12,12,12,17],
'bkl_qty' :[15,40,2,10],
'bkl_date':['16/01/2022', '22/10/2021','09/10/2020','25/06/2020']})
My objective is to find out the below
a) Get the date when threshold exceeds 50%
threshold is given by the formula below
threshold = (((df_inv['inv_qty']+df_bkl['bkl_qty']+df_accu['accu_qty'])/df_raw['total_qty'])*100)
We have to add in the same order. Meaning, first, we have to add inv_qty, then bkl_qty and finally accu_qty.We do this way in order to identify the correct date when they exceeded 50% of total qty. Additionally, this has to be computed for each stud_id and prod_id.
but the problem is df_bkl has multiple records for the same stud_id and prod_id and it is by design. Real data also looks like this. Whereas df_accu and df_inv will have only row for each stud_id and prod_id.
In the above formula for df['bkl_qty'],we have to use each value of df['bkl_qty'] to compute the sum.
for ex: let's take stud_id = 101 and prod_id = 12.
His total_qty = 100, inv_qty = 5, his accu_qty=10. but he has three bkl_qty values - 15,40 and 2. So, threshold has to be computed in a fashion like below
5 (is value of inv_qty) +15 (is 1st value of bkl_qty) +40 (is 2nd value of bkl_qty) +2 (is 3rd value of bkl_qty) +10(is value of accu_qty)
So, now with the above, we can know that his threshold exceeded 50% when his bkl_qty value was 40. Meaning, 5+15+40 = 60 (which is greater than 50% of total_qty (100)).
I was trying something like below
df_stage_1 = df_raw.merge(df_inv,on=['stud_id','prod_id'], how='left').fillna(0)
df_stage_2 = df_stage_1.merge(df_bkl,on=['stud_id','prod_id'])
df_stage_3 = df_stage_2.merge(df_accu,on=['stud_id','prod_id'])
df_stage_3['threshold'] = ((df_stage_3['inv_qty'] + df_stage_3['bkl_qty'] + df_stage_3['accu_qty'])/df_stage_3['total_qty'])*100
But this is incorrect as I am not able to do each value by value for bkl_qty from df_bkl
In this post, I have shown only sample data with one stud_id=101 but in real time I have more than 1000's of stud_id and prod_id.
Therfore, any elegant and efficient approach would be useful. We have to apply this logic on million record datasets.
I expect my output to be like as shown below. whenever the sum value exceeds 50% of total_qty, we need to get that corresponding date
stud_id,prod_id,total_qty,threshold,threshold_date
101 12 100 72 22/10/2021
It can be achieved using groupby and cumsum which does cumulative summation.
# add cumulative sum column to df_bkl
df_bkl['csum'] = df_bkl.groupby(['stud_id','prod_id'])['bkl_qty'].cumsum()
# use df_bkl['csum'] to compute threshold instead of bkl_qty
df_stage_3['threshold'] = ((df_stage_3['inv_qty'] + df_stage_3['csum'] + df_stage_3['accu_qty'])/df_stage_3['total_qty'])*100
# check if inv_qty already exceeds threshold
df_stage_3.loc[df_stage_3.inv_qty > df_stage_3.total_qty/2, 'bkl_date'] = df_stage_3['inv_date']
# next doing some filter and merge to arrive at the desired df
gt_thres = df_stage_3[df_stage_3['threshold'] > df_stage_3['total_qty']/2]
df_f1 = gt_thres.groupby(['stud_id','prod_id','total_qty'])['threshold'].min().to_frame(name='threshold').reset_index()
df_f2 = gt_thres.groupby(['stud_id','prod_id','total_qty'])['threshold'].max().to_frame(name='threshold_max').reset_index()
df = pd.merge(df_f1, df_stage_3, on=['stud_id','prod_id','total_qty','threshold'], how='inner')
df2 = pd.merge(df,df_f2, on=['stud_id','prod_id','total_qty'], how='inner')
df2 = df2[['stud_id','prod_id','total_qty','threshold','bkl_date']].rename(columns={'threshold_max':'threshold', 'bkl_date':'threshold_date'})
print(df2)
provides the output as:
stud_id prod_id total_qty threshold threshold_date
0 101 12 100 72.0 22/10/2021
Does this work?

What is the best way to compute a rolling (lag and lead) difference in sales?

I'm looking to add a field or two into my data set that represents the difference in sales from the last week to current week and from current week to the next week.
My dataset is about 4.5 million rows so I'm looking to find an efficient way of doing this, currently I'm getting into a lot of iteration and for loops and I'm quite sure I'm going about this the wrong way. but Im trying to write code that will be reusable on other datasets and there are situations where you might have nulls or no change in sales week to week (therefore there is no record)
The dataset looks like the following:
Store Item WeekID WeeklySales
1 1567 34 100.00
2 2765 34 86.00
3 1163 34 200.00
1 1567 35 160.00
. .
. .
. .
I have each week as its own dictionary and then each store sales for that week in a dictionary within. So I can use the week as a key and then within the week I access the store's dictionary of item sales.
weekly_sales_dict = {}
for i in df['WeekID'].unique():
store_items_dict = {}
subset = df[df['WeekID'] == i]
subset = subset.groupby(['Store', 'Item']).agg({'WeeklySales':'sum'}).reset_index()
for j in subset['Store'].unique():
storeset = subset[subset['Store'] == j]
store_items_dict.update({str(j): storeset})
weekly_sales_dict.update({ str(i) : store_items_dict})
Then I iterate through each week in the weekly_sales_dict and compare each store/item within it to the week behind it (I planned to do the same for the next week as well). The 'lag_list' I create can be indexed by week, store, and Item so I was going to iterate through and add the values to my df as a new lag column but I feel I am way overthinking this.
count = 0
key_list = list(df['WeekID'].unique())
lag_list = []
for k,v in weekly_sales_dict.items():
if count != 0 and count != len(df['WeekID'].unique())-1:
prev_wk = weekly_sales_dict[str(key_list[(count - 1)])]
current_wk = weekly_sales_dict[str(key_list[count])
for i in df['Store'].unique():
prev_df = prev_wk[str(i)]
current_df = current_wk[str(i)]
for j in df['Item'].unique():
print('in j')
if j in list(current_df['Item'].unique()) and j in list(prev_df['Item'].unique()):
item_lag = current_df[current_df['Item'] == int(j)]['WeeklySales'].values - prev_df[prev_df['Item'] == int(j)]['WeeklySales'].values
df[df['Item'] == j][df['Store'] == i ][df['WeekID'] == key_list[count]]['lag'] = item_lag[0]
lag_list.append((str(i),str(j),item_lag[0]))
elif j in list(current_df['Item'].unique()):
item_lag = current_df[current_df['Item'] == int(j)]['WeeklySales'].values
lag_list.append((str(i),str(j),item_lag[0]))
else:
pass
count += 1
else:
count += 1
Using pd.diff() the problem was solved. I sorted all rows by week, then created a subset with a multi-index by grouping on store,items,and week. Finally I used pd.diff() with a period of 1 and I ended up with the sales difference from the current week to the week prior.
df = df.sort_values(by = 'WeekID')
subset = df.groupby(['Store', 'Items', 'WeekID']).agg({''WeeklySales'':'sum'})
subset['lag'] = subset[['WeeklySales']].diff(1)

New column based off certain input parameter to select what columns to use - Python

Have a pandas dataframe that includes multiple columns of monthly finance data. I have an input of period that is specified by the person running the program. It's currently just saved as period like shown below within the code.
#coded into python
period = ?? (user adds this in from input screen)
I need to create another column of data that uses the input period number to perform a calculation of other columns.
So, in the above table I'd like to create a new column 'calculation' that depends on the period input. For example, if a period of 1 was used the following calc1 would be completed (with math actually done). Period = 2 - then calc2. Period = 3 - then calc3. I only need one column calculated depending on the period number but added three examples in below picture for example of how it'd work.
I can do this in SQL using case when. So using the input period then sum what columns I need to.
select Account #,
'&Period' AS Period,
'&Year' AS YR,
case
When '&Period' = '1' then sum(d_cf+d_1)
when '&Period' = '2' then sum(d_cf+d_1+d_2)
when '&Period' = '3' then sum(d_cf+d_1+d_2+d_3)
I am unsure on how to do this easily in python (newer learner). Yes, I could create a column that does each calculation via new column for every possible period (1-12), and then only select that column but I'd like to learn and do it a more efficient way.
Can you help more or point me in a better direction?
You could certainly do something like
df[['d_cf'] + [f'd_{i}' for i in range(1, period+1)]].sum(axis=1)
You can do this using a simple function in python:
def get_calculation(df, period=NULL):
'''
df = pandas data frame
period = integer type
'''
if period == 1:
return df.apply(lambda x: x['d_0'] +x['d_1'], axis=1)
if period == 2:
return df.apply(lambda x: x['d_0'] +x['d_1']+ x['d_2'], axis=1)
if period == 3:
return df.apply(lambda x: x['d_0'] +x['d_1']+ x['d_2'] + x['d_3'], axis=1)
new_df = get_calculation(df, period = 1)
Setup:
df = pd.DataFrame({'d_0':list(range(1,7)),
'd_1': list(range(10,70,10)),
'd_2':list(range(100,700,100)),
'd_3': list(range(1000,7000,1000))})
Setup:
import pandas as pd
ddict = {
'Year':['2018','2018','2018','2018','2018',],
'Account_Num':['1111','1122','1133','1144','1155'],
'd_cf':['1','2','3','4','5'],
}
data = pd.DataFrame(ddict)
Create value calculator:
def get_calcs(period):
# Convert period to integer
s = str(period)
# Convert to string value
n = int(period) + 1
# This will repeat the period number by the value of the period number
return ''.join([i * n for i in s])
Main function copies data frame, iterates through period values, and sets calculated values to the correct spot index-wise for each relevant column:
def process_data(data_frame=data, period_column='d_cf'):
# Copy data_frame argument
df = data_frame.copy(deep=True)
# Run through each value in our period column
for i in df[period_column].values.tolist():
# Create a temporary column
new_column = 'd_{}'.format(i)
# Pass the period into our calculator; Capture the result
calculated_value = get_calcs(i)
# Create a new column based on our period number
df[new_column] = ''
# Use indexing to place the calculated value into our desired location
df.loc[df[period_column] == i, new_column] = calculated_value
# Return the result
return df
Start:
Year Account_Num d_cf
0 2018 1111 1
1 2018 1122 2
2 2018 1133 3
3 2018 1144 4
4 2018 1155 5
Result:
process_data(data)
Year Account_Num d_cf d_1 d_2 d_3 d_4 d_5
0 2018 1111 1 11
1 2018 1122 2 222
2 2018 1133 3 3333
3 2018 1144 4 44444
4 2018 1155 5 555555

Create a new column in a dataframe with increment number based on another column

Consider the below pandas DataFrame:
from pandas import Timestamp
df = pd.DataFrame({
'day': [Timestamp('2017-03-27'),
Timestamp('2017-03-27'),
Timestamp('2017-04-01'),
Timestamp('2017-04-03'),
Timestamp('2017-04-06'),
Timestamp('2017-04-07'),
Timestamp('2017-04-11'),
Timestamp('2017-05-01'),
Timestamp('2017-05-01')],
'act_id': ['916298883',
'916806776',
'923496071',
'926539428',
'930641527',
'931935227',
'937765185',
'966163233',
'966417205']
})
As you may see, there are 9 unique ids distributed in 7 days.
I am looking for a way to add two new columns.
The first column:
An increment number for each new day. For example 1 for '2017-03-27'(same number for same day), 2 for '2017-04-01', 3 for '2017-04-03', etc.
The second column:
An increment number for each new act_id per day. For example 1 for '916298883', 2 for '916806776' (which is linked to the same day '2017-03-27'), 1 for '923496071', 1 for '926539428', etc.
The final table should look like this
I have already tried to build the first column with apply and a function but it doesn't work as it should.
#Create helper function to give index number to a new column
counter = 1
def giveFlag(x):
global counter
index = counter
counter+=1
return index
And then:
# Create day flagger column
df_helper['day_no'] = df_helper['day'].apply(lambda x: giveFlag(x))
try this:
days = list(set(df['day']))
days.sort()
day_no = list()
iter_no = list()
for index,day in enumerate(days):
counter=1
for dfday in df['day']:
if dfday == day:
iter_no.append(counter)
day_no.append(index+1)
counter+=1
df['day_no'] = pd.Series(day_no).values
df['iter_no'] = pd.Series(iter_no).values

Categories