Calculate previous occurence - python

df month order customer
0 Jan yes 020
1 Feb yes 041
2 April no 020
3 May no 020
Is there a way to calculate the last month a customer ordered if order = no? Expected Output
df month order customer last_order
0 Jan yes 020
1 Feb yes 041
2 April no 020 Jan
3 May no 020 Jan

You can df.groupby, and pd.Series.eq to check if value is yes, then use pd.Series.where and use pd.Series.ffill, then mask using pd.Series.mask
def func(s):
m = s['order'].eq('yes')
f = s['month'].where(m).ffill()
return f.mask(m)
df['last_order'] = df.groupby('customer', group_keys=False).apply(func)
month order customer last_order
0 Jan yes 020 NaN
1 Feb yes 041 NaN
2 March no 020 Jan
Explanation
What happens in each of the group after groupby is the below, for example consider group where customer is 020
month order
0 jan yes
1 apr no
2 may no
3 jun yes
4 jul no
m = df['order'].eq('yes') # True where `order` is 'yes'
f = df['month'].where(m)#.ffill()
f
0 jan # ---> \
1 NaN \ #`jan` and `jun` are visible as
2 NaN / # they were the months with `order` 'yes'
3 jun # ---> /
4 NaN
Name: month, dtype: object
# If you chain the above with with `ffill` it would fill the NaN values.
f = df['month'].where(m).ffill()
f
0 jan
1 jan # filled with valid above value i.e Jan
2 jan # filled with valid above value i.e Jan
3 jun
4 jun # filled with valid above value i.e Jun
Name: month, dtype: object
f.mask(m) # works opposite of `pd.Series.where`
0 NaN # --->\
1 jan \ # Marked values `NaN` where order was `yes`.
2 jan /
3 NaN # --->/
4 jun
Name: month, dtype: object

You might do it with df.iterrows:
df = pd.DataFrame([{'month': 'Jan', 'order': 'yes', 'customer': '020', 'month_2': 1, 'last_order': None},
{'month': 'Feb', 'order': 'yes', 'customer': '041', 'month_2': 2, 'last_order': None},
{'month': 'April', 'order': 'no', 'customer': '020', 'month_2': 4, 'last_order': 'Jan'},
{'month': 'May', 'order': 'no', 'customer': '020', 'month_2': 5, 'last_order': 'Jan'}])
#Lets convert months to numeric value
dict_months = dict(Jan=1, Feb=2, March=3, April = 4,May=5, June = 6,Jul = 7, Aug = 8, Sep = 9, Oct = 10, Nov =11, Dec = 12)
df['month_2'] = df.month.map(dict_months)
#Insert a blank column for last_order
df['last_order'] = None
#Let's iter throught rows
for idx, row in df.iterrows():
if row['order'] == "yes": continue
#For each row, grab the customer and the current month and searchfor orders in previous months
df_temp = df[(df.customer == row['customer']) & (df.month_2 < row['month_2'] )& (df.order == "yes")]
#If any result found, let pick the last know order and update accordingly the DataFrame
if df_temp.shape[0]>0: df.loc[[idx],'last_order'] = df_temp['month'].iloc[-1]
#remove unecessary column
del df['month_2']
Output
| month | order | customer | last_order |
|:--------|:--------|-----------:|:-------------|
| Jan | yes | 020 | |
| Feb | yes | 041 | |
| April | no | 020 | Jan |
| May | no | 020 | Jan |

Related

How to merge two unequal rows of a pandas dataframe, where one column value is to match and another column is to be added?

I have given the following pandas dataframe:
d = {'ID': ['1169', '1234', '2456', '9567', '1234', '4321', '9567', '0169'], 'YEAR': ['2001', '2013', '2009', '1989', '2012', '2013', '2002', '2012'], 'VALUE': [8, 24, 50, 75, 3, 6, 150, 47]}
df = pd.DataFrame(data=d)
print(df)
ID YEAR VALUE
0 1169 2001 8
1 1234 2013 24
2 2456 2009 50
3 9567 1989 75
4 1234 2012 3
5 4321 2013 6
6 9567 2002 150
7 1169 2012 47
I now want to merge two rows of the DataFrame, where there are two different IDs, where ultimately only one remains. The merge should only take place if the values of the column "YEAR" match. The values of the column "VALUE" should be added.
The output should look like this:
ID YEAR VALUE
0 1169 2001 8
1 1234 2013 30
2 2456 2009 50
3 9567 1989 75
4 1234 2012 3
5 9567 2002 150
6 1169 2012 47
Line 1 and line 5 have been merged. Line 5 is removed and line 1 remains with the previous ID, but the VALUEs of line 1 and line 5 have been added.
I would like to specify later which two lines or which two IDs should be merged. One of the two should always remain. The two IDs to be merged come from another function.
I experimented with the groupby() function, but I don't know how to merge two different IDs there. I managed it only with identical values of the "ID" column. This then looked like this:
df.groupby(['ID', 'YEAR'])['VALUE'].sum().reset_index(name ='VALUE')
Unfortunately, even after extensive searching, I have not found anything suitable. I would be very happy if someone can help me! I would like to apply the whole thing later to a much larger DataFrame with more rows. Thanks in advance and best regards!
Try this, just group on 'ID' and take the max YEAR and sum VALUE:
df.groupby('ID', as_index=False).agg({'YEAR':'max', 'VALUE':'sum'})
Output:
ID YEAR VALUE
0 1234 2013 27
1 4321 2013 6
Or group on year and take first ID:
df.groupby('YEAR', as_index=False).agg({'ID':'first', 'VALUE':'sum'})
Ouptut:
YEAR ID VALUE
0 2012 1234 3
1 2013 1234 30
Based on all the comments and update to the question it sounds like the logic (maybe not this exact code) is required...
Try:
import pandas as pd
d = {'ID': ['1169', '1234', '2456', '9567', '1234', '4321', '9567', '0169'], 'YEAR': ['2001', '2013', '2009', '1989', '2012', '2013', '2002', '2012'], 'VALUE': [8, 24, 50, 75, 3, 6, 150, 47]}
df = pd.DataFrame(d)
df['ID'] = df['ID'].astype(int)
def correctRows(l, i):
for x in l:
if df.loc[x, 'YEAR'] == df.loc[i, 'YEAR']:
row = x
break
return row
def mergeRows(a, b):
rowa = list(df[df['ID'] == a].index)
rowb = list(df[df['ID'] == b].index)
if len(rowa) > 1:
if type(rowb)==list:
rowa = correctRows(rowa, rowb[0])
else:
rowa = correctRows(rowa, rowb)
else:
rowa = rowa[0]
if len(rowb) > 1:
if type(rowa)==list:
rowb = correctRows(rowb, rowa[0])
else:
rowb = correctRows(rowb, rowa)
else:
rowb = rowb[0]
print('Keeping:', df.loc[rowa].to_string().replace('\n', ', ').replace(' ', ' '))
print('Dropping:', df.loc[rowb].to_string().replace('\n', ', ').replace(' ', ' '))
df.loc[rowa, 'VALUE'] = df.loc[rowa, 'VALUE'] + df.loc[rowb, 'VALUE']
df.drop(df.index[rowb], inplace=True)
df.reset_index(drop = True, inplace=True)
return None
# add two ids. First 'ID' is kept; the second dropped, but the 'Value'
# of the second is added to the 'Value' of the first.
# Note: the line near the start df['ID'].astype(int), hence integers required
# mergeRows(4321, 1234)
mergeRows(1234, 4321)
Outputs:
Keeping: ID 1234, YEAR 2013, VALUE 24
Dropping: ID 4321, YEAR 2013, VALUE 6
Frame now looks like:
ID YEAR VALUE
0 1169 2001 8
1 1234 2013 30 #<-- sum of 6 + 24
2 2456 2009 50
3 9567 1989 75
4 1234 2012 3
5 9567 2002 150
6 169 2012 47

Recode multiple values in several columns in Python [similar to R]

I am trying to translate my R script to python. I have a survey data with several date of birth and education level columns for each family member(from family member 1 to member 10): here a sample:
id_name dob_1 dob_2 dob_3 education_1 education_2 education_3
12 1958 2001 2005 1 5 1
13 1990 1999 1932 2 1 3
14 1974 1965 1965 3 3 3
15 1963 1963 1990 4 3 1
16 2020 1995 1988 1 1 2
I had a function in R in order to check the logic and re code wrong education level in all columns.Like this
# R function
edu_recode <- function(dob, edu){
case_when(
dob >= 2003 & (edu == 1 | edu == 2 | edu == 3 | edu == 4) ~ 8,
dob > 2000 & (edu == 1 | edu == 2 | edu == 3 | edu == 4) ~ 1,
dob >= 1996 & (edu == 3 | edu == 4) ~ 2,
dob > 1995 & edu == 4 ~ 3,
(dob >= 2001 & dob <= 2002) & edu == 8 ~ 1,
TRUE ~ as.numeric(edu)
)
}
and apply it for all columns like this:
library(tidyverse)
df %>%
mutate(education_1 = edu_recode(dob_1,education_1),
education_2 = edu_recode(dob_2,education_2),
education_3 = edu_recode(dob_3,education_3),
education_4 = edu_recode(dob_4,education_4),
education_5 = edu_recode(dob_5,education_5),
education_6 = edu_recode(dob_6,education_6),
education_7 = edu_recode(dob_7,education_7),
education_8 = edu_recode(dob_8,education_8),
education_9 = edu_recode(dob_9,education_9),
education_10 = edu_recode(dob_10,education_10)
)
is there a way to do similar process in Python instead of manually recoding each column?
You can write a function that combines pipe with np.select, as well as a dictionary (to abstract as much manual processing as possible):
def edu_recode(df, dob, edu):
df = df.copy()
cond1 = (df[dob] >= 2003) & (df[edu].isin([1, 4]))
cond2 = (df[dob] > 2000) & (df[edu].isin([1, 4]))
cond3 = (df[dob] > 1996) & (df[edu].isin([3, 4]))
cond4 = (df[dob] > 1995) & (df[edu] == 4)
cond5 = (df[dob].isin([2001, 2002])) & (df[edu] == 8)
condlist = [cond1, cond2, cond3, cond4, cond5]
choicelist = [8, 1, 2, 3, 1]
return np.select(condlist, choicelist, pd.to_numeric(df[edu]))
# sticking to the sample data, you can extend this
mapping = {f"education_{num}": df.pipe(edu_recode, f"dob_{num}",
f"education_{num}")
for num in range(1, 4)}
df.assign(**mapping)
id_name dob_1 dob_2 dob_3 education_1 education_2 education_3
0 12 1958 2001 2005 1 5 8
1 13 1990 1999 1932 2 1 3
2 14 1974 1965 1965 3 3 3
3 15 1963 1963 1990 4 3 1
4 16 2020 1995 1988 8 1 2

calculating percentile values for each columns group by another column values - Pandas dataframe

I have a dataframe that looks like below -
Year Salary Amount
0 2019 1200 53
1 2020 3443 455
2 2021 6777 123
3 2019 5466 313
4 2020 4656 545
5 2021 4565 775
6 2019 4654 567
7 2020 7867 657
8 2021 6766 567
Python script to get the dataframe below -
import pandas as pd
import numpy as np
d = pd.DataFrame({
'Year': [
2019,
2020,
2021,
] * 3,
'Salary': [
1200,
3443,
6777,
5466,
4656,
4565,
4654,
7867,
6766
],
'Amount': [
53,
455,
123,
313,
545,
775,
567,
657,
567
]
})
I want to calculate certain percentile values for all the columns grouped by 'Year'.
Desired output should look like -
I am running below python script to perform the calculations to calculate certain percentile values-
df_percentile = pd.DataFrame()
p_list = [0.05, 0.10, 0.25, 0.50, 0.75, 0.95, 0.99]
c_list = []
p_values = []
for cols in d.columns[1:]:
for p in p_list:
c_list.append(cols + '_' + str(p))
p_values.append(np.percentile(d[cols], p))
print(len(c_list), len(p_values))
df_percentile['Name'] = pd.Series(c_list)
df_percentile['Value'] = pd.Series(p_values)
print(df_percentile)
Output -
Name Value
0 Salary_0.05 1208.9720
1 Salary_0.1 1217.9440
2 Salary_0.25 1244.8600
3 Salary_0.5 1289.7200
4 Salary_0.75 1334.5800
5 Salary_0.95 1370.4680
6 Salary_0.99 1377.6456
7 Amount_0.05 53.2800
8 Amount_0.1 53.5600
9 Amount_0.25 54.4000
10 Amount_0.5 55.8000
11 Amount_0.75 57.2000
12 Amount_0.95 58.3200
13 Amount_0.99 58.5440
How can I get the output in the required format without having to do extra data manipulation/formatting or in fewer lines of code?
You can try pivot followed by quantile:
(df.pivot(columns='Year')
.quantile([0.01,0.05,0.75, 0.95, 0.99])
.stack('Year')
)
Output:
Salary Amount
Year
0.01 2019 1269.08 58.20
2020 3467.26 456.80
2021 4609.02 131.88
0.05 2019 1545.40 79.00
2020 3564.30 464.00
2021 4785.10 167.40
0.75 2019 5060.00 440.00
2020 6261.50 601.00
2021 6771.50 671.00
0.95 2019 5384.80 541.60
2020 7545.90 645.80
2021 6775.90 754.20
0.99 2019 5449.76 561.92
2020 7802.78 654.76
2021 6776.78 770.84

counting different types of transactions in different locations. python

I have a class called transactions with the following attributes.
transactions = ([time_of_day, day_of_month ,week_day, duration, amount, trans_type, location])
this is the sample data types
time date weekday duration amount type location
0:07 3 thu 2 balance driveup
0:07 3 thu 6 20 withdrawal campusA
0:20 1 tue 2 357 deposit campusB
the type of transactions are
balance, withdrawal, deposit, advance, transfer
i have to count the number of different types of transaction in different location
which will result in something like this
Location | Advance | Balance | Deposit | Transfer | Withdrawal | Total
'driveup'| 4 | 191 | 247 | 28 | 530 | 1000
'campus' | 1 | 2 | 3 | 4 | 5 | 15
the result should emit a list, something like this:
[['Location', 'Advance', 'Balance', 'Deposit', 'Transfer', 'Withdrawal', 'Total'],
['driveup', 4, 191, 247, 28, 530, 1000],['campus', 1, 2, 3, 4, 5, 15]]
note: the example data set and the resulting list only shows 1 location. there are 3 different locations. 'driveup', 'campusa', 'campusb'
how do i make the list?
i tried something like this
atm_location = list(zip(amounts, transactions, locations))
for element in atm_location
a = element[1] #transactions
b = element[2] #locations
c = element[0] #amounts
if a == 'advance' and b == 'driveup':
drive_advance.append((a,b,c))
and etcetra, so i basically just do if and elifs condition, the code is very long so i wouldnt put it here.

Generate calendar Python webpage

Re-phrasing my Question, as suggested by moderator.
I need to create a calendar with Python & CSS for a web page, I tried the following in Python:
#!/usr/bin/env python
import os, re, sys, calendar
from datetime import datetime
myCal = calendar.monthcalendar(2011,9)
html += str(myCal)
mycal = myCal[:1]
cal1 = myCal[1:2]
cal2 = myCal[2:3]
cal3 = myCal[3:4]
cal4 = myCal[4:5]
html += str(mycal)+'<br>'
html += str(cal1)+'<br>'
html += str(cal2)+'<br>'
html += str(cal3)+'<br>'
html += str(cal4)+'<br>'
html += "<br>"
This is the following output on the web page:
[[0, 0, 0, 1, 2, 3, 4]]<br>
[[5, 6, 7, 8, 9, 10, 11]]<br>
[[12, 13, 14, 15, 16, 17, 18]]<br>
[[19, 20, 21, 22, 23, 24, 25]]<br>
[[26, 27, 28, 29, 30, 0, 0]]<br>
How can I arrange the above in the following format below?
(
This is a SAMPLE format, I have not done the actual Day / Date match.
The format needs to be in TWO rows.
eg.DayDateimgNDayDateimgN next month
)
Dec , 2011
----------------------------------------------------------------------------
sun | mon | tue | wed thu fri sat sun mon tue wed thu fri sat sun
-------|-------|-------|----------------------------------------------------
.... 1 |.. 2 ..|.. 3 ..| 4 5 6 7 8 9 10 11 12 13 14 15<br>
------ |-------|-------|----------------------------------------------------
img1 | img2 | img3 | ....
Jan , 2012
----------------------------------------------------------------------------
sun | mon | tue | wed thu fri sat sun mon tue wed thu fri sat sun
-------|-------|-------|----------------------------------------------------
.... 1 |.. 2 ..|.. 3 ..| 4 5 6 7 8 9 10 11 12 13 14 15<br>
------ |-------|-------|----------------------------------------------------
img1 | img2 | img3 | ....
Feb , 2012
----------------------------------------------------------------------------
sun | mon | tue | wed thu fri sat sun mon tue wed thu fri sat sun
-------|-------|-------|----------------------------------------------------
.... 1 |.. 2 ..|.. 3 ..| 4 5 6 7 8 9 10 11 12 13 14 15<br>
------ |-------|-------|----------------------------------------------------
img1 | img2 | img3 | ....
I'm not sure exactly what you are looking for, this produces a table with the 3 rows you describe, but for the entire month (not just the first 15 days). I may be starting point
import calendar
import itertools
blank = " "
myCal = calendar.monthcalendar(2011,9)
day_names = itertools.cycle(['mon','tue','wed','thu','fri','sat','sun']) # endless list
cal = [day for week in myCal for day in week] # flatten list of lists
# empty lists to hold the data
headers = []
numbers = []
imgs = []
# fill lists
for d in cal:
if d != 0:
headers.append(day_names.next())
numbers.append(d)
imgs.append("image"+str(d))
else:
headers.append(day_names.next())
numbers.append(blank)
imgs.append(blank)
# format data
html = "<table><tr></tr>{0}<tr>{1}</tr><tr>{2}</tr></table>".format(
"".join(["<td>%s</td>"% h for h in headers]),
"".join(["<td>%s</td>"% n for n in numbers]),
"".join(["<td>%s</td>"% i for i in imgs]))
I'm not quiet sure I understand the exact format your wanting but you can put the calender into a table of 3 rows.
Try this for each of your months
import calendar
myCal = calendar.monthcalendar(2011,9)
#make multi rowed month into a single row list
days = list()
for x in myCal:
days.extend(x)
#match this up with the week names with enough weeks to cover the month
weeks = len(myCal) * ['sun', 'mon', 'tue', 'wed', 'thu', 'fri', 'sat']
#some images
images = ['image1', 'image2', '...']
#make sure there is at least a zero at the end of the list
days.append(0)
#find start and end indexes where the actual day numbers lie
start_index = days.index(1)
#if there are no zeros at the end of the list this will fail
end_index = days[start_index:].index(0) + len(days) - len(days[start_index:])
header = 'Dec, 2011'
#Create the table rows of just the items between start_index and end_index.
weekday_row = '<tr>%s</tr>'%(''.join(['<td>%s</td>'%(x)
for x in weeks[start_index:end_index]]))
day_row = '<tr>%s</tr>'%(''.join(['<td>%d</td>'%(x)
for x in days[start_index:end_index]]))
image_row = '<tr>%s</tr>'%(''.join(['<td>%s</td>' % (x) for x in images]))
#finally put them all together to form your month block
html = '<div>%s</div><table>\n\n%s\n\n%s\n\n%s</table>' % (header,
weekday_row,
day_row,
image_row)
You can repeat that as many times as you need

Categories