Currently I am working on one of the Warehouse Report Automation. Report that I am receiving from Warehouse contains two Columns Order Ref and Equipment Name which is in given Format.
Order Ref
EQPT_NAME
10-3423AC
NA
10-3423AC
NA
10-3423AC
PQLR22334
10-3423AC
NA
10-3410AC
NCRE267
10-3410AC
NA
10-3410AC
NA
10-3410AC
NA
I want to Replace NA with Correct EQPT_Name as Per Order Ref using Python Pandas DataFrame
Output
Order Ref
EQPT_NAME
10-3423AC
PQLR22334
10-3423AC
PQLR22334
10-3423AC
PQLR22334
10-3423AC
PQLR22334
10-3410AC
NCRE267
10-3410AC
NCRE267
10-3410AC
NCRE267
10-3410AC
NCRE267
For each Order Ref, get the first valid value from EQPT_NAME then broadcast this value to all rows of the group:
df['EQPT_NAME'] = df.groupby('Order Ref')['EQPT_NAME'].transform('first')
print(df)
# Output
Order Ref EQPT_NAME
0 10-3423AC PQLR22334
1 10-3423AC PQLR22334
2 10-3423AC PQLR22334
3 10-3423AC PQLR22334
4 10-3410AC NCRE267
5 10-3410AC NCRE267
6 10-3410AC NCRE267
7 10-3410AC NCRE267
Related
I am looking for a method that will look at each date in "Date A" and find the next nearest date after that value in "Date B" by ID (group_by). I then want to calculate the difference in days. Below is the table that I would like.
ID | Date A | Date B | Difference|
11111 | 09/01/21 | 09/03/21 | 2 |
22222 | 09/06/21 | 09/20/21 | 11 |
11111 | 09/08/21 | 09/18/21 | 10 |
44444 | 09/04/21 | NA | 11 |
44444 | 09/10/21 | 09/15/21 | 5 |
22222 | NA | 09/17/21 | NA |
77777 | NA | 10/16/21 | NA |
77777 | 09/04/21 | 10/17/21 | 24 |
77777 | 09/01/21 | 09/28/21 | 27 |
If you could please help me out with this, I would greatly appreciate it!
Cheers
A dplyr solution via group_by solution is not obvious to me here, but here is a relatively straightforward sqldf solution. Presumably this could be translated into a dplyr solution if you really wanted.
First mock up the data within R
df <- dplyr::tribble(
~'ID', ~'Date A', ~'Date B',
11111, '09/01/21', '09/03/21',
22222, '09/06/21', '09/20/21',
11111, '09/08/21', '09/18/21',
44444, '09/04/21', NA ,
44444, '09/10/21', '09/15/21',
22222, NA , '09/17/21',
77777, NA , '10/16/21',
77777, '09/04/21', '10/17/21',
77777, '09/01/21', '09/28/21'
)
df$`Date A` <- lubridate::mdy(df$`Date A`)
df$`Date B` <- lubridate::mdy(df$`Date B`)
df
Which looks like
# A tibble: 9 x 3
ID `Date A` `Date B`
<dbl> <date> <date>
1 11111 2021-09-01 2021-09-03
2 22222 2021-09-06 2021-09-20
3 11111 2021-09-08 2021-09-18
4 44444 2021-09-04 NA
5 44444 2021-09-10 2021-09-15
6 22222 NA 2021-09-17
7 77777 NA 2021-10-16
8 77777 2021-09-04 2021-10-17
9 77777 2021-09-01 2021-09-28
Then do an inequality join combined with a group by. The column I is added to allow for nuances of the data such as multiple of the same Date A within each ID
df$I <- 1:nrow(df)
df <- sqldf::sqldf('
SELECT a.I, a.ID, a."Date A", a."Date B",
MIN(b."Date B") AS NextB
FROM df a
LEFT JOIN df b
ON a.ID = b.ID
AND a."Date A" < b."Date B"
GROUP BY a.I, a.ID, a."Date A", a."Date B"
ORDER BY a.I
')
df$Difference = df$NextB - as.integer(df$`Date A`)
df$I <- NULL
df$NextB <- NULL
df
Which matches your example data (and should generalize well for edge cases not in your example data). Unclear how well it might scale up to non-trivial data.
ID Date A Date B Difference
1 11111 2021-09-01 2021-09-03 2
2 22222 2021-09-06 2021-09-20 11
3 11111 2021-09-08 2021-09-18 10
4 44444 2021-09-04 <NA> 11
5 44444 2021-09-10 2021-09-15 5
6 22222 <NA> 2021-09-17 NA
7 77777 <NA> 2021-10-16 NA
8 77777 2021-09-04 2021-10-17 24
9 77777 2021-09-01 2021-09-28 27
For df:
id Date ITEM_ID TYPE GROUP
0 13710750 2019-07-01 SLM607 O X
1 13710760 2019-07-01 SLM607 O M
2 13710770 2019-07-03 SLM607 O I
3 13710780 2019-09-03 SLM607 O N
4 13667449 2019-08-02 887643 O I
5 13667450 2019-08-02 792184 O I
6 13728171 2019-09-17 SLM607 I I
7 13667452 2019-08-02 794580 O I
reproducible example:
data = {'id': [13710750, 13710760, 13710770, 13710780, 13667449, 13667450, 13728171, 13667452],
'Date': ['2019-07-01', '2019-07-01', '2019-07-03', '2019-09-03', '2019-08-02', '2019-08-02', '2019-09-17', '2019-08-02'],
'ITEM_ID': ['SLM607', 'SLM607', 'SLM607', 'SLM607', '887643', '792184', 'SLM607', '794580'],
'TYPE': ['O', 'O', 'O', 'O', 'O', 'O', 'I', 'O'],
'GROUP': ['X', 'M', 'I','N','I','I','I', 'I']}
df = pd.DataFrame(data)
df
how can I delete pairs of rows that have same values for ITEM_ID and GROUP, but one with O for TYPE that comes first, and another one with I for TYPE that happens later?
Expected outcome:
id Date ITEM_ID TYPE GROUP
0 13710750 2019-07-01 SLM607 O X
1 13710760 2019-07-01 SLM607 O M
3 13710780 2019-09-03 SLM607 O N
4 13667449 2019-08-02 887643 O I
5 13667450 2019-08-02 792184 O I
7 13667452 2019-08-02 794580 O I
shift with filter
out = df.groupby(['ITEM_ID','GROUP']).filter(lambda x : ~(x['TYPE'].eq('I') & x['TYPE'].shift().eq('O')).any())
Out[7]:
id Date ITEM_ID TYPE GROUP
0 13710750 2019-07-01 SLM607 O X
1 13710760 2019-07-01 SLM607 O M
3 13710780 2019-09-03 SLM607 O N
4 13667449 2019-08-02 887643 O I
5 13667450 2019-08-02 792184 O I
7 13667452 2019-08-02 794580 O I
Working with the following python pandas dataframe "df":
Customer_ID | Transaction_ID | Item_ID
ABC 2017-04-12-333 X8973
ABC 2017-04-12-333 X2468
ABC 2017-05-22-658 X2906
ABC 2017-05-22-757 X8790
ABC 2017-07-13-864 X8790
BCD 2017-08-11-879 X2346
BCD 2017-08-11-879 X2468
I want to count the transactions to have in a column denoted, when it's the client's 1st transaction, 2nd transaction and so forth by date. (If there are two transactions on the same day, I am counting them both as the same count, since I don't have the time, so I don't know which one came first - basically treating them as one transaction).
#get the date out of the Transaction_ID string
df['date'] = pd.to_datetime(df.Transaction_ID.str[:10])
#calculate the transaction number
df['trans_nr'] = df.groupby(['Customer_ID',"Transaction_ID", df['date'].dt.year]).cumcount()+1
Unfortunately, this is my output with the code above:
Customer_ID | Transaction_ID | Item_ID | date | trans_nr
ABC 2017-04-12-333 X8973 2017-04-12 1
ABC 2017-04-12-333 X2468 2017-04-12 2
ABC 2017-05-22-658 X2906 2017-05-22 1
ABC 2017-05-22-757 X8790 2017-05-22 1
ABC 2017-07-13-864 X8790 2017-07-13 1
BCD 2017-08-11-879 X2346 2017-08-11 1
BCD 2017-08-11-879 X2468 2017-08-11 2
Which is incorrect, this is the correct output I am looking for:
Customer_ID | Transaction_ID | Item_ID | date | trans_nr
ABC 2017-04-12-333 X8973 2017-04-12 1
ABC 2017-04-12-333 X2468 2017-04-12 1
ABC 2017-05-22-658 X2906 2017-05-22 2
ABC 2017-05-22-757 X8790 2017-05-22 2
ABC 2017-07-13-864 X8790 2017-07-13 3
BCD 2017-08-11-879 X2346 2017-08-11 1
BCD 2017-08-11-879 X2468 2017-08-11 1
Maybe the logic should be based only on Customer_ID and date (without Transaction_ID)?
I tried this
df['trans_nr'] = df.groupby(['Customer_ID','date').cumcount()+1
But it also counts incorrectly.
Let's try:
df['trans_nr'] = df.groupby(['Customer_ID', df['date'].dt.year])['date']\
.transform(lambda x: (x.diff() != pd.Timedelta('0 days')).cumsum())
Output:
Customer_ID Transaction_ID Item_ID date trans_nr
0 ABC 2017-04-12-333 X8973 2017-04-12 1
1 ABC 2017-04-12-333 X2468 2017-04-12 1
2 ABC 2017-05-22-658 X2906 2017-05-22 2
3 ABC 2017-05-22-757 X8790 2017-05-22 2
4 ABC 2017-07-13-864 X8790 2017-07-13 3
5 BCD 2017-08-11-879 X2346 2017-08-11 1
6 BCD 2017-08-11-879 X2468 2017-08-11 1
Use dual groupby with ngroup() i.e
df['trans_nr'] = df.groupby('Customer_ID').apply(lambda x : \
x.groupby([x['date'].dt.date]).ngroup()+1).values
Customer_ID Transaction_ID Item_ID date trans_nr
0 ABC 2017-04-12-333 X8973 2017-04-12 1
1 ABC 2017-04-12-333 X2468 2017-04-12 1
2 ABC 2017-05-22-658 X2906 2017-05-22 2
3 ABC 2017-05-22-757 X8790 2017-05-22 2
4 ABC 2017-07-13-864 X8790 2017-07-13 3
5 BCD 2017-08-11-879 X2346 2017-08-11 1
6 BCD 2017-08-11-879 X2468 2017-08-11 1
One way would be to drop duplicate values before making the cumulative count:
trans_nr = (df
.drop_duplicates(subset=['Customer_ID', 'date'])
.set_index(['Customer_ID', 'date'])
.groupby(level='Customer_ID')
.cumcount() + 1
)
df.set_index(['Customer_ID', 'date'], inplace=True)
df['trans_nr'] = trans_nr
df.reset_index(inplace=True)
To get the transaction number, you first remove rows with duplicate Customer_ID and date values. Then you set their index using Customer_ID and date (for merging later) and perform your groupby and cumcount. This produces a series whose values are the cumulative count for each Customer_ID and date.
You also set the index for the original dataframe (again to allow for merging). Then you simply assign the trans_nr series to a column in df. The indices take care of the merging logic.
I'm trying to mungle my data from the following data frame to the one following it where the values in column B and C are combined to column names for the values in D grouped by the values in A.
Below is a reproducible example.
set.seed(10)
fooDF <- data.frame(A = sample(1:4, 10, replace=TRUE), B = sample(letters[1:4], 10, replace=TRUE), C= sample(letters[1:4], 10, replace=TRUE), D = sample(1:4, 10, replace=TRUE))
fooDF[!duplicated(fooDF),]
A B C D
1 4 c b 2
2 4 d a 2
3 2 a b 4
4 3 c a 1
5 4 a b 3
6 4 b a 2
7 1 b d 2
8 1 a d 4
9 2 b a 3
10 2 d c 2
newdata <- data.frame(A = 1:4)
for(i in 1:nrow(fooDF)){
col_name <- paste(fooDF$B[i], fooDF$C[i], sep="")
newdata[newdata$A == fooDF$A[i], col_name ] <- fooDF$D[i]
}
The format I am trying to get it in.
> newdata
A cb da ab ca ba bd ad dc
1 1 NA NA NA NA NA 2 4 NA
2 2 NA NA 4 NA 3 NA NA 2
3 3 NA NA NA 1 NA NA NA NA
4 4 2 2 3 NA 2 NA NA NA
Right now I am doing it line by line but that is unfeasible for a large csv containing 5 million + lines. Is there a way to do it faster in R or python?
In R, this can be done with tidyr
library(tidyr)
fooDF %>%
unite(BC, B, C, sep="") %>%
spread(BC, D)
# A ab ad ba bd ca cb da dc
#1 1 NA 4 NA 2 NA NA NA NA
#2 2 4 NA 3 NA NA NA NA 2
#3 3 NA NA NA NA 1 NA NA NA
#4 4 3 NA 2 NA NA 2 2 NA
Or we can do this with dcast
library(data.table)
dcast(setDT(fooDF), A~paste0(B,C), value.var = "D")
# A ab ad ba bd ca cb da dc
#1: 1 NA 4 NA 2 NA NA NA NA
#2: 2 4 NA 3 NA NA NA NA 2
#3: 3 NA NA NA NA 1 NA NA NA
#4: 4 3 NA 2 NA NA 2 2 NA
data
fooDF <- structure(list(A = c(4L, 4L, 2L, 3L, 4L, 4L, 1L, 1L, 2L, 2L),
B = c("c", "d", "a", "c", "a", "b", "b", "a", "b", "d"),
C = c("b", "a", "b", "a", "b", "a", "d", "d", "a", "c"),
D = c(2L, 2L, 4L, 1L, 3L, 2L, 2L, 4L, 3L, 2L)), .Names = c("A",
"B", "C", "D"), class = "data.frame", row.names = c("1", "2",
"3", "4", "5", "6", "7", "8", "9", "10"))
First paste columns B and C together (into column "z"):
fooDF$z = paste0(fooDF$B,fooDF$C)
A B C D z
1 3 d c 3 dc
2 1 b d 3 bd
3 1 a a 2 aa
4 2 d a 1 da
5 4 d c 1 dc
6 2 d b 2 db
7 4 b d 3 bd
8 2 c d 3 cd
9 1 a b 2 ab
10 4 a b 2 ab
Then I'll remove columns B and C
fooDF$B = NULL
fooDF$c = NULL
And last do a reshape from long to wide:
finalFooDF = reshape(fooDF, timevar = "z", direction = "wide",idvar = "A")
A D.dc D.bd D.aa D.da D.db D.cd D.ab
1 3 3 NA NA NA NA NA NA
2 1 NA 3 2 NA NA NA 2
4 2 NA NA NA 1 2 3 NA
5 4 1 3 NA NA NA NA 2
Given this data:
pd.DataFrame({'id':['aaa','aaa','abb','abb','abb','acd','acd','acd'],
'loc':['US','UK','FR','US','IN','US','CN','CN']})
id loc
0 aaa US
1 aaa UK
2 abb FR
3 abb US
4 abb IN
5 acd US
6 acd CN
7 acd CN
How do I pivot it to this:
id loc1 loc2 loc3
aaa US UK None
abb FR US IN
acd US CN CN
I am looking for the most idiomatic method.
I think you can create new column cols with groupby, cumcount and convert to string by astype, last use pivot:
df['cols'] = 'loc' + (df.groupby('id')['id'].cumcount() + 1).astype(str)
print df
id loc cols
0 aaa US loc1
1 aaa UK loc2
2 abb FR loc1
3 abb US loc2
4 abb IN loc3
5 acd US loc1
6 acd CN loc2
7 acd CN loc3
print df.pivot(index='id', columns='cols', values='loc')
cols loc1 loc2 loc3
id
aaa US UK None
abb FR US IN
acd US CN CN
If you want remove index and columns names use rename_axis:
print df.pivot(index='id', columns='cols', values='loc').rename_axis(None)
.rename_axis(None, axis=1)
loc1 loc2 loc3
aaa US UK None
abb FR US IN
acd US CN CN
All together, thank you Colin:
print pd.pivot(df['id'], 'loc' + (df.groupby('id').cumcount() + 1).astype(str), df['loc'])
.rename_axis(None)
.rename_axis(None, axis=1)
loc1 loc2 loc3
aaa US UK None
abb FR US IN
acd US CN CN
I try rank, but I get error in version 0.18.0:
print df.groupby('id')['loc'].transform(lambda x: x.rank(method='first'))
#ValueError: first not supported for non-numeric data