How to Fill Respective data in Pandas Data Frame - python

Currently I am working on one of the Warehouse Report Automation. Report that I am receiving from Warehouse contains two Columns Order Ref and Equipment Name which is in given Format.
Order Ref
EQPT_NAME
10-3423AC
NA
10-3423AC
NA
10-3423AC
PQLR22334
10-3423AC
NA
10-3410AC
NCRE267
10-3410AC
NA
10-3410AC
NA
10-3410AC
NA
I want to Replace NA with Correct EQPT_Name as Per Order Ref using Python Pandas DataFrame
Output
Order Ref
EQPT_NAME
10-3423AC
PQLR22334
10-3423AC
PQLR22334
10-3423AC
PQLR22334
10-3423AC
PQLR22334
10-3410AC
NCRE267
10-3410AC
NCRE267
10-3410AC
NCRE267
10-3410AC
NCRE267

For each Order Ref, get the first valid value from EQPT_NAME then broadcast this value to all rows of the group:
df['EQPT_NAME'] = df.groupby('Order Ref')['EQPT_NAME'].transform('first')
print(df)
# Output
Order Ref EQPT_NAME
0 10-3423AC PQLR22334
1 10-3423AC PQLR22334
2 10-3423AC PQLR22334
3 10-3423AC PQLR22334
4 10-3410AC NCRE267
5 10-3410AC NCRE267
6 10-3410AC NCRE267
7 10-3410AC NCRE267

Related

How can I find the nearest date after another date in a different column in grouping by ID using R?

I am looking for a method that will look at each date in "Date A" and find the next nearest date after that value in "Date B" by ID (group_by). I then want to calculate the difference in days. Below is the table that I would like.
ID | Date A | Date B | Difference|
11111 | 09/01/21 | 09/03/21 | 2 |
22222 | 09/06/21 | 09/20/21 | 11 |
11111 | 09/08/21 | 09/18/21 | 10 |
44444 | 09/04/21 | NA | 11 |
44444 | 09/10/21 | 09/15/21 | 5 |
22222 | NA | 09/17/21 | NA |
77777 | NA | 10/16/21 | NA |
77777 | 09/04/21 | 10/17/21 | 24 |
77777 | 09/01/21 | 09/28/21 | 27 |
If you could please help me out with this, I would greatly appreciate it!
Cheers
A dplyr solution via group_by solution is not obvious to me here, but here is a relatively straightforward sqldf solution. Presumably this could be translated into a dplyr solution if you really wanted.
First mock up the data within R
df <- dplyr::tribble(
~'ID', ~'Date A', ~'Date B',
11111, '09/01/21', '09/03/21',
22222, '09/06/21', '09/20/21',
11111, '09/08/21', '09/18/21',
44444, '09/04/21', NA ,
44444, '09/10/21', '09/15/21',
22222, NA , '09/17/21',
77777, NA , '10/16/21',
77777, '09/04/21', '10/17/21',
77777, '09/01/21', '09/28/21'
)
df$`Date A` <- lubridate::mdy(df$`Date A`)
df$`Date B` <- lubridate::mdy(df$`Date B`)
df
Which looks like
# A tibble: 9 x 3
ID `Date A` `Date B`
<dbl> <date> <date>
1 11111 2021-09-01 2021-09-03
2 22222 2021-09-06 2021-09-20
3 11111 2021-09-08 2021-09-18
4 44444 2021-09-04 NA
5 44444 2021-09-10 2021-09-15
6 22222 NA 2021-09-17
7 77777 NA 2021-10-16
8 77777 2021-09-04 2021-10-17
9 77777 2021-09-01 2021-09-28
Then do an inequality join combined with a group by. The column I is added to allow for nuances of the data such as multiple of the same Date A within each ID
df$I <- 1:nrow(df)
df <- sqldf::sqldf('
SELECT a.I, a.ID, a."Date A", a."Date B",
MIN(b."Date B") AS NextB
FROM df a
LEFT JOIN df b
ON a.ID = b.ID
AND a."Date A" < b."Date B"
GROUP BY a.I, a.ID, a."Date A", a."Date B"
ORDER BY a.I
')
df$Difference = df$NextB - as.integer(df$`Date A`)
df$I <- NULL
df$NextB <- NULL
df
Which matches your example data (and should generalize well for edge cases not in your example data). Unclear how well it might scale up to non-trivial data.
ID Date A Date B Difference
1 11111 2021-09-01 2021-09-03 2
2 22222 2021-09-06 2021-09-20 11
3 11111 2021-09-08 2021-09-18 10
4 44444 2021-09-04 <NA> 11
5 44444 2021-09-10 2021-09-15 5
6 22222 <NA> 2021-09-17 NA
7 77777 <NA> 2021-10-16 NA
8 77777 2021-09-04 2021-10-17 24
9 77777 2021-09-01 2021-09-28 27

Removing matching pairs in dataframe in Python

For df:
id Date ITEM_ID TYPE GROUP
0 13710750 2019-07-01 SLM607 O X
1 13710760 2019-07-01 SLM607 O M
2 13710770 2019-07-03 SLM607 O I
3 13710780 2019-09-03 SLM607 O N
4 13667449 2019-08-02 887643 O I
5 13667450 2019-08-02 792184 O I
6 13728171 2019-09-17 SLM607 I I
7 13667452 2019-08-02 794580 O I
reproducible example:
data = {'id': [13710750, 13710760, 13710770, 13710780, 13667449, 13667450, 13728171, 13667452],
'Date': ['2019-07-01', '2019-07-01', '2019-07-03', '2019-09-03', '2019-08-02', '2019-08-02', '2019-09-17', '2019-08-02'],
'ITEM_ID': ['SLM607', 'SLM607', 'SLM607', 'SLM607', '887643', '792184', 'SLM607', '794580'],
'TYPE': ['O', 'O', 'O', 'O', 'O', 'O', 'I', 'O'],
'GROUP': ['X', 'M', 'I','N','I','I','I', 'I']}
df = pd.DataFrame(data)
df
how can I delete pairs of rows that have same values for ITEM_ID and GROUP, but one with O for TYPE that comes first, and another one with I for TYPE that happens later?
Expected outcome:
id Date ITEM_ID TYPE GROUP
0 13710750 2019-07-01 SLM607 O X
1 13710760 2019-07-01 SLM607 O M
3 13710780 2019-09-03 SLM607 O N
4 13667449 2019-08-02 887643 O I
5 13667450 2019-08-02 792184 O I
7 13667452 2019-08-02 794580 O I
shift with filter
out = df.groupby(['ITEM_ID','GROUP']).filter(lambda x : ~(x['TYPE'].eq('I') & x['TYPE'].shift().eq('O')).any())
Out[7]:
id Date ITEM_ID TYPE GROUP
0 13710750 2019-07-01 SLM607 O X
1 13710760 2019-07-01 SLM607 O M
3 13710780 2019-09-03 SLM607 O N
4 13667449 2019-08-02 887643 O I
5 13667450 2019-08-02 792184 O I
7 13667452 2019-08-02 794580 O I

Python Pandas incorrect date count

Working with the following python pandas dataframe "df":
Customer_ID | Transaction_ID | Item_ID
ABC 2017-04-12-333 X8973
ABC 2017-04-12-333 X2468
ABC 2017-05-22-658 X2906
ABC 2017-05-22-757 X8790
ABC 2017-07-13-864 X8790
BCD 2017-08-11-879 X2346
BCD 2017-08-11-879 X2468
I want to count the transactions to have in a column denoted, when it's the client's 1st transaction, 2nd transaction and so forth by date. (If there are two transactions on the same day, I am counting them both as the same count, since I don't have the time, so I don't know which one came first - basically treating them as one transaction).
#get the date out of the Transaction_ID string
df['date'] = pd.to_datetime(df.Transaction_ID.str[:10])
#calculate the transaction number
df['trans_nr'] = df.groupby(['Customer_ID',"Transaction_ID", df['date'].dt.year]).cumcount()+1
Unfortunately, this is my output with the code above:
Customer_ID | Transaction_ID | Item_ID | date | trans_nr
ABC 2017-04-12-333 X8973 2017-04-12 1
ABC 2017-04-12-333 X2468 2017-04-12 2
ABC 2017-05-22-658 X2906 2017-05-22 1
ABC 2017-05-22-757 X8790 2017-05-22 1
ABC 2017-07-13-864 X8790 2017-07-13 1
BCD 2017-08-11-879 X2346 2017-08-11 1
BCD 2017-08-11-879 X2468 2017-08-11 2
Which is incorrect, this is the correct output I am looking for:
Customer_ID | Transaction_ID | Item_ID | date | trans_nr
ABC 2017-04-12-333 X8973 2017-04-12 1
ABC 2017-04-12-333 X2468 2017-04-12 1
ABC 2017-05-22-658 X2906 2017-05-22 2
ABC 2017-05-22-757 X8790 2017-05-22 2
ABC 2017-07-13-864 X8790 2017-07-13 3
BCD 2017-08-11-879 X2346 2017-08-11 1
BCD 2017-08-11-879 X2468 2017-08-11 1
Maybe the logic should be based only on Customer_ID and date (without Transaction_ID)?
I tried this
df['trans_nr'] = df.groupby(['Customer_ID','date').cumcount()+1
But it also counts incorrectly.
Let's try:
df['trans_nr'] = df.groupby(['Customer_ID', df['date'].dt.year])['date']\
.transform(lambda x: (x.diff() != pd.Timedelta('0 days')).cumsum())
Output:
Customer_ID Transaction_ID Item_ID date trans_nr
0 ABC 2017-04-12-333 X8973 2017-04-12 1
1 ABC 2017-04-12-333 X2468 2017-04-12 1
2 ABC 2017-05-22-658 X2906 2017-05-22 2
3 ABC 2017-05-22-757 X8790 2017-05-22 2
4 ABC 2017-07-13-864 X8790 2017-07-13 3
5 BCD 2017-08-11-879 X2346 2017-08-11 1
6 BCD 2017-08-11-879 X2468 2017-08-11 1
Use dual groupby with ngroup() i.e
df['trans_nr'] = df.groupby('Customer_ID').apply(lambda x : \
x.groupby([x['date'].dt.date]).ngroup()+1).values
Customer_ID Transaction_ID Item_ID date trans_nr
0 ABC 2017-04-12-333 X8973 2017-04-12 1
1 ABC 2017-04-12-333 X2468 2017-04-12 1
2 ABC 2017-05-22-658 X2906 2017-05-22 2
3 ABC 2017-05-22-757 X8790 2017-05-22 2
4 ABC 2017-07-13-864 X8790 2017-07-13 3
5 BCD 2017-08-11-879 X2346 2017-08-11 1
6 BCD 2017-08-11-879 X2468 2017-08-11 1
One way would be to drop duplicate values before making the cumulative count:
trans_nr = (df
.drop_duplicates(subset=['Customer_ID', 'date'])
.set_index(['Customer_ID', 'date'])
.groupby(level='Customer_ID')
.cumcount() + 1
)
df.set_index(['Customer_ID', 'date'], inplace=True)
df['trans_nr'] = trans_nr
df.reset_index(inplace=True)
To get the transaction number, you first remove rows with duplicate Customer_ID and date values. Then you set their index using Customer_ID and date (for merging later) and perform your groupby and cumcount. This produces a series whose values are the cumulative count for each Customer_ID and date.
You also set the index for the original dataframe (again to allow for merging). Then you simply assign the trans_nr series to a column in df. The indices take care of the merging logic.

Manipulating data frame in R

I'm trying to mungle my data from the following data frame to the one following it where the values in column B and C are combined to column names for the values in D grouped by the values in A.
Below is a reproducible example.
set.seed(10)
fooDF <- data.frame(A = sample(1:4, 10, replace=TRUE), B = sample(letters[1:4], 10, replace=TRUE), C= sample(letters[1:4], 10, replace=TRUE), D = sample(1:4, 10, replace=TRUE))
fooDF[!duplicated(fooDF),]
A B C D
1 4 c b 2
2 4 d a 2
3 2 a b 4
4 3 c a 1
5 4 a b 3
6 4 b a 2
7 1 b d 2
8 1 a d 4
9 2 b a 3
10 2 d c 2
newdata <- data.frame(A = 1:4)
for(i in 1:nrow(fooDF)){
col_name <- paste(fooDF$B[i], fooDF$C[i], sep="")
newdata[newdata$A == fooDF$A[i], col_name ] <- fooDF$D[i]
}
The format I am trying to get it in.
> newdata
A cb da ab ca ba bd ad dc
1 1 NA NA NA NA NA 2 4 NA
2 2 NA NA 4 NA 3 NA NA 2
3 3 NA NA NA 1 NA NA NA NA
4 4 2 2 3 NA 2 NA NA NA
Right now I am doing it line by line but that is unfeasible for a large csv containing 5 million + lines. Is there a way to do it faster in R or python?
In R, this can be done with tidyr
library(tidyr)
fooDF %>%
unite(BC, B, C, sep="") %>%
spread(BC, D)
# A ab ad ba bd ca cb da dc
#1 1 NA 4 NA 2 NA NA NA NA
#2 2 4 NA 3 NA NA NA NA 2
#3 3 NA NA NA NA 1 NA NA NA
#4 4 3 NA 2 NA NA 2 2 NA
Or we can do this with dcast
library(data.table)
dcast(setDT(fooDF), A~paste0(B,C), value.var = "D")
# A ab ad ba bd ca cb da dc
#1: 1 NA 4 NA 2 NA NA NA NA
#2: 2 4 NA 3 NA NA NA NA 2
#3: 3 NA NA NA NA 1 NA NA NA
#4: 4 3 NA 2 NA NA 2 2 NA
data
fooDF <- structure(list(A = c(4L, 4L, 2L, 3L, 4L, 4L, 1L, 1L, 2L, 2L),
B = c("c", "d", "a", "c", "a", "b", "b", "a", "b", "d"),
C = c("b", "a", "b", "a", "b", "a", "d", "d", "a", "c"),
D = c(2L, 2L, 4L, 1L, 3L, 2L, 2L, 4L, 3L, 2L)), .Names = c("A",
"B", "C", "D"), class = "data.frame", row.names = c("1", "2",
"3", "4", "5", "6", "7", "8", "9", "10"))
First paste columns B and C together (into column "z"):
fooDF$z = paste0(fooDF$B,fooDF$C)
A B C D z
1 3 d c 3 dc
2 1 b d 3 bd
3 1 a a 2 aa
4 2 d a 1 da
5 4 d c 1 dc
6 2 d b 2 db
7 4 b d 3 bd
8 2 c d 3 cd
9 1 a b 2 ab
10 4 a b 2 ab
Then I'll remove columns B and C
fooDF$B = NULL
fooDF$c = NULL
And last do a reshape from long to wide:
finalFooDF = reshape(fooDF, timevar = "z", direction = "wide",idvar = "A")
A D.dc D.bd D.aa D.da D.db D.cd D.ab
1 3 3 NA NA NA NA NA NA
2 1 NA 3 2 NA NA NA 2
4 2 NA NA NA 1 2 3 NA
5 4 1 3 NA NA NA NA 2

pandas: pivoting on rank

Given this data:
pd.DataFrame({'id':['aaa','aaa','abb','abb','abb','acd','acd','acd'],
'loc':['US','UK','FR','US','IN','US','CN','CN']})
id loc
0 aaa US
1 aaa UK
2 abb FR
3 abb US
4 abb IN
5 acd US
6 acd CN
7 acd CN
How do I pivot it to this:
id loc1 loc2 loc3
aaa US UK None
abb FR US IN
acd US CN CN
I am looking for the most idiomatic method.
I think you can create new column cols with groupby, cumcount and convert to string by astype, last use pivot:
df['cols'] = 'loc' + (df.groupby('id')['id'].cumcount() + 1).astype(str)
print df
id loc cols
0 aaa US loc1
1 aaa UK loc2
2 abb FR loc1
3 abb US loc2
4 abb IN loc3
5 acd US loc1
6 acd CN loc2
7 acd CN loc3
print df.pivot(index='id', columns='cols', values='loc')
cols loc1 loc2 loc3
id
aaa US UK None
abb FR US IN
acd US CN CN
If you want remove index and columns names use rename_axis:
print df.pivot(index='id', columns='cols', values='loc').rename_axis(None)
.rename_axis(None, axis=1)
loc1 loc2 loc3
aaa US UK None
abb FR US IN
acd US CN CN
All together, thank you Colin:
print pd.pivot(df['id'], 'loc' + (df.groupby('id').cumcount() + 1).astype(str), df['loc'])
.rename_axis(None)
.rename_axis(None, axis=1)
loc1 loc2 loc3
aaa US UK None
abb FR US IN
acd US CN CN
I try rank, but I get error in version 0.18.0:
print df.groupby('id')['loc'].transform(lambda x: x.rank(method='first'))
#ValueError: first not supported for non-numeric data

Categories