Inserting a new column with date format in Python - python

I have a dataframe, df, where I am wanting to insert a new column named data in specific format.
df:
Name ID
Kelly A
John B
Desired output:
Date Name ID
2019-10-01 Kelly A
2019-10-01 John B
This is what I am doing:
df['2019-10-01'] = date
I am still researching this. Any insight is helpful

Try with
df['date'] = '2019-10-01'

Related

Filter for most recent event by group with pandas

I'm trying to filter a pandas dataframe so that I'm able to get the most recent data point for each account number in the dataframe.
Here is an example of what the data looks like.
I'm looking for an output of one instance of an account with the product and most recent date.
account_number product sale_date
0 123 rental 2021-12-01
1 423 rental 2021-10-01
2 513 sale 2021-11-02
3 123 sale 2022-01-01
4 513 sale 2021-11-30
I was trying to use groupby and idxmax() but it doesn't work with dates.
And I did want to change the dtype away from date time.
data_grouped = data.groupby('account_number')['sale_date'].max().idxmax()
Any ideas would be awesome.
To retain a subsetted data frame, consider sorting by account number and descending sale date, then call DataFrame.groupby().head (which will return NaNs if in first row per group unlike DataFrame.groupby().first):
data_grouped = (
data.sort_values(
["account_number", "sale_date"], ascending=[True, False]
).reset_index(drop=True)
.groupby("account_number")
.head(1)
)
It seems the sale_date column has strings. If you convert it to datetime dtype, then you can use groupby + idxmax:
df['sale_date'] = pd.to_datetime(df['sale_date'])
out = df.loc[df.groupby('account_number')['sale_date'].idxmax()]
Output:
account_number product sale_date
3 123 sale 2022-01-01
1 423 rental 2021-10-01
4 513 sale 2021-11-30
Would the keyword 'first' work ? So that would be:
data.groupby('account_number')['sale_date'].first()
You want the last keyword in order to get the most recent date after grouping, like this:
df.groupby(by=["account_number"])["sale_date"].last()
which will provide this output:
account_number
123 2022-01-01
423 2021-10-01
513 2021-11-30
Name: sale_date, dtype: datetime64[ns]
It is unclear why you want to transition away from using the datetime dtype, but you need it in order to correctly sort for the value you are looking for. Consider doing this as an intermediate step, then reformatting the column after processing.
I'll change my answer to use #Daniel Weigelbut's answer... and also here, where you can apply .nth(n) to find the nth value for a general case ((-1) for the most recent date).
new_data = data.groupby('account_number')['sale_date'].nth(-1)
My previous suggestion of creating a sorted multi index with
data.set_index(['account_number', 'sale_date'], inplace = True)
data_sorted = data.sort_index(level = [0, 1])
still works and might be more useful for any more complex sorting. As others have said, make sure your date strings are date time objects if you sort like this.

How would you go about finding the latest date in a dataframe groupby? [duplicate]

This question already has answers here:
Get the row(s) which have the max value in groups using groupby
(15 answers)
Closed 1 year ago.
I am attempting to create a sub-table from a larger dataset which lists out the unique ID, the name of the person and the date they attended an appointment.
For example,
df = pd.DataFrame({'ID': ['abc', 'def', 'abc', 'abc'],
'name':['Alex','Bertie','Alex','Alex'],
'date_attended':['01/01/2021','05/01/2021','11/01/2021','20/01/2021']
What I would like is a dataframe, that shows the last time Alex and Bertie attended a class. So my dataframe would like:
name date_attended
Alex 20/01/2021
Bertie 05/01/2021
I'm really struggling with this! So far I have tried (based off a previous question I saw here):
max_date_list = ['ID','date_attended']
df = df.groupby(['ID'])[max_date_list].transform('max').size()
but I keep getting an error. I know this would involve a groupby but I can't figure out how to get the maximum date. Would anyone know how to do this?
Try sort_values by 'date_attended' and drop_duplicates by 'ID':
df['date_attended'] = pd.to_datetime(df['date_attended'], dayfirst=True)
df.sort_values('date_attended', ascending=False).drop_duplicates('ID')
Output:
ID name date_attended
3 abc Alex 2021-01-20
1 def Bertie 2021-01-05
To match your expected output format exactly, you might want to groupby "name":
>>> df.groupby("name")["date_attended"].max()
name
Alex 20/01/2021
Bertie 05/01/2021
Name: date_attended, dtype: object
Alternatively, if you might have different ID with the same name:
>>> df.groupby("ID").agg({"name": "first", "date_attended": "max"}).set_index("name")
date_attended
name
Alex 20/01/2021
Bertie 05/01/2021

How do I aggregate rows in a pandas dataframe according to the latest dates in a column?

I have a dataframe containing materials, dates of purchase and purchase prices. I want to filter my dataframe such that I only keep one row containing each material, and that row contains the material at the latest purchase date and corresponding price.
How could I achieve this? I have racked my brains trying to work out how to apply aggregation functions to this but I just can't work out how.
Do a multisort and then use drop duplicates, keeping the first occurrence.
import pandas as pd
df.sort_values(by=['materials', 'purchase_date'], ascending=[True, False], inplace=True)
df.drop_duplicates(subset=['materials'], keep='first', inplace=True)
Two steps
sort_values() by material and purchaseDate
groupby() material and take first row
d = pd.date_range("1-apr-2020", "30-oct-2020", freq="W")
df = pd.DataFrame({"material":np.random.choice(list("abcd"),len(d)), "purchaseDate":d, "purchasePrice":np.random.randint(1,100, len(d))})
df.sort_values(["material","purchaseDate"], ascending=[1,0]).groupby("material", as_index=False).first()
output
material
purchaseDate
purchasePrice
0
a
2020-09-27 00:00:00
85
1
b
2020-10-25 00:00:00
54
2
c
2020-10-11 00:00:00
21
3
d
2020-10-18 00:00:00
45

.dt.strftime gives back the wrong dates Pandas

I have a pandas dataframe with a column CreatedDate in it. Currently the values of the column look like this:
id CreatedDate
123 1586362930000
124 1586555550000
Desired output is:
id CreatedDate
123 2020-04-08T15:50:00Z
124 2020-04-08T15:45:00Z
I have tried the following:
# Change the column type from int to datetime64[ns]
df['CreatedDate'] = pd.to_datetime(df['CreatedDate'])
new_df = df['CreatedDate'].dt.strftime("%Y-%m-%d"+"T"+"%H:%M:%S"+"Z")
The output is this:
id CreatedDate
123 1970-01-01 00:26:26.362930
124 1970-01-01 00:26:26.365487
Which is not what I have expected, I know for a fact that those days should be April 8th.
I have tested dt.strftime("%Y-%m-%d"+"T"+"%H:%M:%S"+"Z") with just a string and it returns the desired output, however, when I apply it to the dataframe it doesn't work properly
This is unix time
pd.to_datetime(df.CreatedDate,unit='ms').dt.strftime("%Y-%m-%d""T""%H:%M:%S""Z")
0 2020-04-08T16:22:10Z
1 2020-04-10T21:52:30Z
Name: CreatedDate, dtype: object

Data frame formation

I need to create a data frame for 100 customer_id along with their expenses for each day starting from 1st June 2019 to 31st August 2019. I have customer id already in a list and dates as well in a list. How to make a data frame in the format shown.
CustomerID TrxnDate
1 1-Jun-19
1 2-Jun-19
1 3-Jun-19
1 Upto....
1 31-Aug-19
2 1-Jun-19
2 2-Jun-19
2 3-Jun-19
2 Upto....
2 31-Aug-19
and so on for other 100 customer id
I already have customer_id data frame using pandas function now i need to map each customer_id with the date ie assume we have customer id as 1 then 1 should have all dates from 1st June 2019 to 31 aug 2019 and then customerId 2 should have the same dates. Please see the data frame required.
# import module
import pandas as pd
# list of dates
lst = ['1-Jun-19', '2-Jun-19', ' 3-Jun-19']
# Calling DataFrame constructor on list
df = pd.DataFrame(lst)
Repeat the operations for Customer ID and store in df2 or something and then
frames = [df, df2]
result = pd.concat(frames)
There are simpler methods , but this will give you a idea how it is carried out.
I see you want specific dataframes, so first creat the dataframes according to customer ID 1. then repeat same for Customer ID 2, and then concat those dataframes.

Categories