Here is current df:
ID Date
1 3/29/2017
2
3 11/5/2015
4
5 2/28/2017
I am trying to get year + month as a string in the new column. And this is my code:
df["Year"] = df["Date"].dt.year
df["Month"] = df["Date"].dt.month
df["yyyy_mm"] = df["Year"].map(str) + "-" + df["Month"].map(str)
The issue is when I extract the year and month from the date, it will return the float type.
ID Date Year Month yyyy_mm I hope to get this
1 3/29/2017 2017.0 3.0 2017.0-3.0 2017-3
2 nan-nan
3 11/5/2015 2015.0 11.0 2015.0-11.0 2015-11
4 nan-nan
5 2/28/2017 2017.0 2.0 2017.0-2.0 2017-2
I tried to use df["Date"].dt.year.astype(int) to convert it to int, so that there is no .0, but I got this error: Cannot convert non-finite values (NA or inf) to integer. Because there NAN in column.
I don't want to fillna for all the year and month with 0 or something else, i just want to keep them empty since date is empty at that row.
You should perform string conversion directly from Date using pd.Series.dt.strftime.
This not only ensures NaT rows remain NaT, but strings are better formatted, e.g. zero-padding for months.
df["yyyy_mm"] = df['Date'].dt.strftime('%Y-%m')
print(df)
ID Date Year Month yyyy_mm
0 1 2017-03-29 2017.0 3.0 2017-03
1 2 NaT NaN NaN NaT
2 3 2015-11-05 2015.0 11.0 2015-11
3 4 NaT NaN NaN NaT
4 5 2017-02-28 2017.0 2.0 2017-02
Related
I have this data frame
import pandas as pd
df = pd.DataFrame({'COTA':['A','A','A','A','A','B','B','B','B'],
'Date':['14/10/2021','19/10/2020','29/10/2019','30/09/2021','20/09/2020','20/10/2021','29/10/2020','15/10/2019','10/09/2020'],
'Mark':[1,2,3,4,5,1,2,3,3]
})
print(df)
based on this data frame I wanted the MARK from the previous year, I managed to acquire the maximum COTA but I wanted the last one, I used .max() and I thought I could get it with .last() but it didn't work.
follow the example of my code.
df['Date'] = pd.to_datetime(df['Date'])
df['LastYear'] = df['Date'] - pd.offsets.YearEnd(0)
s1 = df.groupby(['Found', 'LastYear'])['Mark'].max()
s2 = s1.rename(index=lambda x: x + pd.offsets.DateOffset(years=1), level=1)
df = df.join(s2.rename('Max_MarkLastYear'), on=['Found', 'LastYear'])
print (df)
Found Date Mark LastYear Max_MarkLastYear
0 A 2021-10-14 1 2021-12-31 5.0
1 A 2020-10-19 2 2020-12-31 3.0
2 A 2019-10-29 3 2019-12-31 NaN
3 A 2021-09-30 4 2021-12-31 5.0
4 A 2020-09-20 5 2020-12-31 3.0
5 B 2021-10-20 1 2021-12-31 3.0
6 B 2020-10-29 2 2020-12-31 3.0
7 B 2019-10-15 3 2019-12-31 NaN
8 B 2020-10-09 3 2020-12-31 3.0
How do I create a new column with the last value of the previous year
I have the following df with Date elements being a string followed by YYYY.MM :
df =
Date Value
0 name 2019.06 1.0
1 string 2018.03 1.6
2 string 2017.12 1.0
3 string 2016.09 1.7
4 name 2018.09 6.0
...
And I would like to convert the Date column to the last business day (Monday to Friday) of its month.
So I could get this output:
df =
Date Value
0 2019-06-28 1.0
1 2018-03-30 1.6
2 2017-12-29 1.0
3 2016-09-30 1.7
4 2018-09-28 6.0
...
I tried re.search to start by searching for the date parts of each element of the column, but I can't figure out the solution for this.
Split and add monthend:
d = pd.to_datetime(df['Date'].str.split().str[-1])
print(df.assign(Date=d + pd.offsets.BMonthEnd(1)))
Date Value
0 2019-06-28 1.0
1 2018-03-30 1.6
2 2017-12-29 1.0
3 2016-09-30 1.7
4 2018-09-28 6.0
customer_id Order_date
1 2015-01-16
1 2015-01-19
2 2014-12-21
2 2015-01-10
1 2015-01-10
3 2018-01-18
3 2017-03-04
4 2019-11-05
4 2010-01-01
3 2019-02-03
Lets say I have data like this
Basically for an ecommerce firm some people buy regularly, some buy once every year, some buy monthly once etc. I need to find the difference between frequency of each transaction for each customer.
This will be a dynamic list, since some people will have transacted thousand times, some would have transacted once, some ten times etc. Any ideas on how to achieve this.
Output needed:
customer_id Order_date_Difference_in_days
1 6,3 #Difference b/w first 2 dates 2015-01-10 and 2015-01-16
#is 6 days and diff b/w next 2 consecutive dates is
#2015-01-16 and 2015-01-19 is #3 days
2 20
3 320,381
4 3596
Basically these are the differences between dates after sorting them first for each customer id
You can also use the below for the current output:
m=(df.assign(Diff=df.sort_values(['customer_id','Order_date'])
.groupby('customer_id')['Order_date'].diff().dt.days).dropna())
m=m.assign(Diff=m['Diff'].astype(str)).groupby('customer_id')['Diff'].agg(','.join)
customer_id
1 6.0,3.0
2 20.0
3 320.0,381.0
4 3595.0
Name: Diff, dtype: object
First we need to sort the data by customer id and the order date
ensure your datetime is a proper date time call df['Order_date'] = pd.to_datetime(df['Order_date'])
df.sort_values(['customer_id','Order_date'],inplace=True)
df["days"] = df.groupby("customer_id")["Order_date"].apply(
lambda x: (x - x.shift()) / np.timedelta64(1, "D")
)
print(df)
customer_id Order_date days
4 1 2015-01-10 NaN
0 1 2015-01-16 6.0
1 1 2015-01-19 3.0
2 2 2014-12-21 NaN
3 2 2015-01-10 20.0
6 3 2017-03-04 NaN
5 3 2018-01-18 320.0
9 3 2019-02-03 381.0
8 4 2010-01-01 NaN
7 4 2019-11-05 3595.0
then you can do a simple agg but you'll need to conver the value into a string.
df.dropna().groupby("customer_id")["days"].agg(
lambda x: ",".join(x.astype(str))
).to_frame()
days
customer_id
1 6.0,3.0
2 20.0
3 320.0,381.0
4 3595.0
This is my dataframe:
df = pd.DataFrame.from_records(data=data, coerce_float=False, index=['date'])
# date field a datetime.datetime values
account_id amount
date
2018-01-01 1 100.0
2018-01-01 1 50.0
2018-06-01 1 200.0
2018-07-01 2 100.0
2018-10-01 2 200.0
Problem description
How can I "pad" my dataframe with leading and trailing "empty dates". I have tried to reindex on a date_range and period_range, I have tried to merge another index. I have tried all sorts of things all day, and I have read alot of the docs.
I have a simple dataframe with columns transaction_date, transaction_amount, and transaction_account. I want to group this dataframe so that it is grouped by account at the first level, and then by year, and then by month. Then I want a column for each month, with the sum of that month's transaction amount value.
This seems like it should be something that is easy to do.
Expected Output
This is the closest I have gotten:
df = pd.DataFrame.from_records(data=data, coerce_float=False, index=['date'])
df = df.groupby(['account_id', df.index.year, df.index.month])
df = df.resample('M').sum().fillna(0)
print(df)
account_id amount
account_id date date date
1 2018 1 2018-01-31 2 150.0
6 2018-06-30 1 200.0
2 2018 7 2018-07-31 2 100.0
10 2018-10-31 2 200.0
And this is what I want to achieve (basically reindex the data by date_range(start='2018-01-01', period=12, freq='M')
(Ideally I would want the month to be transposed by year across the top as columns)
amount
account_id Year Month
1 2018 1 150.0
2 NaN
3 NaN
4 NaN
5 NaN
6 200.0
....
12 200.0
2 2018 1 NaN
....
7 100.0
....
10 200.0
....
12 NaN
One way is to reindex
s=df.groupby([df['account_id'],df.index.year,df.index.month]).sum()
idx=pd.MultiIndex.from_product([s.index.levels[0],s.index.levels[1],list(range(1,13))])
s=s.reindex(idx)
s
Out[287]:
amount
1 2018 1 150.0
2 NaN
3 NaN
4 NaN
5 NaN
6 200.0
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
2 2018 1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 100.0
8 NaN
9 NaN
10 200.0
11 NaN
12 NaN
What I'm doing is I have generated a DataFrame with pandas:
df_output = pd.DataFrame(columns={"id","Payout date", "Amount"}
In column 'Payout date' is a datetime, and in 'Amount' a float. I'm taking the values for each row from a csv:
df=pd.read_csv("file.csv", encoding = "ISO-8859-1", low_memory=False)
but when I assign the values:
df_output.loc[df_output['id'] == index, 'Payout date'].iloc[0]=(parsed_date)
pay=payments.get()
ref=refunds.get()
df_output.loc[df_output['id'] == index, 'Amount'].iloc[0]=(pay+ref-for_next_day)
and I print it the columns 'Payout date' and 'Amount' it only prints the id correctly, and NaT for the payouts and NaN for the amount, even when casting them to floats, or using
df_output['Amount']=pd.to_numeric(df_output['Amount'])
df_output['Payout date'] = pd.to_datetime(df_output['Payout date'])
I've also tried casting the values before passing them to the DataFrame, with no luck, so what I'm getting is this:
id Payout date Amount
1 NaT NaN
2 NaT NaN
3 NaT NaN
4 NaT NaN
5 NaT NaN
Instead, I'm looking for something like this:
id Payout date Amount
1 2019-03-11 3.2
2 2019-03-11 3.2
3 2019-03-11 3.2
4 2019-03-11 3.2
5 2019-03-11 3.2
EDIT
print(df_output.head(5))
print(df.head(5))
id Payout date Amount
1 NaT NaN
2 NaT NaN
3 NaT NaN
4 NaT NaN
5 NaT NaN
id Created (UTC) Type Currency Amount Fee Net
1 2016-07-27 13:28:00 charge mxn 672.0 31.54 640.46
2 2016-07-27 15:21:00 charge mxn 146.0 9.58 136.42
3 2016-07-27 16:18:00 charge mxn 200.0 11.83 188.17
4 2016-07-27 17:18:00 charge mxn 146.0 9.58 136.42
5 2016-07-27 18:11:00 charge mxn 286.0 15.43 270.57
Probably the easiest thing to do would be just to rename the columns of the dataframe you're loading:
df = pd.read_csv("file.csv", encoding = "ISO-8859-1", low_memory=False, index_col='id')
df.columns(rename={"Created (UTC)":'Payout Date'}, inplace=True)
df_output = df[['Payout Date', 'Amount']]
EDIT:
if you're trying to assign a column in one dataframe to the column of another just do this:
output_df['Amount'] = df['Amount']