Pandas dataframe puts NaN and NaT - python

What I'm doing is I have generated a DataFrame with pandas:
df_output = pd.DataFrame(columns={"id","Payout date", "Amount"}
In column 'Payout date' is a datetime, and in 'Amount' a float. I'm taking the values for each row from a csv:
df=pd.read_csv("file.csv", encoding = "ISO-8859-1", low_memory=False)
but when I assign the values:
df_output.loc[df_output['id'] == index, 'Payout date'].iloc[0]=(parsed_date)
pay=payments.get()
ref=refunds.get()
df_output.loc[df_output['id'] == index, 'Amount'].iloc[0]=(pay+ref-for_next_day)
and I print it the columns 'Payout date' and 'Amount' it only prints the id correctly, and NaT for the payouts and NaN for the amount, even when casting them to floats, or using
df_output['Amount']=pd.to_numeric(df_output['Amount'])
df_output['Payout date'] = pd.to_datetime(df_output['Payout date'])
I've also tried casting the values before passing them to the DataFrame, with no luck, so what I'm getting is this:
id Payout date Amount
1 NaT NaN
2 NaT NaN
3 NaT NaN
4 NaT NaN
5 NaT NaN
Instead, I'm looking for something like this:
id Payout date Amount
1 2019-03-11 3.2
2 2019-03-11 3.2
3 2019-03-11 3.2
4 2019-03-11 3.2
5 2019-03-11 3.2
EDIT
print(df_output.head(5))
print(df.head(5))
id Payout date Amount
1 NaT NaN
2 NaT NaN
3 NaT NaN
4 NaT NaN
5 NaT NaN
id Created (UTC) Type Currency Amount Fee Net
1 2016-07-27 13:28:00 charge mxn 672.0 31.54 640.46
2 2016-07-27 15:21:00 charge mxn 146.0 9.58 136.42
3 2016-07-27 16:18:00 charge mxn 200.0 11.83 188.17
4 2016-07-27 17:18:00 charge mxn 146.0 9.58 136.42
5 2016-07-27 18:11:00 charge mxn 286.0 15.43 270.57

Probably the easiest thing to do would be just to rename the columns of the dataframe you're loading:
df = pd.read_csv("file.csv", encoding = "ISO-8859-1", low_memory=False, index_col='id')
df.columns(rename={"Created (UTC)":'Payout Date'}, inplace=True)
df_output = df[['Payout Date', 'Amount']]
EDIT:
if you're trying to assign a column in one dataframe to the column of another just do this:
output_df['Amount'] = df['Amount']

Related

How to create a new column with the last value of the previous year

I have this data frame
import pandas as pd
df = pd.DataFrame({'COTA':['A','A','A','A','A','B','B','B','B'],
'Date':['14/10/2021','19/10/2020','29/10/2019','30/09/2021','20/09/2020','20/10/2021','29/10/2020','15/10/2019','10/09/2020'],
'Mark':[1,2,3,4,5,1,2,3,3]
})
print(df)
based on this data frame I wanted the MARK from the previous year, I managed to acquire the maximum COTA but I wanted the last one, I used .max() and I thought I could get it with .last() but it didn't work.
follow the example of my code.
df['Date'] = pd.to_datetime(df['Date'])
df['LastYear'] = df['Date'] - pd.offsets.YearEnd(0)
s1 = df.groupby(['Found', 'LastYear'])['Mark'].max()
s2 = s1.rename(index=lambda x: x + pd.offsets.DateOffset(years=1), level=1)
df = df.join(s2.rename('Max_MarkLastYear'), on=['Found', 'LastYear'])
print (df)
Found Date Mark LastYear Max_MarkLastYear
0 A 2021-10-14 1 2021-12-31 5.0
1 A 2020-10-19 2 2020-12-31 3.0
2 A 2019-10-29 3 2019-12-31 NaN
3 A 2021-09-30 4 2021-12-31 5.0
4 A 2020-09-20 5 2020-12-31 3.0
5 B 2021-10-20 1 2021-12-31 3.0
6 B 2020-10-29 2 2020-12-31 3.0
7 B 2019-10-15 3 2019-12-31 NaN
8 B 2020-10-09 3 2020-12-31 3.0
How do I create a new column with the last value of the previous year

How to create a new metric column based on 1 year lag of a date column?

I would like create a new column which references a date column - 1 year and displays the corresponding values:
import pandas as pd
Input DF
df = pd.DataFrame({'consumption': [0,1,3,5], 'date':[pd.to_datetime('2017-04-01'),
pd.to_datetime('2017-04-02'),
pd.to_datetime('2018-04-01'),
pd.to_datetime('2018-04-02')]})
>>> df
consumption date
0 2017-04-01
1 2017-04-02
3 2018-04-01
5 2018-04-02
Expected DF
df = pd.DataFrame({'consumption': [0,1,3,5],
'prev_year_consumption': [np.NAN,np.NAN,0,1],
'date':[pd.to_datetime('2017-04-01'),
pd.to_datetime('2017-04-02'),
pd.to_datetime('2018-04-01'),
pd.to_datetime('2018-04-02')]})
>>> df
consumption prev_year_consumption date
0 NAN 2017-04-01
1 NAN 2017-04-02
3 0 2018-04-01
5 1 2018-04-02
So prev_year_consumption is are simply values from the consumption column where 1 year is subtracted from date dynamically.
in SQL I would probably do something like
SELECT df_past.consumption as prev_year_consumption, df_current.consumption
FROM df as df_current
LEFT JOIN ON df df_past ON year(df_current.date) = year(df_past.date) - 1
Appreciate any hints
The notation in pandas is similar. We are still doing a self merge however we need to specify that the right_on (or left_on) has a DateOffset of 1 year:
new_df = df.merge(
df,
left_on='date',
right_on=df['date'] + pd.offsets.DateOffset(years=1),
how='left'
)
new_df:
date consumption_x date_x consumption_y date_y
0 2017-04-01 0 2017-04-01 NaN NaT
1 2017-04-02 1 2017-04-02 NaN NaT
2 2018-04-01 3 2018-04-01 0.0 2017-04-01
3 2018-04-02 5 2018-04-02 1.0 2017-04-02
We can further drop and rename columns to get exact output:
new_df = df.merge(
df,
left_on='date',
right_on=df['date'] + pd.offsets.DateOffset(years=1),
how='left'
).drop(columns=['date_x', 'date_y']).rename(columns={
'consumption_y': 'prev_year_consumption'
})
new_df:
date consumption_x prev_year_consumption
0 2017-04-01 0 NaN
1 2017-04-02 1 NaN
2 2018-04-01 3 0.0
3 2018-04-02 5 1.0

pandas: given a start and end date, add a column for each day in between, then add values?

This is my data:
df = pd.DataFrame([
{start_date: '2019/12/01', end_date: '2019/12/05', spend: 10000, campaign_id: 1}
{start_date: '2019/12/05', end_date: '2019/12/09', spend: 50000, campaign_id: 2}
{start_date: '2019/12/01', end_date: '', spend: 10000, campaign_id: 3}
{start_date: '2019/12/01', end_date: '2019/12/01', spend: 50, campaign_id: 4}
]);
I need to add a column to each row for each day since 2019/12/01, and calculate the spend on that campaign that day, which I'll get by dividing the spend on the campaign by the total number of days it was active.
So here I'd add a column for each day between 1 December and today (10 December). For row 1, the content of the five columns for 1 Dec to 5 Dec would be 2000, then for the six ocolumns from 5 Dec to 10 Dec it would be zero.
I know pandas is well-designed for this kind of problem, but I have no idea where to start!
Doesn't seem like a straight forward task to me. But first convert your date columns if you haven't already:
df["start_date"] = pd.to_datetime(df["start_date"])
df["end_date"] = pd.to_datetime(df["end_date"])
Then create a helper function for resampling:
def resampler(data, daterange):
temp = (data.set_index('start_date').groupby('campaign_id')
.apply(daterange)
.drop("campaign_id",axis=1)
.reset_index().rename(columns={"level_1":"start_date"}))
return temp
Now its a 3 step process. First resample your data according to end_date of each group:
df1 = resampler(df, lambda d: d.reindex(pd.date_range(min(d.index),max(d["end_date"]),freq="D")) if d["end_date"].notnull().all() else d)
df1["spend"] = df1.groupby("campaign_id")["spend"].transform(lambda x: x.mean()/len(x))
With the average values calculated, resample again to current date:
dates = pd.date_range(min(df["start_date"]),pd.Timestamp.today(),freq="D")
df1 = resampler(df1,lambda d: d.reindex(dates))
Finally transpose your dataframe:
df1 = pd.concat([df1.drop("end_date",axis=1).set_index(["campaign_id","start_date"]).unstack(),
df1.groupby("campaign_id")["end_date"].min()], axis=1)
df1.columns = [*dates,"end_date"]
print (df1)
#
2019-12-01 00:00:00 2019-12-02 00:00:00 2019-12-03 00:00:00 2019-12-04 00:00:00 2019-12-05 00:00:00 2019-12-06 00:00:00 2019-12-07 00:00:00 2019-12-08 00:00:00 2019-12-09 00:00:00 2019-12-10 00:00:00 end_date
campaign_id
1 2000.0 2000.0 2000.0 2000.0 2000.0 NaN NaN NaN NaN NaN 2019-12-05
2 NaN NaN NaN NaN 10000.0 10000.0 10000.0 10000.0 10000.0 NaN 2019-12-09
3 10000.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
4 50.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2019-12-01

Leading and Trailing Padding Dates in Pandas DataFrame

This is my dataframe:
df = pd.DataFrame.from_records(data=data, coerce_float=False, index=['date'])
# date field a datetime.datetime values
account_id amount
date
2018-01-01 1 100.0
2018-01-01 1 50.0
2018-06-01 1 200.0
2018-07-01 2 100.0
2018-10-01 2 200.0
Problem description
How can I "pad" my dataframe with leading and trailing "empty dates". I have tried to reindex on a date_range and period_range, I have tried to merge another index. I have tried all sorts of things all day, and I have read alot of the docs.
I have a simple dataframe with columns transaction_date, transaction_amount, and transaction_account. I want to group this dataframe so that it is grouped by account at the first level, and then by year, and then by month. Then I want a column for each month, with the sum of that month's transaction amount value.
This seems like it should be something that is easy to do.
Expected Output
This is the closest I have gotten:
df = pd.DataFrame.from_records(data=data, coerce_float=False, index=['date'])
df = df.groupby(['account_id', df.index.year, df.index.month])
df = df.resample('M').sum().fillna(0)
print(df)
account_id amount
account_id date date date
1 2018 1 2018-01-31 2 150.0
6 2018-06-30 1 200.0
2 2018 7 2018-07-31 2 100.0
10 2018-10-31 2 200.0
And this is what I want to achieve (basically reindex the data by date_range(start='2018-01-01', period=12, freq='M')
(Ideally I would want the month to be transposed by year across the top as columns)
amount
account_id Year Month
1 2018 1 150.0
2 NaN
3 NaN
4 NaN
5 NaN
6 200.0
....
12 200.0
2 2018 1 NaN
....
7 100.0
....
10 200.0
....
12 NaN
One way is to reindex
s=df.groupby([df['account_id'],df.index.year,df.index.month]).sum()
idx=pd.MultiIndex.from_product([s.index.levels[0],s.index.levels[1],list(range(1,13))])
s=s.reindex(idx)
s
Out[287]:
amount
1 2018 1 150.0
2 NaN
3 NaN
4 NaN
5 NaN
6 200.0
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
2 2018 1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 100.0
8 NaN
9 NaN
10 200.0
11 NaN
12 NaN

Get Year & Month as string from Date with NA values

Here is current df:
ID Date
1 3/29/2017
2
3 11/5/2015
4
5 2/28/2017
I am trying to get year + month as a string in the new column. And this is my code:
df["Year"] = df["Date"].dt.year
df["Month"] = df["Date"].dt.month
df["yyyy_mm"] = df["Year"].map(str) + "-" + df["Month"].map(str)
The issue is when I extract the year and month from the date, it will return the float type.
ID Date Year Month yyyy_mm I hope to get this
1 3/29/2017 2017.0 3.0 2017.0-3.0 2017-3
2 nan-nan
3 11/5/2015 2015.0 11.0 2015.0-11.0 2015-11
4 nan-nan
5 2/28/2017 2017.0 2.0 2017.0-2.0 2017-2
I tried to use df["Date"].dt.year.astype(int) to convert it to int, so that there is no .0, but I got this error: Cannot convert non-finite values (NA or inf) to integer. Because there NAN in column.
I don't want to fillna for all the year and month with 0 or something else, i just want to keep them empty since date is empty at that row.
You should perform string conversion directly from Date using pd.Series.dt.strftime.
This not only ensures NaT rows remain NaT, but strings are better formatted, e.g. zero-padding for months.
df["yyyy_mm"] = df['Date'].dt.strftime('%Y-%m')
print(df)
ID Date Year Month yyyy_mm
0 1 2017-03-29 2017.0 3.0 2017-03
1 2 NaT NaN NaN NaT
2 3 2015-11-05 2015.0 11.0 2015-11
3 4 NaT NaN NaN NaT
4 5 2017-02-28 2017.0 2.0 2017-02

Categories