So, I have a dataframe like this one:
date ID value
2018-01-01 A 10
2018-02-01 A 11
2018-04-01 A 13
2017-08-01 B 20
2017-10-01 B 21
2017-11-01 B 23
Each group can have very different dates, and there's about 400k groups. So, what I want to do is to fill the missing dates of each group in an efficient way, so it looks like this:
date ID value
2018-01-01 A 10
2018-02-01 A 11
2018-03-01 A nan
2018-04-01 A 13
2017-08-01 B 20
2017-09-01 B nan
2017-10-01 B 21
2017-11-01 B 23
I've tried two approaches:
df2 = df.groupby('ID').apply(lambda x: x.set_index('date').resample('D').pad())
And also:
df2= df.set_index(['date','ID']).unstack().stack(dropna=False).reset_index()
df2= df2.sort_values(by=['ID','date']).reset_index(drop=True)
df2= df2[df2.groupby('ID').value.ffill().notna()]
df2 = df2[df2.groupby('ID').value.bfill().notna()]
The first one, as it uses apply, it's very slow. I guess I could use something else instead of pad so I get nan instead of the previous value, but I'm not sure that will impact the perfomance enough. I waited around 15 minutes and it didn't finish running.
The second one fills from the first date in the whole dataframe to the last one, for every group, which brings a massive dataframe. Afterward I drop all leading and trailing nan generated by this method. This is quite faster than the first option, but doesn't seem to be the best one. Is there a better way to do this, that's better for big dataframes?
Related
I have pandas data frame that is panel data i.e. data of multiple customers over a timeframe. I want to sample (for bootstraping) a continuous three months period (i always wanna get a full month) of a random customer 90 times.
I have googled a bit and found several sampling techniques but none that would include sampling based on three continuous months.
I was considering just making a list of all the month names and sampling three consecutive ones (although not sure how to do consecutive). But how would i then be able to e.g. pick Nov21-Dec21-Jan22 ?
Would appreciate the help a lot!
import pandas as pd
date_range = pd.date_range("2020-01-01", "2022-01-01")
df = pd.DataFrame({"value":3}, index=date_range)
df.groupby(df.index.quarter).sample(5)
This would output:
Out[12]:
value
2021-01-14 3
2021-02-27 3
2020-01-20 3
2021-02-03 3
2021-02-19 3
2021-04-27 3
2021-06-29 3
2021-04-12 3
2020-06-24 3
2020-06-05 3
2021-07-30 3
2020-08-29 3
2021-07-03 3
2020-07-17 3
2020-09-12 3
2020-12-22 3
2021-12-13 3
2021-11-29 3
2021-12-19 3
2020-10-18 3
It selected 5 sample values form each quarter group.
From now own you can format date column (index) for it to write month in text.
in my code I've generated a range of dates using pd.date_range in an effort to compare it to a column of dates read in from excel using pandas. The generated range of dates is refered to as "all_dates".
all_dates=pd.date_range(start='1998-12-31', end='2020-06-23')
for i, date in enumerate(period): # where 'Period' is the column of excel dates
if date==all_dates[i]: # loop until date from excel doesn't match date from generated dates
continue
else:
missing_dates_stock.append(i) # keep list of locations where dates are missing
stock_data.insert(i,"NaN") # insert 'NaN' where missing date is found
This results in TypeError: argument of type 'Timestamp' is not iterable. How can I make the data types match such that I can iterate and compare them? Apologies as I am not very fluent in Python.
I think you are trying to create a NaN row if the date does not exist in the excel file.
Here's a way to do it. You can use the df.merge option.
I am creating df1 to simulate the excel file. It has two columns sale_dt and sale_amt. If the sale_dt does not exist, then we want to create a separate row with NaN in the columns. To ensure we simulate it, I am creating a date range from 1998-12-31 through 2020-06-23 skipping 4 days in between. So we have a dataframe with 4 missing date between each two rows. The solution should create 4 dummy rows with the correct date in ascending order.
import pandas as pd
import random
#create the sales dataframe with missing dates
df1 = pd.DataFrame({'sale_dt':pd.date_range(start='1998-12-31', end='2020-06-23', freq='5D'),
'sale_amt':random.sample(range(1, 2000), 1570)
})
print (df1)
#now create a dataframe with all the dates between '1998-12-31' and '2020-06-23'
df2 = pd.DataFrame({'date':pd.date_range(start='1998-12-31', end='2020-06-23', freq='D')})
print (df2)
#now merge both dataframes with outer join so you get all the rows.
#i am also sorting the data in ascending order so you can see the dates
#also dropping the original sale_dt column and renaming the date column as sale_dt
#then resetting index
df1 = (df1.merge(df2,left_on='sale_dt',right_on='date',how='outer')
.drop(columns=['sale_dt'])
.rename(columns={'date':'sale_dt'})
.sort_values(by='sale_dt')
.reset_index(drop=True))
print (df1.head(20))
The original dataframe was:
sale_dt sale_amt
0 1998-12-31 1988
1 1999-01-05 1746
2 1999-01-10 1395
3 1999-01-15 538
4 1999-01-20 1186
... ... ...
1565 2020-06-03 560
1566 2020-06-08 615
1567 2020-06-13 858
1568 2020-06-18 298
1569 2020-06-23 1427
The output of this will be (first 20 rows):
sale_amt sale_dt
0 1988.0 1998-12-31
1 NaN 1999-01-01
2 NaN 1999-01-02
3 NaN 1999-01-03
4 NaN 1999-01-04
5 1746.0 1999-01-05
6 NaN 1999-01-06
7 NaN 1999-01-07
8 NaN 1999-01-08
9 NaN 1999-01-09
10 1395.0 1999-01-10
11 NaN 1999-01-11
12 NaN 1999-01-12
13 NaN 1999-01-13
14 NaN 1999-01-14
15 538.0 1999-01-15
16 NaN 1999-01-16
17 NaN 1999-01-17
18 NaN 1999-01-18
19 NaN 1999-01-19
I need to find the number of days between a request date and its most recent offer date for each apartment number. My example dataframe looks like the first 3 columns below and I'm trying to figure out how to calculate the 'days_since_offer' column. The apartment and or_date columns are already sorted.
apartment offer_req_type or_date days_since_offer
A request 12/4/2019 n/a
A request 12/30/2019 n/a
A offer 3/4/2020 0
A request 4/2/2020 29
A request 6/4/2020 92
A request 8/4/2020 153
A offer 12/4/2020 0
A request 1/1/2021 28
B offer 1/1/2019 0
B request 8/1/2019 212
B offer 10/1/2019 0
B request 1/1/2020 92
B request 9/1/2020 244
B offer 1/1/2021 0
B request 1/25/2021 24
I tried to create a new function which sort of gives me what I want if I pass it dates for a single apartment. When I use the apply function is gives me an error though: "SpecificationError: Function names must be unique if there is no new column names assigned".
def func(attr, date_ser):
offer_dt = date(1900,1,1)
lapse_days = []
for row in range(len(attr)):
if attr[row] == 'offer':
offer_dt = date_ser[row]
lapse_days.append(-1)
else:
lapse_days.append(date_ser[row]-offer_dt)
print(lapse_days)
return lapse_days
df['days_since_offer'] = df.apply(func(df['offer_req_type'], df['or_date']))
I also tried to use groupby + diff functions like this and this but it's not the answer that I need:
df.groupby('offer_req_type').or_date.diff().dt.days
I also looked into using the shift method, but I'm not necessarily looking at sequential rows every time.
Any pointers on why my function is failing or if there is a better way to get the date differences that I need using a groupby method would be helpful!
I have played around and I am certainly not claiming this to be the best way. I used df.apply() (edit: see below for alternative without df.apply()).
import numpy as np
import pandas as pd
# SNIP: removed the df creation part for brevity.
df["or_date"] = pd.to_datetime(df["or_date"])
df.drop("days_since_offer", inplace=True, axis="columns")
def get_last_offer(row:pd.DataFrame, df: pd.DataFrame):
if row["offer_req_type"] == "offer":
return
temp_df = df[(df.apartment == row['apartment']) & (df.offer_req_type == "offer") & (df.or_date < row["or_date"])]
if temp_df.empty:
return
else:
x = row["or_date"]
y = temp_df.iloc[-1:, -1:]["or_date"].values[0]
return x-y
return 1
df["days_since_offer"] = df.apply(lambda row: get_last_offer(row, df), axis=1)
print(df)
This returns the following df:
0 A request 2019-12-04 NaT
1 A request 2019-12-30 NaT
2 A offer 2020-03-04 NaT
3 A request 2020-04-02 29 days
4 A request 2020-06-04 92 days
5 A request 2020-08-04 153 days
6 A offer 2020-12-04 NaT
7 A request 2021-01-01 28 days
8 B offer 2019-01-01 NaT
9 B request 2019-08-01 212 days
10 B offer 2019-10-01 NaT
11 B request 2020-01-01 92 days
12 B request 2020-09-01 336 days
13 B offer 2021-01-01 NaT
14 B request 2021-01-25 24 days
EDIT
I was wondering if I couldn't find a way not using df.apply(). I ended up with the following lines: (replace from line def get_last_offer() in previous code bit)
df["offer_dates"] = np.where(df['offer_req_type'] == 'offer', df['or_date'], pd.NaT)
# OLD: df["offer_dates"].ffill(inplace=True)
df["offer_dates"] = df.groupby("apartment")["offer_dates"].ffill()
df["diff"] = pd.to_datetime(df["or_date"]) - pd.to_datetime(df["offer_dates"])
df.drop("offer_dates", inplace=True, axis="columns")
This creates a helper column (df['offer_dates']) which is filled for every row that has offer_req_type as 'offer'. Then it is forward-filled, meaning that every NaT value will be replaced with the previous valid value. Then We calculate the df['diff'] column, with the exact same result. I like this bit better because it is cleaner and it has 4 lines rather than 12 lines of code :)
I was just curious as to what's going on here. I have 13 dataframes that look something like this:
df1:
time val
00:00 1
00:01 2
00:02 5
00:03 8
df2:
time val
00:04 5
00:05 12
00:06 4
df3:
time val
00:07 8
00:08 24
00:09 3
and so on. As you can see each dataframe continues the time exactly where the other left off, which means ideally I would like them in one dataframe for simplicity sake. Note the example ones I used are significantly smaller then my actual ones. However, upon using the following:
df = pd.concat([pd.read_csv(i, usecols=[0,1,2]) for i in sample_files])
Where these 13 dataframes are produced through that list comprehension, I get a very strange result. It's as if I have set axis=1 inside the pd.concat() function. If I try to reference a column, say val
df['val']
Pandas returns something that looks like this:
0 1
1 2
...
2 5
3 8
Name: val, Length: 4, dtype: float64
In this output it does not specify what happened to the other 11 val columns. If I then reference an index, as follows:
df['val'][0]
It returns:
0 1
0 5
0 8
Name: val, dtype: float64
which is the first index of each column. I am unsure as to why pandas is behaving like this, as I would imagine it just joins together columns with similar header names, but obviously this isn't the case.
If sometime could explain this that would be great.
I believe your issue is that you are not resetting the index after concatenation, but before selecting the data.
Try:
df = pd.concat([pd.read_csv(i, usecols=[0,1,2]) for i in sample_files])
df = df.reset_index(Drop=True)
df['val'][0]
I have the following data frames:
print(df_a)
mukey DI PI
0 100000 35 14
1 1000005 44 14
2 1000006 44 14
3 1000007 43 13
4 1000008 43 13
print(df_b)
mukey niccdcd
0 190236 4
1 190237 6
2 190238 7
3 190239 4
4 190240 7
When I try to join these data frames:
join_df = df_a.join(df_b, on='mukey', how='left')
I get the error:
*** ValueError: columns overlap but no suffix specified: Index([u'mukey'], dtype='object')
Why is this so? The data frames do have common 'mukey' values.
Your error on the snippet of data you posted is a little cryptic, in that because there are no common values, the join operation fails because the values don't overlap it requires you to supply a suffix for the left and right hand side:
In [173]:
df_a.join(df_b, on='mukey', how='left', lsuffix='_left', rsuffix='_right')
Out[173]:
mukey_left DI PI mukey_right niccdcd
index
0 100000 35 14 NaN NaN
1 1000005 44 14 NaN NaN
2 1000006 44 14 NaN NaN
3 1000007 43 13 NaN NaN
4 1000008 43 13 NaN NaN
merge works because it doesn't have this restriction:
In [176]:
df_a.merge(df_b, on='mukey', how='left')
Out[176]:
mukey DI PI niccdcd
0 100000 35 14 NaN
1 1000005 44 14 NaN
2 1000006 44 14 NaN
3 1000007 43 13 NaN
4 1000008 43 13 NaN
The .join() function is using the index of the passed as argument dataset, so you should use set_index or use .merge function instead.
Please find the two examples that should work in your case:
join_df = LS_sgo.join(MSU_pi.set_index('mukey'), on='mukey', how='left')
or
join_df = df_a.merge(df_b, on='mukey', how='left')
This error indicates that the two tables have one or more column names that have the same column name.
The error message translates to: "I can see the same column in both tables but you haven't told me to rename either one before bringing them into the same table"
You either want to delete one of the columns before bringing it in from the other on using del df['column name'], or use lsuffix to re-write the original column, or rsuffix to rename the one that is being brought in.
df_a.join(df_b, on='mukey', how='left', lsuffix='_left', rsuffix='_right')
The error indicates that the two tables have the 1 or more column names that have the same column name.
Anyone with the same error who doesn't want to provide a suffix can rename the columns instead. Also make sure the index of both DataFrames match in type and value if you don't want to provide the on='mukey' setting.
# rename example
df_a = df_a.rename(columns={'a_old': 'a_new', 'a2_old': 'a2_new'})
# set the index
df_a = df_a.set_index(['mukus'])
df_b = df_b.set_index(['mukus'])
df_a.join(df_b)
Mainly join is used exclusively to join based on the index,not on the attribute names,so change the attributes names in two different dataframes,then try to join,they will be joined,else this error is raised