I'm trying to fill this DataFrame (df1) (I can start it with NaN or zero values):
27/05/2021 28/05/2021 29/05/2021 30/05/2021 31/05/2021 01/06/2021 02/06/2021 ...
Name1 Nan Nan Nan Nan Nan Nan Nan
Name2 Nan Nan Nan Nan Nan Nan Nan
Name3 Nan Nan Nan Nan Nan Nan Nan
Name4 Nan Nan Nan Nan Nan Nan Nan
Acording information in this DataFrame (df2):
Start1 End1 Dedication1 (h) Start2 End2 Dedication2 (h)
Name1 24/05/2021 31/05/2021 8 02/06/2021 10/07/2021 3
Name2 29/05/2021 31/05/2021 5 Nan Nan Nan
Name3 27/05/2021 01/06/2021 3 Nan Nan Nan
Name4 29/05/2021 07/08/2021 8 10/10/2021 10/12/2021 2
To get something like this (df3):
27/05/2021 28/05/2021 29/05/2021 30/05/2021 31/05/2021 01/06/2021 02/06/2021 ...
Name1 8 8 8 8 8 0 3
Name2 0 0 5 5 5 0 0
Name3 3 3 3 3 3 3 0
Name4 0 0 8 8 8 8 8
This is a schedule with working hours every day for some months. Both DataFrames will have same index and rows number.
According dates in df2, I need to fill df1 values within start day and end day, with dedication hours in that period.
I have tried loc including all rows, and lambda function to select columns according date, but I dont get fill values within dates. Perhaps I need several steps.
Thanks.
You could try this:
from datetime import datetime
import pandas as pd
# Setup
limits = [("Start1", "End1", "Dedication1"), ("Start2", "End2", "Dedication2")]
df3 = df1.copy()
# Deal with NaN values
df3.fillna(0, inplace=True)
df2["Start2"].fillna("31/12/2099", inplace=True)
df2["End2"].fillna("31/12/2099", inplace=True)
df2["Dedication2"].fillna(0, inplace=True)
# Iterate and fill df3
for i, row in df1.iterrows():
for col in df1.columns:
for start, end, dedication in limits:
mask = (
datetime.strptime(df2.loc[i, start], "%d/%m/%Y")
<= datetime.strptime(col, "%d/%m/%Y")
<= datetime.strptime(df2.loc[i, end], "%d/%m/%Y")
)
if mask:
df3.loc[i, col] = df2.loc[i, dedication]
# Format df3
df3 = df3.astype("int")
print(df3)
# Outputs
27/05/2021 28/05/2021 29/05/2021 ... 31/05/2021 01/06/2021 02/06/2021
Name1 8 8 8 ... 8 0 3
Name2 0 0 5 ... 5 0 0
Name3 3 3 3 ... 3 3 0
Name4 0 0 8 ... 8 8 8
I have the following dataframe:
df=pd.DataFrame(np.array([[1,2,3,4,7,9,5],[2,6,5,4,9,8,2],[3,5,3,21,12,6,7],[1,7,8,4,3,4,3]]),
columns=['9:00:00','9:05:00','09:10:00','09:15:00','09:20:00','09:25:00','09:30:00'])
>>> 9:00:00 9:05:00 09:10:00 09:15:00 09:20:00 09:25:00 09:30:00 ....
a 1 2 3 4 7 9 5
b 2 6 5 4 9 8 2
c 3 5 3 21 12 6 7
d 1 7 8 4 3 4 3
I would like to get for each row (e.g a,b,c,d ...) the mean vale between specific hours. The hours are between 9-15, and I want to groupby period, for example to calculate the mean value between 09:00:00 to 11:00:00, between 11- 12, between 13-15 (or any period I decide to).
I was trying first to convert the columns values to datetime format and then I though it would be easier to do this:
df.columns = pd.to_datetime(df.columns,format="%H:%M:%S")
but then I got the columns names with fake year "1900-01-01 09:00:00"...
And also, the columns headers type was object, so I felt a bit lost...
My end goal is to be able to calculate new columns with the mean value for each row only between columns that fall inside the defined time period (e.g 9-11 etc...)
If need some period, e.g. each 2 hours:
df.columns = pd.to_datetime(df.columns,format="%H:%M:%S")
df1 = df.resample('2H', axis=1).mean()
print (df1)
1900-01-01 08:00:00
0 4.428571
1 5.142857
2 8.142857
3 4.285714
If need some custom periods is possible use cut:
df.columns = pd.to_datetime(df.columns,format="%H:%M:%S")
bins = ['5:00:00','9:00:00','11:00:00','12:00:00', '23:59:59']
dates = pd.to_datetime(bins,format="%H:%M:%S")
labels = [f'{i}-{j}' for i, j in zip(bins[:-1], bins[1:])]
df.columns = pd.cut(df.columns, bins=dates, labels=labels, right=False)
print (df)
9:00:00-11:00:00 9:00:00-11:00:00 9:00:00-11:00:00 9:00:00-11:00:00 \
0 1 2 3 4
1 2 6 5 4
2 3 5 3 21
3 1 7 8 4
9:00:00-11:00:00 9:00:00-11:00:00 9:00:00-11:00:00
0 7 9 5
1 9 8 2
2 12 6 7
3 3 4 3
And last use mean per columns, reason of NaNs columns is columns are categoricals:
df2 = df.mean(level=0, axis=1)
print (df2)
9:00:00-11:00:00 5:00:00-9:00:00 11:00:00-12:00:00 12:00:00-23:59:59
0 4.428571 NaN NaN NaN
1 5.142857 NaN NaN NaN
2 8.142857 NaN NaN NaN
3 4.285714 NaN NaN NaN
For avoid NaNs columns convert columns names to strings:
df3 = df.rename(columns=str).mean(level=0, axis=1)
print (df3)
9:00:00-11:00:00
0 4.428571
1 5.142857
2 8.142857
3 4.285714
EDIT: Solution above with timedeltas, because format HH:MM:SS:
df.columns = pd.to_timedelta(df.columns)
print (df)
0 days 09:00:00 0 days 09:05:00 0 days 09:10:00 0 days 09:15:00 \
0 1 2 3 4
1 2 6 5 4
2 3 5 3 21
3 1 7 8 4
0 days 09:20:00 0 days 09:25:00 0 days 09:30:00
0 7 9 5
1 9 8 2
2 12 6 7
3 3 4 3
bins = ['9:00:00','11:00:00','12:00:00']
dates = pd.to_timedelta(bins)
labels = [f'{i}-{j}' for i, j in zip(bins[:-1], bins[1:])]
df.columns = pd.cut(df.columns, bins=dates, labels=labels, right=False)
print (df)
9:00:00-11:00:00 9:00:00-11:00:00 9:00:00-11:00:00 9:00:00-11:00:00 \
0 1 2 3 4
1 2 6 5 4
2 3 5 3 21
3 1 7 8 4
9:00:00-11:00:00 9:00:00-11:00:00 9:00:00-11:00:00
0 7 9 5
1 9 8 2
2 12 6 7
3 3 4 3
#missing values because not exist datetimes between 11:00:00-12:00:00
df2 = df.mean(level=0, axis=1)
print (df2)
9:00:00-11:00:00 11:00:00-12:00:00
0 4.428571 NaN
1 5.142857 NaN
2 8.142857 NaN
3 4.285714 NaN
df3 = df.rename(columns=str).mean(level=0, axis=1)
print (df3)
9:00:00-11:00:00
0 4.428571
1 5.142857
2 8.142857
3 4.285714
I am going to show you my code and the results after the ejecution.
First import libraries and dataframe
import numpy as np
import pandas as pd
df=pd.DataFrame(np.array([[1,2,3,4,7,9,5],[2,6,5,4,9,8,2],[3,5,3,21,12,6,7],
[1,7,8,4,3,4,3]]),
columns=
['9:00:00','9:05:00','09:10:00','09:15:00','09:20:00','09:25:00','09:30:00'])
It would be nice create a class in order to define what is a period:
class Period():
def __init__(self,initial,end):
self.initial=initial
self.end=end
def __repr__(self):
return self.initial +' -- ' +self.end
With comand .loc we can get a subdataframe with the columns that I desire:
`def get_colMean(df,period):
df2 = df.loc[:,period.initial:period.end]
array_mean = df.mean(axis=1).values
col_name = 'mean_'+period.initial+'--'+period.end
pd_colMean = pd.DataFrame(array_mean,columns=[col_name])
return pd_colMean`
Finally we use .join in orde to add our column with the means to our original dataframe:
def join_colMean(df,period):
pd_colMean = get_colMean(df,period)
df = df.join(pd_colMean)
return df
I am goint to show you my results:
Input
df1
id date v1
a 2020-1-1 1
a 2020-1-2 2
b 2020-1-4 10
b 2020-1-22 30
c 2020-2-4 10
c 2020-2-22 30
df2
id date v1
a 2020-1-3 1
b 2020-1-7 12
b 2020-1-22 13
c 2020-2-10 15
c 2020-2-22 60
Goal
id date v1 v2
a 2020-1-1 1 0
a 2020-1-2 2 0
a 2020-1-3 0 1
b 2020-1-4 10 0
b 2020-1-7 0 12
b 2020-1-22 30 13
c 2020-2-4 10 0
c 2020-2-10 0 15
c 2020-2-22 30 60
The details:
Only two dataframes, for each id, the date is unique.
Concat two dataframes into df based on id, each id contains all date values from two dataframe
new merge dataframe contains v1 and v2 columns, while the date in df1 and df2, it returns original values, while the date only in one of df1 and df2, it returns original value and 0 if there is no value on the date.
Try
I have searched merge, concat document but I could not find the answers.
First convert columns to datetimes for correct ordering by to_datetime, then DataFrame.merge with outer join and rename column v1 for df2 for avoid v1_x and v1_y columns in output, replace missing values by DataFrame.fillna, sorting output by DataFrame.sort_values:
df1['date'] = pd.to_datetime(df1['date'])
df2['date'] = pd.to_datetime(df2['date'])
df = (df1.merge(df2.rename(columns={'v1':'v2'}), on=['id','date'], how='outer')
.fillna(0)
.sort_values(['id','date']))
print (df)
id date v1 v2
0 a 2020-01-01 1.0 0.0
1 a 2020-01-02 2.0 0.0
6 a 2020-01-03 0.0 1.0
2 b 2020-01-04 10.0 0.0
7 b 2020-01-07 0.0 12.0
3 b 2020-01-22 30.0 13.0
4 c 2020-02-04 10.0 0.0
8 c 2020-02-10 0.0 15.0
5 c 2020-02-22 30.0 60.0
I'd like to copy or duplicate the rows of a DataFrame based on the value of a column, in this case orig_qty. So if I have a DataFrame and using pandas==0.24.2:
import pandas as pd
d = {'a': ['2019-04-08', 4, 115.00], 'b': ['2019-04-09', 2, 103.00]}
df = pd.DataFrame.from_dict(
d,
orient='index',
columns=['date', 'orig_qty', 'price']
)
Input
>>> print(df)
date orig_qty price
a 2019-04-08 4 115.0
b 2019-04-09 2 103.0
So in the example above the row with orig_qty=4 should be duplicated 4 times and the row with orig_qty=2 should be duplicated 2 times. After this transformation I'd like a DataFrame that looks like:
Desired Output
>>> print(new_df)
date orig_qty price fifo_qty
1 2019-04-08 4 115.0 1
2 2019-04-08 4 115.0 1
3 2019-04-08 4 115.0 1
4 2019-04-08 4 115.0 1
5 2019-04-09 2 103.0 1
6 2019-04-09 2 103.0 1
Note I do not really care about the index after the transformation. I can elaborate more on the use case for this, but essentially I'm doing some FIFO accounting where important changes can occur between values of orig_qty.
Use Index.repeat, DataFrame.loc, DataFrame.assign and DataFrame.reset_index
new_df = df.loc[df.index.repeat(df['orig_qty'])].assign(fifo_qty=1).reset_index(drop=True)
[output]
date orig_qty price fifo_qty
0 2019-04-08 4 115.0 1
1 2019-04-08 4 115.0 1
2 2019-04-08 4 115.0 1
3 2019-04-08 4 115.0 1
4 2019-04-09 2 103.0 1
5 2019-04-09 2 103.0 1
Use np.repeat
new_df = pd.DataFrame({col: np.repeat(df[col], df.orig_qty) for col in df.columns})
I am having a data frame like this I have to get missing Quarterly value and count between them
Same with Quarterly Missing count and fill the data frame is
year Data Id
2019Q4 57170 A
2019Q3 55150 A
2019Q2 51109 A
2019Q1 51109 A
2018Q1 57170 B
2018Q4 55150 B
2017Q4 51109 C
2017Q2 51109 C
2017Q1 51109 C
Id Start year end-year count
B 2018Q2 2018Q3 2
B 2017Q3 2018Q3 1
How can I achieve this using python panda
Use:
#changed data for more general solution - multiple missing years per groups
print (df)
year Data Id
0 2015 57170 A
1 2016 55150 A
2 2019 51109 A
3 2023 51109 A
4 2000 47740 B
5 2002 44563 B
6 2003 43643 C
7 2004 42050 C
8 2007 37312 C
#add missing values for no years by reindex
df1 = (df.set_index('year')
.groupby('Id')['Id']
.apply(lambda x: x.reindex(np.arange(x.index.min(), x.index.max() + 1)))
.reset_index(name='val'))
print (df1)
Id year val
0 A 2015 A
1 A 2016 A
2 A 2017 NaN
3 A 2018 NaN
4 A 2019 A
5 A 2020 NaN
6 A 2021 NaN
7 A 2022 NaN
8 A 2023 A
9 B 2000 B
10 B 2001 NaN
11 B 2002 B
12 C 2003 C
13 C 2004 C
14 C 2005 NaN
15 C 2006 NaN
16 C 2007 C
#boolean mask for check no NaNs to variable for reuse
m = df1['val'].notnull().rename('g')
#create index by cumulative sum for unique groups for consecutive NaNs
df1.index = m.cumsum()
#filter only NaNs row and aggregate first, last and count.
df2 = (df1[~m.values].groupby(['Id', 'g'])['year']
.agg(['first','last','size'])
.reset_index(level=1, drop=True)
.reset_index())
print (df2)
Id first last size
0 A 2017 2018 2
1 A 2020 2022 3
2 B 2001 2001 1
3 C 2005 2006 2
EDIT:
#convert to datetimes
df['year'] = pd.to_datetime(df['year'], format='%Y%m')
#resample by start of months with asfreq
df1 = df.set_index('year').groupby('Id')['Id'].resample('MS').asfreq().rename('val').reset_index()
print (df1)
Id year val
0 A 2015-05-01 A
1 A 2015-06-01 NaN
2 A 2015-07-01 A
3 A 2015-08-01 NaN
4 A 2015-09-01 A
5 B 2000-01-01 B
6 B 2000-02-01 NaN
7 B 2000-03-01 B
8 C 2003-01-01 C
9 C 2003-02-01 C
10 C 2003-03-01 NaN
11 C 2003-04-01 NaN
12 C 2003-05-01 C
m = df1['val'].notnull().rename('g')
#create index by cumulative sum for unique groups for consecutive NaNs
df1.index = m.cumsum()
#filter only NaNs row and aggregate first, last and count.
df2 = (df1[~m.values].groupby(['Id', 'g'])['year']
.agg(['first','last','size'])
.reset_index(level=1, drop=True)
.reset_index())
print (df2)
Id first last size
0 A 2015-06-01 2015-06-01 1
1 A 2015-08-01 2015-08-01 1
2 B 2000-02-01 2000-02-01 1
3 C 2003-03-01 2003-04-01 2