I'm trying to fill this DataFrame (df1) (I can start it with NaN or zero values):
27/05/2021 28/05/2021 29/05/2021 30/05/2021 31/05/2021 01/06/2021 02/06/2021 ...
Name1 Nan Nan Nan Nan Nan Nan Nan
Name2 Nan Nan Nan Nan Nan Nan Nan
Name3 Nan Nan Nan Nan Nan Nan Nan
Name4 Nan Nan Nan Nan Nan Nan Nan
Acording information in this DataFrame (df2):
Start1 End1 Dedication1 (h) Start2 End2 Dedication2 (h)
Name1 24/05/2021 31/05/2021 8 02/06/2021 10/07/2021 3
Name2 29/05/2021 31/05/2021 5 Nan Nan Nan
Name3 27/05/2021 01/06/2021 3 Nan Nan Nan
Name4 29/05/2021 07/08/2021 8 10/10/2021 10/12/2021 2
To get something like this (df3):
27/05/2021 28/05/2021 29/05/2021 30/05/2021 31/05/2021 01/06/2021 02/06/2021 ...
Name1 8 8 8 8 8 0 3
Name2 0 0 5 5 5 0 0
Name3 3 3 3 3 3 3 0
Name4 0 0 8 8 8 8 8
This is a schedule with working hours every day for some months. Both DataFrames will have same index and rows number.
According dates in df2, I need to fill df1 values within start day and end day, with dedication hours in that period.
I have tried loc including all rows, and lambda function to select columns according date, but I dont get fill values within dates. Perhaps I need several steps.
Thanks.
You could try this:
from datetime import datetime
import pandas as pd
# Setup
limits = [("Start1", "End1", "Dedication1"), ("Start2", "End2", "Dedication2")]
df3 = df1.copy()
# Deal with NaN values
df3.fillna(0, inplace=True)
df2["Start2"].fillna("31/12/2099", inplace=True)
df2["End2"].fillna("31/12/2099", inplace=True)
df2["Dedication2"].fillna(0, inplace=True)
# Iterate and fill df3
for i, row in df1.iterrows():
for col in df1.columns:
for start, end, dedication in limits:
mask = (
datetime.strptime(df2.loc[i, start], "%d/%m/%Y")
<= datetime.strptime(col, "%d/%m/%Y")
<= datetime.strptime(df2.loc[i, end], "%d/%m/%Y")
)
if mask:
df3.loc[i, col] = df2.loc[i, dedication]
# Format df3
df3 = df3.astype("int")
print(df3)
# Outputs
27/05/2021 28/05/2021 29/05/2021 ... 31/05/2021 01/06/2021 02/06/2021
Name1 8 8 8 ... 8 0 3
Name2 0 0 5 ... 5 0 0
Name3 3 3 3 ... 3 3 0
Name4 0 0 8 ... 8 8 8
Related
This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed last month.
I have a dataframe as follows with multiple rows per id (maximum 3).
dat = pd.DataFrame({'id':[1,1,1,2,2,3,4,4], 'code': ["A","B","D","B","D","A","A","D"], 'amount':[11,2,5,22,5,32,11,5]})
id code amount
0 1 A 11
1 1 B 2
2 1 D 5
3 2 B 22
4 2 D 5
5 3 A 32
6 4 A 11
7 4 D 5
I want to consolidate the df and have only one row per id so that it looks as follows:
id code1 amount1 code2 amount2 code3 amount3
0 1 A 11 B 2 D 5
1 2 B 22 D 5 NaN NaN
2 3 A 32 NaN NaN NaN NaN
3 4 A 11 D 5 NaN NaN
How can I acheive this in pandas?
Use GroupBy.cumcount for counter with reshape by DataFrame.unstack and DataFrame.sort_index, last flatten MultiIndex and convert id to column by DataFrame.reset_index:
df = (dat.set_index(['id',dat.groupby('id').cumcount().add(1)])
.unstack()
.sort_index(axis=1, level=1, sort_remaining=False))
df.columns = df.columns.map(lambda x: f'{x[0]}{x[1]}')
df = df.reset_index()
print (df)
id code1 amount1 code2 amount2 code3 amount3
0 1 A 11.0 B 2.0 D 5.0
1 2 B 22.0 D 5.0 NaN NaN
2 3 A 32.0 NaN NaN NaN NaN
3 4 A 11.0 D 5.0 NaN NaN
EDIT: Upon request I provide an example that is closer to the real data I am working with.
So I have a table data that looks something like
value0 value1 value2
run step
0 0 0.12573 -0.132105 0.640423
1 0.1049 -0.535669 0.361595
2 1.304 0.947081 -0.703735
3 -1.265421 -0.623274 0.041326
4 -2.325031 -0.218792 -1.245911
5 -0.732267 -0.544259 -0.3163
1 0 0.411631 1.042513 -0.128535
1 1.366463 -0.665195 0.35151
2 0.90347 0.094012 -0.743499
3 -0.921725 -0.457726 0.220195
4 -1.009618 -0.209176 -0.159225
5 0.540846 0.214659 0.355373
(think: collection of time series) and a second table valid_range
start stop
run
0 1 3
1 2 5
For each run I want to drop all rows that do not satisfy start≤step≤stop.
I tried the following (table generating code at the end)
for idx in valid_range.index:
slc = data.loc[idx]
start, stop = valid_range.loc[idx]
cond = (start <= slc.index) & (slc.index <= stop)
data.loc[idx] = data.loc[idx][cond]
However, this results in:
value0 value1 value2
run step
0 0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
1 0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
I also tried data.loc[idx].drop(slc[cond].index, inplace=True) but it didn't have any effect...
Generating code for table
import numpy as np
from pandas import DataFrame, MultiIndex, Index
rng = np.random.default_rng(0)
valid_range = DataFrame({"start": [1, 2], "stop":[3, 5]}, index=Index(range(2), name="run"))
midx = MultiIndex(levels=[[],[]], codes=[[],[]], names=["run", "step"])
data = DataFrame(columns=[f"value{k}" for k in range(3)], index=midx)
for run in range(2):
for step in range(6):
data.loc[(run, step), :] = rng.normal(size=(3))
)
First, merge data and valid range based on 'run', using the merge method
>>> data
value0 value1 value2
run step
0 0 0.12573 -0.132105 0.640423
1 0.1049 -0.535669 0.361595
2 1.304 0.947081 -0.703735
3 -1.26542 -0.623274 0.041326
4 -2.32503 -0.218792 -1.24591
5 -0.732267 -0.544259 -0.3163
1 0 0.411631 1.04251 -0.128535
1 1.36646 -0.665195 0.35151
2 0.90347 0.0940123 -0.743499
3 -0.921725 -0.457726 0.220195
4 -1.00962 -0.209176 -0.159225
5 0.540846 0.214659 0.355373
>>> valid_range
start stop
run
0 1 3
1 2 5
>>> merged = data.reset_index().merge(valid_range, how='left', on='run')
>>> merged
run step value0 value1 value2 start stop
0 0 0 0.12573 -0.132105 0.640423 1 3
1 0 1 0.1049 -0.535669 0.361595 1 3
2 0 2 1.304 0.947081 -0.703735 1 3
3 0 3 -1.26542 -0.623274 0.041326 1 3
4 0 4 -2.32503 -0.218792 -1.24591 1 3
5 0 5 -0.732267 -0.544259 -0.3163 1 3
6 1 0 0.411631 1.04251 -0.128535 2 5
7 1 1 1.36646 -0.665195 0.35151 2 5
8 1 2 0.90347 0.0940123 -0.743499 2 5
9 1 3 -0.921725 -0.457726 0.220195 2 5
10 1 4 -1.00962 -0.209176 -0.159225 2 5
11 1 5 0.540846 0.214659 0.355373 2 5
Then select the rows which satisfy the condition using eval. Use the boolean array to mask data
>>> cond = merged.eval('start < step < stop').to_numpy()
>>> data[cond]
value0 value1 value2
run step
0 2 1.304 0.947081 -0.703735
1 3 -0.921725 -0.457726 0.220195
4 -1.00962 -0.209176 -0.159225
Or if you want, here is a similar approach using query
res = (
data.reset_index()
.merge(valid_range, on='run', how='left')
.query('start < step < stop')
.drop(columns=['start','stop'])
.set_index(['run', 'step'])
)
I would go on groupby like this:
(df.groupby(level=0)
.apply(lambda x: x[x['small']>1])
.reset_index(level=0, drop=True) # remove duplicate index
)
which gives:
big small
animal animal attribute
cow cow speed 30.0 20.0
weight 250.0 150.0
falcon falcon speed 320.0 250.0
lama lama speed 45.0 30.0
weight 200.0 100.0
I have a Dataframe as shown below
A B C D
0 1 2 3.3 4
1 NaT NaN NaN NaN
2 NaT NaN NaN NaN
3 5 6 7 8
4 NaT NaN NaN NaN
5 NaT NaN NaN NaN
6 9 1 2 3
7 NaT NaN NaN NaN
8 NaT NaN NaN NaN
I need to copy the first row values (1,2,3,4) till the non-null row with index 2. Then, copy row values (5,6,7,8) till the non-null row with index 5 and copy (9,1,2,3) till row with index 8 and so on. Is there any way to do this in Python or Pandas. Quick help appreciated! Also is necessary not replace column D
Column C ffill gives 3.3456 as value for next row
Expected Output:
A B C D
0 1 2 3.3 4
1 1 2 3.3 NaN
2 1 2 3.3 NaN
3 5 6 7 8
4 5 6 7 NaN
5 5 6 7 NaN
6 9 1 2 3
7 9 1 2 NaN
8 9 1 2 NaN
Question was changed, so for forward filling all columns without D use Index.difference with ffill for columns names in list:
cols = df.columns.difference(['D'])
df[cols] = df[cols].ffill()
Or create mask for all columns names without D:
mask = df.columns != 'D'
df.loc[:, mask] = df.loc[:, mask].ffill()
EDIT: I cannot replicate your problem:
df = pd.DataFrame({'a':[2114.201789, np.nan, np.nan, 1]})
print (df)
a
0 2114.201789
1 NaN
2 NaN
3 1.000000
print (df.ffill())
a
0 2114.201789
1 2114.201789
2 2114.201789
3 1.000000
Just curious on the behavior of 'where' and why you would use it over 'loc'.
If I create a dataframe:
df = pd.DataFrame({'ID':[1,2,3,4,5,6,7,8,9,10],
'Run Distance':[234,35,77,787,243,5435,775,123,355,123],
'Goals':[12,23,56,7,8,0,4,2,1,34],
'Gender':['m','m','m','f','f','m','f','m','f','m']})
And then apply the 'where' function:
df2 = df.where(df['Goals']>10)
I get the following which filters out the results where Goals > 10, but leaves everything else as NaN:
Gender Goals ID Run Distance
0 m 12.0 1.0 234.0
1 m 23.0 2.0 35.0
2 m 56.0 3.0 77.0
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 NaN NaN NaN NaN
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 m 34.0 10.0 123.0
If however I use the 'loc' function:
df2 = df.loc[df['Goals']>10]
It returns the dataframe subsetted without the NaN values:
Gender Goals ID Run Distance
0 m 12 1 234
1 m 23 2 35
2 m 56 3 77
9 m 34 10 123
So essentially I am curious why you would use 'where' over 'loc/iloc' and why it returns NaN values?
Think of loc as a filter - give me only the parts of the df that conform to a condition.
where originally comes from numpy. It runs over an array and checks if each element fits a condition. So it gives you back the entire array, with a result or NaN. A nice feature of where is that you can also get back something different, e.g. df2 = df.where(df['Goals']>10, other='0'), to replace values that don't meet the condition with 0.
ID Run Distance Goals Gender
0 1 234 12 m
1 2 35 23 m
2 3 77 56 m
3 0 0 0 0
4 0 0 0 0
5 0 0 0 0
6 0 0 0 0
7 0 0 0 0
8 0 0 0 0
9 10 123 34 m
Also, while where is only for conditional filtering, loc is the standard way of selecting in Pandas, along with iloc. loc uses row and column names, while iloc uses their index number. So with loc you could choose to return, say, df.loc[0:1, ['Gender', 'Goals']]:
Gender Goals
0 m 12
1 m 23
If check docs DataFrame.where it replace rows by condition - default by NAN, but is possible specify value:
df2 = df.where(df['Goals']>10)
print (df2)
ID Run Distance Goals Gender
0 1.0 234.0 12.0 m
1 2.0 35.0 23.0 m
2 3.0 77.0 56.0 m
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 NaN NaN NaN NaN
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 10.0 123.0 34.0 m
df2 = df.where(df['Goals']>10, 100)
print (df2)
ID Run Distance Goals Gender
0 1 234 12 m
1 2 35 23 m
2 3 77 56 m
3 100 100 100 100
4 100 100 100 100
5 100 100 100 100
6 100 100 100 100
7 100 100 100 100
8 100 100 100 100
9 10 123 34 m
Another syntax is called boolean indexing and is for filter rows - remove rows matched condition.
df2 = df.loc[df['Goals']>10]
#alternative
df2 = df[df['Goals']>10]
print (df2)
ID Run Distance Goals Gender
0 1 234 12 m
1 2 35 23 m
2 3 77 56 m
9 10 123 34 m
If use loc is possible also filter by rows by condition and columns by name(s):
s = df.loc[df['Goals']>10, 'ID']
print (s)
0 1
1 2
2 3
9 10
Name: ID, dtype: int64
df2 = df.loc[df['Goals']>10, ['ID','Gender']]
print (df2)
ID Gender
0 1 m
1 2 m
2 3 m
9 10 m
loc retrieves only the rows that matches the condition.
where returns the whole dataframe, replacing the rows that don't match the condition (NaN by default).
I have a dataframe like this.
Project 4 Project1 Project2 Project3
0 NaN laptio AB NaN
1 NaN windows ten NaN
0 one NaN NaN
1 two NaN NaN
I want to delete NaN values from Project 4 column
My desired output should be,
df,
Project 4 Project1 Project2 Project3
0 one laptio AB NaN
1 two windows ten NaN
0 NaN NaN NaN
1 NaN NaN
If your data frame's index is just standard 0 to n ordered integers, you can pop the Project4 column to a series, drop the NaN values, reset the index, and then merge it back with the data frame.
import pandas a pd
df = pd.DataFrame([[pd.np.nan, 1,2,3],
[pd.np.nan, 4,5,6],
['one',7,8,9],
['two',10,11,12]], columns=['p4','p1','p2','p3'])
s = df.pop('p4')
pd.concat([df, ps.dropna().reset_index(drop=True)], axis=1)
# returns:
p1 p2 p3 p4
0 1 2 3 one
1 4 5 6 two
2 7 8 9 NaN
3 10 11 12 NaN