I have dataframe df3 that looks like this
with unknown columns length as AAA_??? can be anything from the dataset
Date ID Calendar_Year Month DayName... AAA_1E AAA_BMITH AAA_4.1 AAA_CH
0 2019-09-17 8661 2019 Sep Sun... NaN NaN NaN NaN
1 2019-09-18 8662 2019 Sep Sun... 1.0 3.0 34.0 1.0
2 2019-09-19 8663 2019 Sep Sun... NaN NaN NaN NaN
3 2019-09-20 8664 2019 Sep Mon... NaN NaN NaN NaN
4 2019-09-20 8664 2019 Sep Mon... 2.0 4.0 32.0 3.0
5 2019-09-20 8664 2019 Sep Sat... NaN NaN NaN NaN
6 2019-09-20 8664 2019 Sep Sat... NaN NaN NaN NaN
7 2019-09-20 8664 2019 Sep Sat... 0.0 4.0 30.0 0.0
another dataframe dfMeans that has the mean of a third dataframe
Month Dayname ID ... AAA_BMITH AAA_4.1 AAA_CH
0 Jan Thu 7686.500000 ... 0.000000 28.045455 0.0
1 Jan Fri 7636.272727 ... 0.000000 28.136364 0.0
2 Jan Sat 7637.272727 ... 0.000000 27.045455 0.0
3 Jan Sun 7670.090909 ... 0.000000 27.090909 0.0
4 Jan Mon 7702.909091 ... 0.000000 27.727273 0.0
5 Jan Tue 7734.260870 ... 0.000000 27.956522 0.0
the dataframes will be joined by Month and Dayname
I want to replace NaNs in df3 with values from dfMean
using this line
df3.update(dfMeans, overwrite=False, errors="raise")
but I get this error
raise ValueError("Data overlaps.")
ValueError: Data overlaps.
How to update NaNs with values from dfMean and avoid this error?
Edit :
I have put all dataframes in one dataframe df
Month Dayname ID ... AAA_BMITH AAA_4.1 AAA_CH
0 Jan Thu 7686.500000 ... 0.000000 28.045455 0.0
1 Jan Fri 7636.272727 ... 0.000000 28.136364 0.0
2 Jan Sat 7637.272727 ... 0.000000 27.045455 0.0
3 Jan Sun 7670.090909 ... 0.000000 27.090909 0.0
4 Jan Mon 7702.909091 ... 0.000000 27.727273 0.0
5 Jan Tue 7734.260870 ... 0.000000 27.956522 0.0
How can I fill NaNs with average based on Month and Dayname?
Using fillna:
Data:
Date ID Calendar_Year Month Dayname AAA_1E AAA_BMITH AAA_4.1 AAA_CH
2019-09-17 8661 2019 Jan Sun NaN NaN NaN NaN
2019-09-18 8662 2019 Jan Sun 1.0 3.0 34.0 1.0
2019-09-19 8663 2019 Jan Sun NaN NaN NaN NaN
2019-09-20 8664 2019 Jan Mon NaN NaN NaN NaN
2019-09-20 8664 2019 Jan Mon 2.0 4.0 32.0 3.0
2019-09-20 8664 2019 Jan Sat NaN NaN NaN NaN
2019-09-20 8664 2019 Jan Sat NaN NaN NaN NaN
2019-09-20 8664 2019 Jan Sat 0.0 4.0 30.0 0.0
df.set_index(['Month', 'Dayname'], inplace=True)
df_mean:
Month Dayname ID AAA_BMITH AAA_4.1 AAA_CH
Jan Thu 7686.500000 0.0 28.045455 0.0
Jan Fri 7636.272727 0.0 28.136364 0.0
Jan Sat 7637.272727 0.0 27.045455 0.0
Jan Sun 7670.090909 0.0 27.090909 0.0
Jan Mon 7702.909091 0.0 27.727273 0.0
Jan Tue 7734.260870 0.0 27.956522 0.0
df_mean.set_index(['Month', 'Dayname'], inplace=True)
Update df:
This operation is based on matching index values
It doesn't work with multiple column names at once, you'll have to get the columns of interest and iterate through them
Note, AAA_1E isn't in df_mean
for col in df.columns:
if col in df_mean.columns:
df[col].fillna(df_mean[col], inplace=True)
You can groupby on 'Month' and DayName' and use apply to edit the dataframe.
Use fillna to fill the Nan values. fillna accepts a dictionary as value parameter: keys of the dictionary are column names, values are scalars: the scalars are used to substitute the Nan in each column. With loc you can select the proper value from dMeans.
You can create the dictionary with a dict comprehension, using the intersection between columns of df3 and dfMeans.
All this corresponds to the following statement:
df3filled = df3.groupby(['Month', 'DayName']).apply(lambda x : x.fillna(
{col : dfMeans.loc[(dfMeans['Month'] == x.name[0]) & (dfMeans['Dayname'] == x.name[1]), col].iloc[0]
for col in x.columns.intersection(dfMeans.columns)})).reset_index(drop=True)
Related
I want to merge the following 2 data frames in Pandas but the result isn't containing all the relevant columns:
L1aIn[0:5]
Filename OrbitNumber OrbitMode
OrbitModeCounter Year Month Day L1aIn
0 oco2_L1aInDP_35863a_210329_B10206_210330111927.h5 35863 DP a 2021 3 29 1
1 oco2_L1aInDP_35862a_210329_B10206_210330111935.h5 35862 DP a 2021 3 29 1
2 oco2_L1aInDP_35861b_210329_B10206_210330111934.h5 35861 DP b 2021 3 29 1
3 oco2_L1aInLP_35861a_210329_B10206_210330111934.h5 35861 LP a 2021 3 29 1
4 oco2_L1aInSP_35861a_210329_B10206_210330111934.h5 35861 SP a 2021 3 29 1
L2Std[0:5]
Filename OrbitNumber OrbitMode OrbitModeCounter Year Month Day L2Std
0 oco2_L2StdGL_35861a_210329_B10206r_21042704283... 35861 GL a 2021 3 29 1
1 oco2_L2StdXS_35860a_210329_B10206r_21042700342... 35860 XS a 2021 3 29 1
2 oco2_L2StdND_35852a_210329_B10206r_21042622540... 35852 ND a 2021 3 29 1
3 oco2_L2StdGL_35862a_210329_B10206r_21042622403... 35862 GL a 2021 3 29 1
4 oco2_L2StdTG_35856a_210329_B10206r_21042622422... 35856 TG a 2021 3 29 1
>>> df = L1aIn.copy(deep=True)
>>> df.merge(L2Std, how="outer", on=["OrbitNumber","OrbitMode","OrbitModeCounter"])
0 oco2_L1aInDP_35863a_210329_B10206_210330111927.h5 35863 DP a ... NaN NaN NaN NaN
1 oco2_L1aInDP_35862a_210329_B10206_210330111935.h5 35862 DP a ... NaN NaN NaN NaN
2 oco2_L1aInDP_35861b_210329_B10206_210330111934.h5 35861 DP b ... NaN NaN NaN NaN
3 oco2_L1aInLP_35861a_210329_B10206_210330111934.h5 35861 LP a ... NaN NaN NaN NaN
4 oco2_L1aInSP_35861a_210329_B10206_210330111934.h5 35861 SP a ... NaN NaN NaN NaN
5 NaN 35861 GL a ... 2021.0 3.0 29.0 1.0
6 NaN 35860 XS a ... 2021.0 3.0 29.0 1.0
7 NaN 35852 ND a ... 2021.0 3.0 29.0 1.0
8 NaN 35862 GL a ... 2021.0 3.0 29.0 1.0
9 NaN 35856 TG a ... 2021.0 3.0 29.0 1.0
[10 rows x 13 columns]
>>> df.columns
Index(['Filename', 'OrbitNumber', 'OrbitMode', 'OrbitModeCounter', 'Year',
'Month', 'Day', 'L1aIn'],
dtype='object')
I want the resulting merged table to include both the "L1aIn" and "L2Std" columns but as you can see it doesn't and only picks up the original columns from L1aIn.
I'm also puzzled about why it seems to be returning a dataframe object rather than None.
A toy example works fine for me, but the real-life one does not. What circumstances provoke this kind of behavior for merge?
Seems to me that you just need to a variable to the output of
merged_df = df.merge(L2Std, how="outer", on=["OrbitNumber","OrbitMode","OrbitModeCounter"])
print(merged_df.columns)
I have a pandas dataframe containing several years of timeseries data as columns. Each starts in November and ends in the subsequent year. I'm trying to deal with NaN's in non-leap years. The structure can be recreated with something like this:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
ndays = 151
sdates = [(datetime(2019,11,1) + timedelta(days=x)).strftime("%b-%d") for x in range(ndays)]
columns=list(range(2016,2021))
df = pd.DataFrame(np.random.randint(0,100,size=(ndays, len(columns))), index=sdates, columns=columns)
df.loc["Feb-29",2017:2019] = np.nan
df.loc["Feb-28":"Mar-01"]
Out[61]:
2016 2017 2018 2019 2020
Feb-28 36 59.0 74.0 19.0 24
Feb-29 85 NaN NaN NaN 6
Mar-01 24 75.0 49.0 99.0 82
What I want to do is remove the "Feb-29" NaN data only (in the non-leap years) and then shift teh data in those columns up a row, leaving the leap-years as-is. Something like this, with Mar-01 and subsequent rows shifted up for 2017 through 2019:
2016 2017 2018 2019 2020
Feb-28 36 59.0 74.0 19.0 24
Feb-29 85 75.0 49.0 99.0 6
Mar-01 24 42.0 21.0 41.0 82
I don't care that "Mar-01" data will be labelled as "Feb-29" as eventually I'll be replacing the string date index with an integer index.
Note that I didn't include this in the example but I have NaN's at the start and end of the dataframe in varying rows that I do not want to remove (i.e., I can't just remove all NaN data, I need to target "Feb-29" specifically)
It sounds like you don't actually want to shift dates up, but rather number them correctly based on the day of the year? If so, this will work:
First, make the DataFrame long instead of wide:
df = pd.DataFrame(
{
"2016": {"Feb-28": 36, "Feb-29": 85, "Mar-01": 24},
"2017": {"Feb-28": 59.0, "Feb-29": None, "Mar-01": 75.0},
"2018": {"Feb-28": 74.0, "Feb-29": None, "Mar-01": 49.0},
"2019": {"Feb-28": 19.0, "Feb-29": None, "Mar-01": 99.0},
"2020": {"Feb-28": 24, "Feb-29": 6, "Mar-01": 82},
}
)
df = (
df.melt(ignore_index=False, var_name="year", value_name="value")
.reset_index()
.rename(columns={"index": "month-day"})
)
df
month-day year value
0 Feb-28 2016 36.0
1 Feb-29 2016 85.0
2 Mar-01 2016 24.0
3 Feb-28 2017 59.0
4 Feb-29 2017 NaN
5 Mar-01 2017 75.0
6 Feb-28 2018 74.0
7 Feb-29 2018 NaN
8 Mar-01 2018 49.0
9 Feb-28 2019 19.0
10 Feb-29 2019 NaN
11 Mar-01 2019 99.0
12 Feb-28 2020 24.0
13 Feb-29 2020 6.0
14 Mar-01 2020 82.0
Then remove rows containing an invalid date and get the day of the year for remaining days:
df["date"] = pd.to_datetime(
df.apply(lambda row: " ".join(row[["year", "month-day"]]), axis=1), errors="coerce",
)
df = df[df["date"].notna()]
df["day_of_year"] = df["date"].dt.dayofyear
df
month-day year value date day_of_year
0 Feb-28 2016 36.0 2016-02-28 59
1 Feb-29 2016 85.0 2016-02-29 60
2 Mar-01 2016 24.0 2016-03-01 61
3 Feb-28 2017 59.0 2017-02-28 59
5 Mar-01 2017 75.0 2017-03-01 60
6 Feb-28 2018 74.0 2018-02-28 59
8 Mar-01 2018 49.0 2018-03-01 60
9 Feb-28 2019 19.0 2019-02-28 59
11 Mar-01 2019 99.0 2019-03-01 60
12 Feb-28 2020 24.0 2020-02-28 59
13 Feb-29 2020 6.0 2020-02-29 60
14 Mar-01 2020 82.0 2020-03-01 61
I think this would do the trick?
is_leap = ~np.isnan(df.loc["Feb-29"])
leap = df.loc[:, is_leap]
nonleap = df.loc[:, ~is_leap]
nonleap = nonleap.drop(index="Feb-29")
df2 = pd.concat([
leap.reset_index(drop=True),
nonleap.reset_index(drop=True)
], axis=1).sort_index(axis=1)
This should move all the nulls to the end by sorting on pd.isnull, which I believe it what you want:
df = df.apply(lambda x: sorted(x, key=pd.isnull),axis=0)
Before:
2016 2017 2018 2019 2020
Feb-28 35 85.0 88.0 46.0 19
Feb-29 41 NaN NaN NaN 52
Mar-01 86 92.0 29.0 44.0 36
After:
2016 2017 2018 2019 2020
Feb-28 35 85.0 88.0 46.0 19
Feb-29 41 92.0 29.0 44.0 52
Mar-01 86 32.0 50.0 27.0 36
My data looks like this
import numpy as np
import pandas as pd
# My Data
enroll_year = np.arange(2010, 2015)
grad_year = enroll_year + 4
n_students = [[100, 100, 110, 110, np.nan]]
df = pd.DataFrame(
n_students,
columns=pd.MultiIndex.from_arrays(
[enroll_year, grad_year],
names=['enroll_year', 'grad_year']))
print(df)
# enroll_year 2010 2011 2012 2013 2014
# grad_year 2014 2015 2016 2017 2018
# 0 100 100 110 110 NaN
What I am trying to do is to stack the data, one column/index level for year of enrollment, one for year of graduation and one for the numbers of students, which should look like
# enroll_year grad_year n
# 2010 2014 100.0
# . . .
# . . .
# . . .
# 2014 2018 NaN
The data produced by .stack() is very close, but the missing record(s) is dropped,
df1 = df.stack(['enroll_year', 'grad_year'])
df1.index = df1.index.droplevel(0)
print(df1)
# enroll_year grad_year
# 2010 2014 100.0
# 2011 2015 100.0
# 2012 2016 110.0
# 2013 2017 110.0
# dtype: float64
So, .stack(dropna=False) is tried, but it will expand the index levels to all combinations of enrollment and graduation years
df2 = df.stack(['enroll_year', 'grad_year'], dropna=False)
df2.index = df2.index.droplevel(0)
print(df2)
# enroll_year grad_year
# 2010 2014 100.0
# 2015 NaN
# 2016 NaN
# 2017 NaN
# 2018 NaN
# 2011 2014 NaN
# 2015 100.0
# 2016 NaN
# 2017 NaN
# 2018 NaN
# 2012 2014 NaN
# 2015 NaN
# 2016 110.0
# 2017 NaN
# 2018 NaN
# 2013 2014 NaN
# 2015 NaN
# 2016 NaN
# 2017 110.0
# 2018 NaN
# 2014 2014 NaN
# 2015 NaN
# 2016 NaN
# 2017 NaN
# 2018 NaN
# dtype: float64
And I need to subset df2 to get my desired data set.
existing_combn = list(zip(
df.columns.levels[0][df.columns.labels[0]],
df.columns.levels[1][df.columns.labels[1]]))
df3 = df2.loc[existing_combn]
print(df3)
# enroll_year grad_year
# 2010 2014 100.0
# 2011 2015 100.0
# 2012 2016 110.0
# 2013 2017 110.0
# 2014 2018 NaN
# dtype: float64
Although it only adds a few more extra lines to my code, I wonder if there are any better and neater approaches.
Use unstack with pd.DataFrame then reset_index and drop unnecessary columns and rename the column as:
pd.DataFrame(df.unstack()).reset_index().drop('level_2',axis=1).rename(columns={0:'n'})
enroll_year grad_year n
0 2010 2014 100.0
1 2011 2015 100.0
2 2012 2016 110.0
3 2013 2017 110.0
4 2014 2018 NaN
Or:
df.unstack().reset_index(level=2, drop=True)
enroll_year grad_year
2010 2014 100.0
2011 2015 100.0
2012 2016 110.0
2013 2017 110.0
2014 2018 NaN
dtype: float64
Or:
df.unstack().reset_index(level=2, drop=True).reset_index().rename(columns={0:'n'})
enroll_year grad_year n
0 2010 2014 100.0
1 2011 2015 100.0
2 2012 2016 110.0
3 2013 2017 110.0
4 2014 2018 NaN
Explanation :
print(pd.DataFrame(df.unstack()))
0
enroll_year grad_year
2010 2014 0 100.0
2011 2015 0 100.0
2012 2016 0 110.0
2013 2017 0 110.0
2014 2018 0 NaN
print(pd.DataFrame(df.unstack()).reset_index().drop('level_2',axis=1))
enroll_year grad_year 0
0 2010 2014 100.0
1 2011 2015 100.0
2 2012 2016 110.0
3 2013 2017 110.0
4 2014 2018 NaN
print(pd.DataFrame(df.unstack()).reset_index().drop('level_2',axis=1).rename(columns={0:'n'}))
enroll_year grad_year n
0 2010 2014 100.0
1 2011 2015 100.0
2 2012 2016 110.0
3 2013 2017 110.0
4 2014 2018 NaN
My .csv file looks like:
Area When Year Month Tickets
City Day 2015 1 14
City Night 2015 1 5
Rural Day 2015 1 18
Rural Night 2015 1 21
Suburbs Day 2015 1 15
Suburbs Night 2015 1 21
City Day 2015 2 13
containing 75 rows. I want both a row multiindex and column multiindex that looks like:
Area City Rural Suburbs
When Day Night Day Night Day Night
Year Month
2015 1 5.0 3.0 22.0 11.0 13.0 2.0
2 22.0 8.0 4.0 16.0 6.0 18.0
3 26.0 25.0 22.0 23.0 22.0 2.0
2016 1 20.0 25.0 39.0 14.0 3.0 10.0
2 4.0 14.0 16.0 26.0 1.0 24.0
3 22.0 17.0 7.0 24.0 12.0 20.0
I've read the .read_csv doc at https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
I can get the row multiindex with:
df2 = pd.read_csv('c:\\Data\Tickets.csv', index_col=[2, 3])
I've tried:
df2 = pd.read_csv('c:\\Data\Tickets.csv', index_col=[2, 3], header=[1, 3, 5])
thinking [1, 3, 5] fetches 'City', 'Rural', and 'Suburbs'. How do I get the desired column multiindex shown above?
Seems like you need to pivot_table with multiple indexes and multiple columns.
Start with just reading you csv plainly
df = pd.read_csv('Tickets.csv')
Then
df.pivot_table(index=['Year', 'Month'], columns=['Area', 'When'], values=['Tickets'])
With the input data you provided, you'd get
Area City Rural Suburbs
When Day Night Day Night Day Night
Year Month
2015 1 14.0 5.0 18.0 21.0 15.0 21.0
2 13.0 NaN NaN NaN NaN NaN
I have a question for groupby() in pandas
If I have a DataFrame "df" like
user day click
0 U1 Mon 15
1 U2 Mon 7
2 U1 Wed 15
3 U3 Tue 21
4 U2 Tue 15
5 U2 Tue 10
When I use df.groupby(['user', 'day']).sum()
It would be
click
user day
U1 Mon 15
Tue NaN
Wed 15
U2 Mon 7
Tue 25
Wed NaN
U3 Mon NaN
Tue 21
Wed NaN
How can I get a DataFrame like this
day Mon Tue Wed
user
U1 15 NaN 15
U2 7 25 NaN
U3 NaN 21 NaN
Which means transform one column to be the column name of DataFrame.
Is there any method to do this?
Use pivot function with day as columns and fill with clicks.
df.groupby(['user', 'day']).sum().reset_index()\
.pivot(index='user',columns='day',values='click')
Out[388]:
day Mon Tue Wed
user
U1 15.0 NaN 15.0
U2 7.0 25.0 NaN
U3 NaN 21.0 NaN
Or you can only reset the second level index so you don't need to specify index column in the pivot function.
df.groupby(['user', 'day']).sum().reset_index(level=1)\
.pivot(columns='day',values='click')
Just another way to use unstack():
df=df.groupby(['user', 'day']).sum().unstack('day') #unstack
df.columns = df.columns.droplevel() # drop first level column name
df
Output:
day Mon Tue Wed
user
U1 15.0 NaN 15.0
U2 7.0 25.0 NaN
U3 NaN 21.0 NaN