I have a DataFrame like this:
Year Month Day Rain (mm)
2021 1 1 15
2021 1 2 NaN
2021 1 3 12
And so on (there are multiple years). I have used pivot_table function to convert the DataFrame into this:
Year 2021 2020 2019 2018 2017
Month Day
1 1 15
2 NaN
3 12
I used:
df = df.pivot_table(index=['Month', 'Day'], columns='Year',
values='Rain (mm)', aggfunc='first')
Now I would like to replace all NaN values and also possible -1 values with zeros from every column (by columns I mean years) but I have not been able to do so. I have tried:
df = df.fillna(0)
And also:
df.loc[df['Rain (mm)'] == NaN, 'Rain (mm)'] = 0
But neither won't work, no error message/exception, dataframe just remains unchanged. What I'm doing wrong? Any advise is highly appreciated.
I think problem is NaN are strings, so cannot replace them, so first try convert values to numeric:
df['Rain (mm)'] = pd.to_numeric(df['Rain (mm)'], errors='coerce')
df = df.pivot_table(index=['Month', 'Day'], columns='Year',
values='Rain (mm)', aggfunc='first').fillna(0)
Related
I'm trying to calculate a weighted average for multiple columns in a dataframe.
This is a sample of my data
Group
Year
Month
Weight(kg)
Nitrogen
Calcium
A
2020
01
10000
10
70
A
2020
01
15000
4
78
A
2021
05
12000
5
66
A
2021
05
10000
8
54
B
2021
08
14000
10
90
C
2021
08
50000
20
92
C
2021
08
40000
10
95
My desired result would look something like this:
What I've tried:
I can get the correct weighted average values for a single column using this function:
(similar to: link)
def wavg(df, value, weight):
d = df[value]
w = df[weight]
try:
return (d * w).sum() / w.sum()
except ZeroDivisionError:
return d.mean()
I can apply this function to a single column of my df:
df2 = df.groupby(["Group", "year", "month"]).apply(wavg, "Calcium", "Weight(kg").to_frame()
(Don't mind the different values, they are correct for the data in my notebook)
The obvious problem is that this function only works for a single column whilst I have a douzens of columns. I therefore tried a for loop:
column_list=[]
for column in df.columns:
column_list.append(df.groupby(["Group", "year", "month"]).apply(wavg, column, "Weight(kg").to_frame())
It calculates the values correctly, but the columns are placed on top of each other instead of next to eachother. They also miss a usefull column name:
How could I adapt my code to return the desired df?
Change function for working by multiple columns and for avoid removing column for grouping are converting to MultiIndex:
def wavg(x, value, weight):
d = x[value]
w = x[weight]
try:
return (d.mul(w, axis=0)).div(w.sum())
except ZeroDivisionError:
return d.mean()
#columns used for groupby
groups = ["Group", "Year", "Month"]
#processing all another columns
cols = df.columns.difference(groups + ["Weight(kg)"], sort=False)
#create index and processing all columns by variable cols
df1 = (df.set_index(groups)
.groupby(level=groups)
.apply(wavg, cols, "Weight(kg)")
.reset_index())
print (df2)
Group Year Month Calcium Nitrogen
0 A 2020 1 28.000000 4.000000
1 A 2020 1 46.800000 2.400000
2 A 2021 5 36.000000 2.727273
3 A 2021 5 24.545455 3.636364
4 B 2021 8 90.000000 10.000000
5 C 2021 8 51.111111 11.111111
6 C 2021 8 42.222222 4.444444
Try via concat() and reset_index():
df=pd.concat(column_list,axis=1).reset_index()
OR
you can make changes here:
column_list=[]
for column in df.columns:
column_list.append(df.groupby(["Group", "year", "month"]).apply(wavg, column, "Weight(kg").reset_index())
#Finally:
df=pd.concat(column_list,axis=1)
Let's say I have the following df:
year date_until
1 2010 -
2 2011 30.06.13
3 2011 NaN
4 2015 30.06.18
5 2020 -
I'd like to fill all - and NaNs in the date_until column with 30/06/{year +1}. I tried the following but it uses the whole year column instead of the corresponding value of the specific row:
df['date_until] = df['date_until].str.replace('-', f'30/06/{df["year"]+1}')
my final goal is to calculate the difference between the year and the year of date_until, so maybe the step above is even unnecessary.
We can use pd.to_datetime here with errors='coerce' to ignore the faulty dates. Then use the dt.year to calculate the difference:
df['date_until'] = pd.to_datetime(df['date_until'], format='%d.%m.%y', errors='coerce')
df['diff_year'] = df['date_until'].dt.year - df['year']
year date_until diff_year
0 2010 NaT NaN
1 2011 2013-06-30 2.0
2 2011 NaT NaN
3 2015 2018-06-30 3.0
4 2020 NaT NaN
For everybody who is trying to replace values just like I wanted to in the first place, here is how you could solve it:
for i in range(len(df)):
if pd.isna(df['date_until'].iloc[i]):
df['date_until'].iloc[i] = f'30.06.{df["year"].iloc[i] +1}'
if df['date_until'].iloc[i] == '-':
df['date_until'].iloc[i] = f'30.06.{df["year"].iloc[i] +1}
But #Erfan's approach is much cleaner
My dataframe has a month column with values that repeat as Apr, Apr.1, Apr.2 etc. because there is no year column. I added a year column based on the month value using a for loop as shown below, but I'd like to find a more efficient way to do this:
Products['Year'] = '2015'
for i in range(0, len(Products.Month)):
if '.1' in Products['Month'][i]:
Products['Year'][i] = '2016'
elif '.2' in Products['Month'][i]:
Products['Year'][i] = '2017'
You can use .str and treat the whole columns like string to split at the dot.
Now, apply a function that takes the number string and turns into a new year value if possible.
Starting dataframe:
Month
0 Apr
1 Apr.1
2 Apr.2
Solution:
def get_year(entry):
value = 2015
try:
value += int(entry[-1])
finally:
return str(value)
df['Year'] = df.Month.str.split('.').apply(get_year)
Now df is:
Month Year
0 Apr 2015
1 Apr.1 2016
2 Apr.2 2017
You can use pd.to_numeric after splitting and add 2015 i.e
df['new'] = pd.to_numeric(df['Month'].str.split('.').str[-1],errors='coerce').fillna(0) + 2015
# Sample DataFrame from # Mike Muller
Month Year new
0 Apr 2015 2015.0
1 Apr.1 2016 2016.0
2 Apr.2 2017 2017.0
I have a dataframe of following structure (showing it as comma separated values):
day date hour cnt
Friday 9/15/2017 0 3
Friday 9/15/2017 1 5
Friday 9/15/2017 2 8
Friday 9/15/2017 3 6
...........................
Friday 9/15/2017 10
...........................
Saturday 9/16/2017 21 5
Saturday 9/16/2017 22 4
Some of the date values have data for every hour (0-23).
However, some of the date values can have missing hours. In the example, for 9/15/2017 data, there are no records for hour values from 9 to 13. For all these missing records, I need to add a new record with a cnt value (last column) of zero.
How do I achieve this in Python?
Provided you use pandas.DataFrame you may use fillna() method:
DataFrame['cnt'].fillna(value=0, axis=1)
Example:
Consider data:
one two three
a NaN 1.2 -0.355322
c NaN 3.3 0.983801
e 0.01 4 -0.712964
You may fill NaN using fillna():
data.fillna(0)
one two three
a 0 1.2 -0.355322
c 0 3.3 0.983801
e 0.01 4 -0.712964
You can generate a DatetimeIndex and use resample method:
#suppose your dataframe is named df:
idx = pd.DatetimeIndex(pd.to_datetime(df['date']).add(pd.to_timedelta(df['hour'], unit='h')))
df.index = idx
df_filled = df[['cnt']].resample('1H').sum().fillna(0).astype(int)
df_filled['day'] = df_filled.index.strftime('%A')
df_filled['date'] = df_filled.index.strftime('%-m/%-d/%Y')
df_filled['hour'] = df_filled.index.strftime('%-H')
or you can do the pivot and unpivot trick:
df_filled = df.pivot(values='cnt',index='date',columns='hour').fillna(0).unstack()
df_filled = df_filled.reset_index().sort_values(by=['date','hour'])
suppose I have a dataframe with index as monthy timestep, I know I can use dataframe.groupby(lambda x:x.year) to group monthly data into yearly and apply other operations. Is there some way I could quick group them, let's say by decade?
thanks for any hints.
To get the decade, you can integer-divide the year by 10 and then multiply by 10. For example, if you're starting from
>>> dates = pd.date_range('1/1/2001', periods=500, freq="M")
>>> df = pd.DataFrame({"A": 5*np.arange(len(dates))+2}, index=dates)
>>> df.head()
A
2001-01-31 2
2001-02-28 7
2001-03-31 12
2001-04-30 17
2001-05-31 22
You can group by year, as usual (here we have a DatetimeIndex so it's really easy):
>>> df.groupby(df.index.year).sum().head()
A
2001 354
2002 1074
2003 1794
2004 2514
2005 3234
or you could do the (x//10)*10 trick:
>>> df.groupby((df.index.year//10)*10).sum()
A
2000 29106
2010 100740
2020 172740
2030 244740
2040 77424
If you don't have something on which you can use .year, you could still do lambda x: (x.year//10)*10).
if your Data Frame has Headers say : DataFrame ['Population','Salary','vehicle count']
Make your index as Year: DataFrame=DataFrame.set_index('Year')
use below code to resample data in decade of 10 years and also gives you some of all other columns within that dacade
datafame=dataframe.resample('10AS').sum()
Use the year attribute of index:
df.groupby(df.index.year)
lets say your date column goes by the name Date, then you can group up
dataframe.set_index('Date').ix[:,0].resample('10AS', how='count')
Note: the ix - here chooses the first column in your dataframe
You get the various offsets:
http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases