Calculating the moving average for each unique value in a column - python

I have a csv file with a value that increases with time for n cities, like this:
city,date,value
saopaulo,2020-01-01,5
riodejaneiro,2020-01-01,3
curitiba,2020-01-01,7
...
saopaulo,2020-05-01,31
riodejaneiro,2020-05-01,55
curitiba,2020-05-01,41
What I want to do is to calculate the moving average of the column "value", but for each "city" separately.
I loaded the csv into a pandas dataframe, but if I calculate df["value"].rolling(3), it will calculate the moving average but for all the cities together.
What I want is to create a new column with the moving average but for each city. I was thinking about groupby, but I don't know exactly how to implement this.

You can groupby:
df.groupby('city')['value'].rolling(3).mean()
To assign:
df['roll'] = df.groupby('city')['value'].rolling(3).mean().droplevel(0)

Here you go:
def rolling_mean(group :pd.DataFrame) -> pd.DataFrame:
# Whatever operation you want to do with the cities.
# For each city group will be a dataframe of that city's rows without the city column
# I'm guessing you'd like to set the date as a sorted index
# and calculate your moving average based on that but if that's not case modify this function.
return group.set_index('date').sort_index().rolling(3).mean()
df.grouby("city").apply(rolling_mean) # Use .reset_index() if you don't need the multiindex.

maybe doing this (supose your dataframe named df)
from collections import defaultdict
data =defaultdict(list)
for (place,date,value) in df.values:
data[place].append(value)
new_df = pd.DataFrame(dict(data))
and now you have a new dataframe with each city on a column , so you can apply your function on each column ( in a for loop)
saopaulo riodejaneiro curitiba
0 5 3 7
1 31 55 41

Related

Is there a better way to group by a category, and then select values based on different column values in Pandas?

I have an issue where I want to group by a date column, sort by a time column, and grab the resulting values in the values column.
The data that looks something like this
time value date
0 12.850000 19.195359 08-22-2019
1 9.733333 13.519543 09-19-2019
2 14.083333 9.191413 08-26-2019
3 16.616667 18.346598 08-19-2019
...
Where every date can occur multiple times, recording values at different points
during the day.
I wanted to group by date, and extract the minimum and maximum values of those groupings so I did this:
dayMin = df.groupby('date').value.min()
which gives me a Series object that is fairly easy to manipulate. The issue
comes up when I want to group by 'date', sort by 'time', then grab the 'value'.
What I did was:
dayOpen = df.groupby('date').apply(lambda df: df[ df.time == df.time.min() ])['value']
which almost worked, resulting in a DataFrame of:
date
08-19-2019 13344 17.573522
08-20-2019 12798 19.496609
08-21-2019 2009 20.033917
08-22-2019 5231 19.393700
08-23-2019 12848 17.784213
08-26-2019 417 9.717627
08-27-2019 6318 7.630234
I figured out how to clean up those nasty indexes to the left, name the column, and even concat with my dayMin Series to achieve my goal.
Ultimately my question is if there is a nicer way to perform these data manipulations that follow the general pattern of: "Group by column A, perform filtering or sorting operation on column B, grab resulting values from column C" for future applications.
Thank you in advance :)
You can sort the data frame before calling groupby:
first_of_day = df.sort_values('time').groupby('date').head(1)
This should work for you:
df.sort_values('time').groupby(['date'])['value'].agg([('Min' , 'min'), ('Max', 'max')])
For this small example:
Result df:

how can I group column by date and get the average from the other column in python?

From the dataframe below:
I would like to group column 'datum' by date 01-01-2019 and so on. and get an average at the same time on column 'PM10_gemiddelde'.
So now all 01-01-2019 (24 times) is on hour base and i need it combined to 1 and get the average on column ' PM10_gemiddelde' at the same time. See picture for the data.
besides that, PM10_gemiddelde has also negative data. How can i erase that data in python easily?
Thank you!
ps. im new with python
What you are trying to do can be achieve by:
data[['datum','PM10_gemiddelde']].loc[data['PM10_gemiddelde'] > 0 ].groupby(['datum']).mean()
You can create a new column with the average of PM10_gemiddelde using groupby along with transform. Try the following:
Assuming your dataframe is called df, start first by removing the negative data:
new_df = df[df['PM10_gemiddelde'] > 0]
Then, you can create a new column that contains the average value for every date:
new_df['avg_col'] = new_df.groupby('datum')['PM10_gemiddelde'].transform('mean')

Performing a calculation on a DataFrame based on condition in another DataFrame

I am working with COVID data and am trying to control for population and show incidence per 100,000.
I have one DataFrame with population:
**Country** **Population**
China 1389102
Israel 830982
Iran 868912
I have a second DataFrame showing COVID data:
**Date** **Country** **Confirmed**
01/Jan/2020 China 8
01/Jan/2020 Israel 3
01/Jan/2020 Iran 2
02/Jan/2020 China 15
02/Jan/2020 Israel 5
02/Jan/2020 Iran 5
I wish to perform a calculation on my COVID DataFrame using info from the population DataFrame. That is to say, to normalise cases per 100,000 for each data via:
(Chinese Datapoint/Chinese Population) * 100,000
Likewise for my other countries.
I am stumped on this one and not too sure do I achieve my result via grouping data, zipping data, etc.
Any help welcome.
Edit: I should have added that confirmed cases are cumulative as each day goes on. So for example, I wish to performed for Jan 1st for China: (8/china population)*100000 and like wise for Jan 2nd, Jan 3rd, Jan 4th... And again, likewise for each country. Essentially performing a calculation to the entire DataFrame based on data in another DataFrame.
You could merge 2 dataframes and perform the operation:
# Define the norm operation
def norm_cases(cases, population):
return (cases/population)*100
# If the column name for country is same in both dataframes
covid_df.merge(population_df, on='country_column', how='left')
# For different col names
covid_df.merge(population_df, left_on='covid_country_column', right_on='population_country_column', how='left')
covid_df['norm_cases'] = covid_df.apply(lambda x: norm_cases(x['cases_column'], x['population_column']), axis=1)
Assuming that your dataframes are called df1 and df2 and by "Datapoint" you mean the column **Confirmed**:
normed_cases = (
df2.reset_index().groupby(['**Country**', '**Date**']).sum()['**Confirmed**']
/ df1.set_index('**Country**')['**Population**'] * 100000)
reset the index of df2 to make the date a column (only applicable if **Date** was the index before)
Group by country and date and sum the groups to get the total cases per country and date
set country as index to the first df df1 to allow country-index oriented division
divide by population
I took an approach combining many of your suggestions. Step one, I merged my two dataframes. Step two, I divided my confirmed column by the population. Step three, I multiplied the same column by 100,000. There probably is a more elegant approach but this works.
covid_df = covid_df.merge(population_df, on='Country', how='left')
covid_df["Confirmed"] = covid_df["Confirmed"].divide(covid_df["Population"], axis="index")
covid_df["Confirmed"] = covid_df["Confirmed"] *100000
Suppose Dataframe with population as df_pop, and Covid data as df_data.
# Set index country of df_pop
df_pop = df_pop.set_index(['Country'])
# Norm value
norm = 100000
# Calculate norm cases
df_data['norm_cases'] = [((conf/df_pop.loc[country].Population )*norm
for (conf, country) in zip(df_data.Confirmed,df_data.Country) ]
You can use df1.set_index('Country').join(df2.set_index('Country')) here, then it will be easy for you to perform this actions.

Pandas calculated column from datetime index groups loop

I have a Pandas df with a Datetime Index. I want to loop over the following code with different values of strike, based on the index date value (different strike for different time period). Here is my code that produces what I am after for 1 strike across the whole time series:
import pandas as pd
import numpy as np
index=pd.date_range('2017-10-1 00:00:00', '2018-12-31 23:50:00', freq='30min')
df=pd.DataFrame(np.random.randn(len(index),2).cumsum(axis=0),columns=['A','B'],index=index)
strike = 40
payoffs = df[df>strike]-strike
mean_payoff = payoffs.fillna(0).mean()
dist = mean_payoff.describe(percentiles=[0.05,.5,.95])
print(dist)
I want to use different values of strike based on the time period (index value).
So far I have tried to create a categorical calculated column with the intention of using map or apply row wise on the df. I have also played around with creating a dictionary and mapping the dict across the df.
Even if I get the calculated column with the correct strike value, I can 't think how to subtract the calculated column value (strike) from all other columns to get payoffs from above.
I feel like I need to use for loop and potentially create groups of date chunks that get appended together at the end of the loop, maybe with pd.concat.
Thanks in advance
I think you need convert DatetimeIndex to quarter period by to_period, then to string and last map by dict.
For comapring need gt with sub:
d = {'2017Q4':30, '2018Q1':40, '2018Q2':50, '2018Q3':60, '2018Q4':70}
strike = df.index.to_series().dt.to_period('Q').astype(str).map(d)
payoffs = df[df.gt(strike, 0)].sub(strike, 0)
mean_payoff = payoffs.fillna(0).mean()
dist = mean_payoff.describe(percentiles=[0.05,.5,.95])
Mapping your dataframe index into a dictionary can be a starting point.
a = dict()
a[2017]=30
a[2018]=40
ranint = random.choices([30,35,40,45],k=21936)
#given your index used in example
df = pd.DataFrame({values:ranint},index=index)
values year strick
2017-10-01 00:00:00 30 2017 30
2017-10-01 00:30:00 30 2017 30
2017-10-01 01:00:00 45 2017 30
df.year = df.index.year
index.strike = df.year.map(a)
df.returns = df.values - df.strike
Then you can extract return that are greater than 0:
df[df.returns>0]

Pandas and GeoPandas indexing and slicing

I am using GeoPandas and Pandas.
I have a, say, 300,000 rows Dataframe, df, with 4 columns + the index column.
id lat lon geometry
0 2009 40.711174 -73.99682 0
1 536 40.741444 -73.97536 0
2 228 40.754601 -73.97187 0
however the unique ids are only a handful (~200)
I want to generate a shapely.geometry.point.Point object for each (lat,lon) combination, similarly to what shown here: http://nbviewer.ipython.org/gist/kjordahl/7129098
(see cell#5),
where it loops through all rows of the dataframe; but for such a big dataset, I wanted to limit the loop to the much smaller number of unique ids.
Therefore, for a given id value, idvalue (i.e., 2009 from the first row) create the GeoSeries, and assign it directly to ALL rows that have id==idvalue
My code looks like:
for count, iunique in enumerate(df.if.unique()):
sc_start = GeoSeries([Point(np.array(df[df.if==iunique].lon)[0],np.array(df[df.if==iunique].lat)[0])])
df.loc[iunique,['geometry']] = sc_start
however things don't work - the geometry field does not change - and I think is because the indexes of sc_start don't match with the indexes of df.
how can I solve this? should I just stick to the loop through the whole df?
I would take the following approach:
First find the unique id's and create a GeoSeries of Points for this:
unique_ids = df.groupby('id', as_index=False).first()
unique_ids['geometry'] = GeoSeries([Point(x, y) for x, y in zip(unique_ids['lon'], unique_ids['lat'])])
Then merge these geometries with the original dataframe on matching ids:
df.merge(unique_ids[['id', 'geometry']], how='left', on='id')

Categories