How to automatically pivot data in pandas - python

I am used to work with Excel and trying to learn Python especially Pandas. My goal is to plot a large dataset with Plotly/Dash. My dataset looks very much like the dataset on the Pandas tutorial. I have got more paramters and with 20 locations also more locations.
date.utc location parameter value
2067 2019-05-07 01:00:00+00:00 London Westminster no 23.0
2068 2019-05-07 01:00:00+00:00 London Westminster no2 45.0
2069 2019-05-07 01:00:00+00:00 London Westminster pm25 11.0
1003 2019-05-07 01:00:00+00:00 FR04014 no2 25.0
100 2019-05-07 01:00:00+00:00 BETR801 pm25 12.5
1098 2019-05-07 01:00:00+00:00 BETR801 no2 50.5
1109 2019-05-07 01:00:00+00:00 London Westminster co 8.0
I import the file with pd.read_csv and then manually create a pivot for every location and every parameter with a seperate variable and this is quite a work to do.
Is there a way to automatically pivot this data? I want the locations grouped and a column for every parameter. My goal is to have this data in dash and at the top I want a dropbown with the location and on the right side I want to choose no, no2, pm .... with individual axis labels for each parameter.
I found this code here on stack overflow and trying to adapt it for me but it doesn't work.
df = pd.read_csv('https://api.statbank.dk/v1/data/mpk100/CSV?valuePresentation=Value&timeOrder=Ascending&LAND=*&Tid=*', sep=';')
df = df[df['INDHOLD'] != '..']
df['rate'] = df['INDHOLD'].str.replace(',', '.').astype(float)
available_countries = df['LAND'].unique()
df.groupby('LAND')
Many thanks in advance.:)

If I understand you correctly:
x = df.pivot(["date.utc", "location"], "parameter", "value")
print(x)
Prints:
parameter co no no2 pm25
date.utc location
2019-05-07 01:00:00+00:00 BETR801 NaN NaN 50.5 12.5
FR04014 NaN NaN 25.0 NaN
London Westminster 8.0 23.0 45.0 11.0

Related

Pandas dataframe resample returning incorrect timestaps

I am using resample on a pandas dataframe with a datetime index and the resultant dataframe is returning an unexpected new datetime values.
The original dataframe:
valid snow
2022-06-01 19:00:00+00:00 NaN
2022-06-01 20:00:00+00:00 1.0
2022-06-01 21:00:00+00:00 2.0
2022-06-01 22:00:00+00:00 3.0
2022-06-01 23:00:00+00:00 4.0
2022-06-02 00:00:00+00:00 5.0
2022-06-02 01:00:00+00:00 6.0
2022-06-02 02:00:00+00:00 7.0
And I am applying the following pandas function
df.resample('3H').apply(np.max)
This is returning
valid snow
2022-06-01 18:00:00+00:00 1.0
2022-06-01 21:00:00+00:00 4.0
2022-06-02 00:00:00+00:00 7.0
The first time should not be T18, should be T19. Not sure as to why this is happening. Adding key from resample do not appease this issue. Also adding .dropna before the resample does not fix this issue either.
Additionally, when iterating through the groups using
[{group[0]: group[1]} for group in df.resample('3H')]
the 0th group in this list is
{Timestamp('2022-06-01 18:00:00+0000', tz='UTC', freq='3H'): snow
valid
2022-06-01 19:00:00+00:00 NaN
2022-06-01 20:00:00+00:00 1.0}
The group contains one less value that I would expect using resample, and also the key for this group is not what I would expect either.
If need starting by first index add origin='start' parameter to DataFrame.resample:
df = df.resample('3H', origin='start').max()
print (df)
snow
valid
2022-06-01 19:00:00+00:00 2.0
2022-06-01 22:00:00+00:00 5.0
2022-06-02 01:00:00+00:00 7.0

Convert summary data (cumulative cases) to daily cases pandas

I have case data that is presented as a time series. They are summed for each following day, what can be used to turn them into daily case count data?
My dataframe in pandas:
data sum_cases (cumulative)
0 2020-05-02 4.0
1 2020-05-03 21.0
2 2020-05-04 37.0
3 2020-05-05 51.0
I want them to look like this:
data sum_cases(cumulative) daily_cases
0 2020-05-02 4.0 4.0
1 2020-05-03 21.0 17.0
2 2020-05-04 37.0 16.0
3 2020-05-05 51.0 14.0
If indeed your DF has has the data in date order, then you might be able to get away with:
df['daily_cases'] = df['sum_cases'] - df['sum_cases'].shift(fill_value=0)

Python Pandas - Replace NaN values of a column with respect to another column using interpolate()

I am facing problem while dealing with NaN values in Temperature column with respect to column City by using interpolate().
The df is:
data ={
'City':['Greenville','Charlotte', 'Los Gatos','Greenville','Carson City','Greenville','Greenville' ,'Charlotte','Carson City',
'Greenville','Charlotte','Fort Lauderdale', 'Rifle', 'Los Gatos','Fort Lauderdale'],
'Rec_times':['2019-05-21 08:29:55','2019-01-27 17:43:09','2020-12-13 21:53:00','2019-07-17 11:43:09','2018-04-17 16:51:23',
'2019-10-07 13:28:09','2020-01-07 11:38:10','2019-11-03 07:13:09','2020-11-19 10:45:23','2020-10-07 15:48:19','2020-10-07 10:53:09',
'2017-08-31 17:40:49','2016-08-31 17:40:49','2021-11-13 20:13:10','2016-08-31 19:43:29'],
'Temperature':[30,45,26,33,50,None,29,None,48,32,47,33,None,None,28],
'Pressure':[30,None,26,43,50,36,29,None,48,32,None,35,23,49,None]
}
df =pd.DataFrame(data)
df
Output:
City Rec_times Temperature Pressure
0 Greenville 2019-05-21 08:29:55 30.0 30.0
1 Charlotte 2019-01-27 17:43:09 45.0 NaN
2 Los Gatos 2020-12-13 21:53:00 26.0 26.0
3 Greenville 2019-07-17 11:43:09 33.0 43.0
4 Carson City 2018-04-17 16:51:23 50.0 50.0
5 Greenville 2019-10-07 13:28:09 NaN 36.0
6 Greenville 2020-01-07 11:38:10 29.0 29.0
7 Charlotte 2019-11-03 07:13:09 NaN NaN
8 Carson City 2020-11-19 10:45:23 48.0 48.0
9 Greenville 2020-10-07 15:48:19 32.0 32.0
10 Charlotte 2020-10-07 10:53:09 47.0 NaN
11 Fort Lauderdale 2017-08-31 17:40:49 33.0 35.0
12 Rifle 2016-08-31 17:40:49 NaN 23.0
13 Los Gatos 2021-11-13 20:13:10 NaN 49.0
14 Fort Lauderdale 2016-08-31 19:43:29 28.0 NaN
I want you to deal the NaN values in the column Temperature by grouping them based on City using interpolate(method='time').
Ex:
Consider City as 'Greenville' it has 5 temperatures (30,33,NaN,29 and 32) recorded at different times. The NaN value in Temperature is replaced by a value by grouping the records by the City and using interpolate(method='time').
Note: If you know any other optimal method to replace NaN in Temperature you can use as 'Other solution'.
Use a lambda function with DatetimeIndex created by DataFrame.set_index with GroupBy.transform:
df["Rec_times"] = pd.to_datetime(df["Rec_times"])
df['Temperature'] = (df.set_index('Rec_times')
.groupby("City")['Temperature']
.transform(lambda x: x.interpolate(method='time')).to_numpy())
One possible idea for replacing missing values after interpolate is to replace them by the mean of all values like:
df1.Temperature = df1.Temperature.fillna(df1.Temperature.mean())
My understanding is that you want to replace the NaN in column temperature by an interpolation of the temperature in that specific city.
I would have to think about a more sophisticated solution. But here is a simple hack:
df["Rec_times"] = pd.to_datetime(df["Rec_times"]) # .interpolate requires datetime
df["idx"] = df.index # to restore original ordering
df_new = pd.DataFrame() # will hold new data
for (city,group) in df.groupby("City"):
group = group.set_index("Rec_times", drop=False)
df_new = pd.concat((df_new, group.interpolate(method='time')))
df_new = df_new.set_index("idx").sort_index() # Restore original ordering
df_new
Note that interpolation for Rifle will yield NaN given there is only one data point which is NaN.

Incomplete filling when upsampling with `agg` for multiple columns (pandas resample)

I found this behavior of resample to be confusing after working on a related question. Here are some time series data at 5 minute intervals but with missing rows (code to construct at end):
user value total
2020-01-01 09:00:00 fred 1 1
2020-01-01 09:05:00 fred 13 1
2020-01-01 09:15:00 fred 27 3
2020-01-01 09:30:00 fred 40 12
2020-01-01 09:35:00 fred 15 12
2020-01-01 10:00:00 fred 19 16
I want to fill in the missing times using different methods for each column to fill missing data. For user and total, I want to to a forward fill, while for value I want to fill in with zeroes.
One approach I found was to resample, and then fill in the missing data after the fact:
resampled = df.resample('5T').asfreq()
resampled['user'].ffill(inplace=True)
resampled['total'].ffill(inplace=True)
resampled['value'].fillna(0, inplace=True)
Which gives correct expected output:
user value total
2020-01-01 09:00:00 fred 1.0 1.0
2020-01-01 09:05:00 fred 13.0 1.0
2020-01-01 09:10:00 fred 0.0 1.0
2020-01-01 09:15:00 fred 27.0 3.0
2020-01-01 09:20:00 fred 0.0 3.0
2020-01-01 09:25:00 fred 0.0 3.0
2020-01-01 09:30:00 fred 40.0 12.0
2020-01-01 09:35:00 fred 15.0 12.0
2020-01-01 09:40:00 fred 0.0 12.0
2020-01-01 09:45:00 fred 0.0 12.0
2020-01-01 09:50:00 fred 0.0 12.0
2020-01-01 09:55:00 fred 0.0 12.0
2020-01-01 10:00:00 fred 19.0 16.0
I thought one would be able to use agg to specify what to do by column. I try to do the following:
resampled = df.resample('5T').agg({'user':'ffill',
'value':'sum',
'total':'ffill'})
I find this to be more clear and simpler, but it doesn't give the expected output. The sum works, but the forward fill does not:
user value total
2020-01-01 09:00:00 fred 1 1.0
2020-01-01 09:05:00 fred 13 1.0
2020-01-01 09:10:00 NaN 0 NaN
2020-01-01 09:15:00 fred 27 3.0
2020-01-01 09:20:00 NaN 0 NaN
2020-01-01 09:25:00 NaN 0 NaN
2020-01-01 09:30:00 fred 40 12.0
2020-01-01 09:35:00 fred 15 12.0
2020-01-01 09:40:00 NaN 0 NaN
2020-01-01 09:45:00 NaN 0 NaN
2020-01-01 09:50:00 NaN 0 NaN
2020-01-01 09:55:00 NaN 0 NaN
2020-01-01 10:00:00 fred 19 16.0
Can someone explain this output, and if there is a way to achieve the expected output using agg? It seems odd that the forward fill doesn't work here, but if I were to just do resampled = df.resample('5T').ffill(), that would work for every column (but is undesired here as it would do so for the value column as well). The closest I have come is to individually run resampling for each column and apply the function I want:
resampled = pd.DataFrame()
d = {'user':'ffill',
'value':'sum',
'total':'ffill'}
for k, v in d.items():
resampled[k] = df[k].resample('5T').apply(v)
This works, but feels silly given that it adds extra iteration and uses the dictionary I am trying to pass to agg! I have looked a few posts on agg and apply but can't seem to explain what is happening here:
Losing String column when using resample and aggregation with pandas
resample multiple columns with pandas
pandas groupby with agg not working on multiple columns
Pandas named aggregation not working with resample agg
I have also tried using groupby with a pd.Grouper and using the pd.NamedAgg class, with no luck.
Example data:
import pandas as pd
dates = ['01-01-2020 9:00', '01-01-2020 9:05', '01-01-2020 9:15',
'01-01-2020 9:30', '01-01-2020 9:35', '01-01-2020 10:00']
dates = pd.to_datetime(dates)
df = pd.DataFrame({'user':['fred']*len(dates),
'value':[1,13,27,40,15,19],
'total':[1,1,3,12,12,16]},
index=dates)

Merge variable number of rows in pandas dataframe

I'm new to pandas and working with dataframes. I have a rather simple problem that I think should have a straightforward solution which is not clear to me (and I do not know pandas that well).
So I have many occurrences of rows with same index in my data frame:
Glucose Insulin Carbs
Hour
2018-05-16 06:43:00 156.0 0.0 0.0
2018-05-16 06:43:00 NaN 0.0 65.0
2018-05-16 06:43:00 NaN 7.0 0.0
And I would like to merge them to get this, a row which contains all the information available at a given time index:
Glucose Insulin Carbs
Hour
2018-05-16 06:43:00 156.0 7.0 65.0
2018-05-16 06:43:00 NaN 0.0 65.0
2018-05-16 06:43:00 NaN 7.0 0.0
Afterwards I would drop all rows which contain NaN in any column to get:
Glucose Insulin Carbs
Hour
2018-05-16 06:43:00 156.0 7.0 65.0
The problem is that in the same dataframe I have duplicates with less information, maybe only Carbs or Insulin.
Glucose Insulin Carbs
Hour
2018-05-19 06:15:00 NaN 1.5 0.0
2018-05-19 06:15:00 229.0 0.0 0.0
I already know the indices of these entries:
bad_indices = _df[ _df.Glucosa.isnull() ].index
What I would like to know is if there's like a nice Pythonic way to do such a task (both for the two, and three rows cases).
Maybe a pandas built-in method or something which is semi standard
or at least readable because I don't want to write ugly (and easily breakable)
code that has explicit considerations for each case.
You can replace 0 to NaN and then get first non NaN values per group:
df = df.mask(df == 0).groupby(level=0).first()
print (df)
Glucose Insulin Carbs
Hour
2018-05-16 06:43:00 156.0 7.0 65.0

Categories