I have created a data frame in python using pandas that has the following output with date being the index:
Date Daily Anger Daily Haha Daily Like Daily Love Daily Sad Daily WoW
2019-08-31 1 2.0 132.0 8.0 0.0 5.0
2019-09-30 0 1.0 41.0 4.0 0.0 0.0
2019-10-31 15 1.0 117.0 4.0 0.0 2.0
2019-11-30 0 3.0 84.0 4.0 0.0 4.0
2019-12-31 2 17.0 98.0 20.0 5.0 7.0
I'm trying to get these values in a stacked bar chart where the X axis is the date and the y axis is the total values across these metrics
I've spent the last couple of hours trying to get this to work with google with no success. Could anyone help me?
If Date is column use x parameter in DataFrame.plot.bar:
df.plot.bar(x='Date', stacked=True)
If Date is DatetimeIndex use only stacked parameter:
df.plot.bar(stacked=True)
Related
I have case data that is presented as a time series. They are summed for each following day, what can be used to turn them into daily case count data?
My dataframe in pandas:
data sum_cases (cumulative)
0 2020-05-02 4.0
1 2020-05-03 21.0
2 2020-05-04 37.0
3 2020-05-05 51.0
I want them to look like this:
data sum_cases(cumulative) daily_cases
0 2020-05-02 4.0 4.0
1 2020-05-03 21.0 17.0
2 2020-05-04 37.0 16.0
3 2020-05-05 51.0 14.0
If indeed your DF has has the data in date order, then you might be able to get away with:
df['daily_cases'] = df['sum_cases'] - df['sum_cases'].shift(fill_value=0)
I have a 3D dataframe with x and y and time as 3rd dimension.
The data are 5 indizes of satellite images that were taken at different times.
The x and y describes every pixel.
x y time SIPI classif
7.620001 -77.849990 2018-04-07 1.011107 2.0
2018-10-14 1.023407 2.0
2018-12-28 0.045107 3.0
2020-01-10 0.351107 2.0
2018-06-29 0.351107 2.0
-77.849899 2018-04-07 1.010777 8.0
2018-10-14 0.510562 2.0
2018-12-28 1.410766 4.0
2020-01-10 1.010666 8.0
2018-06-29 2.057068 8.0
-77.849809 2018-04-07 0.986991 1.0
2018-10-14 0.986991 8.0
2018-12-28 0.986991 5.0
2020-01-10 0.984791 5.0
2018-06-29 0.986991 3.0
-77.849718 2018-04-07 0.975965 10.0
2018-10-14 0.964765 7.0
2018-12-28 0.975965 10.0
2020-01-10 0.975965 10.0
2018-06-29 0.975965 3.0
-77.849627 2018-04-07 1.957747 2.0
2018-10-14 0.132445 6.0
2018-12-28 0.589677 2.0
2020-01-10 1.982445 2.0
2018-06-29 3.334456 7.0
I need to group the data and as new column I need the value from column 'classif_rf', which is most frequent in 5 datasets. The values are integers between 1 and 10. I want to add an condition which add only frequency higher than 3.
x y classif
7.620001 -77.849990 2.0
-77.849899 8.0
-77.849809 Na
-77.849718 10.0
-77.849627 2.0
So as a result I need dataframe where each pixel has a value with highest frequency and when the frequency is lower than 3 there should be a NA value.
Can the pandas.groupby function do that? I thought about value_counts(), but I'm not sure how to implement that to my dataset.
Thank you in advance!
Here is a clunky way to do it:
# Get the modes per group and count how often they occur
df_modes = df.groupby(["x", "y"]).agg(
{
'classif': [lambda x: pd.Series.mode(x)[0],
lambda x: sum(x == pd.Series.mode(x)[0])]
}
).reset_index()
# Rename the columns to something a bit more readable
df_modes.columns = ["x", "y", "classif_mode", "classif_mode_freq"]
# Discard modes whose frequency was less than 3
df_modes.loc[df_modes["classif_mode_freq"] < 3, "classif_mode"] = np.nan
Now df_modes.drop("classif_mode_freq", axis=1) will return
x y classif_mode
0 7.620001 -77.849990 2.0
1 7.620001 -77.849899 8.0
2 7.620001 -77.849809 NaN
3 7.620001 -77.849718 10.0
4 7.620001 -77.849627 2.0
I am trying to make a graph that shows the average temperature each day over a year by averaging 19 years of NOAA data (side note, is there any better way to get historical weather data because the NOAA's seems super inconsistent). I was wondering what the best way to set up the data would be. The relevant columns of my data look like this:
DATE PRCP TAVG TMAX TMIN TOBS
0 1990-01-01 17.0 NaN 13.3 8.3 10.0
1 1990-01-02 0.0 NaN NaN NaN NaN
2 1990-01-03 0.0 NaN 13.3 2.8 10.0
3 1990-01-04 0.0 NaN 14.4 2.8 10.0
4 1990-01-05 0.0 NaN 14.4 2.8 11.1
... ... ... ... ... ... ...
10838 2019-12-27 0.0 NaN 15.0 4.4 13.3
10839 2019-12-28 0.0 NaN 14.4 5.0 13.9
10840 2019-12-29 3.6 NaN 15.0 5.6 14.4
10841 2019-12-30 0.0 NaN 14.4 6.7 12.2
10842 2019-12-31 0.0 NaN 15.0 6.7 13.9
10843 rows × 6 columns
The DATE column is the datetime64[ns] type
Here's my code:
import pandas as pd
from matplotlib import pyplot as plt
data = pd.read_csv('1990-2019.csv')
#seperate the data by station
oceanside = data[data.STATION == 'USC00047767']
downtown = data[data.STATION == 'USW00023272']
oceanside.loc[:,'DATE'] = pd.to_datetime(oceanside.loc[:,'DATE'],format='%Y-%m-%d')
#This is the area I need help with:
oceanside['DATE'].dt.year
I've been trying to separate the data by year, so I can then average it. I would like to do this without using a for loop because I plan on doing this with much larger data sets and that would be super inefficient. I looked in the pandas documentation but I couldn't find a function that seemed like it would do that. Am I missing something? Is that even the right way to do it?
I am new to pandas/python data analysis so it is very possible the answer is staring me in the face.
Any help would be greatly appreciated!
Create a dict of dataframes where each key is a year
df_by_year = dict()
for year oceanside.date.dt.year.unique():
data = oceanside[oceanside.date.dt.year == year]
df_by_year[year] = data
Get data by a single year
oceanside[oceanside.date.dt.year == 2019]
Get average for each year
oceanside.groupby(oceanside.date.dt.year).mean()
I want to calculate a rolling mean of different window sizes for each ticker in my dataframe. Ideally I could pass a list of window sizes and for each ticker I would get new columns (one for each rolling mean size). So if I wanted a rolling mean of 2 and one of 3, the output would be two columns for each ticker.
import datetime as dt
import numpy as np
import pandas as pd
Dt_df = pd.DataFrame({"Date":pd.date_range('2018-07-01', periods=5, freq='D')})
Tick_df = pd.DataFrame({"Ticker":['ABC',"HIJ","XYZ"]})
Mult_df = pd.merge(Tick_df.assign(key='x'), Dt_df.assign(key='x') on='key').drop('key', 1)
df2 = pd.DataFrame(np.random.randint(low=5, high=10, size=(15, 1)), columns=['Price'])
df3 = Mult_df.join(df2, how='outer')
df3.set_index(['Ticker','Date'],inplace = True)
Here is the Example Dataset:
When I try to apply this function:
def my_RollMeans(x):
w = [1,2,3]
s = pd.Series(x)
Bob = pd.DataFrame([s.rolling(w1).mean() for w1 in w]).T
return Bob
to my dataframe df3 using various versions of apply or transform I get errors.
NewDF = df3.groupby('Ticker').Price.transform(my_RollMeans).fillna(0)
The latest error is:
Data must be 1-dimensional
IIUC try using apply and I made a modification to your custom function:
def my_RollMeans(x):
w = [1,2,3]
s = pd.Series(x)
Bob = pd.DataFrame([s.rolling(w1).mean().rename('Price_'+str(w1)) for w1 in w]).T
return Bob
df3.groupby('Ticker').apply(lambda x : my_RollMeans(x.Price)).fillna(0)
Output:
Price_1 Price_2 Price_3
Ticker Date
ABC 2018-07-01 9.0 0.0 0.000000
2018-07-02 8.0 8.5 0.000000
2018-07-03 7.0 7.5 8.000000
2018-07-04 8.0 7.5 7.666667
2018-07-05 8.0 8.0 7.666667
HIJ 2018-07-01 8.0 0.0 0.000000
2018-07-02 9.0 8.5 0.000000
2018-07-03 5.0 7.0 7.333333
2018-07-04 6.0 5.5 6.666667
2018-07-05 7.0 6.5 6.000000
XYZ 2018-07-01 9.0 0.0 0.000000
2018-07-02 5.0 7.0 0.000000
2018-07-03 9.0 7.0 7.666667
2018-07-04 8.0 8.5 7.333333
2018-07-05 6.0 7.0 7.666667
I'm new to pandas and working with dataframes. I have a rather simple problem that I think should have a straightforward solution which is not clear to me (and I do not know pandas that well).
So I have many occurrences of rows with same index in my data frame:
Glucose Insulin Carbs
Hour
2018-05-16 06:43:00 156.0 0.0 0.0
2018-05-16 06:43:00 NaN 0.0 65.0
2018-05-16 06:43:00 NaN 7.0 0.0
And I would like to merge them to get this, a row which contains all the information available at a given time index:
Glucose Insulin Carbs
Hour
2018-05-16 06:43:00 156.0 7.0 65.0
2018-05-16 06:43:00 NaN 0.0 65.0
2018-05-16 06:43:00 NaN 7.0 0.0
Afterwards I would drop all rows which contain NaN in any column to get:
Glucose Insulin Carbs
Hour
2018-05-16 06:43:00 156.0 7.0 65.0
The problem is that in the same dataframe I have duplicates with less information, maybe only Carbs or Insulin.
Glucose Insulin Carbs
Hour
2018-05-19 06:15:00 NaN 1.5 0.0
2018-05-19 06:15:00 229.0 0.0 0.0
I already know the indices of these entries:
bad_indices = _df[ _df.Glucosa.isnull() ].index
What I would like to know is if there's like a nice Pythonic way to do such a task (both for the two, and three rows cases).
Maybe a pandas built-in method or something which is semi standard
or at least readable because I don't want to write ugly (and easily breakable)
code that has explicit considerations for each case.
You can replace 0 to NaN and then get first non NaN values per group:
df = df.mask(df == 0).groupby(level=0).first()
print (df)
Glucose Insulin Carbs
Hour
2018-05-16 06:43:00 156.0 7.0 65.0