I have a dataframe as mentioned below:
Date,Time,Price,Volume
31/01/2019,09:15:00,10691.50,600
31/01/2019,09:15:01,10709.90,13950
31/01/2019,09:15:02,10701.95,9600
31/01/2019,09:15:03,10704.10,3450
31/01/2019,09:15:04,10700.05,2625
31/01/2019,09:15:05,10700.05,2400
31/01/2019,09:15:06,10698.10,3000
31/01/2019,09:15:07,10699.90,5925
31/01/2019,09:15:08,10699.25,5775
31/01/2019,09:15:09,10700.45,5925
31/01/2019,09:15:10,10700.00,4650
31/01/2019,09:15:11,10699.40,8025
31/01/2019,09:15:12,10698.95,5025
31/01/2019,09:15:13,10698.45,1950
31/01/2019,09:15:14,10696.15,3900
31/01/2019,09:15:15,10697.15,2475
31/01/2019,09:15:16,10697.05,4275
31/01/2019,09:15:17,10696.25,3225
31/01/2019,09:15:18,10696.25,3300
The data frame contains approx 8000 rows. I want plot both price and volume in same chart. (Volume Range: 0 - 8,00,000)
Suppose you want to compare price and volume vs time, try this:
df = pd.read_csv('your_path_here')
df.plot('Time', ['Price', 'Volume'], secondary_y='Price')
edit: x-axis customization
Since you want x-axis customization,try this (this is just a basic example you can follow):
# Create a Datetime column while parsing the csv file
df = pd.read_csv('your_path_here', parse_dates= {'Datetime': ['Date', 'Time']})
Then you need to create two list, one containing the position on the x-axis and the other one the labels.
Say you want labels every 5 seconds (your requests at 30 min is possibile but not with the data you provided)
positions = [p for p in df.Datetime if p.second in range(0, 60, 5)]
labels = [l.strftime('%H:%M:%S') for l in positions]
Then you plot passing the positions and labels lists to set_xticks and set_xticklabels
ax = df.plot('Datetime', ['Price', 'Volume'], secondary_y='Price')
ax.set_xticks(positions)
ax.set_xticklabels(labels)
Related
I have dataframe as ,
i need something like this for each columns like stress , depression and anxiety and each participant data in each category
i wrote the python code as
ax = data_full.plot(x="participants", y=["Stress","Depression","Anxiety"],kind="line", lw=3, ls='--', figsize = (12,6))
plt.grid(True)
plt.show()
get the output like this
Split the participant column and merge it with the original data frame. Change the data frame to a data frame with only the columns you need in the merged data frame. Transform the data frame in its final form by pivoting. The resulting data frame is then used as the basis for the graph. Now we can adjust the x-axis tick marks, the legend position, and the y-axis limits.
dfs = pd.concat([df,df['participants'].str.split('_', expand=True)],axis=1)
dfs.columns = ['Stress', 'Depression', 'Anxiety', 'participants', 'category', 'group']
fin_df = dfs[['category','group','Stress']]
fin_df = dfs.pivot(index='category', columns='group', values='Stress')
# update
fin_df = fin_df.sort_index(ascending=False)
g = fin_df.plot(kind='line', title='Stress')
g.set_xticks([0,1])
g.set_xticklabels(['pre','post'])
g.legend(loc='center right')
g.set_ylim(5,25)
This question already has answers here:
How to plot different groups of data from a dataframe into a single figure
(5 answers)
Closed 5 months ago.
I am trying to plot multiple different lines on the same graph using pandas and matplotlib.
I have a series of 100 synthetic temperature histories that I want to plot, all in grey, against the original real temperature history I used to generate the data.
How can I plot all of these series on the same graph? I know how to do it in MATLAB but am using Python for this project, and pandas has been the easiest way I have found to read in every single column from the output file without having to specify each column individually. The number of columns of data will change from 100 up to 1000, so I need a generic solution. My code plots each of the data series individually fine, but I just need to work out how to add them both to the same figure.
Here is the code so far:
# dirPath is the path to my working directory
outputFile = "output.csv"
original_data = "temperature_data.csv"
# Read in the synthetic temperatures from the output file, time is the index in the first column
data = pd.read_csv(outputFile,header=None, skiprows=1, index_col=0)
# Read in the original temperature data, time is the index in the first column
orig_data = pd.read_csv(dirPath+original_data,header=None, skiprows=1, index_col=0)
# Convert data to float format
data = data.astype(float)
orig_data = orig_data.astype(float)
# Plot all columns of synthetic data in grey
data = data.plot.line(title="ARMA Synthetic Temperature Histories",
xlabel="Time (yrs)",
ylabel=("Synthetic avergage hourly temperature (C)"),
color="#929591",
legend=None)
# Plot one column of original data in black
orig_data = orig_data.plot.line(color="k",legend="Original temperature data")
# Create and save figure
fig = data.get_figure()
fig = orig_data.get_figure()
fig.savefig("temp_arma.png")
This is some example data for the output data:
And this is the original data:
Plotting each individually gives these graphs - I just want them overlaid!
Your data.plot.line returns an AxesSubplot instance, you can catch it and feed it to your second command:
# plot 1
ax = data.plot.line(…)
# plot 2
data.plot.line(…, ax=ax)
Try to run this code:
# convert data to float format
data = data.astype(float)
orig_data = orig_data.astype(float)
# Plot all columns of synthetic data in grey
ax = data.plot.line(title="ARMA Synthetic Temperature Histories",
xlabel="Time (yrs)",
ylabel=("Synthetic avergage hourly temperature (C)"),
color="#929591",
legend=None)
# Plot one column of original data in black
orig_data.plot.line(color="k",legend="Original temperature data", ax=ax)
# Create and save figure
ax.figure.savefig("temp_arma.png")
You should directy use matplotlib functions. It offers more control and is easy to use as well.
Part 1 - Reading files (borrowing your code)
# Read in the synthetic temperatures from the output file, time is the index in the first column
data = pd.read_csv(outputFile,header=None, skiprows=1, index_col=0)
# Read in the original temperature data, time is the index in the first column
orig_data = pd.read_csv(dirPath+original_data,header=None, skiprows=1, index_col=0)
# Convert data to float format
data = data.astype(float)
orig_data = orig_data.astype(float)
Part 2 - Plotting
fig = plt.figure(figsize=(10,8))
ax = plt.gca()
# Plotting all histories:
# 1st column contains time hence excluding
for col in data.columns[1:]:
ax.plot(data["Time"], data[col], color='grey')
# Orig
ax.plot(orig_data["Time"], orig_data["Temperature"], color='k')
# axis labels
ax.set_xlabel("Time (yrs)")
ax.set_ylabel("avergage hourly temperature (C)")
fig.savefig("temp_arma.png")
Try the following:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
data.plot.line(ax=ax,
title="ARMA Synthetic Temperature Histories",
xlabel="Time (yrs)",
ylabel="Synthetic avergage hourly temperature (C)",
color="#929591",
legend=False)
orig_data.rename(columns={orig_data.columns[0]: "Original temperature data"},
inplace=True)
orig_data.plot.line(ax=ax, color="k")
It's pretty much your original code with the following slight modifications:
Getting the ax object
fig, ax = plt.subplots()
and using it for the plotting
data.plot.line(ax=ax, ...
...
orig_data.plot.line(ax=ax, ...)
Result for some randomly generated sample data:
import random # For sample data only
# Sample data
data = pd.DataFrame({
f'col_{i}': [random.random() for _ in range(25)]
for i in range(1, 50)
})
orig_data = pd.DataFrame({
'col_0': [random.random() for _ in range(25)]
})
I am trying to figure out how should I manipulate my data so I can aggregate on multiple columns but for same grouped pandas data. The reason why I am doing this because, I need to get stacked line chart which take data from different aggregation on same grouped data. How can we do this some compact way? can anyone suggest possible way of doing this in pandas? any ideas?
my current attempt:
import pandas as pd
import matplotlib.pyplot as plt
url = "https://gist.githubusercontent.com/adamFlyn/4657714653398e9269263a7c8ad4bb8a/raw/fa6709a0c41888503509e569ace63606d2e5c2ff/mydf.csv"
df = pd.read_csv(url, parse_dates=['date'])
df_re = df[df['retail_item'].str.contains("GROUND BEEF")]
df_rei = df_re.groupby(['date', 'retail_item']).agg({'number_of_ads': 'sum'})
df_rei = df_rei.reset_index(level=[0,1])
df_rei['week'] = pd.DatetimeIndex(df_rei['date']).week
df_rei['year'] = pd.DatetimeIndex(df_rei['date']).year
df_rei['week'] = df_rei['date'].dt.strftime('%W').astype('uint8')
df_ret_df1 = df_rei.groupby(['retail_item', 'week'])['number_of_ads'].agg([max, min, 'mean']).stack().reset_index(level=[2]).rename(columns={'level_2': 'mm', 0: 'vals'}).reset_index()
similarly, I need to do data aggregation also like this:
df_re['price_gap'] = df_re['high_price'] - df_re['low_price']
dff_rei1 = df_re.groupby(['date', 'retail_item']).agg({'price_gap': 'mean'})
dff_rei1 = dff_rei1.reset_index(level=[0,1])
dff_rei1['week'] = pd.DatetimeIndex(dff_rei1['date']).week
dff_rei1['year'] = pd.DatetimeIndex(dff_rei1['date']).year
dff_rei1['week'] = dff_rei1['date'].dt.strftime('%W').astype('uint8')
dff_ret_df2 = dff_rei1.groupby(['retail_item', 'week'])['price_gap'].agg([max, min, 'mean']).stack().reset_index(level=[2]).rename(columns={'level_2': 'mm', 0: 'vals'}).reset_index()
problem
when I made data aggregation, those lines are similar:
df_rei = df_re.groupby(['date', 'retail_item']).agg({'number_of_ads': 'sum'})
df_ret_df1 = df_rei.groupby(['retail_item', 'week'])['number_of_ads'].agg([max, min, 'mean']).stack().reset_index(level=[2]).rename(columns={'level_2': 'mm', 0: 'vals'}).reset_index()
and
dff_rei1 = df_re.groupby(['date', 'retail_item']).agg({'price_gap': 'mean'})
dff_ret_df2 = dff_rei1.groupby(['retail_item', 'week'])['price_gap'].agg([max, min, 'mean']).stack().reset_index(level=[2]).rename(columns={'level_2': 'mm', 0: 'vals'}).reset_index()
I think better way could be I have to make custom function with *arg, **kwargs to make shift for aggregating the columns, but how should I show stacked line chart where y axis shows different quantities. Is that doable to do so in pandas?
line plot
I did for getting line chart as follow:
for g, d in df_ret_df1.groupby('retail_item'):
fig, ax = plt.subplots(figsize=(7, 4), dpi=144)
sns.lineplot(x='week', y='vals', hue='mm', data=d,alpha=.8)
y1 = d[d.mm == 'max']
y2 = d[d.mm == 'min']
plt.fill_between(x=y1.week, y1=y1.vals, y2=y2.vals)
for year in df['year'].unique():
data = df_rei[(df_rei.date.dt.year == year) & (df_rei.retail_item == g)]
sns.lineplot(x='week', y='price_gap', ci=None, data=data, palette=cmap,label=year,alpha=.8)
I want to minimize those so I could able to aggregate on different columns and make stacked line chart, where they share x-axis as week, and y axis shows number of ads and price_range respectively. I don't know is there any better way of doing this. I am doing this because stacked line chart (two vertical subplots), one shows number of ads on y axis and another one shows price ranges for same items along 52 weeks. can anyone suggest any possible way of doing this? any ideas?
This answer builds on the one by Andreas who has already answered the main question of how to produce aggregate variables of multiple columns in a compact way. The goal here is to implement that solution specifically to your case and to give an example of how to produce a single figure from the aggregated data. Here are some key points:
The dates in the original dataset are already on a weekly frequency so groupby('week') is not needed for df_ret_df1 and dff_ret_df2, which is why these contain identical values for min, max, and mean.
This example uses pandas and matplotlib so the variables do not need to be stacked as when using seaborn.
The aggregation step produces a MultiIndex for the columns. You can access the aggregated variables (min, max, mean) of each high-level variable by using df.xs.
The date is set as the index of the aggregated dataframe to use as the x variable. Using the DatetimeIndex as the x variable gives you more flexibility for formatting the tick labels and ensures that the data is always plotted in chronological order.
It is not clear in the question how the data for separate years should be displayed (in separate figures?) so here the entire time series is shown in a single figure.
Import dataset and aggregate it as needed
import pandas as pd # v 1.2.3
import matplotlib.pyplot as plt # v 3.3.4
# Import dataset
url = 'https://gist.githubusercontent.com/adamFlyn/4657714653398e9269263a7c8ad4bb8a/\
raw/fa6709a0c41888503509e569ace63606d2e5c2ff/mydf.csv'
df = pd.read_csv(url, parse_dates=['date'])
# Create dataframe containing data for ground beef products, compute
# aggregate variables, and set the date as the index
df_gbeef = df[df['retail_item'].str.contains('GROUND BEEF')].copy()
df_gbeef['price_gap'] = df_gbeef['high_price'] - df_gbeef['low_price']
agg_dict = {'number_of_ads': [min, max, 'mean'],
'price_gap': [min, max, 'mean']}
df_gbeef_agg = (df_gbeef.groupby(['date', 'retail_item']).agg(agg_dict)
.reset_index('retail_item'))
df_gbeef_agg
Plot aggregated variables in single figure containing small multiples
variables = ['number_of_ads', 'price_gap']
colors = ['tab:orange', 'tab:blue']
nrows = len(variables)
ncols = df_gbeef_agg['retail_item'].nunique()
fig, axs = plt.subplots(nrows, ncols, figsize=(10, 5), sharex=True, sharey='row')
for axs_row, var, color in zip(axs, variables, colors):
for i, (item, df_item) in enumerate(df_gbeef_agg.groupby('retail_item')):
ax = axs_row[i]
# Select data and plot it
data = df_item.xs(var, axis=1)
ax.fill_between(x=data.index, y1=data['min'], y2=data['max'],
color=color, alpha=0.3, label='min/max')
ax.plot(data.index, data['mean'], color=color, label='mean')
ax.spines['bottom'].set_position('zero')
# Format x-axis tick labels
fmt = plt.matplotlib.dates.DateFormatter('%W') # is not equal to ISO week
ax.xaxis.set_major_formatter(fmt)
# Fomat subplot according to position within the figure
if ax.is_first_row():
ax.set_title(item, pad=10)
if ax.is_last_row():
ax.set_xlabel('Week number', size=12, labelpad=5)
if ax.is_first_col():
ax.set_ylabel(var, size=12, labelpad=10)
if ax.is_last_col():
ax.legend(frameon=False)
fig.suptitle('Cross-regional weekly ads and price gaps of ground beef products',
size=14, y=1.02)
fig.subplots_adjust(hspace=0.1);
I am not sure if this answers your question fully, but based on your headline I guess it all boils down to:
import pandas as pd
url = "https://gist.githubusercontent.com/adamFlyn/4657714653398e9269263a7c8ad4bb8a/raw/fa6709a0c41888503509e569ace63606d2e5c2ff/mydf.csv"
df = pd.read_csv(url, parse_dates=['date'])
# define which columns to group and in which way
dct = {'low_price': [max, min],
'high_price': min,
'year': 'mean'}
# actually group the columns
df.groupby(['region']).agg(dct)
Output:
low_price high_price year
max min min mean
region
ALASKA 16.99 1.33 1.33 2020.792123
HAWAII 12.99 1.33 1.33 2020.738318
MIDWEST 28.73 0.99 0.99 2020.690159
NORTHEAST 19.99 1.20 1.99 2020.709916
NORTHWEST 16.99 1.33 1.33 2020.736397
SOUTH CENTRAL 28.76 1.20 1.49 2020.700980
SOUTHEAST 21.99 1.33 1.48 2020.699655
SOUTHWEST 16.99 1.29 1.29 2020.704341
I have 2 dataframes:
'stock' is a dataframe with columns Date and Price.
'events' is a dataframe with columns Date and Text.
My goal is to produce a graph of the stock prices and on the line place dots where the events occur. However, I do not know how to do 'y' value for the events dataframe as I want it to be where it is on the stock dataframe.
I am able to plot the first dataframe fine with:
plt.plot('Date', 'Price', data=stock)
And I try to plot the event dots with:
plt.scatter('created_at', ???, data=events)
However, it is the ??? that I don't know how to set
Assuming Date and created_at are datetime:
stock = pd.DataFrame({'Date':['2021-01-01','2021-02-01','2021-03-01','2021-04-01','2021-05-01'],'Price':[1,5,3,4,10]})
events = pd.DataFrame({'created_at':['2021-02-01','2021-03-01'],'description':['a','b']})
stock.Date = pd.to_datetime(stock.Date)
events.created_at = pd.to_datetime(events.created_at)
Filter stock by events.created_at (or merge) and plot them onto the same ax :
stock_events = stock[stock.Date.isin(events.created_at)]
# or merge on the date columns
# stock_events = stock.merge(events, left_on='Date', right_on='created_at')
ax = stock.plot(x='Date', y='Price')
stock_events.plot.scatter(ax=ax, x='Date', y='Price', label='Event', c='r', s=50)
I have a few monthly datasets of usage stats stored in different CSVs, with a couple hundred fields. I am cutting off the top 30 of each one, but the bottom will change (and the top as changes as stuff is banned, albeit less commonly). Currently I have the lines representing months, but I want the points to be (y=usage %) and (x=month) with the legend being different users.
column[0] is their number in the file (1-30)
column[1] is their name
column[2] is the usage percent
AprilStats = pd.read_csv(r'filepath', nrows=30)
MayStats = pd.read_csv(r'filepath', nrows=30)
JuneStats = pd.read_csv(r'filepath', nrows=30)
## Assign labels and sources
labels = [[AprilStats.columns[1]], [MayStats.columns[1]], [JuneStats.columns[1]]]
AprilUsage=np.array(AprilStats[AprilStats.columns[2]].tolist())
MayUsage=np.array(MayStats[MayStats.columns[2]].tolist())
JuneUsage=np.array(JuneStats[JuneStats.columns[2]].tolist())
x = np.array(AprilStats[AprilStats.columns[0]].tolist())
y = np.array(AprilStats[AprilStats.columns[2]].tolist())
my_xticks = AprilStats[AprilStats.columns[1]].tolist()
plt.xticks(x, my_xticks, rotation='55')
x1 = np.array(MayStats[MayStats.columns[0]].tolist())
y1 = np.array(MayStats[MayStats.columns[2]].tolist())
my_xticks1 = MayStats[MayStats.columns[1]].tolist()
plt.xticks(x, my_xticks1, rotation='55')
x2 = np.array(JuneStats[JuneStats.columns[0]].tolist())
y2 = np.array(JuneStats[JuneStats.columns[2]].tolist())
my_xticks2 = JuneStats[JuneStats.columns[1]].tolist()
plt.xticks(x, my_xticks2, rotation='55',)
### Plot the data
plt.rc('xtick', labelsize='xx-small')
plt.title('Little Cup Usage')
plt.ylabel('Usage (Percent)')
plt.plot(x,y,label='April', color='green', alpha=.4)
plt.plot(x1,y1,label='May', color='blue', alpha=.4)
plt.plot(x2,y2,label='June', color='red', alpha=.4)
plt.subplots_adjust(bottom=.2)
plt.legend()
plt.savefig('90daytest.png', dpi=500)
plt.show()
I think I am mislabeling them, but the month of usage isn't stored in the file. I reckon I could add it, but I'd like to not have to go in and edit these files every month. Also, sorry if this is horribly inneficient coding, I have just started learning python less than two weeks ago and this is a little project for me to learn with.
I'd divide this into two steps:
Gather all the data into a single dataframe in which the rows correspond to the different months, the columns to the different names and the values are the usage %.
Plot each column as a different series in a scatter plot.
Step 1:
# Create a dictionary associating a file to each month
files = {dt.date(2019, 4, 1): 'april.csv',
dt.date(2019, 5, 1): 'may.csv'}
# An empty data frame
df = pd.DataFrame()
''' For each file, generate a one entry data frame as follows, and append it to df.
Month name1 name2 ...
2019-1-1 0.5 0.2
'''
for month, file in files.items():
data = pd.read_csv(file, usecols=['name', 'usage'], index_col='name')
data = data.transpose()
data['month'] = month
data = data.set_index('month')
df = df.append(data)
Step 2:
# New figure
fig = plt.figure()
# Plot one series for each column in df
for name in df.columns:
plt.scatter(x=df.index, y=df[name], label=name)
# Additional plot formatting code here
plt.show()
I hope that helps.