How to make many plots with multiply groupby pandas? - python

Sorry, I can't google how to get my aim so I am here.
see some sandbox datatable:
mode X Y
0 1 3 10
1 1 4 11
2 1 3 12
3 1 4 13
4 2 3 14
5 2 4 15
6 2 3 16
7 2 4 17
I created following sandbox code. So here, I want plot with TWO lines corresponding to the two different modes ('mode 1' and 'mode 2'). X-axis should be 3,4. And here I want to get two lines (3,(10+12)/2)--(4,(11+13)/2) for mode 1 with averaged Y and analogical (3,15)--(4,16) for mode 2.
But this code even doesn't work.
#!/usr/bin/python3
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame([[1,1,1,1,2,2,2,2],[3,4,3,4,3,4,3,4],list(range(10,18))]).T
df.columns = ['mode','X','Y']
mode = df.groupby(['mode'])['mode'].mean()
Ox = df.groupby(['X'])['X'].mean()
Oy = df.groupby(['mode','X'])['Y'].mean()
for x in mode:
plt.plot(Ox, Oy[Oy['mode'== x]] , label = 'test' + x)
plt.savefig('testpandas.pdf')

You might want to try the seaborn package, which has a lot of functionality for stuff like this
import seaborn as sns
sns.lmplot(data=df,hue='mode',x='X',y='Y',x_estimator=np.mean)
Here's one way to do it in plain pandas:
y_means=df.groupby(['mode','X'],as_index=False).mean()
for mode,g in y_means.groupby('mode'):
plt.plot(g['X'],g['Y'],'o-',label = 'mode = ' + str(mode))

It's an answer of asking person.
Actually I've found solution by myself.
#!/usr/bin/python3
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame([[1,1,1,1,2,2,2,2],[3,4,3,4,3,4,3,4],list(range(10,18))]).T
df.columns = ['mode','X','Y']
mode = df.groupby(['mode'])['mode'].mean()
Ox = df.groupby(['X'])['X'].mean()
Oy = df.groupby(['mode','X'])['Y'].mean()
for x in mode:
plt.plot(Ox, Oy[mode[x]] , label = 'test' + str(x))
plt.savefig('testpandas.png')

I would guess the easiest way to do this is to use a pivot_table. This reduces the whole thing to two lines:
piv = pd.pivot_table(df, columns="mode", index="X")
plt.plot(piv)
or even only one, if you use pandas integrated plotting functionality:
pd.pivot_table(df, columns="mode", index="X").plot()
The complete solution using matplotlib:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame([[1,1,1,1,2,2,2,2],[3,4,3,4,3,4,3,4],list(range(10,18))]).T
df.columns = ['mode','X','Y']
piv = pd.pivot_table(df, columns="mode", index="X")
print piv
plt.plot(piv)
plt.legend(labels=["mode {}".format(c[1]) for c in piv.columns.values])
plt.show()
which prints the pivot table as
Y
mode 1 2
X
3 11 15
4 12 16
and creates the plot

Related

boxplot structure disappears when pandas contains nan [duplicate]

I am using matplotlib to plot a box figure but there are some missing values (NaN). Then I found it doesn't display the box figure within the columns having NaN values.
Do you know how to solve this problem?
Here are the codes.
import numpy as np
import matplotlib.pyplot as plt
#==============================================================================
# open data
#==============================================================================
filename='C:\\Users\\liren\\OneDrive\\Data\\DATA in the first field-final\\ks.csv'
AllData=np.genfromtxt(filename,delimiter=";",skip_header=0,dtype='str')
TreatmentCode = AllData[1:,0]
RepCode = AllData[1:,1]
KsData= AllData[1:,2:].astype('float')
DepthHeader = AllData[0,2:].astype('float')
TreatmentUnique = np.unique(TreatmentCode)[[3,1,4,2,8,6,9,7,0,5,10],]
nT = TreatmentUnique.size#nT=number of treatments
#nD=number of deepth;nR=numbers of replications;nT=number of treatments;iT=iterms of treatments
nD = 5
nR = 6
KsData_3D = np.zeros((nT,nD,nR))
for iT in range(nT):
Treatment = TreatmentUnique[iT]
TreatmentFilter = TreatmentCode == Treatment
KsData_Filtered = KsData[TreatmentFilter,:]
KsData_3D[iT,:,:] = KsData_Filtered.transpose()iD = 4
fig=plt.figure()
ax = fig.add_subplot(111)
plt.boxplot(KsData_3D[:,iD,:].transpose())
ax.set_xticks(range(1,nT+1))
ax.set_xticklabels(TreatmentUnique)
ax.set_title(DepthHeader[iD])
Here is the final figure and some of the treatments are missing in the box.
You can remove the NaNs from the data first, then plot the filtered data.
To do that, you can first find the NaNs using np.isnan(data), then perform the bitwise inversion of that Boolean array using the ~: bitwise inversion operator. Use that to index the data array, and you filter out the NaNs.
filtered_data = data[~np.isnan(data)]
In a complete example (adapted from here)
Tested in python 3.10, matplotlib 3.5.1, seaborn 0.11.2, numpy 1.21.5, pandas 1.4.2
For 1D data:
import matplotlib.pyplot as plt
import numpy as np
# fake up some data
np.random.seed(2022) # so the same data is created each time
spread = np.random.rand(50) * 100
center = np.ones(25) * 50
flier_high = np.random.rand(10) * 100 + 100
flier_low = np.random.rand(10) * -100
data = np.concatenate((spread, center, flier_high, flier_low), 0)
# Add a NaN
data[40] = np.NaN
# Filter data using np.isnan
filtered_data = data[~np.isnan(data)]
# basic plot
plt.boxplot(filtered_data)
plt.show()
For 2D data:
For 2D data, you can't simply use the mask above, since then each column of the data array would have a different length. Instead, we can create a list, with each item in the list being the filtered data for each column of the data array.
A list comprehension can do this in one line: [d[m] for d, m in zip(data.T, mask.T)]
import matplotlib.pyplot as plt
import numpy as np
# fake up some data
np.random.seed(2022) # so the same data is created each time
spread = np.random.rand(50) * 100
center = np.ones(25) * 50
flier_high = np.random.rand(10) * 100 + 100
flier_low = np.random.rand(10) * -100
data = np.concatenate((spread, center, flier_high, flier_low), 0)
data = np.column_stack((data, data * 2., data + 20.))
# Add a NaN
data[30, 0] = np.NaN
data[20, 1] = np.NaN
# Filter data using np.isnan
mask = ~np.isnan(data)
filtered_data = [d[m] for d, m in zip(data.T, mask.T)]
# basic plot
plt.boxplot(filtered_data)
plt.show()
I'll leave it as an exercise to the reader to extend this to 3 or more dimensions, but you get the idea.
Use seaborn, which is a high-level API for matplotlib
seaborn.boxplot filters NaN under the hood
import seaborn as sns
sns.boxplot(data=data)
1D
2D
NaN is also ignored if plotting from df.plot(kind='box') for pandas, which uses matplotlib as the default plotting backend.
import pandas as pd
df = pd.DataFrame(data)
df.plot(kind='box')
1D
2D

Calculating Bull/Bear Markets in Pandas

I was recently given a challenge of calculating the presence of Bull/Bear markets using the values of -1, 1 to denote which one is which.
It is straight forward enough to do this with a for-loop but I know this is the worst way to do these things and it's better to use numpy/pandas methods if possible. However, I'm not seeing an easy way to do it.
Any ways of how to do this, maybe using changes of +/- 20% from current place to determine which regime you're in.
Here's a sample dataframe:
dates = pd.date_range(start='1950-01-01', periods=25000)
rand = np.random.RandomState(42)
vals = np.zeros(25000)
vals[0] = 15
for i in range(1, 25000):
vals[i] = vals[i-1] + rand.normal(0, 1)
df = pd.DataFrame(vals, columns=['Price'], index=dates)
The plot of these prices looks like this:
Anyone have any recommendations to calculate what regime the current point is in?
If you have to use a for loop then that's fine.
Here is a solution using the S&P 500 index from Yahoo! Finance (ticker ^GSPC):
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import yfinance as yf
import requests_cache
session = requests_cache.CachedSession()
df = yf.download('^GSPC', session=session)
df = df[['Adj Close']].copy()
df['dd'] = df['Adj Close'].div(df['Adj Close'].cummax()).sub(1)
df['ddn'] = ((df['dd'] < 0.) & (df['dd'].shift() == 0.)).cumsum()
df['ddmax'] = df.groupby('ddn')['dd'].transform('min')
df['bear'] = (df['ddmax'] < -0.2) & (df['ddmax'] < df.groupby('ddn')['dd'].transform('cummin'))
df['bearn'] = ((df['bear'] == True) & (df['bear'].shift() == False)).cumsum()
bears = df.reset_index().query('bear == True').groupby('bearn')['Date'].agg(['min', 'max'])
print(bears)
df['Adj Close'].plot()
for i, row in bears.iterrows():
plt.fill_between(row, df['Adj Close'].max(), alpha=0.25, color='r')
plt.gca().yaxis.set_major_formatter(plt.matplotlib.ticker.StrMethodFormatter('{x:,.0f}'))
plt.ylabel('S&P 500 Index (^GSPC)')
plt.title('S&P 500 Index with Bear Markets (> 20% Declines)')
plt.savefig('bears.png')
plt.show()
Here are the bear markets in data frame bears:
min max
bearn
1 1956-08-06 1957-10-21
2 1961-12-13 1962-06-25
3 1966-02-10 1966-10-06
4 1968-12-02 1970-05-25
5 1973-01-12 1974-10-02
6 1980-12-01 1982-08-11
7 1987-08-26 1987-12-03
8 2000-03-27 2002-10-08
9 2007-10-10 2009-03-06
10 2020-02-20 2020-03-20
11 2022-01-04 2022-06-15
Here is a plot:
Edit: I think this is an improvement from my first solution since ^GSPC provides a longer time series and bear markets are not typically dividend-adjusted.
I think this might work:
import numpy as np
vals = np.random.normal(0, 1, 25000)
vals[0] = 15
vals = np.cumsum(vals)

Python .csv labeling done properly via jupyter

I'm not sure if my graphs are done properly, what will happen if I'd want to go with upside down. I'd like also to print and generate file as .pdf. But I'm not quite sure how to accomplish that task, please give me some advice if you have any. I'd appreciate that, all best.
Changing variables countlessly
import numpy as np
np.__version__
import networkx as nx
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from operator import itemgetter
sns.set(style="darkgrid")
# t 1
m1 = np.array([[1,2,2],[-4,3,8],[-1,0,1]])
m2 = np.array([[1,4],[-2,2],[3,-6]])
print(m1.dot(m2));
# t 2
G = nx.Graph()
G.add_edges_from([
('A','D'),('A','B'),('B','D'),('B','C'),('B','E'),('C','D'),('C','E'),('D','E')
])
nx.draw(G, with_labels=True)
array = nx.betweenness_centrality(G)
array['B']
# t 3
df = pd.read_csv('xxx.csv')
df.set_index('OBJECTID', inplace=True)
df.head(1)
# t 4
sorted = df.groupby('NAME')['PT_ENROLL'].sum().sort_values(ascending=False)
sorted.head(7)
# t 5
df.groupby('NAICS_DESC')['NAME'].count().sort_values(ascending=False)
# t 6
df1 = df['TOT_ENROLL']
df2 = df['POPULATION']
plt.scatter(df1,df2)
# T3
df = pd.read_csv('Hospitals.csv')
df.set_index('OBJECTID', inplace=True)
df.head(5)
# T4
sorted = df.groupby('CITY')['NAME'].count().sort_values(ascending=False)
sorted.head(6)
# T5
sorted = df.groupby('NAME')['Y'].max().sort_values(ascending=False)
sorted.head(5)
# T6
df.groupby('OWNER')['BEDS'].sum().sort_values(ascending=False).plot(kind='bar')

Pandas : Replace old column values and plot new column values with respect to equations

In a CSV file, I have three columns z,x,y where 'z' column is used to groupby x and y columns and will be plotted w.r.t 'z'. Below is the table for z,x,y.
z x y
23 1,75181E-07 6,949512
23 8,88901E-07 6,963877
23 1,61279E-05 7,293052
23 5,35262E-05 8,135064
23 8,56942E-05 8,903738
23 0,000114883 9,579907
23 0,01068653 211,0798
23 0,01070811 211,3568
23 0,0107263 211,5871
23 0,01074606 211,8401
23 0,01076813 212,1311
23 0,01078525 212,3436
40 1,75181E-07 6,949513217
40 8,889E-07 6,96388319
40 1,61277E-05 7,293169621
40 5,35248E-05 8,135499439
40 0,00029527 13,63721607
40 0,000319049 14,1825142
40 0,000340228 14,69608917
40 0,014110191 252,3893548
40 0,014132366 252,5804547
40 0,014155023 252,8030254
40 0,014180293 253,0374241
40 0,014202693 253,1983821
40 0,014226167 253,4140887
40 0,014251631 253,6566835
40 0,014272699 253,8120535
Now I need to replace the 'x' and 'y' columns with new values say 'x1' and 'y1' by an equation: x1 = ln(1 + x) and y1 = y*(1+x) and w.r.t. same 'z' column I should plot x1 and y1.
I have tried the below code and I am able to see my new values but not able to plot with new values.
import csv
import os
import tkinter as tk
import sys
from tkinter import filedialog
import pandas as pd
import matplotlib.pyplot as plt
import math
from tkinter import ttk
import numpy as np
def readCSV(self):
x=[] # Initializing empty lists to store the 3 columns in csv
y=[]
z=[]
global df
self.filename = filedialog.askopenfilename()
df = pd.read_csv(self.filename, error_bad_lines=False) #Reading CSV file using pandas
read = csv.reader(df, delimiter = ",")
fig = plt.figure()
ax= fig.add_subplot(111)
df.set_index('x', inplace=True) #Setting index
line = df.groupby('z')['y'].plot(legend=True,ax=ax) #grouping and plotting
cursor = datacursor(line)
gdf= df[df['z'] == 23]
x=np.asarray(gdf.index.values)
y=np.asarray(gdf['y'].values)
x1 = np.log(1+x)
y1 = y * (1 + x)
df.set_index('x1', inplace=True) #Setting new index
line = df.groupby('z')['y1'].plot(legend=True,ax=ax) #grouping and plotting for new values
cursor = datacursor(line)
gdf= df[df['z'] == 23]
x1=np.asarray(gdf.index.values)
print ("x1:",x1)
y1=np.asarray(gdf['y1'].values)
print ("y1:",y1)
ax1 = ax.twinx()
ax.grid(True)
ax.set_ylim(0,None)
ax.set_xlim(0,None)
align_yaxis(ax, y.max(), ax1, 1)
plt.show()
I am getting error in this line "df.set_index('x1', inplace=True)"
as keyerror : x1
Thanks in advance
You must assign x1 adn 'y1' to the dataframe to be able to assign either of them to index:
x=np.asarray(df.index.values)
y=np.asarray(df['y'].values)
df['x1'] = np.log(1+x)
df['y1'] = y * (1+x)

ggplot multiple plots in one object

I've created a script to create multiple plots in one object. The results I am looking for are two plots one over the other such that each plot has different y axis scale but x axis is fixed - dates. However, only one of the plots (the top) is properly created, the bottom plot is visible but empty i.e the geom_line is not visible. Furthermore, the y-axis of the second plot does not match the range of values - min to max. I also tried using facet_grid (scales="free") but no change in the y-axis. The y-axis for the second graph has a range of 0 to 0.05.
I've limited the date range to the past few weeks. This is the code I am using:
df = df.set_index('date')
weekly = df.resample('w-mon',label='left',closed='left').sum()
data = weekly[-4:].reset_index()
data= pd.melt(data, id_vars=['date'])
pplot = ggplot(aes(x="date", y="value", color="variable", group="variable"), data)
#geom_line()
scale_x_date(labels = date_format('%d.%m'),
limits=(data.date.min() - dt.timedelta(2),
data.date.max() + dt.timedelta(2)))
#facet_grid("variable", scales="free_y")
theme_bw()
The dataframe sample (df), its a daily dataset containing values for each variable x and a, in this case 'date' is the index:
date x a
2016-08-01 100 20
2016-08-02 50 0
2016-08-03 24 18
2016-08-04 0 10
The dataframe sample (to_plot) - weekly overview:
date variable value
0 2016-08-01 x 200
1 2016-08-08 x 211
2 2016-08-15 x 104
3 2016-08-22 x 332
4 2016-08-01 a 8
5 2016-08-08 a 15
6 2016-08-15 a 22
7 2016-08-22 a 6
Sorry for not adding the df dataframe before.
Your calls to the plot directives geom_line(), scale_x_date(), etc. are standing on their own in your script; you do not connect them to your plot object. Thus, they do not have any effect on your plot.
In order to apply a plot directive to an existing plot object, use the graphics language and "add" them to your plot object by connecting them with a + operator.
The result (as intended):
The full script:
from __future__ import print_function
import sys
import pandas as pd
import datetime as dt
from ggplot import *
if __name__ == '__main__':
df = pd.DataFrame({
'date': ['2016-08-01', '2016-08-08', '2016-08-15', '2016-08-22'],
'x': [100, 50, 24, 0],
'a': [20, 0, 18, 10]
})
df['date'] = pd.to_datetime(df['date'])
data = pd.melt(df, id_vars=['date'])
plt = ggplot(data, aes(x='date', y='value', color='variable', group='variable')) +\
scale_x_date(
labels=date_format('%y-%m-%d'),
limits=(data.date.min() - dt.timedelta(2), data.date.max() + dt.timedelta(2))
) +\
geom_line() +\
facet_grid('variable', scales='free_y')
plt.show()

Categories