I'm new to pandas and trying to compare peak/trough values in time series price data to determine whether they are higher or lower than the previous peaks/troughs. I'd like to find three consecutively higher peaks and three consecutively higher troughs, and vise versa. If this is true, I'd perform some function. I've used the following code to create a dataframe in Quantconnect which returns a true value for each peak/trough. History is the name of the dataframe.
from scipy.signal import argrelextrema
import matplotlib.pyplot as plt
qb = QuantBook()
spy = qb.AddEquity("BA")
history = qb.History(qb.Securities.Keys, 360, Resolution.Daily).reset_index(level=0)
ilocs_min = argrelextrema(history.close.values, np.less_equal, order=3)[0]
ilocs_max = argrelextrema(history.close.values, np.greater_equal, order=3)[0]
print(ilocs_min)
history.close.plot(figsize=(20,8), alpha=.3)
# filter prices that are peaks and plot them differently to be visable on the plot
history.iloc[ilocs_max].close.plot(style='.', lw=10, color='green', marker="v");
history.iloc[ilocs_min].close.plot(style='.', lw=10, color='red', marker="^");
history['weekly_max'] = False
history['weekly_min'] = False
history.loc[history.iloc[ilocs_min].index, 'weekly_min'] = True
history.loc[history.iloc[ilocs_max].index, 'weekly_max'] = True
history.close.plot(figsize=(20,8), alpha =.3)
history[history['weekly_max']].close.plot(style='.', lw=10, color='green', marker="v");
history[history['weekly_min']].close.plot(style='.', lw=10, color='red', marker="^");
This returns the following dataframe:
Dataframe
Any help would be amazing!
Related
So I am doing this guided project in datacamp and it is essentially about Exploring the MarketCap of various Cryptocurrencies over the time. Even though I know other means to get the output, I am sticking to the proposed method.
So I need to make a bar graph for the top 10 cryptocurrencies(x axis) and their share of marketcap (y axis). I am able to get the desired output but I want to go one step up and sort the bar in the descending order. Right now, it is sorted based on the first letter of the respective crypto currencies. Here is the code,
#Declaring these now for later use in the plots
TOP_CAP_TITLE = 'Top 10 market capitalization'
TOP_CAP_YLABEL = '% of total cap'
# Selecting the first 10 rows and setting the index
cap10 = cap.iloc[:10,]
# Calculating market_cap_perc
cap10 = cap10.assign(market_cap_perc = round(cap10['market_cap_usd']/sum(cap['market_cap_usd'])*100,2))
# Plotting the barplot with the title defined above
fig, ax = plt.subplots(1,1)
ax.bar(cap10['symbol'], cap10['market_cap_perc'])
ax.set_title(TOP_CAP_TITLE)
ax.set_ylabel(TOP_CAP_YLABEL)
plt.show()
I've replicated your code with dummy data, and output the plot, is this the sorted plot you're looking for? Only need to sort the dataframe using df.sort_values()
import pandas as pd
import matplotlib.pyplot as plt
d = {'BCH': 8, 'BTC': 55, 'ETH': 12, 'MIOTA': 4, 'ADA': 0.5, 'BTG': 0.8, 'XMR': 0.7, 'DASH': 1, 'LTC': 0.99, 'XRP': 2.5}
cap = pd.DataFrame({'symbol': d.keys(), 'market_cap_perc': d.values()})
#Declaring these now for later use in the plots
TOP_CAP_TITLE = 'Top 10 market capitalization'
TOP_CAP_YLABEL = '% of total cap'
# Selecting the first 10 rows and setting the index
cap10 = cap.iloc[:10,]
# Calculating market_cap_perc
# cap10 = cap10.assign(market_cap_perc = round(cap10['market_cap_usd']/sum(cap['market_cap_usd'])*100,2))
cap10 = cap10.sort_values('market_cap_perc', ascending=False) #add this line
# Plotting the barplot with the title defined above
fig, ax = plt.subplots(1,1)
ax.bar(cap10['symbol'], cap10['market_cap_perc'])
ax.set_title(TOP_CAP_TITLE)
ax.set_ylabel(TOP_CAP_YLABEL)
plt.show()
You can sort cap10 before plotting:
cap10 = cap10.sort_values(by='market_cap_perc', ascending=False)
fig, ax = plt.subplots(1,1)
...
I have the code below with randomly generated dataframes and I would like to extract the x and y values of both plotted lines. These line plots show the Price on the Y-axis and are Volume weighted.
For some reason, the line values for the second distribution plot, cannot be stored on the variables "df_2_x", "df_2_y". The values of "df_1_x", "df_1_y" are also written on the other variables. Both print statements return True, so the arrays are completely equal.
If I put them in separate cells in a notebook, it does work.
I also looked at this solution: How to retrieve all data from seaborn distribution plot with mutliple distributions?
But this does not work for weighted distplots.
import pandas as pd
import random
import seaborn as sns
import matplotlib.pyplot as plt
Price_1 = [round(random.uniform(2,12), 2) for i in range(30)]
Volume_1 = [round(random.uniform(100,3000)) for i in range(30)]
Price_2 = [round(random.uniform(0,10), 2) for i in range(30)]
Volume_2 = [round(random.uniform(100,1500)) for i in range(30)]
df_1 = pd.DataFrame({'Price_1' : Price_1,
'Volume_1' : Volume_1})
df_2 = pd.DataFrame({'Price_2' : Price_2,
'Volume_2' :Volume_2})
df_1_x, df_1_y = sns.distplot(df_1.Price_1, hist_kws={"weights":list(df_1.Volume_1)}).get_lines()[0].get_data()
df_2_x, df_2_y = sns.distplot(df_2.Price_2, hist_kws={"weights":list(df_2.Volume_2)}).get_lines()[0].get_data()
print((df_1_x == df_2_x).all())
print((df_1_y == df_2_y).all())
Why does this happen, and how can I fix this?
Whether or not weight is used, doesn't make a difference here.
The principal problem is that you are extracting again the first curve in df_2_x, df_2_y = sns.distplot(df_2....).get_lines()[0].get_data(). You'd want the second curve instead: df_2_x, df_2_y = sns.distplot(df_2....).get_lines()[1].get_data().
Note that seaborn isn't really meant to concatenate commands. Sometimes it works, but it usually adds a lot of confusion. E.g. sns.distplot returns an ax (which represents a subplot). Graphical elements such as lines are added to that ax.
Also note that sns.distplot has been deprecated. It will be removed from Seaborn in one of the next versions. It is replaced by sns.histplot and sns.kdeplot.
Here is how the code could look like:
import pandas as pd
import random
import seaborn as sns
import matplotlib.pyplot as plt
Price_1 = [round(random.uniform(2, 12), 2) for i in range(30)]
Volume_1 = [round(random.uniform(100, 3000)) for i in range(30)]
Price_2 = [round(random.uniform(0, 10), 2) for i in range(30)]
Volume_2 = [round(random.uniform(100, 1500)) for i in range(30)]
df_1 = pd.DataFrame({'Price_1': Price_1,
'Volume_1': Volume_1})
df_2 = pd.DataFrame({'Price_2': Price_2,
'Volume_2': Volume_2})
ax = sns.histplot(x=df_1.Price_1, weights=list(df_1.Volume_1), bins=10, kde=True, kde_kws={'cut': 3})
sns.histplot(x=df_2.Price_2, weights=list(df_2.Volume_2), bins=10, kde=True, kde_kws={'cut': 3}, ax=ax)
df_1_x, df_1_y = ax.lines[0].get_data()
df_2_x, df_2_y = ax.lines[1].get_data()
# use fill_between to demonstrate where the extracted curves lie
ax.fill_between(df_1_x, 0, df_1_y, color='b', alpha=0.2)
ax.fill_between(df_2_x, 0, df_2_y, color='r', alpha=0.2)
plt.show()
Currently I am trying to create a Barplot that shows the amount of reviews for an app per week. The bar should however be colored according to a third variable which contains the average rating of the reviews in each week (range: 1 to 5).
I followed the instructions of the following post to create the graph: Python: Barplot with colorbar
The code works fine:
# Import Packages
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.cm import ScalarMappable
# Create Dataframe
data = [[1, 10, 3.4], [2, 15, 3.9], [3, 12, 3.6], [4, 30,1.2]]
df = pd.DataFrame(data, columns = ["week", "count", "score"])
# Convert to lists
data_x = list(df["week"])
data_hight = list(df["count"])
data_color = list(df["score"])
#Create Barplot:
data_color = [x / max(data_color) for x in data_color]
fig, ax = plt.subplots(figsize=(15, 4))
my_cmap = plt.cm.get_cmap('RdYlGn')
colors = my_cmap(data_color)
rects = ax.bar(data_x, data_hight, color=colors)
sm = ScalarMappable(cmap=my_cmap, norm=plt.Normalize(1,5))
sm.set_array([])
cbar = plt.colorbar(sm)
cbar.set_label('Color', rotation=270,labelpad=25)
plt.show()
Now to the issue: As you might notice the value of the average score in week 4 is "1.2". The Barplot does however indicate that the value lies around "2.5". I understand that this stems from the following code line, which standardizes the values by dividing it with the max value:
data_color = [x / max(data_color) for x in data_color]
Unfortunatly I am not able to change this command in a way that the colors resemble the absolute values of the scores, e.g. with a average score of 1.2 the last bar should be colored in deep red not light orange. I tried to just plug in the regular score values (Not standardized) to solve the issue, however, doing so creates all bars with the same green color... Since this is only my second python project, I have a hard time comprehending the process behind this matter and would be very thankful for any advice or solution.
Cheers Neil
You identified correctly that the normalization is the problem here. It is in the linked code by valued SO user #ImportanceOfBeingEarnest defined for the interval [0, 1]. If you want another normalization range [normmin, normmax], you have to take this into account during the normalization:
# Import Packages
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.cm import ScalarMappable
# Create Dataframe
data = [[1, 10, 3.4], [2, 15, 3.9], [3, 12, 3.6], [4, 30,1.2]]
df = pd.DataFrame(data, columns = ["week", "mycount", "score"])
# Not necessary to convert to lists, pandas series or numpy array is also fine
data_x = df.week
data_hight = df.mycount
data_color = df.score
#Create Barplot:
normmin=1
normmax=5
data_color = [(x-normmin) / (normmax-normmin) for x in data_color] #see the difference here
fig, ax = plt.subplots(figsize=(15, 4))
my_cmap = plt.cm.get_cmap('RdYlGn')
colors = my_cmap(data_color)
rects = ax.bar(data_x, data_hight, color=colors)
sm = ScalarMappable(cmap=my_cmap, norm=plt.Normalize(normmin,normmax))
sm.set_array([])
cbar = plt.colorbar(sm)
cbar.set_label('Color', rotation=270,labelpad=25)
plt.show()
Sample output:
Obviously, this does not check that all values are indeed within the range [normmin, normmax], so a better script would make sure that all values adhere to this specification. We could, alternatively, address this problem by clipping the values that are outside the normalization range:
#...
import numpy as np
#.....
#Create Barplot:
normmin=1
normmax=3.5
data_color = [(x-normmin) / (normmax-normmin) for x in np.clip(data_color, normmin, normmax)]
#....
You may also have noticed another change that I introduced. You don't have to provide lists - pandas series or numpy arrays are fine, too. And if you name your columns not like pandas functions such as count, you can access them as df.ABC instead of df["ABC"].
I plot a function which is based on the results of a curve fit I did in the query. Now I want to see how the curve fit actually fits the average values for every x value. I treid it with a for loop and a groupby.
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
plt.style.use('seaborn-colorblind')
x = dataset['mrwSmpVWi']
c = dataset['c']
a = dataset['a']
b = dataset['b']
Snr = dataset['Seriennummer']
dataset["y"] = (c / (1 + (a) * np.exp(-b*(x))))
for number in dataset.groupby('mrwSmpVWi'):
dataset['m'] = dataset['mrwSmpP'].mean()
fig, ax = plt.subplots(figsize=(30,15))
for name, group in dataset.groupby('Seriennummer'):
group.plot(x="mrwSmpVWi", y="m", ax=ax, marker='o', linestyle='', ms=12, label =name)
group.plot(x="mrwSmpVWi", y="y", ax=ax, label =name)
plt.show()
The dataset with the values is huge and not sorted by mrwSmpVWi.
Has someone an idea why I only get a straight line for my average values?
You got to take a look at what you're doing with this line:
for number in dataset.groupby('mrwSmpVWi'):
dataset['m'] = dataset['mrwSmpP'].mean()
You probably want:
dataset['m'] = dataset.groupby('Seriennummer')['mrwSmpVWi'].transform('mean')
(assuming you were intending to calculate the mean of each group of Serienummer)
I want to represent correlation matrix using a heatmap. There is something called correlogram in R, but I don't think there's such a thing in Python.
How can I do this? The values go from -1 to 1, for example:
[[ 1. 0.00279981 0.95173379 0.02486161 -0.00324926 -0.00432099]
[ 0.00279981 1. 0.17728303 0.64425774 0.30735071 0.37379443]
[ 0.95173379 0.17728303 1. 0.27072266 0.02549031 0.03324756]
[ 0.02486161 0.64425774 0.27072266 1. 0.18336236 0.18913512]
[-0.00324926 0.30735071 0.02549031 0.18336236 1. 0.77678274]
[-0.00432099 0.37379443 0.03324756 0.18913512 0.77678274 1. ]]
I was able to produce the following heatmap based on another question, but the problem is that my values get 'cut' at 0, so I would like to have a map which goes from blue(-1) to red(1), or something like that, but here values below 0 are not presented in an adequate way.
Here's the code for that:
plt.imshow(correlation_matrix,cmap='hot',interpolation='nearest')
Another alternative is to use the heatmap function in seaborn to plot the covariance. This example uses the Auto data set from the ISLR package in R (the same as in the example you showed).
import pandas.rpy.common as com
import seaborn as sns
%matplotlib inline
# load the R package ISLR
infert = com.importr("ISLR")
# load the Auto dataset
auto_df = com.load_data('Auto')
# calculate the correlation matrix
corr = auto_df.corr()
# plot the heatmap
sns.heatmap(corr,
xticklabels=corr.columns,
yticklabels=corr.columns)
If you wanted to be even more fancy, you can use Pandas Style, for example:
cmap = cmap=sns.diverging_palette(5, 250, as_cmap=True)
def magnify():
return [dict(selector="th",
props=[("font-size", "7pt")]),
dict(selector="td",
props=[('padding', "0em 0em")]),
dict(selector="th:hover",
props=[("font-size", "12pt")]),
dict(selector="tr:hover td:hover",
props=[('max-width', '200px'),
('font-size', '12pt')])
]
corr.style.background_gradient(cmap, axis=1)\
.set_properties(**{'max-width': '80px', 'font-size': '10pt'})\
.set_caption("Hover to magify")\
.set_precision(2)\
.set_table_styles(magnify())
How about this one?
import seaborn as sb
corr = df.corr()
sb.heatmap(corr, cmap="Blues", annot=True)
If your data is in a Pandas DataFrame, you can use Seaborn's heatmap function to create your desired plot.
import seaborn as sns
Var_Corr = df.corr()
# plot the heatmap and annotation on it
sns.heatmap(Var_Corr, xticklabels=Var_Corr.columns, yticklabels=Var_Corr.columns, annot=True)
Correlation plot
From the question, it looks like the data is in a NumPy array. If that array has the name numpy_data, before you can use the step above, you would want to put it into a Pandas DataFrame using the following:
import pandas as pd
df = pd.DataFrame(numpy_data)
The code below will produce this plot:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
# A list with your data slightly edited
l = [1.0,0.00279981,0.95173379,0.02486161,-0.00324926,-0.00432099,
0.00279981,1.0,0.17728303,0.64425774,0.30735071,0.37379443,
0.95173379,0.17728303,1.0,0.27072266,0.02549031,0.03324756,
0.02486161,0.64425774,0.27072266,1.0,0.18336236,0.18913512,
-0.00324926,0.30735071,0.02549031,0.18336236,1.0,0.77678274,
-0.00432099,0.37379443,0.03324756,0.18913512,0.77678274,1.00]
# Split list
n = 6
data = [l[i:i + n] for i in range(0, len(l), n)]
# A dataframe
df = pd.DataFrame(data)
def CorrMtx(df, dropDuplicates = True):
# Your dataset is already a correlation matrix.
# If you have a dateset where you need to include the calculation
# of a correlation matrix, just uncomment the line below:
# df = df.corr()
# Exclude duplicate correlations by masking uper right values
if dropDuplicates:
mask = np.zeros_like(df, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
# Set background color / chart style
sns.set_style(style = 'white')
# Set up matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))
# Add diverging colormap from red to blue
cmap = sns.diverging_palette(250, 10, as_cmap=True)
# Draw correlation plot with or without duplicates
if dropDuplicates:
sns.heatmap(df, mask=mask, cmap=cmap,
square=True,
linewidth=.5, cbar_kws={"shrink": .5}, ax=ax)
else:
sns.heatmap(df, cmap=cmap,
square=True,
linewidth=.5, cbar_kws={"shrink": .5}, ax=ax)
CorrMtx(df, dropDuplicates = False)
I put this together after it was announced that the outstanding seaborn corrplot was to be deprecated. The snippet above makes a resembling correlation plot based on seaborn heatmap. You can also specify the color range and select whether or not to drop duplicate correlations. Notice that I've used the same numbers as you, but that I've put them in a pandas dataframe. Regarding the choice of colors you can have a look at the documents for sns.diverging_palette. You asked for blue, but that falls out of this particular range of the color scale with your sample data. For both observations of
0.95173379, try changing to -0.95173379 and you'll get this:
import seaborn as sns
# label to make it neater
labels = {
's1':'vibration sensor',
'temp':'outer temperature',
'actPump':'flow rate',
'pressIn':'input pressure',
'pressOut':'output pressure',
'DrvActual':'acutal RPM',
'DrvSetPoint':'desired RPM',
'DrvVolt':'input voltage',
'DrvTemp':'inside temperature',
'DrvTorque':'motor torque'}
corr = corr.rename(labels)
# remove the top right triange - duplicate information
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
# Colors
cmap = sns.diverging_palette(500, 10, as_cmap=True)
# uncomment this if you want only the lower triangle matrix
# ans=sns.heatmap(corr, mask=mask, linewidths=1, cmap=cmap, center=0)
ans=sns.heatmap(corr, linewidths=1, cmap=cmap, center=0)
#save image
figure = ans.get_figure()
figure.savefig('correlations.png', dpi=800)
These are all reasonable answers, and it seems like the question has mostly been settled, but I thought I'd add one that doesn't use matplotlib/seaborn. In particular this solution uses altair which is based on a grammar of graphics (which might be a little more familiar to someone coming from ggplot).
# import libraries
import pandas as pd
import altair as alt
# download dataset and create correlation
df = pd.read_json("https://raw.githubusercontent.com/vega/vega-datasets/master/data/penguins.json")
corr_df = df.corr()
# data preparation
pivot_cols = list(corr_df.columns)
corr_df['cat'] = corr_df.index
# actual chart
alt.Chart(corr_df).mark_rect(tooltip=True)\
.transform_fold(pivot_cols)\
.encode(
x="cat:N",
y='key:N',
color=alt.Color("value:Q", scale=alt.Scale(scheme="redyellowblue"))
)
This yields
If you should find yourself needing labels in those cells, you can just swap the #actual chart section for something like
base = alt.Chart(corr_df).transform_fold(pivot_cols).encode(x="cat:N", y='key:N').properties(height=300, width=300)
boxes = base.mark_rect().encode(color=alt.Color("value:Q", scale=alt.Scale(scheme="redyellowblue")))
labels = base.mark_text(size=30, color="white").encode(text=alt.Text("value:Q", format="0.1f"))
boxes + labels
Use the 'jet' colormap for a transition between blue and red.
Use pcolor() with the vmin, vmax parameters.
It is detailed in this answer:
https://stackoverflow.com/a/3376734/21974