Splitting data into bins in histplot - python

I have a problem with sns.histplot(). As far as I understand, the bins parameter should indicate how many of the bins should be in the plot. Here is some code to visualize the strange (at least for me) behavior.
d = {'col1': [1, 2, 3, 4, 5, 6, 7, 8, 9 , 10 , 11 , 12, 13, 14, 15], 'col2': [1, 1, 1, 1, 1, 1, 2, 2, 2 , 2 , 2, 2, 2, 2, 2]}
df = pd.DataFrame(data=d)
sns.histplot(data=df, x='col1', multiple='dodge', hue='col2', binwidth=2, bins=8)
I have almost the same problem in my original code where I have:
hist = sns.histplot(data=Data, x=y_data, multiple='dodge', ax=axes[0], hue=i[2][1], binwidth=2, bins=10)
And as you can see, there is only one bin where data has its minimum and std, but it is not split into the number of bins I declared. Why is this not splitting data into the provided number of bins? How can I change code to ensure constant number of bins?

I think the problem is the binwidth parameter. Maybe just try to delete that parameter, or set it to a smaller value (0.2 or 0.1).

From the docs, regarding the binwidth parameter:
Width of each bin, overrides bins but can be used with binrange.
So you can't specify both bins and binwidth at the same time.

Related

Applying a mask to a dataframe, but only over a certain range inside the dataframe

I currently have some code that uses a mask to calculate the mean of values that are overloads, and values that are baseline values. It does this over the entire length of the dataframe. However, now I want to only apply this to a certain range in the dataframe column, between first and last values (ie, a specified region in the column, dictated by user input). Here is my code as it stands:
mask_number = 5
no_overload_cycles = 1
hyst = pd.DataFrame({"test":[12, 4, 5, 4, 1, 3, 2, 5, 10, 9, 7, 5, 3, 6, 3, 2 ,1, 5, 2]})
list_test = []
for i in range(0,len(hyst)-1,mask_number):
for x in range(no_overload_cycles):
list_test.append(i+x)
mask = np.array(list_test)
print(mask)
[0 1 5 10 15 20]
first = 4
last = 17
regression_area = hyst.iloc[first:last]
mean_range_overload = regression_area.loc[np.where(mask == regression area.index)]['test'].mean()
mean_range_baseline = regression_area.drop(mask[first:last])['test'].mean()
So the overload mean would be be cycles, 5, 10, and 15 in test, and the baseline mean would be from positions 4 to 17, excluding 5, 10 and 15. This would be my expected output from this:
print (mean_range_overload)
4
print(mean_range_baseline)
4.545454
However, the no_overload_cycles value can change, and may for example, be 3, which would then create a mask of this:
mask_number = 5
no_overload_cycles = 3
hyst = pd.DataFrame({"test":[12, 4, 5, 4, 1, 3, 2, 5, 10, 9, 7, 5, 3, 6, 3, 2 ,1, 5, 2]})
list_test = []
for i in range(0,len(hyst)-1,mask_number):
for x in range(no_overload_cycles):
list_test.append(i+x)
mask = np.array(list_test)
print(mask)
[0 1 2 5 6 7 10 11 12 15 16 17 20]
So the mean_range_overload would be mean of the values at 5,6,7,10,11,12,15,16,17, and the mean_range_baseline would be the values inbetween these, in the range of first and last in the dataframe column.
Any help on this would be greatly appreciated!
Assuming no_overload_cycles == 1 always, you can simply use slice objects to index the DataFrame.
Say you wish to, in your example, specifically pick cycles 5, 10 and 15 and use them as overload. Then you can get them by doing df.loc[5:15:5].
On the other hand, if you wish to pick the 5th, 10th and 15th cycles from the range you selected, you can get them by doing df.iloc[5:15+1:5] (iloc does not include the right index, so we add one). No loops required.
As mentioned in the comments, your question is slightly confusing, and it'd be helpful if you gave a better description and some expected results; in general I'd also advise you to decouple the domain-specific part of your problem before asking it in a forum, since not everyone knows what you mean by "overload", "baseline", "cycles" etc. I'm not commenting that since I still don't have enough reputation to do so.
I renamed a few of the variables, so what I called a "mask" is not exactly what you called a mask, but I reckon this is what you were trying to make:
mask_length = 5
overload_cycles_per_mask = 3
df = pd.DataFrame({"test": [12, 4, 5, 4, 1, 3, 2, 5, 10, 9, 7, 5, 3, 6, 3, 2 ,1, 5, 2]})
selected_range = (4, 17)
overload_indices = []
baseline_indices = []
# `range` does not include the right hand side so we add one
# ideally you would specify the range as (4, 18) instead
for i in range(selected_range[0], selected_range[1]+1):
if i % mask_length < overload_cycles_per_mask:
overload_indices.append(i)
else:
baseline_indices.append(i)
print(overload_indices)
print(df.iloc[overload_indices].test.mean())
print(baseline_indices)
print(df.iloc[baseline_indices].test.mean())
Basically, the DataFrame rows inside selected_range are divided into segments of length mask_length, each of which has their first overload_cycles_per_mask elements marked as overload, and any others, as baseline.
With that, you get two lists of indices, which you can directly pass to df.iloc, as according to the documentation it supports a list of integers.
Here is the output for mask_length = 5 and overload_cycles_per_mask = 1:
[5, 10, 15]
4.0
[4, 6, 7, 8, 9, 11, 12, 13, 14, 16, 17]
4.545454545454546
And here is for mask_length = 5 and overload_cycles_per_mask = 3:
[5, 6, 7, 10, 11, 12, 15, 16, 17]
3.6666666666666665
[4, 8, 9, 13, 14]
5.8
I do believe calling this a single mask makes things more confusing. In any case, I would tuck the logic for getting the indices away in some separate function to the one which calculates the mean.

How to plot top %1 in different color in a scatter plotting?

I have two columns consist numeric numbers in a .csv file.
Let's imagine
column1 = [1, 2, 3, 7, 9, 3, 14, 7, 9]
column2 = [2, 5, 2, 67, 8, 3, 6, 5, 2]
column1 and column2 are basically time series but there is no index number in the csv file.
I would like to plot values that are at top %1 in different colors.
So, my code is
features = ['column1' , 'column2']
df = pd.read_csv('XX.csv', usecols=features, sep=';', encoding='ISO-8859-1')
for features in df:
new_df = df[[features]].quantile(q=.99, axis=0, numeric_only=True).iloc[0]
This code produces two threshold numbers for each column that represents the top %1 of each column. And the next step is that plotting a scatter plot for a and b columns and if the number is above these two thresholds for each column, show it in a different color.
I messed up here.
Add a column to your dataframe to specify the colours under certain conditions. Here I've interpreted your requirements to mean that if either column1 or column2 is over its respective threshold, then it gets a different colour. I have hardcoded the thresholds to make the code easier to read; you would probably use new_df["column1"] and new_df["column2"] in the conditions.
dftest = pd.DataFrame({
"column1": [1, 2, 3, 7, 9, 3, 14, 7, 9],
"column2": [2, 5, 2, 67, 8, 3, 6, 5, 2]
}) # set up test df
dftest["colour"] = "green" # put green everywhere in the new colour column
dftest.loc[(dftest["column1"] > 13.6) | (dftest["column2"] > 62),"colour"] = "red" # overwrite with red where you need to.
dftest.plot.scatter(x="column1", y="column2", c=dftest["colour"])
Output:

stacked chart combine with alluvial plot - python

Surprisingly little info out there regarding python and the pyalluvial package. I'm hoping to combine stacked bars and a corresponding alluvial in the same figure.
Using below, I have three unique groups, which is outlined in Group. I want to display the proportion of each Group for each unique Point. I have the data formatted this way as I need three separate stacked bar charts for each Point.
So overall (Ove) highlight the overall proportion taken from all three Points. Group 1 makes up 70%, Group 2 makes up 20%, Group 3 makes up 10%. But the proportion of each group changes at different intervals Points. I'm hoping to show this like a standard stacked bar chart, but add the alluvial over the top.
import pandas as pd
import pyalluvial.alluvial as alluvial
df = pd.DataFrame({
'Group': [1, 2, 3],
'Ove': [0.7, 0.2, 0.1],
'Point 1': [0.8, 0.1, 0.1],
'Point 2': [0.6, 0.2, 0.2],
'Point 3': [0.7, 0.3, 0.0],
})
ax = alluvial.plot(
df = df,
xaxis_names = ['Group','Point 1','Point 2', 'Point 3'],
y_name = 'Ove',
alluvium = 'Group',
)
Output shows the overall group proportion (1st bar) being correct. But the following stacked bars with the proportions.
If I transform the df and put the Points as a single column, then I don't get 3 separate bars.
As correctly pointed out by #darthbaba, pyalluvial expects the dataframe format to consist of frequencies matching different variable-type combinations. To give you an example of a valid input, each Point in each Group has been labelled as present (1) or absent (0):
df = pd.DataFrame({
'Group': [1] * 6 + [2] * 6 + [3] * 6,
'Point 1': [1, 1, 1, 1, 0, 0] * 3,
'Point 2': [0, 1, 0, 1, 1, 0] * 3,
'Point 3': [0, 0, 1, 1, 1, 1] * 3,
'freq': [23, 11, 5, 7, 10, 12, 17, 3, 6, 17, 19, 20, 28, 4, 13, 8, 14, 9]
})
fig = alluvial.plot(df=df, xaxis_names=['Point 1','Point 2', 'Point 3'], y_name='freq', alluvium='Group', ignore_continuity=False)
Clearly, the above code doesn't resolve the issue since pyalluvial has yet to support the inclusion of stacked bars, much like how it's implemented in ggalluvial (see example #5). Therefore, unless you want to use ggalluvial, your best option IMO is to add the required functionality yourself. I'd start by modifying line #85.

Why are the columns in the matplotlib histogram not on top of the numbers

I would expect every column to be centered on top of their bin number. Instead, the 1 and 2 bars are right of the number, the 3rd on to the left. Why is it not even consistent?
import matplotlib.pyplot as plt
degrees = [1, 1, 2, 2, 2, 2, 3]
plt.hist(degrees)
plt.show()
Short answer: It is not supposed to, use plt.bar() instead. For longer explanation, please read below.
Why bars are not on top of numbers 1,2&3?
The purpose of a histogram is to approximate the distribution of the data. For example
import numpy as np
plt.hist(np.random.normal(3, 7, 100))
which gives
Now, when you have much less data, which is integer valued, and call
plt.hist([1, 1, 2, 2, 2, 2, 3])
you also get approximation of the distribution of the data you provided. With the default parameters it looks like this:
The documentation of hist tells that
If you do not provide bins, it will default to 10.
If you do not provide range, it will default to min and max of your data
Therefore, your data will be put inside 10 bins, with min at 1 and max at 3. These bins will be
In [45]: np.linspace(1,3, 11)
Out[45]: array([1. , 1.2, 1.4, 1.6, 1.8, 2. , 2.2, 2.4, 2.6, 2.8, 3. ])
Since you have only data inside bins 1.0 - 1.2, 2.0 - 2.2 and 2.8 - 3.0, you will see three bars centered at 1.1, 2.1 and 2.9.
A guess of what you're after
If your data is integer (categorical) valued, like
degrees = [1, 1, 2, 2, 2, 2, 3]
and you want to know the relative sizes of these categories, you probably want to create a bar plot instead.
import matplotlib.pyplot as plt
from collections import Counter
degrees = [1, 1, 2, 2, 2, 2, 3]
counts = Counter(degrees)
plt.bar(counts.keys(), counts.values())
plt.show()
Because you didn't specify the bins. If you add the bins in:
degrees = [1, 1, 2, 2, 2, 2, 3]
plt.hist(degrees, bins=[1,2,3,4])
plt.show()

Plotting time series data group by month per product

Let's say the data used is something like this
df = pd.DataFrame({'Order_id': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
'Order_date': ['10/1/2020', '10/1/2020', '11/1/2020', '11/1/2020', '12/1/2020', '12/1/2020', '12/1/2020', '12/1/2020', '13/1/2020', '13/1/2020'],
'Product_nr': [0, 2, 1, 0, 2, 0, 2, 1, 2, 0],
'Quantity': [3, 1, 6, 5, 10, 1, 2, 5, 4, 3]})
#transforming the date column into datetime
df['Order_date'] = pd.to_datetime(df['Order_date'])
and I'm trying to plot the number of ordered products per day per product over the given time span.
My initial idea would be something like
product_groups = df.groupby(['Product_nr'])
products_daily = pd.DataFrame()
for product, total_orders in product_groups:
products_daily[product.day] = total_orders.values
products_daily.plot(subplots=True, legend=False)
pyplot.show()
I know there must be a groupby('Product_nr') and the date should be splitted into days using Grouper(freq='D'). They should also be a for loop to combine them and then plotting them all but I really have no clue how to put those pieces together. How can I archieve this? My ultimate goal is actually to plot them per month per product for over 4 years of sales records, but given the example data here I changed it into daily.
Any suggestion or link for guides, tutorials are welcome too. Thank you very much!
You can pivot the table and use pandas' plot function:
(df.groupby(['Order_date', 'Product_nr'])
['Quantity'].sum()
.unstack('Product_nr')
.plot(subplots=True, layout=(1,3)) # change layout to fit your data
)
Output:

Categories