matplotlib histogram with frequency and counts - python

I have data (from a space delimited text file with two columns) which is already binned but only a width of 1. I want to increase this width to about 5. How can I do this using numpy/matplotlib in Python?
Using,
data = loadtxt('file.txt')
x = data[:, 0]
y = data[:, 1]
plt.bar(x,y)
creates too many bars and using,
plt.hist(data)
doesn't plot the histogram appropriately. I guess I don't understand how matplotlib's histogram plotting works.
See some of the data below.
264 1
265 1
266 4
267 2
268 2
269 2
270 2
271 2
272 5
273 3
274 2
275 6
276 7
277 3
278 7
279 5
280 9
281 4
282 8
283 11
284 9
285 15
286 19
287 11
288 12
289 10
290 13
291 18
292 20
293 14
294 15

What if you use numpy.reshape to transform your data before using plt.bar, for example:
In [83]: import numpy as np
In [84]: import matplotlib.pyplot as plt
In [85]: data = np.array([[1,2,3,4,5,6], [4,3,8,9,1,2]]).T
In [86]: data
Out[86]:
array([[1, 4],
[2, 3],
[3, 8],
[4, 9],
[5, 1],
[6, 2]])
In [87]: y = data[:,1].reshape(-1,2).sum(axis=1)
In [89]: y
Out[89]: array([ 7, 17, 3])
In [91]: x = data[:,0].reshape(-1,2).mean(axis=1)
In [92]: x
Out[92]: array([ 1.5, 3.5, 5.5])
In [96]: plt.bar(x, y)
Out[96]: <Container object of 3 artists>
In [97]: plt.show()

I am not an expert at matplotlib but I find hist to be incredibly useful. The examples on the matplotlib site give a great overview of some of the features.
I don't know how to use your provided sample data without transforming it. I altered your example to dequantize those data before creating a histogram.
I calculated the bin size using this question's first answer.
import matplotlib.pyplot as plt
import numpy as np
data = np.loadtxt('file.txt')
dequantized = data[:,0].repeat(data[:,1].astype(int))
dequantized[0:7]
# Each row's first column is repeated the number of times found in the
# second column creating a single array.
# array([ 264., 265., 266., 266., 266., 266., 267.])
def bins(xmin, xmax, binwidth, padding):
# Returns an array of integers which can be used to represent bins
return np.arange(
xmin - (xmin % binwidth) - padding,
xmax + binwidth + padding,
binwidth)
histbins = bins(min(dequantized), max(dequantized), 5, 5)
plt.figure(1)
plt.hist(dequantized, histbins)
plt.show()
This histogram displayed looks like this.
I hope this example is useful.

Related

Can't plot comparative (double) histogram from Pandas table

Here's the table from the dataframe:
Points_groups
Qty Contracts
Qty Gones
1
350+
108
275
2
300-350
725
1718
3
250-300
885
3170
4
200-250
2121
10890
5
150-200
3120
7925
6
100-150
653
1318
7
50-100
101
247
8
0-50
45
137
I'd like to get something like this out of it:
But that the columns correspond to the 'x' axis,
which was built from the 'Scores_groups' column like this
I tried a bunch of options already, but I couldn't get it.
For example:
df.plot(kind ='hist')
plt.xlabel('Points_groups')
plt.ylabel("Number Of Students");
or
sns.distplot(df['Кол-во Ушедшие'])
sns.distplot(df['Кол-во Контракт'])
plt.show()
or
df.hist(column='Баллы_groups', by= ['Кол-во Контракт', 'Кол-во Ушедшие'], bins=2, grid=False, rwidth=0.9,color='purple', sharex=True);
Since you already have the distribution in your pandas dataframe, the plot you need can be achieved with the following code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Df = pd.DataFrame({'key': ['red', 'green', 'blue'], 'A': [1, 2, 1], 'B': [2, 4, 3]})
X_axis = np.arange(len(Df['key']))
plt.bar(X_axis - 0.2, Df['A'], 0.4, label = 'A')
plt.bar(X_axis + 0.2, Df['B'], 0.4, label = 'B')
X_label = list(Df['key'].values)
plt.xticks(X_axis, X_label)
plt.legend()
plt.show()
Since I don't have access to your data, I made some mock dataframe. This results in the following figure:

Z-Score computation of a Pandas' DataFrame returns differing classes

I am trying to calculate the Z-Score of a Pandas' DataFrame, using scipy's zscore method.
Though while successful, I am getting different types returned, depending on which host the program runs.
Thus I am guessing it is related to the different versions for the involved packages.
Still I haven't found the reason for the difference.
Why do the returned type on the two hosts differ?
Host 1
Host2
python 3.6.8
python 3.7.3
pandas 1.1.5
pandas 1.3.1
numpy 1.19.5
numpy 1.19.2
scipy 1.5.4
scipy 1.7.3
Example:
Host 1
import numpy as np
import pandas as pd
from scipy.stats import zscore
df = pd.DataFrame(np.random.randint(100, 200, size=(5, 3)), columns=['A', 'B', 'C'])
# --------------------------------
In [5]: df
Out[5]:
A B C
0 166 135 141
1 156 110 167
2 104 159 114
3 150 156 157
4 163 113 180
In [10]: zscore(df)
Out[10]:
array([[ 0.80546745, 0.01940194, -0.47372066],
[ 0.36290292, -1.19321913, 0.66671797],
[-1.93843265, 1.18351816, -1.65802232],
[ 0.0973642 , 1.03800363, 0.22808773],
[ 0.67269809, -1.0477046 , 1.23693729]])
In [11]: zscore(df, ddof=0)
Out[11]:
array([[ 0.80546745, 0.01940194, -0.47372066],
[ 0.36290292, -1.19321913, 0.66671797],
[-1.93843265, 1.18351816, -1.65802232],
[ 0.0973642 , 1.03800363, 0.22808773],
[ 0.67269809, -1.0477046 , 1.23693729]])
In [12]: type(zscore(df))
Out[12]: numpy.ndarray
Host 2
import numpy as np
import pandas as pd
from scipy.stats import zscore
df = pd.DataFrame(np.random.randint(100, 200, size=(5, 3)), columns=['A', 'B', 'C'])
# --------------------------------
In [77]: df
Out[77]:
A B C
0 151 188 190
1 195 199 103
2 130 174 188
3 168 194 146
4 171 138 129
In [78]: zscore(df)
Out[78]:
A B C
0 -0.553990 0.428052 1.148875
1 1.477308 0.928963 -1.427210
2 -1.523474 -0.209472 1.089654
3 0.230829 0.701276 -0.153973
4 0.369327 -1.848819 -0.657346
In [79]: zscore(df, ddof=0)
Out[79]:
A B C
0 -0.553990 0.428052 1.148875
1 1.477308 0.928963 -1.427210
2 -1.523474 -0.209472 1.089654
3 0.230829 0.701276 -0.153973
4 0.369327 -1.848819 -0.657346
In [80]: type(zscore(df))
Out[80]: pandas.core.frame.DataFrame
If we look at the source code of scipy's zscore in version v1.5.4 (such as on Host 1), we can see that the passed input gets converted to a numpy array using np.asanyarray(a), which is then further processed and returned. In version v1.7.3 on the other hand (such as on Host 2), the code uses the zmap function which calculates the z-score of the passed array/DataFrame while preserving its type (see this line).
In conclusion, the culprit for this behavior is the newer scipy version on Host 2. Hope this helps!

Spatial pie chart using geopandas

I am trying to make a pie chart that looks like the below -
I am using geopandas for that-
us_states = gpd.read_file("conus_state.shp")
data = gpd.read_file("data_file.shp")
fig, ax = plt.subplots(figsize= (10,10))
us_states.plot(color = "None", ax = ax)
data.plot(column = ["Column1","Column2"], ax= ax, kind = "pie",subplots=True)
This gives me the following error-
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
C:\Users\LSRATH~1.STU\AppData\Local\Temp/ipykernel_17992/1047905594.py in <module>
1 fig, ax = plt.subplots(figsize= (10,10))
2 us_states.plot(color = "None", ax = ax)
----> 3 diff_env.plot(column = ["WS_MON1","WS_MON2"], ax= ax, kind = "pie")
c:\python38\lib\site-packages\geopandas\plotting.py in __call__(self, *args, **kwargs)
951 if kind in self._pandas_kinds:
952 # Access pandas plots
--> 953 return PlotAccessor(data)(kind=kind, **kwargs)
954 else:
955 # raise error
c:\python38\lib\site-packages\pandas\plotting\_core.py in __call__(self, *args, **kwargs)
921 if isinstance(data, ABCDataFrame):
922 if y is None and kwargs.get("subplots") is False:
--> 923 raise ValueError(
924 f"{kind} requires either y column or 'subplots=True'"
925 )
ValueError: pie requires either y column or 'subplots=True'
Even after specifying, subplots = True, it does not work.
How can I make a pie chart using 2 columns of the dataframe?
Below are the first five rows of the relevant columns-
diff_env[["Column1", "Column2", "geometry"]].head().to_dict()
{'Column1': {0: 2, 1: 0, 2: 0, 3: 1, 4: 12},
'Column2': {0: 2, 1: 0, 2: 0, 3: 1, 4: 12},
'geometry': {0: <shapely.geometry.point.Point at 0x2c94e07f190>,
1: <shapely.geometry.point.Point at 0x2c94e07f130>,
2: <shapely.geometry.point.Point at 0x2c94e07f0d0>,
3: <shapely.geometry.point.Point at 0x2c94bb86d30>,
4: <shapely.geometry.point.Point at 0x2c94e07f310>}}
you have not provided any usable sample data. Have randomly generated some
this is inspired by How to plot scatter pie chart using matplotlib
sample data
value0
value1
geometry
size
0
5
3
POINT (-105.96116535117056 31.014979334448164)
312
1
2
3
POINT (-79.70609244147155 36.46222924414716)
439
2
4
7
POINT (-68.89518006688962 37.84436728093645)
363
3
7
9
POINT (-118.12344177257525 31.909303946488293)
303
4
2
7
POINT (-102.1001252173913 28.57591221070234)
326
5
3
3
POINT (-96.88772103678929 47.76324025083612)
522
6
5
8
POINT (-112.33188157190635 48.16975143812709)
487
7
7
6
POINT (-95.15025297658862 44.59245298996656)
594
8
3
1
POINT (-100.36265715719063 46.787613401337794)
421
9
2
4
POINT (-81.82966451505015 35.161393444816056)
401
full code
import geopandas as gpd
import numpy as np
import shapely
import matplotlib.pyplot as plt
states = (
gpd.read_file(
"https://raw.githubusercontent.com/nvkelso/natural-earth-vector/master/geojson/ne_110m_admin_1_states_provinces.geojson"
)
.loc[lambda d: d["iso_3166_2"].ne("US-AK"), "geometry"]
.exterior
)
# geodataframe of points where pies are to be plotted
n = 10
pies = gpd.GeoDataFrame(
geometry=[
shapely.geometry.Point(xy)
for xy in zip(
np.random.choice(np.linspace(*states.total_bounds[[0, 2]], 300), n),
np.random.choice(np.linspace(*states.total_bounds[[1, 3]], 300), n),
)
],
data={f"value{c}": np.random.randint(1, 10, n) for c in range(2)},
crs=states.crs,
).pipe(lambda d: d.assign(size=np.random.randint(300, 600, n)))
# utility function inspired by https://stackoverflow.com/questions/56337732/how-to-plot-scatter-pie-chart-using-matplotlib
def draw_pie(dist, xpos, ypos, size, ax):
# for incremental pie slices
cumsum = np.cumsum(dist)
cumsum = cumsum / cumsum[-1]
pie = [0] + cumsum.tolist()
colors = ["blue", "red", "yellow"]
for i, (r1, r2) in enumerate(zip(pie[:-1], pie[1:])):
angles = np.linspace(2 * np.pi * r1, 2 * np.pi * r2)
x = [0] + np.cos(angles).tolist()
y = [0] + np.sin(angles).tolist()
xy = np.column_stack([x, y])
ax.scatter([xpos], [ypos], marker=xy, s=size, color=colors[i], alpha=1)
return ax
fig, ax = plt.subplots()
ax = states.plot(ax=ax, edgecolor="black", linewidth=0.5)
for _, r in pies.iterrows():
ax = draw_pie([r.value0, r.value1], r.geometry.x, r.geometry.y, r["size"], ax)
output

Python : Reduce an array by only keeping number between two limits

I have an array matrrix Nx4 and I want to reduce it by keeping only the values that are in a specefic range for the second and third column. I have written a code that does not work because it does not take count that I am already reducing the array.
Example of data/array :
1 358 33 7.1
2 659 85 7.1
3 111 145 7.1
4 558 116 7.1
5 632 40 7.1
6 415 335 7.1
7 207 30 7.1
8 564 47 7.1
9 352 41 7.1
10 700 570 7.1
11 275 499 7.1
12 482 177 7.1
13 737 565 7.1
14 298 43 7.1
15 155 195 7.1
16 598 417 7.1
17 93 313 7.1
18 1150 597 7.1
19 410 451 7.1
20 34 793 7.1
21 997 904 7.1
22 1024 452 7.1
23 740 128 7.1
24 522 86 7.1
25 679 643 7.1
26 973 37 7.1
27 372 42 7.1
By example I want to keep the values that are in the range = [80, 2000] for the second column and in the range = [130, 2000] for the third one. My real array has over 1'000'000 rows.
Here is my code :
def filter_data(data, XRANGE, YRANGE) :
data_f = np.copy(data)
for l in range(len(data_f)) :
if XRANGE[0] < data_f[l,1] < XRANGE[1] and YRANGE[0] < data_f[l,1] < YRANGE[1] :
pass
else :
data = np.delete(data, l, axis=0)
return data
How could I do differently and well more efficiently ?
You can pull this off by using masks and combining them by computing their point-wise products (equivalent to the AND operator with booleans):
>>> x_range, y_range = [0, 2], [0, 5]
>>> data
array([[ 1, 2, 3],
[ 1, 1, 1],
[ 5, 1, 7],
[ 1, 10, 2]])
First construct two masks based on constraints on data[:, 0] and data[:, 1]:
>>> x_mask = (data[:,0] > x_range[0])*(data[:,0] < x_range[1])
array([ True, True, False, True])
>>> y_mask = (data[:,1] > y_range[0])*(data[:,1] < y_range[1])
array([ True, True, True, False])
Essentially the resulting mask is equivalent to x > x_min & x < x_max & y > y_min & y < y_max:
>>> x_mask*y_mask
array([ True, True, False, False])
>>> data[x_mask*y_mask]
array([[1, 2, 3],
[1, 1, 1]])
Here is a simple example of what I think you are talking about.
I first create an example array (n,3).
Then I find where the values in the second and third columns exceed a value (lets call it 4) AND multiply this times the array with second and third column original values.
Lastly concatenate this new array to the first column from the original array as follows
a = np.asarray([[2,3,4],
[3,4,5],
[4,5,6],
[10,12,14]])
val = 4
b = a[:,1:3] > val
c = a[:,1:3]*b
np.concatenate((a[:,0:1],c),axis=1)
EDIT: After you updated your example: for (n,4) array
a = np.asarray([[2,3,4,5],
[3,4,5,6],
[4,5,6,8],
[10,12,14,9]])
val = 4
b = a[:,1:3] > val
c = a[:,1:3]*b
np.concatenate((a[:,0:1],c,a[:,3:4]),axis=1)

subplots based on records of two different pandas DataFrames ( with same structure) using Seaborn or Matplotlib

I have two DataFrames like below. Both have the same structure (columns names and index) but different values. DataFrame 1 is observed values and DataFrame 2 is predicted. I wanted to draw a single figure using subplots each representing one of the columns, Y- axis to be the values from both dataframes ( two different lines) and X-axis the index.
I know that I am supposed to post sample codes of my work but it seems they are all wrong, that is why I am not sharing my sample code. I really want to use something like the sns.FacetGrid of seaborn as the plots have higher qualities.
08FB006 08FC001 08FC003 08FC005 08GD004
----------------------------------------------
0 253 872 256 11.80 2660
1 250 850 255 10.60 2510
2 246 850 241 10.30 2130
3 241 827 235 9.32 1970
4 241 821 229 9.17 1900
5 232 819 228 8.93 1840
6 231 818 225 8.05 1710
7 234 817 225 7.90 1610
8 210 817 224 7.60 1590
9 200 816 221 7.53 1590
10 199 810 219 7.41 1550
You can modify below code according to your needs -
actual = pd.DataFrame({'a': [5, 8, 9, 6, 7, 2],
'b': [89, 22, 44, 6, 44, 1]})
predicted = pd.DataFrame({'a': [7, 2, 13, 18, 20, 2],
'b': [9, 20, 4, 16, 40, 11]})
# Creating a tidy-dataframe to input under seaborn
merged = pd.concat([pd.melt(actual), pd.melt(predicted)]).reset_index()
merged['category'] = ''
merged.loc[:len(actual)*2,'category'] = 'actual'
merged.loc[len(actual)*2:,'category'] = 'predicted'
g = sns.FacetGrid(merged, col="category", hue="variable")
g.map(plt.plot, "index", "value", alpha=.7)
g.add_legend();

Categories