Problem with linear regression and summarize

Problem with linear regression and summarize - python

I would like to create a plot of my linear regression model showing bike sales for each year summed up at one point, and not like now that there are two points separately.
This is my code:
from sklearn.linear_model import LinearRegression
from sklearn import datasets, linear_model
## Wzrost lub maleje zakup rowerow
## (Purchase of bicycles increases or decreases)
plot1 = df.groupby('Year')['Product_Category'].value_counts().rename('count').reset_index()
x = plot1['Year'].values.reshape(-1, 1)
y = plot1['count'].values.reshape(-1, 1)
# plot #
## linear ##
regr = linear_model.LinearRegression()
regr.fit(x, y)
y_pred = regr.predict(x_test)
#plot#
plt.scatter(x, y, color='black')
plt.plot(x, y, color='blue', linewidth=3)
This is my plot:

As what I can understand from your example, this maybe a solution, replace value_counts by count.
Example data:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'Year': [ 2019, 2019, 2020, 2021], 'Product_Category': ['a', 'b', 'c', 'd']})
print(df)
Year Product_Category
0 2019 a
1 2019 b
2 2020 c
3 2021 d
The count will return:
plot1 = df.groupby('Year')['Product_Category'].count().rename('count').reset_index()
print(plot1)
Year count
0 2019 2
1 2020 1
2 2021 1
plot1 = df.groupby('Year')['Product_Category'].count().rename('count').reset_index()
#x,y#
x = plot1['Year'].values.reshape(-1, 1)
y = plot1['count'].values.reshape(-1, 1)
# plot #
#plot#
plt.scatter(x, y, color='black')
plt.plot(x, y, color='blue', linewidth=3)

Related

How to get rid of weird stray lines in scatterplot

I'm dealing with the well-known Gapminder data file (here:
https://www.kaggle.com/datasets/tklimonova/gapminder-datacamp-2007?select=gapminder_full.csv)
df.head():
country year population continent life_exp gdp_cap
0 Afghanistan 2007 31889923 Asia 43.828 974.580338
1 Albania 2007 3600523 Europe 76.423 5937.029526
2 Algeria 2007 33333216 Africa 72.301 6223.367465
3 Angola 2007 12420476 Africa 42.731 4797.231267
4 Argentina 2007 40301927 Americas 75.320 12779.379640
I tried a scatter plot but get confused by the many lines appearing on the plot:
plt.style.use('seaborn')
x = np.array(df['gdp_cap'])
y = np.array(df['life_exp'])
plt.scatter(x, y, marker = 'o', alpha = 1)
coeff = np.polyfit(x, y, 2)
plt.plot(x, coeff[0]*(x**2) + coeff[1]*x + coeff[2])
plt.show()
What I am doing wrong ???

Your second plot overdraws the first plot. Do add another plt.show() to prevent overdrawing:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
plt.style.use('seaborn')
x = np.array(df['gdp_cap'])
y = np.array(df['life_exp'])
plt.scatter(x, y, marker = 'o', alpha = 1)
plt.show()
coeff = np.polyfit(x, y, 2)
plt.plot(x, coeff[0]*(x**2) + coeff[1]*x + coeff[2])
plt.show()
Output:

Error while trying to graph a decision boundary for a KNN

I have a csv dataframe with 2 variables (an input dataframe, denoted by X) and another numpy array consisting of my target variables.
This looks something like this:
>X
Duration Grand Mean
0 142 383.076805
1 334 182.067833
2 97 232.677513
3 220 448.385085
4 127 251.524975
5 121 156.828771
>y
[13 11 11 13 12 11 11 13 12 11 12 13 11 12 12 13 13 12 13 12 11 13 13 12
12 13 13 13 12 13 13 11 13 13 11 13 11 12 13 13 13 11 11 12 13 13 12 12
12 11]
I have not included the dataframe for this particular exercise because the error that I get is quite universal for any csv file that I use, so I assume that the problem is inherently tied to the method that I used.
So, I tried:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import neighbors, datasets
n_neighbors = 15
h = .02 # step size in the mesh
# Create color maps
cmap_light = ListedColormap(['orange', 'cyan', 'cornflowerblue'])
cmap_bold = ListedColormap(['darkorange', 'c', 'darkblue'])
for weights in ['uniform', 'distance']:
# we create an instance of Neighbours Classifier and fit the data.
clf = neighbors.KNeighborsClassifier(n_neighbors, weights=weights)
clf.fit(X, y)
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)
# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold,
edgecolor='k', s=20)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("3-Class classification (k = %i, weights = '%s')"
% (n_neighbors, weights))
plt.show()
With the following error message:
TypeError: '(slice(None, None, None), 0)' is an invalid key
I have seen a similar post on this topic but I could not get the answer for that question to work for me.

Your error is due to the way you slice pandas df (you do it like it were a numpy array which is obviously wrong).
One possible way of correcting it, put the line:
X = X.values
at the top of your code and you're fine to go.
Proof
X = pd.DataFrame(np.random.randn(100,2), columns=["Duration","Grand Mean"])
X = X.values # <--- put this line
y = np.random.choice([11,12,13],100,True,[.33,.33,.34])
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import neighbors, datasets
n_neighbors = 15
h = .02 # step size in the mesh
# Create color maps
cmap_light = ListedColormap(['orange', 'cyan', 'cornflowerblue'])
cmap_bold = ListedColormap(['darkorange', 'c', 'darkblue'])
for weights in ['uniform', 'distance']:
# we create an instance of Neighbours Classifier and fit the data.
clf = neighbors.KNeighborsClassifier(n_neighbors, weights=weights)
clf.fit(X, y)
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)
# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold,
edgecolor='k', s=20)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("3-Class classification (k = %i, weights = '%s')"
% (n_neighbors, weights))
plt.show()

python pandas bar plot another column text

max min mincount maxcount
0 12 10 1 6
1 21 14 1 6
2 34 19 1 6
3 6 20 1 4
4 8 22 1 4
5 41 23 1 4
this is pandas DataFrame.
so I want like this image.
enter image description here
text label is very important.
here my code
df = pd.DataFrame({'maxcount': max_count, 'mincount': min_count, 'max': max, 'min': min})
ax = df[['maxcount', 'mincount']].plot(kind='bar')

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
#create your dataframe
d= {'max':[12,21,34,6,8,41],'min':[10,14,19,20,22,23],
'mincount':[1,1,1,1,1,1],'maxcount':[6,6,6,4,4,4]}
df=pd.DataFrame(d)
#create 2 dataframes counts and max_min (1 for plotting and 1 for text)
counts=pd.DataFrame(df,columns=['maxcount','mincount'])
max_min=pd.DataFrame(df,columns=['max','min'])
#plot the counts
ax=counts[counts.columns].plot(kind='bar',colormap='Paired',figsize= (12,4))
#using zip() and ax.annotate specify where (location by means of z)
#and what (max_min or counts) you want to plot
for x,y,z in zip(max_min.iloc[:,0].values,counts.iloc[:,0].values, range(len(counts))):
ax.annotate('%.d' % x, (z-0.2, counts.iloc[z,0]), va='bottom', ha='center', fontsize=10)
ax.annotate("("'%.d' % y+")", (z-0.1, counts.iloc[z,0]), va='bottom', ha='center', fontsize=10)
for x,y,z in zip(max_min.iloc[:,1].values,counts.iloc[:,1].values, range(len(counts))):
ax.annotate('%.d' % x, (z+0.1, counts.iloc[z,1]), va='bottom', ha='center', fontsize=10)
ax.annotate("("'%.d' % y+")", (z+0.2, counts.iloc[z,1]), va='bottom', ha='center', fontsize=10)
This is the output:

Multiple boxplots based on pandas groups

Here is how my dataframe looks like:
year item_id sales_quantity
2014 1 10
2014 1 4
... ... ...
2015 1 7
2015 1 10
... ... ...
2014 2 1
2014 2 8
... ... ...
2015 2 17
2015 2 30
... ... ...
2014 3 9
2014 3 18
... ... ...
For each item_id, I want to plot a boxplot showing the distribution for each year.
Here is what I tried:
data = pd.DataFrame.from_csv('electronics.csv')
grouped = data.groupby(['year'])
ncols=4
nrows = int(np.ceil(grouped.ngroups/ncols))
fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=(35,45),
sharey=False)
for (key, ax) in zip(grouped.groups.keys(), axes.flatten()):
grouped.get_group(key).boxplot(x='year', y='sales_quantity',
ax=ax, label=key)
I get the error boxplot() got multiple values for argument 'x'. Can someone please tell me how to do this right?
If I have only a single item, then the following works
sns.boxplot(data.sales_quantity, groupby = data.year). How could I extend it for multiple items?
Link to csv

Please check comment on the code.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('electronics_157_3cols.csv')
print(df)
fig, axes = plt.subplots(1, len(df['item_id_copy'].unique()), sharey=True)
for n, i in enumerate(df['item_id_copy'].unique()):
idf = df[df['item_id_copy'] == int('{}'.format(i))][['year', 'sales_quantity']].pivot(columns='year')
print(idf)
idf.plot.box(ax=axes[n])
axes[n].set_title('ID {}'.format(i))
axes[n].set_xticklabels([e[1] for e in idf.columns], rotation=45)
axes[n].set_ylim(0, 1) # You should disable this line to specify outlier properly. (but I didn't to show you a normal graph)
plt.show()
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('electronics_157_3cols.csv')
print(df)
fig, axes = plt.subplots(2, 5, sharey=True)
gen_n = (n for n in range(1, 11))
gen_i = (i for i in df['item_id_copy'].unique())
for r in range(2):
for c in range(5):
n = gen_n.__next__()
i = gen_i.__next__()
idf = df[df['item_id_copy'] == int('{}'.format(i))][['year', 'sales_quantity']].pivot(columns='year')
print(idf)
idf.plot.box(ax=axes[r][c])
axes[r][c].set_title('ID {}'.format(i))
axes[r][c].set_xticklabels([e[1] for e in idf.columns], rotation=0)
axes[r][c].set_ylim(0, 1)
plt.show()

I will leave this simple version for others...
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_table('sample.txt', delimiter='\s+')
fig, axes = plt.subplots(1, 3, sharey=True)
for n, i in enumerate(df['item_id'].unique()):
idf = df[df['item_id'] == int('{}'.format(i))][['year', 'sales_quantity']].pivot(columns='year')
print(idf)
idf.plot.box(ax=axes[n])
axes[n].set_title('Item ID {}'.format(i))
axes[n].set_xticklabels([e[1] for e in idf.columns])
plt.show()
sample.txt
year item_id sales_quantity
2014 1 10
2014 1 4
2015 1 7
2015 1 10
2014 2 1
2014 2 8
2015 2 17
2015 2 30
2014 3 9
2014 3 18

Hierarchical clustering of time series in Python scipy/numpy/pandas?

I have a DataFrame with some time series. I created a correlation matrix from those time series and I'd like to create a hierarchical clustering on this correlation matrix. How can I do that?
#
# let't pretend this DataFrame contains some time series
#
df = pd.DataFrame((np.random.randn(150)).reshape(10,15))
0 1 2 13 14
0 0.369746 0.093882 -0.656211 .... -0.596936 0 0.095960
1 0.641457 1.120405 -0.468639 .... -2.070802 1 -1.254159
2 0.360756 -0.222554 0.367893 .... 0.566299 2 0.932898
3 0.733130 0.666270 -0.624351 .... -0.377017 3 0.340360
4 -0.263967 1.143818 0.554947 .... 0.220406 4 -0.585353
5 0.082964 -0.311667 1.323161 .... -1.190672 5 -0.828039
6 0.173685 0.719818 -0.881854 .... -1.048066 6 -1.388395
7 0.118301 -0.268945 0.909022 .... 0.094301 7 1.111376
8 -1.341381 0.599435 -0.318425 .... 1.053272 8 -0.763416
9 -1.146692 0.453125 0.150241 .... 0.454584 9 1.506249
#
# I can create a correlation matrix like this
#
correlation_matrix = df.corr(method='spearman')
0 1 ... 13 14
0 1.000000 -0.139394 ... 0.090909 0.309091
1 -0.139394 1.000000 ... -0.636364 0.115152
2 0.175758 0.733333 ... -0.515152 -0.163636
3 0.309091 0.163636 ... -0.248485 -0.127273
4 0.600000 -0.103030 ... 0.151515 0.175758
5 -0.078788 0.054545 ... -0.296970 -0.187879
6 -0.175758 -0.272727 ... 0.151515 -0.139394
7 0.163636 -0.042424 ... 0.187879 0.248485
8 0.030303 0.915152 ... -0.430303 0.296970
9 -0.696970 0.321212 ... -0.236364 -0.151515
10 0.163636 0.115152 ... -0.163636 0.381818
11 0.321212 -0.236364 ... -0.127273 -0.224242
12 -0.054545 -0.200000 ... 0.078788 0.236364
13 0.090909 -0.636364 ... 1.000000 0.381818
14 0.309091 0.115152 ... 0.381818 1.000000
Now, how can build the Hierarchical clustering on this matrix?

Here is a step by step guide on how to build the Hierarchical Clustering and Dendrogram out of our time series using SciPy. Please note that also scikit-learn (a powerful data analysis library built on top of SciPY) has many other clustering algorithms implemented.
First we build some synthetic time series to work with. We'll build 6 groups of correlated time series and we expect the hierarchical clustering to detect those six groups.
import numpy as np
import seaborn as sns
import pandas as pd
from scipy import stats
import scipy.cluster.hierarchy as hac
import matplotlib.pyplot as plt
#
# build 6 time series groups for testing, called: a, b, c, d, e, f
#
num_samples = 61
group_size = 10
#
# create the main time series for each group
#
x = np.linspace(0, 5, num_samples)
scale = 4
a = scale * np.sin(x)
b = scale * (np.cos(1+x*3) + np.linspace(0, 1, num_samples))
c = scale * (np.sin(2+x*6) + np.linspace(0, -1, num_samples))
d = scale * (np.cos(3+x*9) + np.linspace(0, 4, num_samples))
e = scale * (np.sin(4+x*12) + np.linspace(0, -4, num_samples))
f = scale * np.cos(x)
#
# from each main series build 'group_size' series
#
timeSeries = pd.DataFrame()
ax = None
for arr in [a,b,c,d,e,f]:
arr = arr + np.random.rand(group_size, num_samples) + np.random.randn(group_size, 1)
df = pd.DataFrame(arr)
timeSeries = timeSeries.append(df)
# We use seaborn to plot what we have
#ax = sns.tsplot(ax=ax, data=df.values, ci=[68, 95])
ax = sns.tsplot(ax=ax, data=df.values, err_style="unit_traces")
plt.show()
Now we do the clustering and plot it:
# Do the clustering
Z = hac.linkage(timeSeries, method='single', metric='correlation')
# Plot dendogram
plt.figure(figsize=(25, 10))
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('sample index')
plt.ylabel('distance')
hac.dendrogram(
Z,
leaf_rotation=90., # rotates the x axis labels
leaf_font_size=8., # font size for the x axis labels
)
plt.show()
if we want to decide what kind of correlation to apply or to use another distance metric, then we can provide a custom metric function:
# Here we use spearman correlation
def my_metric(x, y):
r = stats.pearsonr(x, y)[0]
return 1 - r # correlation to distance: range 0 to 2
# Do the clustering
Z = hac.linkage(timeSeries, method='single', metric=my_metric)
# Plot dendogram
plt.figure(figsize=(25, 10))
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('sample index')
plt.ylabel('distance')
hac.dendrogram(
Z,
leaf_rotation=90., # rotates the x axis labels
leaf_font_size=8., # font size for the x axis labels
)
plt.show()
To retrieve the Clusters we can use the fcluster function. It can be run in multiple ways (check the documentation) but in this example we'll give it as target the number of clusters we want:
from scipy.cluster.hierarchy import fcluster
def print_clusters(timeSeries, Z, k, plot=False):
# k Number of clusters I'd like to extract
results = fcluster(Z, k, criterion='maxclust')
# check the results
s = pd.Series(results)
clusters = s.unique()
for c in clusters:
cluster_indeces = s[s==c].index
print("Cluster %d number of entries %d" % (c, len(cluster_indeces)))
if plot:
timeSeries.T.iloc[:,cluster_indeces].plot()
plt.show()
print_clusters(timeSeries, Z, 6, plot=False)
Output:
Cluster 2 number of entries 10
Cluster 5 number of entries 10
Cluster 3 number of entries 10
Cluster 6 number of entries 10
Cluster 1 number of entries 10
Cluster 4 number of entries 10

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Problem with linear regression and summarize - python

Related

How to get rid of weird stray lines in scatterplot

Error while trying to graph a decision boundary for a KNN

python pandas bar plot another column text

Multiple boxplots based on pandas groups

Hierarchical clustering of time series in Python scipy/numpy/pandas?

Categories

Resources