Error in plotting of frequency histogram from csv data - python

I am working with a csv file with pandas module on python3. Csv file consists of 5 columns: job, company's name, description of the job, amount of reviews, location of the job; and i want to plot a frequency histogram , where i pick only the jobs containing the words "mechanical engineer" and find the frequencies of the 5 most frequent locations for the "mechanical engineer" job.
So,i defined a variable engloc which stores all the "mechanical engineer" jobs.
engloc=df[df.position.str.contains('mechanical engineer|mechanical engineering', flags=re.IGNORECASE, regex=True)].location
and did a histogram plot with matplotlib with code i found online
x = np.random.normal(size = 1000)
plt.hist(engloc, bins=50)
plt.gca().set(title='Frequency Histogram ', ylabel='Frequency');
but it printed like this
How can i plot a proper frequency histogram where it plots using only 5 of the most frequent locations for jobs containing "mechanical engineer" words, instead of putting all of the locations in the graph?
This is a sample from the csv file

Something along the following lines should help you with numerical data:
import numpy as np
counts_, bins_ = np.histogram(englog.values)
filtered = [(c,b) for (c,b) in zip(counts_,bins_) if counts_>=5]
counts, bins = list(zip(*filtered))
plt.hist(bins[:-1], bins, weights=counts)
For a string type try:
from collections import Counter
coords, counts = list(zip(*Counter(englog.values).most_common(5)))
plt.bar(coords, counts)

Related

plot matplotlib aggregated data python

I need plot of aggregrated data
import pandas as pd
basic_data= pd.read_csv('WHO-COVID-19-global-data _2.csv',parse_dates= ['Date_reported'] )
cum_daily_cases = basic_data.groupby('Date_reported')[['New_cases']].sum()
import pylab
x = cum_daily_cases['Date_reported']
y = cum_daily_cases['New_cases']
pylab.plot(x,y)
pylab.show()
Error: 'Date_reported'
Input: Date_reported, Country_code, Country, WHO_region, New_cases, Cumulative_cases, New_deaths, Cumulative_deaths 2020-01-03,AF,Afghanistan,EMRO,0,0,0,0
Output: the total quantity of "New cases" showed on the plot per day.
What should I do to run this plot? link to dataset
The column names contain a leading space (can be easily seen by checking basic_data.dtypes). Fix that by adding the following line immediately after basic_data was read:
basic_data.columns = [s.strip() for s in basic_data.columns]
In addition, your x variable should be the index after groupby-sum, not a column Date_reported. Correction:
x = cum_daily_cases.index
The plot should show as expected.

How to calculate expected values for a column using Poisson distribution & then compare with actual values?

I have a dataframe which contains the results of different games played. I need to calculate the expected results(how many games result with the same score) with Poisson distribution then compare actual results with expected results.So, imagine I have 2 games that resulted in result = 2, 4 games resulted in result = 9 and so on. I need expected results corresponding to actual values in terms of number of games resulted in a certain result.
I calculated the mean of the results column which I read also is called the expected value. Plotted a histogram of actual results.
import pandas as pd
import numpy as np
# Game Results DataFrame
game_results = pd.DataFrame({"game_id":[56,57,58,59,60],"result":[0,9,4,6,8]})
print(game_results)
# Histogram for result column
result = game_results["result"]
plt.hist(result)
plt.xlabel("Result")
plt.ylabel("Number of Games")
plt.title("Result Histogram")
lamb = result.mean()
You can draw a random poisson distribution using np.random.poisson with your mean and number of observations i.e. len(game_results):
import numpy as np
game_results = pd.DataFrame({"game_id":[56,57,58,59,60],"result":[0,9,4,6,8]})
# Get the lambda
lamb = result.mean()
# Draw a random poisson distribution using the lambda
game_results["expected"] = np.random.poisson(lamb, len(game_results))

A line graph for non-numeric data

I have a dataset with mostly non numeric forms. I would love to create a visualization for them but I am having an error message.
My data set looks like this
|plant_name|Customer_name|Job site|Delivery.Date|DeliveryQuantity|
|SN13|John|Sweden|01.01.2019|6|
|SN14|Ruth|France|01.04.2018|4|
|SN15|Jane|Serbia|01.01.2019|2|
|SN11|Rome|Denmark|01.04.2018|10|
|SN14|John|Sweden|03.04.2018|5|
|SN15|John|Sweden|04.09.2019|7|
|
I need to create a lineplot to show how many times John made a purchase using Delivery Date as my timeline (x-axis)
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
pd.set_option("display.max_rows", 5)
hr_data = pd.read_excel("D:\data\Days_Calculation.xlsx", parse_dates = True)
x = hr_data['DeliveryDate']
y = hr_data ['Customer_name']
sns.lineplot(x,y)
Error: No numeric types to aggregate
My expected result show be a line graph like this
John's marker will present on the timeline (Delivery Date) on "01.01.2019", "03.04.2018" and "04.09.2019"
Another instance
To plot string vs float for example Total number of quantity (DeliveryQuantity) vs Customer Name .How can one approach this
how do one format the axes distance of a plot (not label)
Why not make Delivery Date a timestamp object instead of a string?
hr_data["Delivery.Date"] = pd.to_datetime(hr_data["Delivery.Date"])
Now you got plot options.
Working with John.
john_data = hr_data[hr_data["Customer_name"]=="John"]
sns.countplot(john_data["Delivery.Date"])
Generally speaking you have to aggregate something when working with categorical data. Whether you will be counting names in a column or adding number of orders, or ranking some categories this is still numeric data.
plot_data = hr_data.pivot_table(index='DeliveryDate', columns='Customer_name', values='DeliveryQuantity', aggfunc='sum')
plt.xticks(LISTOFVALUESFORXRANGE)
plot_data.plot(legend=False)

Python plot lines with specific x values from numpy

I have a situation with a bunch of datafiles, these datafiles have a number of samples in a given time frame that depends on the system. i.e. At time t=1 for instance I might have a file with 10 items, or 20 items, at later times in that file I will always have the same number of items. The format is time, x, y, z in columns, and loaded into a numpy array. The time values show which frame, but as mentioned there's always the same, let's go with 10 as a sample. So I'll have a (10,4) numpy array where the time values are identical, but there are many frames in the file, so lets say 100 frames, so really I have (1000,4). I want to plot the data with time on the x-axis and manipulations of the other data on the y, but I am unsure how to do this with line plot methods in matplotlib. Normally to provide both x,y values I believe I need to do a scatter plot, so I'm hoping there's a better way to do this. What I ideally want is to treat each line that has the same time code as a different series (so it will colour differently), and the next bit of data for that same line number in the next frame (time value) will be labelled the same colour, giving those good contiguous lines. We can look at the time column and figure out how many items share a time code, let's call it "n". Sample code:
a = numpy.loadtxt('sampledata.txt')
plt.plot(a[:0,:,n],a[:1,:1])
plt.show()
I think this code expresses what I'm going for, though it doesn't work.
Edit:
I hope this is what you wanted.
seaborn scatterplot can categorize data to some groups which have the same codes (time code in this case) and use the same colors to them.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv(r"E:\Programming\Python\Matplotlib\timecodes.csv",
names=["time","x","y","z","code"]) #use your file
df["time"]=pd.to_datetime(df["time"]) #recognize the data as Time
df["x"]=df["time"].dt.day # I changed the data into "Date only" and imported to x column. Easier to see on graph.
#just used random numbers in y and z in my data.
sns.scatterplot("x", "y", data = df, hue = "code") #hue does the grouping
plt.show()
I used csv file here but you can do to your text file as well by adding sep="\t" in the argument. I also added a code in the file. If you have it the code can group the data in the graph, so you don't have to separate or make a hierarchical index. If you want to change colors or grouping please see seaborn website.
Hope this helps.
Alternative, the method I used, but Tim's answer is still accurate as well. Since the time codes are not date/time information I modified my own code to add tags as a second column I call "p" (they're polymers).
import numpy as np
import pandas as pd
datain = np.loadtxt('somefile.txt')
df = pd.DataFrame(data = datain, columns = ["t","p","x","y","z"])
ax = sns.scatterplot("t","x", data = df, hue = "p")
plt.show()
And of course the other columns can be plotted similarly if desired.

Output K-Means to CSV with SciKit Learn - give cluster names

I have the below scikit learn script which outputs a nice chart (below) with each of the clusters.
I have a couple of questions:
- How can I export this to CSV - with a cluster name or ID?
- How can I name the clusters?
- How can I make sure the clusters are always named the same thing? For example, I want to call the top right segment 'high spenders' how do I so that where it will always be correct?
Thanks!
#import the required libraries
# - matplotlib is a charting library
# - Seaborn builds on top of Matplotlib and introduces additional plot types. It also makes your traditional Matplotlib plots look a bit prettier.
# - Numpy is numerical Python
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.datasets.samples_generator import make_blobs
from sklearn.cluster import KMeans
#Generate sample data, with distinct clusters for testing
#n_samples = the number of datapoints, equally split across each clusters
#centers = The number of centers to generate (number of clusters) - a center is the arithmetic mean of all the points belonging to the cluster.
#cluster_std = the standard deviation of the clusters - a quantity expressing by how much the members of a group differ from the mean value for the group (how tight is the cluster going to be)
#random_state = controls the random number generator being used. If you don't mention the random_state in the code, then whenever you execute your code a new random value is generated and the train and test datasets would have different values each time. However, if you use a particular value for random_state(random_state = 1 or any other value) everytime the result will be same,i.e, same values in train and test datasets.
#make_blobs generates "isotropic Gaussian blobs" - X is a numpy array with two columns which contain the (x, y) Gaussian coordinates of these points, whereas y contains the list of categories for each.
#X, y = simply means that the output of make_blobs() has two elements, that are assigned to X and y.
X, y = make_blobs(n_samples=300, centers=4,
cluster_std=0.50, random_state=0)
#X now looks like this - column zero becomes the X axis, column1 becomes the Y axis
array([[ 1.85219907, 1.10411295],
[-1.27582283, 7.76448722],
[ 1.0060939 , 4.43642592],
[-1.20998253, 7.83203579],
[ 1.92461484, 1.06347673],
[ 2.28565919, 0.79166208],
[-1.57379043, 2.69773813],
[ 1.04917913, 4.31668562],
[-1.07436851, 7.93489945],
[-1.15872975, 7.97295642]
#The below statement, will enable us to visualise matplotlib charts, even in ipython
#Using matplotlib backend: MacOSX
#Populating the interactive namespace from numpy and matplotlib
%pylab
#plot the chart
#s = the sizer of the points.
#X[:, 0] is the numpy coordinates way of selecting every row entry for column 0 - i.e. a single column from the numpy array.
#X[:, 1] is the numpy coordinates way of selecting every row entry for column 1 - i.e. a single column from the numpy array.
plt.scatter(X[:, 0], X[:, 1], s=50);
#now, I am definining that I want to find 4 clusters within the data. The general rule I follow is, I will have 7 times less clusters than datapoints.
kmeans = KMeans(n_clusters=4)
#build the model, based on X with the number of clusters defined above
kmeans.fit(X)
#now we're going to find clusters in the randomly generated dataset
predict = kmeans.predict(X)
#now we can plot the prediction
#c = colour, which is based on the predict variable we defined above
#s = the size of the plots
#X[:, 0] is the numpy coordinates way of selecting every row entry for column 0 - i.e. a single column from the numpy array.
#X[:, 1] is the numpy coordinates way of selecting every row entry for column 1 - i.e. a single column from the numpy array.
plt.scatter(X[:, 0], X[:, 1], c=predict, s=50)
Based on your code the following worked for me. You can certainly stay with numpy for storing the CSV but I simply prefer pandas. The sorting line should give you the same results everytime you run the code. However, since the initliazation of the clusters can have an impact I would also set a seed in your code, e.g. np.random.seed(42) and call the kmeans function with the random_state parameter, e.g. kmeans = KMeans(n_clusters=4, random_state=42)
# transform to dataframe
import pandas as pd
import seaborn as sns
df = pd.DataFrame(X)
df.columns = ["var1", "var2"]
df["cluster"] = predict
colors = sns.color_palette()[0:4]
df = df.sort_values("cluster")
# check plot
sns.scatterplot(df["var1"], df["var2"], hue=df["cluster"], palette=colors)
plt.show()
# define rename schema
mynames = {"0": "center_left", "1": "top_left", "2": "bot_right", "3": "center"}
df["cluster_name"] = [mynames[str(i)] for i in df.cluster]
# plot again to verify order
sns.scatterplot(df["var1"], df["var2"], hue=df["cluster_name"],
palette=colors)
sns.despine()
plt.show()
# save dataframe as CSV
df.to_csv("myoutput.csv")
The first plot looks like this:
The second plot looks like this:
The CSV will look like this:

Categories