Why is the code not plotting the expected output? - python

country = str(input())
import matplotlib.pyplot as plt
lines = f.readlines ()
x = []
y = []
results = []
for line in lines:
words = line.split(',')
f.close()
plt.plot(x,y)
plt.show()
First problem is in the title of the plot. It is giving Population inCountryI instead of Population in Country I.
Second problem is in the graph.

While my answer could point out the mistakes in your code, I think it might also be enlightening to show another, perhaps more standard way, of doing this. This is particularly useful if you're going to do this more often, or with large datasets.
Handling CSV files and creating subgroups out of them by yourself is nice, but can become very tricky. Python already has a built-in csv module, but the Pandas library is nowadays basically the default (there are other options as well) for handling tabular data. Which means it is widely available, and/or easy to install. Plus it goes well with Matplotlib. (Read some of Pandas' user's guide for a good overview.)
With Pandas, you can use the following (I've put comments on the code in between the actual code):
import pandas as pd
import matplotlib.pyplot as plt
mpl.rcParams['figure.figsize'] = (8, 8)
# Read the CSV file into a Pandas dataframe
# For a normal CSV, this will work fine without tweaks
df = pd.read_csv('population.csv')
# Convert the month and year columns to a datetime
# Years have to be converted to string type for that
# '%b%Y' is the format for month abbrevation (English) and 4-digit year;
# see e.g. https://strftime.org/
# Instead of creating a new column, we set the date as the index ("row-indices")
# of the dataframe
df.index = pd.to_datetime(df['Month'] + df['Year'].astype(str), format='%b%Y')
# We can remove the month and year columns now
df = df.drop(columns=['Month', 'Year'])
# For nicety, replace the dot in the country name with a space
df['Country'] = df['Country'].str.replace('.', ' ', regex=False)
# Group the dataframe by country, and loop over the groups
# The resulting grouped dataframes, `grouped`, will have just
# their index (date) values and population values
# The .plot() method will therefore automatically use
# the index/dates as x-axis, and the population as
# y-axis.
for country, grouped in df.groupby('Country'):
# Use the convenience .plot() method
grouped.plot()
# Standard Matplotlib functions are still available
plt.title(country)
The resulting plots are shown below (2, given the example data).
If you don't want a legend (since there is only one line), use grouped.plot(legend=None) instead.
If you want to pick one specific country, remove and replace the whole for-loop with the following
country = "Country II"
df[df['Country'] == country].plot()
If you want to do even more, also have a look at the Seaborn library.
Resulting plots:

Related

Preparing Data-frame for Bokeh Consumption

Trying to plot with Bokeh using a data-frame but plot is displaying empty. Beginner here; missing something fundamental.
My plot works if I hard code some basic X and Y variables so I know the issue has to do with the data-frame I'm trying to use as a source.
...
df = pd.DataFrame(j)
df.columns = ['Team','Type','Date','SLA_MET']
df['SLA_MET']= df['SLA_MET'].round(2)
pd.set_option('display.max_columns', 10)
print(df)
source = ColumnDataSource(df)
p = figure(background_fill_color='gray',
background_fill_alpha=0.5,
border_fill_color='blue',
border_fill_alpha=0.25,
plot_height=600,
plot_width=1000,
x_axis_label='Month',
x_axis_location='below',
y_axis_label='% SLA Met',
y_axis_location='left',
title='Percentage of SLA Met',
title_location='above',
toolbar_location='below',
tools='save')
p.line(source=source,x='Date',y='SLA_MET')
show(p)
Decided to pass clean lists to plot
for index, row in df.iterrows():
if row[2] =='Service Request':
sr_list.append(row[3])
else:
inc_list.append(row[3])
date_list.append(row[1]) # Only need 1 list of dates
Problem is dates in scientific notation and dates are not in order.
Bokeh does not know what to do with the strings in your Date column. You have two options:
convert this column to real python/numpy/pandas (numeric) datetime values, and also set x_axis_type="datetime" in your figure call, or
use the string values as categorical factors
It's not clear what your intention is, so I can't recommend one vs the other.

Selecting Specific Data to Sum and Plot

this is some of the data that is located in the excel sheet
I want to select musical theater shows (known in the code as 'ID') that had more minorities than Caucasians in the cast
once determined, I wanted to place the information of the code selected shows into a new data frame that
will only hold the shows, becasue it will be easier to manipulate. In the new data frame, I want to have in the same row for the show the related ethnicity, so I can compare to audience ethnicity. I then tried to plot this information.
So generally, I want to add up the values in specific rows if that row fits specific summation criteria. All data used in this project is located in an excel sheet that is converted to a csv and uploaded as a data frame. I would like to then plot the values of the cast in its entirety and compare the cast's ethnicity to the audience ethnicity.
I am working with python and I have tried to remove data that is not needed by selecting the columns by using an if statements so that the data frame only includes the shows that have more minorities than Caucasians, I then tried to use this information in the plot. I am unsure if I, have to filter all the unneeded columns if I am not using them in the calculations
import numpy as np
import pandas as pd
#first need to import numpy so that calculations can be made
from google.colab import files
uploaded = files.upload()
# df = pd.read_csv('/content/drive/My Drive/allTheaterDataV2.csv')
import io
df = pd.read_csv(io.BytesIO(uploaded['allTheaterDataV2.csv']))
# need to download excel sheet as csv and then upload into colab so that it can
# be manipulated as a dataframe
# want to select shows(ID) that had more minorities than Caucasians in the cast
# once determined, the selected shows should be placed into a new data frame that
# will only hold the shows and the related ethnicity, and compared to audience ethnicity
# this information should then be plotted
# first we will determine the shows that have a majority ethnic cast
minorcal = list(df)
minorcal.remove('CAU')
minoritycastSUM = df[minorcal].sum(axis=1)
# print(minorcal)
# next, we determine how many people in the cast were Caucasian, so remove all others
caucasiancal = list(df)
# i first wanted to do caucasiancal.remove('AFRAM', 'ASIAM', 'LAT', 'OTH')
# but got the statement I could only have 1 argument so i just put each on their own line
caucasiancal.remove('AFRAM')
caucasiancal.remove('ASIAM')
caucasiancal.remove('LAT')
caucasiancal.remove('OTH')
idrowcaucal = df[caucasiancal].sum(axis=1)
minoritycompare = old.filter(['idrowcaucal','minoritycastSUM'])
print(minoritycompare)
# now compare the two values per line
if minoritycastSUM < caucasiancal:
minoritydf = pd.df.minorcal.append()
# plot new data frame per each show and compare to audience ethnicity
df.plot(x=['AFRAM', 'ASIAM', 'CAU', 'LAT', 'OTH', 'WHT', 'BLK', 'ASN', 'HSP', 'MRO'], y = [''])
# i am unsure how to call the specific value for each column
plt.title('ID Ethnicity Comparison')
# i am unsure how to call the specific show so that only one show is per plot so for now i just subbed in 'ID'
plt.xlabel('Ethnicity comparison')
plt.ylabel('Number of Cast Members/Audience Members')
plt.show()
I would like to see the data frame with specific shows that fit within the criteria, and then the plot for the show, but right now I am getting errors on how to formulate the new data frame and python saying that the lengths of the if statements cannot be used.[2]
First of all, this will not be a complete answer, as
I don't know how you're imagining your final plot to look like
I don't know what the columns in your DataFrame are (consider using more descriptive column labels, e.g. 'caucasian actors' instead of 'CAU',…)
it is unclear to me whether any trend can be formed from your data, since the screenshot you've posted shows equal audience compositions for the first movies
Nevertheless, I built upon the DataFrame in this answer, and maybe this initial plot of "non caucasian/caucasian ratio" per movie can point you in the right direction.
Perhaps you could build a similar set of sum & ratio columns for the audience columns, and then plot the actor ratio as a function of the audience ratio to see whether a more caucasian audience prefers more or less caucasian actors (I guess that's what you're after?).
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'ID':['Billy Elliot','next to normal','shrek','guys and dolls',
'west side story', 'pal joey'],
'Season' : [20082009,20082009,20082009,
20082009,20082009,20082009],
'AFRAM' : [2,0,4,4,0,1],
'ASIAM' : [0,0,1,0,0,0],
'CAU' : [48,10,25,24,28,20],
'LAT' : [1,0,1,3,18,0],
'OTH' : [0,0,0,0,0,0],
'WHT' : [73.7,73.7,73.7,73.7,73.7,73.7]})
## define a sum column for non caucasian actors (I suppose?)
df['non_cau']=df[['AFRAM','ASIAM','LAT','OTH']].sum(axis=1)
## build a ratio of non caucasian to caucasian
df['cau_ratio']=df['non_cau']/df['CAU']
## make a quick plot
fig,ax=plt.subplots()
ax.scatter(df['ID'],df['cau_ratio'])
ax.set_ylabel('non cau / cau ratio')
plt.tight_layout()
plt.show()

How to draw a plot like this using Python? [duplicate]

I have a CSV file with 27.000 lines. I am trying to create a jitter plot, just like this one [https://static1.squarespace.com/static/56fd706140261df95349d4bd/t/59297c72579fb3d813d591c1/1495891103667/Jitter+Example+The+Truthful+Art.png?format=1000w].
The 'y' axis would be the column called "VALOR_REEMBOLSADO" (stands for "refund value"). The 'x' axis would be the column called "MES" (stands for "month").
It represents the spending of brazilian senators in 2017. The CSV file is very organized, but originally has the "VALOR_REEMBOLSADO" as string and not as float. I replaced the "," for ".", but I still can't plot the chart.
Can someone help me with the code? What code can create a chart like that?
Here you find the CSV file of the year 2017: https://www12.senado.leg.br/transparencia/dados-abertos-transparencia/dados-abertos-ceaps
At first I have to admit that I cannot understand some aspects of your question (first link doesn't work, and even more important: you want an x-axis which shows the months but in the plot, the data is shown over states).
But I see that your problems start already at the very beginning of reading the data in, so I'll try to give you the needed hints to start:
For reading in csv-data like this, I'd recommend pandas, usually imported with
import pandas as pd
It has a csv reader included, which is quite powerful. Generally, you should avoid manually tweaking the data sources you have (like changing decimal signs etc.), because this is something which is already adressed by importer functions like read_csv (and you don't want to do this again and again in the future with new data files but the same plot generation):
filepath = 'wherever/file/may/roam/2017.csv'
data = pd.read_csv(filepath, skiprows=1, sep=';', usecols=[1, 9], decimal=',')
With filepath you tell the importer where you stored the csv-file, skiprows=1 says that you're not interested in the first line of the file, sep defines the delimiter between the columns and via usecols you can pick only the columns of interest, 'MES' and 'VALOR_REEMBOLSADO' in your example.
decimal specifies the decimal sign of float numbers in your data.
Now data contains a pandas dataframe of your data:
In: data[:10]
Out:
MES VALOR_REEMBOLSADO
0 1 97.00
1 1 6000.00
2 1 418.04
3 1 1958.95
4 1 1178.67
5 1 1252.65
6 2 62.30
7 2 240.81
8 2 6000.00
9 2 2062.25
So this should be already something you can play around with.
This data can now be plotted with matplotlib or seaborn if you like.
pandas itself has also some plotting methods already included.
However, your question differs from the example plot you added, as I pointed out, so from this point on it's a little difficult to help precisely your needs.
You can aggregate all equal months for example, to create a plot over months. For those cases there is a groupby method for Dataframes:
data.groupby('MES')
This only returns a so called grouby-object, but you can tell it, what you want to do with the grouped data, e.g.:
In: data.groupby('MES').sum()
Out:
VALOR_REEMBOLSADO
MES
1 1558581.11
2 1951731.07
3 2225328.21
4 2248882.83
5 2256224.68
6 2216981.94
7 2053173.90
8 2372847.10
9 2161915.35
10 2355417.34
11 2294658.51
12 2938033.00
if you are interested in the sum within each month. The same for the average with data.groupby('MES').mean(). And for a first plot you could just add the plotting method like
data.groupby('MES').sum().plot()
which produces
If you want to see the distribution and the mean value like in the picture in your question (but still plotted over months, not over states, because I don't see this information in your file) you could have a look at scatter plots:
import matplotlib.pyplot as plt
plt.figure()
plt.scatter(data['MES'],data['VALOR_REEMBOLSADO'])
plt.plot(data.groupby('MES').mean()['VALOR_REEMBOLSADO'], 'k_', ms=10)
which produces
But as you mention seaborn in your tag list: this library provides a jitter plot like the on you reference to via stripplot. So this is finally the answer to the plotting part of your question, leading to this piece of code:
import pandas as pd
import seaborn as sns
filepath = 'https://raw.githubusercontent.com/gabrielacaesar/studyingPython/master/ceap-sf-new-12-04-2018.csv'
data = pd.read_csv(filepath, usecols=[1,9], decimal=',')
x = data['MES'].values
y = data['VALOR_REEMBOLSADO'].values
sns.stripplot(x, y, jitter=True)
which produces

plot the relationship between two variables with pandas

I am new to python but am aware about the usefulness of pandas, thus I would like to kindly ask if someone can help me to use pandas in order to address the below problem.
I have a dataset with buses, which looks like:
BusModel;BusID;ModeName;Value;Unit;UtcTime
Alpha;0001;Engine hours;985;h;2016-06-22 19:58:09.000
Alpha;0001;Engine hours;987;h;2016-06-22 21:58:09.000
Alpha;0001;Engine hours;989;h;2016-06-22 23:59:09.000
Alpha;0001;Fuel consumption;78;l;2016-06-22 19:58:09.000
Alpha;0001;Fuel consumption;88;l;2016-06-22 21:58:09.000
Alpha;0001;Fuel consumption;98;l;2016-06-22 23:59:09.000
The file is .csv format and is separated by semicolon (;). Please note that I would like to plot the relationship between ‘Engine hours’ and ‘Fuel consumption’ by 'calculating the mean value of both for each day' based on the UtcTime. Moreover, I would like to plot graphs for all the busses in the dataset (not only 0001 but also 0002, 0003 etc.). How I can do that with simple loop?
Start with the following interactive mode
import pandas as pd
df = pd.read_csv('bus.csv', sep=";", parse_dates=['UtcTime'])
You should be able to start playing around with the DataFrame and discovering functions you can directly use with the data. To get a list of buses by ID just do:
>>> bus1 = df[df.BusID == 1]
>>> bus1
Substitute 1 with the ID of the bus you require. This will return you a sub-DataFrame. To get BusID 1 and just their engine hours do:
>>> bus1[bus1.ModeName == "Engine hours"]
You can quickly get statistics of columns by doing
>>> bus1.Value.describe()
Once you grouped the data you need you can start plotting:
>>> bus1[bus1.ModeName == "Engine hours"].plot()
>>> bus1[bus1.ModeName == "Fuel consumption"].plot()
>>> plt.show()
There is more explanation on the docs. Please refer to http://pandas.pydata.org/pandas-docs/stable/.
If you really want to use pandas, remember this simple thing: never use a loop. Loops aren't scalable, so try to use built-in functions. First let's read your dataframe:
import pandas as pd
data = pd.read_csv('bus.csv',sep = ';')
Here is the weak point of my answer, I don't know how to manage dates efficently. So create a column named day which contains the day from UtcTime (I would use an apply methode like this data['day'] = data['UtcTime'].apply(lambda x: x[:10]) but it's a hidden loop so don't do that!)
Then to take only the data of a single bus, try a slicing method:
data_bus1 = data[data.BusID == 1]
Finally use the groupby function:
data_bus1[['Modename','Value','day']].groupby(['ModeName','day'],as_index = False).mean()
Or if you don't need to separate your busses in different dataframes, you can use the groupby on the whole data:
data[['BusID','ModeName','Value','day']].groupby(['BusID','ModeName','day'],as_index = False).mean()

How to get multiple legends from multiple pandas plots

I've got two dataframes (both indexed on time), and I'd like plot columns from both dataframes together on the same plot, with legend as if there were two columns in the same dataframe.
If I turn on legend with one column, it works fine, but if I try to do both, the 2nd one overwrites the first one.
import pandas as pd
# Use ERDDAP's built-in relative time functionality to get last 48 hours:
start='now-7days'
stop='now'
# URL for wind data
url='http://www.neracoos.org/erddap/tabledap/E01_met_all.csv?\
station,time,air_temperature,barometric_pressure,wind_gust,wind_speed,\
wind_direction,visibility\
&time>=%s&time<=%s' % (start,stop)
# load CSV data into Pandas
df_met = pd.read_csv(url,index_col='time',parse_dates=True,skiprows=[1]) # skip the units row
# URL for wave data
url='http://www.neracoos.org/erddap/tabledap/E01_accelerometer_all.csv?\
station,time,mooring_site_desc,significant_wave_height,dominant_wave_period&\
time>=%s&time<=%s' % (start,stop)
# Load the CSV data into Pandas
df_wave = pd.read_csv(url,index_col='time',parse_dates=True,skiprows=[1]) # skip the units row
plotting one works fine:
df_met['wind_speed'].plot(figsize=(12,4),legend=True);
but if I try to plot both, the first legend disappears:
df_met['wind_speed'].plot(figsize=(12,4),legend=True)
df_wave['significant_wave_height'].plot(secondary_y=True,legend=True);
Okay, thanks to the comment by unutbu pointing me to essentially the same question (which I searched for but didn't find), I just need to modify my plot command to:
df_met['wind_speed'].plot(figsize=(12,4))
df_wave['significant_wave_height'].plot(secondary_y=True);
ax = gca();
lines = ax.left_ax.get_lines() + ax.right_ax.get_lines()
ax.legend(lines, [l.get_label() for l in lines])
and now I get this, which is what I was looking for:
Well. Almost. It would be nice to get the (right) and (left) on the legend to make it clear which scale was for which line. #unutbu to the rescue again:
df_met['wind_speed'].plot(figsize=(12,4))
df_wave['significant_wave_height'].plot(secondary_y=True);
ax = gca();
lines = ax.left_ax.get_lines() + ax.right_ax.get_lines()
ax.legend(lines, ['{} ({})'.format(l.get_label(), side) for l, side in zip(lines, ('left', 'right'))]);
produces:

Categories