Creating Mixed Charts from CSV Files in Python - python

I've developed a perl script that manipulates around data and gives me a final csv file. Unfortunately, the package for graphs and charts in perl are not supported on my system and I'm not able to install them due to work restrictions. So I want to try and take the csv file and put together something in Python to generate a mixed graph. I want the first column to be the labels on the x-axis. The next three columns to be bar graphs. The fourth column to be a line across the x-axis.
Here is sample data:
Name PreviousWeekProg CurrentWeekProg ExpectedProg Target
Dan 94 92 95 94
Jarrod 34 56 60 94
Chris 45 43 50 94
Sam 89 90 90 94
Aaron 12 10 40 94
Jenna 56 79 80 94
Eric 90 45 90 94
I am looking for a graph like this:
I did some researching but being as clueless as I am in python, I wanted to ask for some guidance on good modules to use for mixed charts and graphs in python. Sorry, if my post is vague. Besides looking at other references online, I'm pretty clueless about how to go about this. Also, my version of python is 3.8 and I DO have matplotlib installed (which is what i was previously recommended to use).

Since the answer by #ShaunLowis doesn't include a complete example I thought I'd add one. As far as reading the .csv file goes, the best way to do it in this case is probably to use pandas.read_csv() as the other answer points out. In this example I have named the file test.csv and placed it in the same directory from which I run the script
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
df = pd.read_csv("./test.csv")
names = df['Name'].values
x = np.arange(len(names))
w = 0.3
plt.bar(x-w, df['PreviousWeekProg'].values, width=w, label='PreviousWeekProg')
plt.bar(x, df['CurrentWeekProg'].values, width=w, label='CurrentWeekProg')
plt.bar(x+w, df['ExpectedProg'].values, width=w, label='ExpectedProg')
plt.plot(x, df['Target'].values, lw=2, label='Target')
plt.xticks(x, names)
plt.ylim([0,100])
plt.tight_layout()
plt.xlabel('X label')
plt.legend(loc='upper center', bbox_to_anchor=(0.5, -0.1), fancybox=True, ncol=5)
plt.savefig("CSVBarplots.png", bbox_inches="tight")
plt.show()
Explanation
From the pandas docs for read_csv() (arguments extraneous to the example excluded),
pandas.read_csv(filepath_or_buffer)
Read a comma-separated values (csv) file into DataFrame.
filepath_or_buffer: str, path object or file-like object
Any valid string path is acceptable. The string could be a URL. [...] If you want to pass in a path object, pandas accepts any os.PathLike.
By file-like object, we refer to objects with a read() method, such as a file handler (e.g. via builtin open function) or StringIO.
In this example I am specifying the path to the file, not a file object.
names = df['Name'].values
This extracts the values in the 'Name' column and converts them to a numpy.ndarray object. In order to plot multiple bars with one name I reference this answer. However, in order to use this method, we need an x array of floats of the same length as the names array, hence
x = np.arange(len(names))
then set a width for the bars and offset the first and third bars accordingly, as outlines in the referenced answer
w = 0.3
plt.bar(x-w, df['PreviousWeekProg'].values, width=w, label='PreviousWeekProg')
plt.bar(x, df['CurrentWeekProg'].values, width=w, label='CurrentWeekProg')
plt.bar(x+w, df['ExpectedProg'].values, width=w, label='ExpectedProg')
from the matplotlib.pyplot.bar page (unused non-positional arguments excluded),
matplotlib.pyplot.bar(x, height, width=0.8)
The bars are positioned at x [...] their dimensions are given by width and height.
Each of x, height, and width may either be a scalar applying to all bars, or it may be a sequence of length N providing a separate value for each bar.
In this case, x and height are sequences of values (different for each bar) and width is a scalar (the same for each bar).
Next is the line for target which is pretty straightforward, simply plotting the x values created earlier against the values from the 'Target' column
plt.plot(x, df['Target'].values, lw=2, label='Target')
where lw specifies the linewidth. Disclaimer: if the target value isn't the same for each row of the .csv this will still work but may not look exactly how you want it to as is.
The next two lines,
plt.xticks(x, names)
plt.ylim([0,100])
just add the names below the bars at the appropriate x positions and then set the y limits to span the interval [0, 100].
The final touch here is to add the legend below the plot,
plt.legend(loc='upper center', bbox_to_anchor=(0.5, -0.05), fancybox=True)
see this answer for more on how to adjust this as desired.

I would recommend reading in your .csv file using the 'read_csv()' utility of the Pandas library like so:
import pandas as pd
df = pd.read_csv(filepath)
This stores the information in a Dataframe object. You can then access your columns by:
my_column = df['PreviousWeekProg']
After which you can call:
my_column.plot(kind='bar')
On whichever column you wish to plot.
Configuring subplots is a different beast, for which I would recommend using matplotlib's pyplot .
I would recommend starting with this figure and axes object declarations, then going from there:
fig = plt.figure()
ax1 = plt.subplot()
ax2 = plt.subplot()
ax3 = plt.subplot()
ax4 = plt.subplot()
Where you can read more about adding in axes data here.
Let me know if this helps!

You can use the parameter hue in the package seaborn. First, you need to reshape you data set with the function melt:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
df1 = df.melt(id_vars=['Name', 'Target'])
print(df1.head(10))
Output:
Name Target variable value
0 Dan 94 PreviousWeekProg 94
1 Jarrod 94 PreviousWeekProg 34
2 Chris 94 PreviousWeekProg 45
3 Sam 94 PreviousWeekProg 89
4 Aaron 94 PreviousWeekProg 12
5 Jenna 94 PreviousWeekProg 56
6 Eric 94 PreviousWeekProg 90
7 Dan 94 CurrentWeekProg 92
8 Jarrod 94 CurrentWeekProg 56
9 Chris 94 CurrentWeekProg 43
Now you can use the column 'variable' as your hue parameter in the function barplot:
fig, ax = plt.subplots(figsize=(10, 5)) # set the size of a figure
sns.barplot(x='Name', y='value', hue='variable', data=df1) # plot
xmin, xmax = plt.xlim() # get x-axis limits
ax.hlines(y=df1['Target'], xmin=xmin, xmax=xmax, color='red') # add multiple lines
# or ax.axhline(y=df1['Target'].max()) to add a single line
sns.set_style("whitegrid") # use the whitegrid style
ax.legend(loc='upper center', bbox_to_anchor=(0.5, -0.06), ncol=4, frameon=False) # move legend to the bottom
plt.title('Student Progress', loc='center') # add title
plt.yticks(np.arange(df1['value'].min(), df1['value'].max()+1, 10.0)) # change tick frequency
plt.xlabel('') # set xlabel
plt.ylabel('') # set ylabel
plt.show() # show plot

Related

Python - Pandas - Plot Data frame without headers

I am trying to use the .plot() function in pandas to plot data into a line graph.
The data sorted by date with 48 rows after each date. Sample below:
1 2 ... 46 47 48
18 2018-02-19 1.317956 1.192840 ... 1.959250 1.782985 1.418093
19 2018-02-20 1.356267 1.192248 ... 2.123432 1.760629 1.569340
20 2018-02-21 1.417181 1.288694 ... 2.086715 1.823581 1.612062
21 2018-02-22 1.431536 1.279514 ... 2.201972 1.878109 1.694159
etc until row 346.
I tried the below but .plot does not seem to take positional arguments:
df.plot(x=df.iloc[0:346,0],y=[0:346,1:49]
How would I go about plotting my rows by date (the 1st column) on a line graph and can I expand this to include multiple axis?
There are multiple ways to do this, some of which are directly through the pandas dataframe. However, given your sample plotting line, I think the easiest might be to just use matplotlib directly:
import matplotlib.pyplot as plt
plt.plot(df.iloc[0:346,0],df.iloc[0:346,1:49])
For multiple axes you can add a few lines to make subplots. For example:
import matplotlib.pyplot as plt
fig = plt.figure()
ax1 = plt.subplot(1,2,1)
plt.plot(df.iloc[0:346,0],df.iloc[0:346,1:10],ax=ax1)
ax2 = plt.subplot(1,2,2)
plt.plot(df.iloc[0:346,0],df.iloc[0:346,20:30],ax=ax2)
You can also do this using the pandas plot() function that you were trying to use - it also takes an ax argument the same way as above, where you can provide the axis on which to plot. If you want to stick to pandas, I think you'd be best off setting the index to be a datetime index (see this link as an example: https://stackoverflow.com/a/27051371/12133280) and then using df.plot('column_name',ax=ax1). The x axis will be the index, which you would have set to be the date.

Plots, by Label, frequency of words

I need to create separates plot based on a label. My dataset is
Label Word Frequency
439 10.0 glass 600
471 10.0 tv 34
463 10.0 screen 31
437 10.0 laptop 15
454 10.0 info 15
65 -1.0 dog 1
68 -1.0 cat 1
69 -1.0 win 1
70 -2.0 man 1
71 -2.0 woman 1
In this case I would expect three plots, one for 10, one for -1 and one for -2, with on the x axis Word column and on the y-axis the Frequency (it is already sorted in descending order by Label).
I have tried as follows:
df['Word'].hist(by=df['Label'])
But it seems to be wrong as the output is far away from the expected one.
Any help would be great
You don't want to be using a histogram here: a histogram plot is where the columns of your dataframe contain raw data, and the hist function buckets the raw values and finds the frequencies of each bucket, and then plots.
Your dataframe is already bucketed, with a column in which the frequencies have already been calculated; what you need is the df.plot.bar() method. Unfortunately, this is quite new, and does not yet allow a by parameter, so you have to deal with the subplots manually.
Full walkthrough code for the cut-down example you have provided follows. Obviously you can make it more generic by not hardcoding the number of subplots required in the line marked [1].
# Set up:
import matplotlib.pyplot as plt
import pandas as pd
import io
txt = """Label,Word,Frequency
10.0,glass,600
10.0,tv,34
10.0,screen,31
10.0,laptop,15
10.0,info,15
-1.0,dog,1
-1.0,cat,1
-1.0,win,1
-2.0,man,1
-2.0,woman,1"""
dfA = pd.read_csv((io.StringIO(txt)))
labels = dfA["Label"].unique()
# Set up subplots on which to plot.
# Make more generic by not hardcoding nrows and ncols in [1],
# but calculating them depending on how many labels you have.
fig, axes = plt.subplots(nrows=2, ncols=2) # [1]
ax_list = axes.flatten() # axes is a list of lists;
# ax_list is a simple list which is easier to index.
# Loop through labels and plot the bar chart to the corresponding axis object.
for i in range(len(labels)):
dfA[dfA["Label"]==labels[i]].plot.bar(x="Word", y="Frequency", ax=ax_list[i])

matplotlib: plotting more than one figure at once

I am working with 3 pandas dataframes having the same column structures(number and type), only that the datasets are for different years.
I would like to plot the ECDF for each of the dataframes, but everytime I do this, I do it individually (lack python skills). So also, one of the figures (2018) is scaled differently on x-axis making it a bit difficult to compare. Here's how I do it.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from empiricaldist import Cdf
df1 = pd.read_csv('2016.csv')
df2 = pd.read_csv('2017.csv')
df3 = pd.read_csv('2018.csv')
#some info about the dfs
df1.columns.values
array(['id', 'trip_id', 'distance', 'duration', 'speed', 'foot', 'bike',
'car', 'bus', 'metro', 'mode'], dtype=object)
modal_class = df1['mode']
print(modal_class[:5])
0 bus
1 metro
2 bike
3 foot
4 car
def decorate_ecdf(title, x, y):
plt.xlabel(x)
plt.ylabel(y)
plt.title(title)
#plotting the ecdf for 2016 dataset
for name, group in df1.groupby('mode'):
Cdf.from_seq(group.speed).plot()
title, x, y = 'Speed distribution by travel mode (April 2016)','speed (m/s)', 'ECDF'
decorate_ecdf(title,x,y)
#plotting the ecdf for 2017 dataset
for name, group in df2.groupby('mode'):
Cdf.from_seq(group.speed).plot()
title, x, y = 'Speed distribution by travel mode (April 2017)','speed (m/s)', 'ECDF'
decorate_ecdf(title,x,y)
#plotting the ecdf for 2018 dataset
for name, group in df3.groupby('mode'):
Cdf.from_seq(group.speed).plot()
title, x, y = 'Speed distribution by travel mode (April 2018)','speed (m/s)', 'ECDF'
decorate_ecdf(title,x,y)
Output:
I am pretty sure this isn't the pythonist way of doing it, but a dirty way to get the work done. You can also see how the 2018 plot is scaled differently on the x-axis.
Is there a way to enforce that all figures are scaled the same way?
How do I re-write my code such that the figures are plotted by calling a function once?
When using pyplot, you can plot using an implicit method with plt.plot(), or you can use the explicit method, by creating and calling the figure and axis objects with fig, ax = plt.subplots(). What happened here is, in my view, a side-effect from using the implicit method.
For example, you can use two pd.DataFrame.plot() commands and have them share the same axis by supplying the returned axis to the other function.
foo = pd.DataFrame(dict(a=[1,2,3], b=[4,5,6]))
bar = pd.DataFrame(dict(c=[3,2,1], d=[6,5,4]))
ax = foo.plot()
bar.plot(ax=ax) # ax object is updated
ax.plot([0,3], [1,1], 'k--')
You can also create the figure and axis object previously, and supply as needed. Also, it's perfectly file to have multiple plot commands. Often, my code is 25% work, 75% fiddling with plots. Don't try to be clever and lose on readability.
fig, axes = plt.subplots(nrows=3, ncols=1, sharex=True)
# In this case, axes is a numpy array with 3 axis objects
# You can access the objects with indexing
# All will have the same x range
axes[0].plot([-1, 2], [1,1])
axes[1].plot([-2, 1], [1,1])
axes[2].plot([1,3],[1,1])
So you can combine both of these snippets to your own code. First, create the figure and axes object, then plot each dataframe, but supply the correct axis to them with the keyword ax.
Also, suppose you have three axis objects and they have different x limits. You can get them all, then set the three to have the same minimum value and the same maximum value. For example:
axis_list = [ax1, ax2, ax3] # suppose you created these separately and want to enforce the same axis limits
minimum_x = min([ax.get_xlim()[0] for ax in axis_list])
maximum_x = max([ax.get_xlim()[1] for ax in axis_list])
for ax in axis_list:
ax.set_xlim(minimum_x, maximum_x)

Single legend at changing categories (!) in subplots from pandas df

Roughly speaking I need to create one legend for several subplots, which have a changing number of categories = legend entries.
But let me clarify this a bit deeper:
I have a figure with 20 subplots, one for each country within my spatial scope:
fig, ax = plt.subplots(nrows=4, ncols=5, sharex=True, sharey=False, figsize = (32,18))
Within a loop, I do some logic to group the data I need into a normal 2-dimensional pandas DataFrame stats and plot it to each of these 20 axes:
colors = stats.index.to_series().map(type_to_color()).tolist()
stats.T.plot.bar(ax=ax[i,j], stacked=True, legend=False, color=colors)
However, the stats DataFrame is changing its size loop by loop, since not every category applies for each of these countries (i.e. in one country there can only two types, in another there are more than 10).
For this reason I pre-defined a specific color for each type.
So far, I am creating one legend for every subplot within the loop:
ax[i,j].legend(fontsize=9, loc='upper right')
This works, however it blows up the subplots unnecessarily. How can I plot one big legend above/below/beside these plots, since I have already defined the according color.
The given approach here with fig.legend(handles, labels, ...)does not work since the line handles are not available from the pandas plot.
Plotting the legend directly with
plt.legend(loc = 'lower center',bbox_to_anchor = (0,-0.3,1,1),
bbox_transform = plt.gcf().transFigure)
shows only the entries for the very last subplot, which is not sufficient.
Any help is greatly appreciated! Thank you so much!
Edit
For example the DataFrame stats could in one country look like this:
2015 2020 2025 2030 2035 2040
Hydro 29.229082 28.964424 28.528139 27.120194 25.932098 24.675778
Natural Gas 0.926800 0.926800 0.926800 0.926800 0.003600 NaN
Wind 25.799950 25.797550 0.776400 0.520800 0.234400 NaN
Whereas in another country it might look like this:
2015 2020 2025 2030 2035
Bioenergy 0.033690 0.033690 0.030000 NaN NaN
Hard Coal 5.307300 0.065100 0.021000 NaN NaN
Hydro 22.834454 23.930642 23.169014 21.639914 19.623791
Natural Gas 8.378116 8.674121 8.013598 6.755498 5.255450
Solar 5.100403 5.100403 5.100403 5.100403 5.093403
Wind 8.983560 8.974740 8.967240 8.378300 0.195800
Here's how it works to get the legend into an alphabetical order without messing the colors up:
import matplotlib.patches as mpatches
import collections
fig, ax = plt.subplots(nrows=4, ncols=5, sharex=True, sharey=False, figsize = (32,18))
labels_mpatches = collections.OrderedDict()
for a, b in enumerate(countries())
# do some data logic here
colors = stats.index.to_series().map(type_to_color()).tolist()
stats.T.plot.bar(ax=ax[i,j],stacked=True,legend=False,color=colors)
# Pass the legend information into the OrderedDict
stats_handle, stats_labels = ax[i,j].get_legend_handles_labels()
for u, v in enumerate(stats_labels):
if v not in labels_mpatches:
labels_mpatches[v] = mpatches.Patch(color=colors[u], label=v)
# After the loop, do the legend layouting.
labels_mpatches = collections.OrderedDict(sorted(labels_mpatches.items()))
fig.legend(labels_mpatches.values(), labels_mpatches.keys())
# !!! Please Note: In previous versions this here worked, but does not anymore:
# fig.legend(handles=labels_mpatches.values(),labels=labels_mpatches.keys())

Frequency plot in Python/Pandas DataFrame

I have a parsed very large dataframe with some values like this and several columns:
Name Age Points ...
XYZ 42 32pts ...
ABC 41 32pts ...
DEF 32 35pts
GHI 52 35pts
JHK 72 35pts
MNU 43 42pts
LKT 32 32pts
LKI 42 42pts
JHI 42 35pts
JHP 42 42pts
XXX 42 42pts
XYY 42 35pts
I have imported numpy and matplotlib.
I need to plot a graph of the number of times the value in the column 'Points' occurs. I dont need to have any bins for the plotting. So it is more of a plot to see how many times the same score of points occurs over a large dataset.
So essentially the bar plot (or histogram, if you can call it that) should show that 32pts occurs thrice, 35pts occurs 5 times and 42pts occurs 4 times. If I can plot the values in sorted order, all the more better. I have tried df.hist() but it is not working for me.
Any clues? Thanks.
I would plot the results of the dataframe's value_count method directly:
import matplotlib.pyplot as plt
import pandas
data = load_my_data()
fig, ax = plt.subplots()
data['Points'].value_counts().plot(ax=ax, kind='bar')
If you want to remove the string 'pnts' from all of the elements in your column, you can do something like this:
df['points_int'] = df['Points'].str.replace('pnts', '').astype(int)
That assumes they all end with 'pnts'. If it varying from line to line, you need to look into regular expressions like this:
Split columns using pandas
And the official docs: http://pandas.pydata.org/pandas-docs/stable/text.html#text-string-methods
Seaborn package has countplot function which can be made use of to make frequency plot:
import seaborn as sns
ax = sns.countplot(x="Points",data=df)

Categories