Plots, by Label, frequency of words

Plots, by Label, frequency of words - python

I need to create separates plot based on a label. My dataset is
Label Word Frequency
439 10.0 glass 600
471 10.0 tv 34
463 10.0 screen 31
437 10.0 laptop 15
454 10.0 info 15
65 -1.0 dog 1
68 -1.0 cat 1
69 -1.0 win 1
70 -2.0 man 1
71 -2.0 woman 1
In this case I would expect three plots, one for 10, one for -1 and one for -2, with on the x axis Word column and on the y-axis the Frequency (it is already sorted in descending order by Label).
I have tried as follows:
df['Word'].hist(by=df['Label'])
But it seems to be wrong as the output is far away from the expected one.
Any help would be great

You don't want to be using a histogram here: a histogram plot is where the columns of your dataframe contain raw data, and the hist function buckets the raw values and finds the frequencies of each bucket, and then plots.
Your dataframe is already bucketed, with a column in which the frequencies have already been calculated; what you need is the df.plot.bar() method. Unfortunately, this is quite new, and does not yet allow a by parameter, so you have to deal with the subplots manually.
Full walkthrough code for the cut-down example you have provided follows. Obviously you can make it more generic by not hardcoding the number of subplots required in the line marked [1].
# Set up:
import matplotlib.pyplot as plt
import pandas as pd
import io
txt = """Label,Word,Frequency
10.0,glass,600
10.0,tv,34
10.0,screen,31
10.0,laptop,15
10.0,info,15
-1.0,dog,1
-1.0,cat,1
-1.0,win,1
-2.0,man,1
-2.0,woman,1"""
dfA = pd.read_csv((io.StringIO(txt)))
labels = dfA["Label"].unique()
# Set up subplots on which to plot.
# Make more generic by not hardcoding nrows and ncols in [1],
# but calculating them depending on how many labels you have.
fig, axes = plt.subplots(nrows=2, ncols=2) # [1]
ax_list = axes.flatten() # axes is a list of lists;
# ax_list is a simple list which is easier to index.
# Loop through labels and plot the bar chart to the corresponding axis object.
for i in range(len(labels)):
dfA[dfA["Label"]==labels[i]].plot.bar(x="Word", y="Frequency", ax=ax_list[i])

Related

Creating Mixed Charts from CSV Files in Python

I've developed a perl script that manipulates around data and gives me a final csv file. Unfortunately, the package for graphs and charts in perl are not supported on my system and I'm not able to install them due to work restrictions. So I want to try and take the csv file and put together something in Python to generate a mixed graph. I want the first column to be the labels on the x-axis. The next three columns to be bar graphs. The fourth column to be a line across the x-axis.
Here is sample data:
Name PreviousWeekProg CurrentWeekProg ExpectedProg Target
Dan 94 92 95 94
Jarrod 34 56 60 94
Chris 45 43 50 94
Sam 89 90 90 94
Aaron 12 10 40 94
Jenna 56 79 80 94
Eric 90 45 90 94
I am looking for a graph like this:
I did some researching but being as clueless as I am in python, I wanted to ask for some guidance on good modules to use for mixed charts and graphs in python. Sorry, if my post is vague. Besides looking at other references online, I'm pretty clueless about how to go about this. Also, my version of python is 3.8 and I DO have matplotlib installed (which is what i was previously recommended to use).

Since the answer by #ShaunLowis doesn't include a complete example I thought I'd add one. As far as reading the .csv file goes, the best way to do it in this case is probably to use pandas.read_csv() as the other answer points out. In this example I have named the file test.csv and placed it in the same directory from which I run the script
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
df = pd.read_csv("./test.csv")
names = df['Name'].values
x = np.arange(len(names))
w = 0.3
plt.bar(x-w, df['PreviousWeekProg'].values, width=w, label='PreviousWeekProg')
plt.bar(x, df['CurrentWeekProg'].values, width=w, label='CurrentWeekProg')
plt.bar(x+w, df['ExpectedProg'].values, width=w, label='ExpectedProg')
plt.plot(x, df['Target'].values, lw=2, label='Target')
plt.xticks(x, names)
plt.ylim([0,100])
plt.tight_layout()
plt.xlabel('X label')
plt.legend(loc='upper center', bbox_to_anchor=(0.5, -0.1), fancybox=True, ncol=5)
plt.savefig("CSVBarplots.png", bbox_inches="tight")
plt.show()
Explanation
From the pandas docs for read_csv() (arguments extraneous to the example excluded),
pandas.read_csv(filepath_or_buffer)
Read a comma-separated values (csv) file into DataFrame.
filepath_or_buffer: str, path object or file-like object
Any valid string path is acceptable. The string could be a URL. [...] If you want to pass in a path object, pandas accepts any os.PathLike.
By file-like object, we refer to objects with a read() method, such as a file handler (e.g. via builtin open function) or StringIO.
In this example I am specifying the path to the file, not a file object.
names = df['Name'].values
This extracts the values in the 'Name' column and converts them to a numpy.ndarray object. In order to plot multiple bars with one name I reference this answer. However, in order to use this method, we need an x array of floats of the same length as the names array, hence
x = np.arange(len(names))
then set a width for the bars and offset the first and third bars accordingly, as outlines in the referenced answer
w = 0.3
plt.bar(x-w, df['PreviousWeekProg'].values, width=w, label='PreviousWeekProg')
plt.bar(x, df['CurrentWeekProg'].values, width=w, label='CurrentWeekProg')
plt.bar(x+w, df['ExpectedProg'].values, width=w, label='ExpectedProg')
from the matplotlib.pyplot.bar page (unused non-positional arguments excluded),
matplotlib.pyplot.bar(x, height, width=0.8)
The bars are positioned at x [...] their dimensions are given by width and height.
Each of x, height, and width may either be a scalar applying to all bars, or it may be a sequence of length N providing a separate value for each bar.
In this case, x and height are sequences of values (different for each bar) and width is a scalar (the same for each bar).
Next is the line for target which is pretty straightforward, simply plotting the x values created earlier against the values from the 'Target' column
plt.plot(x, df['Target'].values, lw=2, label='Target')
where lw specifies the linewidth. Disclaimer: if the target value isn't the same for each row of the .csv this will still work but may not look exactly how you want it to as is.
The next two lines,
plt.xticks(x, names)
plt.ylim([0,100])
just add the names below the bars at the appropriate x positions and then set the y limits to span the interval [0, 100].
The final touch here is to add the legend below the plot,
plt.legend(loc='upper center', bbox_to_anchor=(0.5, -0.05), fancybox=True)
see this answer for more on how to adjust this as desired.

I would recommend reading in your .csv file using the 'read_csv()' utility of the Pandas library like so:
import pandas as pd
df = pd.read_csv(filepath)
This stores the information in a Dataframe object. You can then access your columns by:
my_column = df['PreviousWeekProg']
After which you can call:
my_column.plot(kind='bar')
On whichever column you wish to plot.
Configuring subplots is a different beast, for which I would recommend using matplotlib's pyplot .
I would recommend starting with this figure and axes object declarations, then going from there:
fig = plt.figure()
ax1 = plt.subplot()
ax2 = plt.subplot()
ax3 = plt.subplot()
ax4 = plt.subplot()
Where you can read more about adding in axes data here.
Let me know if this helps!

You can use the parameter hue in the package seaborn. First, you need to reshape you data set with the function melt:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
df1 = df.melt(id_vars=['Name', 'Target'])
print(df1.head(10))
Output:
Name Target variable value
0 Dan 94 PreviousWeekProg 94
1 Jarrod 94 PreviousWeekProg 34
2 Chris 94 PreviousWeekProg 45
3 Sam 94 PreviousWeekProg 89
4 Aaron 94 PreviousWeekProg 12
5 Jenna 94 PreviousWeekProg 56
6 Eric 94 PreviousWeekProg 90
7 Dan 94 CurrentWeekProg 92
8 Jarrod 94 CurrentWeekProg 56
9 Chris 94 CurrentWeekProg 43
Now you can use the column 'variable' as your hue parameter in the function barplot:
fig, ax = plt.subplots(figsize=(10, 5)) # set the size of a figure
sns.barplot(x='Name', y='value', hue='variable', data=df1) # plot
xmin, xmax = plt.xlim() # get x-axis limits
ax.hlines(y=df1['Target'], xmin=xmin, xmax=xmax, color='red') # add multiple lines
# or ax.axhline(y=df1['Target'].max()) to add a single line
sns.set_style("whitegrid") # use the whitegrid style
ax.legend(loc='upper center', bbox_to_anchor=(0.5, -0.06), ncol=4, frameon=False) # move legend to the bottom
plt.title('Student Progress', loc='center') # add title
plt.yticks(np.arange(df1['value'].min(), df1['value'].max()+1, 10.0)) # change tick frequency
plt.xlabel('') # set xlabel
plt.ylabel('') # set ylabel
plt.show() # show plot

Plot multiple lines from one data frame

I have all the data I want to plot in one pandas data frame, e.g.:
date flower_color flower_count
0 2017-08-01 blue 1
1 2017-08-01 red 2
2 2017-08-02 blue 5
3 2017-08-02 red 2
I need a few different lines on one plot: x-value should be the date from the first column and y-value should be flower_count, and the y-value should depend on the flower_color given in the second column.
How can I do that without filtering the original df and saving it as a new object first? My only idea was to create a data frame for only red flowers and then specifying it like:
figure.line(x="date", y="flower_count", source=red_flower_ds)
figure.line(x="date", y="flower_count", source=blue_flower_ds)

You can try this
fig, ax = plt.subplots()
for name, group in df.groupby('flower_color'):
group.plot('date', y='flower_count', ax=ax, label=name)

If my understanding is right, you need a plot with two subplots. The X for both subplots are dates, and the Ys are the flower counts for each color?
In this case, you can employ the subplots in pandas visualization.
fig, axes = plt.subplots(2)
z[z.flower_color == 'blue'].plot(x=['date'], y= ['flower_count'],ax=axes[0]).set_ylabel('blue')
z[z.flower_color == 'red'].plot(x=['date'], y= ['flower_count'],ax=axes[1]).set_ylabel('red')
plt.show()
The output will be like:
Hope it helps.

Single legend at changing categories (!) in subplots from pandas df

Roughly speaking I need to create one legend for several subplots, which have a changing number of categories = legend entries.
But let me clarify this a bit deeper:
I have a figure with 20 subplots, one for each country within my spatial scope:
fig, ax = plt.subplots(nrows=4, ncols=5, sharex=True, sharey=False, figsize = (32,18))
Within a loop, I do some logic to group the data I need into a normal 2-dimensional pandas DataFrame stats and plot it to each of these 20 axes:
colors = stats.index.to_series().map(type_to_color()).tolist()
stats.T.plot.bar(ax=ax[i,j], stacked=True, legend=False, color=colors)
However, the stats DataFrame is changing its size loop by loop, since not every category applies for each of these countries (i.e. in one country there can only two types, in another there are more than 10).
For this reason I pre-defined a specific color for each type.
So far, I am creating one legend for every subplot within the loop:
ax[i,j].legend(fontsize=9, loc='upper right')
This works, however it blows up the subplots unnecessarily. How can I plot one big legend above/below/beside these plots, since I have already defined the according color.
The given approach here with fig.legend(handles, labels, ...)does not work since the line handles are not available from the pandas plot.
Plotting the legend directly with
plt.legend(loc = 'lower center',bbox_to_anchor = (0,-0.3,1,1),
bbox_transform = plt.gcf().transFigure)
shows only the entries for the very last subplot, which is not sufficient.
Any help is greatly appreciated! Thank you so much!
Edit
For example the DataFrame stats could in one country look like this:
2015 2020 2025 2030 2035 2040
Hydro 29.229082 28.964424 28.528139 27.120194 25.932098 24.675778
Natural Gas 0.926800 0.926800 0.926800 0.926800 0.003600 NaN
Wind 25.799950 25.797550 0.776400 0.520800 0.234400 NaN
Whereas in another country it might look like this:
2015 2020 2025 2030 2035
Bioenergy 0.033690 0.033690 0.030000 NaN NaN
Hard Coal 5.307300 0.065100 0.021000 NaN NaN
Hydro 22.834454 23.930642 23.169014 21.639914 19.623791
Natural Gas 8.378116 8.674121 8.013598 6.755498 5.255450
Solar 5.100403 5.100403 5.100403 5.100403 5.093403
Wind 8.983560 8.974740 8.967240 8.378300 0.195800

Here's how it works to get the legend into an alphabetical order without messing the colors up:
import matplotlib.patches as mpatches
import collections
fig, ax = plt.subplots(nrows=4, ncols=5, sharex=True, sharey=False, figsize = (32,18))
labels_mpatches = collections.OrderedDict()
for a, b in enumerate(countries())
# do some data logic here
colors = stats.index.to_series().map(type_to_color()).tolist()
stats.T.plot.bar(ax=ax[i,j],stacked=True,legend=False,color=colors)
# Pass the legend information into the OrderedDict
stats_handle, stats_labels = ax[i,j].get_legend_handles_labels()
for u, v in enumerate(stats_labels):
if v not in labels_mpatches:
labels_mpatches[v] = mpatches.Patch(color=colors[u], label=v)
# After the loop, do the legend layouting.
labels_mpatches = collections.OrderedDict(sorted(labels_mpatches.items()))
fig.legend(labels_mpatches.values(), labels_mpatches.keys())
# !!! Please Note: In previous versions this here worked, but does not anymore:
# fig.legend(handles=labels_mpatches.values(),labels=labels_mpatches.keys())

plot dataframe kind= line with axis labels

I already asked about labeling the axes and I have got the answer that satisfied me for the data that I had at that time. But now I'm trying plot the dataframe with kind=line to see better the evaluation of my values. so I'm using these pandas methods, that don't work in the same manner for kind=line as for kind=bar, and though don't provide the labels for axes. So my dataframe :
name Homework_note Class_note Behavior_note
Alice Ji 7 6 6
Eleonora LI 2 5 4
Mike The 6 5 3
Helen Wo 5 3 5
the script I use:
df=pd.read_csv(os.path.join(path,'class_notes.csv'), sep='\t|,', engine='python')
df.columns=['name', 'Homework_note', 'Class_note', 'Behavior_note']
ax=df.plot(kind='line', x='name', color=['red','blue', 'green'], figsize=(400,100))
ax.set_xlabel("Names", fontsize=56)
ax.set_ylabel("Notes", fontsize=56)
ax.set_title("Notes evaluation", fontsize=79)
plt.legend(loc=2,prop={'size':60})
plt.savefig(os.path.join(path,'notes_names.png'), bbox_inches='tight', dpi=100)
What else can I add to put labels on the axes (both x and y)? I prefer to stay with these pandas methods, cause I find them more comfortable to work with dataframes, but I haven't found the way to put the labels while using this type of plot line.

The labels will show up for a smaller figsize (in inches!). For the the legend, take a look here

Pandas plot dataframe as scatter complains of unknown item

I have thousands of data points for two values Tm1 and Tm2 for a series of text lables of type :
Tm1 Tm2
ID
A01 51 NaN
A03 51 NaN
A05 47 52
A07 47 52
A09 49 NaN
I managed to create a pandas DataFrame with the values from csv. I now want to plot the Tm1 and Tm2 as y values against the text ID's as x values in a scatter plot, with different color dots in pandas/matplotlib.
With a test case like this I can get a line plot
from pandas import *
df2= DataFrame([52,54,56],index=["A01","A02","A03"],columns=["Tm1"])
df2["Tm2"] = [None,42,None]
Tm1 Tm2
A01 52 NaN
A02 54 42
A03 56 NaN
I want to not connect the individual values with lines and just have the Tm1 and Tm2 values as scatter dots in different colors.
When I try to plot using
df2.reset_index().plot(kind="scatter",x='index',y=["Tm1"])
I get an error:
KeyError: u'no item named index'
I know this is a very basic plotting command, but am sorry i have no idea on how to achieve this in pandas/matplotlib. The scatter command does need an x and y value but I somehow am missing some key pandas concept in understanding how to do this.

I think the problem here is that you are trying to plot a scatter graph against a non-numeric series. That will fail - although the error message you are given is so misleading that it could be considered a bug.
You could, however, explictly set the xticks to use one per category and use the second argument of xticks to set the xtick labels. Like this:
import matplotlib.pyplot as plt
df1 = df2.reset_index() #df1 will have a numeric index, and a
#column named 'index' containing the index labels from df2
plt.scatter(df1.index,df1['Tm1'],c='b',label='Tm1')
plt.scatter(df1.index,df1['Tm2'],c='r',label='Tm2')
plt.legend(loc=4) # Optional - show labelled legend, loc=4 puts it at bottom right
plt.xticks(df1.index,df1['index']) # explicitly set one tick per category and label them
# according to the labels in column df1['index']
plt.show()
I've just tested it with 1.4.3 and it worked OK
For the example data you gave, this yields:

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Plots, by Label, frequency of words - python

Related

Creating Mixed Charts from CSV Files in Python

Plot multiple lines from one data frame

Single legend at changing categories (!) in subplots from pandas df

plot dataframe kind= line with axis labels

Pandas plot dataframe as scatter complains of unknown item

Categories

Resources