I have a parsed very large dataframe with some values like this and several columns:
Name Age Points ...
XYZ 42 32pts ...
ABC 41 32pts ...
DEF 32 35pts
GHI 52 35pts
JHK 72 35pts
MNU 43 42pts
LKT 32 32pts
LKI 42 42pts
JHI 42 35pts
JHP 42 42pts
XXX 42 42pts
XYY 42 35pts
I have imported numpy and matplotlib.
I need to plot a graph of the number of times the value in the column 'Points' occurs. I dont need to have any bins for the plotting. So it is more of a plot to see how many times the same score of points occurs over a large dataset.
So essentially the bar plot (or histogram, if you can call it that) should show that 32pts occurs thrice, 35pts occurs 5 times and 42pts occurs 4 times. If I can plot the values in sorted order, all the more better. I have tried df.hist() but it is not working for me.
Any clues? Thanks.
I would plot the results of the dataframe's value_count method directly:
import matplotlib.pyplot as plt
import pandas
data = load_my_data()
fig, ax = plt.subplots()
data['Points'].value_counts().plot(ax=ax, kind='bar')
If you want to remove the string 'pnts' from all of the elements in your column, you can do something like this:
df['points_int'] = df['Points'].str.replace('pnts', '').astype(int)
That assumes they all end with 'pnts'. If it varying from line to line, you need to look into regular expressions like this:
Split columns using pandas
And the official docs: http://pandas.pydata.org/pandas-docs/stable/text.html#text-string-methods
Seaborn package has countplot function which can be made use of to make frequency plot:
import seaborn as sns
ax = sns.countplot(x="Points",data=df)
Related
I am trying to use the .plot() function in pandas to plot data into a line graph.
The data sorted by date with 48 rows after each date. Sample below:
1 2 ... 46 47 48
18 2018-02-19 1.317956 1.192840 ... 1.959250 1.782985 1.418093
19 2018-02-20 1.356267 1.192248 ... 2.123432 1.760629 1.569340
20 2018-02-21 1.417181 1.288694 ... 2.086715 1.823581 1.612062
21 2018-02-22 1.431536 1.279514 ... 2.201972 1.878109 1.694159
etc until row 346.
I tried the below but .plot does not seem to take positional arguments:
df.plot(x=df.iloc[0:346,0],y=[0:346,1:49]
How would I go about plotting my rows by date (the 1st column) on a line graph and can I expand this to include multiple axis?
There are multiple ways to do this, some of which are directly through the pandas dataframe. However, given your sample plotting line, I think the easiest might be to just use matplotlib directly:
import matplotlib.pyplot as plt
plt.plot(df.iloc[0:346,0],df.iloc[0:346,1:49])
For multiple axes you can add a few lines to make subplots. For example:
import matplotlib.pyplot as plt
fig = plt.figure()
ax1 = plt.subplot(1,2,1)
plt.plot(df.iloc[0:346,0],df.iloc[0:346,1:10],ax=ax1)
ax2 = plt.subplot(1,2,2)
plt.plot(df.iloc[0:346,0],df.iloc[0:346,20:30],ax=ax2)
You can also do this using the pandas plot() function that you were trying to use - it also takes an ax argument the same way as above, where you can provide the axis on which to plot. If you want to stick to pandas, I think you'd be best off setting the index to be a datetime index (see this link as an example: https://stackoverflow.com/a/27051371/12133280) and then using df.plot('column_name',ax=ax1). The x axis will be the index, which you would have set to be the date.
I need to create separates plot based on a label. My dataset is
Label Word Frequency
439 10.0 glass 600
471 10.0 tv 34
463 10.0 screen 31
437 10.0 laptop 15
454 10.0 info 15
65 -1.0 dog 1
68 -1.0 cat 1
69 -1.0 win 1
70 -2.0 man 1
71 -2.0 woman 1
In this case I would expect three plots, one for 10, one for -1 and one for -2, with on the x axis Word column and on the y-axis the Frequency (it is already sorted in descending order by Label).
I have tried as follows:
df['Word'].hist(by=df['Label'])
But it seems to be wrong as the output is far away from the expected one.
Any help would be great
You don't want to be using a histogram here: a histogram plot is where the columns of your dataframe contain raw data, and the hist function buckets the raw values and finds the frequencies of each bucket, and then plots.
Your dataframe is already bucketed, with a column in which the frequencies have already been calculated; what you need is the df.plot.bar() method. Unfortunately, this is quite new, and does not yet allow a by parameter, so you have to deal with the subplots manually.
Full walkthrough code for the cut-down example you have provided follows. Obviously you can make it more generic by not hardcoding the number of subplots required in the line marked [1].
# Set up:
import matplotlib.pyplot as plt
import pandas as pd
import io
txt = """Label,Word,Frequency
10.0,glass,600
10.0,tv,34
10.0,screen,31
10.0,laptop,15
10.0,info,15
-1.0,dog,1
-1.0,cat,1
-1.0,win,1
-2.0,man,1
-2.0,woman,1"""
dfA = pd.read_csv((io.StringIO(txt)))
labels = dfA["Label"].unique()
# Set up subplots on which to plot.
# Make more generic by not hardcoding nrows and ncols in [1],
# but calculating them depending on how many labels you have.
fig, axes = plt.subplots(nrows=2, ncols=2) # [1]
ax_list = axes.flatten() # axes is a list of lists;
# ax_list is a simple list which is easier to index.
# Loop through labels and plot the bar chart to the corresponding axis object.
for i in range(len(labels)):
dfA[dfA["Label"]==labels[i]].plot.bar(x="Word", y="Frequency", ax=ax_list[i])
I've developed a perl script that manipulates around data and gives me a final csv file. Unfortunately, the package for graphs and charts in perl are not supported on my system and I'm not able to install them due to work restrictions. So I want to try and take the csv file and put together something in Python to generate a mixed graph. I want the first column to be the labels on the x-axis. The next three columns to be bar graphs. The fourth column to be a line across the x-axis.
Here is sample data:
Name PreviousWeekProg CurrentWeekProg ExpectedProg Target
Dan 94 92 95 94
Jarrod 34 56 60 94
Chris 45 43 50 94
Sam 89 90 90 94
Aaron 12 10 40 94
Jenna 56 79 80 94
Eric 90 45 90 94
I am looking for a graph like this:
I did some researching but being as clueless as I am in python, I wanted to ask for some guidance on good modules to use for mixed charts and graphs in python. Sorry, if my post is vague. Besides looking at other references online, I'm pretty clueless about how to go about this. Also, my version of python is 3.8 and I DO have matplotlib installed (which is what i was previously recommended to use).
Since the answer by #ShaunLowis doesn't include a complete example I thought I'd add one. As far as reading the .csv file goes, the best way to do it in this case is probably to use pandas.read_csv() as the other answer points out. In this example I have named the file test.csv and placed it in the same directory from which I run the script
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
df = pd.read_csv("./test.csv")
names = df['Name'].values
x = np.arange(len(names))
w = 0.3
plt.bar(x-w, df['PreviousWeekProg'].values, width=w, label='PreviousWeekProg')
plt.bar(x, df['CurrentWeekProg'].values, width=w, label='CurrentWeekProg')
plt.bar(x+w, df['ExpectedProg'].values, width=w, label='ExpectedProg')
plt.plot(x, df['Target'].values, lw=2, label='Target')
plt.xticks(x, names)
plt.ylim([0,100])
plt.tight_layout()
plt.xlabel('X label')
plt.legend(loc='upper center', bbox_to_anchor=(0.5, -0.1), fancybox=True, ncol=5)
plt.savefig("CSVBarplots.png", bbox_inches="tight")
plt.show()
Explanation
From the pandas docs for read_csv() (arguments extraneous to the example excluded),
pandas.read_csv(filepath_or_buffer)
Read a comma-separated values (csv) file into DataFrame.
filepath_or_buffer: str, path object or file-like object
Any valid string path is acceptable. The string could be a URL. [...] If you want to pass in a path object, pandas accepts any os.PathLike.
By file-like object, we refer to objects with a read() method, such as a file handler (e.g. via builtin open function) or StringIO.
In this example I am specifying the path to the file, not a file object.
names = df['Name'].values
This extracts the values in the 'Name' column and converts them to a numpy.ndarray object. In order to plot multiple bars with one name I reference this answer. However, in order to use this method, we need an x array of floats of the same length as the names array, hence
x = np.arange(len(names))
then set a width for the bars and offset the first and third bars accordingly, as outlines in the referenced answer
w = 0.3
plt.bar(x-w, df['PreviousWeekProg'].values, width=w, label='PreviousWeekProg')
plt.bar(x, df['CurrentWeekProg'].values, width=w, label='CurrentWeekProg')
plt.bar(x+w, df['ExpectedProg'].values, width=w, label='ExpectedProg')
from the matplotlib.pyplot.bar page (unused non-positional arguments excluded),
matplotlib.pyplot.bar(x, height, width=0.8)
The bars are positioned at x [...] their dimensions are given by width and height.
Each of x, height, and width may either be a scalar applying to all bars, or it may be a sequence of length N providing a separate value for each bar.
In this case, x and height are sequences of values (different for each bar) and width is a scalar (the same for each bar).
Next is the line for target which is pretty straightforward, simply plotting the x values created earlier against the values from the 'Target' column
plt.plot(x, df['Target'].values, lw=2, label='Target')
where lw specifies the linewidth. Disclaimer: if the target value isn't the same for each row of the .csv this will still work but may not look exactly how you want it to as is.
The next two lines,
plt.xticks(x, names)
plt.ylim([0,100])
just add the names below the bars at the appropriate x positions and then set the y limits to span the interval [0, 100].
The final touch here is to add the legend below the plot,
plt.legend(loc='upper center', bbox_to_anchor=(0.5, -0.05), fancybox=True)
see this answer for more on how to adjust this as desired.
I would recommend reading in your .csv file using the 'read_csv()' utility of the Pandas library like so:
import pandas as pd
df = pd.read_csv(filepath)
This stores the information in a Dataframe object. You can then access your columns by:
my_column = df['PreviousWeekProg']
After which you can call:
my_column.plot(kind='bar')
On whichever column you wish to plot.
Configuring subplots is a different beast, for which I would recommend using matplotlib's pyplot .
I would recommend starting with this figure and axes object declarations, then going from there:
fig = plt.figure()
ax1 = plt.subplot()
ax2 = plt.subplot()
ax3 = plt.subplot()
ax4 = plt.subplot()
Where you can read more about adding in axes data here.
Let me know if this helps!
You can use the parameter hue in the package seaborn. First, you need to reshape you data set with the function melt:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
df1 = df.melt(id_vars=['Name', 'Target'])
print(df1.head(10))
Output:
Name Target variable value
0 Dan 94 PreviousWeekProg 94
1 Jarrod 94 PreviousWeekProg 34
2 Chris 94 PreviousWeekProg 45
3 Sam 94 PreviousWeekProg 89
4 Aaron 94 PreviousWeekProg 12
5 Jenna 94 PreviousWeekProg 56
6 Eric 94 PreviousWeekProg 90
7 Dan 94 CurrentWeekProg 92
8 Jarrod 94 CurrentWeekProg 56
9 Chris 94 CurrentWeekProg 43
Now you can use the column 'variable' as your hue parameter in the function barplot:
fig, ax = plt.subplots(figsize=(10, 5)) # set the size of a figure
sns.barplot(x='Name', y='value', hue='variable', data=df1) # plot
xmin, xmax = plt.xlim() # get x-axis limits
ax.hlines(y=df1['Target'], xmin=xmin, xmax=xmax, color='red') # add multiple lines
# or ax.axhline(y=df1['Target'].max()) to add a single line
sns.set_style("whitegrid") # use the whitegrid style
ax.legend(loc='upper center', bbox_to_anchor=(0.5, -0.06), ncol=4, frameon=False) # move legend to the bottom
plt.title('Student Progress', loc='center') # add title
plt.yticks(np.arange(df1['value'].min(), df1['value'].max()+1, 10.0)) # change tick frequency
plt.xlabel('') # set xlabel
plt.ylabel('') # set ylabel
plt.show() # show plot
I have dataframe with column(C_NC) containing two values namely C and NC. I plotted frequency of C and NC values with
df['C_NC'].value_counts().plot(kind='bar')
Though this graph is nice, I also want to have exact frequency number on each bar in bar chart. I am quite new to Data visualization with Pandas Dataframe. Is there a way to do this with pandas dataframe?
Use:
s=df['C_NC'].value_counts()
s.plot(kind='bar',yticks=s)
Example
as you can see here is the same problem:
import numpy as np
import matplotlib.pyplot as plt
s1=pd.Series(np.random.randint(0,2,300))
s=s1.value_counts()
print(s)
1 156
0 144
dtype: int64
s1.value_counts().plot(kind='bar')
we can now show the exact values
s.plot(kind='bar',yticks=s)
I have thousands of data points for two values Tm1 and Tm2 for a series of text lables of type :
Tm1 Tm2
ID
A01 51 NaN
A03 51 NaN
A05 47 52
A07 47 52
A09 49 NaN
I managed to create a pandas DataFrame with the values from csv. I now want to plot the Tm1 and Tm2 as y values against the text ID's as x values in a scatter plot, with different color dots in pandas/matplotlib.
With a test case like this I can get a line plot
from pandas import *
df2= DataFrame([52,54,56],index=["A01","A02","A03"],columns=["Tm1"])
df2["Tm2"] = [None,42,None]
Tm1 Tm2
A01 52 NaN
A02 54 42
A03 56 NaN
I want to not connect the individual values with lines and just have the Tm1 and Tm2 values as scatter dots in different colors.
When I try to plot using
df2.reset_index().plot(kind="scatter",x='index',y=["Tm1"])
I get an error:
KeyError: u'no item named index'
I know this is a very basic plotting command, but am sorry i have no idea on how to achieve this in pandas/matplotlib. The scatter command does need an x and y value but I somehow am missing some key pandas concept in understanding how to do this.
I think the problem here is that you are trying to plot a scatter graph against a non-numeric series. That will fail - although the error message you are given is so misleading that it could be considered a bug.
You could, however, explictly set the xticks to use one per category and use the second argument of xticks to set the xtick labels. Like this:
import matplotlib.pyplot as plt
df1 = df2.reset_index() #df1 will have a numeric index, and a
#column named 'index' containing the index labels from df2
plt.scatter(df1.index,df1['Tm1'],c='b',label='Tm1')
plt.scatter(df1.index,df1['Tm2'],c='r',label='Tm2')
plt.legend(loc=4) # Optional - show labelled legend, loc=4 puts it at bottom right
plt.xticks(df1.index,df1['index']) # explicitly set one tick per category and label them
# according to the labels in column df1['index']
plt.show()
I've just tested it with 1.4.3 and it worked OK
For the example data you gave, this yields: