Pandas plot dataframe as scatter complains of unknown item - python

I have thousands of data points for two values Tm1 and Tm2 for a series of text lables of type :
Tm1 Tm2
ID
A01 51 NaN
A03 51 NaN
A05 47 52
A07 47 52
A09 49 NaN
I managed to create a pandas DataFrame with the values from csv. I now want to plot the Tm1 and Tm2 as y values against the text ID's as x values in a scatter plot, with different color dots in pandas/matplotlib.
With a test case like this I can get a line plot
from pandas import *
df2= DataFrame([52,54,56],index=["A01","A02","A03"],columns=["Tm1"])
df2["Tm2"] = [None,42,None]
Tm1 Tm2
A01 52 NaN
A02 54 42
A03 56 NaN
I want to not connect the individual values with lines and just have the Tm1 and Tm2 values as scatter dots in different colors.
When I try to plot using
df2.reset_index().plot(kind="scatter",x='index',y=["Tm1"])
I get an error:
KeyError: u'no item named index'
I know this is a very basic plotting command, but am sorry i have no idea on how to achieve this in pandas/matplotlib. The scatter command does need an x and y value but I somehow am missing some key pandas concept in understanding how to do this.

I think the problem here is that you are trying to plot a scatter graph against a non-numeric series. That will fail - although the error message you are given is so misleading that it could be considered a bug.
You could, however, explictly set the xticks to use one per category and use the second argument of xticks to set the xtick labels. Like this:
import matplotlib.pyplot as plt
df1 = df2.reset_index() #df1 will have a numeric index, and a
#column named 'index' containing the index labels from df2
plt.scatter(df1.index,df1['Tm1'],c='b',label='Tm1')
plt.scatter(df1.index,df1['Tm2'],c='r',label='Tm2')
plt.legend(loc=4) # Optional - show labelled legend, loc=4 puts it at bottom right
plt.xticks(df1.index,df1['index']) # explicitly set one tick per category and label them
# according to the labels in column df1['index']
plt.show()
I've just tested it with 1.4.3 and it worked OK
For the example data you gave, this yields:

Related

Python - Pandas - Plot Data frame without headers

I am trying to use the .plot() function in pandas to plot data into a line graph.
The data sorted by date with 48 rows after each date. Sample below:
1 2 ... 46 47 48
18 2018-02-19 1.317956 1.192840 ... 1.959250 1.782985 1.418093
19 2018-02-20 1.356267 1.192248 ... 2.123432 1.760629 1.569340
20 2018-02-21 1.417181 1.288694 ... 2.086715 1.823581 1.612062
21 2018-02-22 1.431536 1.279514 ... 2.201972 1.878109 1.694159
etc until row 346.
I tried the below but .plot does not seem to take positional arguments:
df.plot(x=df.iloc[0:346,0],y=[0:346,1:49]
How would I go about plotting my rows by date (the 1st column) on a line graph and can I expand this to include multiple axis?
There are multiple ways to do this, some of which are directly through the pandas dataframe. However, given your sample plotting line, I think the easiest might be to just use matplotlib directly:
import matplotlib.pyplot as plt
plt.plot(df.iloc[0:346,0],df.iloc[0:346,1:49])
For multiple axes you can add a few lines to make subplots. For example:
import matplotlib.pyplot as plt
fig = plt.figure()
ax1 = plt.subplot(1,2,1)
plt.plot(df.iloc[0:346,0],df.iloc[0:346,1:10],ax=ax1)
ax2 = plt.subplot(1,2,2)
plt.plot(df.iloc[0:346,0],df.iloc[0:346,20:30],ax=ax2)
You can also do this using the pandas plot() function that you were trying to use - it also takes an ax argument the same way as above, where you can provide the axis on which to plot. If you want to stick to pandas, I think you'd be best off setting the index to be a datetime index (see this link as an example: https://stackoverflow.com/a/27051371/12133280) and then using df.plot('column_name',ax=ax1). The x axis will be the index, which you would have set to be the date.

Plots, by Label, frequency of words

I need to create separates plot based on a label. My dataset is
Label Word Frequency
439 10.0 glass 600
471 10.0 tv 34
463 10.0 screen 31
437 10.0 laptop 15
454 10.0 info 15
65 -1.0 dog 1
68 -1.0 cat 1
69 -1.0 win 1
70 -2.0 man 1
71 -2.0 woman 1
In this case I would expect three plots, one for 10, one for -1 and one for -2, with on the x axis Word column and on the y-axis the Frequency (it is already sorted in descending order by Label).
I have tried as follows:
df['Word'].hist(by=df['Label'])
But it seems to be wrong as the output is far away from the expected one.
Any help would be great
You don't want to be using a histogram here: a histogram plot is where the columns of your dataframe contain raw data, and the hist function buckets the raw values and finds the frequencies of each bucket, and then plots.
Your dataframe is already bucketed, with a column in which the frequencies have already been calculated; what you need is the df.plot.bar() method. Unfortunately, this is quite new, and does not yet allow a by parameter, so you have to deal with the subplots manually.
Full walkthrough code for the cut-down example you have provided follows. Obviously you can make it more generic by not hardcoding the number of subplots required in the line marked [1].
# Set up:
import matplotlib.pyplot as plt
import pandas as pd
import io
txt = """Label,Word,Frequency
10.0,glass,600
10.0,tv,34
10.0,screen,31
10.0,laptop,15
10.0,info,15
-1.0,dog,1
-1.0,cat,1
-1.0,win,1
-2.0,man,1
-2.0,woman,1"""
dfA = pd.read_csv((io.StringIO(txt)))
labels = dfA["Label"].unique()
# Set up subplots on which to plot.
# Make more generic by not hardcoding nrows and ncols in [1],
# but calculating them depending on how many labels you have.
fig, axes = plt.subplots(nrows=2, ncols=2) # [1]
ax_list = axes.flatten() # axes is a list of lists;
# ax_list is a simple list which is easier to index.
# Loop through labels and plot the bar chart to the corresponding axis object.
for i in range(len(labels)):
dfA[dfA["Label"]==labels[i]].plot.bar(x="Word", y="Frequency", ax=ax_list[i])

Plot pandas dataframe using column names as x axis

I have the following pandas Data Frame:
and I need to make line plots using the column names (400, 400.5, 401....) as the x axis and the data frame values as the y axis, and using the index column ('fluorophore') as the label for that line plot. I want to be able to choose which fluorophores I want to plot.
How can I accomplish that?
I do not know your dataset, so if it's always just full columns of NaN you could do
df[non_nan_cols].T[['FAM', 'TET']].plot.line()
Where non_nan_cols is a list of your columns that do not contain NaN values.
Alternatively, you could
choice_of_fp = df.index.tolist()
x_val = np.asarray(df.columns.tolist())
for i in choice_of_fp:
mask = np.isfinite(df.loc[i].values)
plt.plot(x_val[mask], df.loc[i].values[mask], label=i)
plt.legend()
plt.show()
which allows to have NaN values. Here choice_of_fp is a list containing the fluorophores you want to plot.
You can do the below and it will use all columns except the index and plot the chart.
abs_data.set_index('fluorophore ').plot()
If you want to filter values for fluorophore then you can do this
abs_data[abs_data.fluorophore .isin(['A', 'B'])].set_index('fluorophore ').plot()

Python: Plot histogram of dataframe with one column as the labels, and the other as the values

I have a dataframe with two columns. I want to plot a histogram with the 'Word_Length' column as the x-axis labels and the y-axis values as the 'Count'
Here's a short example of what the data looks like. Both Columns values are integers.
Word_Length Count
1 265
9 67
3 45
I guess you need DataFrame.plot.bar, because histogram is an accurate graphical representation of the distribution of numerical data:
df.plot.bar(x = 'Word_Length', y='Count')

Frequency plot in Python/Pandas DataFrame

I have a parsed very large dataframe with some values like this and several columns:
Name Age Points ...
XYZ 42 32pts ...
ABC 41 32pts ...
DEF 32 35pts
GHI 52 35pts
JHK 72 35pts
MNU 43 42pts
LKT 32 32pts
LKI 42 42pts
JHI 42 35pts
JHP 42 42pts
XXX 42 42pts
XYY 42 35pts
I have imported numpy and matplotlib.
I need to plot a graph of the number of times the value in the column 'Points' occurs. I dont need to have any bins for the plotting. So it is more of a plot to see how many times the same score of points occurs over a large dataset.
So essentially the bar plot (or histogram, if you can call it that) should show that 32pts occurs thrice, 35pts occurs 5 times and 42pts occurs 4 times. If I can plot the values in sorted order, all the more better. I have tried df.hist() but it is not working for me.
Any clues? Thanks.
I would plot the results of the dataframe's value_count method directly:
import matplotlib.pyplot as plt
import pandas
data = load_my_data()
fig, ax = plt.subplots()
data['Points'].value_counts().plot(ax=ax, kind='bar')
If you want to remove the string 'pnts' from all of the elements in your column, you can do something like this:
df['points_int'] = df['Points'].str.replace('pnts', '').astype(int)
That assumes they all end with 'pnts'. If it varying from line to line, you need to look into regular expressions like this:
Split columns using pandas
And the official docs: http://pandas.pydata.org/pandas-docs/stable/text.html#text-string-methods
Seaborn package has countplot function which can be made use of to make frequency plot:
import seaborn as sns
ax = sns.countplot(x="Points",data=df)

Categories