unable to plot two columns from DataFrame after using pandas.read_csv - python

I'm trying to plot two columns that have been read in using pandas.read_csv, the code:-
from pandas import read_csv
from matplotlib import pyplot
data = read_csv('Stats.csv', sep=',')
#data = data.astype(float)
data.plot(x = 1, y = 2)
pyplot.show()
the csv file snippet:-
1,a4,2000,125,1.9,2.8,25.6
2,a4,7000,125,1.7,2.3,18
3,a2,7000,30,0.84,1.1,8.11
4,a2,5000,30,0.83,1.05,6.87
5,a2,4000,45,2.8,3.48,16.54
when x = 1 and y = 2 it will plot the second column against the fourth not the third as I expected
When I try to plot the third column against the fourth (x = 2, y = 3) it plots the third against the fifth
I'm trying to plot the third against the fourth right now, when both x and y = 2 it will plot the third column against the fourth but the values are incorrect, what am I missing? is the read_csv changing the order of the columns?

Your input csv is without headers which doesn't help clarity (see Murali's comment). But I think the problem stems from the nature of column that contains a4,a2.
This column can be used for the x axis but not for y axis (non-numeric data on an x axis appears to be just read in order). Hence the count offset. So as y "reads over" the column at 1 (all 0 indexed) - but x does not.
Conducting
data.plot(x=1,y=0)
and
data.plot(x=0,y=1)
and inspecting the axis helps visualise what's going on.
Bizarrely this means you can do
df.plot(x=1,y=1)
to get what you want.

Related

How to Write a plt.scatter(x, y) function in one line where y=function of x

I was plotting a scatter plot to show null values in dataframe. As you can see the plt.scatter() function is not expressive enough. Relation between list(range(0,1200)) and 'a' is not clear unless you see the previous lines. Can the plt.scatter(x,y) be written in a more explicit way where it could be easily understood how x and y is related. Like if somebody only see the plt.scatter(x,y) , they would understand what it is about.
a = []
for i in range(0,1200):
feature_with_na = [feature for feature in df.columns if df[feature].isnull().sum()>i]
a.append(len(feature_with_na))
plt.scatter(list(range(0,1200)), a)
On your x axis you have the number, then on the y-axis you want to plot the number of columns in your DataFrame that have more than that number of null values.
Instead of your loop you can count the number of null values within each column and use numpy.broadcasting, ([:, None]), to compare with an array of your numbers. This allows you to specify an xarr of the numbers, then you use that same array in the comparison.
Sample Data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plot
df = pd.DataFrame(np.random.choice([1,2,3,4,5,np.NaN], (100,10)))
Code
# Range of 'x' values to consider
xarr = np.arange(0, 100)
plt.scatter(xarr, (df.isnull().sum().to_numpy()>xarr[:, None]).sum(axis=1))
ALollz answer is good, but here's a less numpy-heavy alternative if that's your thing:
feature_null_counts = df.isnull().sum()
n_nulls = list(range(100))
features_with_n_nulls = [sum(feature_null_counts > n) for n in n_nulls]
plt.scatter(n_nulls, features_with_n_nulls)

Plot multiple line graph from Pandas into Seaborn

I'm trying to plot a multi line-graph plot from a pandas dataframe using seaborn. Below is a .csv of the of the data and the desired plot. In excel I simply selected the whole dataset and swapped the axis. Technically there are 110 lines (rows) on this, but many aren't visible because they only contain 0's.
This is my code:
individual_burst_data = {'nb001':nb001, 'nb002':nb002, 'nb003':nb003, 'nb004':nb004, 'nb005':nb005, 'nb006':nb006, 'nb007':nb007, 'nb008':nb008, 'nb009':nb009, 'nb010':nb010, 'nb011':nb011, 'nb012':nb012, 'nb013':nb013, 'nb015':nb015, 'nb016':nb016 }
ibd_panda_conv = pd.DataFrame(individual_burst_data)
sns.lineplot(data = ibd_panda_conv, x = individual_burst_data, y =ibd_panda_conv)
Other sources seem to only extract one column, whereas I need all the columns.
I tried to create an index for the y-axis
index_data = list(range(0,len(individual_burst_data)))
but this didn't work either.
The seaborn lineplot() documentation says:
Passing the entire wide-form dataset to data plots a separate line for each column
Since you want a line for each row instead, you need to transpose your dataframe, so try this:
sns.lineplot(data=ibd_panda_conv.T, dashes=False)

How do I format a y-axis 'y' in matplotlib going between pandas dataframes and simple variables?

CSV1only is a dataframe uploaded from a CSV
Let CSV1only as a dataframe be a column such that:
TRADINGITEMID:
1233
2455
3123
1235
5098
as a small example
How can I plot a scatterplot accordingly, specifically the y-axis?
I tried:
import pandas as pd
import matplotlib.pyplot as plt
CSV1only.plot(kind='scatter',x='TRADINGITEMID', y= [1,2], color='b')
plt.xlabel('TRADINGITEMID Numbers')
plt.ylabel('Range')
plt.title('Distribution of ItemIDNumbers')
and it doesn't work because of the y.
So, my main question is just how I can get a 0, 1, 2 y-axis for this scatter plot, as I want to make a distribution graph.
The following code doesn't work because it doesn't match the amount of rows included in the original TRADINGITEMID column, which has 5000 rows:
newcolumn_values = [1, 2]
CSV1only['un et deux'] = newcolumn_values
#and then I changed the y = [1,2] from before into y = ['un et deux']
Therefore the solution would need to work from any integer 1 to N, N being the # of rows. Yet, it would only have a range of [0, 2] or some [0, m], m being some arbitrary integer.
Don't need to worry about the actual pandas data frame CSV1only.
The 'TRADINGITEMIDNUMBERS' contains 5000 rows of unique numbers, so I just wanna plot those numbers on a line, with the y-axis being instances (which will never pass 1 since it is unique).
I think you are looking for the following: You need to generate y-values starting from 0 until n-1 where n is the total number of rows
y = np.arange(len(CSV1only['TRADINGITEMID']))
plt.scatter(CSV1only['TRADINGITEMID'], y, c='DarkBlue')

pandas plot line segments for each row

I have dataframes with columns containing x,y coordinates for multiple points. One row can consist of several points.
I'm trying to find out an easy way to be able to plot lines between each point generating a curve for each row of data.
Here is a simplified example where two lines are represented by two points each.
line1 = {'p1_x':1, 'p1_y':10, 'p2_x':2, 'p2_y':11 }
line2 = {'p1_x':2, 'p1_y':9, 'p2_x':3, 'p2_y':12 }
df = pd.DataFrame([line1,line2])
df.plot(y=['p1_y','p2_y'], x=['p1_x','p2_x'])
when trying to plot them I expect line 1 to start where x=1 and line 2 to start where x=2.
Instead, the x axis contains two value-pairs (1,2) and (2,3) and both lines have the same start and end-point in x-axis.
How do I get around this problem?
Edit:
If using matplotlib, the following hardcoded values generates the plot i'm interested in
plt.plot([[1,2],[2,3]],[[10,9],[11,12]])
While I'm sure that there should be a more succinct way using pure pandas, here's a simple approach using matplotlib and some derivatives from the original df.(I hope I understood the question correctly)
Assumption: In df, you place x values in even columns and y values in odd columns
Obtain x values
x = df.loc[:, df.columns[::2]]
x
p1_x p2_x
0 1 2
1 2 3
Obtain y values
y = df.loc[:, df.columns[1::2]]
y
p1_y p2_y
0 10 11
1 9 12
Then plot using a for loop
for i in range(len(df)):
plt.plot(x.iloc[i,:], y.iloc[i,:])
One does not need to create additional data frames. One can loop through the rows to plot these lines:
line1 = {'p1_x':1, 'p1_y':10, 'p2_x':2, 'p2_y':11 }
line2 = {'p1_x':2, 'p1_y':9, 'p2_x':3, 'p2_y':12 }
df = pd.DataFrame([line1,line2])
for i in range(len(df)): # for each row:
# plt.plot([list of Xs], [list of Ys])
plt.plot([df.iloc[i,0],df.iloc[i,2]],[df.iloc[i,1],df.iloc[i,3]])
plt.show()
The lines will be drawn in different colors. To get lines of same color, one can add option c='k' or whatever color one wants.
plt.plot([df.iloc[i,0],df.iloc[i,2]],[df.iloc[i,1],df.iloc[i,3]], c='k')
I generaly don't use the pandas plotting because I think it is rather limited, if using matplotlib is not an issue, the following code works:
from matplotlib import pyplot as plt
plt.plot(df.p1_x,df.p1_y)
plt.plot(df.p2_x,df.p2_y)
plt.plot()
if you got lots of lines to plot, you can use a for loop.

plot multiple data series from numpy array

I had a very ambitious project (for my novice level) to use on numpy array, where I load a series of data, and make different plots based on my needs - I have uploaded a slim version of my data file input_data and wanted to make plots based on: F (where I would like to choose the desired F before looping), and each series will have the data from E column (e.g. A12 one data series, A23 another data series in the plot, etc) and on the X axis I would like to use the corresponding values in D.
so to summarize for a chosen value on column F I want to have 4 different data series (as the number of variables on column E) and the data should be reference (x-axis) on the value of column D (which is date)
I stumbled in the first step (although spend too much time) where I wanted to plot all data with F column identifier as one plot.
Here is what I have up to now:
import os
import numpy as np
N = 8 #different values on column F
M = 4 #different values on column E
dataset = open('array_data.txt').readlines()[1:]
data = np.genfromtxt(dataset)
my_array = data
day = len(my_array)/M/N # number of measurement sets - variation on column D
for i in range(0, len(my_array), N):
plt.xlim(0, )
plt.ylim(-1, 2)
plt.plot(my_array[i, 0], my_array[i, 2], 'o')
plt.hold(True)
plt.show()
this does nothing.... and I still have a long way to go..
With pandas you can do:
import pandas as pd
dataset = pd.read_table("toplot.txt", sep="\t")
#make D index (automatically puts it on the x axis)
dataset.set_index("D", inplace=True)
#plotting R vs. D
dataset.R.plot()
#plotting F vs. D
dataset.F.plot()
dataset is a DataFrame object and DataFrame.plot is just a wrapper around the matplotlib function to plot the series.
I'm not clear on how you are wanting to plot it, but it sound like you'll need to select some values of a column. This would be:
# get where F == 1000
maskF = dataset.F == 1000
# get the values where F == 1000
rows = dataset[maskF]
# get the values where A12 is in column E
rows = rows[rows.E == "A12"]
#remove the we don't want to see
del rows["E"]
del rows["F"]
#Plot the result
rows.plot(xlim=(0,None), ylim=(-1,2))

Categories