panda plot multiple lines base on a certain column - python

I have a dataframe like this
timestamp|type|value_1|value_2
t1|A|v1|v2
t2|B|v3|v4
t3|C|v5|v6
t4|A|v7|v8
I would like to plot a graph with 6 lines each type and value
for example
type A - value_1
type A - value_2
type B - value_1
type B - value_2
type C - value_1
type C - value_2
thanks,
it is like doing this
A = df[df["type"] == A]
A.plot(x="time", y=["value_1", "value_2"])
do this for three types
and combine those 6 lines on the same graph

I think you can reshape DataFrame to columns and then plot:
df['g'] = df.groupby('type').cumcount()
df = df.set_index(['timestamp','g', 'type']).unstack().reset_index(level=1, drop=True)
df.columns = df.columns.map('_'.join)
df.plot()

As far as the plotting goes I recommend you check out:
MatPlotLib: Multiple datasets on the same scatter plot and Multiple data set plotting with matplotlib.pyplot.plot_date , as well as this tutorial.
For the selection of data to plot I recommend the section "selection by label" in the pandas docs. I suppose you could store the values from your corresponding columns / rows in some temporary variables x1 - xn and y1 - yn and then just plot all the pairs, which could look something like:
xs = sheet.loc[<appropriate labels>]
ys = sheet.loc[<appropriate labels>]
for i in range(len(xs)):
plt.plot(xs[i],ys[i],<further arguments>)
plt.show()
In your case, just accessing the 'values' label might not be sufficient, as only every n'th element of that column seams to belong to any given type. inthis question you can see how you can get a new list with only the appropriate values inside. Basically something like:
allXs = sheet.loc['v1']
xsTypeA = allXs[1::4]
...
hope that helps.

Related

Unable to create histogram from slice due to datetime datatype error

I am looking to remove the upper outliers of some columns in a DataFrame (specifically in the 'vehicle_age' and 'odometer' columns in order to then build a histogram.
I have been able to successfully build the initial histograms like so:
crankshaft_ads['odometer'].plot(kind='hist', bins=25, range= (0, 1000000))
And I would like to build histograms without the upper outliers, as a comparison. Here is what I tried so far:
q1_age = crankshaft_ads['vehicle_age'].quantile(0.25)
q1_odometer = crankshaft_ads['odometer'].quantile(0.25)
q3_age = crankshaft_ads['vehicle_age'].quantile(0.75)
q3_odometer = crankshaft_ads['odometer'].quantile(0.75)
iqr_age = q3_age - q1_age
iqr_odometer = q3_odometer - q1_odometer
upper_limit_age = q3_age + (1.5 * iqr_age)
upper_limit_odometer = q3_odometer + (1.5 * iqr_odometer)
crankshaft_ads['upper_limit_age'] = upper_limit_age
crankshaft_ads['upper_limit_odometer'] = upper_limit_odometer
(crankshaft_ads
.query('vehicle_age < upper_limit_age')
.plot(kind='hist', bins=10)
)
(crankshaft_ads
.query('odometer < upper_limit_odometer')
.plot(kind='hist', bins=25)
)
I would need help with the .query() elements. I get the following error (it happens when running the .plot line it seems):
ValueError: view limit minimum -49500.0 is less than 1 and is an invalid Matplotlib date value. This often happens if you pass a non-datetime value to an axis that has datetime units
There is one column in the DataFrame that has datetime datatype, but what I'm trying to do is build a histogram for the 2 columns mentioned above, with the upper outliers filtered out. Is this the wrong approach?
Thanks for your help.
It seems that you have not selected the columns you want to plot in your plotting functions. The queries you have written select a subset of the whole dataframe, not only the column mentioned in each query. So both plotting functions are attempting to plot a histogram for each column in a single figure, including the datetime column.
Here are three ways you could solve this problem, taking your first plotting function as an example:
# Solution 1: apply query to whole dataframe then select column in plotting function
crankshaft_ads.query('vehicle_age < #upper_limit_age').plot.hist(y='vehicle_age', bins=10)
# Solution 2: first select column then select values to plot in histogram
crankshaft_ads['vehicle_age'][crankshaft_ads['vehicle_age'] < upper_limit_age].plot.hist(bins=10)
# Solution 3: first select all dataframe rows meeting condition then select column in plotting function
crankshaft_ads[crankshaft_ads['vehicle_age'] < upper_limit_age].plot.hist(y='vehicle_age', bins=10)

How do I access the integers given by nunique in Pandas?

I am trying to access the items in each column that is outputted given the following code. It outputs two columns, 'Accurate_Episode_Date' values, and the count (the frequency of each Date). My goal is to plot the date on the x axis, and the count on the y axis using a scatterplot, but first I need to be able to access the actual count values.
data = pd.read_csv('CovidDataset.csv')
Barrie = data.loc[data['Reporting_PHU_City'] == 'Barrie']
dates_barrie = Barrie[['Accurate_Episode_Date']]
num = data.groupby('Accurate_Episode_Date')['_id'].nunique()
print(num.tail(5))
The code above outputs the following:
2021-01-10T00:00:00 1326
2021-01-11T00:00:00 1875
2021-01-12T00:00:00 1274
2021-01-13T00:00:00 492
2021-01-14T00:00:00 8
Again, I want to plot the dates on the x axis, and the counts on the y axis in scatterplot form. How do I access the count and date values?
EDIT: I just want a way to plot dates like 2021-01-10T00:00:00 and so on on the x axis, and the corresponding count: 1326 on the Y-axis.
Turns out this was mainly a data type issue. Basically all that was needed was accessing the datetime index and typecasting it to string with num.index.astype(str).
You could probably change it "in-place" and use the plot like below.
num.index = num.index.astype(str)
num.plot()
If you only want to access the values of a DataFrame or Series you just need to access them like this: num.values
If you want to plot the date column on X, you don't need to access that column separately, just use pandas internals:
# some dummy dates + counts
dates = [datetime.now() + timedelta(hours=i) for i in range(1, 6)]
values = np.random.randint(1, 10, 5)
df = pd.DataFrame({
"Date": dates,
"Values": values,
})
# if you only have 1 other column you can skip `y`
df.plot(x="Date", y="Values")
you need to convert date column using pd.to_datetime(df['dates']) then you can plot
updated answer:
here no need to convert to pd.to_datetime(df['dates'])
ax=df[['count']].plot()
ax.set_xticks(df.count.index)
ax.set_xticklabels(df.date)

Seaborn stripplot of datetime objects not working

Following the first example from URL:
http://seaborn.pydata.org/tutorial/categorical.html
I am able to load the dataset called 'tips' and reproduce the stripplot showed. However this plot is not shown when applied to my pandas dataframe (called df) consisting of datetime objects. My df consists of 19300 rows and 7 columns, of which 2 columns are in the form of datetime objects (dates and times respectively). I would like to use the Python Seaborn package's stripplot function to visualize these two df columns together. My code reads as follows:
sns.stripplot(x=df['DATE'], y=df['TIME'], data=df);
And the output error reads as follows:
TypeError: float() argument must be a string or a number
I have made sure to remove the header from the data columns before applying the plotting command.
Other failed attempts include (but not limited to)
sns.stripplot(x=df['DATE'], y=df['TIME']);
It is my guess that this error might be due to the datetype object nature of the column data types, and that this type must somehow be changed into either strings or integer values. Is this correct? And how might one proceed to accomplish this task?
To illustrate the df data, here is a working code which uses matplotlib.pyplot (as plt)
ax1.plot(x, y, 'o', label='Events')
Any help is much appreciated.
One can also try to convert dates/times into seconds to plot them as numeric values:
dates = df.DATE
times = df.TIME
start_date = dates.min()
dates_as_seconds = dates.map(lambda d: (d - start_date).total_seconds())
times_as_seconds = times.map(lambda t: t.second + t.minute*60 + t.hour*3600)
ax = sns.stripplot(x=dates_as_seconds, y=times_as_seconds)
ax.set_xticklabels(dates)
ax.set_yticklabels(times)
Of course, data frame should be sorted by dates and times to match ticks and values.
After applying the following code to previous script:
x = df['DATE']
data = df['TIME']
y = data[1:len(x)]
x = x[1:len(x)]
s = []
for time in y:
a = int(str(time).replace(':',''))
s.append(a)
k = []
for date in x:
a = str(date)
k.append(a)
x = k
y = s
stripplot worked:
sns.stripplot(x, y)
You just need to put the variables name as input of x and y; not the data themselves. For example :
sns.stripplot(x="value", y="measurement", hue="species",
data=iris, dodge=True, alpha=.25, zorder=1)
https://seaborn.pydata.org/examples/jitter_stripplot.html

Pandas dataframe hist not plotting catgorical variables

i have a dataframe where i want to plot the histograms of each column.
df_play = pd.DataFrame({'a':['cat','dog','cat'],'b':['apple','orange','orange']})
df_play['a'] = df_play['a'].astype('category')
df_play['b'] = df_play['b'].astype('category')
df_play
df_play.hist(layout = (12,10))
However im getting ValueError: num must be 1 <= num <= 0, not 1
When i tried with integers instead of category in the values, it worked fine but i really want the names of the unique string to be in the x-axis.
You can just apply pd.value_counts across columns and plot.
>>> df_play.apply(pd.value_counts).T.stack().plot(kind='bar')
If you want proper subplots or something more intricate, I'd suggest you just iterate with value_counts and create the subplots yourself.
Since there is no natural parameter for binning, perhaps what you want rather than histograms are bar plots of the value counts for each Series? If so, you can achieve that through
df_play['a'].value_counts().plot(kind='bar')
I realized a way to do this is to first specify the fig and axs then loop though the column names of the dataframe that we want to plot the value counts.
fig, axs = plt.subplots(1,len(df_play.columns),figsize(10,6))
for i,x in enumerate(df_play.columns):
df_play[x].value_counts().plot(kind='bar',ax=axs[i])

Distributing a pandas DataFrame feature at random

I am reading in a set of data using pandas and plotting this using matplotlib. One column is a "category", eg "Sports", "Entertainment", but for some rows this is marked "Random", which means I need to distribute this value and add it randomly to one column. Ideally I would like to do this in the dataframe so that all values would be distributed.
My basic graph code is as follows :
df.category.value_counts().plot(kind="barh", alpha=a_bar)
title("Category Distribution")
The behaviour I would like is
If category == "Random"{
Assign this value to another column at random.
}
How can I accomplish this?
possibly:
# take the original value_counts, drop 'Random'
ts1 = df.category.value_counts()
rand_cnt = ts1.random
ts1.drop('Random', inplace=True)
# randomly choose from the other categories
ts2 = pd.Series(np.random.choice(ts1.index, rand_cnt)).value_counts()
# align the two series, and add them up
ts2 = ts2.reindex_like(ts1).fillna(0)
(ts1 + ts2).plot(kind='barh')
if you want to modify the original data-frame, then
idx = df.category == 'Random'
xs = df.category[~idx].unique() # all other categories
# randomly assign to categories which are 'Random'
df.category[idx] = np.random.choice(xs, idx.sum())

Categories