How do I access the integers given by nunique in Pandas? - python

I am trying to access the items in each column that is outputted given the following code. It outputs two columns, 'Accurate_Episode_Date' values, and the count (the frequency of each Date). My goal is to plot the date on the x axis, and the count on the y axis using a scatterplot, but first I need to be able to access the actual count values.
data = pd.read_csv('CovidDataset.csv')
Barrie = data.loc[data['Reporting_PHU_City'] == 'Barrie']
dates_barrie = Barrie[['Accurate_Episode_Date']]
num = data.groupby('Accurate_Episode_Date')['_id'].nunique()
print(num.tail(5))
The code above outputs the following:
2021-01-10T00:00:00 1326
2021-01-11T00:00:00 1875
2021-01-12T00:00:00 1274
2021-01-13T00:00:00 492
2021-01-14T00:00:00 8
Again, I want to plot the dates on the x axis, and the counts on the y axis in scatterplot form. How do I access the count and date values?
EDIT: I just want a way to plot dates like 2021-01-10T00:00:00 and so on on the x axis, and the corresponding count: 1326 on the Y-axis.

Turns out this was mainly a data type issue. Basically all that was needed was accessing the datetime index and typecasting it to string with num.index.astype(str).
You could probably change it "in-place" and use the plot like below.
num.index = num.index.astype(str)
num.plot()
If you only want to access the values of a DataFrame or Series you just need to access them like this: num.values
If you want to plot the date column on X, you don't need to access that column separately, just use pandas internals:
# some dummy dates + counts
dates = [datetime.now() + timedelta(hours=i) for i in range(1, 6)]
values = np.random.randint(1, 10, 5)
df = pd.DataFrame({
"Date": dates,
"Values": values,
})
# if you only have 1 other column you can skip `y`
df.plot(x="Date", y="Values")

you need to convert date column using pd.to_datetime(df['dates']) then you can plot
updated answer:
here no need to convert to pd.to_datetime(df['dates'])
ax=df[['count']].plot()
ax.set_xticks(df.count.index)
ax.set_xticklabels(df.date)

Related

Plotting values above a threshold in Python

Having issues with plotting values above a set threshold using a pandas dataframe.
I have a dataframe that has 21453 rows and 20 columns, and one of the columns is just 1 and 0 values. I'm trying to plot this column using the following code:
lst1 = []
for x in range(0, len(df)):
if(df_smooth['Active'][x] == 1):
lst1.append(df_smooth['Time'][x])
plt.plot(df_smooth['Time'], df_smooth['CH1'])
plt.plot(df_smooth['Time'], lst1)
But get the following errors:
x and y must have same first dimension, but have shapes (21453,) and (9,)
Any suggestions on how to fix this?
The error is probably the result of this line plt.plot(df_smooth['Time'], lst1). While lst1 is a subset of df_smooth[Time], df_smooth['Time'] is the full series.
The solution I would do is to also build a filtered x version for example -
lst_X = []
lst_Y = []
for x in range(0, len(df)):
if(df_smooth['Active'][x] == 1):
lst_X.append(df_smooth['Time'][x])
lst_Y.append(df_smooth['Time'][x])
Another option is to build a sub-dataframe -
sub_df = df_smooth[df_smooth['Active']==1]
plt.plot(sub_df['Time'], sub_df['Time'])
(assuming the correct column as Y column is Time, otherwise just replace it with the correct column)
It seems like you are trying to plot two different data series using the plt.plot() function, this is causing the error because plt.plot() expects both series to have the same length.
You will need to ensure that both data series have the same length before trying to plot them. One way to do this is to create a new list that contains the same number of elements as the df_smooth['Time'] data series, and then fill it with the corresponding values from the lst1 data series.
# Create a new list with the same length as the 'Time' data series
lst2 = [0] * len(df_smooth['Time'])
# Loop through the 'lst1' data series and copy the values to the corresponding
# indices in the 'lst2' data series
for x in range(0, len(lst1)):
lst2[x] = lst1[x]
# Plot the 'Time' and 'lst2' data series using the plt.plot() function
plt.plot(df_smooth['Time'], df_smooth['CH1'])
plt.plot(df_smooth['Time'], lst2)
I think this should work.

Unable to create histogram from slice due to datetime datatype error

I am looking to remove the upper outliers of some columns in a DataFrame (specifically in the 'vehicle_age' and 'odometer' columns in order to then build a histogram.
I have been able to successfully build the initial histograms like so:
crankshaft_ads['odometer'].plot(kind='hist', bins=25, range= (0, 1000000))
And I would like to build histograms without the upper outliers, as a comparison. Here is what I tried so far:
q1_age = crankshaft_ads['vehicle_age'].quantile(0.25)
q1_odometer = crankshaft_ads['odometer'].quantile(0.25)
q3_age = crankshaft_ads['vehicle_age'].quantile(0.75)
q3_odometer = crankshaft_ads['odometer'].quantile(0.75)
iqr_age = q3_age - q1_age
iqr_odometer = q3_odometer - q1_odometer
upper_limit_age = q3_age + (1.5 * iqr_age)
upper_limit_odometer = q3_odometer + (1.5 * iqr_odometer)
crankshaft_ads['upper_limit_age'] = upper_limit_age
crankshaft_ads['upper_limit_odometer'] = upper_limit_odometer
(crankshaft_ads
.query('vehicle_age < upper_limit_age')
.plot(kind='hist', bins=10)
)
(crankshaft_ads
.query('odometer < upper_limit_odometer')
.plot(kind='hist', bins=25)
)
I would need help with the .query() elements. I get the following error (it happens when running the .plot line it seems):
ValueError: view limit minimum -49500.0 is less than 1 and is an invalid Matplotlib date value. This often happens if you pass a non-datetime value to an axis that has datetime units
There is one column in the DataFrame that has datetime datatype, but what I'm trying to do is build a histogram for the 2 columns mentioned above, with the upper outliers filtered out. Is this the wrong approach?
Thanks for your help.
It seems that you have not selected the columns you want to plot in your plotting functions. The queries you have written select a subset of the whole dataframe, not only the column mentioned in each query. So both plotting functions are attempting to plot a histogram for each column in a single figure, including the datetime column.
Here are three ways you could solve this problem, taking your first plotting function as an example:
# Solution 1: apply query to whole dataframe then select column in plotting function
crankshaft_ads.query('vehicle_age < #upper_limit_age').plot.hist(y='vehicle_age', bins=10)
# Solution 2: first select column then select values to plot in histogram
crankshaft_ads['vehicle_age'][crankshaft_ads['vehicle_age'] < upper_limit_age].plot.hist(bins=10)
# Solution 3: first select all dataframe rows meeting condition then select column in plotting function
crankshaft_ads[crankshaft_ads['vehicle_age'] < upper_limit_age].plot.hist(y='vehicle_age', bins=10)

Resampling Pandas but with given dates

I want to resample my pandas dataframe with datetime as index. When I use resample method it returns resampled date with index of the last date which doesn't always exist in the original data. For example, my original data has data from 2000-01-03 ~ 2005-12-29. But when I resample this data yearly I get data for 2005-12-31. This is a problem for me when I use concat for resampled data.
Y = price.resample("Y").first()
M = price.resample("M").first()
W = price.resample("W").first()
total = pd.concat([price,W,M,Y], axis=1, sort=False)
#example
price = pd.DataFrame([1315.23, 1324.97, 1376.54, 1351.46, 1343.55, 1369.89, 1380.2 ,
1371.18, 1359.99, 1340.93, 1312.15, 1322.74, 1305.6 , 1264.74,
1274.86, 1305.97, 1305.97, 1315.19, 1328.92, 1334.22, 1320.28],
index = ['2000-12-01', '2000-12-04', '2000-12-05', '2000-12-06',
'2000-12-07', '2000-12-08', '2000-12-11', '2000-12-12',
'2000-12-13', '2000-12-14', '2000-12-15', '2000-12-18',
'2000-12-19', '2000-12-20', '2000-12-21', '2000-12-22',
'2000-12-25', '2000-12-26', '2000-12-27', '2000-12-28',
'2000-12-29'])
price.index = pd.to_datetime(price.index)
price.resample("W").first()
#see how 12-03, 12-10, 12-17, 12-24, 12-31 are not dates that are in the original index
Have you considered just dropping undesired rows afterwards?
The following code will work because all rows created by resample (that are not on the original index) will be set to values of NaN.
price.resample('W').dropna()

panda plot multiple lines base on a certain column

I have a dataframe like this
timestamp|type|value_1|value_2
t1|A|v1|v2
t2|B|v3|v4
t3|C|v5|v6
t4|A|v7|v8
I would like to plot a graph with 6 lines each type and value
for example
type A - value_1
type A - value_2
type B - value_1
type B - value_2
type C - value_1
type C - value_2
thanks,
it is like doing this
A = df[df["type"] == A]
A.plot(x="time", y=["value_1", "value_2"])
do this for three types
and combine those 6 lines on the same graph
I think you can reshape DataFrame to columns and then plot:
df['g'] = df.groupby('type').cumcount()
df = df.set_index(['timestamp','g', 'type']).unstack().reset_index(level=1, drop=True)
df.columns = df.columns.map('_'.join)
df.plot()
As far as the plotting goes I recommend you check out:
MatPlotLib: Multiple datasets on the same scatter plot and Multiple data set plotting with matplotlib.pyplot.plot_date , as well as this tutorial.
For the selection of data to plot I recommend the section "selection by label" in the pandas docs. I suppose you could store the values from your corresponding columns / rows in some temporary variables x1 - xn and y1 - yn and then just plot all the pairs, which could look something like:
xs = sheet.loc[<appropriate labels>]
ys = sheet.loc[<appropriate labels>]
for i in range(len(xs)):
plt.plot(xs[i],ys[i],<further arguments>)
plt.show()
In your case, just accessing the 'values' label might not be sufficient, as only every n'th element of that column seams to belong to any given type. inthis question you can see how you can get a new list with only the appropriate values inside. Basically something like:
allXs = sheet.loc['v1']
xsTypeA = allXs[1::4]
...
hope that helps.

Distributing a pandas DataFrame feature at random

I am reading in a set of data using pandas and plotting this using matplotlib. One column is a "category", eg "Sports", "Entertainment", but for some rows this is marked "Random", which means I need to distribute this value and add it randomly to one column. Ideally I would like to do this in the dataframe so that all values would be distributed.
My basic graph code is as follows :
df.category.value_counts().plot(kind="barh", alpha=a_bar)
title("Category Distribution")
The behaviour I would like is
If category == "Random"{
Assign this value to another column at random.
}
How can I accomplish this?
possibly:
# take the original value_counts, drop 'Random'
ts1 = df.category.value_counts()
rand_cnt = ts1.random
ts1.drop('Random', inplace=True)
# randomly choose from the other categories
ts2 = pd.Series(np.random.choice(ts1.index, rand_cnt)).value_counts()
# align the two series, and add them up
ts2 = ts2.reindex_like(ts1).fillna(0)
(ts1 + ts2).plot(kind='barh')
if you want to modify the original data-frame, then
idx = df.category == 'Random'
xs = df.category[~idx].unique() # all other categories
# randomly assign to categories which are 'Random'
df.category[idx] = np.random.choice(xs, idx.sum())

Categories