I have the following dataframe:
Month Year Location Revenue
0 2015-01 Location 1 0.00
1 2015-03 Location 1 1105.50
2 2015-04 Location 1 44034.28
3 2015-05 Location 1 56756.39
4 2015-06 Location 1 51502.22
There are about two years worth of data. There are 5 different locations. I want to create a lineplot with seaborn that shows 5 different lines (one for each location) with Revenue on the y-axis, Month Year on the x-axis.
sns.lineplot(x="Month Year",
y="Revenue",
hue="Location",
data=rev_by_month,
palette="tab10")
When I run the code above, however, I receive the following error:
view limit minimum -0.05500000000000001 is less than 1 and is an invalid Matplotlib date value. This often happens if you pass a non-datetime value to an axis that has datetime units
For the record, the Month Year column was created using the pandas .to_datetime() function.
Related
I have a panda dataframe named 'epu' with three columns 'year', 'month' and 'score'. It looks like this:
year month score
0 1970 1 125.224739
1 1970 1 99.020813
2 1970 1 112.190506
.
.
.
447 2022 4 154.661957
448 2022 5 168.034912
449 2022 6 143.154816
As can be seen, the year range is from 1970 to 2022. I wanted to plot 'score' on the y-axis and 'year' on the x-axis.
epu['score'].plot()
would give me the following graph. The score plot seems to be alright, but I don't get year labels on the x-axis.
But, plotting with
epu.plot(x='year',y='score')
would give me a strange-looking graph like below.
So, my question is how can I generate the graph in the first picture with the year labels on the x-axis of the second picture? Do I need some advanced features of matplotlib? I tried to search for answers here, but I might be missing something and couldn't find an answer for my problem.
I got a dataset that is built ike this:
hour weekday
12 2
14 1
12 2
and so on.
I want to display in a heatmap per weekday when the dataframe had most action (which is the sum of all events that happened on that weekday during that hour)
I tried wo work with groupBy
hm = df.groupby(['hour']).sum()
which shows me all events for the hours, but does not split the events across the weekdays
How can I keep the list so I have the weekdays as an x-axis and on the y-axis the sum of the hours on that weekday?
thanks for your help!
The output you expect is unclear, but I imagine you could be looking for pandas.crosstab:
# computing crosstab
hm = pd.crosstab(df['hour'], df['weekday'])
# plotting heatmap
import seaborn as sns
sns.heatmap(hm, cmap='Greys')
output:
weekday 1 2
hour
12 0 2
14 1 0
So I want to be able to know how many clusters are in a time series frequency table.
Input would be a date index with the frequency.
The kind of output you would get when using .resample('D').sum()
Input Example:
Index
Count
01-01-2022
3
02-01-2022
4
03-01-2022
2
04-01-2022
2
05-01-2022
2
....
...
27-01-2022
5
28-01-2022
4
29-01-2022
2
30-01-2022
3
31-01-2022
2
Assume the dates not shown (... on table) are all frequency 0.
So essentially there is two clusters in the month of January 2022. First cluster is at the beginning of the month and the second cluster is at the end of the month.
Cluster 1 is between date range 01-01-2022 and 05-01-2022.
Cluster 2 is between date range 27-01-2022 and 31-01-2022.
Do you know which clustering algorithm would allow me to get the # of clusters with this type of data?
or is a clustering algorithm even necessary?
Thank you for your help
I would need to plot the frequency of items by date. My csv contains three columns: one for Date, one for Name & Surname and another one for Birthday.
I am interested in plotting the frequency of people recorded in a date. My expected output would be:
Date Count
0 01/01/2018 9
1 01/02/2018 12
2 01/03/2018 6
3 01/04/2018 4
4 01/05/2018 5
.. ... ...
.. 02/27/2020 122
.. 02/28/2020 84
The table above was found as follows:
by_date = df.groupby(df['Date']).size().reset_index(name='Count')
Date is a column in my csv file, but not Count. This explains the reason why I am having difficulties to draw a line plot.
How can I plot the frequency as a list of numbers/column?
Although not absolutely required, you should convert the Date column into Timestamp for easier analysis in later steps:
df['Date'] = pd.to_datetime(df['Date'])
Now, to your question. To count many births there are per day, you can use value_counts:
births = df['Date'].value_counts()
But you don't even have to do that for plotting a histogram! Use hist:
import matplotlib.dates as mdates
year = mdates.YearLocator()
month = mdates.MonthLocator()
formatter = mdates.ConciseDateFormatter(year)
ax = df['Date'].hist()
ax.set_title('# of births')
ax.xaxis.set_major_locator(year)
ax.xaxis.set_minor_locator(month)
ax.xaxis.set_major_formatter(formatter)
Result (from random data):
I have a series of transactions similar to this table:
ID Customer Date Amount
1 A 6/12/2018 33,223.00
2 A 9/20/2018 635.00
3 B 8/3/2018 8,643.00
4 B 8/30/2018 1,231.00
5 C 5/29/2018 7,522.00
However I need to get the mean amount of the last six months (as of today)
I was using
df.groupby('Customer').resample('W')['Amount'].sum()
And get something like this:
CustomerCode PayDate
A 2018-05-21 268
2018-05-28 0.00
2018-06-11 0.00
2018-06-18 472,657
2018-06-25 0.00
However with this solution I only get the range of dates where the customers had amount. I need to extend the weeks for each customer so I can get the whole range of the six months (in weeks). In this example I would need to get for customer A from the week of '2018-04-05' (which is exactly six months ago from today) till the week of today (filled with 0 of course since there was no amount)
Heres is the solution I found to my question. First I creates the dates I wanted (last six months but in frequency of weeks)
dates = pd.date_range(datetime.date.today() - datetime.timedelta(6*365/12),
pd.datetime.today(),
freq='W')
Then I create a multi-index using the product of the customer with the dates.
multi_index = pd.MultiIndex.from_product([pd.Index(df['Customer'].unique()),
dates],
names=('Customer', 'Date'))
Then I reindex the df using the new created multi-index and lastly, I fill with zeroes the missing values.
df.reindex(multi_index)
df.fillna(0)
Resample is super flexible. To get a 6-month sum instead of the weekly sum you currently have all you need is:
df.groupby('Customer').resample('6M')['Amount'].sum()
That groups by month end; month start would be '6MS'.
More documentation on available frequencies can be found here:
http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases