Count frequency and plot - python

I would need to plot the frequency of items by date. My csv contains three columns: one for Date, one for Name & Surname and another one for Birthday.
I am interested in plotting the frequency of people recorded in a date. My expected output would be:
Date Count
0 01/01/2018 9
1 01/02/2018 12
2 01/03/2018 6
3 01/04/2018 4
4 01/05/2018 5
.. ... ...
.. 02/27/2020 122
.. 02/28/2020 84
The table above was found as follows:
by_date = df.groupby(df['Date']).size().reset_index(name='Count')
Date is a column in my csv file, but not Count. This explains the reason why I am having difficulties to draw a line plot.
How can I plot the frequency as a list of numbers/column?

Although not absolutely required, you should convert the Date column into Timestamp for easier analysis in later steps:
df['Date'] = pd.to_datetime(df['Date'])
Now, to your question. To count many births there are per day, you can use value_counts:
births = df['Date'].value_counts()
But you don't even have to do that for plotting a histogram! Use hist:
import matplotlib.dates as mdates
year = mdates.YearLocator()
month = mdates.MonthLocator()
formatter = mdates.ConciseDateFormatter(year)
ax = df['Date'].hist()
ax.set_title('# of births')
ax.xaxis.set_major_locator(year)
ax.xaxis.set_minor_locator(month)
ax.xaxis.set_major_formatter(formatter)
Result (from random data):

Related

Python / Pandas - How can I group a DF by two columns

I got a dataset that is built ike this:
hour weekday
12 2
14 1
12 2
and so on.
I want to display in a heatmap per weekday when the dataframe had most action (which is the sum of all events that happened on that weekday during that hour)
I tried wo work with groupBy
hm = df.groupby(['hour']).sum()
which shows me all events for the hours, but does not split the events across the weekdays
How can I keep the list so I have the weekdays as an x-axis and on the y-axis the sum of the hours on that weekday?
thanks for your help!
The output you expect is unclear, but I imagine you could be looking for pandas.crosstab:
# computing crosstab
hm = pd.crosstab(df['hour'], df['weekday'])
# plotting heatmap
import seaborn as sns
sns.heatmap(hm, cmap='Greys')
output:
weekday 1 2
hour
12 0 2
14 1 0

Count occurrences in column based on another column (date)

I am trying to count the number of "Type" occurrences by what month they are in.
Daily data is given, so to group by month I tried using .resample() but the problem with using is that combines all the strings together in one LONG string and then I can't count the number of occurrences using str.count() as it returns the wrong value (it finds too many matches because it isn't looking for the EXACT pattern).
I think it has to be done in more than one step...
I have tried SO many things... I even heard there is a pivot table?
Sample data:
Type
Date
Cat
2020-01-01
Cat
2020-01-01
Bird
2020-01-01
Dog
2020-01-01
Cat
2020-02-01
Cat
2020-03-01
Bird
2020-03-01
Cat
2020-05-02
... For all the months over a few years...
Converted to the following format: (titles of header can be in numeric form as well)
January 2020
February 2020
Cat
4
1
Bird
1
0
Dog
1
0
As far as I know, Pandas does not have a standard function or typical approach to obtain your desired result. Below I've included a code snippet that gets your desired result.
If you do not mind using extra packages, there exist some packages which you can use for quicker/easier binary encoding (e.g. category_encoder).
import pandas as pd
# your data in dictionary format
d = {
"Type":["Cat","Cat","Bird","Dog","Cat","Cat","Bird","Cat"],
"Date":["2020-01-01","2020-01-01","2020-01-01","2020-01-01","2020-02-01","2020-03-01","2020-03-01","2020-05-02"]
}
# creata dataframe with the dates as index
df = pd.DataFrame(data = d['Type'], index=pd.to_datetime(d['Date']))
animals = list(df[0].unique()) # a list contaning all unique animals
ndf = pd.DataFrame(index=animals) # empty new dataframe with all animals as index
for animal in animals:
ndf.loc[animal, df.index.month.unique()] = ( # at row = animal, insert all unique months
(df == animal).groupby(df.index.month) # groupby months, using .month (returns 1 for Jan)
.sum() # sum since we use bool comparison
.transpose() # tranpose due to desired output format
.values # array of values to insert
)
# convert column names back to date time and save as string in desired format
ndf.columns = pd.to_datetime(ndf.columns, format='%m').strftime('%B 2020')
Result
January 2020
February 2020
March 2020
May 2020
Cat
2
1
1
1
Bird
1
0
1
0
Dog
1
0
0
0

Calculating Monthly Anomalies in Pandas

Hello StackOverflow Community,
I have been interested in calculating anomalies for data in pandas 1.2.0 using Python 3.9.1 and Numpy 1.19.5, but have been struggling to figure out the most "Pythonic" and "pandas" way to complete this task (or any way for that matter). Below I have some created some dummy data and put it into a pandas DataFrame. In addition, I have tried to clearly outline my methodology for calculating monthly anomalies for the dummy data.
What I am trying to do is take "n" years of monthly values (in this example, 2 years of monthly data = 25 months) and calculate monthly averages for all years (for example group all the January values togeather and calculate the mean). I have been able to do this using pandas.
Next, I would like to take each monthly average and subtract it from all elements in my DataFrame that fall into that specific month (for example subtract each January value from the overall January mean value). In the code below you will see some lines of code that attempt to do this subtraction, but to no avail.
If anyone has any thought or tips on what may be a good way to approach this, I really appreciate your insight. If you require further clarification, let me know. Thanks for your time and thoughts.
-Marian
#Import packages
import numpy as np
import pandas as pd
#-------------------------------------------------------------
#Create a pandas dataframe with some data that will represent:
#Column of dates for two years, at monthly resolution
#Column of corresponding values for each date.
#Create two years worth of monthly dates
dates = pd.date_range(start='2018-01-01', end='2020-01-01', freq='MS')
#Create some random data that will act as our data that we want to compute the anomalies of
values = np.random.randint(0,100,size=25)
#Put our dates and values into a dataframe to demonsrate how we have tried to calculate our anomalies
df = pd.DataFrame({'Dates': dates, 'Values': values})
#-------------------------------------------------------------
#Anomalies will be computed by finding the mean value of each month over all years
#And then subtracting the mean value of each month by each element that is in that particular month
#Group our df according to the month of each entry and calculate monthly mean for each month
monthly_means = df.groupby(df['Dates'].dt.month).mean()
#-------------------------------------------------------------
#Now, how do we go about subtracting these grouped monthly means from each element that falls
#in the corresponding month.
#For example, if the monthly mean over 2 years for January is 20 and the value is 21 in January 2018, the anomaly would be +1 for January 2018
#Example lines of code I have tried, but have not worked
#ValueError:Unable to coerce to Series, length must be 1: given 12
#anomalies = socal_csv.groupby(socal_csv['Date'].dt.month) - monthly_means
#TypeError: unhashable type: "list"
#anomalies = socal_csv.groupby(socal_csv['Date'].dt.month).transform([np.subtract])
You can use pd.merge like this :
import numpy as np
import pandas as pd
dates = pd.date_range(start='2018-01-01', end='2020-01-01', freq='MS')
values = np.random.randint(0,100,size=25)
df = pd.DataFrame({'Dates': dates, 'Values': values})
monthly_means = df.groupby(df['Dates'].dt.month.mean()
df['month']=df['Dates'].dt.strftime("%m").astype(int)
df=df.merge(monthly_means.rename(columns={'Dates':'month','Values':'Mean'}),on='month',how='left')
df['Diff']=df['Mean']-df['Values']
output:
df['Diff']
Out[19]:
0 33.333333
1 19.500000
2 -29.500000
3 -22.500000
4 -24.000000
5 -3.000000
6 10.000000
7 2.500000
8 14.500000
9 -17.500000
10 44.000000
11 31.000000
12 -11.666667
13 -19.500000
14 29.500000
15 22.500000
16 24.000000
17 3.000000
18 -10.000000
19 -2.500000
20 -14.500000
21 17.500000
22 -44.000000
23 -31.000000
24 -21.666667
You can use abs() if you want absolute difference
A one-line solution is:
df = pd.DataFrame({'Values': values}, index=dates)
df.groupby(df.index.month).transform(lambda x: x-x.mean())

Dataframe of the Top X Values of the Top Y Days - Pandas Groupby

I have data about three variables where I want to find the largest X values of one variable on a per day basis. Previously I wrote some code to find the hour where the max value of the day occurred, but now I want to add some options to find more max hours per day.
I've been able to find the Top X values per day for all the days, but I've gotten stuck on narrowing it down to the Top X Values from the Top X Days. I've included pictures detailing what the end result would hopefully look like.
Data
Identified Top 2 Hours
Code
df = pd.DataFrame(
{'ID':['ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1'],
'Year':[2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018],
'Month':[6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6],
'Day':[12,12,12,12,13,13,13,13,14,14,14,14,15,15,15,15,16,16,16,16,17,17,17,17],
'Hour':[19,20,21,22,11,12,13,19,19,20,21,22,18,19,20,21,19,20,21,23,19,20,21,22],
'var_1': [0.83,0.97,0.69,0.73,0.66,0.68,0.78,0.82,1.05,1.05,1.08,0.88,0.96,0.81,0.71,0.88,1.08,1.02,0.88,0.79,0.91,0.91,0.80,0.96],
'var_2': [47.90,42.85,67.37,57.18,66.13,59.96,52.63,54.75,32.54,36.58,36.99,37.23,46.94,52.80,68.79,50.84,37.79,43.54,48.04,38.01,42.22,47.13,50.96,44.19],
'var_3': [99.02,98.10,98.99,99.12,98.78,98.90,99.09,99.20,99.22,99.11,99.18,99.24,99.00,98.90,98.87,99.07,99.06,98.86,98.92,99.32,98.93,98.97,98.99,99.21],})
# Get the top 2 var2 values each day
top_two_var2_each_day = df.groupby(['ID', 'Year', 'Month', 'Day'])['var_2'].nlargest(2)
top_two_var2_each_day = top_two_var2_each_day.reset_index()
# set level_4 index to the current index
top_two_var2_each_day = top_two_var2_each_day.set_index('level_4')
# use the index from the top_two_var2 to get the rows from df to get values of the other variables when top 2 values occured
top_2_all_vars = df[df.index.isin(top_two_var2_each_day.index)]
End Goal Result
I figure the best way would be to average the two hours to identify what days have the largest average, then go back into top_2_all_vars dataframe and grab the rows where the Days occur. I am unsure how to proceed.
mean_day = top_2_all_vars.groupby(['ID', 'Year', 'Month', 'Day'],as_index=False)['var_2'].mean()
top_2_day = mean_day.nlargest(2, 'var_2')
Final Dataframe
This is the result I am trying to find. A dataframe consisting of the Top 2 values of var_2 from each of the Top 2 days.
Code I previously used to find the single largest value of each day, but I don't know how I would make it work for more than a single max per day
# For each ID and Day, Find the Hour where the Max Amount of var_2 occurred and save the index location
df_idx = df.groupby(['ID', 'Year', 'Month', 'Day',])['var_2'].transform(max) == df['var_2']
# Now the hour has been found, store the rows in a new dataframe based on the saved index location
top_var2_hour_of_each_day = df[df_idx]
Using Groupbys may be not the best way to go about it, but I am open to anything.
This is one approach:
If your data spans multiple months its a lot harder dealing with it when the month and day are in different columns. So First I made a new column called 'Date' which just combines the month and the day.
df['Date'] = df['Month'].astype('str')+"-"+df['Day'].astype('str')
Next we need the top two values of var_2 per day, and then average them. So we can create a really simple function to find exactly that.
def topTwoMean(series):
top = series.sort_values(ascending = False).iloc[0]
second = series.sort_values(ascending = False).iloc[1]
return (top+second)/2
We then use our function, sort by the average of var_2 to get the highest 2 days, then save the dates to a list.
maxDates = df.groupby('Date').agg({'var_2': [topTwoMean]})\
.sort_values(by = ('var_2', 'topTwoMean'), ascending = False)\
.reset_index()['Date']\
.head(2)\
.to_list()
Finally we filter by the dates chosen above, then find the highest two of var_2 on those days.
df[df['Date'].isin(maxDates)]\
.groupby('Date')\
.apply(lambda x: x.sort_values('var_2', ascending = False).head(2))\
.reset_index(drop = True)
ID Year Month Day Hour var_1 var_2 var_3 Date
0 ID_1 2018 6 12 21 0.69 67.37 98.99 6-12
1 ID_1 2018 6 12 22 0.73 57.18 99.12 6-12
2 ID_1 2018 6 13 11 0.66 66.13 98.78 6-13
3 ID_1 2018 6 13 12 0.68 59.96 98.90 6-13

LinePlot with Seaborn: View Limit Minimum Error

I have the following dataframe:
Month Year Location Revenue
0 2015-01 Location 1 0.00
1 2015-03 Location 1 1105.50
2 2015-04 Location 1 44034.28
3 2015-05 Location 1 56756.39
4 2015-06 Location 1 51502.22
There are about two years worth of data. There are 5 different locations. I want to create a lineplot with seaborn that shows 5 different lines (one for each location) with Revenue on the y-axis, Month Year on the x-axis.
sns.lineplot(x="Month Year",
y="Revenue",
hue="Location",
data=rev_by_month,
palette="tab10")
When I run the code above, however, I receive the following error:
view limit minimum -0.05500000000000001 is less than 1 and is an invalid Matplotlib date value. This often happens if you pass a non-datetime value to an axis that has datetime units
For the record, the Month Year column was created using the pandas .to_datetime() function.

Categories