My dataset has three columns, namely date, sold, and item.
I would like to investigate where a change in trends (like a peak or a drop) in market sales happens.
Date Sold Item
01/02/2018 1 socks
01/03/2018 4 t-shirts
01/04/2018 3 pants
01/04/2018 2 shirts
01/05/2018 1 socks
...
12/12/2018 21 watches
12/12/2018 35 toys
...
12/22/2018 43 flowers
12/22/2018 25 toys
12/22/2018 32 shirts
12/22/2018 70 pijamas
...
12/31/2018 12 toys
12/31/2018 2 skirts
To do this, I have been considering two things:
number of total sales per date (e.g. 1 on Jan 2, 2018; 4 on Jan 3,2018; 5 on Jan 4, 2018; and so on);
number of sales per item through time (i.e. looking at each item trend through time separately)
The first key point should be easily assessed by using groupby; the second key point should be also doable by using groupby.
However, my difficulties are in plotting all the items in the same plot (preferable a line plot).
What I have done is:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn
df = pd.read_csv("./MarketSales.csv")
sales_plot = df[Item].groupby("Sold").sum().sort_values("Sold",ascending=False).plot()
sales_plot.set_xlabel("Date")
sales_plot.set_ylabel("Frequency")
Unfortunately, the code above does not generate the expected results.
The most challenging topic in Python is about the use of groupby and plot.
I hope you can help me to understand the approach.
I'm not sure why you groupby 'Sold' because you seem to be interested by the number of sold per date, so here are the two lines of codes that would address your two points:
df.groupby(['Date'])['Sold'].sum().plot()
#and
df.groupby(['Date','Item'])['Sold'].sum().unstack().plot()
Also, you may want to convert your date before with df['Date'] = pd.to_datetime(df['Date']) to have a better visualization with time
Related
extracting the number of orders for each category according to its' date by the number of the product type appeared using python please if you have any other method let me know
these are the libraries i used
#Importing libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set()
here is the count of each product i have
print(" The Total number for each product type: ")
data.DrugType.value_counts()
the output is
productA 8124
productB 3047
transformation of the variable frome date & time value to date value
data['date'] = pd.to_datetime(data['Assessment_Date']).dt.date
data['date']=pd.to_datetime(data['date'])
here is the code i wrote to find the number of product order according to its's order date
print( data.groupby('DrugType')['date'].sum() )
this is what i got
KEYERROR: 'date' is not a column in data
i want the out put to be like this one below
productA 1-11-2022 8124
productB 1-11-2022 3047
if you have another way to write the code to count the number of order for each category
PLEASE let me know
thank you and please do not close my question without a helpful answer
Let's start with some
reproducible
example data.
It has statistics similar to yours.
import datetime as dt
from random import randrange, seed
import pandas as pd
start = dt.datetime(2022, 1, 1)
seed(0)
data = pd.DataFrame([
dict(
DrugType="product" + ("A" if randrange(100) < 73 else "B"),
date=start + dt.timedelta(days=randrange(365)),
qty=1,
)
for _ in range(11_171)
])
Now, it's not entirely clear what you're looking for.
Here are two possibilities.
>>> data.groupby(["date", "DrugType"]).count()
qty
date DrugType
2022-01-01 productA 20
productB 15
2022-01-02 productA 26
productB 10
... ...
2022-12-30 productA 24
productB 8
2022-12-31 productA 24
productB 9
[730 rows x 1 columns]
That's explaining that last year every day
we sold two kinds of drugs, typically more of A than B.
Alternatively, perhaps you were looking for something
simpler, focused on just a single Series.
>>> data.groupby(["date"]).qty.count()
date
2022-01-01 35
2022-01-02 36
2022-01-03 27
..
2022-12-29 31
2022-12-30 32
2022-12-31 33
Name: qty, Length: 365, dtype: int64
Which shows that typically we sold
a little more than 30 units each day,
and the daily totals match the subtotals
that were broken out above.
I have the following dataset:
Date
ID
Fruit
2021-2-2
1
Apple
2021-2-2
1
Pear
2021-2-2
1
Apple
2021-2-2
2
Pear
2021-2-2
2
Pear
2021-2-2
2
Apple
2021-3-2
3
Apple
2021-3-2
3
Apple
I have removed duplicate "Fruit" based on ID (There can only be 1 apple per ID number but multiple apples per a single month). And now I would like to generate multiple scatter/line plots (one per "Fruit" type) with the x-axis as month (i.e. Jan. 2021, Feb. 2021, Mar. 2021, etc) and the y-axis as frequency or counts of "Fruit" that occur in that month.
If I could generate new columns in a new sheet in Excel that I could then plot as x and y that would be great too. Something like this for Apples specifically:
Month
Number of Apples
Jan 2021
0
Feb 2021
2
Mar 2021
1
I've tried the following which let me remove duplicates but I can't figure out how to count the number of Apples in the Fruit column that occur within a given timeframe (month is what I'm looking for now) and set that to the y-axis.
import numpy as np
import pandas as pd
import re
import datetime as dt
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_excel('FruitExample.xlsx',
usecols=("A:E"), sheet_name=('Data'))
df_Example = df.drop_duplicates(subset=["ID Number", "Fruit"], keep="first")
df_Example.plot(x="Date", y=count("Fruit"), style="o")
plt.show()
I've tried to use groupby and categorical but can't seem to count this up properly and plot it. Here is an example of a plot that would be great.
[]
Make sure the dates are in datetime format
df['Date']=pd.to_datetime(df['Date'])
Then create a column for month-year,
df['Month-Year']=df['Date'].dt.to_period('M') #M for month
new_df=pd.DataFrame(df.groupby(['Month-Year','Fruit'])['ID'].count())
new_df.reset_index(inplace=True)
Make sure to change back to datetime as seaborn can't handle 'period' type
new_df['Month-Year']=new_df['Month-Year'].apply(lambda x: x.to_timestamp())
Then plot,
import seaborn as sns
sns.lineplot(x='Month-Year',y='ID',data=new_df,hue='Fruit')
I would like to plot a stacked bar plot from a csv file in python. I have three columns of data
year word frequency
2018 xyz 12
2017 gfh 14
2018 sdd 10
2015 fdh 1
2014 sss 3
2014 gfh 12
2013 gfh 2
2012 gfh 4
2011 wer 5
2010 krj 4
2009 krj 4
2019 bfg 4
... 300+ rows of data.
I need to go through all the data and plot a stacked bar plot which is categorized based on the year, so x axis is word and y axis is frequency, the legend color should show year wise. I want to see how the evolution of each word occured year wise. Some of the technology words are repeatedly used in every year and hence the stack bar graph should add the values on top and plot, for example the word gfh initially plots 14 for year 2017, and then in year 2014 I want the gfh word to plot (in a different color) for a value of 12 on top of the gfh of 2017. How do I do this? So far I called the csv file in my code. But I don't understand how could it go over all the rows and stack the words appropriately (as some words repeat through all the years). Any help is highly appreciated. Also the years are arranged in random order in csv but I sorted them year wise to make it easier. I am just learning python and trying to understand this plotting routine since i have 40 years of data and ~20 words. So I thought stacked bar plot is the best way to represent them. Any other visualisation method is also welcome.
This can be done using pandas:
import pandas as pd
df = pd.read_csv("file.csv")
# Aggregate data
df = df.groupby(["word", "year"], as_index=False).agg({"frequency": "sum"})
# Create list to sort by
sorter = (
df.groupby(["word"], as_index=False)
.agg({"frequency": "sum"})
.sort_values("frequency")["word"]
.values
)
# Pivot, reindex, and plot
df = df.pivot(index="word", columns="year", values="frequency")
df = df.reindex(sorter)
df.plot.bar(stacked=True)
Which outputs:
I would need to plot the frequency of items by date. My csv contains three columns: one for Date, one for Name & Surname and another one for Birthday.
I am interested in plotting the frequency of people recorded in a date. My expected output would be:
Date Count
0 01/01/2018 9
1 01/02/2018 12
2 01/03/2018 6
3 01/04/2018 4
4 01/05/2018 5
.. ... ...
.. 02/27/2020 122
.. 02/28/2020 84
The table above was found as follows:
by_date = df.groupby(df['Date']).size().reset_index(name='Count')
Date is a column in my csv file, but not Count. This explains the reason why I am having difficulties to draw a line plot.
How can I plot the frequency as a list of numbers/column?
Although not absolutely required, you should convert the Date column into Timestamp for easier analysis in later steps:
df['Date'] = pd.to_datetime(df['Date'])
Now, to your question. To count many births there are per day, you can use value_counts:
births = df['Date'].value_counts()
But you don't even have to do that for plotting a histogram! Use hist:
import matplotlib.dates as mdates
year = mdates.YearLocator()
month = mdates.MonthLocator()
formatter = mdates.ConciseDateFormatter(year)
ax = df['Date'].hist()
ax.set_title('# of births')
ax.xaxis.set_major_locator(year)
ax.xaxis.set_minor_locator(month)
ax.xaxis.set_major_formatter(formatter)
Result (from random data):
I have a question. I am dealing with a Datetime DataFrame in Pandas. I want to perform a count on a particular column and group by the month.
For example:
df.groupby(df.index.month)["count_interest"].count()
Assuming that I am analyzing a Data From December 2019. I get a result like this
date
1 246
2 360
3 27
12 170
In reality, December 2019 is supposed to come first. Please what can I do because when I plot the frame grouped by month, the December 2019 is showing at the last and this is practically incorrect.
See plot below for your understanding:
You can try reindex:
df.groupby(df.index.month)["count_interest"].count().reindex([12,1,2,3])