I would like to plot a stacked bar plot from a csv file in python. I have three columns of data
year word frequency
2018 xyz 12
2017 gfh 14
2018 sdd 10
2015 fdh 1
2014 sss 3
2014 gfh 12
2013 gfh 2
2012 gfh 4
2011 wer 5
2010 krj 4
2009 krj 4
2019 bfg 4
... 300+ rows of data.
I need to go through all the data and plot a stacked bar plot which is categorized based on the year, so x axis is word and y axis is frequency, the legend color should show year wise. I want to see how the evolution of each word occured year wise. Some of the technology words are repeatedly used in every year and hence the stack bar graph should add the values on top and plot, for example the word gfh initially plots 14 for year 2017, and then in year 2014 I want the gfh word to plot (in a different color) for a value of 12 on top of the gfh of 2017. How do I do this? So far I called the csv file in my code. But I don't understand how could it go over all the rows and stack the words appropriately (as some words repeat through all the years). Any help is highly appreciated. Also the years are arranged in random order in csv but I sorted them year wise to make it easier. I am just learning python and trying to understand this plotting routine since i have 40 years of data and ~20 words. So I thought stacked bar plot is the best way to represent them. Any other visualisation method is also welcome.
This can be done using pandas:
import pandas as pd
df = pd.read_csv("file.csv")
# Aggregate data
df = df.groupby(["word", "year"], as_index=False).agg({"frequency": "sum"})
# Create list to sort by
sorter = (
df.groupby(["word"], as_index=False)
.agg({"frequency": "sum"})
.sort_values("frequency")["word"]
.values
)
# Pivot, reindex, and plot
df = df.pivot(index="word", columns="year", values="frequency")
df = df.reindex(sorter)
df.plot.bar(stacked=True)
Which outputs:
Related
I have a panda dataframe named 'epu' with three columns 'year', 'month' and 'score'. It looks like this:
year month score
0 1970 1 125.224739
1 1970 1 99.020813
2 1970 1 112.190506
.
.
.
447 2022 4 154.661957
448 2022 5 168.034912
449 2022 6 143.154816
As can be seen, the year range is from 1970 to 2022. I wanted to plot 'score' on the y-axis and 'year' on the x-axis.
epu['score'].plot()
would give me the following graph. The score plot seems to be alright, but I don't get year labels on the x-axis.
But, plotting with
epu.plot(x='year',y='score')
would give me a strange-looking graph like below.
So, my question is how can I generate the graph in the first picture with the year labels on the x-axis of the second picture? Do I need some advanced features of matplotlib? I tried to search for answers here, but I might be missing something and couldn't find an answer for my problem.
I'm trying to visually compare two columns in a data frame and it either makes a weird table with 'frequency' instead of one of the columns
I tried these options:
ct1=pd.crosstab(df['releaseyear'],df['score'],normalize=True)
ct1.plot()
df.plot( x='releaseyear', y='score', kind='hist')
and also a scatter plot which get the x and y right but I don't know how normalize it so it will only show the average of each year and not all the data
plt.scatter(df['releaseyear'], df['score'])
plt.show()
There is no proper data which can be used to reproduce the dataframe or clue about how dataframe looks.
This answer is according to what i understood if data is like this
year score
2001 20
2001 18
2002 12
2002 16
then first use groupby and group data according to year and apply required aggregate function.
df=df.groupby('year').mean().reset_index()
output
year score
0 2001 19.0
1 2002 14.0
you can then plot the data accordingly.
I have a data frame like so. I am trying to make a plot with the mean of 'number' for each year on the y and the year on the x. I think what I have to do to do this is make a new data frame with 2 columns 'year' and 'avg number' for each year. How would I go about doing that?
year number
0 2010 40
1 2010 44
2 2011 33
3 2011 32
4 2012 34
5 2012 56
When opening a question about pandas please make sure you following these guidelines: How to make good reproducible pandas examples. It will help us reproduce your environment.
Assuming your dataframe is stored in the df variable:
df.groupby('year').mean().plot()
I have a dataframe that looks like this, with the default pandas index starting at 0:
index Year Count Name
0 2005 70000 Apple
1 2005 60000 Banana
2 2006 20000 Pineapple
3 2007 70000 Cherry
4 2007 60000 Coconut
5 2007 40000 Pear
6 2008 90000 Grape
7 2008 10000 Apricot
I would like to create a stacked bar plot of this data.
However, using the df.groupby() function will only allow me to call a function such as .mean() or .count() on this data in order to plot the data by year. I am getting the following result which separates each data point and does not group them by the shared year.
I have seen the matplotlib example for stacked bar charts, but they are grouped by a common index, in this case I do not have a common index I want to plot by. Is there a way to group and plot this data without rearranging the entire dataframe?
If I understood you correctly, you could do this using pivot first:
df1 = pd.pivot_table(df, values='Count', index='Year', columns='Name')
df1.plot(kind='bar')
Output:
Or with the argument stacked=True:
df1.plot(kind='bar', stacked=True)
I have a DataFrame contains as following, where first row is the "columns":
id,year,type,sale
1,1998,a,5
2,2000,b,10
3,1999,c,20
4,2001,b,15
5,2001,a,25
6,1998,b,5
...
I want to draw two figures, the first one is like
The second one is like
Figures in my draft might not be in right scale. I am a newbie to Python and I understand plotting functionality is powerful in Python. I believe there must be very easy to plot such figures.
The Pandas library provides simple and efficient tools to analyze and plot DataFrames.
Considering that the pandas library is installed and that the data are in a .csv file (matching the example you provided).
1. import the pandas library and load the data
import pandas as pd
data = pd.read_csv('filename.csv')
You now have a Pandas Dataframe as follow:
id year type sale
0 1 1998 a 5
1 2 2000 b 10
2 3 1999 c 20
3 4 2001 b 15
4 5 2001 a 25
5 6 1998 b 5
2. Plot the "sale" vs "type"
This is easily achieved by:
data.plot('type', 'sale', kind='bar')
which results in
If you want the sale for each type to be summed, data.groupby('type').sum().plot(y='sale', kind='bar') will do the trick (see #3 for explanation)
3. Plot the "sale" vs "year"
This is basically the same command, except that you have to first sum all the sale in the same year using the groupby pandas function.
data.groupby('year').sum().plot(y='sale', kind='bar')
This will result in
Edit:
4 Unstack the different type per year
You can also unstack the different 'type' per year for each bar by using groupby on 2 variables
data.groupby(['year', 'type']).sum().unstack().plot(y='sale', kind='bar', stacked=True)
Note:
See the Pandas Documentation on visualization for more information about achieving the layout you want.