I have a small pd.DataFrame that looks like this:
Col1
NumCol
0
10000000
1
7500000
2
12500000
3
37500000
4
110000000
5
65000000
NumCol is actually dollar values.
I want to create a seaborn lineplot, but instead of using the numerical values which create a funky looking axis, I'd like to show dollar values.
sns.lineplot(data=plot_df, x='Col1', y='NumCol') properly creates:
However, I'd like the axes to show $10,000,000, $7,500,000, etc.
I know I can create a string-representation of the column using
plot_df['NumCol_Str'] = plot_df.NumCol.apply(lambda x : "${:,}".format(x))
Which creates:
Col1
NumCol
NumCol_Str
0
10000000
$10,000,000
1
7500000
$7,500,000
2
12500000
$12,500,000
3
37500000
$37,500,000
4
110000000
$110,000,000
5
65000000
$65,000,000
However, when plotting, it changes the order of the columns
sns.lineplot(data=plot_df, x='Col1', y='NumCol_Str')
How can I properly plot the linegraph while keeping the new string notation on the axis?
MRE below:
plot_df = pd.DataFrame.from_dict({'Col1': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5},
'NumCol': {0: 10000000,
1: 7500000,
2: 12500000,
3: 37500000,
4: 110000000,
5: 65000000}})
plot_df['NumCol_Str'] = plot_df.NumCol.apply(lambda x : "${:,}".format(x))
sns.lineplot(data=plot_df, x='Col1', y='NumCol_Str')
sns.lineplot(data=plot_df, x='Col1', y='NumCol')
Just plot using the numeric values and then change the axis formatter with matplotlib tick formatter:
import matplotlib.ticker as mtick
ax.yaxis.set_major_formatter(mtick.StrMethodFormatter('${x:,.0f}'))
EDIT:
Or even simpler as pointed out by #BigBen:
ax.yaxis.set_major_formatter('${x:,.0f}')
Related
Consider my series as below: First column is article_id and the second column is frequency count.
article_id
1 39
2 49
3 187
4 159
5 158
...
16947 14
16948 7
16976 2
16977 1
16978 1
16980 1
Name: article_id, dtype: int64
I got this series from a dataframe with the following command:
logs.loc[logs['article_id'] <= 17029].groupby('article_id')['article_id'].count()
logs is the dataframe here and article_id is one of the columns in it.
How do I plot a bar chart(using Matlplotlib) such that the article_id is on the X-axis and the frequency count on the Y-axis ?
My natural instinct was to convert it into a list using .tolist() but that doesn't preserve the article_id.
IIUC you need Series.plot.bar:
#pandas 0.17.0 and above
s.plot.bar()
#pandas below 0.17.0
s.plot('bar')
Sample:
import pandas as pd
import matplotlib.pyplot as plt
s = pd.Series({16976: 2, 1: 39, 2: 49, 3: 187, 4: 159,
5: 158, 16947: 14, 16977: 1, 16948: 7, 16978: 1, 16980: 1},
name='article_id')
print (s)
1 39
2 49
3 187
4 159
5 158
16947 14
16948 7
16976 2
16977 1
16978 1
16980 1
Name: article_id, dtype: int64
s.plot.bar()
plt.show()
The new pandas API suggests the following way:
import pandas as pd
s = pd.Series({16976: 2, 1: 39, 2: 49, 3: 187, 4: 159,
5: 158, 16947: 14, 16977: 1, 16948: 7, 16978: 1, 16980: 1},
name='article_id')
s.plot(kind="bar", figsize=(20,10))
If you are working on Jupyter, you don't need the matplotlib library.
Just use 'bar' in kind parameter of plot
Example
series = read_csv('BwsCount.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser)
series.plot(kind='bar')
Default value of kind is 'line' (ie. series.plot() --> will automatically plot line graph)
For your reference:
kind : str
‘line’ : line plot (default)
‘bar’ : vertical bar plot
‘barh’ : horizontal bar plot
‘hist’ : histogram
‘box’ : boxplot
‘kde’ : Kernel Density Estimation plot
‘density’ : same as ‘kde’
‘area’ : area plot
‘pie’ : pie plot
I have the following pandas groupby object, and I'd like to turn the result into a new dataframe.
Following, is the code to get the conditional probability:
bin_probs = data.groupby('season')['bin'].value_counts()/data.groupby('season')['bin'].count()
I've tried the following code, but it returns as follows.
I like the season to fill in each row. How can I do that?
a = pd.DataFrame(data_5.groupby('season')['bin'].value_counts()/data_5.groupby('season')['bin'].count())
a is a DataFrame, but with a 2-level index, so my interpretation is you want a dataframe without a multi-level index.
The index can't be reset when the name in the index and the column are the same.
Use pandas.Series.reset_index, and set name='normalized_bin, to rename the bin column.
This would not work with the implementation in the OP, because that is a dataframe.
This works with the following implementation, because a pandas.Series is created with .groupby.
The correct way to normalize the column is to use the normalize=True parameter in .value_counts.
import pandas as pd
import random # for test data
import numpy as np # for test data
# setup a dataframe with test data
np.random.seed(365)
random.seed(365)
rows = 1100
data = {'bin': np.random.randint(10, size=(rows)),
'season': [random.choice(['fall', 'winter', 'summer', 'spring']) for _ in range(rows)]}
df = pd.DataFrame(data)
# display(df.head())
bin season
0 2 summer
1 4 winter
2 1 summer
3 5 winter
4 2 spring
# groupby, normalize and reset the the Series index
a = df.groupby(['season'])['bin'].value_counts(normalize=True).reset_index(name='normalized_bin')
# display(a.head(15))
season bin normalized_bin
0 fall 2 0.15600
1 fall 9 0.11600
2 fall 3 0.10800
3 fall 4 0.10400
4 fall 6 0.10000
5 fall 0 0.09600
6 fall 8 0.09600
7 fall 5 0.08400
8 fall 7 0.08000
9 fall 1 0.06000
10 spring 0 0.11524
11 spring 8 0.11524
12 spring 9 0.11524
13 spring 3 0.11152
14 spring 1 0.10037
Using the OP code for a
As already noted above, use normalize=True to get normalized values
The solution in the OP, creates a DataFrame, because the .groupby is wrapped with the DataFrame constructor, pandas.DataFrame.
To reset the index, you must first pandas.DataFrame.rename the bin column, and then use pandas.DataFrame.reset_index
a = pd.DataFrame(df.groupby('season')['bin'].value_counts()/df.groupby('season')['bin'].count()).rename(columns={'bin': 'normalized_bin'}).reset_index()
Other Resources
See Pandas unable to reset index because name exist to reset by a level.
Plotting
It is easier to plot from the multi-index Series, by using pandas.Series.unstack(), and then use pandas.DataFrame.plot.bar
For side-by-side bars, set stacked=False.
The bars are all equal to 1, because this is normalized data.
s = df.groupby(['season'])['bin'].value_counts(normalize=True).unstack()
# plot a stacked bar
s.plot.bar(stacked=True, figsize=(8, 6))
plt.legend(title='bin', bbox_to_anchor=(1.05, 1), loc='upper left')
You are looking for parameter normalize:
bin_probs = data.groupby('season')['bin'].value_counts(normalize=True)
Read more about it here:
I have a ProductDf which have many versions of the same product. I want to filter the last iteration of the product. So I did this as below:
productIndexDf= ProductDf.groupby('productId').apply(lambda
x:x['startDtTime'].reset_index()).reset_index()
productToPick = productIndexDf.groupby('productId')['index'].max()
get the value of productToPick into a string
productIndex = productToPick.to_string(header=False,
index=False).replace('\n',' ')
productIndex = productIndex.split()
productIndex = list(map(int, productIndex))
productIndex.sort()
productIndexStr = ','.join(str(e) for e in productIndex)
Once I get that in a Series, I call loc function manually and it works:
filteredProductDf = ProductDf.iloc[[7,8],:]
If I pass it the string, I get an error:
filteredProductDf = ProductDf.iloc[productIndexStr,:]
ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types
I also tried this:
filteredProductDf = ProductDf[productIndexStr]
But then I get this issue:
KeyError: '7,8'
Pandas Dataframe iloc method works only with integer type indexed value. If you want to use string value as index for accessing data from pandas dataframe then you have to use Pandas Dataframe loc method.
Know more about these method from these link.
Use of Pandas Dataframe iloc method
Use of Pandas Dataframe loc method
Ok I think you are confusing it.
Given a dataframe that look like this:
avgPrice productId startDtTime totalSold
0 42.5 A001 01/05/2018 100
1 55.5 A001 02/05/2018 150
2 48.5 A001 03/05/2018 300
3 42.5 A002 01/05/2018 220
4 53.5 A002 02/05/2018 250
I assume that you are interested in row 2 and 4 (the last value of respective productId). In pandas the easiest way would be to use drop_duplicates() with the param keep='last'. Consider this example:
import pandas as pd
d = {'startDtTime': {0: '01/05/2018', 1: '02/05/2018',
2: '03/05/2018', 3: '01/05/2018', 4: '02/05/2018'},
'totalSold': {0: 100, 1: 150, 2: 300, 3: 220, 4: 250},
'productId': {0: 'A001', 1: 'A001', 2: 'A001', 3: 'A002', 4: 'A002'},
'avgPrice': {0: 42.5, 1: 55.5, 2: 48.5, 3: 42.5, 4: 53.5}
}
# Recreate dataframe
ProductDf = pd.DataFrame(d)
# Convert column with dates to datetime objects
ProductDf['startDtTime'] = pd.to_datetime(ProductDf['startDtTime'])
# Sort values by productId and startDtTime to ensure correct order
ProductDf.sort_values(by=['productId','startDtTime'], inplace=True)
# Drop the duplicates
ProductDf.drop_duplicates(['productId'], keep='last', inplace=True)
print(ProductDf)
And you get:
avgPrice productId startDtTime totalSold
2 48.5 A001 2018-03-05 300
4 53.5 A002 2018-02-05 250
I have dataframe total_year, which contains three columns (year, action, comedy).
How can I plot two columns (action and comedy) on y-axis?
My code plots only one:
total_year[-15:].plot(x='year', y='action', figsize=(10,5), grid=True)
Several column names may be provided to the y argument of the pandas plotting function. Those should be specified in a list, as follows.
df.plot(x="year", y=["action", "comedy"])
Complete example:
import matplotlib.pyplot as plt
import pandas as pd
df = pd.DataFrame({"year": [1914,1915,1916,1919,1920],
"action" : [2.6,3.4,3.25,2.8,1.75],
"comedy" : [2.5,2.9,3.0,3.3,3.4] })
df.plot(x="year", y=["action", "comedy"])
plt.show()
Pandas.DataFrame.plot() per default uses index for plotting X axis, all other numeric columns will be used as Y values.
So setting year column as index will do the trick:
total_year.set_index('year').plot(figsize=(10,5), grid=True)
When using pandas.DataFrame.plot, it's only necessary to specify a column to the x parameter.
The caveat is, the rest of the columns with numeric values will be used for y.
The following code contains extra columns to demonstrate. Note, 'date' is left as a string. However, if 'date' is converted to a datetime dtype, the plot API will also plot the 'date' column on the y-axis.
If the dataframe includes many columns, some of which should not be plotted, then specify the y parameter as shown in this answer, but if the dataframe contains only columns to be plotted, then specify only the x parameter.
In cases where the index is to be used as the x-axis, then it is not necessary to specify x=.
import pandas as pd
# test data
data = {'year': [1914, 1915, 1916, 1919, 1920],
'action': [2.67, 3.43, 3.26, 2.82, 1.75],
'comedy': [2.53, 2.93, 3.02, 3.37, 3.45],
'test1': ['a', 'b', 'c', 'd', 'e'],
'date': ['1914-01-01', '1915-01-01', '1916-01-01', '1919-01-01', '1920-01-01']}
# create the dataframe
df = pd.DataFrame(data)
# display(df)
year action comedy test1 date
0 1914 2.67 2.53 a 1914-01-01
1 1915 3.43 2.93 b 1915-01-01
2 1916 3.26 3.02 c 1916-01-01
3 1919 2.82 3.37 d 1919-01-01
4 1920 1.75 3.45 e 1920-01-01
# plot the dataframe
df.plot(x='year', figsize=(10, 5), grid=True)
I am trying to plot three lines on the same plot in Matplotlib. They are InvoicesThisYear, DisputesThisYear, and PercentThisYear (Which is Disputes/Invoices)
The original input is two columns of dates -- one for the date of a logged dispute and one for the date of a logged invoice.
I use the dates to count up the number of disputes and invoices per month during a certain year.
Then I try to graph it, but it comes up empty. I started with just trying to print PercentThisYear and InvoicesThisYear.
PercentThisYear = (DisputesFYThisYear/InvoicesFYThisYear).fillna(0.0)
#Percent_ThisYear.plot(kind = 'line')
#InvoicesFYThisYear.plot(kind = 'line')
plt.plot(PercentThisYear)
plt.xlabel('Date')
plt.ylabel('Percent')
plt.title('Customer Disputes')
# Remove the plot frame lines. They are unnecessary chartjunk.
ax = plt.subplot(111)
ax.spines["top"].set_visible(False)
ax.spines["bottom"].set_visible(False)
ax.spines["right"].set_visible(False)
ax.spines["left"].set_visible(False)
ax2 = ax.twinx()
ax2.plot(InvoicesFYThisYear)
# Ensure that the axis ticks only show up on the bottom and left of the plot.
# Ticks on the right and top of the plot are generally unnecessary chartjunk.
ax.get_xaxis().tick_bottom()
#ax.get_yaxis().tick_left()
# Limit the range of the plot to only where the data is.
# Avoid unnecessary whitespace.
datenow = datetime.datetime.now()
dstart = datetime.datetime(2015,4,1)
print datenow
#plt.ylim(0, .14)
plt.xlim(dstart, datenow)
firsts=[]
for i in range(dstart.month, datenow.month+1):
firsts.append(datetime.datetime(2015,i,1))
plt.xticks(firsts)
plt.show()
This is the output... The date is all messed up and nothing prints. But the scaled on the axes look right. What am I doing wrong?
Here is the set up leading up to the graph if that is helpful
The Input looks like this:
InvoicesThisYear
Out[82]:
7 7529
5 5511
6 4934
8 3552
dtype: int64
DisputesThisYear
Out[83]:
2 211
1 98
7 54
4 43
3 32
6 29
5 21
8 8
dtype: int64
PercentThisYear
Out[84]:
1 0.000000
2 0.000000
3 0.000000
4 0.000000
5 0.003810
6 0.005877
7 0.007172
8 0.002252
dtype: float64
Matplotlib has no way of knowing which dates are associated with which data points. When you call plot with only one argument y, Matplotlib automatically assumes that the x-values are range(len(y)). You need to supply the dates as the first argument to plot. Assuming that InvoicesThisYear is a count of the number of invoices each month, starting at 1 and ending at 8, you could do something like
import datetime
import matplotlib.pyplot as plt
import pandas as pd
InvoicesFYThisYear = pd.DataFrame([0, 0, 0, 0, 5511, 4934, 7529, 3552])
Disputes = pd.DataFrame([98, 211, 32, 43, 21, 29, 54, 8])
PercentThisYear = (Disputes / InvoicesFYThisYear)
datenow = datetime.date.today()
ax = plt.subplot(111)
dates = [datetime.date(2015,i,1) for i in xrange(1, 9, 1)]
plt.plot(dates, PercentThisYear)
ax2 = ax.twinx()
ax2.plot(dates, InvoicesFYThisYear)
dstart = datetime.datetime(2015,4,1)
plt.xlim(dstart, datenow)
plt.xticks(dates, dates)
plt.show()
If your data is in a Pandas series and the index is an integer representing the month, all you have to do is change the index to datetime objects instead. The plot method for pandas.Series will handle things automatically from there. Here's how you might do that:
Invoices = pd.Series((211, 98, 54, 43, 32, 29, 21, 8), index = (2, 1, 7, 4, 3, 6, 5, 8))
dates = [datetime.date(2015, month, 1) for month in Invoices.index]
Invoices.index = dates
Invoices.plot()