Pandas timeseries plot showing abnormal characters - python

I used pandas to plot some random time-series data, and I found that it was showing each month as a number followed by a square. Is there any way to fix this?
Here is the code:
>>> import numpy as np
>>> import pandas as pd
>>> import matplotlib.pyplot as plt
>>> ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
>>> ts.plot()
<matplotlib.axes._subplots.AxesSubplot object at 0xb2072b0c>
>>> plt.show()
Here is the plot:

This can happen if your system locale is set to a non-English one. Given your name I am going to assume you might be using a Chinese locale. So, Pandas or Matplotlib is generating Chinese calendar characters, but the rendering engine you are using cannot display them.
You have at least two options:
Change your system locale, at least when running this code.
Try a different "backend" for Matplotlib. You can get the list of available ones on your system by following this: List of all available matplotlib backends

Related

Seaborn lineplot unexpected behaviour

I am hoping to understand why the following Seaborn lineplot behaviour occurs.
Spikes are occurring through the time-series and additional data has been added to the left of the actual data.
How can I prevent this unexpected behaviour in Seaborn?
Regular plot of data:
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
aussie_property[['Sydney(SYDD)']].plot();
Seaborn plot of data:
sns.lineplot(data=aussie_property, x='date', y='Sydney(SYDD)');
This is not a seaborn problem but a question of ambiguous datetimes.
Convert date to a datetime object with the following code:
aussie_property['date'] = pd.to_datetime(aussie_property['Date'], dayfirst=True)
and you get your expected plot with seaborn
Generally, it is advisable to provide the format during datetime conversions, e.g.,
aussie_property['date'] = pd.to_datetime(aussie_property['Date'], format="%d/%m/%Y")
because, as we have seen here, dates like 10/12/2020 are ambiguous. Consequently, the parser first thought the data would be month/day/year and later noticed this cannot be the case, so changed to parsing your input as day/month/year, giving rise to these time-travelling spikes in your seaborn graph. Why you didn't see them in the pandas plot, you ask? Well, this is plotted against the index, so you don't notice this conversion problem in the pandas plot.
More information on the format codes can be found in the Python documentation.

Why is my boxplot not showing up in python? [duplicate]

This question already has answers here:
How to show matplotlib plots?
(6 answers)
Closed 3 years ago.
I am new to Python and am working on displaying a boxplot for a dataset with 2 numeric columns and 1 character column with values (A,B,C,D). I want to show a boxplot of the values for either of the 2 numeric columns by the character column. I have followed some tutorials online but the plots are not showing up.
I have tried adding .show() or .plot() on the end of some of my code, but receive warnings that those attributes don't exist. I have tried using matplotlib and it seems to work better when I use that module, but I want to learn how to do this when using pandas.
import pandas as pd
datafile="C:\\Users\\…\\TestFile.xlsx"
data=pd.read_excel(datafile)
data.boxplot('Col1', by='Col2')
I want a boxplot to show up automatically when I run this code or be able to run one more line to have it pop up, but everything I've tried has failed. What step(s) am I missing?
You should use plt.show(). Look at the following code
import pandas as pd
import matplotlib.pyplot as plt
datafile="C:\\Users\\…\\TestFile.xlsx"
data=pd.read_excel(datafile)
data.boxplot('Col1', by='Col2')
plt.show()
Seaborn library helps you plot all sorts of plots between two columns of a dataframe pretty easily. Place any categorical column on the x-axis and a numerical column on the y-axis. There is also a fancy version of boxplot in Seaborn known as boxenplot.
import seaborn as sns
sns.boxplot(x = data['Col1'], y = data['Col2'])
import seaborn as sns
sns.boxenplot(x = data['Col1'], y = data['Col2'])

how to make small multiple box plots with long data frame in python

I have a long data frame like the simplified sample below:
import pandas as pd
import numpy as np
data={'nm':['A','B']*12,'var':['vol','vol','ratio','ratio','price','price']*4,'value':np.random.randn(24)}
sample=pd.DataFrame(data)
sample
And wish to create small multiple box plots using var as facet, nm as category and value as value, how can I do so using matplotlib or seaborn? I've searched for similar code but the examples looked complex.
Perhaps you can start with seaborns catplot:
sns.catplot(x='nm', y='value', col='var', kind='box', data=sample)

Use locale with seaborn

Currently I'm trying to visualize some data I am working on with seaborn. I need to use a comma as decimal separator, so I was thinking about simply changing the locale. I found this answer to a similar question, which sets the locale and uses matplotlib to plot some data.
This also works for me, but when using seaborn instead of matplotlib directly, it doesn't use the locale anymore. Unfortunately, I can't find any setting to change in seaborn or any other workaround. Is there a way?
Here some exemplary data. Note that I had to use 'german' instead of "de_DE". The xlabels all use the standard point as decimal separator.
import locale
# Set to German locale to get comma decimal separator
locale.setlocale(locale.LC_NUMERIC, 'german')
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Tell matplotlib to use the locale we set above
plt.rcParams['axes.formatter.use_locale'] = True
df = pd.DataFrame([[1,2,3],[4,5,6]]).T
df.columns = [0.3,0.7]
sns.boxplot(data=df)
The "numbers" shown on the x axis for such boxplots are determined via a
matplotlib.ticker.FixedFormatter (find out via print(ax.xaxis.get_major_formatter())).
This fixed formatter just puts labels on ticks one by one from a list of labels. This makes sense because your boxes are positionned at 0 and 1, yet you want them to be labeled as 0.3, 0.7. I suppose this concept becomes clearer when thinking about what should happen for a dataframe with df.columns=["apple","banana"].
So the FixedFormatter ignores the locale, because it just takes the labels as they are. The solution I would propose here (although some of those in the comments are equally valid) would be to format the labels yourself.
ax.set_xticklabels(["{:n}".format(l) for l in df.columns])
The n format here is just the same as the usual g, but takes into account the locale. (See python format mini language). Of course using any other format of choice is equally possible. Also note that setting the labels here via ax.set_xticklabels only works because of the fixed locations used by boxplot. For other types of plots with continuous axes, this would not be recommended, and instead the concepts from the linked answers should be used.
Complete code:
import locale
# Set to German locale to get comma decimal separator
locale.setlocale(locale.LC_NUMERIC, 'german')
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.DataFrame([[1,2,3],[4,5,6]]).T
df.columns = [0.3,0.7]
ax = sns.boxplot(data=df)
ax.set_xticklabels(["{:n}".format(l) for l in df.columns])
plt.show()

Simple way to plot time series with real dates using pandas

Starting from the following CSV data, loaded into a pandas data frame...
Buchung;Betrag;Saldo
27.06.2016;-1.000,00;42.374,95
02.06.2016;500,00;43.374,95
01.06.2016;-1.000,00;42.874,95
13.05.2016;-500,00;43.874,95
02.05.2016;500,00;44.374,95
04.04.2016;500,00;43.874,95
02.03.2016;500,00;43.374,95
10.02.2016;1.000,00;42.874,95
02.02.2016;500,00;41.874,95
01.02.2016;1.000,00;41.374,95
04.01.2016;300,00;40.374,95
30.12.2015;234,54;40.074,95
02.12.2015;300,00;39.840,41
02.11.2015;300,00;39.540,41
08.10.2015;1.000,00;39.240,41
02.10.2015;300,00;38.240,41
02.09.2015;300,00;37.940,41
31.08.2015;2.000,00;37.640,41
... I would like an intuitive way to plot the time series given by the dates in column "Buchung" and the monetary values in column "Saldo".
I tried
seaborn.tsplot(data=data, time="Buchung", value="Saldo")
which yields
ValueError: could not convert string to float: '31.08.2015'
What is an easy way to read the dates and values and plot the time series? I assume this is such a common problem that there must be a three line solution.
You need to convert your date column into the correct format:
data['Buchung'] = pd.to_datetime(data['Buchung'], format='%d.%m.%Y')
Now your plot will work.
Though you didn't ask, I think you will also run into a similar problem because your numbers (in 'Betrag' and 'Saldo') seem to be string as well. So I recommend you convert them to numeric before plotting. Here is how you can do that by simple string manipulation:
data["Saldo"] = data["Saldo"].str.replace('.', '').str.replace(',', '.')
data["Betrag"] = data["Betrag"].str.replace('.', '').str.replace(',', '.')
Or set the locale:
import locale
# The data appears to be in a European format, German locale might
# fit. Try this on Windows machine:
locale.setlocale(locale.LC_ALL, 'de')
data['Betrag'] = data['Betrag'].apply(locale.atof)
data['Saldo'] = data['Saldo'].apply(locale.atof)
# This will reset the locale to system default
locale.setlocale(locale.LC_ALL, '')
On an Ubuntu machine, follow this answer. If the above code does not work on a Windows machine, try locale.locale_alias to list all available locales and pick the name from that.
Output
Using matplotlib since I cannot install Seaborn on the machine I am working from.
from matplotlib import pyplot as plt
plt.plot(data['Buchung'], data['Saldo'], '-')
_ = plt.xticks(rotation=45)
Note: this has been produced using the locale method. Hence the month names are in German.

Categories