Simple way to plot time series with real dates using pandas - python

Starting from the following CSV data, loaded into a pandas data frame...
Buchung;Betrag;Saldo
27.06.2016;-1.000,00;42.374,95
02.06.2016;500,00;43.374,95
01.06.2016;-1.000,00;42.874,95
13.05.2016;-500,00;43.874,95
02.05.2016;500,00;44.374,95
04.04.2016;500,00;43.874,95
02.03.2016;500,00;43.374,95
10.02.2016;1.000,00;42.874,95
02.02.2016;500,00;41.874,95
01.02.2016;1.000,00;41.374,95
04.01.2016;300,00;40.374,95
30.12.2015;234,54;40.074,95
02.12.2015;300,00;39.840,41
02.11.2015;300,00;39.540,41
08.10.2015;1.000,00;39.240,41
02.10.2015;300,00;38.240,41
02.09.2015;300,00;37.940,41
31.08.2015;2.000,00;37.640,41
... I would like an intuitive way to plot the time series given by the dates in column "Buchung" and the monetary values in column "Saldo".
I tried
seaborn.tsplot(data=data, time="Buchung", value="Saldo")
which yields
ValueError: could not convert string to float: '31.08.2015'
What is an easy way to read the dates and values and plot the time series? I assume this is such a common problem that there must be a three line solution.

You need to convert your date column into the correct format:
data['Buchung'] = pd.to_datetime(data['Buchung'], format='%d.%m.%Y')
Now your plot will work.
Though you didn't ask, I think you will also run into a similar problem because your numbers (in 'Betrag' and 'Saldo') seem to be string as well. So I recommend you convert them to numeric before plotting. Here is how you can do that by simple string manipulation:
data["Saldo"] = data["Saldo"].str.replace('.', '').str.replace(',', '.')
data["Betrag"] = data["Betrag"].str.replace('.', '').str.replace(',', '.')
Or set the locale:
import locale
# The data appears to be in a European format, German locale might
# fit. Try this on Windows machine:
locale.setlocale(locale.LC_ALL, 'de')
data['Betrag'] = data['Betrag'].apply(locale.atof)
data['Saldo'] = data['Saldo'].apply(locale.atof)
# This will reset the locale to system default
locale.setlocale(locale.LC_ALL, '')
On an Ubuntu machine, follow this answer. If the above code does not work on a Windows machine, try locale.locale_alias to list all available locales and pick the name from that.
Output
Using matplotlib since I cannot install Seaborn on the machine I am working from.
from matplotlib import pyplot as plt
plt.plot(data['Buchung'], data['Saldo'], '-')
_ = plt.xticks(rotation=45)
Note: this has been produced using the locale method. Hence the month names are in German.

Related

Why doesn't Statsmodels OLS support reading in columns with multiple words?

I've been experimenting with Seaborn's lmplot() and Statsmodels .ols() functions for simple linear regression plots and their associated p-values, r-squared, etc.
I've noticed that when I specify which columns I want to use for lmplot, I can specify a column even if it has multiple words for it:
import seaborn as sns
import pandas as pd
input_csv = pd.read_csv('./test.csv',index_col = 0,header = 0)
input_csv
sns.lmplot(x='Age',y='Count of Specific Strands',data = input_csv)
<seaborn.axisgrid.FacetGrid at 0x2800985b710>
However, if I try to use ols, I'm getting an error for inputting in "Count of Specific Strands" as my dependent variable (I've only listed out the last couple of lines in the error):
import statsmodels.formula.api as smf
test_results = smf.ols('Count of Specific Strands ~ Age',data = input_csv).fit()
File "<unknown>", line 1
Count of Specific Strands
^
SyntaxError: invalid syntax
Conversely, if I specify the "Counts of Specific Strand" as shown below, the regression works:
test_results = smf.ols('input_csv.iloc[:,1] ~ Age',data = input_csv).fit()
test_results.summary()
Does anyone know why this is? Is it just because of how Statsmodels was written? Is there an alternative to specify the dependent variable for regression analysis that doesn't involve iloc or loc?
This is due to the way the formula parser patsy is written: see this link for more information
The authors of patsy have, however, thought of this problem: (quoted from here)
This flexibility does create problems in one case, though – because we
interpret whatever you write in-between the + signs as Python code,
you do in fact have to write valid Python code. And this can be tricky
if your variable names have funny characters in them, like whitespace
or punctuation. Fortunately, patsy has a builtin “transformation”
called Q() that lets you “quote” such variables
Therefore, in your case, you should be able to write:
smf.ols('Q("Count of Specific Strands") ~ Age',data = input_csv).fit()

Use locale with seaborn

Currently I'm trying to visualize some data I am working on with seaborn. I need to use a comma as decimal separator, so I was thinking about simply changing the locale. I found this answer to a similar question, which sets the locale and uses matplotlib to plot some data.
This also works for me, but when using seaborn instead of matplotlib directly, it doesn't use the locale anymore. Unfortunately, I can't find any setting to change in seaborn or any other workaround. Is there a way?
Here some exemplary data. Note that I had to use 'german' instead of "de_DE". The xlabels all use the standard point as decimal separator.
import locale
# Set to German locale to get comma decimal separator
locale.setlocale(locale.LC_NUMERIC, 'german')
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Tell matplotlib to use the locale we set above
plt.rcParams['axes.formatter.use_locale'] = True
df = pd.DataFrame([[1,2,3],[4,5,6]]).T
df.columns = [0.3,0.7]
sns.boxplot(data=df)
The "numbers" shown on the x axis for such boxplots are determined via a
matplotlib.ticker.FixedFormatter (find out via print(ax.xaxis.get_major_formatter())).
This fixed formatter just puts labels on ticks one by one from a list of labels. This makes sense because your boxes are positionned at 0 and 1, yet you want them to be labeled as 0.3, 0.7. I suppose this concept becomes clearer when thinking about what should happen for a dataframe with df.columns=["apple","banana"].
So the FixedFormatter ignores the locale, because it just takes the labels as they are. The solution I would propose here (although some of those in the comments are equally valid) would be to format the labels yourself.
ax.set_xticklabels(["{:n}".format(l) for l in df.columns])
The n format here is just the same as the usual g, but takes into account the locale. (See python format mini language). Of course using any other format of choice is equally possible. Also note that setting the labels here via ax.set_xticklabels only works because of the fixed locations used by boxplot. For other types of plots with continuous axes, this would not be recommended, and instead the concepts from the linked answers should be used.
Complete code:
import locale
# Set to German locale to get comma decimal separator
locale.setlocale(locale.LC_NUMERIC, 'german')
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.DataFrame([[1,2,3],[4,5,6]]).T
df.columns = [0.3,0.7]
ax = sns.boxplot(data=df)
ax.set_xticklabels(["{:n}".format(l) for l in df.columns])
plt.show()

Reading values from a csv, then putting them in a string using python

I have a spreadsheet with Toronto real estate info - this includes the latitude and longitude of condos in the area. In fact, it even has a column with the lat & long combined. What I'd like to do is put these coordinates on a map, probably with folium (but I'm open to alternatives). From what I can tell, folium uses the following format:
map_1 = folium.Map(location=[45.372, -121.6972],
zoom_start=12,
tiles='Stamen Terrain')
folium.Marker([45.3288, -121.6625], popup='Mt. Hood Meadows').add_to(map_1)
folium.Marker([45.3311, -121.7113], popup='Timberline Lodge').add_to(map_1)
map_1
So as far as I can tell I need to do two things:
1) Generate a sufficient amount of lines with the content: folium.Marker([x, y]).add_to(map_1)
2) Fill in x and y with the lat/long values from the spreadsheet
So far I've been able to pull in the lat/long column from the csv, but that's as far as I've gotten:
import pandas as pd
import folium
df_raw = pd.read_excel('df_condo_v9_t1.xlsx', sheetname=0, header=0)
df_raw.shape
df_raw.dtypes
df_lat = df_raw['Latlng']
df_lat.head()
If you really need to look at the csv, it's at: https://github.com/vshideler/toronto-condos
Any suggestions would be appreciated!
If you're going to be working with data like this more in the future, I would strongly recommend that you read through the pandas documentation. This includes a 10 minute "getting started" guide which covers many of the common use-cases, including examples very similar to this one.
That being said, you have most of the components that you need in the code that you gave above. The two issues that I see are as follows:
If you look at the DataFrame's data types (df_raw.dtypes in your code above), you'll see that your Latlng column is actually still a string, while Pandas has helpfully already converted the Latitude and Longitude columns to floats for you. That should tell you that it may be easier to work with those two since they'll be directly usable in positioning your marker.
You'll probably want to configure your map a bit - the defaults that you took from the example code create a terrain map centered somewhere in Oregon. Given that you're plotting properties in Toronto - neither of those seem like good options. I generally like to center my map at the mean point of my data.
A simple example to get things working could look like:
import pandas as pd
import folium
df = pd.read_csv("df_condo_v9_t1.csv")
map_center = [df["Latitude"].mean(), df["Longitude"].mean()]
map_1 = folium.Map(location=map_center, zoom_start=16)
for i, row in df[["Latitude", "Longitude"]].dropna().iterrows():
position = (row["Latitude"], row["Longitude"])
folium.Marker(position).add_to(map_1)
map_1

ggplot geom_histogram behaves differently between Python and R

I am trying to do some exploratory data analysis and I have a data frame with an integer age column and a "category" column. Making a histogram of the age is easy enough. What I want to do is maintain this age histogram but color the bars based on the categorical variables.
import numpy as np
import pandas as pd
ageSeries.hist(bins=np.arange(-0.5, 116.5, 1))
I was able to do what I wanted easily in one line with ggplot2 in R
ggplot(data, aes(x=Age, fill=Category)) + geom_histogram(binwidth = 1)
I wasn't able to find a good solution in Python, but then I realized there was a ggplot2 library for Python and installed it. I tried to do the same ggplot command...
ggplot(data, aes(x="Age", fill="Category")) + geom_histogram(binwidth = 1)
Looking at these results we can see that the different categories are treated as different series and and overlaid rather than stacked. I don't want to mess around with transperancies, and I still want to maintain the overall distribution of the the population.
Is this something I can fix with a parameter in the ggplot call, or is there a straightforward way to do this in Python at all without doing a bunch of extra dataframe manipulations?

Pandas timeseries plot showing abnormal characters

I used pandas to plot some random time-series data, and I found that it was showing each month as a number followed by a square. Is there any way to fix this?
Here is the code:
>>> import numpy as np
>>> import pandas as pd
>>> import matplotlib.pyplot as plt
>>> ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
>>> ts.plot()
<matplotlib.axes._subplots.AxesSubplot object at 0xb2072b0c>
>>> plt.show()
Here is the plot:
This can happen if your system locale is set to a non-English one. Given your name I am going to assume you might be using a Chinese locale. So, Pandas or Matplotlib is generating Chinese calendar characters, but the rendering engine you are using cannot display them.
You have at least two options:
Change your system locale, at least when running this code.
Try a different "backend" for Matplotlib. You can get the list of available ones on your system by following this: List of all available matplotlib backends

Categories