Inconsistent automatic pandas date labeling - python
I was wondering how pandas formats the x-axis date exactly. I am using the same script on a bunch of data results, which all have the same pandas df format. However, pandas formats each df date differently. How could this be more consistently?
Each df has a DatetimeIndex like this, dtype='datetime64[ns]
>>> df.index
DatetimeIndex(['2014-10-02', '2014-10-03', '2014-10-04', '2014-10-05',
'2014-10-06', '2014-10-07', '2014-10-08', '2014-10-09',
'2014-10-10', '2014-10-11',
...
'2015-09-23', '2015-09-24', '2015-09-25', '2015-09-26',
'2015-09-27', '2015-09-28', '2015-09-29', '2015-09-30',
'2015-10-01', '2015-10-02'],
dtype='datetime64[ns]', name='Date', length=366, freq=None)
Eventually, I plot with df.plot() where the df has two columns.
But the axes of the plots have different styles, like this:
I would like all plots to have the x-axis style of the first plot. pandas should do this automatically, so I'd rather not prefer to begin with xticks formatting, since I have quite a lot of data to plot. Could anyone explain what to do? Thanks!
EDIT:
I'm reading two csv-files from 2015. The first has the model results of about 200 stations, the second has the gauge measurements of the same stations. Later, I read another two csv-files from 2016 with the same format.
import pandas as pd
df_model = pd.read_csv(path_model, sep=';', index_col=0, parse_dates=True)
df_gauge = pd.read_csv(path_gauge, sep=';', index_col=0, parse_dates=True)
df = pd.DataFrame(columns=['model', 'gauge'], index=df_model.index)
df['model'] = df_model['station_1'].copy()
df['gauge'] = df_gauge['station_1'].copy()
df.plot()
I do this for each year, so the x-axis should look the same, right?
I do not think this possible unless you make modifications to the pandas library. I looked around a bit for options that one may set in Pandas, but couldn't find one. Pandas tries to intelligently select the type of axis ticks using logic implemented here (I THINK). So in my opinion, it would be best to define your own function to make the plots and than overwrite the tick formatting (although you do not want to do that).
There are many references around the internet which show how to do this. I used this one by "Simone Centellegher" and this stackoverflow answer to come up with a function that may work for you (tested in python 3.7.1 with matplotlib 3.0.2, pandas 0.23.4):
import pandas as pd
import numpy as np
import matplotlib.dates as mdates
import matplotlib.pyplot as plt
## pass df with columns you want to plot
def my_plotter(df, xaxis, y_cols):
fig, ax = plt.subplots()
plt.plot(xaxis,df[y_cols])
ax.xaxis.set_minor_locator(mdates.MonthLocator())
ax.xaxis.set_major_locator(mdates.YearLocator())
ax.xaxis.set_minor_formatter(mdates.DateFormatter('%b'))
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b\n%Y'))
# Remove overlapping major and minor ticks
majticklocs = ax.xaxis.get_majorticklocs()
minticklocs = ax.xaxis.get_minorticklocs()
minticks = ax.xaxis.get_minor_ticks()
for i in range(len(minticks)):
cur_mintickloc = minticklocs[i]
if cur_mintickloc in majticklocs:
minticks[i].set_visible(False)
return fig, ax
df = pd.DataFrame({'values':np.random.randint(0,1000,36)}, \
index=pd.date_range(start='2014-01-01', \
end='2016-12-31',freq='M'))
fig, ax = my_plotter(df, df.index, ["values"])
Related
Plot data frame fast and with correct date format
I have the data as in the screenshot, it is in a dataframe format, I would like to plot the dataframe fast and with correct date format. The code as follow is much fast than using e.g plt.plot(df["Date"], df["D30"]) df.plot(marker='.', linestyle='none') So that I would like to keep using the dataframe.plot() functionality directly because it is much faster than plot each column against the "Date" column separately. However, as shown in the graph, the date is not correct. My actually starting Date is 2006-01-10, but in the figure, it is shown from 70-01 (1970-01-01). For me, the official documentation of matplotlib DateFormatter is quite confusing and not so helpful. I tried to google a easy and clear solution, but most answers are related to plt.plot(x, y) where x is Date and y is the actual value. After that it is easy to adjust the format of the "Date" in the figure. But it will make my plot super slow since I am plotting 11 columns in total. Any idea how I can plot data frame fast and with correct date format import os import datetime as dt import pandas as pd import matplotlib.pyplot as plt date_format = mdates.DateFormatter('%y%m') df_file = r"C:\Codes\df_file.csv" df = pd.read_csv(path_file) print(len(df), df.info(), df["Date"][0], type(df["Date"][0])) df.head(2) fig = plt.figure(figsize=(12.0, 8.0)) df.plot(marker='.', linestyle='none') plt.title("data_frame_show date", fontsize=16) plt.gca().xaxis.set_major_formatter(dtFmt) plt.legend(loc=(1.04, 0)) plt.show() partial input: Date,D10,D30,D60,D91,D122,D152,D182,D273,D365,D547,D730 2006-01-10,,0.1373444,0.1544265,0.1541397,0.1429375,0.1421464,0.1426055,0.1460771,0.1486266,0.1551848,0.1593932 2006-01-11,,0.135426,0.1411246,0.141093,0.1384091,0.1383636,0.1395791,0.1438944,0.1469191,0.1553112,0.1598582 2006-01-12,,0.1311339,0.1292621,0.1304292,0.1363482,0.1362213,0.1367843,0.1404174,0.1439877,0.152306,0.1568677 2006-01-13,,0.1594458,0.1355387,0.1367246,0.1434708,0.143745,0.1441349,0.1453056,0.1481918,0.157193,0.1607564 2006-01-16,,0.1374846,0.1182223,0.1272385,0.1415359,0.1418881,0.1430098,0.1468544,0.1496407,0.1547714,0.158936 2006-01-17,,0.1453834,0.1418838,0.143198,0.1437924,0.143473,0.1440987,0.1473208,0.1501543,0.1590842,0.1629096 2006-01-18,,0.1385479,0.141472,0.1481763,0.1515037,0.1511353,0.1511544,0.1535245,0.1554254,0.1626349,0.1663554 2006-01-19,,0.1639788,0.1462084,0.1483903,0.1486906,0.1483109,0.1492335,0.1539002,0.1563708,0.1611751,0.1644693 2006-01-20,,0.189771,0.178394,0.1638331,0.1565402,0.1559029,0.1553547,0.1526479,0.1516396,0.1614136,0.1646431 2006-01-23,,0.1420271,0.1570005,0.1614942,0.1607205,0.1605297,0.1630065,0.1653838,0.1642349,0.166809,0.1701779 2006-01-24,,0.1814291,0.1633585,0.1563364,0.1548823,0.15382,0.1545099,0.1590869,0.1609158,0.1653819,0.1681759 2006-01-25,,0.1272998,0.1445222,0.1487031,0.1522032,0.152714,0.1524364,0.1532192,0.1550062,0.1635665,0.1658293 2006-01-26,,0.1392162,0.1413034,0.1443807,0.1476261,0.1482458,0.1473548,0.1471019,0.1493254,0.1578586,0.160699 2006-01-27,,0.1360269,0.1374056,0.1387952,0.1426731,0.1441445,0.144917,0.1462428,0.1478979,0.1519537,0.1550311 2006-01-30,,0.1439245,0.1430108,0.1434628,0.1448731,0.1450397,0.1454756,0.1467621,0.1487521,0.1538424,0.1561802 2006-01-31,,0.1483135,0.1468713,0.1473837,0.1519043,0.1519379,0.1502139,0.1504632,0.1529254,0.1571567,0.1589795 2006-02-01,,0.1464208,0.1447363,0.1443483,0.1459808,0.1477726,0.1505124,0.1520256,0.1535773,0.1589145,0.1607383 2006-02-02,,0.1484249,0.1414394,0.1412338,0.1497531,0.1500731,0.1475751,0.147502,0.1512457,0.1571017,0.1606797 2006-02-03,,0.1496503,0.1485318,0.1502473,0.1565336,0.156727,0.1556335,0.1560396,0.1579241,0.1619183,0.1634751 2006-02-06,,0.149966,0.1457216,0.1475524,0.1539103,0.1546401,0.154973,0.1553681,0.1570598,0.161173,0.1630743 2006-02-08,,0.1463649,0.1436135,0.1454147,0.1498372,0.1507231,0.1520234,0.1538407,0.1563603,0.1617697,0.1639547 2006-02-09,,0.1401312,0.1432856,0.1437166,0.1443243,0.1463163,0.148681,0.1496198,0.1516376,0.1584639,0.1615756 2006-02-10,,0.1339916,0.1405194,0.1432779,0.1464605,0.1470921,0.1484831,0.1514307,0.1550715,0.1599564,0.1623171 2006-02-13,,0.1470304,0.1423007,0.1446087,0.1470668,0.1485171,0.1503383,0.1508497,0.1532987,0.1591155,0.1615874 2006-02-14,,0.1454322,0.1449017,0.1455735,0.1462286,0.1478059,0.1501469,0.1522522,0.1541999,0.157668,0.1601427 2006-02-15,,0.1429312,0.1455881,0.1464055,0.1471812,0.1489883,0.1514654,0.153837,0.1559375,0.16082,0.1631557 2006-02-16,,0.134637,0.1373471,0.140634,0.1432172,0.145788,0.14875,0.1507805,0.15325,0.1581015,0.1613797 2006-02-20,,0.1303785,0.1334454,0.139216,0.1423217,0.1454704,0.1477552,0.1487534,0.1509405,0.1554398,0.1588761 2006-02-21,,0.1359587,0.1370814,0.1416117,0.1418016,0.1441761,0.1468109,0.1476679,0.1496546,0.1561362,0.1607204 2006-02-22,,0.1302253,0.1337104,0.1415016,0.141451,0.1438881,0.1467031,0.1502449,0.1514018,0.1531452,0.1582335 2006-02-23,,0.1282022,0.1333902,0.1342376,0.1385976,0.1453201,0.1481733,0.1490296,0.1512885,0.1554035,0.1593463 2006-02-24,,0.1269229,0.1304391,0.1348061,0.1378378,0.1419301,0.1442134,0.1472283,0.1507224,0.1555662,0.1595938 2006-02-27,,0.1254707,0.128201,0.1334554,0.1374389,0.1427246,0.1446071,0.1465459,0.1496113,0.1541296,0.1578174 2006-02-28,,0.1346332,0.1361773,0.139586,0.1421924,0.1468084,0.1489651,0.1505661,0.1541479,0.1606205,0.1675438 2006-03-01,,0.1301198,0.1318495,0.1343342,0.1376886,0.1434328,0.1459977,0.1490832,0.1525961,0.1557153,0.1593923 2006-03-02,,0.1304425,0.1347556,0.1398592,0.1420431,0.1457691,0.1479747,0.1510143,0.1544964,0.1589201,0.1616325 2006-03-03,,0.1311674,0.1339681,0.138887,0.1418598,0.1451706,0.1472144,0.1495689,0.1536886,0.1599843,0.162247 2006-03-06,,0.1308081,0.1367775,0.1412145,0.1436582,0.1480171,0.1495588,0.1511633,0.1545973,0.1588486,0.1616268 2006-03-07,,0.1344355,0.1387528,0.143365,0.1459607,0.1482421,0.1491656,0.1512236,0.1550063,0.1593201,0.1615385
When plotting time series, pandas takes the index for the x-axis when calling the plot function. I would suggest to: df = df.assign( Date=lambda x: pd.to_datetime(x["Date"], format="%Y-%m%d") ).set_index("Date")
Python vs matplotlib - Chart generation issue
I have the below python code. but as an output it gives a chart like in the attachment. And its really messy in python. Can anybody tell me hw to fix the issue and make the day in ascenting order in X axis? import pandas as pd import matplotlib.pyplot as plt df = pd.read_excel("C/desktop/data.xlsx") df = df.loc[df['month'] == 8] df = df.astype({'day': str}) plt.plot( 'day', 'cases', data=df) In the first instance, i didnt take the day as str. So it came like this. Because it had decimal numbers, i have converted it to str. now this happens.
What you got is typical of an unsorted dataset with many points per group. As you did not provide an example, here is one: import numpy as np import pandas as pd np.random.seed(0) df = pd.DataFrame({'day': np.random.randint(1,21,size=100), 'cases': np.random.randint(0,50000,size=100), }) plt.plot('day', 'cases', data=df) There is no reason to plot a line in this case, you can use a scatter plot instead: plt.scatter('day', 'cases', data=df) To make more sense of your data, you can also compute an aggregated value (ex. mean): plt.plot('day', 'cases', data=df.groupby('day', as_index=False)['cases'].mean())
Integrating over range of dates, and labeling the xaxis
I am trying to integrate 2 curves as they change through time using pandas. I am loading data from a CSV file like such: Where the Dates are the X-axis and both the Oil & Water points are the Y-axis. I have learned to use the cross-section option to isolate the "NAME" values, but am having trouble finding a good way to integrate with dates as the X-axis. I eventually would like to be able to take the integrals of both curves and stack them against each other. I am also having trouble with the plot defaulting the x-ticks to arbitrary values, instead of the dates. I can change the labels/ticks manually, but have a large CSV to process and would like to automate the process. Any help would be greatly appreciated. NAME,DATE,O,W A,1/20/2000,12,50 B,1/20/2000,25,28 C,1/20/2000,14,15 A,1/21/2000,34,50 B,1/21/2000,8,3 C,1/21/2000,10,19 A,1/22/2000,47,35 B,1/22/2000,4,27 C,1/22/2000,46,1 A,1/23/2000,19,31 B,1/23/2000,18,10 C,1/23/2000,19,41 Contents of CSV in text form above.
Further to my comment above, here is some sample code (using logic from the example mentioned) to label your xaxis with formatted dates. Hope this helps. Data Collection / Imports: Just re-creating your dataset for the example. import matplotlib.pyplot as plt import numpy as np import pandas as pd header = ['NAME', 'DATE', 'O', 'W'] data = [['A','1/20/2000',12,50], ['B','1/20/2000',25,28], ['C','1/20/2000',14,15], ['A','1/21/2000',34,50], ['B','1/21/2000',8,3], ['C','1/21/2000',10,19], ['A','1/22/2000',47,35], ['B','1/22/2000',4,27], ['C','1/22/2000',46,1], ['A','1/23/2000',19,31], ['B','1/23/2000',18,10], ['C','1/23/2000',19,41]] df = pd.DataFrame(data, columns=header) df['DATE'] = pd.to_datetime(df['DATE'], format='%m/%d/%Y') # Subset to just the 'A' labels. df_a = df[df['NAME'] == 'A'] Plotting: # Define the number of ticks you need. nticks = 4 # Define the date format. mask = '%m-%d-%Y' # Create the set of custom date labels. step = int(df_a.shape[0] / nticks) xdata = np.arange(df_a.shape[0]) xlabels = df_a['DATE'].dt.strftime(mask).tolist()[::step] # Create the plot. fig, ax = plt.subplots(1, 1) ax.plot(xdata, df_a['O'], label='Oil') ax.plot(xdata, df_a['W'], label='Water') ax.set_xticks(np.arange(df_a.shape[0], step=step)) ax.set_xticklabels(xlabels, rotation=45, horizontalalignment='right') ax.set_title('Test in Naming Labels for the X-Axis') ax.legend() Output:
I'd recommend modifying the X-axis into some form of integers or floats (Seconds, minutes, hours days since a certain time, based on the precision that you need). You can then use usual methods to integrate and the x-axes would no longer default to some other values. See How to convert datetime to integer in python
Matplotlib - show x axis values - dates [duplicate]
Compare the following code: test = pd.DataFrame({'date':['20170527','20170526','20170525'],'ratio1':[1,0.98,0.97]}) test['date'] = pd.to_datetime(test['date']) test = test.set_index('date') ax = test.plot() I added DateFormatter in the end: test = pd.DataFrame({'date':['20170527','20170526','20170525'],'ratio1':[1,0.98,0.97]}) test['date'] = pd.to_datetime(test['date']) test = test.set_index('date') ax = test.plot() ax.xaxis.set_minor_formatter(dates.DateFormatter('%d\n\n%a')) ## Added this line The issue with the second graph is that it starts on 5-24 instead 5-25. Also, 5-25 of 2017 is Thursday not Monday. What is causing the issue? Is this timezone related? (I don't understand why the date numbers are stacked on top of each other either)
In general the datetime utilities of pandas and matplotlib are incompatible. So trying to use a matplotlib.dates object on a date axis created with pandas will in most cases fail. One reason is e.g. seen from the documentation datetime objects are converted to floating point numbers which represent time in days since 0001-01-01 UTC, plus 1. For example, 0001-01-01, 06:00 is 1.25, not 0.25. However, this is not the only difference and it is thus advisable not to mix pandas and matplotlib when it comes to datetime objects. There is however the option to tell pandas not to use its own datetime format. In that case using the matplotlib.dates tickers is possible. This can be steered via. df.plot(x_compat=True) Since pandas does not provide sophisticated formatting capabilities for dates, one can use matplotlib for plotting and formatting. import pandas as pd import matplotlib.pyplot as plt import matplotlib.dates as dates df = pd.DataFrame({'date':['20170527','20170526','20170525'],'ratio1':[1,0.98,0.97]}) df['date'] = pd.to_datetime(df['date']) usePandas=True #Either use pandas if usePandas: df = df.set_index('date') df.plot(x_compat=True) plt.gca().xaxis.set_major_locator(dates.DayLocator()) plt.gca().xaxis.set_major_formatter(dates.DateFormatter('%d\n\n%a')) plt.gca().invert_xaxis() plt.gcf().autofmt_xdate(rotation=0, ha="center") # or use matplotlib else: plt.plot(df["date"], df["ratio1"]) plt.gca().xaxis.set_major_locator(dates.DayLocator()) plt.gca().xaxis.set_major_formatter(dates.DateFormatter('%d\n\n%a')) plt.gca().invert_xaxis() plt.show() Updated using the matplotlib object oriented API usePandas=True #Either use pandas if usePandas: df = df.set_index('date') ax = df.plot(x_compat=True, figsize=(6, 4)) ax.xaxis.set_major_locator(dates.DayLocator()) ax.xaxis.set_major_formatter(dates.DateFormatter('%d\n\n%a')) ax.invert_xaxis() ax.get_figure().autofmt_xdate(rotation=0, ha="center") # or use matplotlib else: fig, ax = plt.subplots(figsize=(6, 4)) ax.plot('date', 'ratio1', data=df) ax.xaxis.set_major_locator(dates.DayLocator()) ax.xaxis.set_major_formatter(dates.DateFormatter('%d\n\n%a')) fig.invert_xaxis() plt.show()
Time-series boxplot in pandas
How can I create a boxplot for a pandas time-series where I have a box for each day? Sample dataset of hourly data where one box should consist of 24 values: import pandas as pd n = 480 ts = pd.Series(randn(n), index=pd.date_range(start="2014-02-01", periods=n, freq="H")) ts.plot() I am aware that I could make an extra column for the day, but I would like to have proper x-axis labeling and x-limit functionality (like in ts.plot()), so being able to work with the datetime index would be great. There is a similar question for R/ggplot2 here, if it helps to clarify what I want.
If its an option for you, i would recommend using Seaborn, which is a wrapper for Matplotlib. You could do it yourself by looping over the groups from your timeseries, but that's much more work. import pandas as pd import numpy as np import seaborn import matplotlib.pyplot as plt n = 480 ts = pd.Series(np.random.randn(n), index=pd.date_range(start="2014-02-01", periods=n, freq="H")) fig, ax = plt.subplots(figsize=(12,5)) seaborn.boxplot(ts.index.dayofyear, ts, ax=ax) Which gives: Note that i'm passing the day of year as the grouper to seaborn, if your data spans multiple years this wouldn't work. You could then consider something like: ts.index.to_series().apply(lambda x: x.strftime('%Y%m%d')) Edit, for 3-hourly you could use this as a grouper, but it only works if there are no minutes or lower defined. : [(dt - datetime.timedelta(hours=int(dt.hour % 3))).strftime('%Y%m%d%H') for dt in ts.index]
(Not enough rep to comment on accepted solution, so adding an answer instead.) The accepted code has two small errors: (1) need to add numpy import and (2) nned to swap the x and y parameters in the boxplot statement. The following produces the plot shown. import numpy as np import pandas as pd import seaborn import matplotlib.pyplot as plt n = 480 ts = pd.Series(np.random.randn(n), index=pd.date_range(start="2014-02-01", periods=n, freq="H")) fig, ax = plt.subplots(figsize=(12,5)) seaborn.boxplot(ts.index.dayofyear, ts, ax=ax)
I have a solution that may be helpful-- It only uses native pandas and allows for hierarchical date-time grouping (i.e spanning years). The key is that if you pass a function to groupby(), it will be called on each element of the dataframe's index. If your index is a DatetimeIndex (or similar), you can access all of the dt's convenience functions for resampling! Try this: n = 480 ts = pd.DataFrame(np.random.randn(n), index=pd.date_range(start="2014-02-01", periods=n, freq="H")) ts.groupby(lambda x: x.strftime("%Y-%m-%d")).boxplot(subplots=False, figsize=(12,9), rot=90)