make correlation plot on time series data in python - python
I want to see a correlation on a rolling week basis in time series data. The reason because I want to see how rolling correlation moves each year. To do so, I tried to use pandas.corr(), pandas.rolling_corr() built-in function for getting rolling correlation and tried to make line plot, but I couldn't correct the correlation line chart. I don't know how should I aggregate time series for getting rolling correlation line chart. Does anyone knows any way of doing this in python? Is there any workaround to get rolling correlation line chart from time series data in pandas? any idea?
my attempt:
I tried of using pandas.corr() to get correlation but it was not helpful to generate rolling correlation line chart. So, here is my new attempt but it is not working. I assume I should think about the right way of data aggregation to make rolling correlation line chart.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
url = 'https://gist.githubusercontent.com/adamFlyn/eb784c86c44fd7ed3f2504157a33dc23/raw/79b6aa4f2e0ffd1eb626dffdcb609eb2cb8dae48/corr.csv'
df = pd.read_csv(url)
df['date'] = pd.to_datetime(df['date'])
def get_corr(df, window=4):
dfs = []
for key, value in df:
value["ROLL_CORR"] = pd.rolling_corr(value["prod_A_price"],value["prod_B_price"], window)
dfs.append(value)
df_final = pd.concat(dfs)
return df_final
corr_df = get_corr(df, window=12)
fig, ax = plt.subplots(figsize=(7, 4), dpi=144)
sns.lineplot(x='week', y='ROLL_CORR', hue='year', data=corr_df,alpha=.8)
plt.show()
plt.close()
doing this way is not working to me. By doing this, I want to see how the rolling correlations move each year. Can anyone point me out possible of doing rolling correlation line chart from time-series data in python? any thoughts?
desired output
here is the desired rolling correlation line chart that I want to get. Note that desired plot was generated from MS excel. I am wondering is there any possible way of doing this in python? Is there any workaround to get a rolling correlation line chart from time-series data in python? how should I correct my current attempt to get the desired output? any thoughts?
Using your code and description as a starting point.
Panda's Rolling class has an apply function which can be leveraged (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.window.rolling.Rolling.apply.html#pandas.core.window.rolling.Rolling.apply)
Two tricks are involved to make the code work:
Accessing the whole row in the applied function (Pandas rolling apply using multiple columns)
We call the rolling function on a pandas.Series (here df['week']) to avoid going the applied function once per column
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
url = 'https://gist.githubusercontent.com/adamFlyn/eb784c86c44fd7ed3f2504157a33dc23/raw/79b6aa4f2e0ffd1eb626dffdcb609eb2cb8dae48/corr.csv'
df = pd.read_csv(url)
def get_corr(ser):
rolling_df = df.loc[ser.index]
return rolling_df['prod_A_price'].corr(rolling_df['prod_B_price'])
df['ROLL_CORR'] = df['week'].rolling(4).apply(get_corr)
number_years = 3
for week, df_week in df.groupby('week'):
df = df.append({
'week': week,
'year': f'{number_years} year avg',
'ROLL_CORR': df_week.sort_values(by='date').head(number_years)['ROLL_CORR'].mean()
}, ignore_index=True)
fig, ax = plt.subplots(figsize=(7, 4), dpi=144)
sns.lineplot(x='week', y='ROLL_CORR', hue='year', data=df,alpha=.8)
plt.show()
plt.close()
You'll find here the generated image by seaborn
With the 3 year average
Related
Graphing in Dataframe Pandas Pyton. How to plot a line after filtering a dataframe
So I have a pandas Dataframe with pateint id, date of visit, location, weight, and heartrate. I need to graph the line of the number of visits in one location in the Dataset over a period of 12 months with the month number on the horizontal axis. Any other suggestions about how I may go about this? I tried making the data into 3 data sets and then just graphing the number of visits counted from each data set but creating new columns and assigning the values wasn't working, it only worked for when I was graphing the values of all of the clinics but after splitting it into 3 dataframes, it stopped working. DataFrame
Here is a working example of filtering a DataFrame and using the filtered results to plot a chart. import pandas as pd import matplotlib.pyplot as plt # larger dataframe example d = {'x values':[1,2,3,4,5,6,7,8,9],'y values':[2,4,6,8,10,12,14,16,18]} df = pd.DataFrame(d) # apply filter df = df[df['x values'] < 5] # plot chart plt.plot(df['x values'], df['y values']) plt.show() result:
simply place your data into an ndarray and plot it with the matplotlib.pyplot or you can simply plot from a dataframe for example plt.plot(df['something'])
Plot data frame fast and with correct date format
I have the data as in the screenshot, it is in a dataframe format, I would like to plot the dataframe fast and with correct date format. The code as follow is much fast than using e.g plt.plot(df["Date"], df["D30"]) df.plot(marker='.', linestyle='none') So that I would like to keep using the dataframe.plot() functionality directly because it is much faster than plot each column against the "Date" column separately. However, as shown in the graph, the date is not correct. My actually starting Date is 2006-01-10, but in the figure, it is shown from 70-01 (1970-01-01). For me, the official documentation of matplotlib DateFormatter is quite confusing and not so helpful. I tried to google a easy and clear solution, but most answers are related to plt.plot(x, y) where x is Date and y is the actual value. After that it is easy to adjust the format of the "Date" in the figure. But it will make my plot super slow since I am plotting 11 columns in total. Any idea how I can plot data frame fast and with correct date format import os import datetime as dt import pandas as pd import matplotlib.pyplot as plt date_format = mdates.DateFormatter('%y%m') df_file = r"C:\Codes\df_file.csv" df = pd.read_csv(path_file) print(len(df), df.info(), df["Date"][0], type(df["Date"][0])) df.head(2) fig = plt.figure(figsize=(12.0, 8.0)) df.plot(marker='.', linestyle='none') plt.title("data_frame_show date", fontsize=16) plt.gca().xaxis.set_major_formatter(dtFmt) plt.legend(loc=(1.04, 0)) plt.show() partial input: Date,D10,D30,D60,D91,D122,D152,D182,D273,D365,D547,D730 2006-01-10,,0.1373444,0.1544265,0.1541397,0.1429375,0.1421464,0.1426055,0.1460771,0.1486266,0.1551848,0.1593932 2006-01-11,,0.135426,0.1411246,0.141093,0.1384091,0.1383636,0.1395791,0.1438944,0.1469191,0.1553112,0.1598582 2006-01-12,,0.1311339,0.1292621,0.1304292,0.1363482,0.1362213,0.1367843,0.1404174,0.1439877,0.152306,0.1568677 2006-01-13,,0.1594458,0.1355387,0.1367246,0.1434708,0.143745,0.1441349,0.1453056,0.1481918,0.157193,0.1607564 2006-01-16,,0.1374846,0.1182223,0.1272385,0.1415359,0.1418881,0.1430098,0.1468544,0.1496407,0.1547714,0.158936 2006-01-17,,0.1453834,0.1418838,0.143198,0.1437924,0.143473,0.1440987,0.1473208,0.1501543,0.1590842,0.1629096 2006-01-18,,0.1385479,0.141472,0.1481763,0.1515037,0.1511353,0.1511544,0.1535245,0.1554254,0.1626349,0.1663554 2006-01-19,,0.1639788,0.1462084,0.1483903,0.1486906,0.1483109,0.1492335,0.1539002,0.1563708,0.1611751,0.1644693 2006-01-20,,0.189771,0.178394,0.1638331,0.1565402,0.1559029,0.1553547,0.1526479,0.1516396,0.1614136,0.1646431 2006-01-23,,0.1420271,0.1570005,0.1614942,0.1607205,0.1605297,0.1630065,0.1653838,0.1642349,0.166809,0.1701779 2006-01-24,,0.1814291,0.1633585,0.1563364,0.1548823,0.15382,0.1545099,0.1590869,0.1609158,0.1653819,0.1681759 2006-01-25,,0.1272998,0.1445222,0.1487031,0.1522032,0.152714,0.1524364,0.1532192,0.1550062,0.1635665,0.1658293 2006-01-26,,0.1392162,0.1413034,0.1443807,0.1476261,0.1482458,0.1473548,0.1471019,0.1493254,0.1578586,0.160699 2006-01-27,,0.1360269,0.1374056,0.1387952,0.1426731,0.1441445,0.144917,0.1462428,0.1478979,0.1519537,0.1550311 2006-01-30,,0.1439245,0.1430108,0.1434628,0.1448731,0.1450397,0.1454756,0.1467621,0.1487521,0.1538424,0.1561802 2006-01-31,,0.1483135,0.1468713,0.1473837,0.1519043,0.1519379,0.1502139,0.1504632,0.1529254,0.1571567,0.1589795 2006-02-01,,0.1464208,0.1447363,0.1443483,0.1459808,0.1477726,0.1505124,0.1520256,0.1535773,0.1589145,0.1607383 2006-02-02,,0.1484249,0.1414394,0.1412338,0.1497531,0.1500731,0.1475751,0.147502,0.1512457,0.1571017,0.1606797 2006-02-03,,0.1496503,0.1485318,0.1502473,0.1565336,0.156727,0.1556335,0.1560396,0.1579241,0.1619183,0.1634751 2006-02-06,,0.149966,0.1457216,0.1475524,0.1539103,0.1546401,0.154973,0.1553681,0.1570598,0.161173,0.1630743 2006-02-08,,0.1463649,0.1436135,0.1454147,0.1498372,0.1507231,0.1520234,0.1538407,0.1563603,0.1617697,0.1639547 2006-02-09,,0.1401312,0.1432856,0.1437166,0.1443243,0.1463163,0.148681,0.1496198,0.1516376,0.1584639,0.1615756 2006-02-10,,0.1339916,0.1405194,0.1432779,0.1464605,0.1470921,0.1484831,0.1514307,0.1550715,0.1599564,0.1623171 2006-02-13,,0.1470304,0.1423007,0.1446087,0.1470668,0.1485171,0.1503383,0.1508497,0.1532987,0.1591155,0.1615874 2006-02-14,,0.1454322,0.1449017,0.1455735,0.1462286,0.1478059,0.1501469,0.1522522,0.1541999,0.157668,0.1601427 2006-02-15,,0.1429312,0.1455881,0.1464055,0.1471812,0.1489883,0.1514654,0.153837,0.1559375,0.16082,0.1631557 2006-02-16,,0.134637,0.1373471,0.140634,0.1432172,0.145788,0.14875,0.1507805,0.15325,0.1581015,0.1613797 2006-02-20,,0.1303785,0.1334454,0.139216,0.1423217,0.1454704,0.1477552,0.1487534,0.1509405,0.1554398,0.1588761 2006-02-21,,0.1359587,0.1370814,0.1416117,0.1418016,0.1441761,0.1468109,0.1476679,0.1496546,0.1561362,0.1607204 2006-02-22,,0.1302253,0.1337104,0.1415016,0.141451,0.1438881,0.1467031,0.1502449,0.1514018,0.1531452,0.1582335 2006-02-23,,0.1282022,0.1333902,0.1342376,0.1385976,0.1453201,0.1481733,0.1490296,0.1512885,0.1554035,0.1593463 2006-02-24,,0.1269229,0.1304391,0.1348061,0.1378378,0.1419301,0.1442134,0.1472283,0.1507224,0.1555662,0.1595938 2006-02-27,,0.1254707,0.128201,0.1334554,0.1374389,0.1427246,0.1446071,0.1465459,0.1496113,0.1541296,0.1578174 2006-02-28,,0.1346332,0.1361773,0.139586,0.1421924,0.1468084,0.1489651,0.1505661,0.1541479,0.1606205,0.1675438 2006-03-01,,0.1301198,0.1318495,0.1343342,0.1376886,0.1434328,0.1459977,0.1490832,0.1525961,0.1557153,0.1593923 2006-03-02,,0.1304425,0.1347556,0.1398592,0.1420431,0.1457691,0.1479747,0.1510143,0.1544964,0.1589201,0.1616325 2006-03-03,,0.1311674,0.1339681,0.138887,0.1418598,0.1451706,0.1472144,0.1495689,0.1536886,0.1599843,0.162247 2006-03-06,,0.1308081,0.1367775,0.1412145,0.1436582,0.1480171,0.1495588,0.1511633,0.1545973,0.1588486,0.1616268 2006-03-07,,0.1344355,0.1387528,0.143365,0.1459607,0.1482421,0.1491656,0.1512236,0.1550063,0.1593201,0.1615385
When plotting time series, pandas takes the index for the x-axis when calling the plot function. I would suggest to: df = df.assign( Date=lambda x: pd.to_datetime(x["Date"], format="%Y-%m%d") ).set_index("Date")
Graphing a dataframe line plot with a legend in Matplotlib
I'm working with a dataset that has grades and states and need to create line graphs by state showing what percent of each state's students fall into which bins. My methodology (so far) is as follows: First I import the dataset: import pandas as pd import numpy as np import matplotlib.pyplot as plt records = [{'Name':'A', 'Grade':'.15','State':'NJ'},{'Name':'B', 'Grade':'.15','State':'NJ'},{'Name':'C', 'Grade':'.43','State':'CA'},{'Name':'D', 'Grade':'.75','State':'CA'},{'Name':'E', 'Grade':'.17','State':'NJ'},{'Name':'F', 'Grade':'.85','State':'HI'},{'Name':'G', 'Grade':'.89','State':'HI'},{'Name':'H', 'Grade':'.38','State':'CA'},{'Name':'I', 'Grade':'.98','State':'NJ'},{'Name':'J', 'Grade':'.49','State':'NJ'},{'Name':'K', 'Grade':'.17','State':'CA'},{'Name':'K', 'Grade':'.94','State':'HI'},{'Name':'M', 'Grade':'.33','State':'HI'},{'Name':'N', 'Grade':'.22','State':'NJ'},{'Name':'O', 'Grade':'.7','State':'NJ'}] df = pd.DataFrame(records) df.Grade = df.Grade.astype(float) Next I cut each grade into a bin df['bin'] = pd.cut(df['Grade'],[-np.inf,.05,.1,.15,.2,.25,.3,.35,.4,.45,.5,.55,.6,.65,.7,.75,.8,.85,.9,.95,1],labels=False)/10 Then I create a pivot table giving me the count of people by bin in each state df2 = pd.pivot_table(df,index=['bin'],columns='State',values=['Name'],aggfunc=pd.Series.nunique,margins=True) df2 = df2.fillna(0) Then I convert those n-counts into percentages and remove the margin rows df3 = df2.div(df2.iloc[-1]) df3 = df3.iloc[:-1,:-1] Now I want to create a line graph with multiple lines (one for each state) with the bin on the x axis and the percentage on the Y axis. df3.plot() will give me the chart I want but I would like to accomplish the same using matplotlib, because it offers me greater customization of the graph. However, running plt.plot(df3) gives me the lines I need but I can't get the legend the work properly. Any thoughts on how to accomplish this?
It may not be the best way, but I use the pandas plot function to draw df3, then get the legend and get the new label names. Please note that the processing of the legend string is limited to this data. line = df3.plot(kind='line') handles, labels = line.get_legend_handles_labels() label = [] for l in labels: label.append(l[7:-1]) plt.legend(handles, label, loc='best')
You can do this: plt.plot(df3,label="label") plt.legend() plt.show() For more information visit here And if it helps you to solve your issues then don't forget to mark this as accepted answer.
Plot stacked bar chart from pandas data frame
I have dataframe: payout_df.head(10) What would be the easiest, smartest and fastest way to replicate the following excel plot? I've tried different approaches, but couldn't get everything into place. Thanks
If you just want a stacked bar chart, then one way is to use a loop to plot each column in the dataframe and just keep track of the cumulative sum, which you then pass as the bottom argument of pyplot.bar import pandas as pd import matplotlib.pyplot as plt # If it's not already a datetime payout_df['payout'] = pd.to_datetime(payout_df.payout) cumval=0 fig = plt.figure(figsize=(12,8)) for col in payout_df.columns[~payout_df.columns.isin(['payout'])]: plt.bar(payout_df.payout, payout_df[col], bottom=cumval, label=col) cumval = cumval+payout_df[col] _ = plt.xticks(rotation=30) _ = plt.legend(fontsize=18)
Besides the lack of data, I think the following code will produce the desired graph import pandas as pd import matplotlib.pyplot as plt df.payout = pd.to_datetime(df.payout) grouped = df.groupby(pd.Grouper(key='payout', freq='M')).sum() grouped.plot(x=grouped.index.year, kind='bar', stacked=True) plt.show() I don't know how to reproduce this fancy x-axis style. Also, your payout column must be a datetime, otherwise pd.Grouper won't work (available frequencies).
Pandas Series not plotting to timeseries chart
I have a data set of house prices - House Price Data. When I use a subset of the data in a Numpy array, I can plot it in this nice timeseries chart: However, when I use the same data in a Panda Series, the chart goes all lumpy like this: How can I create a smooth time series line graph (like the first image) using a Panda Series? Here is what I am doing to get the nice looking time series chart (using Numpy array)(after importing numpy as np, pandas as pd and matplotlib.pyplot as plt): data = pd.read_csv('HPI.csv', index_col='Date', parse_dates=True) #pull in csv file, make index the date column and parse the dates brixton = data[data['RegionName'] == 'Lambeth'] # pull out a subset for the region Lambeth prices = brixton['AveragePrice'].values # create a numpy array of the average price values plt.plot(prices) #plot plt.show() #show Here is what I am doing to get the lumpy one using a Panda series: data = pd.read_csv('HPI.csv', index_col='Date', parse_dates=True) brixton = data[data['RegionName'] == 'Lambeth'] prices_panda = brixton['AveragePrice'] plt.plot(prices_panda) plt.show() How do I make this second graph show as a nice smooth proper time series? * This is my first StackOverflow question so please shout if I have left anything out or not been clear * Any help greatly appreciated
When you did parse_dates=True, pandas read the dates in its default method, which is month-day-year. Your data is formatted according to the British convention, which is day-month-year. As a result, instead of having a data point for the first of every month, your plot is showing data points for the first 12 days of January, and a flat line for the rest of each year. You need to reformat the dates, such as data.index = pd.to_datetime({'year':data.index.year,'month':data.index.day,'day':data.index.month})
The date format in the file you have is Day/Month/Year. In order for pandas to interprete this format correctly you can use the option dayfirst=True inside the read_csv call. import pandas as pd import matplotlib.pyplot as plt data = pd.read_csv('data/UK-HPI-full-file-2017-08.csv', index_col='Date', parse_dates=True, dayfirst=True) brixton = data[data['RegionName'] == 'Lambeth'] prices_panda = brixton['AveragePrice'] plt.plot(prices_panda) plt.show()