Calculating Bull/Bear Markets in Pandas - python

I was recently given a challenge of calculating the presence of Bull/Bear markets using the values of -1, 1 to denote which one is which.
It is straight forward enough to do this with a for-loop but I know this is the worst way to do these things and it's better to use numpy/pandas methods if possible. However, I'm not seeing an easy way to do it.
Any ways of how to do this, maybe using changes of +/- 20% from current place to determine which regime you're in.
Here's a sample dataframe:
dates = pd.date_range(start='1950-01-01', periods=25000)
rand = np.random.RandomState(42)
vals = np.zeros(25000)
vals[0] = 15
for i in range(1, 25000):
vals[i] = vals[i-1] + rand.normal(0, 1)
df = pd.DataFrame(vals, columns=['Price'], index=dates)
The plot of these prices looks like this:
Anyone have any recommendations to calculate what regime the current point is in?
If you have to use a for loop then that's fine.

Here is a solution using the S&P 500 index from Yahoo! Finance (ticker ^GSPC):
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import yfinance as yf
import requests_cache
session = requests_cache.CachedSession()
df = yf.download('^GSPC', session=session)
df = df[['Adj Close']].copy()
df['dd'] = df['Adj Close'].div(df['Adj Close'].cummax()).sub(1)
df['ddn'] = ((df['dd'] < 0.) & (df['dd'].shift() == 0.)).cumsum()
df['ddmax'] = df.groupby('ddn')['dd'].transform('min')
df['bear'] = (df['ddmax'] < -0.2) & (df['ddmax'] < df.groupby('ddn')['dd'].transform('cummin'))
df['bearn'] = ((df['bear'] == True) & (df['bear'].shift() == False)).cumsum()
bears = df.reset_index().query('bear == True').groupby('bearn')['Date'].agg(['min', 'max'])
print(bears)
df['Adj Close'].plot()
for i, row in bears.iterrows():
plt.fill_between(row, df['Adj Close'].max(), alpha=0.25, color='r')
plt.gca().yaxis.set_major_formatter(plt.matplotlib.ticker.StrMethodFormatter('{x:,.0f}'))
plt.ylabel('S&P 500 Index (^GSPC)')
plt.title('S&P 500 Index with Bear Markets (> 20% Declines)')
plt.savefig('bears.png')
plt.show()
Here are the bear markets in data frame bears:
min max
bearn
1 1956-08-06 1957-10-21
2 1961-12-13 1962-06-25
3 1966-02-10 1966-10-06
4 1968-12-02 1970-05-25
5 1973-01-12 1974-10-02
6 1980-12-01 1982-08-11
7 1987-08-26 1987-12-03
8 2000-03-27 2002-10-08
9 2007-10-10 2009-03-06
10 2020-02-20 2020-03-20
11 2022-01-04 2022-06-15
Here is a plot:
Edit: I think this is an improvement from my first solution since ^GSPC provides a longer time series and bear markets are not typically dividend-adjusted.

I think this might work:
import numpy as np
vals = np.random.normal(0, 1, 25000)
vals[0] = 15
vals = np.cumsum(vals)

Related

Find intersection points for two stock timeseries

Background
I am trying to find intersection points of two series. In this stock example, I would like to find the intersection points of SMA20 & SMA50. Simple Moving Average (SMA) is commonly used as stock indicators, combined with intersections and other strategies will help one to make decision. Below is the code example.
Code
You can run the following with jupyter.
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
datafile = 'output_XAG_D1_20200101_to_20200601.csv'
#This creates a dataframe from the CSV file:
data = pd.read_csv(datafile, index_col = 'Date')
#This selects the 'Adj Close' column
close = data['BidClose']
#This converts the date strings in the index into pandas datetime format:
close.index = pd.to_datetime(close.index)
close
sma20 = close.rolling(window=20).mean()
sma50 = close.rolling(window=50).mean()
priceSma_df = pd.DataFrame({
'BidClose' : close,
'SMA 20' : sma20,
'SMA 50' : sma50
})
priceSma_df.plot()
plt.show()
Sample Data
This is the data file used in example output_XAG_D1_20200101_to_20200601.csv
Date,BidOpen,BidHigh,BidLow,BidClose,AskOpen,AskHigh,AskLow,AskClose,Volume
01.01.2020 22:00:00,1520.15,1531.26,1518.35,1527.78,1520.65,1531.75,1518.73,1531.73,205667
01.02.2020 22:00:00,1527.78,1553.43,1526.72,1551.06,1531.73,1553.77,1528.17,1551.53,457713
01.05.2020 22:00:00,1551.06,1588.16,1551.06,1564.4,1551.53,1590.51,1551.53,1568.32,540496
01.06.2020 22:00:00,1564.4,1577.18,1555.2,1571.62,1568.32,1577.59,1555.54,1575.56,466430
01.07.2020 22:00:00,1571.62,1611.27,1552.13,1554.79,1575.56,1611.74,1552.48,1558.72,987671
01.08.2020 22:00:00,1554.79,1561.24,1540.08,1549.78,1558.72,1561.58,1540.5,1553.73,473799
01.09.2020 22:00:00,1549.78,1563.0,1545.62,1562.44,1553.73,1563.41,1545.96,1562.95,362002
01.12.2020 22:00:00,1562.44,1562.44,1545.38,1545.46,1562.95,1563.06,1546.71,1549.25,280809
01.13.2020 22:00:00,1545.46,1548.77,1535.78,1545.1,1549.25,1549.25,1536.19,1548.87,378200
01.14.2020 22:00:00,1545.1,1558.04,1543.79,1554.89,1548.87,1558.83,1546.31,1558.75,309719
01.15.2020 22:00:00,1554.89,1557.98,1547.91,1551.18,1558.75,1558.75,1548.24,1554.91,253944
01.16.2020 22:00:00,1551.18,1561.12,1549.28,1556.68,1554.91,1561.55,1549.59,1557.15,239186
01.19.2020 22:00:00,1556.68,1562.69,1556.25,1560.77,1557.15,1562.97,1556.61,1561.17,92020
01.20.2020 22:00:00,1560.77,1568.49,1546.21,1556.8,1561.17,1568.87,1546.56,1558.5,364753
01.21.2020 22:00:00,1556.8,1559.18,1550.07,1558.59,1558.5,1559.47,1550.42,1559.31,238468
01.22.2020 22:00:00,1558.59,1567.83,1551.8,1562.45,1559.31,1568.16,1552.11,1564.17,365518
01.23.2020 22:00:00,1562.45,1575.77,1556.44,1570.39,1564.17,1576.12,1556.76,1570.87,368529
01.26.2020 22:00:00,1570.39,1588.41,1570.39,1580.51,1570.87,1588.97,1570.87,1582.33,510524
01.27.2020 22:00:00,1580.51,1582.93,1565.31,1567.15,1582.33,1583.3,1565.79,1570.62,384205
01.28.2020 22:00:00,1567.15,1577.93,1563.27,1576.7,1570.62,1578.22,1563.61,1577.25,328766
01.29.2020 22:00:00,1576.7,1585.87,1572.19,1573.23,1577.25,1586.18,1572.44,1575.33,522371
01.30.2020 22:00:00,1573.23,1589.98,1570.82,1589.75,1575.33,1590.37,1571.14,1590.31,482710
02.02.2020 22:00:00,1589.75,1593.09,1568.65,1575.62,1590.31,1595.82,1569.85,1578.35,488585
02.03.2020 22:00:00,1575.62,1579.56,1548.95,1552.55,1578.35,1579.87,1549.31,1556.4,393037
02.04.2020 22:00:00,1552.55,1562.3,1547.34,1554.62,1556.4,1562.64,1547.72,1556.42,473172
02.05.2020 22:00:00,1554.62,1568.14,1552.39,1565.08,1556.42,1568.51,1552.73,1567.0,365580
02.06.2020 22:00:00,1565.08,1574.02,1559.82,1570.11,1567.0,1574.33,1560.7,1570.55,424269
02.09.2020 22:00:00,1570.11,1576.9,1567.9,1571.05,1570.55,1577.25,1568.21,1573.34,326606
02.10.2020 22:00:00,1571.05,1573.92,1561.92,1566.12,1573.34,1574.27,1562.24,1568.12,310037
02.11.2020 22:00:00,1566.12,1570.39,1561.45,1564.26,1568.12,1570.71,1561.91,1567.02,269032
02.12.2020 22:00:00,1564.26,1578.24,1564.26,1574.5,1567.02,1578.52,1565.81,1576.63,368438
02.13.2020 22:00:00,1574.5,1584.87,1572.44,1584.49,1576.63,1585.29,1573.28,1584.91,250788
02.16.2020 22:00:00,1584.49,1584.49,1578.7,1580.79,1584.91,1584.91,1579.06,1581.31,101499
02.17.2020 22:00:00,1580.79,1604.97,1580.79,1601.06,1581.31,1605.33,1581.31,1603.08,321542
02.18.2020 22:00:00,1601.06,1612.83,1599.41,1611.27,1603.08,1613.4,1599.77,1613.34,357488
02.19.2020 22:00:00,1611.27,1623.62,1603.74,1618.48,1613.34,1623.98,1604.12,1621.27,535148
02.20.2020 22:00:00,1618.48,1649.26,1618.48,1643.42,1621.27,1649.52,1619.19,1643.87,590262
02.23.2020 22:00:00,1643.42,1689.22,1643.42,1658.62,1643.87,1689.55,1643.87,1659.07,1016570
02.24.2020 22:00:00,1658.62,1660.76,1624.9,1633.19,1659.07,1661.52,1625.5,1636.23,1222774
02.25.2020 22:00:00,1633.19,1654.88,1624.74,1640.4,1636.23,1655.23,1625.11,1642.59,1004692
02.26.2020 22:00:00,1640.4,1660.3,1635.15,1643.99,1642.59,1660.6,1635.6,1646.42,1084115
02.27.2020 22:00:00,1643.99,1649.39,1562.74,1584.95,1646.42,1649.84,1563.22,1585.58,1174015
03.01.2020 22:00:00,1584.95,1610.94,1575.29,1586.55,1585.58,1611.26,1575.88,1590.33,1115889
03.02.2020 22:00:00,1586.55,1649.16,1586.55,1640.19,1590.33,1649.6,1589.43,1644.16,889364
03.03.2020 22:00:00,1640.19,1652.81,1631.73,1635.95,1644.16,1653.51,1632.1,1639.05,589438
03.04.2020 22:00:00,1635.95,1674.51,1634.91,1669.36,1639.05,1674.9,1635.3,1672.83,643444
03.05.2020 22:00:00,1669.36,1692.1,1641.61,1673.89,1672.83,1692.65,1642.75,1674.46,1005737
03.08.2020 21:00:00,1673.89,1703.19,1656.98,1678.31,1674.46,1703.52,1657.88,1679.2,910166
03.09.2020 21:00:00,1678.31,1680.43,1641.37,1648.71,1679.2,1681.18,1641.94,1649.75,943377
03.10.2020 21:00:00,1648.71,1671.15,1632.9,1634.42,1649.75,1671.56,1633.31,1637.07,793816
03.11.2020 21:00:00,1634.42,1650.28,1560.5,1578.29,1637.07,1650.8,1560.92,1580.01,1009172
03.12.2020 21:00:00,1578.29,1597.85,1504.34,1528.99,1580.01,1598.36,1505.14,1530.09,1052940
03.15.2020 21:00:00,1528.99,1575.2,1451.08,1509.12,1530.09,1576.05,1451.49,1512.94,1196812
03.16.2020 21:00:00,1509.12,1553.91,1465.4,1528.57,1512.94,1554.21,1466.1,1529.43,1079729
03.17.2020 21:00:00,1528.57,1545.93,1472.49,1485.85,1529.43,1546.74,1472.99,1486.75,976857
03.18.2020 21:00:00,1485.85,1500.68,1463.49,1471.89,1486.75,1501.6,1464.64,1474.16,833803
03.19.2020 21:00:00,1471.89,1516.07,1454.46,1497.01,1474.16,1516.57,1455.93,1497.82,721471
03.22.2020 21:00:00,1497.01,1560.86,1482.21,1551.45,1497.82,1561.65,1483.22,1553.09,707830
03.23.2020 21:00:00,1551.45,1631.23,1551.45,1621.05,1553.09,1638.75,1553.09,1631.35,164862
03.24.2020 21:00:00,1621.05,1636.23,1588.82,1615.77,1631.35,1650.03,1601.29,1618.47,205272
03.25.2020 21:00:00,1615.77,1642.96,1587.7,1628.31,1618.47,1649.81,1599.87,1633.29,152804
03.26.2020 21:00:00,1628.31,1630.48,1606.76,1617.5,1633.29,1638.48,1616.9,1622.8,307278
03.29.2020 21:00:00,1617.5,1631.48,1602.51,1620.91,1622.8,1643.86,1612.55,1623.77,291653
03.30.2020 21:00:00,1620.91,1626.55,1573.37,1574.9,1623.77,1627.31,1575.24,1579.1,371507
03.31.2020 21:00:00,1574.9,1600.41,1560.13,1590.13,1579.1,1603.42,1570.75,1592.43,412780
04.01.2020 21:00:00,1590.13,1619.76,1582.42,1612.07,1592.43,1621.1,1583.37,1614.49,704652
04.02.2020 21:00:00,1612.07,1625.21,1605.39,1618.63,1614.49,1626.83,1607.69,1621.37,409490
04.05.2020 21:00:00,1618.63,1668.35,1608.59,1657.77,1621.37,1670.98,1609.7,1663.43,381690
04.06.2020 21:00:00,1657.77,1671.95,1641.84,1644.84,1663.43,1677.53,1643.4,1650.46,286313
04.07.2020 21:00:00,1644.84,1656.39,1640.1,1644.06,1650.46,1657.43,1643.46,1646.66,219464
04.08.2020 21:00:00,1644.06,1689.66,1643.05,1682.16,1646.66,1691.13,1644.83,1686.74,300111
04.12.2020 21:00:00,1682.16,1722.25,1677.35,1709.16,1686.74,1725.48,1680.49,1718.28,280905
04.13.2020 21:00:00,1709.16,1747.04,1708.56,1726.18,1718.28,1748.88,1709.36,1729.72,435098
04.14.2020 21:00:00,1726.18,1730.53,1706.67,1714.35,1729.72,1732.97,1708.95,1717.25,419065
04.15.2020 21:00:00,1714.35,1738.65,1707.83,1715.99,1717.25,1740.35,1708.93,1720.09,615105
04.16.2020 21:00:00,1715.99,1718.46,1677.16,1683.2,1720.09,1720.09,1680.55,1684.97,587875
04.19.2020 21:00:00,1683.2,1702.49,1671.1,1694.71,1684.97,1703.46,1672.02,1697.29,412116
04.20.2020 21:00:00,1694.71,1697.66,1659.42,1683.4,1697.29,1698.44,1662.3,1686.58,502893
04.21.2020 21:00:00,1683.4,1718.21,1679.61,1713.67,1686.58,1719.19,1680.71,1716.91,647622
04.22.2020 21:00:00,1713.67,1738.59,1706.93,1729.89,1716.91,1739.47,1707.72,1731.83,751833
04.23.2020 21:00:00,1729.89,1736.31,1710.56,1726.74,1731.83,1736.98,1711.03,1727.71,608827
04.26.2020 21:00:00,1726.74,1727.55,1705.99,1713.36,1727.71,1728.55,1706.72,1715.29,698217
04.27.2020 21:00:00,1713.36,1716.52,1691.41,1707.66,1715.29,1718.02,1692.51,1710.22,749906
04.28.2020 21:00:00,1707.66,1717.42,1697.65,1711.58,1710.22,1718.57,1698.4,1715.42,630720
04.29.2020 21:00:00,1711.58,1721.94,1681.36,1684.97,1715.42,1722.79,1681.91,1687.92,631609
04.30.2020 21:00:00,1684.97,1705.87,1669.62,1699.92,1687.92,1706.33,1670.81,1701.66,764742
05.03.2020 21:00:00,1699.92,1714.75,1691.46,1700.42,1701.66,1715.83,1692.96,1702.17,355859
05.04.2020 21:00:00,1700.42,1711.64,1688.55,1703.04,1702.17,1712.55,1690.42,1706.71,415576
05.05.2020 21:00:00,1703.04,1708.1,1681.6,1685.18,1706.71,1708.71,1682.33,1688.33,346814
05.06.2020 21:00:00,1685.18,1721.95,1683.59,1715.17,1688.33,1722.53,1684.8,1716.91,379103
05.07.2020 21:00:00,1715.17,1723.54,1701.49,1704.06,1716.91,1724.42,1702.1,1705.25,409225
05.10.2020 21:00:00,1704.06,1712.02,1691.75,1696.68,1705.25,1713.03,1692.45,1697.58,438010
05.11.2020 21:00:00,1696.68,1710.94,1693.56,1701.46,1697.58,1711.31,1693.92,1703.32,369988
05.12.2020 21:00:00,1701.46,1718.11,1698.86,1716.09,1703.32,1718.69,1699.4,1718.63,518107
05.13.2020 21:00:00,1716.09,1736.16,1710.79,1727.71,1718.63,1736.55,1711.33,1731.38,447401
05.14.2020 21:00:00,1727.71,1751.56,1727.71,1743.94,1731.38,1752.1,1728.89,1744.96,561909
05.17.2020 21:00:00,1743.94,1765.3,1727.4,1731.73,1744.96,1765.92,1728.08,1732.99,495628
05.18.2020 21:00:00,1731.73,1747.76,1725.05,1743.52,1732.99,1748.24,1726.29,1746.9,596250
05.19.2020 21:00:00,1743.52,1753.8,1742.04,1747.22,1746.9,1754.28,1742.62,1748.48,497960
05.20.2020 21:00:00,1747.22,1748.7,1717.14,1726.56,1748.48,1751.18,1717.39,1727.82,557122
05.21.2020 21:00:00,1726.56,1740.06,1723.33,1735.67,1727.82,1740.7,1724.41,1736.73,336867
05.24.2020 21:00:00,1735.67,1735.67,1721.61,1727.88,1736.73,1736.73,1721.83,1730.25,164650
05.25.2020 21:00:00,1727.88,1735.39,1708.48,1710.1,1730.25,1735.99,1709.34,1712.21,404914
05.26.2020 21:00:00,1710.1,1715.93,1693.57,1708.36,1712.21,1716.3,1694.04,1709.85,436519
05.27.2020 21:00:00,1708.36,1727.42,1703.41,1717.28,1709.85,1727.93,1705.85,1721.0,416306
05.28.2020 21:00:00,1717.28,1737.58,1712.55,1731.2,1721.0,1738.26,1713.24,1732.07,399698
05.31.2020 21:00:00,1731.2,1744.51,1726.98,1738.73,1732.07,1745.11,1727.93,1742.56,365219
Problem
This is the result for this code and I'm looking for ways to find intersections for SMA20 (yellow) and SMA50 (green) lines and thus able to get alerts whenever these lines cross.
Solution
Print out intersections indication crossing from above or below relative to each series.
import numpy as np
g20=sma20.values
g50=sma50.values
# np.sign(...) return -1, 0 or 1
# np.diff(...) return value difference for (n-1) - n, to obtain intersections
# np.argwhere(...) remove zeros, preserves turning points only
idx20 = np.argwhere(np.diff(np.sign(g20 - g50))).flatten()
priceSma_df.plot()
plt.scatter(close.index[idx20], sma50[idx20], color='red')
plt.show()
import numpy as np
f=close.values
g20=sma20.values
g50=sma50.values
idx20 = np.argwhere(np.diff(np.sign(f - g20))).flatten()
idx50 = np.argwhere(np.diff(np.sign(f - g50))).flatten()
priceSma_df = pd.DataFrame({
'BidClose' : close,
'SMA 20' : sma20,
'SMA 50' : sma50
})
priceSma_df.plot()
plt.scatter(close.index[idx20], sma20[idx20], color='orange')
plt.scatter(close.index[idx50], sma50[idx50], color='green')
plt.show()

Create and pass random values to Pandas dataframes with hard bounds

I am trying to simulate a pandas dataframe, using random values, with a combination of hard upper/lower values. I am using np.random.normal, as the original data is fairly normally distributed.
The code I am using to create the dataframe is:
df = pd.DataFrame({
"Temp": np.random.normal(6.809892, 2.975827,93),
"Sun": np.random.normal(1.615054,2.053996,93),
"Rel Hum": np.random.normal(87.153118,5.529958,93)
})
In the above example, I would like there to be a hard lower and upper bound for all three values. For example, Rel. Hum. could not go below 0, or above 100. Edit: all three values would not have the same bounds, either upper or lower. Temp can go negative, while sun would be bounded at 0, and 24)
How can I force these values, while creating a relatively normally distribution, and passing them to the dataframe at the same time?
Edit : Note that this samples from a truncated normal for the given parameters and will most likely not be truly normally distributed, sorry for the confusion.
Use scipy truncated normal defined as :
"The standard form of this distribution is a standard normal truncated to the range [a, b]"
from scipy.stats import truncnorm
low_bound = 0
upper_bound = 100
mean = 8
std = 2
a, b = (low_bound - mean) / std, (upper_bound - mean) / std
n_samples = 1000
samples = truncnorm.rvs(a = a, b = b,
loc = mean, scale = std,
size = n_samples)
Thanks to ALollz for the corrections !
Try clip() function to bound the values, example:
>>> df[df['Rel Hum']>100].head()
Temp Sun Rel Hum
32 4.734005 4.102939 100.064077
Name: Rel Hum, Length: 93, dtype: float64
>>> df[df['Rel Hum']>100].head()
Temp Sun Rel Hum
32 4.734005 4.102939 100.064077
>>> df['Rel Hum'].clip(0, 100, inplace=True) # assigns values outside boundary to 0 and 100
>>> df.head()
Temp Sun Rel Hum
0 9.714943 6.255931 93.105135
1 0.551001 3.063972 85.923184
2 7.780588 3.580514 79.124139
3 3.766066 3.684801 84.543149
4 8.541507 -3.066196 83.598925
>>> df[df['Rel Hum']>100].head()
Empty DataFrame
Columns: [Temp, Sun, Rel Hum]
Index: []
Just do a clip:
df = pd.DataFrame({
"Temp": np.random.normal(6.809892, 2.975827,93),
"Sun": np.random.normal(1.615054,2.053996,93),
"Rel Hum": np.random.normal(87.153118,5.529958,93)
}).clip(0,100)
And plot:
df.plot.density(subplots=True);
gives:
You can clip, though this leaves you with a spike at the edges:
import pandas as pd
import numpy as np
N = 10**5
df = pd.DataFrame({"Rel Hum": np.random.normal(87.153118,5.529958, N)})
df['Rel Hum'].clip(lower=0, upper=100).plot(kind='hist', bins=np.arange(60,101,1))
If you want to avoid that spike redraw out of bounds points until everything is within bounds:
while not df['Rel Hum'].between(0, 100).all():
m = ~df['Rel Hum'].between(0, 100)
df.loc[m, 'Rel Hum'] = np.random.normal(87.153118, 5.529958, m.sum())
df['Rel Hum'].plot(kind='hist', bins=np.arange(60,101,1))

How to perform time series analysis that contains multiple groups in Python using fbProphet or other models?

All,
My dataset looks like following. I am trying to predict the 'amount' for next 6 months using either the fbProphet or other model. But my issue is that I would like to predict amount based on each groups i.e A,B,C,D for next 6 months. I am not sure how to do that in python using fbProphet or other model ? I referenced official page of fbprophet, but the only information I found is that "Prophet" takes two columns only One is "Date" and other is "amount" .
I am new to python, so any help with code explanation is greatly appreciated!
import pandas as pd
data = {'Date':['2017-01-01', '2017-02-01', '2017-03-01', '2017-04-01','2017-05-01','2017-06-01','2017-07-01'],'Group':['A','B','C','D','C','A','B'],
'Amount':['12.1','13','15','10','12','9.0','5.6']}
df = pd.DataFrame(data)
print (df)
output:
Date Group Amount
0 2017-01-01 A 12.1
1 2017-02-01 B 13
2 2017-03-01 C 15
3 2017-04-01 D 10
4 2017-05-01 C 12
5 2017-06-01 A 9.0
6 2017-07-01 B 5.6
fbprophet requires two columns ds and y, so you need to first rename the two columns
df = df.rename(columns={'Date': 'ds', 'Amount':'y'})
Assuming that your groups are independent from each other and you want to get one prediction for each group, you can group the dataframe by "Group" column and run forecast for each group
from fbprophet import Prophet
grouped = df.groupby('Group')
for g in grouped.groups:
group = grouped.get_group(g)
m = Prophet()
m.fit(group)
future = m.make_future_dataframe(periods=365)
forecast = m.predict(future)
print(forecast.tail())
Take note that the input dataframe that you supply in the question is not sufficient for the model because group D only has a single data point. fbprophet's forecast needs at least 2 non-Nan rows.
EDIT: if you want to merge all predictions into one dataframe, the idea is to name the yhat for each observations differently, do pd.merge() in the loop, and then cherry-pick the columns that you need at the end:
final = pd.DataFrame()
for g in grouped.groups:
group = grouped.get_group(g)
m = Prophet()
m.fit(group)
future = m.make_future_dataframe(periods=365)
forecast = m.predict(future)
forecast = forecast.rename(columns={'yhat': 'yhat_'+g})
final = pd.merge(final, forecast.set_index('ds'), how='outer', left_index=True, right_index=True)
final = final[['yhat_' + g for g in grouped.groups.keys()]]
import pandas as pd
import numpy as np
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.stattools import adfuller
from matplotlib import pyplot as plt
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_squared_log_error
# Before doing any modeling using ARIMA or SARIMAS etc Confirm that
# your time-series is stationary by using Augmented Dick Fuller test
# or other tests.
# Create a list of all groups or get from Data using np.unique or other methods
groups_iter = ['A', 'B', 'C', 'D']
dict_org = {}
dict_pred = {}
group_accuracy = {}
# Iterate over all groups and get data
# from Dataframe by filtering for specific group
for i in range(len(groups_iter)):
X = data[data['Group'] == groups_iter[i]]['Amount'].values
size = int(len(X) * 0.70)
train, test = X[0:size], X[size:len(X)]
history = [x for in train]
# Using ARIMA model here you can also do grid search for best parameters
for t in range(len(test)):
model = ARIMA(history, order = (5, 1, 0))
model_fit = model.fit(disp = 0)
output = model_fit.forecast()
yhat = output[0]
predictions.append(yhat)
obs = test[t]
history.append(obs)
print("Predicted:%f, expected:%f" %(yhat, obs))
error = mean_squared_log_error(test, predictions)
dict_org.update({groups_iter[i]: test})
dict_pred.update({group_iter[i]: test})
print("Group: ", group_iter[i], "Test MSE:%f"% error)
group_accuracy.update({group_iter[i]: error})
plt.plot(test)
plt.plot(predictions, color = 'red')
plt.show()
I know this is old but I was trying to predict outcomes for different clients and I tried to use Aditya Santoso solution above but got into some errors, so I added a couple of modifications and finally this worked for me:
df = pd.read_csv('file.csv')
df = pd.DataFrame(df)
df = df.rename(columns={'date': 'ds', 'amount': 'y', 'client_id': 'client_id'})
#I had to filter first clients with less than 3 records to avoid errors as prophet only works for 2+ records by group
df = df.groupby('client_id').filter(lambda x: len(x) > 2)
df.client_id = df.client_id.astype(str)
final = pd.DataFrame(columns=['client','ds','yhat'])
grouped = df.groupby('client_id')
for g in grouped.groups:
group = grouped.get_group(g)
m = Prophet()
m.fit(group)
future = m.make_future_dataframe(periods=365)
forecast = m.predict(future)
#I added a column with client id
forecast['client'] = g
#I used concat instead of merge
final = pd.concat([final, forecast], ignore_index=True)
final.head(10)

Pandas finding local max and min

I have a pandas data frame with two columns one is temperature the other is time.
I would like to make third and fourth columns called min and max. Each of these columns would be filled with nan's except where there is a local min or max, then it would have the value of that extrema.
Here is a sample of what the data looks like, essentially I am trying to identify all the peaks and low points in the figure.
Are there any built in tools with pandas that can accomplish this?
The solution offered by fuglede is great but if your data is very noisy (like the one in the picture) you will end up with lots of misleading local extremes. I suggest that you use scipy.signal.argrelextrema() method. The .argrelextrema() method has its own limitations but it has a useful feature where you can specify the number of points to be compared, kind of like a noise filtering algorithm. for example:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from scipy.signal import argrelextrema
# Generate a noisy AR(1) sample
np.random.seed(0)
rs = np.random.randn(200)
xs = [0]
for r in rs:
xs.append(xs[-1] * 0.9 + r)
df = pd.DataFrame(xs, columns=['data'])
n = 5 # number of points to be checked before and after
# Find local peaks
df['min'] = df.iloc[argrelextrema(df.data.values, np.less_equal,
order=n)[0]]['data']
df['max'] = df.iloc[argrelextrema(df.data.values, np.greater_equal,
order=n)[0]]['data']
# Plot results
plt.scatter(df.index, df['min'], c='r')
plt.scatter(df.index, df['max'], c='g')
plt.plot(df.index, df['data'])
plt.show()
Some points:
you might need to check the points afterward to ensure there are no twine points very close to each other.
you can play with n to filter the noisy points
argrelextrema returns a tuple and the [0] at the end extracts a numpy array
Assuming that the column of interest is labelled data, one solution would be
df['min'] = df.data[(df.data.shift(1) > df.data) & (df.data.shift(-1) > df.data)]
df['max'] = df.data[(df.data.shift(1) < df.data) & (df.data.shift(-1) < df.data)]
For example:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Generate a noisy AR(1) sample
np.random.seed(0)
rs = np.random.randn(200)
xs = [0]
for r in rs:
xs.append(xs[-1]*0.9 + r)
df = pd.DataFrame(xs, columns=['data'])
# Find local peaks
df['min'] = df.data[(df.data.shift(1) > df.data) & (df.data.shift(-1) > df.data)]
df['max'] = df.data[(df.data.shift(1) < df.data) & (df.data.shift(-1) < df.data)]
# Plot results
plt.scatter(df.index, df['min'], c='r')
plt.scatter(df.index, df['max'], c='g')
df.data.plot()
using Numpy
ser = np.random.randint(-40, 40, 100) # 100 points
peak = np.where(np.diff(ser) < 0)[0]
or
double_difference = np.diff(np.sign(np.diff(ser)))
peak = np.where(double_difference == -2)[0]
using Pandas
ser = pd.Series(np.random.randint(2, 5, 100))
peak_df = ser[(ser.shift(1) < ser) & (ser.shift(-1) < ser)]
peak = peak_df.index
You can do something similar to Foad's .argrelextrema() solution, but with the Pandas .rolling() function:
# Find local peaks
n = 5 #rolling period
local_min_vals = df.loc[df['data'] == df['data'].rolling(n, center=True).min()]
local_max_vals = df.loc[df['data'] == df['data'].rolling(n, center=True).max()]
plt.scatter(local_min_vals.index, local_min_vals, c='r')
plt.scatter(local_max_vals.index, local_max_vals, c='g')

How to more efficiently calculate a rolling ratio

i have data length is over 3000.
below are code for making 20days value ( Volume Ration in Stock market)
it took more than 2 min.
is there any good way to reduce running time.
import pandas as pd
import numpy as np
from pandas.io.data import DataReader
import matplotlib.pylab as plt
data = DataReader('047040.KS','yahoo',start='2010')
data['vr']=0
data['Volume Ratio']=0
data['acend']=0
data['vr'] = np.sign(data['Close']-data['Open'])
data['vr'] = np.where(data['vr']==0,0.5,data['vr'])
data['vr'] = np.where(data['vr']<0,0,data['vr'])
data['acend'] = np.multiply(data['Volume'],data['vr'])
for i in range(len(data['Open'])):
if i<19:
data['Volume Ratio'][i]=0
else:
data['Volume Ratio'][i] = ((sum(data['acend'][i-19:i]))/((sum(data['Volume'][i-19:i])-sum(data['acend'][i-19:i]))))*100
Consider using conditional row selection and rolling.sum():
data.loc[data.index[:20], 'Volume Ratio'] = 0
data.loc[data.index[20:], 'Volume Ratio'] = (data.loc[:20:, 'acend'].rolling(window=20).sum() / (data.loc[:20:, 'Volume'].rolling(window=20).sum() - data.loc[:20:, 'acend'].rolling(window=20).sum()) * 100
or, simplified - .rolling.sum() will create np.nan for the first 20 values so just use .fillna(0):
data['new_col'] = data['acend'].rolling(window=20).sum().div(data['Volume'].rolling(window=20).sum().subtract(data['acend'].rolling(window=20).sum()).mul(100).fillna(0)

Categories