Find intersection points for two stock timeseries - python
Background
I am trying to find intersection points of two series. In this stock example, I would like to find the intersection points of SMA20 & SMA50. Simple Moving Average (SMA) is commonly used as stock indicators, combined with intersections and other strategies will help one to make decision. Below is the code example.
Code
You can run the following with jupyter.
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
datafile = 'output_XAG_D1_20200101_to_20200601.csv'
#This creates a dataframe from the CSV file:
data = pd.read_csv(datafile, index_col = 'Date')
#This selects the 'Adj Close' column
close = data['BidClose']
#This converts the date strings in the index into pandas datetime format:
close.index = pd.to_datetime(close.index)
close
sma20 = close.rolling(window=20).mean()
sma50 = close.rolling(window=50).mean()
priceSma_df = pd.DataFrame({
'BidClose' : close,
'SMA 20' : sma20,
'SMA 50' : sma50
})
priceSma_df.plot()
plt.show()
Sample Data
This is the data file used in example output_XAG_D1_20200101_to_20200601.csv
Date,BidOpen,BidHigh,BidLow,BidClose,AskOpen,AskHigh,AskLow,AskClose,Volume
01.01.2020 22:00:00,1520.15,1531.26,1518.35,1527.78,1520.65,1531.75,1518.73,1531.73,205667
01.02.2020 22:00:00,1527.78,1553.43,1526.72,1551.06,1531.73,1553.77,1528.17,1551.53,457713
01.05.2020 22:00:00,1551.06,1588.16,1551.06,1564.4,1551.53,1590.51,1551.53,1568.32,540496
01.06.2020 22:00:00,1564.4,1577.18,1555.2,1571.62,1568.32,1577.59,1555.54,1575.56,466430
01.07.2020 22:00:00,1571.62,1611.27,1552.13,1554.79,1575.56,1611.74,1552.48,1558.72,987671
01.08.2020 22:00:00,1554.79,1561.24,1540.08,1549.78,1558.72,1561.58,1540.5,1553.73,473799
01.09.2020 22:00:00,1549.78,1563.0,1545.62,1562.44,1553.73,1563.41,1545.96,1562.95,362002
01.12.2020 22:00:00,1562.44,1562.44,1545.38,1545.46,1562.95,1563.06,1546.71,1549.25,280809
01.13.2020 22:00:00,1545.46,1548.77,1535.78,1545.1,1549.25,1549.25,1536.19,1548.87,378200
01.14.2020 22:00:00,1545.1,1558.04,1543.79,1554.89,1548.87,1558.83,1546.31,1558.75,309719
01.15.2020 22:00:00,1554.89,1557.98,1547.91,1551.18,1558.75,1558.75,1548.24,1554.91,253944
01.16.2020 22:00:00,1551.18,1561.12,1549.28,1556.68,1554.91,1561.55,1549.59,1557.15,239186
01.19.2020 22:00:00,1556.68,1562.69,1556.25,1560.77,1557.15,1562.97,1556.61,1561.17,92020
01.20.2020 22:00:00,1560.77,1568.49,1546.21,1556.8,1561.17,1568.87,1546.56,1558.5,364753
01.21.2020 22:00:00,1556.8,1559.18,1550.07,1558.59,1558.5,1559.47,1550.42,1559.31,238468
01.22.2020 22:00:00,1558.59,1567.83,1551.8,1562.45,1559.31,1568.16,1552.11,1564.17,365518
01.23.2020 22:00:00,1562.45,1575.77,1556.44,1570.39,1564.17,1576.12,1556.76,1570.87,368529
01.26.2020 22:00:00,1570.39,1588.41,1570.39,1580.51,1570.87,1588.97,1570.87,1582.33,510524
01.27.2020 22:00:00,1580.51,1582.93,1565.31,1567.15,1582.33,1583.3,1565.79,1570.62,384205
01.28.2020 22:00:00,1567.15,1577.93,1563.27,1576.7,1570.62,1578.22,1563.61,1577.25,328766
01.29.2020 22:00:00,1576.7,1585.87,1572.19,1573.23,1577.25,1586.18,1572.44,1575.33,522371
01.30.2020 22:00:00,1573.23,1589.98,1570.82,1589.75,1575.33,1590.37,1571.14,1590.31,482710
02.02.2020 22:00:00,1589.75,1593.09,1568.65,1575.62,1590.31,1595.82,1569.85,1578.35,488585
02.03.2020 22:00:00,1575.62,1579.56,1548.95,1552.55,1578.35,1579.87,1549.31,1556.4,393037
02.04.2020 22:00:00,1552.55,1562.3,1547.34,1554.62,1556.4,1562.64,1547.72,1556.42,473172
02.05.2020 22:00:00,1554.62,1568.14,1552.39,1565.08,1556.42,1568.51,1552.73,1567.0,365580
02.06.2020 22:00:00,1565.08,1574.02,1559.82,1570.11,1567.0,1574.33,1560.7,1570.55,424269
02.09.2020 22:00:00,1570.11,1576.9,1567.9,1571.05,1570.55,1577.25,1568.21,1573.34,326606
02.10.2020 22:00:00,1571.05,1573.92,1561.92,1566.12,1573.34,1574.27,1562.24,1568.12,310037
02.11.2020 22:00:00,1566.12,1570.39,1561.45,1564.26,1568.12,1570.71,1561.91,1567.02,269032
02.12.2020 22:00:00,1564.26,1578.24,1564.26,1574.5,1567.02,1578.52,1565.81,1576.63,368438
02.13.2020 22:00:00,1574.5,1584.87,1572.44,1584.49,1576.63,1585.29,1573.28,1584.91,250788
02.16.2020 22:00:00,1584.49,1584.49,1578.7,1580.79,1584.91,1584.91,1579.06,1581.31,101499
02.17.2020 22:00:00,1580.79,1604.97,1580.79,1601.06,1581.31,1605.33,1581.31,1603.08,321542
02.18.2020 22:00:00,1601.06,1612.83,1599.41,1611.27,1603.08,1613.4,1599.77,1613.34,357488
02.19.2020 22:00:00,1611.27,1623.62,1603.74,1618.48,1613.34,1623.98,1604.12,1621.27,535148
02.20.2020 22:00:00,1618.48,1649.26,1618.48,1643.42,1621.27,1649.52,1619.19,1643.87,590262
02.23.2020 22:00:00,1643.42,1689.22,1643.42,1658.62,1643.87,1689.55,1643.87,1659.07,1016570
02.24.2020 22:00:00,1658.62,1660.76,1624.9,1633.19,1659.07,1661.52,1625.5,1636.23,1222774
02.25.2020 22:00:00,1633.19,1654.88,1624.74,1640.4,1636.23,1655.23,1625.11,1642.59,1004692
02.26.2020 22:00:00,1640.4,1660.3,1635.15,1643.99,1642.59,1660.6,1635.6,1646.42,1084115
02.27.2020 22:00:00,1643.99,1649.39,1562.74,1584.95,1646.42,1649.84,1563.22,1585.58,1174015
03.01.2020 22:00:00,1584.95,1610.94,1575.29,1586.55,1585.58,1611.26,1575.88,1590.33,1115889
03.02.2020 22:00:00,1586.55,1649.16,1586.55,1640.19,1590.33,1649.6,1589.43,1644.16,889364
03.03.2020 22:00:00,1640.19,1652.81,1631.73,1635.95,1644.16,1653.51,1632.1,1639.05,589438
03.04.2020 22:00:00,1635.95,1674.51,1634.91,1669.36,1639.05,1674.9,1635.3,1672.83,643444
03.05.2020 22:00:00,1669.36,1692.1,1641.61,1673.89,1672.83,1692.65,1642.75,1674.46,1005737
03.08.2020 21:00:00,1673.89,1703.19,1656.98,1678.31,1674.46,1703.52,1657.88,1679.2,910166
03.09.2020 21:00:00,1678.31,1680.43,1641.37,1648.71,1679.2,1681.18,1641.94,1649.75,943377
03.10.2020 21:00:00,1648.71,1671.15,1632.9,1634.42,1649.75,1671.56,1633.31,1637.07,793816
03.11.2020 21:00:00,1634.42,1650.28,1560.5,1578.29,1637.07,1650.8,1560.92,1580.01,1009172
03.12.2020 21:00:00,1578.29,1597.85,1504.34,1528.99,1580.01,1598.36,1505.14,1530.09,1052940
03.15.2020 21:00:00,1528.99,1575.2,1451.08,1509.12,1530.09,1576.05,1451.49,1512.94,1196812
03.16.2020 21:00:00,1509.12,1553.91,1465.4,1528.57,1512.94,1554.21,1466.1,1529.43,1079729
03.17.2020 21:00:00,1528.57,1545.93,1472.49,1485.85,1529.43,1546.74,1472.99,1486.75,976857
03.18.2020 21:00:00,1485.85,1500.68,1463.49,1471.89,1486.75,1501.6,1464.64,1474.16,833803
03.19.2020 21:00:00,1471.89,1516.07,1454.46,1497.01,1474.16,1516.57,1455.93,1497.82,721471
03.22.2020 21:00:00,1497.01,1560.86,1482.21,1551.45,1497.82,1561.65,1483.22,1553.09,707830
03.23.2020 21:00:00,1551.45,1631.23,1551.45,1621.05,1553.09,1638.75,1553.09,1631.35,164862
03.24.2020 21:00:00,1621.05,1636.23,1588.82,1615.77,1631.35,1650.03,1601.29,1618.47,205272
03.25.2020 21:00:00,1615.77,1642.96,1587.7,1628.31,1618.47,1649.81,1599.87,1633.29,152804
03.26.2020 21:00:00,1628.31,1630.48,1606.76,1617.5,1633.29,1638.48,1616.9,1622.8,307278
03.29.2020 21:00:00,1617.5,1631.48,1602.51,1620.91,1622.8,1643.86,1612.55,1623.77,291653
03.30.2020 21:00:00,1620.91,1626.55,1573.37,1574.9,1623.77,1627.31,1575.24,1579.1,371507
03.31.2020 21:00:00,1574.9,1600.41,1560.13,1590.13,1579.1,1603.42,1570.75,1592.43,412780
04.01.2020 21:00:00,1590.13,1619.76,1582.42,1612.07,1592.43,1621.1,1583.37,1614.49,704652
04.02.2020 21:00:00,1612.07,1625.21,1605.39,1618.63,1614.49,1626.83,1607.69,1621.37,409490
04.05.2020 21:00:00,1618.63,1668.35,1608.59,1657.77,1621.37,1670.98,1609.7,1663.43,381690
04.06.2020 21:00:00,1657.77,1671.95,1641.84,1644.84,1663.43,1677.53,1643.4,1650.46,286313
04.07.2020 21:00:00,1644.84,1656.39,1640.1,1644.06,1650.46,1657.43,1643.46,1646.66,219464
04.08.2020 21:00:00,1644.06,1689.66,1643.05,1682.16,1646.66,1691.13,1644.83,1686.74,300111
04.12.2020 21:00:00,1682.16,1722.25,1677.35,1709.16,1686.74,1725.48,1680.49,1718.28,280905
04.13.2020 21:00:00,1709.16,1747.04,1708.56,1726.18,1718.28,1748.88,1709.36,1729.72,435098
04.14.2020 21:00:00,1726.18,1730.53,1706.67,1714.35,1729.72,1732.97,1708.95,1717.25,419065
04.15.2020 21:00:00,1714.35,1738.65,1707.83,1715.99,1717.25,1740.35,1708.93,1720.09,615105
04.16.2020 21:00:00,1715.99,1718.46,1677.16,1683.2,1720.09,1720.09,1680.55,1684.97,587875
04.19.2020 21:00:00,1683.2,1702.49,1671.1,1694.71,1684.97,1703.46,1672.02,1697.29,412116
04.20.2020 21:00:00,1694.71,1697.66,1659.42,1683.4,1697.29,1698.44,1662.3,1686.58,502893
04.21.2020 21:00:00,1683.4,1718.21,1679.61,1713.67,1686.58,1719.19,1680.71,1716.91,647622
04.22.2020 21:00:00,1713.67,1738.59,1706.93,1729.89,1716.91,1739.47,1707.72,1731.83,751833
04.23.2020 21:00:00,1729.89,1736.31,1710.56,1726.74,1731.83,1736.98,1711.03,1727.71,608827
04.26.2020 21:00:00,1726.74,1727.55,1705.99,1713.36,1727.71,1728.55,1706.72,1715.29,698217
04.27.2020 21:00:00,1713.36,1716.52,1691.41,1707.66,1715.29,1718.02,1692.51,1710.22,749906
04.28.2020 21:00:00,1707.66,1717.42,1697.65,1711.58,1710.22,1718.57,1698.4,1715.42,630720
04.29.2020 21:00:00,1711.58,1721.94,1681.36,1684.97,1715.42,1722.79,1681.91,1687.92,631609
04.30.2020 21:00:00,1684.97,1705.87,1669.62,1699.92,1687.92,1706.33,1670.81,1701.66,764742
05.03.2020 21:00:00,1699.92,1714.75,1691.46,1700.42,1701.66,1715.83,1692.96,1702.17,355859
05.04.2020 21:00:00,1700.42,1711.64,1688.55,1703.04,1702.17,1712.55,1690.42,1706.71,415576
05.05.2020 21:00:00,1703.04,1708.1,1681.6,1685.18,1706.71,1708.71,1682.33,1688.33,346814
05.06.2020 21:00:00,1685.18,1721.95,1683.59,1715.17,1688.33,1722.53,1684.8,1716.91,379103
05.07.2020 21:00:00,1715.17,1723.54,1701.49,1704.06,1716.91,1724.42,1702.1,1705.25,409225
05.10.2020 21:00:00,1704.06,1712.02,1691.75,1696.68,1705.25,1713.03,1692.45,1697.58,438010
05.11.2020 21:00:00,1696.68,1710.94,1693.56,1701.46,1697.58,1711.31,1693.92,1703.32,369988
05.12.2020 21:00:00,1701.46,1718.11,1698.86,1716.09,1703.32,1718.69,1699.4,1718.63,518107
05.13.2020 21:00:00,1716.09,1736.16,1710.79,1727.71,1718.63,1736.55,1711.33,1731.38,447401
05.14.2020 21:00:00,1727.71,1751.56,1727.71,1743.94,1731.38,1752.1,1728.89,1744.96,561909
05.17.2020 21:00:00,1743.94,1765.3,1727.4,1731.73,1744.96,1765.92,1728.08,1732.99,495628
05.18.2020 21:00:00,1731.73,1747.76,1725.05,1743.52,1732.99,1748.24,1726.29,1746.9,596250
05.19.2020 21:00:00,1743.52,1753.8,1742.04,1747.22,1746.9,1754.28,1742.62,1748.48,497960
05.20.2020 21:00:00,1747.22,1748.7,1717.14,1726.56,1748.48,1751.18,1717.39,1727.82,557122
05.21.2020 21:00:00,1726.56,1740.06,1723.33,1735.67,1727.82,1740.7,1724.41,1736.73,336867
05.24.2020 21:00:00,1735.67,1735.67,1721.61,1727.88,1736.73,1736.73,1721.83,1730.25,164650
05.25.2020 21:00:00,1727.88,1735.39,1708.48,1710.1,1730.25,1735.99,1709.34,1712.21,404914
05.26.2020 21:00:00,1710.1,1715.93,1693.57,1708.36,1712.21,1716.3,1694.04,1709.85,436519
05.27.2020 21:00:00,1708.36,1727.42,1703.41,1717.28,1709.85,1727.93,1705.85,1721.0,416306
05.28.2020 21:00:00,1717.28,1737.58,1712.55,1731.2,1721.0,1738.26,1713.24,1732.07,399698
05.31.2020 21:00:00,1731.2,1744.51,1726.98,1738.73,1732.07,1745.11,1727.93,1742.56,365219
Problem
This is the result for this code and I'm looking for ways to find intersections for SMA20 (yellow) and SMA50 (green) lines and thus able to get alerts whenever these lines cross.
Solution
Print out intersections indication crossing from above or below relative to each series.
import numpy as np
g20=sma20.values
g50=sma50.values
# np.sign(...) return -1, 0 or 1
# np.diff(...) return value difference for (n-1) - n, to obtain intersections
# np.argwhere(...) remove zeros, preserves turning points only
idx20 = np.argwhere(np.diff(np.sign(g20 - g50))).flatten()
priceSma_df.plot()
plt.scatter(close.index[idx20], sma50[idx20], color='red')
plt.show()
import numpy as np
f=close.values
g20=sma20.values
g50=sma50.values
idx20 = np.argwhere(np.diff(np.sign(f - g20))).flatten()
idx50 = np.argwhere(np.diff(np.sign(f - g50))).flatten()
priceSma_df = pd.DataFrame({
'BidClose' : close,
'SMA 20' : sma20,
'SMA 50' : sma50
})
priceSma_df.plot()
plt.scatter(close.index[idx20], sma20[idx20], color='orange')
plt.scatter(close.index[idx50], sma50[idx50], color='green')
plt.show()
Related
Identify in a pandas series when the trend changes from positive to negative
I have a pandas dataframe with securities prices and several moving average trend lines of various moving average lengths. The data frames are sufficiently large that I would like to identify the most efficient way to capture the index of a particular series where the slope changes (In this example, let's just say from positive to negative for a given series in the dataframe.) My hack seems very "hacky". I am currently doing the following (Note, imagine this is for a single moving average series): filter = (df.diff()>0).diff().dropna(axis=0) new_df = df[filter].dropna(axis=0) Full example code below: import pandas as pd import numpy as np from datetime import datetime, timedelta # Create a sample Dataframe date_today = datetime.now() days = pd.date_range(date_today, date_today + timedelta(7), freq='D') close = pd.Series([1,2,3,4,2,1,4,3]) df = pd.DataFrame({"date":days, "prices":close}) df.set_index("date", inplace=True) print("Original DF") print(df) # Long Explanation updays = (df.diff()>0) # Show True for all updays false for all downdays print("Updays df is") print(updays) reversal_df = (updays.diff()) # this will only show change days as True reversal_df.dropna(axis=0, inplace=True) # Handle the first day trade_df = df[reversal_df].dropna() # Select only the days where the trend reversed print("These are the days where the trend reverses it self from negative to positive or vice versa ") print(trade_df) # Simplified below by combining the above into two lines filter = (df.diff()>0).diff().dropna(axis=0) new_df = df[filter].dropna(axis=0) print("The final result is this: ") print(new_df) Any help here would be appreciated. Note, I'm more interested in balancing efficiencies between how best to do this so I can understand it, and how to make it sufficiently quick to compute.
Multiple moving average solution. Look for the comment # *** THE SOLUTION BEGINS HERE *** to see the solution, before that is just generating data, printing and plotting to validate. What I do here is to calculate the sign of MVA slopes so a positive slope will have a value of 1 and a negative slope a value of -1. Slope_i = MVA(i, ask; periods) - MVA(i, ask; periods) m<periods>_slp_sgn_i = Sign(Slope_i) Then to spot slope changes I Calculate: m<periods>slp_chg = sign(m<periods>_slp_sgn_i - m<periods>_slp_sgn_i-1) So for example if the slope changes from 1 (positive) to -1 (negative): sign (-1 - 1) = sign(-2) = -1 In the other hand, if the changes from -1 to 1: sign (1 - - 1) = sign(2) = 1 import numpy as np import pandas as pd import matplotlib.pyplot as plt %matplotlib inline # GENERATE DATA RANDOM PRICE _periods = 1000 _value_0 = 1.1300 _std = 0.0005 _freq = '5T' _output_col = 'ask' _last_date = pd.to_datetime('2021-12-15') _p_t = np.zeros(_periods) _wn = np.random.normal(loc=0, scale=_std, size=_periods) _p_t[0] = _value_0 _wn[0] = 0 _date_index = pd.date_range(end=_last_date, periods=_periods, freq=_freq) df= pd.DataFrame(np.stack([_p_t, _wn], axis=1), columns=[_output_col, "wn"], index=_date_index) for i in range(1, _periods): df.iloc[i][_output_col] = df.iloc[i - 1][_output_col] + df.iloc[i].wn print(df.head(5)) # CALCULATE MOVING AVERAGES (3) df['mva_25'] = df['ask'].rolling(25).mean() df['mva_50'] = df['ask'].rolling(50).mean() df['mva_100'] = df['ask'].rolling(100).mean() # plot to check df['ask'].plot(figsize=(15,5)) df['mva_25'].plot(figsize=(15,5)) df['mva_50'].plot(figsize=(15,5)) df['mva_100'].plot(figsize=(15,5)) plt.show() # *** THE SOLUTION BEGINS HERE *** # calculate mva slopes directions # positive slope: 1, negative slope -1 df['m25_slp_sgn'] = np.sign(df['mva_25'] - df['mva_25'].shift(1)) df['m50_slp_sgn'] = np.sign(df['mva_50'] - df['mva_50'].shift(1)) df['m100_slp_sgn'] = np.sign(df['mva_100'] - df['mva_100'].shift(1)) # CALCULATE CHANGE IN SLOPE # from 1 to -1: -1 # from -1 to 1: 1 df['m25_slp_chg'] = np.sign(df['m25_slp_sgn'] - df['m25_slp_sgn'].shift(1)) df['m50_slp_chg'] = np.sign(df['m50_slp_sgn'] - df['m50_slp_sgn'].shift(1)) df['m100_slp_chg'] = np.sign(df['m100_slp_sgn'] - df['m100_slp_sgn'].shift(1)) # clean NAN df.dropna(inplace=True) # print data to visually check print(df.iloc[20:40][['mva_25', 'm25_slp_sgn', 'm25_slp_chg']]) # query where slope of MVA25 changes from positive to negative df[(df['m25_slp_chg'] == -1)].head(5) WARNING: Data is generated random so you'll see the plots and the printings change each time you execute the code.
how to create a stacked bar chart indicating time spent on nest per day
I have some data of an owl being present in the nest box. In a previous question you helped me visualize when the owl is in the box: In addition I created a plot of the hours per day spent in the box with the code below (probably this can be done more efficiently): import pandas as pd import matplotlib.pyplot as plt # raw data indicating time spent in box (each row represents start and end time) time = pd.DatetimeIndex(["2021-12-01 18:08","2021-12-01 18:11", "2021-12-02 05:27","2021-12-02 05:29", "2021-12-02 22:40","2021-12-02 22:43", "2021-12-03 19:24","2021-12-03 19:27", "2021-12-06 18:04","2021-12-06 18:06", "2021-12-07 05:28","2021-12-07 05:30", "2021-12-10 03:05","2021-12-10 03:10", "2021-12-10 07:11","2021-12-10 07:13", "2021-12-10 20:40","2021-12-10 20:41", "2021-12-12 19:42","2021-12-12 19:45", "2021-12-13 04:13","2021-12-13 04:17", "2021-12-15 04:28","2021-12-15 04:30", "2021-12-15 05:21","2021-12-15 05:25", "2021-12-15 17:40","2021-12-15 17:44", "2021-12-15 22:31","2021-12-15 22:37", "2021-12-16 04:24","2021-12-16 04:28", "2021-12-16 19:58","2021-12-16 20:09", "2021-12-17 17:42","2021-12-17 18:04", "2021-12-17 22:19","2021-12-17 22:26", "2021-12-18 05:41","2021-12-18 05:44", "2021-12-19 07:40","2021-12-19 16:55", "2021-12-19 20:39","2021-12-19 20:52", "2021-12-19 21:56","2021-12-19 23:17", "2021-12-21 04:53","2021-12-21 04:59", "2021-12-21 05:37","2021-12-21 05:39", "2021-12-22 08:06","2021-12-22 17:22", "2021-12-22 20:04","2021-12-22 21:24", "2021-12-22 21:44","2021-12-22 22:47", "2021-12-23 02:20","2021-12-23 06:17", "2021-12-23 08:07","2021-12-23 16:54", "2021-12-23 19:36","2021-12-23 23:59:59", "2021-12-24 00:00","2021-12-24 00:28", "2021-12-24 07:53","2021-12-24 17:00", ]) # create dataframe with column indicating presence (1) or absence (0) time_df = pd.DataFrame(data={'present':[1,0]*int(len(time)/2)}, index=time) # calculate interval length and add to time_df time_df['interval'] = time_df.index.to_series().diff().astype('timedelta64[m]') # add column with day to time_df time_df['day'] = time.day #select only intervals where owl is present timeinbox = time_df.iloc[1::2, :] interval = timeinbox.interval day = timeinbox.day # sum multiple intervals per day interval_tot = [interval[0]] day_tot = [day[0]] for i in range(1, len(day)): if day[i] == day[i-1]: interval_tot[-1] +=interval[i] else: day_tot.append(day[i]) interval_tot.append(interval[i]) # recalculate to hours for i in range(len(interval_tot)): interval_tot[i] = interval_tot[i]/(60) plt.figure(figsize=(15, 5)) plt.grid(zorder=0) plt.bar(day_tot, interval_tot, color='g', zorder=3) plt.xlim([1,31]) plt.xlabel('day in December') plt.ylabel('hours per day in nest box') plt.xticks(np.arange(1,31,1)) plt.ylim([0, 24]) Now I would like to combine all data in one plot by making a stacked bar chart, where each day is represented by a bar and each bar indicating for each of the 24*60 minutes whether the owl is present or not. Is this possible from the current data structure?
The data seems to have been created manually, so I have changed the format of the data presented. The approach I took was to create the time spent and the time not spent, with a continuous index of 1 minute intervals with the start and end time as the difference time and a flag of 1. Now to create non-stay time, I will create a time series index of start and end date + 1 at 1 minute intervals. Update the original data frame with the newly created index. This is the data for the graph. In the graph, based on the data frame extracted in days, create a color list with red for stay and green for non-stay. Then, in a bar graph, stack the height one. It may be necessary to consider grouping the data into hourly units. import pandas as pd import numpy as np import matplotlib.pyplot as plt from datetime import timedelta import io data = ''' start_time,end_time "2021-12-01 18:08","2021-12-01 18:11" "2021-12-02 05:27","2021-12-02 05:29" "2021-12-02 22:40","2021-12-02 22:43" "2021-12-03 19:24","2021-12-03 19:27" "2021-12-06 18:04","2021-12-06 18:06" "2021-12-07 05:28","2021-12-07 05:30" "2021-12-10 03:05","2021-12-10 03:10" "2021-12-10 07:11","2021-12-10 07:13" "2021-12-10 20:40","2021-12-10 20:41" "2021-12-12 19:42","2021-12-12 19:45" "2021-12-13 04:13","2021-12-13 04:17" "2021-12-15 04:28","2021-12-15 04:30" "2021-12-15 05:21","2021-12-15 05:25" "2021-12-15 17:40","2021-12-15 17:44" "2021-12-15 22:31","2021-12-15 22:37" "2021-12-16 04:24","2021-12-16 04:28" "2021-12-16 19:58","2021-12-16 20:09" "2021-12-17 17:42","2021-12-17 18:04" "2021-12-17 22:19","2021-12-17 22:26" "2021-12-18 05:41","2021-12-18 05:44" "2021-12-19 07:40","2021-12-19 16:55" "2021-12-19 20:39","2021-12-19 20:52" "2021-12-19 21:56","2021-12-19 23:17" "2021-12-21 04:53","2021-12-21 04:59" "2021-12-21 05:37","2021-12-21 05:39" "2021-12-22 08:06","2021-12-22 17:22" "2021-12-22 20:04","2021-12-22 21:24" "2021-12-22 21:44","2021-12-22 22:47" "2021-12-23 02:20","2021-12-23 06:17" "2021-12-23 08:07","2021-12-23 16:54" "2021-12-23 19:36","2021-12-24 00:00" "2021-12-24 00:00","2021-12-24 00:28" "2021-12-24 07:53","2021-12-24 17:00" ''' df = pd.read_csv(io.StringIO(data), sep=',') df['start_time'] = pd.to_datetime(df['start_time']) df['end_time'] = pd.to_datetime(df['end_time']) time_df = pd.DataFrame() for idx, row in df.iterrows(): rng = pd.date_range(row['start_time'], row['end_time']-timedelta(minutes=1), freq='1min') tmp = pd.DataFrame({'present':[1]*len(rng)}, index=rng) time_df = time_df.append(tmp) date_add = pd.date_range(time_df.index[0].date(), time_df.index[-1].date()+timedelta(days=1), freq='1min') time_df = time_df.reindex(date_add, fill_value=0) time_df['day'] = time_df.index.day import matplotlib.pyplot as plt fig, ax = plt.subplots(figsize=(8,15)) ax.set_yticks(np.arange(0,1500,60)) ax.set_ylim(0,1440) ax.set_xticks(np.arange(1,25,1)) days = time_df['day'].unique() for d in days: #if d == 1: day_df = time_df.query('day == #d') colors = [ 'r' if p == 1 else 'g' for p in day_df['present']] for i in range(len(day_df)): ax.bar(d, height=1, width=0.5, bottom=i+1, color=colors[i]) plt.show()
Analysing height difference from columns and selecting max difference in Python
I have a .csv file containing x y data from transects (.csv file here). The file can contain a few dozen transects (example only 4). I want to calculate the elevation change from each transect and then select the transect with the highest elevation change. x y lines 0 3.444 1 0.009 3.445 1 0.180 3.449 1 0.027 3.449 1 ... 0 2.115 2 0.008 2.115 2 0.017 2.115 2 0.027 2.116 2 I've tried to calculate the change with pandas.dataframe.diff but I'm unable to select the highest elevation change from this. UPDATE: I found a way to calculate the height difference for 1 transect. The goal is now to loop this script through the different other transects and let it select the transect with the highest difference. Not sure how to create a loop from this... import numpy as np import matplotlib.pyplot as plt import pandas as pd from scipy.signal import savgol_filter, find_peaks, find_peaks_cwt from pandas import read_csv import csv df = pd.read_csv('transect4.csv', delimiter=',', header=None, names=['x', 'y', 'lines']) df_1 = df ['lines'] == 1 df1 = df[df_1] plt.plot(df1['x'], df1['y'], label='Original Topography') #apply a Savitzky-Golay filter smooth = savgol_filter(df1.y.values, window_length = 351, polyorder = 5) #find the maximums peaks_idx_max, _ = find_peaks(smooth, prominence = 0.01) #reciprocal, so mins will become max smooth_rec = 1/smooth #find the mins now peaks_idx_mins, _ = find_peaks(smooth_rec, prominence = 0.01) plt.xlabel('Distance') plt.ylabel('Height') plt.plot(df1['x'], smooth, label='Smoothed Topography') #plot them plt.scatter(df1.x.values[peaks_idx_max], smooth[peaks_idx_max], s = 55, c = 'green', label = 'Local Max Cusp') plt.scatter(df1.x.values[peaks_idx_mins], smooth[peaks_idx_mins], s = 55, c = 'black', label = 'Local Min Cusp') plt.legend(loc='upper left') plt.show() #Export to csv df['Cusp_max']=False df['Cusp_min']=False df.loc[df1.x[peaks_idx_max].index, 'Cusp_max']=True df.loc[df1.x[peaks_idx_mins].index, 'Cusp_min']=True data=df[df['Cusp_max'] | df['Cusp_min']] data.to_csv(r'Cusp_total.csv') #Calculate height difference my_data=pd.read_csv('Cusp_total.csv', delimiter=',', header=0, names=['ID', 'x', 'y', 'lines']) df_1 = df ['lines'] == 1 df1 = df[df_1] df1_diff=pd.DataFrame(my_data) df1_diff['Diff_Cusps']=df1_diff['y'].diff(-1) #Only use positive numbers for average df1_pos = df_diff[df_diff['Diff_Cusps'] > 0] print("Average Height Difference: ", (df1_pos['Diff_Cusps'].mean()), "m") Ideally, the script would select the transect with the highest elevation change from an unknown number of transects in the .csv file, which will then be exported to a new .csv file.
You need to groupby by column lines. Not sure if this is what you meant when you say elevation change but this gives difference of elevations (max(y) - min(y)) for each group, where groups are formed by all rows sharing same value of 'line'each group representing one such value. This should help you with what you are missing in your logic, (sorry can't put more time in). frame = pd.read_csv('transect4.csv', header=None, names=['x', 'y', 'lines']) groups = frame.groupby('lines') groups['y'].max() - groups['y'].min() # Should give you max elevations of each group.
Averaging several time-series together with confidence interval (with test code)
Sounds very complicated but a simple plot will make it easy to understand: I have three curves of cumulative sum of some values over time, which are the blue lines. I want to average (or somehow combine in a statistically correct way) the three curves into one smooth curve and add confidence interval. I tried one simple solution - combining all the data into one curve, average it with the "rolling" function in pandas, getting the standard deviation for it. I plotted those as the purple curve with the confidence interval around it. The problem with my real data, and as illustrated in the plot above is the curve isn't smooth at all, also there are sharp jumps in the confidence interval which also isn't a good representation of the 3 separate curves as there is no jumps in them. Is there a better way to represent the 3 different curves in one smooth curve with a nice confidence interval? I supply a test code, tested on python 3.5.1 with numpy and pandas (don't change the seed in order to get the same curves). There are some constrains - increasing the number of points for the "rolling" function isn't a solution for me because some of my data is too short for that. Test code: import numpy as np import pandas as pd import matplotlib.pyplot as plt import matplotlib np.random.seed(seed=42) ## data generation - cumulative analysis over time df1_time = pd.DataFrame(np.random.uniform(0,1000,size=50), columns=['time']) df1_values = pd.DataFrame(np.random.randint(0,10000,size=100), columns=['vals']) df1_combined_sorted = pd.concat([df1_time, df1_values], axis = 1).sort_values(by=['time']) df1_combined_sorted_cumulative = np.cumsum(df1_combined_sorted['vals']) df2_time = pd.DataFrame(np.random.uniform(0,1000,size=50), columns=['time']) df2_values = pd.DataFrame(np.random.randint(1000,13000,size=100), columns=['vals']) df2_combined_sorted = pd.concat([df2_time, df2_values], axis = 1).sort_values(by=['time']) df2_combined_sorted_cumulative = np.cumsum(df2_combined_sorted['vals']) df3_time = pd.DataFrame(np.random.uniform(0,1000,size=50), columns=['time']) df3_values = pd.DataFrame(np.random.randint(0,4000,size=100), columns=['vals']) df3_combined_sorted = pd.concat([df3_time, df3_values], axis = 1).sort_values(by=['time']) df3_combined_sorted_cumulative = np.cumsum(df3_combined_sorted['vals']) ## combining the three curves df_all_vals_cumulative = pd.concat([df1_combined_sorted_cumulative,. df2_combined_sorted_cumulative, df3_combined_sorted_cumulative]).reset_index(drop=True) df_all_time = pd.concat([df1_combined_sorted['time'], df2_combined_sorted['time'], df3_combined_sorted['time']]).reset_index(drop=True) df_all = pd.concat([df_all_time, df_all_vals_cumulative], axis = 1) ## creating confidence intervals df_all_sorted = df_all.sort_values(by=['time']) ma = df_all_sorted.rolling(10).mean() mstd = df_all_sorted.rolling(10).std() ## plotting plt.fill_between(df_all_sorted['time'], ma['vals'] - 2 * mstd['vals'], ma['vals'] + 2 * mstd['vals'],color='b', alpha=0.2) plt.plot(df_all_sorted['time'],ma['vals'], c='purple') plt.plot(df1_combined_sorted['time'], df1_combined_sorted_cumulative, c='blue') plt.plot(df2_combined_sorted['time'], df2_combined_sorted_cumulative, c='blue') plt.plot(df3_combined_sorted['time'], df3_combined_sorted_cumulative, c='blue') matplotlib.use('Agg') plt.show()
First of all, your sample code could be re-written to make better use of pd. For example np.random.seed(seed=42) ## data generation - cumulative analysis over time def get_data(max_val, max_time=1000): times = pd.DataFrame(np.random.uniform(0,max_time,size=50), columns=['time']) vals = pd.DataFrame(np.random.randint(0,max_val,size=100), columns=['vals']) df = pd.concat([times, vals], axis = 1).sort_values(by=['time']).\ reset_index().drop('index', axis=1) df['cumulative'] = df.vals.cumsum() return df # generate the dataframes df1,df2,df3 = (df for df in map(get_data, [10000, 13000, 4000])) dfs = (df1, df2, df3) # join df_all = pd.concat(dfs, ignore_index=True).sort_values(by=['time']) # render function def render(window=10): # compute rolling means and confident intervals mean_val = df_all.cumulative.rolling(window).mean() std_val = df_all.cumulative.rolling(window).std() min_val = mean_val - 2*std_val max_val = mean_val + 2*std_val plt.figure(figsize=(16,9)) for df in dfs: plt.plot(df.time, df.cumulative, c='blue') plt.plot(df_all.time, mean_val, c='r') plt.fill_between(df_all.time, min_val, max_val, color='blue', alpha=.2) plt.show() The reason your curves aren't that smooth is maybe your rolling window is not large enough. You can increase this window size to get smoother graphs. For example render(20) gives: while render(30) gives: Although, the better way might be imputing each of df['cumulative'] to the entire time window and compute the mean/confidence interval on these series. With that in mind, we can modify the code as follows: np.random.seed(seed=42) ## data generation - cumulative analysis over time def get_data(max_val, max_time=1000): times = pd.DataFrame(np.random.uniform(0,max_time,size=50), columns=['time']) vals = pd.DataFrame(np.random.randint(0,max_val,size=100), columns=['vals']) # note that we set time as index of the returned data df = pd.concat([times, vals], axis = 1).dropna().set_index('time').sort_index() df['cumulative'] = df.vals.cumsum() return df df1,df2,df3 = (df for df in map(get_data, [10000, 13000, 4000])) dfs = (df1, df2, df3) # rename column for later plotting for i,df in zip(range(3),dfs): df.rename(columns={'cumulative':f'cummulative_{i}'}, inplace=True) # concatenate the dataframes with common time index df_all = pd.concat(dfs,sort=False).sort_index() # interpolate each cumulative column linearly df_all.interpolate(inplace=True) # plot graphs mean_val = df_all.iloc[:,1:].mean(axis=1) std_val = df_all.iloc[:,1:].std(axis=1) min_val = mean_val - 2*std_val max_val = mean_val + 2*std_val fig, ax = plt.subplots(1,1,figsize=(16,9)) df_all.iloc[:,1:4].plot(ax=ax) plt.plot(df_all.index, mean_val, c='purple') plt.fill_between(df_all.index, min_val, max_val, color='blue', alpha=.2) plt.show() and we get:
Pandas finding local max and min
I have a pandas data frame with two columns one is temperature the other is time. I would like to make third and fourth columns called min and max. Each of these columns would be filled with nan's except where there is a local min or max, then it would have the value of that extrema. Here is a sample of what the data looks like, essentially I am trying to identify all the peaks and low points in the figure. Are there any built in tools with pandas that can accomplish this?
The solution offered by fuglede is great but if your data is very noisy (like the one in the picture) you will end up with lots of misleading local extremes. I suggest that you use scipy.signal.argrelextrema() method. The .argrelextrema() method has its own limitations but it has a useful feature where you can specify the number of points to be compared, kind of like a noise filtering algorithm. for example: import numpy as np import matplotlib.pyplot as plt import pandas as pd from scipy.signal import argrelextrema # Generate a noisy AR(1) sample np.random.seed(0) rs = np.random.randn(200) xs = [0] for r in rs: xs.append(xs[-1] * 0.9 + r) df = pd.DataFrame(xs, columns=['data']) n = 5 # number of points to be checked before and after # Find local peaks df['min'] = df.iloc[argrelextrema(df.data.values, np.less_equal, order=n)[0]]['data'] df['max'] = df.iloc[argrelextrema(df.data.values, np.greater_equal, order=n)[0]]['data'] # Plot results plt.scatter(df.index, df['min'], c='r') plt.scatter(df.index, df['max'], c='g') plt.plot(df.index, df['data']) plt.show() Some points: you might need to check the points afterward to ensure there are no twine points very close to each other. you can play with n to filter the noisy points argrelextrema returns a tuple and the [0] at the end extracts a numpy array
Assuming that the column of interest is labelled data, one solution would be df['min'] = df.data[(df.data.shift(1) > df.data) & (df.data.shift(-1) > df.data)] df['max'] = df.data[(df.data.shift(1) < df.data) & (df.data.shift(-1) < df.data)] For example: import numpy as np import matplotlib.pyplot as plt import pandas as pd # Generate a noisy AR(1) sample np.random.seed(0) rs = np.random.randn(200) xs = [0] for r in rs: xs.append(xs[-1]*0.9 + r) df = pd.DataFrame(xs, columns=['data']) # Find local peaks df['min'] = df.data[(df.data.shift(1) > df.data) & (df.data.shift(-1) > df.data)] df['max'] = df.data[(df.data.shift(1) < df.data) & (df.data.shift(-1) < df.data)] # Plot results plt.scatter(df.index, df['min'], c='r') plt.scatter(df.index, df['max'], c='g') df.data.plot()
using Numpy ser = np.random.randint(-40, 40, 100) # 100 points peak = np.where(np.diff(ser) < 0)[0] or double_difference = np.diff(np.sign(np.diff(ser))) peak = np.where(double_difference == -2)[0] using Pandas ser = pd.Series(np.random.randint(2, 5, 100)) peak_df = ser[(ser.shift(1) < ser) & (ser.shift(-1) < ser)] peak = peak_df.index
You can do something similar to Foad's .argrelextrema() solution, but with the Pandas .rolling() function: # Find local peaks n = 5 #rolling period local_min_vals = df.loc[df['data'] == df['data'].rolling(n, center=True).min()] local_max_vals = df.loc[df['data'] == df['data'].rolling(n, center=True).max()] plt.scatter(local_min_vals.index, local_min_vals, c='r') plt.scatter(local_max_vals.index, local_max_vals, c='g')