Analysing height difference from columns and selecting max difference in Python - python
I have a .csv file containing x y data from transects (.csv file here).
The file can contain a few dozen transects (example only 4).
I want to calculate the elevation change from each transect and then select the transect with the highest elevation change.
x y lines
0 3.444 1
0.009 3.445 1
0.180 3.449 1
0.027 3.449 1
...
0 2.115 2
0.008 2.115 2
0.017 2.115 2
0.027 2.116 2
I've tried to calculate the change with pandas.dataframe.diff but I'm unable to select the highest elevation change from this.
UPDATE: I found a way to calculate the height difference for 1 transect. The goal is now to loop this script through the different other transects and let it select the transect with the highest difference. Not sure how to create a loop from this...
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from scipy.signal import savgol_filter, find_peaks, find_peaks_cwt
from pandas import read_csv
import csv
df = pd.read_csv('transect4.csv', delimiter=',', header=None, names=['x', 'y', 'lines'])
df_1 = df ['lines'] == 1
df1 = df[df_1]
plt.plot(df1['x'], df1['y'], label='Original Topography')
#apply a Savitzky-Golay filter
smooth = savgol_filter(df1.y.values, window_length = 351, polyorder = 5)
#find the maximums
peaks_idx_max, _ = find_peaks(smooth, prominence = 0.01)
#reciprocal, so mins will become max
smooth_rec = 1/smooth
#find the mins now
peaks_idx_mins, _ = find_peaks(smooth_rec, prominence = 0.01)
plt.xlabel('Distance')
plt.ylabel('Height')
plt.plot(df1['x'], smooth, label='Smoothed Topography')
#plot them
plt.scatter(df1.x.values[peaks_idx_max], smooth[peaks_idx_max], s = 55,
c = 'green', label = 'Local Max Cusp')
plt.scatter(df1.x.values[peaks_idx_mins], smooth[peaks_idx_mins], s = 55,
c = 'black', label = 'Local Min Cusp')
plt.legend(loc='upper left')
plt.show()
#Export to csv
df['Cusp_max']=False
df['Cusp_min']=False
df.loc[df1.x[peaks_idx_max].index, 'Cusp_max']=True
df.loc[df1.x[peaks_idx_mins].index, 'Cusp_min']=True
data=df[df['Cusp_max'] | df['Cusp_min']]
data.to_csv(r'Cusp_total.csv')
#Calculate height difference
my_data=pd.read_csv('Cusp_total.csv', delimiter=',', header=0, names=['ID', 'x', 'y', 'lines'])
df_1 = df ['lines'] == 1
df1 = df[df_1]
df1_diff=pd.DataFrame(my_data)
df1_diff['Diff_Cusps']=df1_diff['y'].diff(-1)
#Only use positive numbers for average
df1_pos = df_diff[df_diff['Diff_Cusps'] > 0]
print("Average Height Difference: ", (df1_pos['Diff_Cusps'].mean()), "m")
Ideally, the script would select the transect with the highest elevation change from an unknown number of transects in the .csv file, which will then be exported to a new .csv file.
You need to groupby by column lines.
Not sure if this is what you meant when you say elevation change but this gives difference of elevations (max(y) - min(y)) for each group, where groups are formed by all rows sharing same value of 'line'each group representing one such value. This should help you with what you are missing in your logic, (sorry can't put more time in).
frame = pd.read_csv('transect4.csv', header=None, names=['x', 'y', 'lines'])
groups = frame.groupby('lines')
groups['y'].max() - groups['y'].min()
# Should give you max elevations of each group.
Related
Generate values in separate dataframe
I trying to generate random data with Pandas. Data is need to be stored in two columns. The first column needs to contain categorical variables (from Stratum_1 until Stratum_19) each of these stratums can contain a random number of values. Second column needs to have data in the range between 1 to 180000000 with a standard deviation of 453210, a mean of 170000, and a number of rows 100000. I try to categorical = {'name': ['Stratum_1','Stratum_2','Stratum_3','Stratum_4','Stratum_5','Stratum_6','Stratum_7','Stratum_8','Stratum_9', 'Stratum_10','Stratum_11','Stratum_12','Stratum_13','Stratum_14','Stratum_15','Stratum_16','Stratum_17','Stratum_18','Stratum_19']} desired_mean = 170000 desired_std_dev = 453210 df = pd.DataFrame(np.random.randint(0,180000000,size=(100000, 1)),columns=list('1')) I tried with this code above but don't know how to implement categorical and numerical values together with desired mean and standard deviation. So can anybody help how to solve this problem and generate?
I decided to use the gamma distribution to generate your desired sample after thinking that the given parameters are not suitable for the normal distribution. Code import numpy as np import pandas as pd # desired parameters n_rows = 100000 lower, upper = 1, 180000000 mu, sigma = 170000, 453210 # amount of shift delta = lower # parameters for the gamma distribution shape = ((mu - delta) / sigma) ** 2 scale = sigma**2 / (mu - delta) # Create a dataframe categories = {'name': [f'Stratum_{i}' for i in range(1, 19 + 1)]} df = pd.DataFrame(categories).sample(n=n_rows, replace=True).reset_index(drop=True) # Generate samples along with your desired parameters generator = np.random.default_rng() while True: df['value'] = generator.gamma(shape=shape, scale=scale, size=n_rows) + delta if df.value.max() <= upper: break # Show statistics print(df.describe()) Output value count 100,000 mean 169,403 (Target: 170,000) std 449,668 (Target: 453,210) min 1 25% 39.4267 50% 5529.28 75% 105,748 max 9.45114e+06
Try: import numpy as np categorical = {'name': ['Stratum_1','Stratum_2','Stratum_3','Stratum_4','Stratum_5','Stratum_6','Stratum_7','Stratum_8','Stratum_9', 'Stratum_10','Stratum_11','Stratum_12','Stratum_13','Stratum_14','Stratum_15','Stratum_16','Stratum_17','Stratum_18','Stratum_19']} desired_mean = 170000 desired_std_dev = 453210 df = pd.DataFrame({'num':np.random.normal(170000, 453210,size=(300000, 1)).reshape(-1), 'cat':np.random.choice(categorical['name'], 300000)}) df[(0<df['num'])&(df['num']<180000000)].sample(100000) result:
Find intersection points for two stock timeseries
Background I am trying to find intersection points of two series. In this stock example, I would like to find the intersection points of SMA20 & SMA50. Simple Moving Average (SMA) is commonly used as stock indicators, combined with intersections and other strategies will help one to make decision. Below is the code example. Code You can run the following with jupyter. import pandas as pd import matplotlib.pyplot as plt %matplotlib inline datafile = 'output_XAG_D1_20200101_to_20200601.csv' #This creates a dataframe from the CSV file: data = pd.read_csv(datafile, index_col = 'Date') #This selects the 'Adj Close' column close = data['BidClose'] #This converts the date strings in the index into pandas datetime format: close.index = pd.to_datetime(close.index) close sma20 = close.rolling(window=20).mean() sma50 = close.rolling(window=50).mean() priceSma_df = pd.DataFrame({ 'BidClose' : close, 'SMA 20' : sma20, 'SMA 50' : sma50 }) priceSma_df.plot() plt.show() Sample Data This is the data file used in example output_XAG_D1_20200101_to_20200601.csv Date,BidOpen,BidHigh,BidLow,BidClose,AskOpen,AskHigh,AskLow,AskClose,Volume 01.01.2020 22:00:00,1520.15,1531.26,1518.35,1527.78,1520.65,1531.75,1518.73,1531.73,205667 01.02.2020 22:00:00,1527.78,1553.43,1526.72,1551.06,1531.73,1553.77,1528.17,1551.53,457713 01.05.2020 22:00:00,1551.06,1588.16,1551.06,1564.4,1551.53,1590.51,1551.53,1568.32,540496 01.06.2020 22:00:00,1564.4,1577.18,1555.2,1571.62,1568.32,1577.59,1555.54,1575.56,466430 01.07.2020 22:00:00,1571.62,1611.27,1552.13,1554.79,1575.56,1611.74,1552.48,1558.72,987671 01.08.2020 22:00:00,1554.79,1561.24,1540.08,1549.78,1558.72,1561.58,1540.5,1553.73,473799 01.09.2020 22:00:00,1549.78,1563.0,1545.62,1562.44,1553.73,1563.41,1545.96,1562.95,362002 01.12.2020 22:00:00,1562.44,1562.44,1545.38,1545.46,1562.95,1563.06,1546.71,1549.25,280809 01.13.2020 22:00:00,1545.46,1548.77,1535.78,1545.1,1549.25,1549.25,1536.19,1548.87,378200 01.14.2020 22:00:00,1545.1,1558.04,1543.79,1554.89,1548.87,1558.83,1546.31,1558.75,309719 01.15.2020 22:00:00,1554.89,1557.98,1547.91,1551.18,1558.75,1558.75,1548.24,1554.91,253944 01.16.2020 22:00:00,1551.18,1561.12,1549.28,1556.68,1554.91,1561.55,1549.59,1557.15,239186 01.19.2020 22:00:00,1556.68,1562.69,1556.25,1560.77,1557.15,1562.97,1556.61,1561.17,92020 01.20.2020 22:00:00,1560.77,1568.49,1546.21,1556.8,1561.17,1568.87,1546.56,1558.5,364753 01.21.2020 22:00:00,1556.8,1559.18,1550.07,1558.59,1558.5,1559.47,1550.42,1559.31,238468 01.22.2020 22:00:00,1558.59,1567.83,1551.8,1562.45,1559.31,1568.16,1552.11,1564.17,365518 01.23.2020 22:00:00,1562.45,1575.77,1556.44,1570.39,1564.17,1576.12,1556.76,1570.87,368529 01.26.2020 22:00:00,1570.39,1588.41,1570.39,1580.51,1570.87,1588.97,1570.87,1582.33,510524 01.27.2020 22:00:00,1580.51,1582.93,1565.31,1567.15,1582.33,1583.3,1565.79,1570.62,384205 01.28.2020 22:00:00,1567.15,1577.93,1563.27,1576.7,1570.62,1578.22,1563.61,1577.25,328766 01.29.2020 22:00:00,1576.7,1585.87,1572.19,1573.23,1577.25,1586.18,1572.44,1575.33,522371 01.30.2020 22:00:00,1573.23,1589.98,1570.82,1589.75,1575.33,1590.37,1571.14,1590.31,482710 02.02.2020 22:00:00,1589.75,1593.09,1568.65,1575.62,1590.31,1595.82,1569.85,1578.35,488585 02.03.2020 22:00:00,1575.62,1579.56,1548.95,1552.55,1578.35,1579.87,1549.31,1556.4,393037 02.04.2020 22:00:00,1552.55,1562.3,1547.34,1554.62,1556.4,1562.64,1547.72,1556.42,473172 02.05.2020 22:00:00,1554.62,1568.14,1552.39,1565.08,1556.42,1568.51,1552.73,1567.0,365580 02.06.2020 22:00:00,1565.08,1574.02,1559.82,1570.11,1567.0,1574.33,1560.7,1570.55,424269 02.09.2020 22:00:00,1570.11,1576.9,1567.9,1571.05,1570.55,1577.25,1568.21,1573.34,326606 02.10.2020 22:00:00,1571.05,1573.92,1561.92,1566.12,1573.34,1574.27,1562.24,1568.12,310037 02.11.2020 22:00:00,1566.12,1570.39,1561.45,1564.26,1568.12,1570.71,1561.91,1567.02,269032 02.12.2020 22:00:00,1564.26,1578.24,1564.26,1574.5,1567.02,1578.52,1565.81,1576.63,368438 02.13.2020 22:00:00,1574.5,1584.87,1572.44,1584.49,1576.63,1585.29,1573.28,1584.91,250788 02.16.2020 22:00:00,1584.49,1584.49,1578.7,1580.79,1584.91,1584.91,1579.06,1581.31,101499 02.17.2020 22:00:00,1580.79,1604.97,1580.79,1601.06,1581.31,1605.33,1581.31,1603.08,321542 02.18.2020 22:00:00,1601.06,1612.83,1599.41,1611.27,1603.08,1613.4,1599.77,1613.34,357488 02.19.2020 22:00:00,1611.27,1623.62,1603.74,1618.48,1613.34,1623.98,1604.12,1621.27,535148 02.20.2020 22:00:00,1618.48,1649.26,1618.48,1643.42,1621.27,1649.52,1619.19,1643.87,590262 02.23.2020 22:00:00,1643.42,1689.22,1643.42,1658.62,1643.87,1689.55,1643.87,1659.07,1016570 02.24.2020 22:00:00,1658.62,1660.76,1624.9,1633.19,1659.07,1661.52,1625.5,1636.23,1222774 02.25.2020 22:00:00,1633.19,1654.88,1624.74,1640.4,1636.23,1655.23,1625.11,1642.59,1004692 02.26.2020 22:00:00,1640.4,1660.3,1635.15,1643.99,1642.59,1660.6,1635.6,1646.42,1084115 02.27.2020 22:00:00,1643.99,1649.39,1562.74,1584.95,1646.42,1649.84,1563.22,1585.58,1174015 03.01.2020 22:00:00,1584.95,1610.94,1575.29,1586.55,1585.58,1611.26,1575.88,1590.33,1115889 03.02.2020 22:00:00,1586.55,1649.16,1586.55,1640.19,1590.33,1649.6,1589.43,1644.16,889364 03.03.2020 22:00:00,1640.19,1652.81,1631.73,1635.95,1644.16,1653.51,1632.1,1639.05,589438 03.04.2020 22:00:00,1635.95,1674.51,1634.91,1669.36,1639.05,1674.9,1635.3,1672.83,643444 03.05.2020 22:00:00,1669.36,1692.1,1641.61,1673.89,1672.83,1692.65,1642.75,1674.46,1005737 03.08.2020 21:00:00,1673.89,1703.19,1656.98,1678.31,1674.46,1703.52,1657.88,1679.2,910166 03.09.2020 21:00:00,1678.31,1680.43,1641.37,1648.71,1679.2,1681.18,1641.94,1649.75,943377 03.10.2020 21:00:00,1648.71,1671.15,1632.9,1634.42,1649.75,1671.56,1633.31,1637.07,793816 03.11.2020 21:00:00,1634.42,1650.28,1560.5,1578.29,1637.07,1650.8,1560.92,1580.01,1009172 03.12.2020 21:00:00,1578.29,1597.85,1504.34,1528.99,1580.01,1598.36,1505.14,1530.09,1052940 03.15.2020 21:00:00,1528.99,1575.2,1451.08,1509.12,1530.09,1576.05,1451.49,1512.94,1196812 03.16.2020 21:00:00,1509.12,1553.91,1465.4,1528.57,1512.94,1554.21,1466.1,1529.43,1079729 03.17.2020 21:00:00,1528.57,1545.93,1472.49,1485.85,1529.43,1546.74,1472.99,1486.75,976857 03.18.2020 21:00:00,1485.85,1500.68,1463.49,1471.89,1486.75,1501.6,1464.64,1474.16,833803 03.19.2020 21:00:00,1471.89,1516.07,1454.46,1497.01,1474.16,1516.57,1455.93,1497.82,721471 03.22.2020 21:00:00,1497.01,1560.86,1482.21,1551.45,1497.82,1561.65,1483.22,1553.09,707830 03.23.2020 21:00:00,1551.45,1631.23,1551.45,1621.05,1553.09,1638.75,1553.09,1631.35,164862 03.24.2020 21:00:00,1621.05,1636.23,1588.82,1615.77,1631.35,1650.03,1601.29,1618.47,205272 03.25.2020 21:00:00,1615.77,1642.96,1587.7,1628.31,1618.47,1649.81,1599.87,1633.29,152804 03.26.2020 21:00:00,1628.31,1630.48,1606.76,1617.5,1633.29,1638.48,1616.9,1622.8,307278 03.29.2020 21:00:00,1617.5,1631.48,1602.51,1620.91,1622.8,1643.86,1612.55,1623.77,291653 03.30.2020 21:00:00,1620.91,1626.55,1573.37,1574.9,1623.77,1627.31,1575.24,1579.1,371507 03.31.2020 21:00:00,1574.9,1600.41,1560.13,1590.13,1579.1,1603.42,1570.75,1592.43,412780 04.01.2020 21:00:00,1590.13,1619.76,1582.42,1612.07,1592.43,1621.1,1583.37,1614.49,704652 04.02.2020 21:00:00,1612.07,1625.21,1605.39,1618.63,1614.49,1626.83,1607.69,1621.37,409490 04.05.2020 21:00:00,1618.63,1668.35,1608.59,1657.77,1621.37,1670.98,1609.7,1663.43,381690 04.06.2020 21:00:00,1657.77,1671.95,1641.84,1644.84,1663.43,1677.53,1643.4,1650.46,286313 04.07.2020 21:00:00,1644.84,1656.39,1640.1,1644.06,1650.46,1657.43,1643.46,1646.66,219464 04.08.2020 21:00:00,1644.06,1689.66,1643.05,1682.16,1646.66,1691.13,1644.83,1686.74,300111 04.12.2020 21:00:00,1682.16,1722.25,1677.35,1709.16,1686.74,1725.48,1680.49,1718.28,280905 04.13.2020 21:00:00,1709.16,1747.04,1708.56,1726.18,1718.28,1748.88,1709.36,1729.72,435098 04.14.2020 21:00:00,1726.18,1730.53,1706.67,1714.35,1729.72,1732.97,1708.95,1717.25,419065 04.15.2020 21:00:00,1714.35,1738.65,1707.83,1715.99,1717.25,1740.35,1708.93,1720.09,615105 04.16.2020 21:00:00,1715.99,1718.46,1677.16,1683.2,1720.09,1720.09,1680.55,1684.97,587875 04.19.2020 21:00:00,1683.2,1702.49,1671.1,1694.71,1684.97,1703.46,1672.02,1697.29,412116 04.20.2020 21:00:00,1694.71,1697.66,1659.42,1683.4,1697.29,1698.44,1662.3,1686.58,502893 04.21.2020 21:00:00,1683.4,1718.21,1679.61,1713.67,1686.58,1719.19,1680.71,1716.91,647622 04.22.2020 21:00:00,1713.67,1738.59,1706.93,1729.89,1716.91,1739.47,1707.72,1731.83,751833 04.23.2020 21:00:00,1729.89,1736.31,1710.56,1726.74,1731.83,1736.98,1711.03,1727.71,608827 04.26.2020 21:00:00,1726.74,1727.55,1705.99,1713.36,1727.71,1728.55,1706.72,1715.29,698217 04.27.2020 21:00:00,1713.36,1716.52,1691.41,1707.66,1715.29,1718.02,1692.51,1710.22,749906 04.28.2020 21:00:00,1707.66,1717.42,1697.65,1711.58,1710.22,1718.57,1698.4,1715.42,630720 04.29.2020 21:00:00,1711.58,1721.94,1681.36,1684.97,1715.42,1722.79,1681.91,1687.92,631609 04.30.2020 21:00:00,1684.97,1705.87,1669.62,1699.92,1687.92,1706.33,1670.81,1701.66,764742 05.03.2020 21:00:00,1699.92,1714.75,1691.46,1700.42,1701.66,1715.83,1692.96,1702.17,355859 05.04.2020 21:00:00,1700.42,1711.64,1688.55,1703.04,1702.17,1712.55,1690.42,1706.71,415576 05.05.2020 21:00:00,1703.04,1708.1,1681.6,1685.18,1706.71,1708.71,1682.33,1688.33,346814 05.06.2020 21:00:00,1685.18,1721.95,1683.59,1715.17,1688.33,1722.53,1684.8,1716.91,379103 05.07.2020 21:00:00,1715.17,1723.54,1701.49,1704.06,1716.91,1724.42,1702.1,1705.25,409225 05.10.2020 21:00:00,1704.06,1712.02,1691.75,1696.68,1705.25,1713.03,1692.45,1697.58,438010 05.11.2020 21:00:00,1696.68,1710.94,1693.56,1701.46,1697.58,1711.31,1693.92,1703.32,369988 05.12.2020 21:00:00,1701.46,1718.11,1698.86,1716.09,1703.32,1718.69,1699.4,1718.63,518107 05.13.2020 21:00:00,1716.09,1736.16,1710.79,1727.71,1718.63,1736.55,1711.33,1731.38,447401 05.14.2020 21:00:00,1727.71,1751.56,1727.71,1743.94,1731.38,1752.1,1728.89,1744.96,561909 05.17.2020 21:00:00,1743.94,1765.3,1727.4,1731.73,1744.96,1765.92,1728.08,1732.99,495628 05.18.2020 21:00:00,1731.73,1747.76,1725.05,1743.52,1732.99,1748.24,1726.29,1746.9,596250 05.19.2020 21:00:00,1743.52,1753.8,1742.04,1747.22,1746.9,1754.28,1742.62,1748.48,497960 05.20.2020 21:00:00,1747.22,1748.7,1717.14,1726.56,1748.48,1751.18,1717.39,1727.82,557122 05.21.2020 21:00:00,1726.56,1740.06,1723.33,1735.67,1727.82,1740.7,1724.41,1736.73,336867 05.24.2020 21:00:00,1735.67,1735.67,1721.61,1727.88,1736.73,1736.73,1721.83,1730.25,164650 05.25.2020 21:00:00,1727.88,1735.39,1708.48,1710.1,1730.25,1735.99,1709.34,1712.21,404914 05.26.2020 21:00:00,1710.1,1715.93,1693.57,1708.36,1712.21,1716.3,1694.04,1709.85,436519 05.27.2020 21:00:00,1708.36,1727.42,1703.41,1717.28,1709.85,1727.93,1705.85,1721.0,416306 05.28.2020 21:00:00,1717.28,1737.58,1712.55,1731.2,1721.0,1738.26,1713.24,1732.07,399698 05.31.2020 21:00:00,1731.2,1744.51,1726.98,1738.73,1732.07,1745.11,1727.93,1742.56,365219 Problem This is the result for this code and I'm looking for ways to find intersections for SMA20 (yellow) and SMA50 (green) lines and thus able to get alerts whenever these lines cross. Solution Print out intersections indication crossing from above or below relative to each series. import numpy as np g20=sma20.values g50=sma50.values # np.sign(...) return -1, 0 or 1 # np.diff(...) return value difference for (n-1) - n, to obtain intersections # np.argwhere(...) remove zeros, preserves turning points only idx20 = np.argwhere(np.diff(np.sign(g20 - g50))).flatten() priceSma_df.plot() plt.scatter(close.index[idx20], sma50[idx20], color='red') plt.show()
import numpy as np f=close.values g20=sma20.values g50=sma50.values idx20 = np.argwhere(np.diff(np.sign(f - g20))).flatten() idx50 = np.argwhere(np.diff(np.sign(f - g50))).flatten() priceSma_df = pd.DataFrame({ 'BidClose' : close, 'SMA 20' : sma20, 'SMA 50' : sma50 }) priceSma_df.plot() plt.scatter(close.index[idx20], sma20[idx20], color='orange') plt.scatter(close.index[idx50], sma50[idx50], color='green') plt.show()
2D bin (x,y) and calculate mean of values (c) of 10 deepest data points (z)
For a data set consisting of: coordinates x, y depth z a certain value c I would like to do the following more efficient: bin the data set in 2D bins based on the coordinates (x, y) take the 10 deepest data points (z) per bin calculate the mean value of c of these 10 data points per bin Finally show a 2d heatmap with the calculated mean values. I have found a working solution, but this takes too long for small bins and/or large data sets. Is there a more efficient way of achieving the same result? Current working example Example dataframe: import numpy as np from numpy.random import rand import pandas as pd import math import matplotlib.pyplot as plt n = 10000 df = pd.DataFrame({'x':rand(n), 'y':rand(n), 'z':rand(n), 'c':rand(n)}) Bin the data set: cell_size = 0.01 nx = math.ceil((max(df['x']) - min(df['x'])) / cell_size) ny = math.ceil((max(df['y']) - min(df['y'])) / cell_size) x_range = np.arange(0, nx) y_range = np.arange(0, ny) df['xbin'], x_edges = pd.cut(x=df['x'], bins=nx, labels=x_range, retbins=True) df['ybin'], y_edges = pd.cut(x=df['y'], bins=ny, labels=y_range, retbins=True) Code that now takes to long: df = df.groupby(['xbin', 'ybin']).apply( lambda d: d.sort_values('z').head(10).mean()) Update an empty DataFrame for the bins without data and show result: index = pd.MultiIndex.from_product([x_range, y_range], names=['xbin', 'ybin']) tot_df = pd.DataFrame(index=index, columns=['z', 'c']) tot_df.update(df) zval = tot_df['c'].astype('float').values zval = zval.reshape((nx, ny)) zval = zval.T zval = np.flipud(zval) extent = [min(x_edges), max(x_edges), min(y_edges), max(y_edges)] plt.matshow(zval, aspect='auto', extent=extent) plt.show()
you can use np.searchsorted to bin the rows by x and y and then use groupby to take 10 deep values and calculate means. As groupby will maintains the order in each group you can sort values before applying bins. groupby will perform better without apply df = pd.DataFrame({'x':rand(n), 'y':rand(n), 'z':rand(n), 'c':rand(n)}) df = df.sort_values("z", ascending=False) bins = np.linspace(0, 1, 11) df["bin_x"] = np.searchsorted(bins, df['x'].values) - 1 df["bin_y"] = np.searchsorted(bins, df['y'].values) - 1 result = df.groupby(["bin_x", "bin_y"]).head(10) result.groupby(["bin_x", "bin_y"])["c"].mean() Result bin_x bin_y 0 0 0.369531 1 0.601803 2 0.554452 3 0.575464 4 0.455198 ... 9 5 0.469838 6 0.420772 7 0.367549 8 0.379200 9 0.523083 Name: c, Length: 100, dtype: float64
Averaging several time-series together with confidence interval (with test code)
Sounds very complicated but a simple plot will make it easy to understand: I have three curves of cumulative sum of some values over time, which are the blue lines. I want to average (or somehow combine in a statistically correct way) the three curves into one smooth curve and add confidence interval. I tried one simple solution - combining all the data into one curve, average it with the "rolling" function in pandas, getting the standard deviation for it. I plotted those as the purple curve with the confidence interval around it. The problem with my real data, and as illustrated in the plot above is the curve isn't smooth at all, also there are sharp jumps in the confidence interval which also isn't a good representation of the 3 separate curves as there is no jumps in them. Is there a better way to represent the 3 different curves in one smooth curve with a nice confidence interval? I supply a test code, tested on python 3.5.1 with numpy and pandas (don't change the seed in order to get the same curves). There are some constrains - increasing the number of points for the "rolling" function isn't a solution for me because some of my data is too short for that. Test code: import numpy as np import pandas as pd import matplotlib.pyplot as plt import matplotlib np.random.seed(seed=42) ## data generation - cumulative analysis over time df1_time = pd.DataFrame(np.random.uniform(0,1000,size=50), columns=['time']) df1_values = pd.DataFrame(np.random.randint(0,10000,size=100), columns=['vals']) df1_combined_sorted = pd.concat([df1_time, df1_values], axis = 1).sort_values(by=['time']) df1_combined_sorted_cumulative = np.cumsum(df1_combined_sorted['vals']) df2_time = pd.DataFrame(np.random.uniform(0,1000,size=50), columns=['time']) df2_values = pd.DataFrame(np.random.randint(1000,13000,size=100), columns=['vals']) df2_combined_sorted = pd.concat([df2_time, df2_values], axis = 1).sort_values(by=['time']) df2_combined_sorted_cumulative = np.cumsum(df2_combined_sorted['vals']) df3_time = pd.DataFrame(np.random.uniform(0,1000,size=50), columns=['time']) df3_values = pd.DataFrame(np.random.randint(0,4000,size=100), columns=['vals']) df3_combined_sorted = pd.concat([df3_time, df3_values], axis = 1).sort_values(by=['time']) df3_combined_sorted_cumulative = np.cumsum(df3_combined_sorted['vals']) ## combining the three curves df_all_vals_cumulative = pd.concat([df1_combined_sorted_cumulative,. df2_combined_sorted_cumulative, df3_combined_sorted_cumulative]).reset_index(drop=True) df_all_time = pd.concat([df1_combined_sorted['time'], df2_combined_sorted['time'], df3_combined_sorted['time']]).reset_index(drop=True) df_all = pd.concat([df_all_time, df_all_vals_cumulative], axis = 1) ## creating confidence intervals df_all_sorted = df_all.sort_values(by=['time']) ma = df_all_sorted.rolling(10).mean() mstd = df_all_sorted.rolling(10).std() ## plotting plt.fill_between(df_all_sorted['time'], ma['vals'] - 2 * mstd['vals'], ma['vals'] + 2 * mstd['vals'],color='b', alpha=0.2) plt.plot(df_all_sorted['time'],ma['vals'], c='purple') plt.plot(df1_combined_sorted['time'], df1_combined_sorted_cumulative, c='blue') plt.plot(df2_combined_sorted['time'], df2_combined_sorted_cumulative, c='blue') plt.plot(df3_combined_sorted['time'], df3_combined_sorted_cumulative, c='blue') matplotlib.use('Agg') plt.show()
First of all, your sample code could be re-written to make better use of pd. For example np.random.seed(seed=42) ## data generation - cumulative analysis over time def get_data(max_val, max_time=1000): times = pd.DataFrame(np.random.uniform(0,max_time,size=50), columns=['time']) vals = pd.DataFrame(np.random.randint(0,max_val,size=100), columns=['vals']) df = pd.concat([times, vals], axis = 1).sort_values(by=['time']).\ reset_index().drop('index', axis=1) df['cumulative'] = df.vals.cumsum() return df # generate the dataframes df1,df2,df3 = (df for df in map(get_data, [10000, 13000, 4000])) dfs = (df1, df2, df3) # join df_all = pd.concat(dfs, ignore_index=True).sort_values(by=['time']) # render function def render(window=10): # compute rolling means and confident intervals mean_val = df_all.cumulative.rolling(window).mean() std_val = df_all.cumulative.rolling(window).std() min_val = mean_val - 2*std_val max_val = mean_val + 2*std_val plt.figure(figsize=(16,9)) for df in dfs: plt.plot(df.time, df.cumulative, c='blue') plt.plot(df_all.time, mean_val, c='r') plt.fill_between(df_all.time, min_val, max_val, color='blue', alpha=.2) plt.show() The reason your curves aren't that smooth is maybe your rolling window is not large enough. You can increase this window size to get smoother graphs. For example render(20) gives: while render(30) gives: Although, the better way might be imputing each of df['cumulative'] to the entire time window and compute the mean/confidence interval on these series. With that in mind, we can modify the code as follows: np.random.seed(seed=42) ## data generation - cumulative analysis over time def get_data(max_val, max_time=1000): times = pd.DataFrame(np.random.uniform(0,max_time,size=50), columns=['time']) vals = pd.DataFrame(np.random.randint(0,max_val,size=100), columns=['vals']) # note that we set time as index of the returned data df = pd.concat([times, vals], axis = 1).dropna().set_index('time').sort_index() df['cumulative'] = df.vals.cumsum() return df df1,df2,df3 = (df for df in map(get_data, [10000, 13000, 4000])) dfs = (df1, df2, df3) # rename column for later plotting for i,df in zip(range(3),dfs): df.rename(columns={'cumulative':f'cummulative_{i}'}, inplace=True) # concatenate the dataframes with common time index df_all = pd.concat(dfs,sort=False).sort_index() # interpolate each cumulative column linearly df_all.interpolate(inplace=True) # plot graphs mean_val = df_all.iloc[:,1:].mean(axis=1) std_val = df_all.iloc[:,1:].std(axis=1) min_val = mean_val - 2*std_val max_val = mean_val + 2*std_val fig, ax = plt.subplots(1,1,figsize=(16,9)) df_all.iloc[:,1:4].plot(ax=ax) plt.plot(df_all.index, mean_val, c='purple') plt.fill_between(df_all.index, min_val, max_val, color='blue', alpha=.2) plt.show() and we get:
Pandas finding local max and min
I have a pandas data frame with two columns one is temperature the other is time. I would like to make third and fourth columns called min and max. Each of these columns would be filled with nan's except where there is a local min or max, then it would have the value of that extrema. Here is a sample of what the data looks like, essentially I am trying to identify all the peaks and low points in the figure. Are there any built in tools with pandas that can accomplish this?
The solution offered by fuglede is great but if your data is very noisy (like the one in the picture) you will end up with lots of misleading local extremes. I suggest that you use scipy.signal.argrelextrema() method. The .argrelextrema() method has its own limitations but it has a useful feature where you can specify the number of points to be compared, kind of like a noise filtering algorithm. for example: import numpy as np import matplotlib.pyplot as plt import pandas as pd from scipy.signal import argrelextrema # Generate a noisy AR(1) sample np.random.seed(0) rs = np.random.randn(200) xs = [0] for r in rs: xs.append(xs[-1] * 0.9 + r) df = pd.DataFrame(xs, columns=['data']) n = 5 # number of points to be checked before and after # Find local peaks df['min'] = df.iloc[argrelextrema(df.data.values, np.less_equal, order=n)[0]]['data'] df['max'] = df.iloc[argrelextrema(df.data.values, np.greater_equal, order=n)[0]]['data'] # Plot results plt.scatter(df.index, df['min'], c='r') plt.scatter(df.index, df['max'], c='g') plt.plot(df.index, df['data']) plt.show() Some points: you might need to check the points afterward to ensure there are no twine points very close to each other. you can play with n to filter the noisy points argrelextrema returns a tuple and the [0] at the end extracts a numpy array
Assuming that the column of interest is labelled data, one solution would be df['min'] = df.data[(df.data.shift(1) > df.data) & (df.data.shift(-1) > df.data)] df['max'] = df.data[(df.data.shift(1) < df.data) & (df.data.shift(-1) < df.data)] For example: import numpy as np import matplotlib.pyplot as plt import pandas as pd # Generate a noisy AR(1) sample np.random.seed(0) rs = np.random.randn(200) xs = [0] for r in rs: xs.append(xs[-1]*0.9 + r) df = pd.DataFrame(xs, columns=['data']) # Find local peaks df['min'] = df.data[(df.data.shift(1) > df.data) & (df.data.shift(-1) > df.data)] df['max'] = df.data[(df.data.shift(1) < df.data) & (df.data.shift(-1) < df.data)] # Plot results plt.scatter(df.index, df['min'], c='r') plt.scatter(df.index, df['max'], c='g') df.data.plot()
using Numpy ser = np.random.randint(-40, 40, 100) # 100 points peak = np.where(np.diff(ser) < 0)[0] or double_difference = np.diff(np.sign(np.diff(ser))) peak = np.where(double_difference == -2)[0] using Pandas ser = pd.Series(np.random.randint(2, 5, 100)) peak_df = ser[(ser.shift(1) < ser) & (ser.shift(-1) < ser)] peak = peak_df.index
You can do something similar to Foad's .argrelextrema() solution, but with the Pandas .rolling() function: # Find local peaks n = 5 #rolling period local_min_vals = df.loc[df['data'] == df['data'].rolling(n, center=True).min()] local_max_vals = df.loc[df['data'] == df['data'].rolling(n, center=True).max()] plt.scatter(local_min_vals.index, local_min_vals, c='r') plt.scatter(local_max_vals.index, local_max_vals, c='g')