Iterate trough a converted datetime pandas dataframe with a external function - python

https://rhodesmill.org/skyfield/positions.html#azimuth-and-altitude-from-a-geographic-position
Hi I have function that generates a sun-shot azimuth on a specific date and time on a specific place, using the package skyfield for astronomical calculations.
What do I want:
iterate trough the df2.cdt rows as fix time,
currently astro = Nieuwe_diep.at(ts.utc(2019, 12, 31, 8, 41, 44)).observe(sun)
for fix#1
add a new column to df2, called: df2.azimuth containing the output of az trough row itteration.
Currently I can only generate the azimuth for the first fix, with this code:
# Sunshot Azimuth - hour angle method
from skyfield.api import N,S,E,W, wgs84
from skyfield.api import load
import pandas as pd
# location Port of Den Helder, Nieuwe diep:
lat = 52+(57/60)+(26.9/3600)
lon = 4+(46/60)+(37.5/3600)
# fix1 # 2019-12-31 08:41:44 UTC
ts = load.timescale()
t = ts.utc(2019, 12, 31)
planets = load('de421.bsp')
earth, sun = planets['earth'], planets['sun']
# Altitude and azimuth in the sky for a specific geographic location
earth = planets['earth']
Nieuwe_diep = earth + wgs84.latlon(lat * N, lon * E, elevation_m=6)
astro = Nieuwe_diep.at(ts.utc(2019, 12, 31, 8, 41, 44)).observe(sun)
app = astro.apparent()
alt, az, distance = app.altaz()
print('alt: ' + alt.dstr())
print('az: ' + az.dstr())
print(distance)
print('lat, lon: ' + str(lat), str(lon))
#dt_utc = df2['datetime_UTC']
print('az: {:.3f}'.format(az.degrees)) # desired output for azimuth in decimal degrees
print('az: '+ az.dstr(format=u'{0}{1}°{2:02}′{3:02}.{4:0{5}}″'))
which results in:
alt: 04deg 18' 42.2"
az: 138deg 52' 22.3"
0.983305 au
lat, lon: 52.95747222222222 4.777083333333334
az: 138.873
az: 138°52′22.3″
I have a pandas dataframe that consists of times of when I want to know the suns Azimuth. The column cdt
# cdt: converted datetime
df2['cdt'] = df2['datetime_UTC'].dt.strftime('%Y, %m, %d, %H, %M, %S')
print(df2)
cdt = df2.cdt
date time datetime_UTC cdt
0 2019-12-31 08:41:44 2019-12-31 08:41:44 2019, 12, 31, 08, 41, 44
1 2019-12-31 08:43:16 2019-12-31 08:43:16 2019, 12, 31, 08, 43, 16
2 2019-12-31 08:44:12 2019-12-31 08:44:12 2019, 12, 31, 08, 44, 12
3 2019-12-31 08:44:52 2019-12-31 08:44:52 2019, 12, 31, 08, 44, 52
4 2019-12-31 08:46:01 2019-12-31 08:46:01 2019, 12, 31, 08, 46, 01
5 2019-12-31 08:46:42 2019-12-31 08:46:42 2019, 12, 31, 08, 46, 42
6 2019-12-31 08:47:21 2019-12-31 08:47:21 2019, 12, 31, 08, 47, 21
7 2019-12-31 08:48:12 2019-12-31 08:48:12 2019, 12, 31, 08, 48, 12
8 2019-12-31 08:48:58 2019-12-31 08:48:58 2019, 12, 31, 08, 48, 58
9 2019-12-31 09:07:08 2019-12-31 09:07:08 2019, 12, 31, 09, 07, 08
10 2019-12-31 09:07:24 2019-12-31 09:07:24 2019, 12, 31, 09, 07, 24
11 2019-12-31 09:07:45 2019-12-31 09:07:45 2019, 12, 31, 09, 07, 45
12 2019-12-31 09:08:03 2019-12-31 09:08:03 2019, 12, 31, 09, 08, 03
13 2019-12-31 09:08:19 2019-12-31 09:08:19 2019, 12, 31, 09, 08, 19
14 2019-12-31 09:08:34 2019-12-31 09:08:34 2019, 12, 31, 09, 08, 34
15 2019-12-31 09:08:50 2019-12-31 09:08:50 2019, 12, 31, 09, 08, 50
16 2019-12-31 09:09:13 2019-12-31 09:09:13 2019, 12, 31, 09, 09, 13
17 2019-12-31 09:09:33 2019-12-31 09:09:33 2019, 12, 31, 09, 09, 33
18 2019-12-31 09:09:57 2019-12-31 09:09:57 2019, 12, 31, 09, 09, 57
19 2019-12-31 09:10:20 2019-12-31 09:10:20 2019, 12, 31, 09, 10, 20

I think this would work. You'd have to take the output and deal with that in a list, dictionary or some other dataframe. Also, there seems like there should be a better way to pass and parse the utc time but I'm not familiar with the library.
import io
data = '''date time datetime_UTC
2019-12-31 08:41:44 2019-12-31 08:41:44
2019-12-31 08:43:16 2019-12-31 08:43:16
2019-12-31 08:44:12 2019-12-31 08:44:12
'''
df2 = pd.read_csv(io.StringIO(data), sep=' \s+', engine='python')
df2['datetime_UTC'] = pd.to_datetime(df2['datetime_UTC'])
df2['cdt'] = df2['datetime_UTC'].dt.strftime('%Y,%m,%d,%H,%M,%S')
# note I changed the formatting to remove spaces for later parsing
def calc_az(tutc):
yr=int(tutc.split(',')[0])
mo=int(tutc.split(',')[1])
da=int(tutc.split(',')[2])
hr=int(tutc.split(',')[3])
mi=int(tutc.split(',')[4])
se=int(tutc.split(',')[5])
# location Port of Den Helder, Nieuwe diep:
lat = 52+(57/60)+(26.9/3600)
lon = 4+(46/60)+(37.5/3600)
# fix1 # 2019-12-31 08:41:44 UTC
ts = load.timescale()
t = ts.utc(2019, 12, 31)
planets = load('de421.bsp')
earth, sun = planets['earth'], planets['sun']
# Altitude and azimuth in the sky for a specific geographic location
earth = planets['earth']
Nieuwe_diep = earth + wgs84.latlon(lat * N, lon * E, elevation_m=6)
# astro = Nieuwe_diep.at(ts.utc(2019, 12, 31, 8, 41, 44)).observe(sun)
astro = Nieuwe_diep.at(ts.utc(yr, mo, da, hr, mi, se)).observe(sun)
app = astro.apparent()
alt, az, distance = app.altaz()
print('alt: ' + alt.dstr())
print('az: ' + az.dstr())
print(distance)
print('lat, lon: ' + str(lat), str(lon))
#dt_utc = df2['datetime_UTC']
print('az: {:.3f}'.format(az.degrees)) # desired output for azimuth in decimal degrees
print('az: '+ az.dstr(format=u'{0}{1}°{2:02}′{3:02}.{4:0{5}}″'))
print('\n'*2)
return
df2['cdt'].apply(calc_az)
output
alt: 04deg 18' 42.2"
az: 138deg 52' 22.3"
0.983305 au
lat, lon: 52.95747222222222 4.777083333333334
az: 138.873
az: 138°52′22.3″
alt: 04deg 27' 47.3"
az: 139deg 11' 31.5"
0.983305 au
lat, lon: 52.95747222222222 4.777083333333334
az: 139.192
az: 139°11′31.5″
alt: 04deg 33' 17.4"
az: 139deg 23' 11.9"
0.983305 au
lat, lon: 52.95747222222222 4.777083333333334
az: 139.387
az: 139°23′11.9″

Related

Python: How to split dataframe with datetime index by number of observations?

data = {
'aapl': [11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
'aal' : [33, 33, 33, 32, 31, 30, 34, 29, 27, 26],
}
data = pd.DataFrame(data)
data.index = pd.date_range('2011-01-01', '2011-01-10')
n_obs = len(data) * 0.3
train, test = data[:n_obs], data[n_obs:]
>>> TypeError: cannot do slice indexing on DatetimeIndex with these indexers [3.0] of type float
I can probably slice the dataframe by date like df[ : '2011-01-05' ], but I want to be splitting the data by number of observations, which I have difficulties using the method above.
You need to ensure having an integer for slicing:
n_obs = int(len(data) * 0.3)
train, test = data[:n_obs], data[n_obs:]
output:
# train
aapl aal
2011-01-01 11 33
2011-01-02 12 33
2011-01-03 13 33
# test
aapl aal
2011-01-04 14 32
2011-01-05 15 31
2011-01-06 16 30
2011-01-07 17 34
2011-01-08 18 29
2011-01-09 19 27
2011-01-10 20 26
If you want to train/test a model you might be interested in getting a random sample:
test = data.sample(frac=0.3)
train = data.loc[data.index.difference(test.index)]

raise ValueError(err) - Implementation of multithreading using concurrent.future in Python

I have written a python code which scrape information from a website. I tried to apply multi-thread method in my code. Here's my code before applying multithreading: It run perfectly on my PC.
import requests
from bs4 import BeautifulSoup
import pandas as pd
import investpy
def getCurrencyHistorical():
t1 = time.perf_counter()
headers = {'Accept-Language': 'en-US,en;q=0.9',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36 Edg/88.0.705.63',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,/;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive'}
links = {"USD-IDR":"https://www.investing.com/currencies/usd-idr-historical-data",
"USD-JPY":"https://www.investing.com/currencies/usd-jpy-historical-data",
"USD-CNY":"https://www.investing.com/currencies/usd-cny-historical-data"}
column = []
output = []
for key, value in links.items():
page = requests.get(value, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
table =soup.select('table')[0]
#ColumnName
rows = table.find_all('tr')
for row in rows:
cols = row.find_all('th')
cols = [item.text.strip() for item in cols]
column.append(cols)
outs = row.find_all('td')
outs = [item.text.strip() for item in outs]
outs.append(key)
output.append(outs)
del output[0]
#print(value)
#print(output)
column[0].append('Currency')
df = pd.DataFrame(output, columns = column[0])
t2 = time.perf_counter()
print(f'Finished in {t2-t1} seconds')
return(df)
But, when I convert to below, I got some error. here's the code after applying multithreading:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import concurrent.futures
from functools import partial
import psutil
def process_data(key, page):
soup = BeautifulSoup(page, 'html.parser')
table =soup.select('table')[0]
#ColumnName
rows = table.find_all('tr')
for row in rows:
cols = row.find_all('th')
cols = [item.text.strip() for item in cols]
outs = row.find_all('td')
outs = [item.text.strip() for item in outs]
outs.append(key)
return cols, outs
def getCurrencyHistorical(session, pool_executor, item):
key, value = item
page = session.get(value)
f = pool_executor.submit(process_data, key, page.content)
return f.result()
def main():
t1 = time.perf_counter()
links = {"USD-IDR":"https://www.investing.com/currencies/usd-idr-historical-data",
"USD-JPY":"https://www.investing.com/currencies/usd-jpy-historical-data",
"USD-CNY":"https://www.investing.com/currencies/usd-cny-historical-data"}
with requests.Session() as session:
user_agent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.37"
session.headers = {'User-Agent': user_agent}
column = []
output = []
with concurrent.futures.ProcessPoolExecutor(psutil.cpu_count(logical=False)) as pool_executor, \
concurrent.futures.ThreadPoolExecutor(max_workers=len(links)) as executor:
for return_value in executor.map(partial(getCurrencyHistorical, session, pool_executor), links.items()):
cols, outs = return_value
column.append(cols)
output.append(outs)
del output[0]
column[0].append('Currency')
df = pd.DataFrame(output, columns = column[0])
t2 = time.perf_counter()
print(f'Finished in {t2-t1} seconds')
print(df)
# Required for Windows:
if __name__ == '__main__':
main()
I got error raise ValueError(err) from err. ValueError: 1 columns passed, passed data had 7 columns. and it comes from the line df = pd.DataFrame(output, columns = column[0]). What is wrong? Thank you.
process_data should be just like the non-multiprocessing case except for the fact it is only processing one key-value pair, but that's not what you have done. The main process now must do extend operations on the lists returned by process_data.
Update
You were not retrieving the data items for key "USD-JPY" because you were not looking at the correct table. You should be looking at the table with id 'curr_table'. I have also updated the multiprocessing pool size per my comment to your question.
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import concurrent.futures
from functools import partial
from os import cpu_count
def process_data(key, page):
soup = BeautifulSoup(page, 'html.parser')
table = soup.find('table', {'id': 'curr_table'})
#ColumnName
rows = table.find_all('tr')
column = []
output = []
for row in rows:
cols = row.find_all('th')
cols = [item.text.strip() for item in cols]
column.append(cols)
outs = row.find_all('td')
outs = [item.text.strip() for item in outs]
outs.append(key)
output.append(outs)
del output[0]
return column, output
def getCurrencyHistorical(session, pool_executor, item):
key, value = item
page = session.get(value)
f = pool_executor.submit(process_data, key, page.content)
return f.result()
def main():
t1 = time.perf_counter()
links = {"USD-IDR":"https://www.investing.com/currencies/usd-idr-historical-data",
"USD-JPY":"https://www.investing.com/currencies/usd-jpy-historical-data",
"USD-CNY":"https://www.investing.com/currencies/usd-cny-historical-data"}
with requests.Session() as session:
user_agent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.37"
session.headers = {'User-Agent': user_agent}
column = []
output = []
with concurrent.futures.ProcessPoolExecutor(min(len(links), cpu_count())) as pool_executor, \
concurrent.futures.ThreadPoolExecutor(max_workers=len(links)) as executor:
for return_value in executor.map(partial(getCurrencyHistorical, session, pool_executor), links.items()):
cols, outs = return_value
column.extend(cols)
output.extend(outs)
column[0].append('Currency')
df = pd.DataFrame(output, columns = column[0])
t2 = time.perf_counter()
print(f'Finished in {t2-t1} seconds')
pd.set_option("display.max_rows", None, "display.max_columns", None)
print(df)
# Required for Windows:
if __name__ == '__main__':
main()
Prints:
Finished in 2.1944901 seconds
Date Price Open High Low Change % Currency
0 Aug 26, 2021 14,417.5 14,425.0 14,430.0 14,411.0 0.16% USD-IDR
1 Aug 25, 2021 14,395.0 14,405.0 14,421.0 14,387.5 0.03% USD-IDR
2 Aug 24, 2021 14,390.0 14,395.0 14,407.5 14,377.5 -0.14% USD-IDR
3 Aug 23, 2021 14,410.0 14,435.0 14,438.5 14,404.0 -0.28% USD-IDR
4 Aug 20, 2021 14,450.0 14,475.0 14,485.0 14,422.5 0.35% USD-IDR
5 Aug 19, 2021 14,400.0 14,405.0 14,425.0 14,392.5 0.21% USD-IDR
6 Aug 18, 2021 14,370.0 14,387.5 14,400.0 14,372.5 0.00% USD-IDR
7 Aug 16, 2021 14,370.0 14,390.0 14,395.0 14,371.5 -0.10% USD-IDR
8 Aug 13, 2021 14,385.0 14,382.5 14,395.0 14,366.0 0.03% USD-IDR
9 Aug 12, 2021 14,380.0 14,395.0 14,407.5 14,366.0 0.00% USD-IDR
10 Aug 10, 2021 14,380.0 14,375.0 14,402.0 14,375.0 0.14% USD-IDR
11 Aug 09, 2021 14,360.0 14,370.0 14,387.5 14,357.5 0.07% USD-IDR
12 Aug 06, 2021 14,350.0 14,360.0 14,377.5 14,347.5 0.07% USD-IDR
13 Aug 05, 2021 14,340.0 14,330.0 14,360.0 14,321.0 0.21% USD-IDR
14 Aug 04, 2021 14,310.0 14,325.0 14,347.5 14,304.5 -0.21% USD-IDR
15 Aug 03, 2021 14,340.0 14,375.0 14,388.0 14,338.5 -0.55% USD-IDR
16 Aug 02, 2021 14,420.0 14,465.0 14,472.5 14,422.5 -0.28% USD-IDR
17 Jul 30, 2021 14,460.0 14,435.0 14,477.5 14,434.5 -0.14% USD-IDR
18 Jul 29, 2021 14,480.0 14,490.0 14,502.5 14,482.5 -0.03% USD-IDR
19 Jul 28, 2021 14,485.0 14,500.0 14,512.5 14,485.0 -0.03% USD-IDR
20 Jul 27, 2021 14,490.0 14,473.5 14,497.5 14,465.0 0.07% USD-IDR
21 Jul 26, 2021 14,480.0 14,510.0 14,522.5 14,470.0 -0.07% USD-IDR
22 Aug 26, 2021 110.10 109.98 110.23 109.93 0.10% USD-JPY
23 Aug 25, 2021 109.99 109.64 110.13 109.61 0.34% USD-JPY
24 Aug 24, 2021 109.62 109.69 109.89 109.41 -0.05% USD-JPY
25 Aug 23, 2021 109.68 109.81 110.15 109.65 -0.11% USD-JPY
26 Aug 20, 2021 109.80 109.75 109.89 109.57 0.07% USD-JPY
27 Aug 19, 2021 109.72 109.76 110.23 109.49 -0.02% USD-JPY
28 Aug 18, 2021 109.74 109.57 110.07 109.47 0.16% USD-JPY
29 Aug 17, 2021 109.57 109.22 109.66 109.12 0.31% USD-JPY
30 Aug 16, 2021 109.23 109.71 109.76 109.11 -0.31% USD-JPY
31 Aug 13, 2021 109.57 110.39 110.46 109.54 -0.73% USD-JPY
32 Aug 12, 2021 110.38 110.42 110.55 110.31 -0.02% USD-JPY
33 Aug 11, 2021 110.40 110.58 110.81 110.31 -0.14% USD-JPY
34 Aug 10, 2021 110.56 110.29 110.60 110.28 0.25% USD-JPY
35 Aug 09, 2021 110.28 110.26 110.36 110.02 0.03% USD-JPY
36 Aug 06, 2021 110.25 109.77 110.36 109.69 0.46% USD-JPY
37 Aug 05, 2021 109.74 109.49 109.79 109.40 0.25% USD-JPY
38 Aug 04, 2021 109.47 109.07 109.68 108.72 0.39% USD-JPY
39 Aug 03, 2021 109.04 109.32 109.36 108.88 -0.22% USD-JPY
40 Aug 02, 2021 109.28 109.69 109.79 109.18 -0.38% USD-JPY
41 Jul 30, 2021 109.70 109.49 109.83 109.36 0.22% USD-JPY
42 Jul 29, 2021 109.46 109.91 109.96 109.42 -0.40% USD-JPY
43 Jul 28, 2021 109.90 109.75 110.29 109.74 0.13% USD-JPY
44 Jul 27, 2021 109.76 110.36 110.41 109.58 -0.53% USD-JPY
45 Jul 26, 2021 110.34 110.57 110.59 110.11 -0.18% USD-JPY
46 Aug 26, 2021 6.4815 6.4725 6.4866 6.4725 0.09% USD-CNY
47 Aug 25, 2021 6.4756 6.4714 6.4811 6.4707 0.07% USD-CNY
48 Aug 24, 2021 6.4710 6.4790 6.4851 6.4676 -0.15% USD-CNY
49 Aug 23, 2021 6.4805 6.4915 6.4973 6.4788 -0.32% USD-CNY
50 Aug 20, 2021 6.5012 6.4960 6.5057 6.4935 0.11% USD-CNY
51 Aug 19, 2021 6.4942 6.4847 6.4997 6.4840 0.16% USD-CNY
52 Aug 18, 2021 6.4841 6.4861 6.4872 6.4776 -0.02% USD-CNY
53 Aug 17, 2021 6.4854 6.4787 6.4889 6.4759 0.17% USD-CNY
54 Aug 16, 2021 6.4742 6.4774 6.4810 6.4719 -0.04% USD-CNY
55 Aug 13, 2021 6.4768 6.4778 6.4854 6.4749 -0.02% USD-CNY
56 Aug 12, 2021 6.4782 6.4767 6.4811 6.4719 -0.00% USD-CNY
57 Aug 11, 2021 6.4783 6.4846 6.4894 6.4752 -0.11% USD-CNY
58 Aug 10, 2021 6.4852 6.4826 6.4875 6.4774 -0.01% USD-CNY
59 Aug 09, 2021 6.4857 6.4835 6.4895 6.4731 0.05% USD-CNY
60 Aug 06, 2021 6.4825 6.4660 6.4848 6.4622 0.34% USD-CNY
61 Aug 05, 2021 6.4608 6.4671 6.4677 6.4595 -0.07% USD-CNY
62 Aug 04, 2021 6.4655 6.4662 6.4673 6.4555 -0.07% USD-CNY
63 Aug 03, 2021 6.4700 6.4656 6.4710 6.4604 0.12% USD-CNY
64 Aug 02, 2021 6.4620 6.4615 6.4693 6.4580 0.02% USD-CNY
65 Jul 30, 2021 6.4609 6.4645 6.4693 6.4506 0.07% USD-CNY
66 Jul 29, 2021 6.4562 6.4908 6.4908 6.4544 -0.53% USD-CNY
67 Jul 28, 2021 6.4905 6.5095 6.5101 6.4891 -0.31% USD-CNY
68 Jul 27, 2021 6.5104 6.4760 6.5132 6.4735 0.43% USD-CNY
69 Jul 26, 2021 6.4825 6.4790 6.4875 6.4785 0.03% USD-CNY

Get 25 quantile in cumsum pandas

Suppose I have the following DataFrame:
df = pd.DataFrame({'id': [2, 4, 10, 12, 13, 14, 19, 20, 21, 22, 24, 25, 27, 29, 30, 31, 42, 50, 54],
'value': [37410.0, 18400.0, 200000.0, 392000.0, 108000.0, 423000.0, 80000.0, 307950.0,
50807.0, 201740.0, 182700.0, 131300.0, 282005.0, 428800.0, 56000.0, 412400.0, 1091595.0, 1237200.0,
927500.0]})
And I do the following:
df.sort_values(by='id').set_index('id').cumsum()
value
id
2 37410.0
4 55810.0
10 255810.0
12 647810.0
13 755810.0
14 1178810.0
19 1258810.0
20 1566760.0
21 1617567.0
22 1819307.0
24 2002007.0
25 2133307.0
27 2415312.0
29 2844112.0
30 2900112.0
31 3312512.0
42 4404107.0
50 5641307.0
54 6568807.0
I want to know the first element of id that is bigger than 25% of the cumulative sum. In this example, 25% of the cumsum would be 1,642,201.75. The first element to exceed that would be 22. I know it can be done with a for, but I think it would be pretty inefficient.
You could do:
percentile_25 = df['value'].sum() * 0.25
res = df[df['value'].cumsum() > percentile_25].head(1)
print(res)
Output
id value
9 22 201740.0
Or use searchsorted to do the search in O(log N):
percentile_25 = df['value'].sum() * 0.25
i = df['value'].cumsum().searchsorted(percentile_25)
res = df.iloc[i]
print(res)
Output
id 22.0
value 201740.0
Name: 9, dtype: float64

Can you explain the output: diff.sort_values(ascending=False).index.astype

Can anyone explain the following statement.
list(diff.sort_values(ascending=False).index.astype(int)[0:5])
Output: Int64Index([24, 26, 17, 2, 1], dtype='int64')
It sorts first, but what is the index doing and how do i get 24, 26, 17, 2 ,1 ??
diff is series
ipdb> diff
1 0.017647
2 0.311765
3 -0.060000
4 -0.120000
5 -0.040000
6 -0.120000
7 -0.190000
8 -0.200000
9 -0.100000
10 -0.011176
11 -0.130000
12 0.008824
13 -0.060000
14 -0.090000
15 -0.060000
16 0.008824
17 0.341765
18 -0.140000
19 -0.050000
20 -0.060000
21 -0.040000
22 -0.210000
23 0.008824
24 0.585882
25 -0.060000
26 0.555882
27 -0.031176
28 -0.060000
29 -0.170000
30 -0.220000
31 -0.170000
32 -0.040000
dtype: float64
Yout code return list of index values of top5 values of Series sorted in descending order.
First 'column' printed in pandas Series is called index, so your code after sorting convert values of index to integers and slice by indexing.
print (diff.sort_values(ascending=False))
24 0.585882
26 0.555882
17 0.341765
2 0.311765
1 0.017647
12 0.008824
23 0.008824
16 0.008824
10 -0.011176
27 -0.031176
32 -0.040000
21 -0.040000
5 -0.040000
19 -0.050000
15 -0.060000
3 -0.060000
13 -0.060000
25 -0.060000
28 -0.060000
20 -0.060000
14 -0.090000
9 -0.100000
6 -0.120000
4 -0.120000
11 -0.130000
18 -0.140000
31 -0.170000
29 -0.170000
7 -0.190000
8 -0.200000
22 -0.210000
30 -0.220000
Name: a, dtype: float64
print (diff.sort_values(ascending=False).index.astype(int))
Int64Index([24, 26, 17, 2, 1, 12, 23, 16, 10, 27, 32, 21, 5, 19, 15, 3, 13,
25, 28, 20, 14, 9, 6, 4, 11, 18, 31, 29, 7, 8, 22, 30],
dtype='int64')
print (diff.sort_values(ascending=False).index.astype(int)[0:5])
Int64Index([24, 26, 17, 2, 1], dtype='int64')
print (list(diff.sort_values(ascending=False).index.astype(int)[0:5]))
[24, 26, 17, 2, 1]
Here's what's happening:
diff.sort_values(ascending) - sorts a Series. By default, ascending is True, but you've kept it false, so it returns sorted Series in descending order.
pandas.Series.index returns a row-labels of the index (the sorted numbers 1 - 32 in your case)
.as_type(int) typecasts index row-labels as integers.
[0: 5] just picks the cells 0 through 5
Let me know if this helps!

Down-sampling specific period on dataframe using Pandas

I have a long time serie that starts in 1963 and ends in 2013. However, from 1963 til 2007 it has an hourly sampling period while after 2007's sampling rate changes to 5 minutes. Is it possible to resample data just after 2007 in a way that the entire time serie has hourly data sampling? Data slice below.
yr, m, d, h, m, s, sl
2007, 11, 30, 19, 0, 0, 2180
2007, 11, 30, 20, 0, 0, 2310
2007, 11, 30, 21, 0, 0, 2400
2007, 11, 30, 22, 0, 0, 2400
2007, 11, 30, 23, 0, 0, 2270
2008, 1, 1, 0, 0, 0, 2210
2008, 1, 1, 0, 5, 0, 2210
2008, 1, 1, 0, 10, 0, 2210
2008, 1, 1, 0, 15, 0, 2200
2008, 1, 1, 0, 20, 0, 2200
2008, 1, 1, 0, 25, 0, 2200
2008, 1, 1, 0, 30, 0, 2200
2008, 1, 1, 0, 35, 0, 2200
2008, 1, 1, 0, 40, 0, 2200
2008, 1, 1, 0, 45, 0, 2200
2008, 1, 1, 0, 50, 0, 2200
2008, 1, 1, 0, 55, 0, 2200
2008, 1, 1, 1, 0, 0, 2190
2008, 1, 1, 1, 5, 0, 2190
Thanks!
Give your dataframe proper column names
df.columns = 'year month day hour minute second sl'.split()
Solution
df.groupby(['year', 'month', 'day', 'hour'], as_index=False).first()
year month day hour minute second sl
0 2007 11 30 19 0 0 2180
1 2007 11 30 20 0 0 2310
2 2007 11 30 21 0 0 2400
3 2007 11 30 22 0 0 2400
4 2007 11 30 23 0 0 2270
5 2008 1 1 0 0 0 2210
6 2008 1 1 1 0 0 2190
Option 2
Here is an option that builds off of the column renaming. We'll use pd.to_datetime to cleverly get at our dates, then use resample. However, you have time gaps and will have to address nulls and re-cast dtypes.
df.set_index(
pd.to_datetime(df.drop('sl', 1))
).resample('H').first().dropna().astype(df.dtypes)
year month day hour minute second sl
2007-11-30 19:00:00 2007 11 30 19 0 0 2180
2007-11-30 20:00:00 2007 11 30 20 0 0 2310
2007-11-30 21:00:00 2007 11 30 21 0 0 2400
2007-11-30 22:00:00 2007 11 30 22 0 0 2400
2007-11-30 23:00:00 2007 11 30 23 0 0 2270
2008-01-01 00:00:00 2008 1 1 0 0 0 2210
2008-01-01 01:00:00 2008 1 1 1 0 0 2190
Rename the minute column for convenience:
df.columns = ['yr', 'm', 'd', 'h', 'M', 's', 'sl']
Create a datetime column:
from datetime import datetime as dt
df['dt'] = df.apply(axis=1, func=lambda x: dt(x.yr, x.m, x.d, x.h, x.M, x.s))
Resample:
For pandas < 0.19:
df = df.set_index('dt').resample('60T').reset_index('dt')
For pandas >= 0.19:
df = df.resample('60T', on='dt')
You'd better first append a datetime column to your dataframe:
df['datetime'] = pd.to_datetime(df[['yr', 'mnth', 'd', 'h', 'm', 's']])
But before that you should rename the month column:
df.rename(columns={ df.columns[1]: "mnth" })
Then you set a datetime column as dataframe index:
data.set_index('datetime', inplace=True)
Now you can apply resample method on your dataframe on by prefereed sampling rate:
df.resample('60T', on='datatime').mean()
Here I used mean to aggregate. You can use other method based on your need.
See Pandas document as a ref.

Categories