I'm trying to extract a table from a webpage and have tried a number of alternatives, but the table always seems to remain empty.
Two of what I thought were the most promising sets of code are attached below. Any means of extracting the data from the webpage would be considered as helpful. I have also included a screenshot of the table I want to extract.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
browser = webdriver.Chrome()
browser.set_window_size(1120, 550)
# Create an URL object
url = 'https://www.flightradar24.com/data/aircraft/ja11jc'
browser.get(url)
element = WebDriverWait(browser, 3).until(
EC.presence_of_element_located((By.ID, "tbl-datatable"))
)
data = element.get_attribute('tbl-datatable')
print(data)
browser.quit()
or alternatively,
# Import libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
# Create an URL object
url = 'https://www.flightradar24.com/data/aircraft/ja11jc'
# Create object page
page = requests.get(url)
# parser-lxml = Change html to Python friendly format
# Obtain page's information
soup = BeautifulSoup(page.text, 'lxml')
soup
# Obtain information from tag <table>
table1 = soup.find("table", id='tbl-datatable')
table1
# Obtain every title of columns with tag <th>
headers = []
for i in table1.find_all('th'):
title = i.text
headers.append(title)
# Create a dataframe
mydata = pd.DataFrame(columns = headers)
# Create a for loop to fill mydata
for j in table1.find_all('tr')[1:]:
row_data = j.find_all('td')
row = [i.text for i in row_data]
length = len(mydata)
mydata.loc[length] = row
Best practice is and first shot scraping table data should go with pandas.read_html(), it works in most cases, needs adjustments in some cases and only fails in specific ones.
Issue here is, that a user-agent is needed with requests to avoid the 403, so we have to help pandas with that:
requests.get('http://www.flightradar24.com/data/aircraft/ja11jc',
headers={'User-Agent': 'some user agent string'}).text
)[0]
Now the table could be scraped, but have to be transformed a bit, cause that is what the browser will do while rendering - .dropna(axis=1) drops columns with NaN values and [:-1] slices the last row, that contains non relevant information:
requests.get('http://www.flightradar24.com/data/aircraft/ja11jc',
headers={'User-Agent': 'some user agent string'}).text
)[0].dropna(axis=1)[:-1]
You could also use selenium give it some time.sleep(3) while browser renders table in final form and process the driver.page_source but in my opinion this is a bit to much, in this case.
Example
import pandas as pd
import requests
df = pd.read_html(
requests.get('http://www.flightradar24.com/data/aircraft/ja11jc',
headers={'User-Agent': 'some user agent string'}).text
)[0].dropna(axis=1)[:-1]
df.columns = ['DATE','FROM', 'TO', 'FLIGHT', 'FLIGHT TIME', 'STD', 'ATD', 'STA','STATUS']
df
Output
DATE
FROM
TO
FLIGHT
FLIGHT TIME
STD
ATD
STA
STATUS
0
10 Dec 2022
Tokunoshima (TKN)
Kagoshima (KOJ)
JL3798
—
10:00
—
11:10
Scheduled
1
10 Dec 2022
Amami (ASJ)
Tokunoshima (TKN)
JL3843
—
08:55
—
09:30
Scheduled
...
...
...
...
...
...
...
...
...
...
58
03 Dec 2022
Amami (ASJ)
Kagoshima (KOJ)
JL3724
0:56
01:45
02:02
02:50
Landed 02:58
59
03 Dec 2022
Kagoshima (KOJ)
Amami (ASJ)
JL3725
1:06
00:00
00:09
01:15
Landed 01:14
Try this
import requests
import pandas as pd
response = requests.get('https://www.flightradar24.com/data/aircraft/ja11jc', headers={'User-agent': 'Mozilla/5.0'})
df = pd.read_html(response.text)[0][:-1]
df = df.dropna(axis=1, how='all')
print(df.to_string(index=False))
OUTPUT:
JL3798 10 Dec 2022 - Scheduled STD 10:00 ATD — STA 11:10 FROM Tokunoshima (TKN) TO Kagoshima (KOJ) 10 Dec 2022 Tokunoshima (TKN) Kagoshima (KOJ) JL3798 — 10:00 — 11:10 Scheduled Play
JL3843 10 Dec 2022 - Scheduled STD 08:55 ATD — STA 09:30 FROM Amami (ASJ) TO Tokunoshima (TKN) 10 Dec 2022 Amami (ASJ) Tokunoshima (TKN) JL3843 — 08:55 — 09:30 Scheduled Play
JL3844 10 Dec 2022 - Scheduled STD 07:45 ATD — STA 08:15 FROM Tokunoshima (TKN) TO Amami (ASJ) 10 Dec 2022 Tokunoshima (TKN) Amami (ASJ) JL3844 — 07:45 — 08:15 Scheduled Play
JL3710 10 Dec 2022 - Scheduled STD 06:45 ATD — STA 07:15 FROM Okierabu (OKE) TO Tokunoshima (TKN) 10 Dec 2022 Okierabu (OKE) Tokunoshima (TKN) JL3710 — 06:45 — 07:15 Scheduled Play
JL3715 10 Dec 2022 - Estimated departure 05:25 STD 05:25 ATD — STA 06:15 FROM Okinawa (OKA) TO Okierabu (OKE) 10 Dec 2022 Okinawa (OKA) Okierabu (OKE) JL3715 — 05:25 — 06:15 Estimated departure 05:25 Play
JL3716 10 Dec 2022 - Scheduled STD 03:55 ATD — STA 04:45 FROM Okierabu (OKE) TO Okinawa (OKA) 10 Dec 2022 Okierabu (OKE) Okinawa (OKA) JL3716 — 03:55 — 04:45 Scheduled Play
JL3711 10 Dec 2022 - Scheduled STD 02:55 ATD — STA 03:25 FROM Tokunoshima (TKN) TO Okierabu (OKE) 10 Dec 2022 Tokunoshima (TKN) Okierabu (OKE) JL3711 — 02:55 — 03:25 Scheduled Play
JL3841 10 Dec 2022 - Scheduled STD 01:50 ATD — STA 02:25 FROM Amami (ASJ) TO Tokunoshima (TKN) 10 Dec 2022 Amami (ASJ) Tokunoshima (TKN) JL3841 — 01:50 — 02:25 Scheduled Play
JL3842 10 Dec 2022 - Scheduled STD 00:35 ATD — STA 01:05 FROM Tokunoshima (TKN) TO Amami (ASJ) 10 Dec 2022 Tokunoshima (TKN) Amami (ASJ) JL3842 — 00:35 — 01:05 Scheduled Play
JL3791 09 Dec 2022 - Estimated departure 22:45 STD 22:45 ATD — STA 00:05 FROM Kagoshima (KOJ) TO Tokunoshima (TKN) 09 Dec 2022 Kagoshima (KOJ) Tokunoshima (TKN) JL3791 — 22:45 — 00:05 Estimated departure 22:45 Play
JL3734 09 Dec 2022 0:57 Landed 09:38 STD 08:40 ATD 08:41 STA 09:45 FROM Amami (ASJ) TO Kagoshima (KOJ) 09 Dec 2022 Amami (ASJ) Kagoshima (KOJ) JL3734 0:57 08:40 08:41 09:45 Landed 09:38 KML CSV Play
JL3836 09 Dec 2022 0:15 Landed 07:52 STD 07:40 ATD 07:37 STA 08:00 FROM Kikai (KKX) TO Amami (ASJ) 09 Dec 2022 Kikai (KKX) Amami (ASJ) JL3836 0:15 07:40 07:37 08:00 Landed 07:52 KML CSV Play
JL3837 09 Dec 2022 0:11 Landed 07:00 STD 06:50 ATD 06:49 STA 07:10 FROM Amami (ASJ) TO Kikai (KKX) 09 Dec 2022 Amami (ASJ) Kikai (KKX) JL3837 0:11 06:50 06:49 07:10 Landed 07:00 KML CSV Play
JL3867 09 Dec 2022 0:46 Landed 05:45 STD 04:50 ATD 04:59 STA 05:50 FROM Okinawa (OKA) TO Amami (ASJ) 09 Dec 2022 Okinawa (OKA) Amami (ASJ) JL3867 0:46 04:50 04:59 05:50 Landed 05:45 KML CSV Play
JL3866 09 Dec 2022 0:30 Landed 04:15 STD 03:25 ATD 03:45 STA 04:10 FROM Yoronjima (RNJ) TO Okinawa (OKA) 09 Dec 2022 Yoronjima (RNJ) Okinawa (OKA) JL3866 0:30 03:25 03:45 04:10 Landed 04:15 KML CSV Play
JL3861 09 Dec 2022 0:35 Landed 02:59 STD 02:10 ATD 02:24 STA 02:55 FROM Amami (ASJ) TO Yoronjima (RNJ) 09 Dec 2022 Amami (ASJ) Yoronjima (RNJ) JL3861 0:35 02:10 02:24 02:55 Landed 02:59 KML CSV Play
JL3803 09 Dec 2022 - Unknown STD 01:12 ATD 01:12 STA 01:28 FROM Kikai (KKX) TO Amami (ASJ) 09 Dec 2022 Kikai (KKX) Amami (ASJ) JL3803 — 01:12 01:12 01:28 Unknown KML CSV Play
JL3830 09 Dec 2022 - Estimated departure 01:12 STD 01:10 ATD — STA 01:30 FROM Kikai (KKX) TO Amami (ASJ) 09 Dec 2022 Kikai (KKX) Amami (ASJ) JL3830 — 01:10 — 01:30 Estimated departure 01:12 Play
JL3831 09 Dec 2022 0:12 Landed 00:30 STD 00:20 ATD 00:18 STA 00:40 FROM Amami (ASJ) TO Kikai (KKX) 09 Dec 2022 Amami (ASJ) Kikai (KKX) JL3831 0:12 00:20 00:18 00:40 Landed 00:30 KML CSV Play
JL3721 08 Dec 2022 0:59 Landed 23:34 STD 22:25 ATD 22:35 STA 23:40 FROM Kagoshima (KOJ) TO Amami (ASJ) 08 Dec 2022 Kagoshima (KOJ) Amami (ASJ) JL3721 0:59 22:25 22:35 23:40 Landed 23:34 KML CSV Play
JL3734 08 Dec 2022 0:55 Landed 09:35 STD 08:40 ATD 08:41 STA 09:45 FROM Amami (ASJ) TO Kagoshima (KOJ) 08 Dec 2022 Amami (ASJ) Kagoshima (KOJ) JL3734 0:55 08:40 08:41 09:45 Landed 09:35 KML CSV Play
JL3836 08 Dec 2022 0:09 Landed 07:51 STD 07:40 ATD 07:42 STA 08:00 FROM Kikai (KKX) TO Amami (ASJ) 08 Dec 2022 Kikai (KKX) Amami (ASJ) JL3836 0:09 07:40 07:42 08:00 Landed 07:51 KML CSV Play
JL3837 08 Dec 2022 0:09 Landed 07:03 STD 06:50 ATD 06:54 STA 07:10 FROM Amami (ASJ) TO Kikai (KKX) 08 Dec 2022 Amami (ASJ) Kikai (KKX) JL3837 0:09 06:50 06:54 07:10 Landed 07:03 KML CSV Play
JL3867 08 Dec 2022 0:48 Landed 05:41 STD 04:50 ATD 04:53 STA 05:50 FROM Okinawa (OKA) TO Amami (ASJ) 08 Dec 2022 Okinawa (OKA) Amami (ASJ) JL3867 0:48 04:50 04:53 05:50 Landed 05:41 KML CSV Play
JL3866 08 Dec 2022 0:25 Landed 04:01 STD 03:25 ATD 03:36 STA 04:10 FROM Yoronjima (RNJ) TO Okinawa (OKA) 08 Dec 2022 Yoronjima (RNJ) Okinawa (OKA) JL3866 0:25 03:25 03:36 04:10 Landed 04:01 KML CSV Play
JL3861 08 Dec 2022 0:32 Landed 02:55 STD 02:10 ATD 02:24 STA 02:55 FROM Amami (ASJ) TO Yoronjima (RNJ) 08 Dec 2022 Amami (ASJ) Yoronjima (RNJ) JL3861 0:32 02:10 02:24 02:55 Landed 02:55 KML CSV Play
JL3830 08 Dec 2022 0:09 Landed 01:20 STD 01:10 ATD 01:11 STA 01:30 FROM Kikai (KKX) TO Amami (ASJ) 08 Dec 2022 Kikai (KKX) Amami (ASJ) JL3830 0:09 01:10 01:11 01:30 Landed 01:20 KML CSV Play
JL3831 08 Dec 2022 0:08 Landed 00:28 STD 00:20 ATD 00:20 STA 00:40 FROM Amami (ASJ) TO Kikai (KKX) 08 Dec 2022 Amami (ASJ) Kikai (KKX) JL3831 0:08 00:20 00:20 00:40 Landed 00:28 KML CSV Play
JL3721 07 Dec 2022 0:57 Landed 23:29 STD 22:25 ATD 22:31 STA 23:40 FROM Kagoshima (KOJ) TO Amami (ASJ) 07 Dec 2022 Kagoshima (KOJ) Amami (ASJ) JL3721 0:57 22:25 22:31 23:40 Landed 23:29 KML CSV Play
JL3772 07 Dec 2022 0:30 Landed 09:32 STD 09:00 ATD 09:02 STA 09:30 FROM Tanegashima (TNE) TO Kagoshima (KOJ) 07 Dec 2022 Tanegashima (TNE) Kagoshima (KOJ) JL3772 0:30 09:00 09:02 09:30 Landed 09:32 KML CSV Play
JL3777 07 Dec 2022 0:27 Landed 08:25 STD 07:50 ATD 07:58 STA 08:30 FROM Kagoshima (KOJ) TO Tanegashima (TNE) 07 Dec 2022 Kagoshima (KOJ) Tanegashima (TNE) JL3777 0:27 07:50 07:58 08:30 Landed 08:25 KML CSV Play
JL3784 07 Dec 2022 0:55 Landed 07:13 STD 06:05 ATD 06:18 STA 07:10 FROM Kikai (KKX) TO Kagoshima (KOJ) 07 Dec 2022 Kikai (KKX) Kagoshima (KOJ) JL3784 0:55 06:05 06:18 07:10 Landed 07:13 KML CSV Play
JL3785 07 Dec 2022 1:00 Landed 05:32 STD 04:25 ATD 04:31 STA 05:35 FROM Kagoshima (KOJ) TO Kikai (KKX) 07 Dec 2022 Kagoshima (KOJ) Kikai (KKX) JL3785 1:00 04:25 04:31 05:35 Landed 05:32 KML CSV Play
JL3762 07 Dec 2022 0:26 Landed 03:37 STD 03:10 ATD 03:10 STA 03:45 FROM Tanegashima (TNE) TO Kagoshima (KOJ) 07 Dec 2022 Tanegashima (TNE) Kagoshima (KOJ) JL3762 0:26 03:10 03:10 03:45 Landed 03:37 KML CSV Play
JL3763 07 Dec 2022 0:24 Landed 02:32 STD 02:00 ATD 02:08 STA 02:40 FROM Kagoshima (KOJ) TO Tanegashima (TNE) 07 Dec 2022 Kagoshima (KOJ) Tanegashima (TNE) JL3763 0:24 02:00 02:08 02:40 Landed 02:32 KML CSV Play
JL3780 07 Dec 2022 0:51 Landed 01:13 STD 00:15 ATD 00:22 STA 01:20 FROM Kikai (KKX) TO Kagoshima (KOJ) 07 Dec 2022 Kikai (KKX) Kagoshima (KOJ) JL3780 0:51 00:15 00:22 01:20 Landed 01:13 KML CSV Play
JL3783 06 Dec 2022 1:00 Landed 23:45 STD 22:35 ATD 22:45 STA 23:45 FROM Kagoshima (KOJ) TO Kikai (KKX) 06 Dec 2022 Kagoshima (KOJ) Kikai (KKX) JL3783 1:00 22:35 22:45 23:45 Landed 23:45 KML CSV Play
JL3772 06 Dec 2022 0:25 Landed 09:15 STD 09:00 ATD 08:50 STA 09:30 FROM Tanegashima (TNE) TO Kagoshima (KOJ) 06 Dec 2022 Tanegashima (TNE) Kagoshima (KOJ) JL3772 0:25 09:00 08:50 09:30 Landed 09:15 KML CSV Play
JL3777 06 Dec 2022 0:25 Landed 08:15 STD 07:50 ATD 07:50 STA 08:30 FROM Kagoshima (KOJ) TO Tanegashima (TNE) 06 Dec 2022 Kagoshima (KOJ) Tanegashima (TNE) JL3777 0:25 07:50 07:50 08:30 Landed 08:15 KML CSV Play
JL3784 06 Dec 2022 0:55 Landed 07:09 STD 06:05 ATD 06:14 STA 07:10 FROM Kikai (KKX) TO Kagoshima (KOJ) 06 Dec 2022 Kikai (KKX) Kagoshima (KOJ) JL3784 0:55 06:05 06:14 07:10 Landed 07:09 KML CSV Play
JL3785 06 Dec 2022 0:59 Landed 05:34 STD 04:25 ATD 04:35 STA 05:35 FROM Kagoshima (KOJ) TO Kikai (KKX) 06 Dec 2022 Kagoshima (KOJ) Kikai (KKX) JL3785 0:59 04:25 04:35 05:35 Landed 05:34 KML CSV Play
JL3762 06 Dec 2022 0:22 Landed 03:48 STD 03:10 ATD 03:26 STA 03:45 FROM Tanegashima (TNE) TO Kagoshima (KOJ) 06 Dec 2022 Tanegashima (TNE) Kagoshima (KOJ) JL3762 0:22 03:10 03:26 03:45 Landed 03:48 KML CSV Play
JL3763 06 Dec 2022 0:24 Landed 02:47 STD 02:00 ATD 02:23 STA 02:40 FROM Kagoshima (KOJ) TO Tanegashima (TNE) 06 Dec 2022 Kagoshima (KOJ) Tanegashima (TNE) JL3763 0:24 02:00 02:23 02:40 Landed 02:47 KML CSV Play
JL3780 06 Dec 2022 0:54 Landed 01:38 STD 00:15 ATD 00:44 STA 01:20 FROM Kikai (KKX) TO Kagoshima (KOJ) 06 Dec 2022 Kikai (KKX) Kagoshima (KOJ) JL3780 0:54 00:15 00:44 01:20 Landed 01:38 KML CSV Play
JL3783 05 Dec 2022 1:04 Landed 23:45 STD 22:35 ATD 22:41 STA 23:45 FROM Kagoshima (KOJ) TO Kikai (KKX) 05 Dec 2022 Kagoshima (KOJ) Kikai (KKX) JL3783 1:04 22:35 22:41 23:45 Landed 23:45 KML CSV Play
JL3686 05 Dec 2022 0:53 Landed 12:09 STD 11:10 ATD 11:16 STA 12:10 FROM Matsuyama (MYJ) TO Kagoshima (KOJ) 05 Dec 2022 Matsuyama (MYJ) Kagoshima (KOJ) JL3686 0:53 11:10 11:16 12:10 Landed 12:09 KML CSV Play
JL3687 05 Dec 2022 0:42 Landed 10:44 STD 09:45 ATD 10:02 STA 10:40 FROM Kagoshima (KOJ) TO Matsuyama (MYJ) 05 Dec 2022 Kagoshima (KOJ) Matsuyama (MYJ) JL3687 0:42 09:45 10:02 10:40 Landed 10:44 KML CSV Play
JL3808 05 Dec 2022 1:07 Landed 09:22 STD 07:50 ATD 08:15 STA 09:05 FROM Okierabu (OKE) TO Kagoshima (KOJ) 05 Dec 2022 Okierabu (OKE) Kagoshima (KOJ) JL3808 1:07 07:50 08:15 09:05 Landed 09:22 KML CSV Play
JL3809 05 Dec 2022 1:28 Landed 07:31 STD 05:55 ATD 06:03 STA 07:20 FROM Kagoshima (KOJ) TO Okierabu (OKE) 05 Dec 2022 Kagoshima (KOJ) Okierabu (OKE) JL3809 1:28 05:55 06:03 07:20 Landed 07:31 KML CSV Play
JL3785 05 Dec 2022 - Canceled STD 04:25 ATD — STA 05:35 FROM Kagoshima (KOJ) TO Kikai (KKX) 05 Dec 2022 Kagoshima (KOJ) Kikai (KKX) JL3785 — 04:25 — 05:35 Canceled Play
JL3762 05 Dec 2022 0:25 Landed 03:49 STD 03:10 ATD 03:24 STA 03:45 FROM Tanegashima (TNE) TO Kagoshima (KOJ) 05 Dec 2022 Tanegashima (TNE) Kagoshima (KOJ) JL3762 0:25 03:10 03:24 03:45 Landed 03:49 KML CSV Play
JL3763 05 Dec 2022 0:28 Landed 02:48 STD 02:00 ATD 02:20 STA 02:40 FROM Kagoshima (KOJ) TO Tanegashima (TNE) 05 Dec 2022 Kagoshima (KOJ) Tanegashima (TNE) JL3763 0:28 02:00 02:20 02:40 Landed 02:48 KML CSV Play
JL3780 05 Dec 2022 0:51 Landed 01:34 STD 00:15 ATD 00:42 STA 01:20 FROM Kikai (KKX) TO Kagoshima (KOJ) 05 Dec 2022 Kikai (KKX) Kagoshima (KOJ) JL3780 0:51 00:15 00:42 01:20 Landed 01:34 KML CSV Play
JL3783 04 Dec 2022 1:03 Landed 23:45 STD 22:35 ATD 22:42 STA 23:45 FROM Kagoshima (KOJ) TO Kikai (KKX) 04 Dec 2022 Kagoshima (KOJ) Kikai (KKX) JL3783 1:03 22:35 22:42 23:45 Landed 23:45 KML CSV Play
JL3464 04 Dec 2022 0:57 Landed 08:10 STD 07:10 ATD 07:13 STA 08:15 FROM Amami (ASJ) TO Kagoshima (KOJ) 04 Dec 2022 Amami (ASJ) Kagoshima (KOJ) JL3464 0:57 07:10 07:13 08:15 Landed 08:10 KML CSV Play
JL3465 04 Dec 2022 1:05 Landed 06:31 STD 05:25 ATD 05:26 STA 06:40 FROM Kagoshima (KOJ) TO Amami (ASJ) 04 Dec 2022 Kagoshima (KOJ) Amami (ASJ) JL3465 1:05 05:25 05:26 06:40 Landed 06:31 KML CSV Play
JL3724 04 Dec 2022 0:53 Landed 02:43 STD 01:45 ATD 01:50 STA 02:50 FROM Amami (ASJ) TO Kagoshima (KOJ) 04 Dec 2022 Amami (ASJ) Kagoshima (KOJ) JL3724 0:53 01:45 01:50 02:50 Landed 02:43 KML CSV Play
JL3725 04 Dec 2022 0:56 Landed 01:02 STD 00:00 ATD 00:05 STA 01:15 FROM Kagoshima (KOJ) TO Amami (ASJ) 04 Dec 2022 Kagoshima (KOJ) Amami (ASJ) JL3725 0:56 00:00 00:05 01:15 Landed 01:02 KML CSV Play
JL3724 03 Dec 2022 0:56 Landed 02:58 STD 01:45 ATD 02:02 STA 02:50 FROM Amami (ASJ) TO Kagoshima (KOJ) 03 Dec 2022 Amami (ASJ) Kagoshima (KOJ) JL3724 0:56 01:45 02:02 02:50 Landed 02:58 KML CSV Play
JL3725 03 Dec 2022 1:06 Landed 01:14 STD 00:00 ATD 00:09 STA 01:15 FROM Kagoshima (KOJ) TO Amami (ASJ) 03 Dec 2022 Kagoshima (KOJ) Amami (ASJ) JL3725 1:06 00:00 00:09 01:15 Landed 01:14 KML CSV Play
Related
I have a web page which I can access from my server. The contents of the web page are as below.
xys.server.com - /xys/reports/
[To Parent Directory]
3/4/2021 6:09 AM <dir> All_Master
3/4/2021 6:09 AM <dir> Hartland
3/4/2021 6:09 AM <dir> Hauppauge
3/4/2021 6:09 AM <dir> Hazelwood
2/15/2019 7:41 AM 58224 NetBackup Retention and Full Backup Occupancy.xlsx
1/1/2022 11:00 AM 23959 OpsCenter_All_Master_Server_Backup_Report_01_01_2022_10_00_45_259_AM_49.zip
2/1/2022 11:00 AM 18989 OpsCenter_All_Master_Server_Backup_Report_01_02_2022_10_00_04_813_AM_4.zip
3/1/2022 11:00 AM 18969 OpsCenter_All_Master_Server_Backup_Report_01_03_2022_10_00_24_664_AM_17.zip
4/1/2021 10:00 AM 21709 OpsCenter_All_Master_Server_Backup_Report_01_04_2021_10_00_02_266_AM_31.zip
5/1/2021 10:00 AM 27491 OpsCenter_All_Master_Server_Backup_Report_01_05_2021_10_00_27_655_AM_11.zip
6/1/2021 10:00 AM 21260 OpsCenter_All_Master_Server_Backup_Report_01_06_2021_10_00_54_053_AM_19.zip
7/1/2021 10:00 AM 19898 OpsCenter_All_Master_Server_Backup_Report_01_07_2021_10_00_12_544_AM_42.zip
8/1/2021 10:00 AM 22642 OpsCenter_All_Master_Server_Backup_Report_01_08_2021_10_00_28_384_AM_25.zip
9/1/2021 10:00 AM 19426 OpsCenter_All_Master_Server_Backup_Report_01_09_2021_10_00_43_851_AM_70.zip
10/1/2021 10:01 AM 19149 OpsCenter_All_Master_Server_Backup_Report_01_10_2021_10_01_00_422_AM_7.zip
11/1/2021 10:00 AM 19638 OpsCenter_All_Master_Server_Backup_Report_01_11_2021_10_00_15_326_AM_20.zip
12/1/2021 11:00 AM 19375 OpsCenter_All_Master_Server_Backup_Report_01_12_2021_10_00_29_943_AM_13.zip
1/2/2022 11:00 AM 22281 OpsCenter_All_Master_Server_Backup_Report_02_01_2022_10_00_45_803_AM_37.zip
2/2/2022 11:00 AM 19435 OpsCenter_All_Master_Server_Backup_Report_02_02_2022_10_00_05_577_AM_71.zip
3/2/2022 11:00 AM 19380 OpsCenter_All_Master_Server_Backup_Report_02_03_2022_10_00_24_973_AM_90.zip
4/2/2021 10:00 AM 21411 OpsCenter_All_Master_Server_Backup_Report_02_04_2021_10_00_03_069_AM_56.zip
Now, I need to get the contents from this page in a structured format. I am using requests module but the data is highly un-structured and difficult to parse. The code is as below..
req = requests.get(url)
print (req.content.decode('utf-8'))
Output is like :
<pre>[To Parent Directory]<br><br> 3/4/2021 6:09 AM <dir> All_Master<br> 3/4/2021 6:09 AM <dir> Hartland<br> 3/4/2021 6:09 AM <dir> Hauppauge<br> 3/4/2021 6:09 AM <dir> Hazelwood<br> 2/15/2019 7:41 AM 58224 NetBackup Retention and Full Backup Occupancy.xlsx<br> 1/1/2022 11:00 AM 23959 OpsCenter_All_Master_Server_Backup_Report_01_01_2022_10_00_45_259_AM_49.zip<br> 2/1/2022 11:00 AM 18989 OpsCenter_All_Master_Server_Backup_Report_01_02_2022_10_00_04_813_AM_4.zip<br> 3/1/2022 11:00 AM 18969 OpsCenter_All_Master_Server_Backup_Report_01_03_2022_10_00_24_664_AM_17.zip<br> 4/1/2021 10:00 AM 21709 OpsCenter_All_Master_Server_Backup_Report_01_04_2021_10_00_02_266_AM_31.zip<br> 5/1/2021 10:00 AM 27491 OpsCenter_All_Master_Server_Backup_Report_01_05_2021_10_00_27_655_AM_11.zip<br> 6/1/2021 10:00 AM 21260 OpsCenter_All_Master_Server_Backup_Report_01_06_2021_10_00_54_053_AM_19.zip<br> 7/1/2021 10:00 AM 19898 OpsCenter_All_Master_Server_Backup_Report_01_07_2021_10_00_12_544_AM_42.zip<br> 8/1/2021 10:00 AM 22642 OpsCenter_All_Master_Server_Backup_Report_01_08_2021_10_00_28_384_AM_25.zip<br> 9/1/2021 10:00 AM 19426 OpsCenter_All_Master_Server_Backup_Report_01_09_2021_10_00_43_851_AM_70.zip<br> 10/1/2021 10:01 AM 19149 OpsCenter_All_Master_Server_Backup_Report_01_10_2021_10_01_00_422_AM_7.zip<br> 11/1/2021 10:00 AM 19638 OpsCenter_All_Master_Server_Backup_Report_01_11_2021_10_00_15_326_AM_20.zip<br> 12/1/2021 11:00 AM 19375 OpsCenter_All_Master_Server_Backup_Report_01_12_2021_10_00_29_943_AM_13.zip<br> 1/2/2022 11:00 AM 22281 OpsCenter_All_Master_Server_Backup_Report_02_01_2022_10_00_45_803_AM_37.zip<br> 2/2/2022 11:00 AM 19435 OpsCenter_All_Master_Server_Backup_Report_02_02_2022_10_00_05_577_AM_71.zip<br> 3/2/2022 11:00 AM 19380 OpsCenter_All_Master_Server_Backup_Report_02_03_2022_10_00_24_973_AM_90.zip<br> 4/2/2021 10:00 AM 21411 OpsCenter_All_Master_Server_Backup_Report_02_04_2021_10_00_03_069_AM_56.zip<br> 5/2/2021 10:00 AM 24191 OpsCenter_All_Master_Server_Backup_Report_02_05_2021_10_00_28_556_AM_14.zip<br> 6/2/2021 10:00 AM 21675 OpsCenter_All_Master_Server_Backup_Report_02_06_2021_10_00_54_962_AM_73.zip<br> 7/2/2021 10:00 AM 19954 OpsCenter_All_Master_Server_Backup_Report_02_07_2021_10_00_13_058_AM_31.zip<br> 8/2/2021 10:00 AM 21085 OpsCenter_All_Master_Server_Backup_Report_02_08_2021_10_00_28_778_AM_79.zip<br> 9/2/2021 10:00 AM 19691 OpsCenter_All_Master_Server_Backup_Report_02_09_2021_10_00_44_294_AM_5.zip<br> 10/2/2021 10:01 AM 23477 OpsCenter_All_Master_Server_Backup_Report_02_10_2021_10_01_00_793_AM_9.zip<br> 11/2/2021 10:00 AM 2
This is very unstructured.
Kindly suggest a way to make this content more readable so it is easy to parse the data...
I have a list:
['Sun Oct 24 10:31:10 +0000 2021','Sun Oct 24 10:45:02 +0000 2021','Mon Oct 25 04:13:27 +0000 2021',
'Mon Oct 25 04:26:20 +0000 2021','Mon Oct 25 04:32:32 +0000 2021','Mon Oct 25 04:56:39 +0000 2021',
'Mon Oct 25 05:21:21 +0000 2021','Mon Oct 25 06:46:27 +0000 2021','Mon Oct 25 08:59:13 +0000 2021']
How can I get this result:
['Sun Oct 24', 'Sun Oct 24', 'Mon Oct 25', 'Mon Oct 25', 'Mon Oct 25', 'Mon Oct 25', 'Mon Oct 25',
'Mon Oct 25', 'Mon Oct 25']
We could use re.sub here for a regex based approach:
inp = ['Sun Oct 24 10:31:10 +0000 2021', 'Sun Oct 24 10:45:02 +0000 2021', 'Mon Oct 25 04:13:27 +0000 2021', 'Mon Oct 25 04:26:20 +0000 2021', 'Mon Oct 25 04:32:32 +0000 2021', 'Mon Oct 25 04:56:39 +0000 2021', 'Mon Oct 25 05:21:21 +0000 2021', 'Mon Oct 25 06:46:27 +0000 2021', 'Mon Oct 25 08:59:13 +0000 2021']
output = [re.sub(r'\s+\d{2}:.*$', '', x) for x in inp]
print(output)
# ['Sun Oct 24', 'Sun Oct 24', 'Mon Oct 25', 'Mon Oct 25', 'Mon Oct 25',
'Mon Oct 25', 'Mon Oct 25', 'Mon Oct 25', 'Mon Oct 25']
If you have fixed format dates, you can just take first 10 chars of every string date
dates = ['Sun Oct 24 10:31:10 +0000 2021','Sun Oct 24 10:45:02 +0000 2021','Mon Oct 25 04:13:27 +0000 2021',
'Mon Oct 25 04:26:20 +0000 2021','Mon Oct 25 04:32:32 +0000 2021','Mon Oct 25 04:56:39 +0000 2021',
'Mon Oct 25 05:21:21 +0000 2021','Mon Oct 25 06:46:27 +0000 2021','Mon Oct 25 08:59:13 +0000 2021']
trunc_dates = [
date[:10] for date in dates
]
print(trunc_dates)
Output
['Sun Oct 24', 'Sun Oct 24', 'Mon Oct 25', 'Mon Oct 25', 'Mon Oct 25', 'Mon Oct 25', 'Mon Oct 25', 'Mon Oct 25', 'Mon Oct 25']
Also you can use more reliable solution with parsing via dateutil and formatting
from dateutil import parser
dates = ['Sun Oct 24 10:31:10 +0000 2021','Sun Oct 24 10:45:02 +0000 2021','Mon Oct 25 04:13:27 +0000 2021',
'Mon Oct 25 04:26:20 +0000 2021','Mon Oct 25 04:32:32 +0000 2021','Mon Oct 25 04:56:39 +0000 2021',
'Mon Oct 25 05:21:21 +0000 2021','Mon Oct 25 06:46:27 +0000 2021','Mon Oct 25 08:59:13 +0000 2021']
trunc_dates = [
parser.parse(date).strftime('%a %b %d')
for date in dates
]
print(trunc_dates)
Output
['Sun Oct 24', 'Sun Oct 24', 'Mon Oct 25', 'Mon Oct 25', 'Mon Oct 25', 'Mon Oct 25', 'Mon Oct 25', 'Mon Oct 25', 'Mon Oct 25']
I have the following pandas dataframe that was converted to string with to_string().
It was printed like this:
S T Q U X A D
02:36 06:00 06:00 06:00 06:30 09:46 07:56
02:37 06:10 06:15 06:15 06:40 09:48 08:00
12:00 11:00 12:00 12:00 07:43 12:00 18:03
13:15 13:00 13:15 13:15 07:50 13:15 18:08
14:00 14:00 14:00 14:00 14:00 19:00
15:15 15:00 14:15 15:15 15:15 19:05
16:15 16:00 15:15 16:15 16:15 20:15
17:15 17:00 17:15 17:15 17:15 20:17
18:15 21:22 21:19 19:55 18:15 20:18
19:15 21:24 21:21 19:58 19:15 20:19
The gaps are due to empty values in the dataframe. I would like to keep the column alignment, perhaps by replacing the empty values with tabs. I would also like to center align the header line.
This wasn't printed in a terminal, but was sent over telegram with the requests post command. I think though, it is just a print formatting problem, independent of the telegram requests library.
The desired output would be like this:
S T Q U X A D
02:36 06:00 06:00 06:00 06:30 09:46 07:56
02:37 06:10 06:15 06:15 06:40 09:48 08:00
12:00 11:00 12:00 12:00 07:43 12:00 18:03
13:15 13:00 13:15 13:15 07:50 13:15 18:08
14:00 14:00 14:00 14:00 14:00 19:00
15:15 15:00 14:15 15:15 15:15 19:05
16:15 16:00 15:15 16:15 16:15 20:15
17:15 17:00 17:15 17:15 17:15 20:17
18:15 21:22 21:19 19:55 18:15 20:18
19:15 21:24 21:21 19:58 19:15 20:19
you can use dataframe style.set_properties to set some of these options like:
df.style.set_properties(**{'text-align': 'center'})
read more here:
https://pandas.pydata.org/docs/reference/api/pandas.io.formats.style.Styler.set_properties.html
I want to copy the table (id=symbolMarket) and save it as a pandas dataframe in this link https://www.myfxbook.com/en/forex-market/currencies/US30-historical-data
How should I do it in the simple/beautiful way please?
Obviously I can retrieve the element one by one, but I believe there is a better way.
(I am using selenium to access the page, if this helps)
Many thanks for sharing knowledge with me
I was pretty hesitant to post this since it is pretty basic, and there is an abundance of solutions that show how to read a html table into pandas dataframe. Makes me wonder if you even attempted to look it up first.
But, just use .read_html(). This will return a list of dataframes. So you'll just have to figure out which dataframe in that list that you want:
import pandas as pd
url = 'https://www.myfxbook.com/en/forex-market/currencies/US30-historical-data'
tables = pd.read_html(url)
Output:
table = tables[3]
print (table)
0 1 ... 5 6
0 Date Open ... Change (Pips) Change (%)
1 Mar 20, 2019 21:00 25737 ... +253.0 +0.97%
2 Mar 19, 2019 21:00 25871 ... -135.0 -0.52%
3 Mar 18, 2019 21:00 25935 ... -63.0 -0.24%
4 Mar 17, 2019 21:00 25864 ... +70.0 +0.27%
5 Mar 16, 2019 21:00 25864 ... -20.0 -0.08%
6 Mar 14, 2019 21:00 25716 ... +153.0 +0.59%
7 Mar 13, 2019 21:00 25756 ... -40.0 -0.16%
8 Mar 12, 2019 21:00 25575 ... +185.0 +0.72%
9 Mar 11, 2019 21:00 25686 ... -93.0 -0.36%
10 Mar 10, 2019 21:00 25470 ... +212.0 +0.83%
11 Mar 09, 2019 21:00 25470 ... -29.0 -0.11%
12 Mar 07, 2019 21:00 25459 ... +61.0 +0.24%
13 Mar 06, 2019 21:00 25673 ... -197.0 -0.77%
14 Mar 05, 2019 21:00 25786 ... -108.0 -0.42%
15 Mar 04, 2019 21:00 25805 ... +3.0 +0.01%
16 Mar 03, 2019 21:00 26114 ... -300.0 -1.16%
17 Feb 28, 2019 21:00 25911 ... +138.0 +0.53%
18 Feb 27, 2019 21:00 26018 ... -89.0 -0.34%
19 Feb 26, 2019 21:00 26005 ... +31.0 +0.12%
20 Feb 25, 2019 21:00 26093 ... -63.0 -0.24%
21 Feb 24, 2019 21:00 26094 ... -3.0 -0.01%
22 Feb 21, 2019 21:00 25825 ... +210.0 +0.81%
23 Feb 20, 2019 21:00 25962 ... -120.0 -0.46%
24 Feb 19, 2019 21:00 25877 ... +88.0 +0.34%
25 Feb 18, 2019 21:00 25894 ... -9.0 -0.03%
26 Feb 17, 2019 21:00 25905 ... +5.0 +0.02%
27 Feb 14, 2019 21:00 25404 ... +500.0 +1.93%
28 Feb 13, 2019 21:00 25483 ... -68.0 -0.27%
29 Feb 12, 2019 21:00 25418 ... +102.0 +0.40%
.. ... ... ... ... ...
71 Dec 11, 2018 21:00 24341 ... +208.0 +0.85%
72 Dec 10, 2018 21:00 24490 ... -152.0 -0.62%
73 Dec 09, 2018 21:00 24338 ... +144.0 +0.59%
74 Dec 06, 2018 21:00 24921 ... -517.0 -2.12%
75 Dec 05, 2018 21:00 25118 ... -189.0 -0.76%
76 Dec 04, 2018 21:00 25033 ... +134.0 +0.53%
77 Dec 03, 2018 21:00 25837 ... -798.0 -3.19%
78 Dec 02, 2018 21:00 25897 ... -55.0 -0.21%
79 Nov 29, 2018 21:00 25367 ... +220.0 +0.86%
80 Nov 28, 2018 21:00 25327 ... +62.0 +0.24%
81 Nov 27, 2018 21:00 24794 ... +568.0 +2.24%
82 Nov 26, 2018 21:00 24546 ... +253.0 +1.02%
83 Nov 25, 2018 21:00 24300 ... +230.0 +0.94%
84 Nov 22, 2018 21:00 24367 ... -80.0 -0.33%
85 Nov 21, 2018 21:00 24497 ... -144.0 -0.59%
86 Nov 20, 2018 21:00 24461 ... +38.0 +0.16%
87 Nov 19, 2018 21:00 25063 ... -604.0 -2.47%
88 Nov 18, 2018 21:00 25410 ... -342.0 -1.36%
89 Nov 15, 2018 21:00 25335 ... +135.0 +0.53%
90 Nov 14, 2018 21:00 25085 ... +256.0 +1.01%
91 Nov 13, 2018 21:00 25378 ... -273.0 -1.09%
92 Nov 12, 2018 21:00 25422 ... -65.0 -0.26%
93 Nov 11, 2018 21:00 25987 ... -577.0 -2.27%
94 Nov 08, 2018 21:00 26184 ... -202.0 -0.78%
95 Nov 07, 2018 21:00 26190 ... +15.0 +0.06%
96 Nov 06, 2018 21:00 25663 ... +572.0 +2.18%
97 Nov 05, 2018 21:00 25481 ... +200.0 +0.78%
98 Nov 04, 2018 21:00 25267 ... +221.0 +0.87%
99 Nov 01, 2018 21:00 25240 ... +40.0 +0.16%
100 Oct 31, 2018 21:00 25090 ... +229.0 +0.90%
[101 rows x 7 columns]
I have dataframe with a timestamp column and iam using lambda function to that column. When i am doing that i am getting the following error:
row['date'] = pd.Timestamp(row['date']).apply(lambda t: t.replace(minute=15*(t.minute//15)).strftime('%H:%M'))
AttributeError: 'Timestamp' object has no attribute 'apply'
How can i do that in pandas?
example: output:
05:06 05:00
05:20 05:15
09:18 09:15
10:03 10:00
It seems you need to_datetime for convert column to datetimes instead Timestamp - it convert only scalar:
row['date']=pd.to_datetime(row['date']).apply(lambda t: t.replace(minute=15*(t.minute//15)))
.dt.strftime('%H:%M')
EDIT:
print (df)
a b
0 05:06 05:00
1 05:20 05:15
2 09:18 09:15
3 10:03 10:00
df['date'] = pd.to_datetime(df['a'])
.apply(lambda t: t.replace(minute=15*(t.minute//15)))
.dt.strftime('%H:%M')
print (df)
a b date
0 05:06 05:00 05:00
1 05:20 05:15 05:15
2 09:18 09:15 09:15
3 10:03 10:00 10:00
Another solution but with different output:
df['date'] = pd.to_datetime(df['a']).dt.round('15min').dt.strftime('%H:%M')
For checking output you can use:
L = ['5:' + str(x).zfill(2) for x in range(60)]
df = pd.DataFrame({'a':L})
#print (df)
df['date1'] = pd.to_datetime(df['a']).dt.round('15min').dt.strftime('%H:%M')
df['date'] = pd.to_datetime(df['a'])
.apply(lambda t: t.replace(minute=15*(t.minute//15)))
.dt.strftime('%H:%M')
print (df)
a date1 date
0 5:00 05:00 05:00
1 5:01 05:00 05:00
2 5:02 05:00 05:00
3 5:03 05:00 05:00
4 5:04 05:00 05:00
5 5:05 05:00 05:00
6 5:06 05:00 05:00
7 5:07 05:00 05:00
8 5:08 05:15 05:00
9 5:09 05:15 05:00
10 5:10 05:15 05:00
11 5:11 05:15 05:00
12 5:12 05:15 05:00
13 5:13 05:15 05:00
14 5:14 05:15 05:00
15 5:15 05:15 05:15
16 5:16 05:15 05:15
17 5:17 05:15 05:15
18 5:18 05:15 05:15
19 5:19 05:15 05:15
20 5:20 05:15 05:15
21 5:21 05:15 05:15
22 5:22 05:15 05:15
23 5:23 05:30 05:15
24 5:24 05:30 05:15
25 5:25 05:30 05:15
26 5:26 05:30 05:15
27 5:27 05:30 05:15
28 5:28 05:30 05:15
29 5:29 05:30 05:15
30 5:30 05:30 05:30
31 5:31 05:30 05:30
32 5:32 05:30 05:30
33 5:33 05:30 05:30
34 5:34 05:30 05:30
35 5:35 05:30 05:30
36 5:36 05:30 05:30
37 5:37 05:30 05:30
38 5:38 05:45 05:30
39 5:39 05:45 05:30
40 5:40 05:45 05:30
41 5:41 05:45 05:30
42 5:42 05:45 05:30
43 5:43 05:45 05:30
44 5:44 05:45 05:30
45 5:45 05:45 05:45
46 5:46 05:45 05:45
47 5:47 05:45 05:45
48 5:48 05:45 05:45
49 5:49 05:45 05:45
50 5:50 05:45 05:45
51 5:51 05:45 05:45
52 5:52 05:45 05:45
53 5:53 06:00 05:45
54 5:54 06:00 05:45
55 5:55 06:00 05:45
56 5:56 06:00 05:45
57 5:57 06:00 05:45
58 5:58 06:00 05:45
59 5:59 06:00 05:45