POST request with JSESSION ID - python

So I need to scrape this site, but everything is dynamic. I can't append to the URL the query param i need so i need to pass it with a POST request. I extracted the headers and the payload but something breaks along the way and i get the results of the starting page, not the page with the sent POST request. Also the JSESSION ID i get at the end is not the same as the one i sent in the headers. Here is my code
# post_URL = "https://lekovi.zdravstvo.gov.mk/drugsregister.searchform"
session = requests.Session()
cookie = session.get(URL).cookies.get("JSESSIONID")
print(cookie)
headers = {
"Accept": "text/javascript, text/html, application/xml, text/xml, */*",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-MK,en-US;q=0.9,en-GB;q=0.8,en;q=0.7,mk;q=0.6",
"Connection": "keep-alive",
"Content-Length": "819",
"Content-type": "application/x-www-form-urlencoded; charset=UTF-8",
"Cookie": f"SERVERID=APPC_L2; JSESSIONID={cookie}",
"Host": "lekovi.zdravstvo.gov.mk",
"Origin": "https://lekovi.zdravstvo.gov.mk",
"Referer": "https://lekovi.zdravstvo.gov.mk/drugsregister/overview",
"Sec-Fetch-Dest": "empty",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Site": "same-origin",
"Sec-GPC": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.81 Safari/537.36",
"X-Prototype-Version": "1.7",
"X-Requested-With": "XMLHttpReques",
}
payload = {
"t:ac": "overview",
"t:submit": ["submit_3","submit_0"],
"t:formdata": "Db5ytL52OazQLgFwZVqY/TPR99w=:H4sIAAAAAAAAAJVSu0oDQRS9BoRAGhF8NWLho1sDmkYbY0QQYiIGa5md3F1HdmfWO7N5NFb+hI1fIFb6BRZ2/oMfYGNhZeFMshrFR7TaYc89c86dcy4fYbS9BAtblIZ6H0OhDdJaICL78bssSUi1WMQZkUDSBCVFoccSxo/QMyxBbahb8rgijITv+UyjV/btT8bNtsCoOd9AkyYLB7eFh4m7lxyMVKHAlTSkohqL0cB49Zi12HLEZLjcMCRkuN5JDEz1LWx2y5mFSt/Cf7yW/+t1jxRHrRupHwuthZK3V83V4PniPgfQSdpzMPtZOmYyDSw7JSRpt9EncApgYOwj4NYcTnXM0a9jDpJp7CMp4qo5UHBArQfUqWKB4dQfFEKUSIK7aUXM8HeFDHD261Q2fDi1r7AI898HFFsTKrAPmzLJ3zeZfAt618L1YCeD/3pNX3MaJj8PaxehOSzaFmz82gKu4kRJlEZ7vdjN1xKcN55mbq7PKjnIVSHPI2Gnd5pO2JUZI7TvbGpZhiOuvPlMfmVwLL4CyWl1EGkDAAA=",
"filterByApprovalCarrier": "",
"manufacturerName": "",
"nameNumberOrCode": "paracetamol", # This is the thing i'm searching for, its an input field
"genericNameOrAtc": "",
"filterByModeOfIssuance": "",
"t:zoneid": "gridZone,"
}
d = session.post(URL, headers=headers, data=payload)
print(d.cookies.get("JSESSIONID"))```

To get the data from the server you can use next example:
import requests
import pandas as pandas
from bs4 import BeautifulSoup
url = "https://lekovi.zdravstvo.gov.mk/drugsregister/overview"
api_url = "https://lekovi.zdravstvo.gov.mk/drugsregister.searchform"
data = {
"t:ac": "overview",
"t:submit": '["submit_3","submit_0"]',
"t:formdata": "",
"filterByApprovalCarrier": "",
"manufacturerName": "",
"nameNumberOrCode": "paracetamol",
"genericNameOrAtc": "",
"filterByModeOfIssuance": "",
"t:zoneid": "gridZone",
}
headers = {"X-Requested-With": "XMLHttpRequest"}
soup = BeautifulSoup(requests.get(url).content, "html.parser")
data["t:formdata"] = soup.select_one('#searchForm [name="t:formdata"]')["value"]
# parse the returned data with BeautifulSoup or with pandas:
# soup = BeautifulSoup(
# requests.post(api_url, data=data, headers=headers).json()["zones"][
# "gridZone"
# ],
# "html.parser",
# )
df = pd.read_html(
requests.post(api_url, data=data, headers=headers).json()["zones"][
"gridZone"
]
)[0]
print(df)
Prints:
# Латинично име Генеричко име Јачина Пакување Фармацевтска форма Начин на издавање Производител Носител на одобрение Број на решение Датум на решение Датум на важност Датум на обнова Големопродажна цена без ДДВ Малопродажна цена со ДДВ Варијации Г/О/БС
0 1 IBUPROFEN/PARACETAMOL ALKALOID ibuprofen, paracetamol 200 mg/500 mg 10 таблети (блистер 1 х 10)/кутија филм-обложена таблета BRp АЛКАЛОИД АД Скопје - Фармацевтска, Хемиска, Козметичка индустрија, Скопје, Р. Северна Македонија АЛКАЛОИД АД СКОПЈЕ- фармацевтска, хемиска, козметичка индустрија-Скопје, Република Северна Македонија 11-4851/2 02.10.2019 30.09.2024 NaN 0 0 NaN Г
1 2 IBUPROFEN/PARACETAMOL ALKALOID ibuprofen, paracetamol 200 mg/500 mg 20 таблети (блистер 2 х 10)/кутија филм-обложена таблета BRp АЛКАЛОИД АД Скопје - Фармацевтска, Хемиска, Козметичка индустрија, Скопје, Р. Северна Македонија АЛКАЛОИД АД СКОПЈЕ- фармацевтска, хемиска, козметичка индустрија-Скопје, Република Северна Македонија 11-5026/2 02.10.2019 30.09.2024 NaN 0 0 NaN Г
2 3 PARACETAMOL paracetamol 120 mg/5 ml темно стаклено шише х 100 ml + пластична лажичка/кутија сируп BRp РЕПЛЕК ФАРМ ДООЕЛ СКОПЈЕ, Скопје, Р. Северна Македонија РЕПЛЕК ФАРМ ДООЕЛ СКОПЈЕ 11-1862/6 19.07.2016 NaN NaN 0 0 jQuery('#showVarriationsModal_102b50fe87d6c80 .close').click(); Г
3 4 PARACETAMOL paracetamol 500 mg 500 таблети (блистер 50 х 10)/кутија таблета BRp РЕПЛЕК ФАРМ ДООЕЛ СКОПЈЕ, Скопје, Р. Северна Македонија РЕПЛЕК ФАРМ ДООЕЛ СКОПЈЕ 11-3159/4 18.04.2018 NaN NaN 0 0 NaN Г
4 5 PARACETAMOL paracetamol 300 mg 10 супозитории (2 x 5 PVC/PE алвеоли)/кутија супозиторија BRp РЕПЛЕК ФАРМ ДООЕЛ СКОПЈЕ, Скопје, Р. Северна Македонија РЕПЛЕК ФАРМ ДООЕЛ СКОПЈЕ 11-4694/5 12.04.2018 NaN NaN 0 0 NaN Г
5 6 PARACETAMOL paracetamol 500 mg 20 таблети (блистер 2 х 10)/кутија таблета BRp GALENIKA AD, Белград, Србија ГАЛЕНИКА ДООЕЛ Скопје 11-5776/2 03.12.2020 25.09.2024 NaN 0 0 NaN Г
6 7 PARACETAMOL paracetamol 120 mg/5 ml темно стаклено шише х 100 ml + пластично лажиче/кутија сируп BRp GALENIKA AD, Белград, Србија ГАЛЕНИКА ДООЕЛ Скопје 11-5777/2 03.12.2020 25.09.2024 NaN 0 0 NaN Г
7 8 PARACETAMOL paracetamol 125 mg 10 супозитории (блистер 2 х 5)/кутија супозиторија BRp PROFARMA Sh.a, Тирана, Албанија ТАРА-ФАРМ дооел 11-689/4 30.03.2017 30.03.2022 NaN 0 0 NaN Г
8 9 PARACETAMOL paracetamol 250 mg 10 супозитории (блистер 2 х 5)/кутија супозиторија BRp PROFARMA Sh.a, Тирана, Албанија ТАРА-ФАРМ дооел 11-690/8 30.03.2017 30.03.2022 NaN 0 0 NaN Г
9 10 PARACETAMOL paracetamol 120 mg/5 ml темно стаклено шише х 100 ml/кутија сируп BRp BOSNALIJEK d.d., Сараево, Босна и Херцеговина Претставништво БОСНАЛИЈЕК Д.Д. 11-8985/2 11.06.2019 NaN NaN 0 0 NaN Г

Related

Remove HTML markup (getting the desired text)

When parsing data from an AJAX table, the data is parsed with the type "bs4.element.Tag" (checked via "type"):
enter image description here
Although I specified the text attribute when requesting:
enter image description here
And I can't get the text i need inside the HTML markup. As in the screenshot:
enter image description here
For example, the replace/strip e.t.c. function does not work with this type of data :
enter image description here
The class containing the number of comments class_="tcl2".
That is, there is a problem, i can't delete 2 through re.sub, because the number of comments can be equal to 2 and this 2 remains.
enter image description here
Code:
import requests
from bs4 import BeautifulSoup
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
import json
import re
import time
catalog = {}
def parse ():
header = {
'user-agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.167 YaBrowser/22.7.5.940 Yowser/2.5 Safari/537.36"
}
session = requests.Session()
retry = Retry(connect=1, backoff_factor=0.5)
adapter = HTTPAdapter(max_retries=retry)
inb = 1
comp = 'all'
while inb <= 1:
url_not_based = (f"http://foodmarkets.ru/firms/filter_result/7/{comp}/0/posted/page{inb}/")
session.mount(f'http://foodmarkets.ru/firms/filter_result/7/{comp}/0/posted/page{inb}/'
,adapter)
r = session.get (url_not_based,verify=True,headers=header,timeout=5)
soup = BeautifulSoup (r.text, "lxml")
rounded_block = soup.find_all('tr')
for round in rounded_block:
round_сompany = round.find('td',class_='tcl'>'href')
clear_comp1 = re.sub(r'[a-zA-Z<>/\t\n=''0-9.:""]','',str(round_сompany))
clear_comp2 = re.sub(r'[\xa0]',' ',clear_comp1)
round_сity = round.find('td',class_="tc2 nowrap")
clear_city1 = re.sub(r'[a-zA-Z<>/\t\n=''0-9.:""]','',str(round_сity))
clear_city2 = re.sub(r'[\xa0]',' ',clear_city1)
round_сommment = round.find('td',class_="tc2 cntr")
clear_comm1 = re.sub(r'[a-zA-Z<>""''/\t\n=.:]','',str( round_сommment))
if round_сompany in catalog:
continue
else:
catalog [round_сompany] = {
"Company": clear_comp2,
"City":clear_city2,
"Comment": clear_comm1,
}
inb = inb+1
time.sleep(0.5)
# print (catalog)
#with open ("catalog.json","w",encoding='utf-8') as file:
# json.dump(catalog, file, indent=4, ensure_ascii=False)
if __name__ == "__main__":
parse()
To get the data from the table to the DataFrame you can use following code:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'http://foodmarkets.ru/firms/filter_result/7/all/0/posted/page1/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for row in soup.select('tr:has(td.tcl)'):
tds = [cell.get_text(strip=True, separator=' ') for cell in row.select('td')]
all_data.append(tds)
df = pd.DataFrame(all_data, columns=['Название компании', 'Города', 'Комментариев', 'Последнее сообщение'])
print(df)
Prints:
Название компании Города Комментариев Последнее сообщение
0 РусКонфета ООО Санкт-Петербург 0 Сегодня 08:46 от ruskonfeta.ru
1 "СУББОТА" ООО Санкт-Петербург 0 Вчера 14:07 от limon4ik
2 "КАРАБАНОВСКИЙ КОНДИТЕРСКИЙ КОМБИНАТ" ООО Москва 1 Вчера 12:13 от kurmanskiy
3 Мажор ООО Москва 0 30.01.2023 23:11 от bstgroup
4 ОрионСвит ООО Минск 0 16.01.2023 09:28 от Boetc
5 КД Стайл ООО Санкт-Петербург 1 11.01.2023 15:00 от kozhemyakinaJ
6 БАСТИОН ООО Владивосток 0 10.01.2023 14:52 от dv_zakupka
7 ИП Давыдов Р.А. ИП Саратов 0 21.12.2022 07:53 от dfkthbz98
8 Гипермаркет Сити Центр ООО Назрань 0 28.11.2022 21:23 от Calibri
9 ЭКА ООО Краснодар 1 Вчера 08:49 от intense
10 Арсанукаев Бекхан Бадруддинович ИП Грозный 1 26.10.2022 08:33 от sale555
11 ООО "Хлебный Дом" ООО Симферополь 0 20.10.2022 07:39 от AlexMeganom
12 Горелый Николай Иванович ИП Брянск 0 18.10.2022 10:20 от Dinastya Vlad
13 АЛЬЯНС ПРОДУКТ ООО Орел 1 10.10.2022 12:32 от viola_vrn2017
14 ООО «ТК Русские Традиции» ООО Москва 1 25.11.2022 15:34 от ZefirVK
15 "Технотрейд" ООО Минск 0 23.09.2022 08:28 от Alejandros
16 ООО ТК АТЛАС ООО Киров 0 15.09.2022 09:47 от Sal291279
17 Владторг ООО Владивосток 4 25.01.2023 05:54 от Andrey_Bragin
18 Кондитерская фабрика "Золотая Русь" ООО Тула 0 30.08.2022 14:48 от ilya_aldagir
19 ООО "Кондитерская фабрика "Финтур" ООО Санкт-Петербург 1 15.08.2022 11:15 от dvp_wholesaler
20 Новая Система Услуг ООО Тамбов 0 04.08.2022 11:32 от NSU
21 Шидакова И.М. (Ника-Трейд) ИП Нальчик 1 17.10.2022 12:16 от otdelprodazh.6a
22 Лапин Вячеслав Геннадьевич ИП Белгород 4 24.01.2023 13:24 от Anton Bel
23 ТД Первый Вкус ИП Москва 5 18.01.2023 12:34 от pvkioni
24 ГУДДРИНКС ООО Москва 0 25.07.2022 12:49 от sergeiaustraliangroup.ru
25 ООО ГХП Бизнес Гифт ООО Москва 0 19.07.2022 14:51 от visss71
26 Винотека ООО Севастополь 0 13.07.2022 12:30 от Ooo.vinoteka
27 Череповецхлеб АО Череповец 0 06.07.2022 13:54 от Alexcher35
28 Лысанов ИВ(Лысанова ЛБ Мещуров СВ) ИП Пермь 6 10.11.2022 08:34 от Andrey_Bragin
29 ХОРЕКА РБ ООО Уфа 0 09.06.2022 23:52 от horecarb

Beautiful Soup 4 Python Webscraping price from website (not the same)

Someone automatically closed my post. This is not the same question as someone else. Do not close it please.
I am trying to get the price from the website. Can anyone please find the error in my code?
import requests
from bs4 import BeautifulSoup
def priceTracker():
url = ' https://www.britishairways.com/travel/book/public/en_gb/flightList?onds=LGW-AMS_2022-3-11&ad=1&yad=0&ch=0&inf=0&cabin=M&flex=LOWEST&usedChangeSearch=true'
response = requests.get(url)
soup = BeautifulSoup(response.text,'lxml')
#print(soup.prettify)
price = soup.find(class_=" heading-sm ng-tns-c70-4 ng-star-inserted").next_sibling
#print(soup) test if soup working properly
print(price)
while True:
priceTracker()
I have attached the DOM screen.
I have attached the DOM screen of the price. I have updated the url (in case the url does not work, to get the url you go to the main website and press the search button)
The page is rendered through javascript. You can get data through the api but requires a little work:
import requests
import pandas as pd
s = requests.Session()
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
s.get('https://www.britishairways.com', headers=headers)
cookies = s.cookies.get_dict()
cookieStr = ''
for k, v in cookies.items():
cookieStr += f'{k}={v};'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36',
'ba_integrity_tokenV1': '27f532c2f83fb5c560bcd25af3125d9658321fb753c1becc68735fd076ccbc46',
'ba_api_context': 'https://www.britishairways.com/api/sc4',
'ba_client_applicationName': 'ba.com',
'Authorization': 'Bearer 09067a6cba44a44a7119a15c123064f6',
'x-dtpc': '1$590599503_459h15vVJSUFBOHGPMRNQQHGCWULORUCSWNCPSO-0e0',
'ba_client_sessionId': '72bb7a96-f635-4a55-bf5f-125f8c83c464',
'Content-Type': 'application/json',
'Referer': 'https://www.britishairways.com/travel/book/public/en_gb/flightList?onds=LGW-AMS_2022-3-11&ad=1&yad=0&ch=0&inf=0&cabin=M&flex=LOWEST&usedChangeSearch=true',
'Cookie': cookieStr}
url = 'https://www.britishairways.com/api/sc4/badotcomadapter-paa/rs/v1/flightavailability/search;ondwanted=1'
payload = {"ondSearches":[
{"originLocationCode":"LGW",
"destinationLocationCode":"AMS",
"departureDate":"2022-03-11"
}
],
"cabin":"M",
"ticketFlexibility":"LOWEST",
"passengerMix":{
"adultCount":1,
"youngAdultCount":0,
"childCount":0,
"infantCount":0
},
"cug":'false',
"includeCalendar":'true',
"calendarDays":3,
"baIntegrityTokenV1":"27f532c2f83fb5c560bcd25af3125d9658321fb753c1becc68735fd076ccbc46"}
jsonData = s.post(url, json=payload, headers=headers).json()
calendarEntries = pd.DataFrame(jsonData['calendar']['calendarEntries'])
flightEvents = pd.json_normalize(jsonData['flightOption'],
record_path=['flightEvents'])
availableCabinsForOption = pd.json_normalize(jsonData['flightOption'],
record_path=['availableCabinsForOption'])
Output:
for table in [calendarEntries, flightEvents, availableCabinsForOption]:
print(table)
date cheapestSegmentPrice cheapestJourneyPrice
0 2022-03-08 47.07 47.07
1 2022-03-09 54.07 54.07
2 2022-03-10 51.07 51.07
3 2022-03-11 80.07 80.07
4 2022-03-12 69.73 69.73
5 2022-03-13 51.07 51.07
6 2022-03-14 54.07 54.07
eventType duration ... aircraft.aircraftCode aircraft.aircraftName
0 FLIGHT_SEGMENT PT1H30M ... 319 Airbus A319 jet
1 FLIGHT_SEGMENT PT1H25M ... 320 Airbus A320 jet
2 FLIGHT_SEGMENT PT1H10M ... E90 Embraer E190SR
3 FLIGHT_SEGMENT PT1H25M ... 319 Airbus A319 jet
4 FLIGHT_SEGMENT PT1H5M ... E90 Embraer E190SR
5 FLIGHT_SEGMENT PT1H5M ... E90 Embraer E190SR
6 FLIGHT_SEGMENT PT1H20M ... 319 Airbus A319 jet
7 FLIGHT_SEGMENT PT1H25M ... 319 Airbus A319 jet
[8 rows x 32 columns]
availabilityInSellingClass ... fareBasisCode.BA2758
0 9 ... NaN
1 9 ... NaN
2 9 ... NaN
3 1 ... NaN
4 1 ... NaN
5 9 ... NaN
6 9 ... NaN
7 9 ... NaN
8 9 ... NaN
9 2 ... NaN
10 2 ... NaN
11 9 ... NaN
12 9 ... NaN
13 9 ... NaN
14 9 ... NaN
15 7 ... NaN
16 7 ... NaN
17 9 ... NaN
18 2 ... NaN
19 2 ... NaN
20 9 ... NaN
21 2 ... [KZ0RO]
22 2 ... [KV2RO]
23 9 ... [DMV2RO]
[24 rows x 33 columns]
Try this :
import requests
from bs4 import BeautifulSoup
def priceTracker():
url = ' https://www.britishairways.com/travel/book/public/en_gb/flightList?onds=LGW-AMS_2022-3-11&ad=1&yad=0&ch=0&inf=0&cabin=M&flex=LOWEST&usedChangeSearch=true'
response = requests.get(url)
soup = BeautifulSoup(response.text,'lxml')
#print(soup.prettify)
price = soup.find(class_="ng-star-inserted").text # changed
#print(soup) test if soup working properly
print(price)
while True:
priceTracker()

Scraping data from Morningstar using an API

I have a very specific issue which I have not been able to find a solution to.
Recently, I began a project for which I am monitoring about 100 ETFs and Mutual funds based on specific data acquired from Morningstar. The current solution works great - but I later found out that I need more data from another "Tab" within the website. Specifically, I am trying to get data from the 1st table from the following website: https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000Z1MC&tab=1
Right now, I have the code below for scraping data from a table from the tab "Indhold" on the website, and exporting it to Excel. My question is therefore: How do I adjust the code to scrape data from another part of the website?.
To briefly explain the code and reiterate: The code below scrapes data from another tab from the same websites. The many, many IDs are for each website representing each mutual fund/ETF. The setup works very well so I am hoping to simply adjust it (If that is possible) to extract the table from the link above. I have very limited knowledge of the topic so any help is much, much appreciated.
import requests
import re
import pandas as pd
from openpyxl import load_workbook
auth = 'https://www.morningstar.dk/Common/funds/snapshot/PortfolioSAL.aspx'
# Create a Pandas Excel writer using XlsxWriter as the engine.
path= r'/Users/karlemilthulstrup/Downloads/data2.xlsm'
book = load_workbook(path ,read_only = False, keep_vba=True)
writer = pd.ExcelWriter(path, engine='openpyxl')
writer.book = book
ids = ['F00000VA2N','F0GBR064OO','F00000YKC2','F000015MVX','F0000020YA','0P00015YTR','0P00015YTT','F0GBR05V8D','F0GBR06XKI','F000013CKH','F00000MG6K','F000014G49',
'F00000WC0Z','F00000QSD2','F000016551','F0000146QH','F0000146QI','F0GBR04KZS','F0GBR064VU','F00000VXLM','F0000119R1','F0GBR04L4T','F000015CS3','F000015CS5','F000015CS6',
'F000015CS4','F000013BZE','F0GBR05W0Q','F000016M1C','F0GBR04L68','F00000Z9T9','F0GBR04JI8','F00000Z9TG','F0GBR04L2P','F000014CU8','F00000ZG2G','F00000MLEW',
'F000013ZOY','F000016614','F00000WUI9','F000015KRL','F0GBR04LCR','F000010ES9','F00000P780','F0GBR04HC3','F000015CV6','F00000YWCK','F00000YWCJ','F00000NAI5',
'F0GBR04L81','F0GBR05KNU','F0GBR06XKB','F00000NAI3','F0GBR06XKF','F000016UA9','F000013FC2','F000014NRE','0P0000CNVT','0P0000CNVX','F000015KRI',
'F000015KRG','F00000XLK7','F0GBR04IDG','F00000XLK6','F00000073J','F00000XLK4','F000013CKG','F000013CKJ','F000013CKK','F000016P8R','F000016P8S','F000011JG6',
'F000014UZQ','F0000159PE','F0GBR04KZG','F0000002OY','F00000TW9K','F0000175CC','F00000NBEL','F000016054','F000016056','F00000TEYP','F0000025UI','F0GBR04FV7',
'F00000WP01','F000011SQ4','F0GBR04KZO','F000010E19','F000013ZOX','F0GBR04HD7','F00000YKC1','F0GBR064UG','F00000JSDD','F000010ROF','F0000100CA','F0000100CD',
'FOGBR05KQ0','F0GBR04LBB','F0GBR04LBZ','F0GBR04LCN','F00000WLA7','F0000147D7','F00000ZB5E','F00000WC0Y']
headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Mobile Safari/537.36'}
payload = {
'languageId': 'da-DK',
'locale': 'da-DK',
'clientId': 'MDC_intl',
'benchmarkId': 'category',
'component': 'sal-components-mip-factor-profile',
'version': '3.40.1'}
for api_id in ids:
payload = {
'Site': 'dk',
'FC': '%s' %api_id,
'IT': 'FO',
'LANG': 'da-DK',}
response = requests.get(auth, params=payload)
search = re.search('(tokenMaaS:[\w\s]*\")(.*)(\")', response.text, re.IGNORECASE)
bearer = 'Bearer ' + search.group(2)
headers.update({'Authorization': bearer})
url = 'https://www.us-api.morningstar.com/sal/sal-service/fund/factorProfile/%s/data' %api_id
jsonData = requests.get(url, headers=headers, params=payload).json()
rows = []
for k, v in jsonData['factors'].items():
row = {}
row['factor'] = k
historicRange = v.pop('historicRange')
row.update(v)
for each in historicRange:
row.update(each)
rows.append(row.copy())
df = pd.DataFrame(rows)
sheetName = jsonData['id']
df.to_excel(writer, sheet_name=sheetName, index=False)
print('Finished: %s' %sheetName)
writer.save()
writer.close()
If I understand you correctly, you want to get first table of that URL in the form of pandas dataframe:
import requests
import pandas as pd
from bs4 import BeautifulSoup
# load the page into soup:
url = "https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000Z1MC&tab=1"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
# find correct table:
tbl = soup.select_one(".returnsCalenderYearTable")
# remove the first row (it's not header):
tbl.tr.extract()
# convert the html to pandas DF:
df = pd.read_html(str(tbl))[0]
# move the first row to header:
df.columns = map(str, df.loc[0])
df = df.loc[1:].reset_index(drop=True).rename(columns={"nan": "Name"})
print(df)
Prints:
Name 2014* 2015* 2016* 2017* 2018 2019 2020 31-08
0 Samlet afkast % 2627 1490 1432 584 -589 2648 -482 1841
1 +/- Kategori 1130 583 808 -255 164 22 -910 -080
2 +/- Indeks 788 591 363 -320 -127 -262 -1106 -162
3 Rank i kategori 2 9 4 80 38 54 92 63
EDIT: To load from multiple URLs:
import requests
import pandas as pd
from bs4 import BeautifulSoup
urls = [
"https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000VA2N&tab=1",
"https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F0GBR064OO&tab=1",
"https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000YKC2&tab=1",
"https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F000015MVX&tab=1",
]
all_data = []
for url in urls:
print("Loading URL {}".format(url))
# load the page into soup:
soup = BeautifulSoup(requests.get(url).content, "html.parser")
# find correct table:
tbl = soup.select_one(".returnsCalenderYearTable")
# remove the first row (it's not header):
tbl.tr.extract()
# convert the html to pandas DF:
df = pd.read_html(str(tbl))[0]
# move the first row to header:
df.columns = map(lambda x: str(x).replace("*", "").strip(), df.loc[0])
df = df.loc[1:].reset_index(drop=True).rename(columns={"nan": "Name"})
df["Company"] = soup.h1.text.split("\n")[0].strip()
df["URL"] = url
all_data.append(df.loc[:, ~df.isna().all()])
df = pd.concat(all_data, ignore_index=True)
print(df)
Prints:
Name 2016 2017 2018 2019 2020 31-08 Company URL
0 Samlet afkast % 1755.0 942.0 -1317.0 1757.0 -189.0 3018 Great Dane Globale Aktier https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000VA2N&tab=1
1 +/- Kategori 966.0 -54.0 -186.0 -662.0 -967.0 1152 Great Dane Globale Aktier https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000VA2N&tab=1
2 +/- Indeks 686.0 38.0 -854.0 -1153.0 -813.0 1015 Great Dane Globale Aktier https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000VA2N&tab=1
3 Rank i kategori 10.0 24.0 85.0 84.0 77.0 4 Great Dane Globale Aktier https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000VA2N&tab=1
4 Samlet afkast % NaN 1016.0 -940.0 1899.0 767.0 2238 Independent Generations ESG https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F0GBR064OO&tab=1
5 +/- Kategori NaN 20.0 190.0 -520.0 -12.0 373 Independent Generations ESG https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F0GBR064OO&tab=1
6 +/- Indeks NaN 112.0 -478.0 -1011.0 143.0 235 Independent Generations ESG https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F0GBR064OO&tab=1
7 Rank i kategori NaN 26.0 69.0 92.0 43.0 25 Independent Generations ESG https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F0GBR064OO&tab=1
8 Samlet afkast % NaN NaN -939.0 1898.0 766.0 2239 Independent Generations ESG Akk https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000YKC2&tab=1
9 +/- Kategori NaN NaN 191.0 -521.0 -12.0 373 Independent Generations ESG Akk https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000YKC2&tab=1
10 +/- Indeks NaN NaN -477.0 -1012.0 142.0 236 Independent Generations ESG Akk https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000YKC2&tab=1
11 Rank i kategori NaN NaN 68.0 92.0 44.0 24 Independent Generations ESG Akk https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000YKC2&tab=1
12 Samlet afkast % NaN NaN NaN NaN NaN 2384 Investin Sustainable World https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F000015MVX&tab=1
13 +/- Kategori NaN NaN NaN NaN NaN 518 Investin Sustainable World https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F000015MVX&tab=1
14 +/- Indeks NaN NaN NaN NaN NaN 381 Investin Sustainable World https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F000015MVX&tab=1
15 Rank i kategori NaN NaN NaN NaN NaN 18 Investin Sustainable World https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F000015MVX&tab=1

How to pulling actual data from multiple pages of website with using Selenium,Beautiful Soup ,Pandas?

I am new for pulling data using Python . I want to do excel file as pulling tables from website.
The website url : "https://seffaflik.epias.com.tr/transparency/piyasalar/gop/arz-talep.xhtml"
In this webpage ,there are tables at seperately pages for hours data.Due to one hour includes around 500 datas so pages are divided. I want to pull all data for each hour.
But my mistake is pulling same table even if page changes.
I am using beautiful soup,pandas ,selenium libraries. I will show you my codes for explaning myself.
import requests
r = requests.get('https://seffaflik.epias.com.tr/transparency/piyasalar/gop/arz-talep.xhtml')
from bs4 import BeautifulSoup
source = BeautifulSoup(r.content,"lxml")
metin =source.title.get_text()
source.find("input",attrs={"id":"j_idt206:txt1"})
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import pandas as pd
tarih = source.find("input",attrs={"id":"j_idt206:date1_input"})["value"]
import datetime
import time
x = datetime.datetime.now()
today = datetime.date.today()
# print(today)
tomorrow = today + datetime.timedelta(days = 1)
tomorrow = str(tomorrow)
words = tarih.split('.')
yeni_tarih = '.'.join(reversed(words))
yeni_tarih =yeni_tarih.replace(".","-")
def tablo_cek():
tablo = source.find_all("table")#sayfadaki tablo
dfs = pd.read_html(str(tablo))#tabloyu dataframe e çekmek
dfs.append(dfs)#tabloya yeni çekilen tabloyu ekle
print(dfs)
return tablo
if tomorrow == yeni_tarih :
print(yeni_tarih == tomorrow)
driver = webdriver.Chrome("C:/Users/tugba.ozkan/AppData/Local/SeleniumBasic/chromedriver.exe")
driver.get("https://seffaflik.epias.com.tr/transparency/piyasalar/gop/arz-talep.xhtml")
time.sleep(1)
driver.find_element_by_xpath("//select/option[#value='96']").click()
time.sleep(1)
user = driver.find_element_by_name("j_idt206:txt1")
nextpage = driver.find_element_by_xpath("//a/span[#class ='ui-icon ui-icon-seek-next']")
num=0
while num < 24 :
user.send_keys(num) #saate veri gönder
driver.find_element_by_id('j_idt206:goster').click() #saati uygula
nextpage = driver.find_element_by_xpath("//a/span[#class ='ui-icon ui-icon-seek-next']")#o saatteki next page
nextpage.click() #next page e geç
user = driver.find_element_by_name("j_idt206:txt1") #tekrar getiriyor saat yerini
time.sleep(1)
tablo_cek()
num = num + 1 #saati bir arttır
user.clear() #saati sıfırla
else:
print("Güncelleme gelmedi")
In this situation :
nextpage = driver.find_element_by_xpath("//a/span[#class ='ui-icon ui-icon-seek-next']")#o saatteki next page
nextpage.click()
when python clicks the button to go to next page ,the next page show then it needs to pull next table as shown table. But it doesn't work.
At the output I saw appended table that is same values.Like this :
This is my output :
[ Fiyat (TL/MWh) Talep (MWh) Arz (MWh)
0 0 25.0101 19.15990
1 1 24.9741 19.16390
2 2 24.9741 19.18510
3 85 24.9741 19.18512
4 86 24.9736 19.20762
5 99 24.9736 19.20763
6 100 24.6197 19.20763
7 101 24.5697 19.20763
8 300 24.5697 19.20768
9 301 24.5697 19.20768
10 363 24.5697 19.20770
11 364 24.5497 19.20770
12 400 24.5497 19.20771
13 401 24.5297 19.20771
14 498 24.5297 19.20773
15 499 24.5297 19.36473
16 500 24.5297 19.36473
17 501 24.4097 19.36473
18 563 24.4097 19.36475
19 564 24.3897 19.36475
20 999 24.3897 19.36487
21 1000 24.3097 19.36487
22 1001 24.1897 19.36487
23 1449 24.1897 19.36499, [...]]
[ Fiyat (TL/MWh) Talep (MWh) Arz (MWh)
0 0 25.0101 19.15990
1 1 24.9741 19.16390
2 2 24.9741 19.18510
3 85 24.9741 19.18512
4 86 24.9736 19.20762
5 99 24.9736 19.20763
6 100 24.6197 19.20763
7 101 24.5697 19.20763
8 300 24.5697 19.20768
9 301 24.5697 19.20768
10 363 24.5697 19.20770
11 364 24.5497 19.20770
12 400 24.5497 19.20771
13 401 24.5297 19.20771
14 498 24.5297 19.20773
15 499 24.5297 19.36473
16 500 24.5297 19.36473
17 501 24.4097 19.36473
18 563 24.4097 19.36475
19 564 24.3897 19.36475
20 999 24.3897 19.36487
21 1000 24.3097 19.36487
22 1001 24.1897 19.36487
23 1449 24.1897 19.36499, [...]]
[ Fiyat (TL/MWh) Talep (MWh) Arz (MWh)
0 0 25.0101 19.15990
1 1 24.9741 19.16390
2 2 24.9741 19.18510
3 85 24.9741 19.18512
4 86 24.9736 19.20762
5 99 24.9736 19.20763
6 100 24.6197 19.20763
7 101 24.5697 19.20763
8 300 24.5697 19.20768
9 301 24.5697 19.20768
10 363 24.5697 19.20770
11 364 24.5497 19.20770
12 400 24.5497 19.20771
13 401 24.5297 19.20771
14 498 24.5297 19.20773
15 499 24.5297 19.36473
16 500 24.5297 19.36473
17 501 24.4097 19.36473
18 563 24.4097 19.36475
19 564 24.3897 19.36475
20 999 24.3897 19.36487
21 1000 24.3097 19.36487
22 1001 24.1897 19.36487
23 1449 24.1897 19.36499, [...]]
[ Fiyat (TL/MWh) Talep (MWh) Arz (MWh)
0 0 25.0101 19.15990
1 1 24.9741 19.16390
2 2 24.9741 19.18510
3 85 24.9741 19.18512
4 86 24.9736 19.20762
5 99 24.9736 19.20763
6 100 24.6197 19.20763
7 101 24.5697 19.20763
8 300 24.5697 19.20768
9 301 24.5697 19.20768
10 363 24.5697 19.20770
11 364 24.5497 19.20770
12 400 24.5497 19.20771
13 401 24.5297 19.20771
14 498 24.5297 19.20773
15 499 24.5297 19.36473
16 500 24.5297 19.36473
17 501 24.4097 19.36473
18 563 24.4097 19.36475
19 564 24.3897 19.36475
20 999 24.3897 19.36487
21 1000 24.3097 19.36487
22 1001 24.1897 19.36487
23 1449 24.1897 19.36499, [...]]
I will also offer up another solution as you can pull that data directly from the requests. It also gives you the option of how many to pull per page (and you can iterate through each page), however, if you set that limit high enough, you can get it all in 1 request. So there are about 400+ rows, I set the limit to 1000, then you only need page 0:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://seffaflik.epias.com.tr/transparency/piyasalar/gop/arz-talep.xhtml'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'}
page = '0'
payload = {
'javax.faces.partial.ajax': 'true',
'javax.faces.source': 'j_idt206:dt',
'javax.faces.partial.execute': 'j_idt206:dt',
'javax.faces.partial.render': 'j_idt206:dt',
'j_idt206:dt': 'j_idt206:dt',
'j_idt206:dt_pagination': 'true',
'j_idt206:dt_first': page,
'j_idt206:dt_rows': '1000',
'j_idt206:dt_skipChildren': 'true',
'j_idt206:dt_encodeFeature': 'true',
'j_idt206': 'j_idt206',
'j_idt206:date1_input': '04.02.2021',
'j_idt206:txt1': '0',
'j_idt206:dt_rppDD': '1000'
}
rows = []
hours = list(range(0,24))
for hour in hours:
payload.update({'j_idt206:txt1':str(hour)})
response = requests.get(url, headers=headers, params=payload)
soup = BeautifulSoup(response.text.replace('![CDATA[',''), 'lxml')
columns = ['Fiyat (TL/MWh)', 'Talep (MWh)', 'Arz (MWh)', 'hour']
trs = soup.find_all('tr')
for row in trs:
data = row.find_all('td')
data = [x.text for x in data] + [str(hour)]
rows.append(data)
df = pd.DataFrame(rows, columns=columns)
Output:
print(df)
Fiyat (TL/MWh) Talep (MWh) Arz (MWh)
0 0,00 25.113,70 17.708,10
1 0,01 25.077,69 17.712,10
2 0,02 25.077,67 17.723,10
3 0,85 25.076,57 17.723,12
4 0,86 25.076,05 17.746,12
.. ... ... ...
448 571,01 19.317,10 29.529,60
449 571,80 19.316,86 29.529,60
450 571,90 19.316,83 29.529,70
451 571,99 19.316,80 29.529,70
452 572,00 19.316,80 29.540,70
[453 rows x 3 columns]
To find this just takes a little investigative work. If you go to Dev Tools -> Network -> XHR, you try to see if the data is somewhere embedded in those requests (see image). If you find it there, go to Headers tab and you can get the url and parameters at the bottom.
MOST cases you'll see the data is returned in a nice json format. Not the case here. It was returned in a slightly different way with xml, so need a tad extra work to pull out the tags and such. But not impossible.
That's because you pull the initial html here source = BeautifulSoup(r.content,"lxml"), and then keep rendering that content.
You need to pull the html for each page that you go to. It's just a matter of adding 1 line. I commented where I added it:
import requests
r = requests.get('https://seffaflik.epias.com.tr/transparency/piyasalar/gop/arz-talep.xhtml')
from bs4 import BeautifulSoup
source = BeautifulSoup(r.content,"lxml")
metin =source.title.get_text()
source.find("input",attrs={"id":"j_idt206:txt1"})
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import pandas as pd
tarih = source.find("input",attrs={"id":"j_idt206:date1_input"})["value"]
import datetime
import time
x = datetime.datetime.now()
today = datetime.date.today()
# print(today)
tomorrow = today + datetime.timedelta(days = 1)
tomorrow = str(tomorrow)
words = tarih.split('.')
yeni_tarih = '.'.join(reversed(words))
yeni_tarih =yeni_tarih.replace(".","-")
def tablo_cek():
source = BeautifulSoup(driver.page_source,"lxml") #<-- get the current html
tablo = source.find_all("table")#sayfadaki tablo
dfs = pd.read_html(str(tablo))#tabloyu dataframe e çekmek
dfs.append(dfs)#tabloya yeni çekilen tabloyu ekle
print(dfs)
return tablo
if tomorrow == yeni_tarih :
print(yeni_tarih == tomorrow)
driver = webdriver.Chrome("C:/chromedriver_win32/chromedriver.exe")
driver.get("https://seffaflik.epias.com.tr/transparency/piyasalar/gop/arz-talep.xhtml")
time.sleep(1)
driver.find_element_by_xpath("//select/option[#value='96']").click()
time.sleep(1)
user = driver.find_element_by_name("j_idt206:txt1")
nextpage = driver.find_element_by_xpath("//a/span[#class ='ui-icon ui-icon-seek-next']")
num=0
tablo_cek() #<-- need to get that data before moving to next page
while num < 24 :
user.send_keys(num) #saate veri gönder
driver.find_element_by_id('j_idt206:goster').click() #saati uygula
nextpage = driver.find_element_by_xpath("//a/span[#class ='ui-icon ui-icon-seek-next']")#o saatteki next page
nextpage.click() #next page e geç
user = driver.find_element_by_name("j_idt206:txt1") #tekrar getiriyor saat yerini
time.sleep(1)
tablo_cek()
num = num + 1 #saati bir arttır
user.clear() #saati sıfırla
else:
print("Güncelleme gelmedi")
Output:
True
[ Fiyat (TL/MWh) Talep (MWh) Arz (MWh)
0 0 25.11370 17.70810
1 1 25.07769 17.71210
2 2 25.07767 17.72310
3 85 25.07657 17.72312
4 86 25.07605 17.74612
.. ... ... ...
91 10000 23.97000 17.97907
92 10001 23.91500 17.97907
93 10014 23.91500 17.97907
94 10015 23.91500 17.97907
95 10100 23.91499 17.97909
[96 rows x 3 columns], [...]]
[ Fiyat (TL/MWh) Talep (MWh) Arz (MWh)
0 10101 23.91499 18.04009
1 10440 23.91497 18.04015
2 10999 23.91493 18.04025
3 11000 23.89993 18.04025
4 11733 23.89988 18.04039
.. ... ... ...
91 23999 23.55087 19.40180
92 24000 23.55087 19.40200
93 24001 23.53867 19.40200
94 24221 23.53863 19.40200
95 24222 23.53863 19.40200
[96 rows x 3 columns], [...]]
[ Fiyat (TL/MWh) Talep (MWh) Arz (MWh)
0 24360 21.33871 19.8112
1 24499 21.33868 19.8112
2 24500 21.33868 19.8112
3 24574 21.33867 19.8112
4 24575 21.33867 19.8112
.. ... ... ...
91 29864 21.18720 20.3708
92 29899 21.18720 20.3708
93 29900 21.18720 20.3808
94 29999 21.18720 20.3808
95 30000 21.18530 20.3811
[96 rows x 3 columns], [...]]

Extracting first rows from multiple tables and add one column (Python)

I`m trying to generate a list of the latest currency quotes from Investing.com.
I have the following code:
head = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
"X-Requested-With": "XMLHttpRequest"}
ISO_Code=[]
Latest=[]
for item in ISO_CURR_ID.ISO_Code[:4]:
url = 'http://www.investing.com/currencies/usd-'+item+'-historical-data'
r = requests.get(url, headers=head)
soup = BeautifulSoup(r.content, 'html.parser')
try:
CurrHistoricRange = pd.read_html(r.content,attrs = {'id': 'curr_table'}, flavor="bs4")[0]
Item='USD/'+item
ISO_Code.append(np.array(Item))
# Latest.append(np.array(CurrHistoricRange[:1]))
Latest.append(CurrHistoricRange[:1])
except:
pass
where ISO_CURR_ID.ISO_Code is:
In [69]:ISO_CURR_ID.ISO_Code[:4]
Out[69]:
0 EUR
1 GBP
2 JPY
3 CHF
I need the final format to be a table Like this
ISO_Code Date Price Open High Low Change %
0 EUR Jun 21, 2016, 0.8877, 0.8833, 0.8893, 0.881, -0.14%
But Im having problems to undestand how to merge those first rows without repeating column names. So Im getting a result like this if I use
Final=pd.DataFrame(dict(ISO_Code = ISO_Code, Latest_Quotes = Latest))
Final
Out[71]:
ISO_Code Latest_Quotes
0 USD/EUR Date Price Open High Low...
1 USD/GBP Date Price Open High Lo...
2 USD/JPY Date Price Open High Low...
3 USD/CHF Date Price Open High Low...
I think this is a cleaner way to accomplish what you are trying to do
head = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
"X-Requested-With": "XMLHttpRequest"}
latest_data=[]
for item in ISO_CURR_ID.ISO_Code:
url = 'http://www.investing.com/currencies/usd-'+item+'-historical-data'
r = requests.get(url, headers=head)
soup = BeautifulSoup(r.content, 'html.parser')
try:
CurrHistoricRange = pd.read_html(r.content,attrs = {'id': 'curr_table'}, flavor="bs4")[0]
Item='USD/'+item
data = CurrHistoricRange.iloc[0].to_dict()
data["ISO_Code"] = Item
latest_data.append(data)
except Exception as e:
print(e)
def getDf(latest_list, order = ["ISO_Code", "Date", "Price", "Open", "High", "Low", "Change %"]):
return pd.DataFrame(latest_list, columns=order)
getDf(latest_data)
Outputs:
ISO_Code Date Price Open High Low Change %
0 USD/EUR Jun 21, 2016 0.8882 0.8833 0.8893 0.8810 0.55%
1 USD/GBP Jun 21, 2016 0.6822 0.6815 0.6829 0.6766 0.10%
2 USD/JPY Jun 21, 2016 104.75 103.82 104.82 103.60 0.88%
3 USD/CHF Jun 21, 2016 0.9613 0.9620 0.9623 0.9572 -0.07%
I would recommend you to use pandas.Panel for that, similar to pandas_datareader:
import requests
from bs4 import BeautifulSoup
import pandas as pd
head = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
"X-Requested-With": "XMLHttpRequest"
}
ISO_Code=[]
Latest=[]
URL = 'http://www.investing.com/currencies/usd-{}-historical-data'
dfs = {}
curr_ser = pd.Series(['EUR','GBP','JPY','CHF'])
#for item in ISO_CURR_ID.ISO_Code[:4]:
for item in curr_ser:
url = URL.format(item)
r = requests.get(url, headers=head)
soup = BeautifulSoup(r.content, 'html.parser')
try:
Item='USD/'+item
dfs[Item] = pd.read_html(r.content,attrs = {'id': 'curr_table'}, flavor="bs4")[0]
#CurrHistoricRange = pd.read_html(r.content,attrs = {'id': 'curr_table'}, flavor="bs4")[0]
#ISO_Code.append(np.array(Item))
#Latest.append(np.array(CurrHistoricRange[:1]))
#Latest.append(CurrHistoricRange[:1])
except:
pass
# create Panel out of dictionary of DataFrames
p = pd.Panel(dfs)
# slice first row from all DFs
t = p[:,0,:]
print(t)
print(t.T)
Output:
USD/CHF USD/EUR USD/GBP USD/JPY
Date Jun 21, 2016 Jun 21, 2016 Jun 21, 2016 Jun 21, 2016
Price 0.9618 0.8887 0.6828 104.97
Open 0.962 0.8833 0.6815 103.82
High 0.9623 0.8893 0.6829 104.97
Low 0.9572 0.881 0.6766 103.6
Change % -0.02% 0.61% 0.19% 1.09%
Date Price Open High Low Change %
USD/CHF Jun 21, 2016 0.9618 0.962 0.9623 0.9572 -0.02%
USD/EUR Jun 21, 2016 0.8887 0.8833 0.8893 0.881 0.61%
USD/GBP Jun 21, 2016 0.6828 0.6815 0.6829 0.6766 0.19%
USD/JPY Jun 21, 2016 104.97 103.82 104.97 103.6 1.09%
and if we sort DF's indices (by dates) like this:
dfs[Item] = pd.read_html(r.content,
attrs = {'id': 'curr_table'},
flavor="bs4",
parse_dates=['Date'],
index_col=[0]
)[0].sort_index()
# create Panel out of dictionary of DataFrames
p = pd.Panel(dfs)
now we can do a lot of funny things:
In [18]: p.axes
Out[18]:
[Index(['USD/CHF', 'USD/EUR', 'USD/GBP', 'USD/JPY'], dtype='object'),
DatetimeIndex(['2016-05-23', '2016-05-24', '2016-05-25', '2016-05-26', '2016-05-27', '2016-05-30', '2016-05-31', '2016-06-01', '2016-06-02', '20
16-06-03', '2016-06-06', '2016-06-07', '2016-06-08',
'2016-06-09', '2016-06-10', '2016-06-13', '2016-06-14', '2016-06-15', '2016-06-16', '2016-06-17', '2016-06-19', '2016-06-20', '20
16-06-21'],
dtype='datetime64[ns]', name='Date', freq=None),
Index(['Price', 'Open', 'High', 'Low', 'Change %'], dtype='object')]
In [19]: p.keys()
Out[19]: Index(['USD/CHF', 'USD/EUR', 'USD/GBP', 'USD/JPY'], dtype='object')
In [22]: p.to_frame().head(10)
Out[22]:
USD/CHF USD/EUR USD/GBP USD/JPY
Date minor
2016-05-23 Price 0.9896 0.8913 0.6904 109.23
Open 0.9905 0.8913 0.6893 110.08
High 0.9924 0.8942 0.6925 110.25
Low 0.9879 0.8893 0.6872 109.08
Change % -0.06% 0.03% 0.12% -0.84%
2016-05-24 Price 0.9933 0.8976 0.6833 109.99
Open 0.9892 0.891 0.6903 109.22
High 0.9946 0.8983 0.6911 110.12
Low 0.9882 0.8906 0.6827 109.14
Change % 0.37% 0.71% -1.03% 0.70%
indexing by currency pair and by dates
In [25]: p['USD/EUR', '2016-06-10':'2016-06-15', :]
Out[25]:
Price Open High Low Change %
Date
2016-06-10 0.8889 0.8835 0.8893 0.8825 0.59%
2016-06-13 0.8855 0.8885 0.8903 0.8846 -0.38%
2016-06-14 0.8922 0.8856 0.8939 0.8846 0.76%
2016-06-15 0.8881 0.892 0.8939 0.8848 -0.46%
index by currency pair
In [26]: p['USD/EUR', :, :]
Out[26]:
Price Open High Low Change %
Date
2016-05-23 0.8913 0.8913 0.8942 0.8893 0.03%
2016-05-24 0.8976 0.891 0.8983 0.8906 0.71%
2016-05-25 0.8964 0.8974 0.8986 0.8953 -0.13%
2016-05-26 0.8933 0.8963 0.8975 0.8913 -0.35%
2016-05-27 0.8997 0.8931 0.9003 0.8926 0.72%
2016-05-30 0.8971 0.8995 0.9012 0.8969 -0.29%
2016-05-31 0.8983 0.8975 0.8993 0.8949 0.13%
2016-06-01 0.8938 0.8981 0.9 0.8929 -0.50%
2016-06-02 0.8968 0.8937 0.8974 0.8911 0.34%
2016-06-03 0.8798 0.8968 0.8981 0.8787 -1.90%
2016-06-06 0.8807 0.8807 0.8831 0.8777 0.10%
2016-06-07 0.8804 0.8805 0.8821 0.8785 -0.03%
2016-06-08 0.8777 0.8803 0.8812 0.8762 -0.31%
2016-06-09 0.8837 0.877 0.8847 0.8758 0.68%
2016-06-10 0.8889 0.8835 0.8893 0.8825 0.59%
2016-06-13 0.8855 0.8885 0.8903 0.8846 -0.38%
2016-06-14 0.8922 0.8856 0.8939 0.8846 0.76%
2016-06-15 0.8881 0.892 0.8939 0.8848 -0.46%
2016-06-16 0.8908 0.8879 0.8986 0.8851 0.30%
2016-06-17 0.8868 0.8907 0.8914 0.885 -0.45%
2016-06-19 0.8813 0.8822 0.8841 0.8811 -0.63%
2016-06-20 0.8833 0.8861 0.8864 0.8783 0.23%
2016-06-21 0.8891 0.8833 0.8893 0.881 0.66%
index by date
In [28]: p[:, '2016-06-20', :]
Out[28]:
USD/CHF USD/EUR USD/GBP USD/JPY
Price 0.962 0.8833 0.6815 103.84
Open 0.9599 0.8861 0.6857 104.63
High 0.9633 0.8864 0.6881 104.84
Low 0.9576 0.8783 0.6794 103.78
Change % 0.22% 0.23% -0.61% -0.75%
In [29]: p[:, :, 'Change %']
Out[29]:
USD/CHF USD/EUR USD/GBP USD/JPY
Date
2016-05-23 -0.06% 0.03% 0.12% -0.84%
2016-05-24 0.37% 0.71% -1.03% 0.70%
2016-05-25 -0.20% -0.13% -0.42% 0.18%
2016-05-26 -0.20% -0.35% 0.18% -0.38%
2016-05-27 0.55% 0.72% 0.31% 0.42%
2016-05-30 -0.25% -0.29% -0.07% 0.82%
2016-05-31 0.14% 0.13% 1.10% -0.38%
2016-06-01 -0.55% -0.50% 0.42% -1.07%
2016-06-02 0.23% 0.34% -0.04% -0.61%
2016-06-03 -1.45% -1.90% -0.66% -2.14%
2016-06-06 -0.56% 0.10% 0.55% 0.97%
2016-06-07 -0.55% -0.03% -0.71% -0.19%
2016-06-08 -0.62% -0.31% 0.28% -0.35%
2016-06-09 0.55% 0.68% 0.30% 0.10%
2016-06-10 0.02% 0.59% 1.42% -0.10%
2016-06-13 -0.03% -0.38% -0.11% -0.68%
2016-06-14 -0.11% 0.76% 1.11% -0.13%
2016-06-15 -0.21% -0.46% -0.64% -0.08%
2016-06-16 0.40% 0.30% 0.03% -1.67%
2016-06-17 -0.54% -0.45% -1.08% -0.12%
2016-06-19 0.00% -0.63% -1.55% 0.48%
2016-06-20 0.22% 0.23% -0.61% -0.75%
2016-06-21 0.02% 0.66% 0.35% 0.98%
index by two axes
In [30]: p[:, '2016-06-10':'2016-06-15', 'Change %']
Out[30]:
USD/CHF USD/EUR USD/GBP USD/JPY
Date
2016-06-10 0.02% 0.59% 1.42% -0.10%
2016-06-13 -0.03% -0.38% -0.11% -0.68%
2016-06-14 -0.11% 0.76% 1.11% -0.13%
2016-06-15 -0.21% -0.46% -0.64% -0.08%

Categories