I want to web-scrape the information on:
https://rotogrinders.com/resultsdb/date/2019-01-13/sport/4/slate/5c3c66edb1699a43c0d7bba7/contest/5c3c66f2b1699a43c0d7bd0d
There is a main table with a column user. When you click on a user, there is another table beside that shows the information of the team of that user enters in the contest. I want to extract the team of all the users. Therefore, I need to be able to go through all the users by clicking on them and then extracting the information on the second table. Here is my code to extract the team of the first user:
from selenium import webdriver
import csv
from selenium.webdriver.support.ui import Select
from datetime import date, timedelta
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
chromedriver =("C:/Users/Michel/Desktop/python/package/chromedriver_win32/chromedriver.exe")
driver = webdriver.Chrome(chromedriver)
DFSteam = []
driver.get("https://rotogrinders.com/resultsdb/date/2019-01- 13/sport/4/slate/5c3c66edb1699a43c0d7bba7/contest/5c3c66f2b1699a43c0d7bd0d")
Team1=driver.find_element_by_css_selector("table.ant-table-fixed")
driver.close
print(Team1.text)
However, I am not able to iterate through the different users. I noticed that when I click on a user the tr class of that row switch for inactive to active in the page source code, but I do not know how to use that. Moreover, I would like to store the team extracted in a data frame. I am not sure if it is better to do it at the same time or afterwards.
The data frame would look like this:
RANK(team) / C / C / W / W / W / D / D /G/ UTIL/ TOTAL($) / Total Points
1 / Mark Scheifel/ Mickael Backlund/ Artemi Panarin / Nick Foligno / Michael Frolik / Mark Giordano / Zach Werenski / CConnor Hellebuyck / Brandon Tanev / 50 000 / 54.60
You have the right idea. It's just a matter of finding the username element to click on then grab the lineup table, reformat to combine into one results dataframe.
The user name text is tagged with <a>. Just need to find the <a> tag that matched the user name.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as ec
import pandas as pd
url = 'https://rotogrinders.com/resultsdb/date/2019-01-13/sport/4/slate/5c3c66edb1699a43c0d7bba7/contest/5c3c66f2b1699a43c0d7bd0d'
# Open Browser and go to site
driver = webdriver.Chrome("C:/chromedriver_win32/chromedriver.exe")
driver.get(url)
# Waits until tables are loaded and has text. Timeouts after 60 seconds
WebDriverWait(driver, 60).until(ec.presence_of_element_located((By.XPATH, './/tbody//tr//td//span//a[text() != ""]')))
# Get tables to get the user names
tables = pd.read_html(driver.page_source)
users_df = tables[0][['Rank','User']]
users_df['User'] = users_df['User'].str.replace(' Member', '')
# Initialize results dataframe and iterate through users
results = pd.DataFrame()
for i, row in users_df.iterrows():
rank = row['Rank']
user = row['User']
# Find the user name and click on the name
user_link = driver.find_elements(By.XPATH, "//a[text()='%s']" %(user))[0]
user_link.click()
# Get the lineup table after clicking on the user name
tables = pd.read_html(driver.page_source)
lineup = tables[1]
#print (user)
#print (lineup)
# Restructure to put into resutls dataframe
lineup.loc[9, 'Name'] = lineup.iloc[9]['Salary']
lineup.loc[10, 'Name'] = lineup.iloc[9]['Pts']
temp_df = pd.DataFrame(lineup['Name'].values.reshape(-1, 11),
columns=lineup['Pos'].iloc[:9].tolist() + ['Total_$', 'Total_Pts'] )
temp_df.insert(loc=0, column = 'User', value = user)
temp_df.insert(loc=0, column = 'Rank', value = rank)
results = results.append(temp_df)
results = results.reset_index(drop=True)
driver.close()
Output:
print (results)
Rank User ... Total_$ Total_Pts
0 1 Canadaman101 ... $50,000.00 54.6
1 2 MayhemLikeMe27 ... $50,000.00 53.9
2 2 gunslinger58 ... $50,000.00 53.9
3 4 oilkings ... $48,600.00 53.6
4 5 TTB19 ... $50,000.00 53.4
5 6 Adamjloder ... $49,800.00 53.1
6 7 DollarBillW ... $49,900.00 52.6
7 8 Biglarry696 ... $49,900.00 52.4
8 8 tical1994 ... $49,900.00 52.4
9 8 rollem02 ... $49,900.00 52.4
10 8 kchoban ... $50,000.00 52.4
11 8 TBirdSCIL ... $49,900.00 52.4
12 13 manny716 ... $49,900.00 52.1
13 14 JayKooks ... $50,000.00 51.9
14 15 Cambie19 ... $49,900.00 51.4
15 16 mjh6588 ... $50,000.00 51.1
16 16 shanefriesen ... $50,000.00 51.1
17 16 mnfish42 ... $50,000.00 51.1
18 19 Pugsly55 ... $49,900.00 50.9
19 19 volpez7 ... $50,000.00 50.9
20 19 Scherr47 ... $49,900.00 50.9
21 19 Testosterown ... $50,000.00 50.9
22 23 markm22 ... $49,700.00 50.6
23 23 foreveryoung12 ... $49,800.00 50.6
24 23 STP_Picks ... $49,900.00 50.6
25 26 jibbinghippo ... $49,800.00 50.4
26 26 loumister35 ... $49,900.00 50.4
27 26 creels3 ... $50,000.00 50.4
28 26 JayKooks ... $50,000.00 51.9
29 26 mmeiselman731 ... $49,900.00 50.4
30 26 volpez7 ... $50,000.00 50.9
31 26 tommienation1 ... $49,900.00 50.4
32 26 jibbinghippo ... $49,800.00 50.4
33 26 Testosterown ... $50,000.00 50.9
34 35 nut07 ... $50,000.00 49.9
35 35 volpez7 ... $50,000.00 50.9
36 35 durfdurf ... $50,000.00 49.9
37 35 chupacabra21 ... $50,000.00 49.9
38 39 Mbermes01 ... $50,000.00 49.6
39 40 suerte41 ... $50,000.00 49.4
40 40 spliksskins77 ... $50,000.00 49.4
41 42 Andrewskoff ... $49,600.00 49.1
42 42 Alky14 ... $49,800.00 49.1
43 42 bretned ... $50,000.00 49.1
44 42 bretned ... $50,000.00 49.1
45 42 gehrig38 ... $49,700.00 49.1
46 42 d-train_91 ... $49,500.00 49.1
47 42 DiamondDallas ... $50,000.00 49.1
48 49 jdmre ... $50,000.00 48.9
49 49 Devosty ... $50,000.00 48.9
[50 rows x 13 columns]
Related
I am trying to scrape the table from:
https://worldpopulationreview.com/states
My code:
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = 'https://worldpopulationreview.com/states'
page = requests.get(url)
soup = BeautifulSoup(page.text,'lxml')
table = soup.find('table', {'class': 'jsx-a3119e4553b2cac7 table is-striped is-hoverable is-fullwidth tp-table-body is-narrow'})
headers = []
for i in table.find_all('th'):
title = i.text.strip()
headers.append(title)
df = pd.DataFrame(columns=headers)
for row in table.find_all('tr')[1:]:
data = row.find_all('td')
row_data = [td.text.strip() for td in data]
length = len(df)
df.loc[length] = row_data
df
Currently returns
'NoneType' object has no attribute 'find_all'
Clearly the error is because the table variable is returning nothing, but I believe I have the table tag correct.
The table data is dynamically loaded by JavaScript and bs4 can't render JS but you can do the job bs4 with an automation tool something like selenium and grab the table using pandas DataFrame.
from selenium import webdriver
import time
from bs4 import BeautifulSoup
import pandas as pd
from selenium.webdriver.chrome.service import Service
webdriver_service = Service("./chromedriver") #Your chromedriver path
driver = webdriver.Chrome(service=webdriver_service)
driver.get('https://worldpopulationreview.com/states')
driver.maximize_window()
time.sleep(8)
soup = BeautifulSoup(driver.page_source,"lxml")
#You can pull the table directly from the web page
df = pd.read_html(str(soup))[0]
print(df)
#OR
#table= soup.select_one('table[class="jsx-a3119e4553b2cac7 table is-striped is-hoverable is-fullwidth tp-table-body is-narrow"]')
# df = pd.read_html(str(table))[0]
# print(df)
Output:
Rank State 2022 Population Growth Rate ... 2010 Population Growth Since 2010 % of US Density (/mi²)
0 1 California 39995077 0.57% ... 37253956 7.36% 11.93% 257
1 2 Texas 29945493 1.35% ... 25145561 19.09% 8.93% 115
2 3 Florida 22085563 1.25% ... 18801310 17.47% 6.59% 412
3 4 New York 20365879 0.41% ... 19378102 5.10% 6.07% 432
4 5 Pennsylvania 13062764 0.23% ... 12702379 2.84% 3.90% 292
5 6 Illinois 12808884 -0.01% ... 12830632 -0.17% 3.82% 231
6 7 Ohio 11852036 0.22% ... 11536504 2.74% 3.53% 290
7 8 Georgia 10916760 0.95% ... 9687653 12.69% 3.26% 190
8 9 North Carolina 10620168 0.86% ... 9535483 11.38% 3.17% 218
9 10 Michigan 10116069 0.19% ... 9883640 2.35% 3.02% 179
10 11 New Jersey 9388414 0.53% ... 8791894 6.78% 2.80% 1277
11 12 Virginia 8757467 0.73% ... 8001024 9.45% 2.61% 222
12 13 Washington 7901429 1.26% ... 6724540 17.50% 2.36% 119
13 14 Arizona 7303398 1.05% ... 6392017 14.26% 2.18% 64
14 15 Massachusetts 7126375 0.68% ... 6547629 8.84% 2.13% 914
15 16 Tennessee 7023788 0.81% ... 6346105 10.68% 2.09% 170
16 17 Indiana 6845874 0.44% ... 6483802 5.58% 2.04% 191
17 18 Maryland 6257958 0.65% ... 5773552 8.39% 1.87% 645
18 19 Missouri 6188111 0.27% ... 5988927 3.33% 1.85% 90
19 20 Wisconsin 5935064 0.35% ... 5686986 4.36% 1.77% 110
20 21 Colorado 5922618 1.27% ... 5029196 17.76% 1.77% 57
21 22 Minnesota 5787008 0.70% ... 5303925 9.11% 1.73% 73
22 23 South Carolina 5217037 0.95% ... 4625364 12.79% 1.56% 174
23 24 Alabama 5073187 0.48% ... 4779736 6.14% 1.51% 100
24 25 Louisiana 4682633 0.27% ... 4533372 3.29% 1.40% 108
25 26 Kentucky 4539130 0.37% ... 4339367 4.60% 1.35% 115
26 27 Oregon 4318492 0.95% ... 3831074 12.72% 1.29% 45
27 28 Oklahoma 4000953 0.52% ... 3751351 6.65% 1.19% 58
28 29 Connecticut 3612314 0.09% ... 3574097 1.07% 1.08% 746
29 30 Utah 3373162 1.53% ... 2763885 22.04% 1.01% 41
30 31 Iowa 3219171 0.45% ... 3046355 5.67% 0.96% 58
31 32 Nevada 3185426 1.28% ... 2700551 17.95% 0.95% 29
32 33 Arkansas 3030646 0.32% ... 2915918 3.93% 0.90% 58
33 34 Mississippi 2960075 -0.02% ... 2967297 -0.24% 0.88% 63
34 35 Kansas 2954832 0.29% ... 2853118 3.57% 0.88% 36
35 36 New Mexico 2129190 0.27% ... 2059179 3.40% 0.64% 18
36 37 Nebraska 1988536 0.68% ... 1826341 8.88% 0.59% 26
37 38 Idaho 1893410 1.45% ... 1567582 20.79% 0.56% 23
38 39 West Virginia 1781860 -0.33% ... 1852994 -3.84% 0.53% 74
39 40 Hawaii 1474265 0.65% ... 1360301 8.38% 0.44% 230
40 41 New Hampshire 1389741 0.44% ... 1316470 5.57% 0.41% 155
41 42 Maine 1369159 0.25% ... 1328361 3.07% 0.41% 44
42 43 Rhode Island 1106341 0.41% ... 1052567 5.11% 0.33% 1070
43 44 Montana 1103187 0.87% ... 989415 11.50% 0.33%
8
44 45 Delaware 1008350 0.92% ... 897934 12.30% 0.30% 517
45 46 South Dakota 901165 0.81% ... 814180 10.68% 0.27% 12
46 47 North Dakota 800394 1.35% ... 672591 19.00% 0.24% 12
47 48 Alaska 738023 0.31% ... 710231 3.91% 0.22%
1
48 49 Vermont 646545 0.27% ... 625741 3.32% 0.19% 70
49 50 Wyoming 579495 0.23% ... 563626 2.82% 0.17%
6
[50 rows x 9 columns]
Table is rendered dynamically from JSON that is placed at the end of the source code, so it do not need selenium simply extract the tag and load the JSON - This also includes all additional information from the page:
soup = BeautifulSoup(requests.get('https://worldpopulationreview.com/states').text)
json.loads(soup.select_one('#__NEXT_DATA__').text)['props']['pageProps']['data']
Example
import requests, json
import pandas as pd
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get('https://worldpopulationreview.com/states').text)
pd.DataFrame(
json.loads(soup.select_one('#__NEXT_DATA__').text)['props']['pageProps']['data']
)
Example
Cause there are also additional information, that is used for the map, simply choose columns you need by header.
fips
state
densityMi
pop2022
pop2021
pop2020
pop2019
pop2010
growthRate
growth
growthSince2010
area
fill
Name
rank
0
6
California
256.742
39995077
39766650
39538223
39309799
37253956
0.00574419
228427
0.0735793
155779
#084594
California
1
1
48
Texas
114.632
29945493
29545499
29145505
28745507
25145561
0.0135382
399994
0.190886
261232
#084594
Texas
2
2
12
Florida
411.852
22085563
21811875
21538187
21264502
18801310
0.0125477
273688
0.174682
53625
#084594
Florida
3
3
36
New York
432.158
20365879
20283564
20201249
20118937
19378102
0.00405821
82315
0.0509739
47126
#084594
New York
4
4
42
Pennsylvania
291.951
13062764
13032732
13002700
12972667
12702379
0.00230435
30032
0.0283715
44743
#2171b5
Pennsylvania
5
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
45
46
South Dakota
11.887
901165
893916
886667
879421
814180
0.00810926
7249
0.106838
75811
#c6dbef
South Dakota
46
46
38
North Dakota
11.5997
800394
789744
779094
768441
672591
0.0134854
10650
0.190016
69001
#c6dbef
North Dakota
47
47
2
Alaska
1.29332
738023
735707
733391
731075
710231
0.00314799
2316
0.0391309
570641
#c6dbef
Alaska
48
48
50
Vermont
70.147
646545
644811
643077
641347
625741
0.00268916
1734
0.033247
9217
#c6dbef
Vermont
49
49
56
Wyoming
5.96845
579495
578173
576851
575524
563626
0.00228651
1322
0.0281552
97093
#c6dbef
Wyoming
50
I am new for pulling data using Python . I want to do excel file as pulling tables from website.
The website url : "https://seffaflik.epias.com.tr/transparency/piyasalar/gop/arz-talep.xhtml"
In this webpage ,there are tables at seperately pages for hours data.Due to one hour includes around 500 datas so pages are divided. I want to pull all data for each hour.
But my mistake is pulling same table even if page changes.
I am using beautiful soup,pandas ,selenium libraries. I will show you my codes for explaning myself.
import requests
r = requests.get('https://seffaflik.epias.com.tr/transparency/piyasalar/gop/arz-talep.xhtml')
from bs4 import BeautifulSoup
source = BeautifulSoup(r.content,"lxml")
metin =source.title.get_text()
source.find("input",attrs={"id":"j_idt206:txt1"})
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import pandas as pd
tarih = source.find("input",attrs={"id":"j_idt206:date1_input"})["value"]
import datetime
import time
x = datetime.datetime.now()
today = datetime.date.today()
# print(today)
tomorrow = today + datetime.timedelta(days = 1)
tomorrow = str(tomorrow)
words = tarih.split('.')
yeni_tarih = '.'.join(reversed(words))
yeni_tarih =yeni_tarih.replace(".","-")
def tablo_cek():
tablo = source.find_all("table")#sayfadaki tablo
dfs = pd.read_html(str(tablo))#tabloyu dataframe e çekmek
dfs.append(dfs)#tabloya yeni çekilen tabloyu ekle
print(dfs)
return tablo
if tomorrow == yeni_tarih :
print(yeni_tarih == tomorrow)
driver = webdriver.Chrome("C:/Users/tugba.ozkan/AppData/Local/SeleniumBasic/chromedriver.exe")
driver.get("https://seffaflik.epias.com.tr/transparency/piyasalar/gop/arz-talep.xhtml")
time.sleep(1)
driver.find_element_by_xpath("//select/option[#value='96']").click()
time.sleep(1)
user = driver.find_element_by_name("j_idt206:txt1")
nextpage = driver.find_element_by_xpath("//a/span[#class ='ui-icon ui-icon-seek-next']")
num=0
while num < 24 :
user.send_keys(num) #saate veri gönder
driver.find_element_by_id('j_idt206:goster').click() #saati uygula
nextpage = driver.find_element_by_xpath("//a/span[#class ='ui-icon ui-icon-seek-next']")#o saatteki next page
nextpage.click() #next page e geç
user = driver.find_element_by_name("j_idt206:txt1") #tekrar getiriyor saat yerini
time.sleep(1)
tablo_cek()
num = num + 1 #saati bir arttır
user.clear() #saati sıfırla
else:
print("Güncelleme gelmedi")
In this situation :
nextpage = driver.find_element_by_xpath("//a/span[#class ='ui-icon ui-icon-seek-next']")#o saatteki next page
nextpage.click()
when python clicks the button to go to next page ,the next page show then it needs to pull next table as shown table. But it doesn't work.
At the output I saw appended table that is same values.Like this :
This is my output :
[ Fiyat (TL/MWh) Talep (MWh) Arz (MWh)
0 0 25.0101 19.15990
1 1 24.9741 19.16390
2 2 24.9741 19.18510
3 85 24.9741 19.18512
4 86 24.9736 19.20762
5 99 24.9736 19.20763
6 100 24.6197 19.20763
7 101 24.5697 19.20763
8 300 24.5697 19.20768
9 301 24.5697 19.20768
10 363 24.5697 19.20770
11 364 24.5497 19.20770
12 400 24.5497 19.20771
13 401 24.5297 19.20771
14 498 24.5297 19.20773
15 499 24.5297 19.36473
16 500 24.5297 19.36473
17 501 24.4097 19.36473
18 563 24.4097 19.36475
19 564 24.3897 19.36475
20 999 24.3897 19.36487
21 1000 24.3097 19.36487
22 1001 24.1897 19.36487
23 1449 24.1897 19.36499, [...]]
[ Fiyat (TL/MWh) Talep (MWh) Arz (MWh)
0 0 25.0101 19.15990
1 1 24.9741 19.16390
2 2 24.9741 19.18510
3 85 24.9741 19.18512
4 86 24.9736 19.20762
5 99 24.9736 19.20763
6 100 24.6197 19.20763
7 101 24.5697 19.20763
8 300 24.5697 19.20768
9 301 24.5697 19.20768
10 363 24.5697 19.20770
11 364 24.5497 19.20770
12 400 24.5497 19.20771
13 401 24.5297 19.20771
14 498 24.5297 19.20773
15 499 24.5297 19.36473
16 500 24.5297 19.36473
17 501 24.4097 19.36473
18 563 24.4097 19.36475
19 564 24.3897 19.36475
20 999 24.3897 19.36487
21 1000 24.3097 19.36487
22 1001 24.1897 19.36487
23 1449 24.1897 19.36499, [...]]
[ Fiyat (TL/MWh) Talep (MWh) Arz (MWh)
0 0 25.0101 19.15990
1 1 24.9741 19.16390
2 2 24.9741 19.18510
3 85 24.9741 19.18512
4 86 24.9736 19.20762
5 99 24.9736 19.20763
6 100 24.6197 19.20763
7 101 24.5697 19.20763
8 300 24.5697 19.20768
9 301 24.5697 19.20768
10 363 24.5697 19.20770
11 364 24.5497 19.20770
12 400 24.5497 19.20771
13 401 24.5297 19.20771
14 498 24.5297 19.20773
15 499 24.5297 19.36473
16 500 24.5297 19.36473
17 501 24.4097 19.36473
18 563 24.4097 19.36475
19 564 24.3897 19.36475
20 999 24.3897 19.36487
21 1000 24.3097 19.36487
22 1001 24.1897 19.36487
23 1449 24.1897 19.36499, [...]]
[ Fiyat (TL/MWh) Talep (MWh) Arz (MWh)
0 0 25.0101 19.15990
1 1 24.9741 19.16390
2 2 24.9741 19.18510
3 85 24.9741 19.18512
4 86 24.9736 19.20762
5 99 24.9736 19.20763
6 100 24.6197 19.20763
7 101 24.5697 19.20763
8 300 24.5697 19.20768
9 301 24.5697 19.20768
10 363 24.5697 19.20770
11 364 24.5497 19.20770
12 400 24.5497 19.20771
13 401 24.5297 19.20771
14 498 24.5297 19.20773
15 499 24.5297 19.36473
16 500 24.5297 19.36473
17 501 24.4097 19.36473
18 563 24.4097 19.36475
19 564 24.3897 19.36475
20 999 24.3897 19.36487
21 1000 24.3097 19.36487
22 1001 24.1897 19.36487
23 1449 24.1897 19.36499, [...]]
I will also offer up another solution as you can pull that data directly from the requests. It also gives you the option of how many to pull per page (and you can iterate through each page), however, if you set that limit high enough, you can get it all in 1 request. So there are about 400+ rows, I set the limit to 1000, then you only need page 0:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://seffaflik.epias.com.tr/transparency/piyasalar/gop/arz-talep.xhtml'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'}
page = '0'
payload = {
'javax.faces.partial.ajax': 'true',
'javax.faces.source': 'j_idt206:dt',
'javax.faces.partial.execute': 'j_idt206:dt',
'javax.faces.partial.render': 'j_idt206:dt',
'j_idt206:dt': 'j_idt206:dt',
'j_idt206:dt_pagination': 'true',
'j_idt206:dt_first': page,
'j_idt206:dt_rows': '1000',
'j_idt206:dt_skipChildren': 'true',
'j_idt206:dt_encodeFeature': 'true',
'j_idt206': 'j_idt206',
'j_idt206:date1_input': '04.02.2021',
'j_idt206:txt1': '0',
'j_idt206:dt_rppDD': '1000'
}
rows = []
hours = list(range(0,24))
for hour in hours:
payload.update({'j_idt206:txt1':str(hour)})
response = requests.get(url, headers=headers, params=payload)
soup = BeautifulSoup(response.text.replace('![CDATA[',''), 'lxml')
columns = ['Fiyat (TL/MWh)', 'Talep (MWh)', 'Arz (MWh)', 'hour']
trs = soup.find_all('tr')
for row in trs:
data = row.find_all('td')
data = [x.text for x in data] + [str(hour)]
rows.append(data)
df = pd.DataFrame(rows, columns=columns)
Output:
print(df)
Fiyat (TL/MWh) Talep (MWh) Arz (MWh)
0 0,00 25.113,70 17.708,10
1 0,01 25.077,69 17.712,10
2 0,02 25.077,67 17.723,10
3 0,85 25.076,57 17.723,12
4 0,86 25.076,05 17.746,12
.. ... ... ...
448 571,01 19.317,10 29.529,60
449 571,80 19.316,86 29.529,60
450 571,90 19.316,83 29.529,70
451 571,99 19.316,80 29.529,70
452 572,00 19.316,80 29.540,70
[453 rows x 3 columns]
To find this just takes a little investigative work. If you go to Dev Tools -> Network -> XHR, you try to see if the data is somewhere embedded in those requests (see image). If you find it there, go to Headers tab and you can get the url and parameters at the bottom.
MOST cases you'll see the data is returned in a nice json format. Not the case here. It was returned in a slightly different way with xml, so need a tad extra work to pull out the tags and such. But not impossible.
That's because you pull the initial html here source = BeautifulSoup(r.content,"lxml"), and then keep rendering that content.
You need to pull the html for each page that you go to. It's just a matter of adding 1 line. I commented where I added it:
import requests
r = requests.get('https://seffaflik.epias.com.tr/transparency/piyasalar/gop/arz-talep.xhtml')
from bs4 import BeautifulSoup
source = BeautifulSoup(r.content,"lxml")
metin =source.title.get_text()
source.find("input",attrs={"id":"j_idt206:txt1"})
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import pandas as pd
tarih = source.find("input",attrs={"id":"j_idt206:date1_input"})["value"]
import datetime
import time
x = datetime.datetime.now()
today = datetime.date.today()
# print(today)
tomorrow = today + datetime.timedelta(days = 1)
tomorrow = str(tomorrow)
words = tarih.split('.')
yeni_tarih = '.'.join(reversed(words))
yeni_tarih =yeni_tarih.replace(".","-")
def tablo_cek():
source = BeautifulSoup(driver.page_source,"lxml") #<-- get the current html
tablo = source.find_all("table")#sayfadaki tablo
dfs = pd.read_html(str(tablo))#tabloyu dataframe e çekmek
dfs.append(dfs)#tabloya yeni çekilen tabloyu ekle
print(dfs)
return tablo
if tomorrow == yeni_tarih :
print(yeni_tarih == tomorrow)
driver = webdriver.Chrome("C:/chromedriver_win32/chromedriver.exe")
driver.get("https://seffaflik.epias.com.tr/transparency/piyasalar/gop/arz-talep.xhtml")
time.sleep(1)
driver.find_element_by_xpath("//select/option[#value='96']").click()
time.sleep(1)
user = driver.find_element_by_name("j_idt206:txt1")
nextpage = driver.find_element_by_xpath("//a/span[#class ='ui-icon ui-icon-seek-next']")
num=0
tablo_cek() #<-- need to get that data before moving to next page
while num < 24 :
user.send_keys(num) #saate veri gönder
driver.find_element_by_id('j_idt206:goster').click() #saati uygula
nextpage = driver.find_element_by_xpath("//a/span[#class ='ui-icon ui-icon-seek-next']")#o saatteki next page
nextpage.click() #next page e geç
user = driver.find_element_by_name("j_idt206:txt1") #tekrar getiriyor saat yerini
time.sleep(1)
tablo_cek()
num = num + 1 #saati bir arttır
user.clear() #saati sıfırla
else:
print("Güncelleme gelmedi")
Output:
True
[ Fiyat (TL/MWh) Talep (MWh) Arz (MWh)
0 0 25.11370 17.70810
1 1 25.07769 17.71210
2 2 25.07767 17.72310
3 85 25.07657 17.72312
4 86 25.07605 17.74612
.. ... ... ...
91 10000 23.97000 17.97907
92 10001 23.91500 17.97907
93 10014 23.91500 17.97907
94 10015 23.91500 17.97907
95 10100 23.91499 17.97909
[96 rows x 3 columns], [...]]
[ Fiyat (TL/MWh) Talep (MWh) Arz (MWh)
0 10101 23.91499 18.04009
1 10440 23.91497 18.04015
2 10999 23.91493 18.04025
3 11000 23.89993 18.04025
4 11733 23.89988 18.04039
.. ... ... ...
91 23999 23.55087 19.40180
92 24000 23.55087 19.40200
93 24001 23.53867 19.40200
94 24221 23.53863 19.40200
95 24222 23.53863 19.40200
[96 rows x 3 columns], [...]]
[ Fiyat (TL/MWh) Talep (MWh) Arz (MWh)
0 24360 21.33871 19.8112
1 24499 21.33868 19.8112
2 24500 21.33868 19.8112
3 24574 21.33867 19.8112
4 24575 21.33867 19.8112
.. ... ... ...
91 29864 21.18720 20.3708
92 29899 21.18720 20.3708
93 29900 21.18720 20.3808
94 29999 21.18720 20.3808
95 30000 21.18530 20.3811
[96 rows x 3 columns], [...]]
It is the first time I use pandas and I do not really know how to deal with my problematic.
In fact I have 2 data frame:
import pandas
blast=pandas.read_table("blast")
cluster=pandas.read_table("cluster")
Here is an exemple of their contents:
>>> cluster
cluster_name seq_names
0 1 g1.t1_0035
1 1 g1.t1_0035_0042
2 119365 g1.t1_0042
3 90273 g1.t1_0042_0035
4 71567 g10.t1_0035
5 37976 g10.t1_0035_0042
6 22560 g10.t1_0042
7 90280 g10.t1_0042_0035
8 82698 g100.t1_0035
9 47392 g100.t1_0035_0042
10 28484 g100.t1_0042
11 22580 g100.t1_0042_0035
12 19474 g1000.t1_0035
13 5770 g1000.t1_0035_0042
14 29708 g1000.t1_0042
15 99776 g1000.t1_0042_0035
16 6283 g10000.t1_0035
17 39828 g10000.t1_0035_0042
18 25383 g10000.t1_0042
19 106614 g10000.t1_0042_0035
20 6285 g10001.t1_0035
21 13866 g10001.t1_0035_0042
22 121157 g10001.t1_0042
23 106615 g10001.t1_0042_0035
24 6286 g10002.t1_0035
25 113 g10002.t1_0035_0042
26 25397 g10002.t1_0042
27 106616 g10002.t1_0042_0035
28 4643 g10003.t1_0035
29 13868 g10003.t1_0035_0042
... ... ...
and
[78793 rows x 2 columns]
>>> blast
qseqid sseqid pident length mismatch \
0 g1.t1_0035_0042 g1.t1_0035_0042 100.0 286 0
1 g1.t1_0035_0042 g1.t1_0035 100.0 257 0
2 g1.t1_0035_0042 g9307.t1_0035 26.9 134 65
3 g2.t1_0035_0042 g2.t1_0035_0042 100.0 445 0
4 g2.t1_0035_0042 g2.t1_0035 95.8 451 3
5 g2.t1_0035_0042 g24520.t1_0042_0035 61.1 429 137
6 g2.t1_0035_0042 g9924.t1_0042 61.1 429 137
7 g2.t1_0035_0042 g1838.t1_0035 86.2 29 4
8 g3.t1_0035_0042 g3.t1_0035_0042 100.0 719 0
9 g3.t1_0035_0042 g3.t1_0035 84.7 753 62
10 g4.t1_0035_0042 g4.t1_0035_0042 100.0 242 0
11 g4.t1_0035_0042 g3.t1_0035 98.8 161 2
12 g5.t1_0035_0042 g5.t1_0035_0042 100.0 291 0
13 g5.t1_0035_0042 g3.t1_0035 93.1 291 0
14 g6.t1_0035_0042 g6.t1_0035_0042 100.0 152 0
15 g6.t1_0035_0042 g4.t1_0035 100.0 152 0
16 g7.t1_0035_0042 g7.t1_0035_0042 100.0 216 0
17 g7.t1_0035_0042 g5.t1_0035 98.1 160 3
18 g7.t1_0035_0042 g11143.t1_0042 46.5 230 99
19 g7.t1_0035_0042 g27537.t1_0042_0035 40.8 233 111
20 g3778.t1_0035_0042 g3778.t1_0035_0042 100.0 86 0
21 g3778.t1_0035_0042 g6174.t1_0035 98.0 51 1
22 g3778.t1_0035_0042 g20037.t1_0035_0042 100.0 50 0
23 g3778.t1_0035_0042 g37190.t1_0035 100.0 50 0
24 g3778.t1_0035_0042 g15112.t1_0042_0035 66.0 53 18
25 g3778.t1_0035_0042 g6061.t1_0042 66.0 53 18
26 g18109.t1_0035_0042 g18109.t1_0035_0042 100.0 86 0
27 g18109.t1_0035_0042 g33071.t1_0035 100.0 81 0
28 g18109.t1_0035_0042 g32810.t1_0035 96.4 83 3
29 g18109.t1_0035_0042 g17982.t1_0035_0042 98.6 72 1
... ... ... ... ... ...
if you stay focus on the cluster database, the first column correspond to the cluster ID and inside those clusters there are several sequences ID.
What I need to to is first to split all my cluster (in R it would be like: liste=split(x = data$V2, f = data$V1) )
And then, creat a function which displays the most similarity paires sequence within each cluster.
here is an exemple:
let's say I have two clusters (dataframe cluster):
cluster 1:
seq1
seq2
seq3
seq4
cluster 2:
seq5
seq6
seq7
...
On the blast dataframe there is on the 3th column the similarity between all sequences (all against all), so something like:
seq1 vs seq1 100
seq1 vs seq2 90
seq1 vs seq3 56
seq1 vs seq4 49
seq1 vs seq5 40
....
seq2 vs seq3 70
seq2 vs seq4 98
...
seq5 vs seq5 100
seq5 vs seq6 89
seq5 vs seq7 60
seq7 vs seq7 46
seq7 vs seq7 100
seq6 vs seq6 100
and what I need to get is :
cluster 1 (best paired sequences):
seq 1 vs seq 2
cluster2 (best paired sequences):
seq 5 vs seq6
...
So as you can see, I do not want to take into account the sequences paired by themselves
IF someone could give me some clues it would be fantastic.
Thank you all.
Firstly I assume that there are no Pairings in 'blast' with sequences from two different Clusters. In other words: in this solution the cluster-ID of a pairing will be evaluated by only one of the two sequence IDs.
Including cluster information and pairing information into one dataframe:
data = cluster.merge(blast, left_on='seq_names', right_on='qseqid')
Then the data should only contain pairings of different sequences:
data = data[data['qseqid']!=data['sseqid']]
To ignore pairings which have the same substrings in their seqid, the most readable way would be to add data columns with these data:
data['qspec'] = [seqid.split('_')[1] for seqid in data['qseqid'].values]
data['sspec'] = [seqid.split('_')[1] for seqid in data['sseqid'].values]
Now equal spec-values can be filtered the same way like it was done with equal seqids above:
data = data[data['qspec']!=data['sspec']]
In the end the data should be grouped by cluster-ID and within each group, the maximum of pident is of interest:
data_grpd = data.groupby('cluster_name')
result = data.loc[data_grpd['pident'].idxmax()]
The only drawback here - except the above mentioned assumption - is, that if there are several exactly equal max-values, only one of them would be taken into account.
Note: if you don't want the spec-columns to be of type string, you could easiliy turn them into integers on the fly by:
import numpy as np
data['qspec'] = [np.int(seqid.split('_')[1]) for seqid in data['qseqid'].values]
This merges the dataframes based first on sseqid, then on qseqid, and then returns results_df. Any with 100% match are filtered out. Let me know if this works. You can then order by cluster name.
blast = blast.loc[blast['pident'] != 100]
results_df = cluster.merge(blast, left_on='seq_names',right_on='sseqid')
results_df = results_df.append(cluster.merge(blast, left_on='seq_names',right_on='qseqid'))
I have some data in a csv file as show below(only partial data is shown here).
SourceID local_date local_time Vge BSs PC hour Type
7208 8/01/2015 11:00:19 15.4 87 +BC_MSG 11 MAIN
11060 8/01/2015 11:01:56 14.9 67 +AB_MSG 11 MAIN
3737 8/01/2015 11:02:09 15.4 88 +AB_MSG 11 MAIN
9683 8/01/2015 11:07:19 14.9 69 +AB_MSG 11 MAIN
9276 8/01/2015 11:07:52 15.4 88 +AB_MSG 11 MAIN
7754 8/01/2015 11:09:26 14.7 62 +AF_MSG 11 MAIN
11111 8/01/2015 11:10:06 15.2 80 +AF_MSG 11 MAIN
9276 8/01/2015 11:10:52 15.4 88 +AB_MSG 11 MAIN
11111 8/01/2015 11:12:56 15.2 80 +AB_MSG 11 MAIN
6148 8/01/2015 11:15:29 15 70 +AB_MSG 11 MAIN
11111 8/01/2015 11:15:56 15.2 80 +AB_MSG 11 MAIN
9866 8/01/2015 11:16:28 4.102 80 +SUB_MSG 11 SUB
9866 8/01/2015 11:16:38 15.1 78 +AH_MSG 11 MAIN
9866 8/01/2015 11:16:38 4.086 78 +SUB_MSG 11 SUB
20729 8/01/2015 11:23:21 11.6 82 +AB_MSG 11 MAIN
9276 8/01/2015 11:25:52 15.4 88 +AB_MSG 11 MAIN
11111 8/01/2015 11:34:16 15.2 80 +AF_MSG 11 MAIN
20190 8/01/2015 11:36:09 11.2 55 +AF_MSG 11 MAIN
7208 8/01/2015 11:37:09 15.3 85 +AB_MSG 11 MAIN
7208 8/01/2015 11:38:39 15.3 86 +AB_MSG 11 MAIN
7754 8/01/2015 11:39:16 14.7 61 +AB_MSG 11 MAIN
8968 8/01/2015 11:39:39 15.5 91 +AB_MSG 11 MAIN
3737 8/01/2015 11:41:09 15.4 88 +AB_MSG 11 MAIN
9683 8/01/2015 11:41:39 14.9 69 +AF_MSG 11 MAIN
20729 8/01/2015 11:44:36 11.6 81 +AB_MSG 11 MAIN
9704 8/01/2015 11:45:20 14.9 68 +AF_MSG 11 MAIN
11111 8/01/2015 11:46:06 4.111 87 +SUB_MSG 11 PAN
I have the following python program that uses pandas to process this input
import sys
import csv
import operator
import os
from glob import glob
import fileinput
from relativeDates import *
import datetime
import math
import pprint
import numpy as np
import pandas as pd
from io import StringIO
COLLECTION = 'NEW'
BATTERY = r'C:\MyFolder\Analysis\\{}'.format(COLLECTION)
INPUT_FILE = Pandas + r'\in.csv'
OUTPUT_FILE = Pandas + r'\out.csv'
with open(INPUT_FILE) as fin:
df = pd.read_csv(INPUT_FILE,
usecols=["SourceID", "local_date","local_time","Vge",'BSs','PC'],
header=0)
#df.set_index(['SourceID','local_date','local_time','Vge','BSs','PC'],inplace=True)
df.drop_duplicates(inplace=True)
#df.reset_index(inplace=True)
hour_list = []
gb = df['local_time'].groupby(df['local_date'])
for i in list(gb)[0][1]:
hour_list.append(i.split(':')[0])
for j in list(gb)[1][1]:
hour_list.append(str(int(j.split(':')[0])+ 24))
df['hour'] = pd.Series(hour_list,index=df.index)
df.set_index(['SourceID','local_date','local_time','Vge'],inplace=True)
#gb = df['hour'].groupby(df['PC'])
#print(list(gb))
gb = df['PC']
class_list = []
for msg in df['PC']:
if 'SUB' in msg:
class_list.append('SUB')
else:
class_list.append('MAIN')
df['Type'] = pd.Series(class_list,index=df.index)
print(df.groupby(['hour','Type'])['BSs'].aggregate(np.mean))
gb = df['Type'].groupby(df['hour'])
#print(list(gb))
#print(list(df.groupby(['hour','Type']).count()))
df.to_csv(OUTPUT_FILE)
I want to get an average of BSs field over time. This is what I am attempting to do in print(df.groupby(['hour','Type'])['BSs'].aggregate(np.mean)) above.
However few things needs to be considered.
Vge values can be classified into 2 types based on Type field.
The number of Vge values that we get can vary from hour to hour widely.
The whole data set is for 24 hours.
The Vge values can be recieved from a number of SourceIDs.
The Vge values can vary little bit among SourceID but should somewhat similar during the same time interval (same hour)
In such a situation calculation of simple mean as above print(df.groupby(['hour','Type'])['BSs'].aggregate(np.mean)) won't be sufficient as the number of samples during different time periods (hours) are different.
What function should be used in such a situation?
I am trying to read JSON file using pandas:
import pandas as pd
df = pd.read_json('https://data.gov.in/node/305681/datastore/export/json')
I get ValueError: arrays must all be same length
Some other JSON pages show this error:
ValueError: Mixing dicts with non-Series may lead to ambiguous ordering.
How do I somehow read the values? I am not particular about data validity.
Looking at the json it is valid, but it's nested with data and fields:
import json
import requests
In [11]: d = json.loads(requests.get('https://data.gov.in/node/305681/datastore/export/json').text)
In [12]: list(d.keys())
Out[12]: ['data', 'fields']
You want the data as the content, and fields as the column names:
In [13]: pd.DataFrame(d["data"], columns=[x["label"] for x in d["fields"]])
Out[13]:
S. No. States/UTs 2008-09 2009-10 2010-11 2011-12 2012-13
0 1 Andhra Pradesh 183446.36 193958.45 201277.09 212103.27 222973.83
1 2 Arunachal Pradesh 360.5 380.15 407.42 419 438.69
2 3 Assam 4658.93 4671.22 4707.31 4705 4709.58
3 4 Bihar 10740.43 11001.77 7446.08 7552 8371.86
4 5 Chhattisgarh 9737.92 10520.01 12454.34 12984.44 13704.06
5 6 Goa 148.61 148 149 149.45 457.87
6 7 Gujarat 12675.35 12761.98 13269.23 14269.19 14558.39
7 8 Haryana 38149.81 38453.06 39644.17 41141.91 42342.66
8 9 Himachal Pradesh 977.3 1000.26 1020.62 1049.66 1069.39
9 10 Jammu and Kashmir 7208.26 7242.01 7725.19 6519.8 6715.41
10 11 Jharkhand 3994.77 3924.73 4153.16 4313.22 4238.95
11 12 Karnataka 23687.61 29094.3 30674.18 34698.77 36773.33
12 13 Kerala 15094.54 16329.52 16856.02 17048.89 22375.28
13 14 Madhya Pradesh 6712.6 7075.48 7577.23 7971.53 8710.78
14 15 Maharashtra 35502.28 38640.12 42245.1 43860.99 45661.07
15 16 Manipur 1105.25 1119 1137.05 1149.17 1162.19
16 17 Meghalaya 994.52 999.47 1010.77 1021.14 1028.18
17 18 Mizoram 411.14 370.92 387.32 349.33 352.02
18 19 Nagaland 831.92 833.5 802.03 703.65 617.98
19 20 Odisha 19940.15 23193.01 23570.78 23006.87 23229.84
20 21 Punjab 36789.7 32828.13 35449.01 36030 37911.01
21 22 Rajasthan 6449.17 6713.38 6696.92 9605.43 10334.9
22 23 Sikkim 136.51 136.07 139.83 146.24 146
23 24 Tamil Nadu 88097.59 108475.73 115137.14 118518.45 119333.55
24 25 Tripura 1388.41 1442.39 1569.45 1650 1565.17
25 26 Uttar Pradesh 10139.8 10596.17 10990.72 16075.42 17073.67
26 27 Uttarakhand 1961.81 2535.77 2613.81 2711.96 3079.14
27 28 West Bengal 33055.7 36977.96 39939.32 43432.71 47114.91
28 29 Andaman and Nicobar Islands 617.58 657.44 671.78 780 741.32
29 30 Chandigarh 272.88 248.53 180.06 180.56 170.27
30 31 Dadra and Nagar Haveli 70.66 70.71 70.28 73 73
31 32 Daman and Diu 18.83 18.9 18.81 19.67 20
32 33 Delhi 1.17 1.17 1.17 1.23 NA
33 34 Lakshadweep 134.64 138.22 137.98 139.86 139.99
34 35 Puducherry 111.69 112.84 113.53 116 112.89
See also json_normalize for more complex json DataFrame extraction.
The following listed both the key and value pair for me:
from urllib.request import urlopen
import json
from pandas.io.json import json_normalize
import pandas as pd
import requests
df = json.loads(requests.get('https://api.github.com/repos/akkhil2012/MachineLearning').text)
data = pd.DataFrame.from_dict(df, orient='index')
print(data)