Parse urls python - python

So im using this code to get a list of URLs, the thing is that i need a column with the URLS and another one with the tags or the text
import requests
from bs4 import BeautifulSoup
getpage= requests.get
getpage_soup= BeautifulSoup(getpage.text, 'html.parser')
all_links= getpage_soup.findAll('a')
for link in all_links:
print (link)
What i'm expecting is a dataframe similar to this
pd.DataFrame({'link': 'https://drive.google.com/file/d/1t1hLPvUkfCde1wglfjAh--r8NpLONbRf/view?usp=sharing', 'tag': 'Estatal 2020'})

Using your first example of what you need, this may can help you:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.gob.mx/sesnsp/acciones-y-programas/incidencia-delictiva-del-fuero-comun-nueva-metodologia?state=published"
data = []
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
div = soup.find('div', {'class': 'article-body'}) # get div "article-body"
for ul in div.findAll('ul'): # get all 'ul' tags inside div "article-body"
for li in ul.findAll('li'): # get all 'li' inside 'ul'
for link in li.findAll('a', href=True): # get 'a' inside li
data.append([link['href'], link.text]) # link['href'] = url | link.text = "Estatal 2020"
dataframe = pd.DataFrame(data, columns=['link', 'tag'])
print(dataframe)
[OUTPUT]
link tag
0 https://drive.google.com/file/d/1t1hLPvUkfCde1... Estatal 2020
1 https://drive.google.com/open?id=17MnLmvY_YW5Z... Estatal 2019
2 https://drive.google.com/open?id=11DcfF4Pvp_21... Estatal 2018
3 https://drive.google.com/open?id=1Y0aqq6w2EQij... Estatal 2017
[/OUTPUT]

You could try this:
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
getpage= requests.get('https://www.gob.mx/sesnsp/acciones-y-programas/incidencia-delictiva-del-fuero-comun-nueva-metodologia?state=published')
getpage_soup= BeautifulSoup(getpage.text, 'html.parser')
all_links= getpage_soup.findAll('a', attrs={'href': re.compile("(^http://)|(^https://)")}) #get all the urls with protocols http or https
data=[]
for link in all_links:
if link.text.strip()=='': #if the link doesn't have text, add the id
data.append([link['href'], link.get('id')])
else:
data.append([link['href'], link.text.strip()]) #add the text without trailing and leading whitespaces
df=pd.DataFrame(data, columns=['link', 'tag']) #create the dataframe
print(df)
Output:
df
link tag
0 https://coronavirus.gob.mx/ Información importante Coronavirus COVID-19
1 https://www.gob.mx/busqueda?utf8=✓ botbusca
2 https://www.gob.mx/sesnsp/acciones-y-programas... Transparencia
3 https://drive.google.com/file/d/1t1hLPvUkfCde1... Estatal 2020
4 https://drive.google.com/open?id=17MnLmvY_YW5Z... Estatal 2019
5 https://drive.google.com/open?id=11DcfF4Pvp_21... Estatal 2018
6 https://drive.google.com/open?id=1Y0aqq6w2EQij... Estatal 2017
7 https://drive.google.com/open?id=1mgFsF3rdoYLE... Estatal 2016
8 https://drive.google.com/open?id=1RQhk58-fHNPr... Estatal 2015
9 https://drive.google.com/file/d/1WIzrjJTF24DCX... Estatal 2020
10 https://drive.google.com/open?id=1QtjDM7pczeST... Estatal 2019
11 https://drive.google.com/open?id=15l9hl4eUmFCM... Estatal 2018
12 https://drive.google.com/open?id=1FO4W0HK8cdPk... Estatal 2017
13 https://drive.google.com/open?id=1tDEjJ1XLdFP8... Estatal 2016
14 https://drive.google.com/open?id=1lCeFrMi_D-Gr... Estatal 2015
15 https://drive.google.com/file/d/1q8AdhfxpLdF_l... Estatal 2015 - 2020
16 https://drive.google.com/file/d/1jopZOChRppi6Q... Mayo 2020
17 https://drive.google.com/open?id=1CvHXHC48SYWT... Febrero 2020
18 https://drive.google.com/open?id=1QxUe0HwLNNZH... Enero 2020
19 https://drive.google.com/open?id=1KZzHGdTlH5ya... Diciembre 2019
20 https://drive.google.com/open?id=119VQ5-1JPnWZ... Noviembre 2019
21 https://drive.google.com/open?id=1CbNV3sTkSn3t... Octubre 2019
22 https://drive.google.com/open?id=1gpMM2pi6Ta-r... Septiembre 2019
23 https://drive.google.com/open?id=1dHUhpr-DbOPx... Agosto 2019
24 https://drive.google.com/open?id=18CQlwY07tTaa... Julio 2019
25 https://drive.google.com/open?id=1EnhF4IOFxqLr... Junio 2019
26 https://drive.google.com/open?id=1wrTEwP5Q3xwZ... Mayo 2019
27 https://drive.google.com/open?id=1ZuY20S-5Gi8l... Abril 2019
28 https://drive.google.com/open?id=1P2Xvs7kLLclg... Marzo 2019
29 https://drive.google.com/open?id=16FWEKbbJ83KL... Febrero 2019
30 https://drive.google.com/open?id=1mIw1XKJBY8ZV... Enero 2019
31 https://drive.google.com/open?id=1iTGBC1Ge4UWP... Diciembre 2018
32 https://drive.google.com/open?id=1Kmtir0rhQLf7... Noviembre 2018
33 https://drive.google.com/open?id=1r7SHNfKVXGfe... Octubre 2018
34 https://drive.google.com/open?id=1IKpGJbJuNQKW... Septiembre 2018
35 https://drive.google.com/open?id=1spqdNT0T0pen... Agosto 2018
36 https://drive.google.com/open?id=1k07ZSk2c4irk... Julio 2018
37 https://drive.google.com/open?id=1HX4SlChjRbMm... Junio 2018
38 https://drive.google.com/open?id=1ErSyO9-rfHi3... Mayo 2018
39 https://drive.google.com/open?id=1cK5lR33-mA6-... Abril 2018
40 https://drive.google.com/open?id=1MaqJaSfq2KxB... Marzo 2018
41 https://drive.google.com/open?id=1GaoDPWud-2Iy... Febrero 2018
42 https://drive.google.com/open?id=1OXITYyRrUBwj... Enero 2018
43 https://drive.google.com/file/d/1KwjGdNYez72_z... Estatal 2015 - 2020
44 https://drive.google.com/file/d/14fDk5sBry1DOo... Municipal 2015 - 2020
45 https://www.gob.mx/sesnsp/acciones-y-programas... Regresar al menú principal de Incidencia Delic...
46 https://www.facebook.com/sharer/sharer.php?u=h... Compartir
47 http://www.participa.gob.mx Participa
48 https://datos.gob.mx/ Datos
49 https://www.gob.mx/publicaciones Publicaciones Oficiales
50 https://www.infomex.org.mx/gobiernofederal/hom... Sistema Infomex
51 http://www.inai.org.mx INAI
52 http://www.ordenjuridico.gob.mx Marco Jurídico
53 https://www.facebook.com/gobmexico Facebook
54 https://twitter.com/GobiernoMX Twitter
And if you only want the ones that start with "Estatal", you can add this to the code above:
import numpy as np
mask=np.where(df.tag.str.startswith('Estatal'), True, False)
print(df[mask])
Output:
link tag
3 https://drive.google.com/file/d/1t1hLPvUkfCde1... Estatal 2020
4 https://drive.google.com/open?id=17MnLmvY_YW5Z... Estatal 2019
5 https://drive.google.com/open?id=11DcfF4Pvp_21... Estatal 2018
6 https://drive.google.com/open?id=1Y0aqq6w2EQij... Estatal 2017
7 https://drive.google.com/open?id=1mgFsF3rdoYLE... Estatal 2016
8 https://drive.google.com/open?id=1RQhk58-fHNPr... Estatal 2015
9 https://drive.google.com/file/d/1WIzrjJTF24DCX... Estatal 2020
10 https://drive.google.com/open?id=1QtjDM7pczeST... Estatal 2019
11 https://drive.google.com/open?id=15l9hl4eUmFCM... Estatal 2018
12 https://drive.google.com/open?id=1FO4W0HK8cdPk... Estatal 2017
13 https://drive.google.com/open?id=1tDEjJ1XLdFP8... Estatal 2016
14 https://drive.google.com/open?id=1lCeFrMi_D-Gr... Estatal 2015
15 https://drive.google.com/file/d/1q8AdhfxpLdF_l... Estatal 2015 - 2020
43 https://drive.google.com/file/d/1KwjGdNYez72_z... Estatal 2015 - 2020

Related

pd.read_html() not reading date

When I try to parse a wiki page for its tables, the tables are read correctly except for the date of birth column, which comes back as empty. Is there a workaround for this? I've tried using beautiful soup but I get the same result.
The code I've used is as follows:
url = 'https://en.wikipedia.org/wiki/2002_FIFA_World_Cup_squads'
pd.read_html(url)
Here's an image of one of the tables in question:
One possible solution can be alter the page content with beautifulsoup and then load it to pandas:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/2002_FIFA_World_Cup_squads"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
# select correct table, here I select the first one:
tbl = soup.select("table")[0]
# remove the (aged XX) part:
for td in tbl.select("td:nth-of-type(3)"):
td.string = td.contents[-1].split("(")[0]
df = pd.read_html(str(tbl))[0]
print(df)
Prints:
No. Pos. Player Date of birth (age) Caps Club
0 1 GK Thomas Sørensen 12 June 1976 14 Sunderland
1 2 MF Stig Tøfting 14 August 1969 36 Bolton Wanderers
2 3 DF René Henriksen 27 August 1969 39 Panathinaikos
3 4 DF Martin Laursen 26 July 1977 15 Milan
4 5 DF Jan Heintze (c) 17 August 1963 83 PSV Eindhoven
5 6 DF Thomas Helveg 24 June 1971 67 Milan
6 7 MF Thomas Gravesen 11 March 1976 22 Everton
7 8 MF Jesper Grønkjær 12 August 1977 25 Chelsea
8 9 FW Jon Dahl Tomasson 29 August 1976 38 Feyenoord
9 10 MF Martin Jørgensen 6 October 1975 32 Udinese
10 11 FW Ebbe Sand 19 July 1972 44 Schalke 04
11 12 DF Niclas Jensen 17 August 1974 8 Manchester City
12 13 DF Steven Lustü 13 April 1971 4 Lyn
13 14 MF Claus Jensen 29 April 1977 13 Charlton Athletic
14 15 MF Jan Michaelsen 28 November 1970 11 Panathinaikos
15 16 GK Peter Kjær 5 November 1965 4 Aberdeen
16 17 MF Christian Poulsen 28 February 1980 3 Copenhagen
17 18 FW Peter Løvenkrands 29 January 1980 4 Rangers
18 19 MF Dennis Rommedahl 22 July 1978 19 PSV Eindhoven
19 20 DF Kasper Bøgelund 8 October 1980 2 PSV Eindhoven
20 21 FW Peter Madsen 26 April 1978 4 Brøndby
21 22 GK Jesper Christiansen 24 April 1978 0 Vejle
22 23 MF Brian Steen Nielsen 28 December 1968 65 Malmö FF
Try setting the parse_dates parameter to True inside read_html method.

Python Pandas multiindex

i'm try create table like in example:
Example_picture
My code:
data = list(range(39)) # mockup for 39 values
columns = pd.MultiIndex.from_product([['1', '2', '6'], [str(year) for year in range(2007, 2020)]],
names=['Factor', 'Year'])
df = pd.DataFrame(data, index=['World'], columns=columns)
print(df)
But i get error:
Shape of passed values is (39, 1), indices imply (1, 39)
What i'm did wrong?
You need to wrap the data in a list to force the DataFrame constructor to interpret the list as a row:
data = list(range(39))
columns = pd.MultiIndex.from_product([['1', '2', '6'],
[str(year) for year in range(2007, 2020)]],
names=['Factor', 'Year'])
df = pd.DataFrame([data], index=['World'], columns=columns)
output:
Factor 1 2 6
Year 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
World 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38

Web scrape Sports-Reference with Python Beautiful Soup

I am trying to scrape data from Nick Saban's sports reference page so that I can pull in the list of All-Americans he coached and then his Bowl-Win Loss Percentage.
I am new to Python so this has been a massive struggle. When I inspect the page I see div id = #leaderboard_all-americans class = "data_grid_box"
When I run the code below I am getting the Coaching Record table, which is the first table on the site. I tried using different indexes thinking it may give me a different result but that did not work either.
Ultimately, I want to get the All-American data and turn it into a data frame.
import requests
import bs4
import pandas as pd
saban2 = requests.get("https://www.sports-reference.com/cfb/coaches/nick-saban-1.html")
saban_soup2 = bs4.BeautifulSoup(saban2.text,"lxml")
saban_select = saban_soup2.select('div',{"id":"leaderboard_all-americans"})
saban_df2 = pd.read_html(str(saban_select))
All Americans
sports-reference.com stores the HTML tables as comments in the basic request response. You have to first grab the commented block with the All-Americans and bowl results, and then parse that result:
import bs4
from bs4 import BeautifulSoup as soup
import requests, pandas as pd
d = soup(requests.get('https://www.sports-reference.com/cfb/coaches/nick-saban-1.html').text, 'html.parser')
block = [i for i in d.find_all(string=lambda text: isinstance(text, bs4.Comment)) if 'id="leaderboard_all-americans"' in i][0]
b = soup(str(block), 'html.parser')
players = [i for i in b.select('#leaderboard_all-americans table.no_columns tr')]
p_results = [{'name':i.td.a.text, 'year':i.td.contents[-1][2:-1]} for i in players]
all_americans = pd.DataFrame(p_results)
bowl_win_loss = b.select_one('#leaderboard_win_loss_pct_post td.single').contents[-2]
print(all_americans)
print(bowl_win_loss)
Output:
all_americans
name year
0 Jonathan Allen 2016
1 Javier Arenas 2009
2 Mark Barron 2011
3 Antoine Caldwell 2008
4 Ha Ha Clinton-Dix 2013
5 Terrence Cody 2008-2009
6 Landon Collins 2014
7 Amari Cooper 2014
8 Landon Dickerson 2020
9 Minkah Fitzpatrick 2016-2017
10 Reuben Foster 2016
11 Najee Harris 2020
12 Derrick Henry 2015
13 Dont'a Hightower 2011
14 Mark Ingram 2009
15 Jerry Jeudy 2018
16 Mike Johnson 2009
17 Barrett Jones 2011-2012
18 Mac Jones 2020
19 Ryan Kelly 2015
20 Cyrus Kouandjio 2013
21 Chad Lavalais 2003
22 Alex Leatherwood 2020
23 Rolando McClain 2009
24 Demarcus Milliner 2012
25 C.J. Mosley 2012-2013
26 Reggie Ragland 2015
27 Josh Reed 2001
28 Trent Richardson 2011
29 A'Shawn Robinson 2015
30 Cam Robinson 2016
31 Andre Smith 2008
32 DeVonta Smith 2020
33 Marcus Spears 2004
34 Patrick Surtain II 2020
35 Tua Tagovailoa 2018
36 Deionte Thompson 2018
37 Chance Warmack 2012
38 Ben Wilkerson 2004
39 Jonah Williams 2018
40 Quinnen Williams 2018
bowl_win_loss:
' .63 (#23)'

Pandas cumulative sum without changing week order number

I have a dataframe which looks like the following:
df:
RY Week no Value
2020 14 3.95321
2020 15 3.56425
2020 16 0.07042
2020 17 6.45417
2020 18 0.00029
2020 19 0.27737
2020 20 4.12644
2020 21 0.32753
2020 22 0.47239
2020 23 0.28756
2020 24 1.83029
2020 25 0.75385
2020 26 2.08981
2020 27 2.05611
2020 28 1.00614
2020 29 0.02105
2020 30 0.58101
2020 31 3.49083
2020 32 8.29013
2020 33 8.99825
2020 34 2.66293
2020 35 0.16448
2020 36 2.26301
2020 37 1.09302
2020 38 1.66566
2020 39 1.47233
2020 40 6.42708
2020 41 2.67947
2020 42 6.79551
2020 43 4.45881
2020 44 1.87972
2020 45 0.76284
2020 46 1.8671
2020 47 2.07159
2020 48 2.87303
2020 49 7.66944
2020 50 1.20421
2020 51 9.04416
2020 52 2.2625
2020 1 1.17026
2020 2 14.22263
2020 3 1.36464
2020 4 2.64862
2020 5 8.69916
2020 6 4.51259
2020 7 2.83411
2020 8 3.64183
2020 9 4.77292
2020 10 1.64729
2020 11 1.6878
2020 12 2.24874
2020 13 0.32712
I created a week no column using date. In my scenario regulatory year starts from 1st April and ends at 31st March of next year that's why Week no starts from 14 and ends at 13. Now I want to create another column that contains the cumulative sum of the value column. I tried to use cumsum() by using the following code:
df['Cummulative Value'] = df.groupby('RY')['Value'].apply(lambda x:x.cumsum())
The problem with the above code is that it starts calculating the cumulative sum from week no 1 not from week no 14 onwards. Is there any way to calculate the cumulative sum without disturbing the week order number?
EDIT: You can sorting values by RY and Week no before GroupBy.cumsum and last sorting index for original order:
#create default index for correct working
df = df.reset_index(drop=True)
df['Cummulative Value'] = df.sort_values(['RY','Week no']).groupby('RY')['Value'].cumsum().sort_index()
print (df)
RY Week no Value Cummulative Value
0 2020 14 3.95321 53.73092
1 2020 15 3.56425 57.29517
2 2020 16 0.07042 57.36559
3 2020 17 6.45417 63.81976
4 2020 18 0.00029 63.82005
5 2020 19 0.27737 64.09742
6 2020 20 4.12644 68.22386
7 2020 21 0.32753 68.55139
8 2020 22 0.47239 69.02378
9 2020 23 0.28756 69.31134
10 2020 24 1.83029 71.14163
11 2020 25 0.75385 71.89548
12 2020 26 2.08981 73.98529
13 2020 27 2.05611 76.04140
14 2020 28 1.00614 77.04754
15 2020 29 0.02105 77.06859
16 2020 30 0.58101 77.64960
17 2020 31 3.49083 81.14043
18 2020 32 8.29013 89.43056
19 2020 33 8.99825 98.42881
20 2020 34 2.66293 101.09174
21 2020 35 0.16448 101.25622
22 2020 36 2.26301 103.51923
23 2020 37 1.09302 104.61225
24 2020 38 1.66566 106.27791
25 2020 39 1.47233 107.75024
26 2020 40 6.42708 114.17732
27 2020 41 2.67947 116.85679
28 2020 42 6.79551 123.65230
29 2020 43 4.45881 128.11111
30 2020 44 1.87972 129.99083
31 2020 45 0.76284 130.75367
32 2020 46 1.86710 132.62077
33 2020 47 2.07159 134.69236
34 2020 48 2.87303 137.56539
35 2020 49 7.66944 145.23483
36 2020 50 1.20421 146.43904
37 2020 51 9.04416 155.48320
38 2020 52 2.26250 157.74570
39 2020 1 1.17026 1.17026
40 2020 2 14.22263 15.39289
41 2020 3 1.36464 16.75753
42 2020 4 2.64862 19.40615
43 2020 5 8.69916 28.10531
44 2020 6 4.51259 32.61790
45 2020 7 2.83411 35.45201
46 2020 8 3.64183 39.09384
47 2020 9 4.77292 43.86676
48 2020 10 1.64729 45.51405
49 2020 11 1.68780 47.20185
50 2020 12 2.24874 49.45059
51 2020 13 0.32712 49.77771
EDIT:
After some discussion solution should be simplify by GroupBy.cumsum:
df['Cummulative Value'] = df.groupby('RY')['Value'].cumsum()

Python not recognizing multilevel index

I have a dataframe that currently looks like this:
country series year value
usa a 2010 21
usa b 2015 22
usa a 2017 23
usa b 2010 22
usa b 2017 23
aus a 2010 21
aus b 2015 22
aus a 2017 23
aus b 2010 22
aus b 2017 23
When I run this code, it reduces the duplicity of the countries but not the series like I expect it to.
pop2.set_index(['Country','Series'])
I want:
country series year value
usa a 2010 21
2017 23
b 2010 22
2015 22
2017 23
aus a 2010 21
2017 23
b 2010 22
2015 22
2017 23
Instead, it is returning:
country series year value
usa a 2010 21
b 2015 22
a 2017 23
b 2010 22
b 2017 23
aus a 2010 21
b 2015 22
a 2017 23
b 2010 22
b 2017 23
There must be an index label for each row to display in a dataframe. Therefore, you need is a another level of index then you can show index "grouping" as you wish.
Let's try this:
df.set_index(['country','series',np.arange(df.shape[0])]).sort_index()
Output:
year value
country series
aus a 5 2010 21
7 2017 23
b 6 2015 22
8 2010 22
9 2017 23
usa a 0 2010 21
2 2017 23
b 1 2015 22
3 2010 22
4 2017 23

Categories