I have a .csv that looks like the below. I was wondering what the best way would be to keep the first few cols (id, account_id, date, amount, payments) intact while creating a new column containing the column name for observations with an 'X' marked.
The first 10 rows of the csv look like:
id,account_id,date,amount,payments,24_A,12_B,12_A,60_D,48_C,36_D,36_C,12_C,48_A,24_C,60_C,24_B,48_D,24_D,48_B,36_A,36_B,60_B,12_D,60_A
4959,2,1994-01-05,80952,3373,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4961,19,1996-04-29,30276,2523,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4962,25,1997-12-08,30276,2523,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4967,37,1998-10-14,318480,5308,-,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4968,38,1998-04-19,110736,2307,-,-,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4973,67,1996-05-02,165960,6915,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4986,97,1997-08-10,102876,8573,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4988,103,1997-12-06,265320,7370,-,-,-,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4989,105,1998-12-05,352704,7348,-,-,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4990,110,1997-09-08,162576,4516,-,-,-,-,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-
There used to be something called lookup but that's been deprecated in favor of melt + loc[].
The idea is to use the id_vars as the grouping, and all the other columns get smashed into a single column with their respective value. Then filter where that value is X, effectively dropping the other rows.
import pandas as pd
df = pd.read_csv('test.txt')
df = df.melt(id_vars=['id','account_id','date','amount','payments'], var_name='x_col')
df = df.loc[df['value']=='X'].drop(columns='value')
print(df)
Output
id account_id date amount payments x_col
0 4959 2 1994-01-05 80952 3373 24_A
5 4973 67 1996-05-02 165960 6915 24_A
11 4961 19 1996-04-29 30276 2523 12_B
22 4962 25 1997-12-08 30276 2523 12_A
26 4986 97 1997-08-10 102876 8573 12_A
33 4967 37 1998-10-14 318480 5308 60_D
44 4968 38 1998-04-19 110736 2307 48_C
48 4989 105 1998-12-05 352704 7348 48_C
57 4988 103 1997-12-06 265320 7370 36_D
69 4990 110 1997-09-08 162576 4516 36_C
I have a very specific issue which I have not been able to find a solution to.
Recently, I began a project for which I am monitoring about 100 ETFs and Mutual funds based on specific data acquired from Morningstar. The current solution works great - but I later found out that I need more data from another "Tab" within the website. Specifically, I am trying to get data from the 1st table from the following website: https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000Z1MC&tab=1
Right now, I have the code below for scraping data from a table from the tab "Indhold" on the website, and exporting it to Excel. My question is therefore: How do I adjust the code to scrape data from another part of the website?.
To briefly explain the code and reiterate: The code below scrapes data from another tab from the same websites. The many, many IDs are for each website representing each mutual fund/ETF. The setup works very well so I am hoping to simply adjust it (If that is possible) to extract the table from the link above. I have very limited knowledge of the topic so any help is much, much appreciated.
import requests
import re
import pandas as pd
from openpyxl import load_workbook
auth = 'https://www.morningstar.dk/Common/funds/snapshot/PortfolioSAL.aspx'
# Create a Pandas Excel writer using XlsxWriter as the engine.
path= r'/Users/karlemilthulstrup/Downloads/data2.xlsm'
book = load_workbook(path ,read_only = False, keep_vba=True)
writer = pd.ExcelWriter(path, engine='openpyxl')
writer.book = book
ids = ['F00000VA2N','F0GBR064OO','F00000YKC2','F000015MVX','F0000020YA','0P00015YTR','0P00015YTT','F0GBR05V8D','F0GBR06XKI','F000013CKH','F00000MG6K','F000014G49',
'F00000WC0Z','F00000QSD2','F000016551','F0000146QH','F0000146QI','F0GBR04KZS','F0GBR064VU','F00000VXLM','F0000119R1','F0GBR04L4T','F000015CS3','F000015CS5','F000015CS6',
'F000015CS4','F000013BZE','F0GBR05W0Q','F000016M1C','F0GBR04L68','F00000Z9T9','F0GBR04JI8','F00000Z9TG','F0GBR04L2P','F000014CU8','F00000ZG2G','F00000MLEW',
'F000013ZOY','F000016614','F00000WUI9','F000015KRL','F0GBR04LCR','F000010ES9','F00000P780','F0GBR04HC3','F000015CV6','F00000YWCK','F00000YWCJ','F00000NAI5',
'F0GBR04L81','F0GBR05KNU','F0GBR06XKB','F00000NAI3','F0GBR06XKF','F000016UA9','F000013FC2','F000014NRE','0P0000CNVT','0P0000CNVX','F000015KRI',
'F000015KRG','F00000XLK7','F0GBR04IDG','F00000XLK6','F00000073J','F00000XLK4','F000013CKG','F000013CKJ','F000013CKK','F000016P8R','F000016P8S','F000011JG6',
'F000014UZQ','F0000159PE','F0GBR04KZG','F0000002OY','F00000TW9K','F0000175CC','F00000NBEL','F000016054','F000016056','F00000TEYP','F0000025UI','F0GBR04FV7',
'F00000WP01','F000011SQ4','F0GBR04KZO','F000010E19','F000013ZOX','F0GBR04HD7','F00000YKC1','F0GBR064UG','F00000JSDD','F000010ROF','F0000100CA','F0000100CD',
'FOGBR05KQ0','F0GBR04LBB','F0GBR04LBZ','F0GBR04LCN','F00000WLA7','F0000147D7','F00000ZB5E','F00000WC0Y']
headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Mobile Safari/537.36'}
payload = {
'languageId': 'da-DK',
'locale': 'da-DK',
'clientId': 'MDC_intl',
'benchmarkId': 'category',
'component': 'sal-components-mip-factor-profile',
'version': '3.40.1'}
for api_id in ids:
payload = {
'Site': 'dk',
'FC': '%s' %api_id,
'IT': 'FO',
'LANG': 'da-DK',}
response = requests.get(auth, params=payload)
search = re.search('(tokenMaaS:[\w\s]*\")(.*)(\")', response.text, re.IGNORECASE)
bearer = 'Bearer ' + search.group(2)
headers.update({'Authorization': bearer})
url = 'https://www.us-api.morningstar.com/sal/sal-service/fund/factorProfile/%s/data' %api_id
jsonData = requests.get(url, headers=headers, params=payload).json()
rows = []
for k, v in jsonData['factors'].items():
row = {}
row['factor'] = k
historicRange = v.pop('historicRange')
row.update(v)
for each in historicRange:
row.update(each)
rows.append(row.copy())
df = pd.DataFrame(rows)
sheetName = jsonData['id']
df.to_excel(writer, sheet_name=sheetName, index=False)
print('Finished: %s' %sheetName)
writer.save()
writer.close()
If I understand you correctly, you want to get first table of that URL in the form of pandas dataframe:
import requests
import pandas as pd
from bs4 import BeautifulSoup
# load the page into soup:
url = "https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000Z1MC&tab=1"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
# find correct table:
tbl = soup.select_one(".returnsCalenderYearTable")
# remove the first row (it's not header):
tbl.tr.extract()
# convert the html to pandas DF:
df = pd.read_html(str(tbl))[0]
# move the first row to header:
df.columns = map(str, df.loc[0])
df = df.loc[1:].reset_index(drop=True).rename(columns={"nan": "Name"})
print(df)
Prints:
Name 2014* 2015* 2016* 2017* 2018 2019 2020 31-08
0 Samlet afkast % 2627 1490 1432 584 -589 2648 -482 1841
1 +/- Kategori 1130 583 808 -255 164 22 -910 -080
2 +/- Indeks 788 591 363 -320 -127 -262 -1106 -162
3 Rank i kategori 2 9 4 80 38 54 92 63
EDIT: To load from multiple URLs:
import requests
import pandas as pd
from bs4 import BeautifulSoup
urls = [
"https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000VA2N&tab=1",
"https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F0GBR064OO&tab=1",
"https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000YKC2&tab=1",
"https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F000015MVX&tab=1",
]
all_data = []
for url in urls:
print("Loading URL {}".format(url))
# load the page into soup:
soup = BeautifulSoup(requests.get(url).content, "html.parser")
# find correct table:
tbl = soup.select_one(".returnsCalenderYearTable")
# remove the first row (it's not header):
tbl.tr.extract()
# convert the html to pandas DF:
df = pd.read_html(str(tbl))[0]
# move the first row to header:
df.columns = map(lambda x: str(x).replace("*", "").strip(), df.loc[0])
df = df.loc[1:].reset_index(drop=True).rename(columns={"nan": "Name"})
df["Company"] = soup.h1.text.split("\n")[0].strip()
df["URL"] = url
all_data.append(df.loc[:, ~df.isna().all()])
df = pd.concat(all_data, ignore_index=True)
print(df)
Prints:
Name 2016 2017 2018 2019 2020 31-08 Company URL
0 Samlet afkast % 1755.0 942.0 -1317.0 1757.0 -189.0 3018 Great Dane Globale Aktier https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000VA2N&tab=1
1 +/- Kategori 966.0 -54.0 -186.0 -662.0 -967.0 1152 Great Dane Globale Aktier https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000VA2N&tab=1
2 +/- Indeks 686.0 38.0 -854.0 -1153.0 -813.0 1015 Great Dane Globale Aktier https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000VA2N&tab=1
3 Rank i kategori 10.0 24.0 85.0 84.0 77.0 4 Great Dane Globale Aktier https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000VA2N&tab=1
4 Samlet afkast % NaN 1016.0 -940.0 1899.0 767.0 2238 Independent Generations ESG https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F0GBR064OO&tab=1
5 +/- Kategori NaN 20.0 190.0 -520.0 -12.0 373 Independent Generations ESG https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F0GBR064OO&tab=1
6 +/- Indeks NaN 112.0 -478.0 -1011.0 143.0 235 Independent Generations ESG https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F0GBR064OO&tab=1
7 Rank i kategori NaN 26.0 69.0 92.0 43.0 25 Independent Generations ESG https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F0GBR064OO&tab=1
8 Samlet afkast % NaN NaN -939.0 1898.0 766.0 2239 Independent Generations ESG Akk https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000YKC2&tab=1
9 +/- Kategori NaN NaN 191.0 -521.0 -12.0 373 Independent Generations ESG Akk https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000YKC2&tab=1
10 +/- Indeks NaN NaN -477.0 -1012.0 142.0 236 Independent Generations ESG Akk https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000YKC2&tab=1
11 Rank i kategori NaN NaN 68.0 92.0 44.0 24 Independent Generations ESG Akk https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000YKC2&tab=1
12 Samlet afkast % NaN NaN NaN NaN NaN 2384 Investin Sustainable World https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F000015MVX&tab=1
13 +/- Kategori NaN NaN NaN NaN NaN 518 Investin Sustainable World https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F000015MVX&tab=1
14 +/- Indeks NaN NaN NaN NaN NaN 381 Investin Sustainable World https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F000015MVX&tab=1
15 Rank i kategori NaN NaN NaN NaN NaN 18 Investin Sustainable World https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F000015MVX&tab=1
I'm having some troubles and I couldn't find any answers on the internet (which is very rare !!)
I'm doing some webscraping (meterological data) on a daily granularity by city.
When I scrape for 1-day, the 'city' column is considered string, but when i scrape on more than 1-day like 2-days, the 'city' column is not a string anymore and the same code that used to work when 1-day scrape doesnt work anymore on a 2-days scrape.
Here's my code for scraping
url_template = 'https://www.wetterkontor.de/de/wetter/deutschland/extremwerte.asp?id={}'
def get_weather(date):
url = url_template.format(date)
html_doc = requests.get(url).text
soup = BeautifulSoup(html_doc)
table = soup.find('table', id="extremwerte")
rows = []
for row in table.find_all('tr'):
rows.append([val.text for val in row.find_all('td')])
headers= [header.text for header in table.find_all('th')]
df = pd.DataFrame(rows[2:], columns=headers)
df['date']=date
return df
df2 = pd.DataFrame()
df2 = df2.fillna(0)
for d in pd.date_range(start='20170101', end='20170101'):
df2=pd.concat([df2,get_weather(d.strftime('%Y%m%d'))])
When I only scrape for 20170101 like above, this below line of code works :
for i, row in df2.iterrows():
if isinstance(df2['Wetterstation\xa0'][i],str) is True:
df2.at[i,'Wetterstation\xa0']=df2['Wetterstation\xa0'][i].replace('/','-')
But when I change the end date to 20170202 for exemple, the code won't work and give me this error message :
AttributeError: 'Series' object has no attribute 'encode'
Please can you explain me what has changed ? Why the type of columns changes when I scrape more than one day ?
Thank you in advance for your time !
It seems you need to specify correct parser, in this case lxml (standard html.parser produces incorrect results).
Running this code for end date 20170202:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url_template = 'https://www.wetterkontor.de/de/wetter/deutschland/extremwerte.asp?id={}'
def get_weather(date):
url = url_template.format(date)
html_doc = requests.get(url).text
soup = BeautifulSoup(html_doc, 'lxml') # <--- specify `lxml` parser
table = soup.find('table', id="extremwerte")
rows = []
for row in table.find_all('tr'):
rows.append([val.text for val in row.find_all('td')])
headers= [header.text for header in table.find_all('th')]
df = pd.DataFrame(rows[2:], columns=headers)
df['date']=date
return df
df2 = pd.DataFrame()
df2 = df2.fillna(0)
for d in pd.date_range(start='20170101', end='20170202'):
df2=pd.concat([df2,get_weather(d.strftime('%Y%m%d'))])
print(df2)
Produces:
Wetterstation MinimumTemp.[°C] MaximumTemp.[°C] Minimum 5 cmüber dem Erd-boden (Nacht) Schnee-höhe[cm] StärksteWindböe[Bft] Nieder-schlag[l/m2] Sonnen-scheindauer[h] date
0 Aachen -5,1 1,9 -6,9 0 644 1,9 5,3 20170101
1 Ahaus (Münsterland) -1,2 0,7 -1 0 429 3,9 0 20170101
2 Albstadt (Schwäbische Alb, 759 m) -9,9 4,2 -10,8 0 7,1 20170101
3 Aldersbach (Lkr. Passau, Niederbayern) -8,3 -1,7 -7,8 0 4,9 20170101
4 Alfeld (Leine) -4,1 1,6 -4,4 0 322 1,6 1,5 20170101
.. ... ... ... ... ... ... ... ... ...
965 Zernien (N) 0 20170202
966 Zielitz (N) 0 20170202
967 Zinnwald-Georgenfeld (Erzgebirge, 877 m) -7,4 0,2 -0,2 55 540 0,5 0 20170202
968 Zugspitze -7,3 -5,3 235 12123 0 5,8 20170202
969 Zwiesel (Bayerischer Wald, 612 m) -4,4 9,3 -0,3 44 213 0,1 2,4 20170202
[27841 rows x 9 columns]
I'm newbie learning BeautifulSoup. May someone have a look at the following code? I'd like to scrape data from a website without any success. I'd like to create a dataframe with the sum of player arrivals per year and with a column of players average age.
dataframe repeating codes:
img dataframe error
my code:
import pandas as pd
import requests
from bs4 import BeautifulSoup
anos_list = list(range(2005, 2018))
anos_lista = []
valor_contratos_lista = []
idade_média_lista = []
for ano_lista in anos_list:
url = 'https://www.transfermarkt.com/flamengo-rio-de-janeiro/transfers/verein/614/saison_id/'+ str(anos_list) + ''
page = requests.get(url, headers={'User-Agent': 'Custom5'})
soup = BeautifulSoup(page.text, 'html.parser')
tag_list = soup.tfoot.find_all('td')
valor = (tag_list[0].string)
idade = (tag_list[1].string)
ano = ano_lista
valor_contratos_lista.append(valor)
idade_media_lista.append(idade)
anos_lista.append(ano)
flamengo_df = pd.DataFrame({'Ano': ano_lista,
'Despesa com contratações':valor_contratos_lista,
'Média de idade': idade_média_lista
})
flamengo_df.to_csv('flamengo.csv', encoding = 'utf-8')`
Here's my approach:
Using Beautiful Soup + Regex:
import requests
from bs4 import BeautifulSoup
import re
import numpy as np
# Set min and max years as variables
min_year = 2005
max_year = 2019
year_range = list(range(min_year, 2019+1))
base_url = 'https://www.transfermarkt.com/flamengo-rio-de-janeiro/transfers/verein/614/saison_id/'
# Begin iterating
records = []
for year in year_range:
url = base_url+str(year)
# get the page
page = requests.get(url, headers={'User-Agent': 'Custom5'})
soup = BeautifulSoup(page.text, 'html.parser')
# I used the class of "responsive table"
tables = soup.find_all('div',{'class':'responsive-table'})
rows = tables[0].find_all('tr')
cells = [row.find_all('td', {'class':'zentriert'}) for row in rows]
# get variable names:
variables = [x.text for x in rows[0].find_all('th')]
variables_values = {x:[] for x in variables}
# get values
for row in rows:
values = [' '.join(x.text.split()) for x in row.find_all('td')]
values = [x for x in values if x!='']
if len(variables)< len(values):
values.pop(4)
values.pop(2)
for k,v in zip(variables_values.keys(), values):
variables_values[k].append(v)
num_pattern = re.compile('[0-9,]+')
to_float = lambda x: float(x) if x!='' else np.NAN
get_nums = lambda x: to_float(''.join(num_pattern.findall(x)).replace(',','.'))
# Add values to an individual record
rec = {
'Url':url,
'Year':year,
'Total Transfers':len(variables_values['Player']),
'Avg Age': np.mean([int(x) for x in variables_values['Age']]),
'Avg Cost': np.nanmean([get_nums(x) for x in variables_values['Fee'] if ('loan' not in x)]),
'Total Cost': np.nansum([get_nums(x) for x in variables_values['Fee'] if ('loan' not in x)]),
}
# Store record
records.append(rec)
Thereafter, initialize dataframe:
Of note, some of the numbers represent millions and would need to be adjusted for.
import pandas as pd
# Drop the URL
df = pd.DataFrame(records, columns=['Year','Total Transfers','Avg Age','Avg Cost','Total Cost'])
Year Total Transfers Avg Age Avg Cost Total Cost
0 2005 26 22.038462 2.000000 2.00
1 2006 32 23.906250 240.660000 1203.30
2 2007 37 22.837838 462.750000 1851.00
3 2008 41 22.926829 217.750000 871.00
4 2009 31 23.419355 175.000000 350.00
5 2010 46 23.239130 225.763333 1354.58
6 2011 47 23.042553 340.600000 1703.00
7 2012 45 24.133333 345.820000 1037.46
8 2013 36 24.166667 207.166667 621.50
9 2014 37 24.189189 111.700000 335.10
10 2015 49 23.530612 413.312000 2066.56
11 2016 41 23.341463 241.500000 966.00
12 2017 31 24.000000 101.433333 304.30
13 2018 18 25.388889 123.055000 738.33
14 2019 10 25.300000 NaN 0.00