Unable to do web scraping from URL using Python Alchemy

Unable to do web scraping from URL using Python Alchemy - python

I have a script where I'm trying to web scraping the data into table. But I'm getting errors
raise exc.with_traceback(traceback)
ValueError: No tables found
Script :
import pandas as pd
import logging
from sqlalchemy import create engine
from urlib.parse import quote
db_connection = {mysql}://{username}:{quote'pwd'}#{DB:port}
ds_connection = create_engine(db_connection)
a = pd.read_html("https://www.centralbank.ae/en/forex-eibor/exchange-rates/")
df = pd.Dataframe(a[0])
df_final = df.loc[:,['Currency','Rate']]
df_final.to_sql('rate_table',db_connection,if_exists = append,index=false)
Can anyone suggest on this

One easy way to obtain those exchange rates would be to scrape the API accessed to retrieve information in page (check Dev Tools - network tab):
import pandas as pd
import requests
from bs4 import BeautifulSoup
headers = {'Accept-Language': 'en-US,en;q=0.9',
'Referer': 'https://www.centralbank.ae/en/forex-eibor/exchange-rates/'
}
r = requests.post('https://www.centralbank.ae/umbraco/Surface/Exchange/GetExchangeRateAllCurrency', headers=headers)
dfs = pd.read_html(r.text)
print(dfs[0].loc[:,['Currency','Rates']])
This returns:
Currency
Rates
0
US Dollar
3.6725
1
Argentine Peso
0.026993
2
Australian Dollar
2.52753
3
Bangladesh Taka
0.038508
4
Bahrani Dinar
9.74293
5
Brunei Dollar
2.64095
6
Brazilian Real
0.706549
7
Botswana Pula
0.287552
8
Belarus Rouble
1.45526
9
Canadian Dollar
2.82565
10
Swiss Franc
3.83311
11
Chilean Peso
0.003884
12
Chinese Yuan - Offshore
0.536978
13
Chinese Yuan
0.538829
14
Colombian Peso
0.000832
15
Czech Koruna
0.149763
16
Danish Krone
0.496304
17
Algerian Dinar
0.025944
18
Egypt Pound
0.191775
19
Euro
3.69096
20
GB Pound
4.34256
21
Hongkong Dollar
0.468079
22
Hungarian Forint
0.009112
23
Indonesia Rupiah
0.000248
24
Indian Rupee
0.045976
25
Iceland Krona
0.026232
26
Jordan Dinar
5.17472
27
Japanese Yen
0.026818
28
Kenya Shilling
0.030681
29
Korean Won
0.002746
30
Kuwaiti Dinar
11.9423
31
Kazakhstan Tenge
0.007704
32
Lebanon Pound
0.002418
33
Sri Lanka Rupee
0.010201
34
Moroccan Dirham
0.353346
35
Macedonia Denar
0.059901
36
Mexican Peso
0.181874
37
Malaysia Ringgit
0.820395
38
Nigerian Naira
0.008737
39
Norwegian Krone
0.37486
40
NewZealand Dollar
2.27287
41
Omani Rial
9.53921
42
Peru Sol
0.952659
43
Philippine Piso
0.065562
44
Pakistan Rupee
0.017077
45
Polish Zloty
0.777446
46
Qatari Riyal
1.00254
47
Serbian Dinar
0.031445
48
Russia Rouble
0.06178
49
Saudi Riyal
0.977847
50
Sudanese Pound
0.006479
51
Swedish Krona
0.347245
52
Singapore Dollar
2.64038
53
Thai Baht
0.102612
54
Tunisian Dinar
1.1505
55
Turkish Lira
0.20272
56
Trin Tob Dollar
0.541411
57
Taiwan Dollar
0.121961
58
Tanzania Shilling
0.001575
59
Uganda Shilling
0.000959
60
Vietnam Dong
0.000157
61
Yemen Rial
0.01468
62
South Africa Rand
0.216405
63
Zambian Kwacha
0.227752
64
Azerbaijan manat
2.16157
65
Bulgarian lev
1.8873
66
Croatian kuna
0.491344
67
Ethiopian birr
0.069656
68
Iraqi dinar
0.002516
69
Israeli new shekel
1.12309
70
Libyan dinar
0.752115
71
Mauritian rupee
0.079837
72
Romanian leu
0.755612
73
Syrian pound
0.001462
74
Turkmen manat
1.05079
75
Uzbekistani som
0.000336

Related

Scraping non-interactable table from dynamic webpage

I've seen a couple of posts with this same question but their scripts usually waits until one of the elements (buttons) is clickable. Here is the table I'm trying to scrape:
https://ropercenter.cornell.edu/presidential-approval/highslows
First couple of tries my code was returning all the rows except both Polling Organization columns. Without changing anything, it now only scrapes the table headers and the tbody tag (no table rows).
url = "https://ropercenter.cornell.edu/presidential-approval/highslows"
driver = webdriver.Firefox()
driver.get(url)
driver.implicitly_wait(12)
soup = BeautifulSoup(driver.page_source, 'lxml')
table = soup.find_all('table')
approvalData = pd.read_html(str(table[0]))
approvalData = pd.DataFrame(approvalData[0], columns = ['President', 'Highest %', 'Polling Organization & Dates H' 'Lowest %', 'Polling Organization & Dates L'])
Should I use explicit wait? If so, which condition should I wait for since the dynamic table is not interactive?
Also, why did the output of my code change after running it multiple times?

Maybe more cheating, but easier solution, which indeed solves your problem, but in other way, would be to take a look what frontend does (using developer tools), and discover it calls the api, which returns JSON value, so no selenium is really needed. requests and pandas are enough.
import requests
import pandas as pd
url = "https://ropercenter.cornell.edu/presidential-approval/api/presidents/highlow"
data = requests.get(url).json()
df = pd.io.json.json_normalize(data)
>>> df
>>> df
president.id president.active president.surname president.givenname president.shortname ... low.approve low.disapprove low.noOpinion low.sampleSize low.presidentName
0 e9c0d19b-dfe9-49cf-9939-d06a0f256e57 True Biden Joe None ... 33 53 13 1313.0 Joe Biden
1 bc9855d5-8e97-4448-b62e-1fb2865c79e6 True Trump Donald None ... 29 68 3 5360.0 Donald Trump
2 1c49881f-0f0c-4a53-9b2c-0dd6540f88e4 True Obama Barack None ... 37 57 5 1017.0 Barack Obama
3 ceda6415-5975-404d-8049-978758a7d1f8 True Bush George W. W. Bush ... 19 77 4 1100.0 George W. Bush
4 4f7344de-a7bd-4bc6-9147-87963ae51095 True Clinton Bill None ... 36 50 14 800.0 Bill Clinton
5 116721f1-f947-4c14-b0b5-d521ed5a4c8b True Bush George H.W. H.W. Bush ... 29 60 11 1001.0 George H.W. Bush
6 43720f8f-0b9f-43b0-8c0d-63da059e7a57 True Reagan Ronald None ... 35 56 9 1555.0 Ronald Reagan
7 7aa76fd3-e1bc-4e9a-b13c-463a64e0c864 True Carter Jimmy None ... 28 59 13 1542.0 Jimmy Carter
8 6255dd77-531d-46c6-bb26-627e2a4b3654 True Ford Gerald None ... 37 39 24 1519.0 Gerald Ford
9 f1a23b06-4200-41e6-b137-dd46260ac4d8 True Nixon Richard None ... 23 55 22 1589.0 Richard Nixon
10 772aabfd-289b-4f10-aaae-81a82dd3dbc6 True Johnson Lyndon B. None ... 35 52 13 1526.0 Lyndon B. Johnson
11 d849b5a8-f711-4ac9-9728-c3915e17bb6a True Kennedy John F. None ... 56 30 14 1550.0 John F. Kennedy
12 e22fd64a-cf20-4bc4-8db6-b4e71dc4483d True Eisenhower Dwight D. None ... 48 36 16 NaN Dwight D. Eisenhower
13 ab0bfa04-61da-49d1-8069-6992f6124f17 True Truman Harry S. None ... 22 65 13 NaN Harry S. Truman
14 11edf04f-9d8d-4678-976d-b9339b46705d True Roosevelt Franklin D. None ... 48 43 8 NaN Franklin D. Roosevelt
[15 rows x 41 columns]
>>> df.columns
Index(['president.id', 'president.active', 'president.surname',
'president.givenname', 'president.shortname', 'president.fullname',
'president.number', 'president.terms', 'president.ratings',
'president.termCount', 'president.ratingCount', 'high.id',
'high.active', 'high.organization.id', 'high.organization.active',
'high.organization.name', 'high.organization.ratingCount',
'high.pollingStart', 'high.pollingEnd', 'high.updated',
'high.president', 'high.approve', 'high.disapprove', 'high.noOpinion',
'high.sampleSize', 'high.presidentName', 'low.id', 'low.active',
'low.organization.id', 'low.organization.active',
'low.organization.name', 'low.organization.ratingCount',
'low.pollingStart', 'low.pollingEnd', 'low.updated', 'low.president',
'low.approve', 'low.disapprove', 'low.noOpinion', 'low.sampleSize',
'low.presidentName'],
dtype='object')

Using only Selenium, GeckoDriver and firefox to extract the table contents within the website you need to induce WebDriverWait for the visibility_of_element_located() and using DataFrame from Pandas you can use the following Locator Strategy:
Code Block:
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
options = Options()
options.add_argument('--disable-blink-features=AutomationControlled')
s = Service('C:\\BrowserDrivers\\geckodriver.exe')
driver = webdriver.Firefox(service=s, options=options)
driver.get('https://ropercenter.cornell.edu/presidential-approval/highslows')
tabledata = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[#class='table table-striped']"))).get_attribute("outerHTML")
tabledf = pd.read_html(tabledata)
print(tabledf)
driver.quit()
Console Output:
[ President Highest % ... Lowest % Polling Organization & Dates.1
0 Joe Biden 63% ... 33% Quinnipiac UniversityJan 7th, 2022 - Jan 10th,...
1 Donald Trump 49% ... 29% PewJan 8th, 2021 - Jan 12th, 2021
2 Barack Obama 76% ... 37% Gallup OrganizationSep 8th, 2011 - Sep 11th, 2011
3 George W. Bush 92% ... 19% American Research GroupFeb 16th, 2008 - Feb 19...
4 Bill Clinton 73% ... 36% Yankelovich Partners / TIME / CNNMay 26th, 199...
5 George H.W. Bush 89% ... 29% Gallup OrganizationJul 31st, 1992 - Aug 2nd, 1992
6 Ronald Reagan 68% ... 35% Gallup OrganizationJan 28th, 1983 - Jan 31st, ...
7 Jimmy Carter 75% ... 28% Gallup OrganizationJun 29th, 1979 - Jul 2nd, 1979
8 Gerald Ford 71% ... 37% Gallup OrganizationJan 10th, 1975 - Jan 13th, ...
9 Richard Nixon 70% ... 23% Gallup OrganizationJan 4th, 1974 - Jan 7th, 1974
10 Lyndon B. Johnson 80% ... 35% Gallup OrganizationAug 7th, 1968 - Aug 12th, 1968
11 John F. Kennedy 83% ... 56% Gallup OrganizationSep 12th, 1963 - Sep 17th, ...
12 Dwight D. Eisenhower 78% ... 48% Gallup OrganizationMar 27th, 1958 - Apr 1st, 1958
13 Harry S. Truman 87% ... 22% Gallup OrganizationFeb 9th, 1952 - Feb 14th, 1952
14 Franklin D. Roosevelt 84% ... 48% Gallup OrganizationAug 18th, 1939 - Aug 24th, ...
[15 rows x 5 columns]]

How to separate a combined column, but with incongruent data

I'm preparing for a new job where I'll be receiving data submissions in varying quality, often times dates/chars/etc are combined together nonsensically and must be separated before analysis. Thinking ahead of how might this be solved.
Using a fictitious example below, I combined region, rep, and product together.
file['combine'] = file['Region'] + file['Sales Rep'] + file['Product']
Shift Region Sales Rep Product Cost per Units Sold combine
0 3 East Shirlene Pencil 5 71 EastShirlenePencil
1 3 South Anderson Folder 17 69 SouthAndersonFolder
2 3 West Shelli Folder 17 185 WestShelliFolder
3 3 South Damion Binder 30 159 SouthDamionBinder
4 3 West Shirlene Stapler 25 41 WestShirleneStapler
Assuming no other data, the question is, how can the 'combine' column be split up?
Many thanks in advance!

If you want space between the strings, you can do:
df["combine"] = df[["Region", "Sales Rep", "Product"]].apply(" ".join, axis=1)
print(df)
Prints:
Shift Region Sales Rep Product Cost per Units Sold combine
0 3 East Shirlene Pencil 5 71 East Shirlene Pencil
1 3 South Anderson Folder 17 69 South Anderson Folder
2 3 West Shelli Folder 17 185 West Shelli Folder
3 3 South Damion Binder 30 159 South Damion Binder
4 3 West Shirlene Stapler 25 41 West Shirlene Stapler
Or: if you want to split the already combined string:
import re
df["separated"] = df["combine"].apply(lambda x: re.findall(r"[A-Z][^A-Z]*", x))
print(df)
Prints:
Shift Region Sales Rep Product Cost per Units Sold combine separated
0 3 East Shirlene Pencil 5 71 EastShirlenePencil [East, Shirlene, Pencil]
1 3 South Anderson Folder 17 69 SouthAndersonFolder [South, Anderson, Folder]
2 3 West Shelli Folder 17 185 WestShelliFolder [West, Shelli, Folder]
3 3 South Damion Binder 30 159 SouthDamionBinder [South, Damion, Binder]
4 3 West Shirlene Stapler 25 41 WestShirleneStapler [West, Shirlene, Stapler]

scraping basketball results and associate related competition to each match

I want to scrape basketball results from this webpage:
http://www.nowgoal.group/nba/Schedule.aspx?f=ft2&date=2020-07-29
I created the code using bs4 and requests:
url = http://www.nowgoal.group/nba/Schedule.aspx?f=ft2&date=2020-07-29
with requests.Session() as session:
session.headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}
r = session.get(url, timeout=30)
soup = BeautifulSoup(r.content, 'html.parser')
The issue I face is how to add competition to each row I scrape
I want to create a table and each row is the match results (competition, home team, away team, score...)

Selenium
Try this (selenium):
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
import time
res =[]
url = 'http://www.nowgoal.group/nba/Schedule.aspx?f=ft2&date=2020-07-29'
driver = webdriver.Firefox(executable_path='c:/program/geckodriver.exe')
driver.get(url)
time.sleep(2)
page = driver.page_source
driver.close()
soup = BeautifulSoup(page, 'html.parser')
span = soup.select_one('span#live')
tables = span.select('table')
for table in tables:
if table.get('class'):
competition = table.select_one('a b font').text
else:
for home, away in zip(table.select('tr.b1')[0::2], table.select('tr.b1')[1::2]):
res.append([f"{competition}",
f"{home.select_one('td a').text}",
f"{away.select_one('td a').text}",
f"{home.select_one('td.red').text}",
f"{away.select_one('td.red').text}",
f"{home.select_one('td.odds1').text}",
f"{away.select_one('td.odds1').text}",
f"{home.select('td font')[0].text}/{home.select('td font')[1].text}",
f"{away.select('td font')[0].text}/{away.select('td font')[1].text}",
f"{home.select('td div a')[-1].get('href')}"])
df = pd.DataFrame(res, columns=['competition',
'home',
'away',
'home score',
'away score',
'home odds',
'away odds',
'home ht',
'away ht',
'odds'
])
print(df.to_string())
df.to_csv('Res.csv')
prints:
competition home away home score away score home odds away odds home ht away ht odds
0 National Basketball Association Portland Trail Blazers Oklahoma City Thunder 120 131 2.72 1.45 50/70 63/68 http://data.nowgoal.group/OddsCompBasket/387520.html
1 National Basketball Association Houston Rockets Boston Celtics 137 112 1.49 2.58 77/60 60/52 http://data.nowgoal.group/OddsCompBasket/387521.html
2 National Basketball Association Philadelphia 76ers Dallas Mavericks 115 118 2.04 1.76 39/64 48/55 http://data.nowgoal.group/OddsCompBasket/387522.html
3 Women’s National Basketball Association Connecticut Sun Washington Mystics 89 94 2.28 1.59 52/37 48/46 http://data.nowgoal.group/OddsCompBasket/385886.html
4 Women’s National Basketball Association Chicago Sky Los Angeles Sparks 96 78 2.72 1.43 40/56 36/42 http://data.nowgoal.group/OddsCompBasket/385618.html
5 Women’s National Basketball Association Seattle Storm Minnesota Lynx 90 66 1.21 4.19 41/49 35/31 http://data.nowgoal.group/OddsCompBasket/385884.html
6 Friendly Competition Labas Pasauli LT Balduasenaras 85 78 52/33 31/47 http://data.nowgoal.group/OddsCompBasket/387769.html
7 Friendly Competition BC Vikings Nemuno Banga KK 66 72 29/37 30/42 http://data.nowgoal.group/OddsCompBasket/387771.html
8 Friendly Competition NRG Kiev Hizhaki 51 76 31/20 28/48 http://data.nowgoal.group/OddsCompBasket/387766.html
9 Friendly Competition Finland Estonia 97 76 2.77 1.40 48/49 29/47 http://data.nowgoal.group/OddsCompBasket/387740.html
10 Friendly Competition Synkarb Sk nemenchine 82 79 37/45 38/41 http://data.nowgoal.group/OddsCompBasket/387770.html
and so on....
And saves a Res.csv that looks like this:
Requests
Try this (requests):
import pandas as pd
from bs4 import BeautifulSoup
import requests
res = []
url = 'http://www.nowgoal.group/GetNbaWithTimeZone.aspx?date=2020-07-29&timezone=2&kind=0&t=1596143185000'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
items = soup.find_all('h')
for item in items:
values = item.text.split('^')
res.append([f'{values[1]}', f'{values[8]}', f'{values[10]}', f'{values[11]}', f'{values[12]}'])
df = pd.DataFrame(res, columns=['competition', 'home', 'away', 'home score', 'away score'])
print(df.to_string())
df.to_csv('Res.csv')
prints:
competition home away home score away score
0 NBA Portland Trail Blazers Oklahoma City Thunder 120 131
1 NBA Houston Rockets Boston Celtics 137 112
2 NBA Philadelphia 76ers Dallas Mavericks 115 118
3 WNBA Connecticut Sun Washington Mystics 89 94
4 WNBA Chicago Sky Los Angeles Sparks 96 78
5 WNBA Seattle Storm Minnesota Lynx 90 66
6 FC Labas Pasauli LT Balduasenaras 85 78
7 FC BC Vikings Nemuno Banga KK 66 72
8 FC NRG Kiev Hizhaki 51 76
And saves a Res.csv that looks like this:
If you do not want the index column you can simply add index=False to df.to_csv('Res.csv') so it looks like this df.to_csv('Res.csv', index=False)
Note selenium: You need selenium and geckodriver and in this code geckodriver is set to be imported from c:/program/geckodriver.exe
The selenium version is slower but has no need to fetch and find the XML file with devtools

This page uses JavaScript to load data but requests/BeautifulSoup can't run JavaScript.
So you have two options.
First: you can use Selenium to control real web browser which can run JavaScript. It can be better when page use complex JavaScript code to generate data - but this slower because it needs to run web browser which has to render page and run JavaScript.
Second: you can try to use DevTools in Firefox/Chrome (tab Network, filter XHR) to find URL used by JavaScript/AJAX(XHR) to get data from server and use this URL with requests. often you can get JSON data which can be converted to Python list/dictionary and then you don't need BeautifulSoupto scrape data. It is faster but sometimes page uses some JavaScript code which hard to replace with Python code.
I choose second method.
I found it reads data from
http://www.nowgoal.group/GetNbaWithTimeZone.aspx?date=2020-07-29&timezone=2&kind=0&t=1596143185000
but it gives XML data so it still needs BeautifulSoup (or lxml) to scrape data.
import requests
from bs4 import BeautifulSoup as BS
url = 'http://www.nowgoal.group/GetNbaWithTimeZone.aspx?date=2020-07-29&timezone=2&kind=0&t=1596143185000'
r = requests.get(url)
soup = BS(r.text, 'html.parser')
all_items = soup.find_all('h')
for item in all_items:
values = item.text.split('^')
#print(values)
print(values[8], values[11])
print(values[10], values[12])
print('---')
Result:
Portland Trail Blazers 120
Oklahoma City Thunder 131
---
Houston Rockets 137
Boston Celtics 112
---
Philadelphia 76ers 115
Dallas Mavericks 118
---
Connecticut Sun 89
Washington Mystics 94
---
Chicago Sky 96
Los Angeles Sparks 78
---
Seattle Storm 90
Minnesota Lynx 66
---
Labas Pasauli LT 85
Balduasenaras 78
---
BC Vikings 66
Nemuno Banga KK 72
---
NRG Kiev 51
Hizhaki 76
---
Finland 97
Estonia 76
---
Synkarb 82
Sk nemenchine 79
---
CS Sfaxien (w) 51
ES Cap Bon (w) 54
---
Police De La Circulation (w) 43
Etoile Sportive Sahel (w) 39
---
CA Bizertin 63
ES Goulette 71
---
JS Manazeh 77
AS Hammamet 53
---
Southern Huskies 84
Canterbury Rams 98
---
Taranaki Mountainairs 99
Franklin Bulls 90
---
Chaophraya Thunder 67
Thai General Equipment 102
---
Airforce Madgoat Basketball Club 60
HiTech Bangkok City 77
---
Bizoni 82
Leningrad 75
---
chameleon 104
Leningrad 80
---
Bizoni 71
Zubuyu 57
---
Drakony 89
chameleon 79
---
Dragoni 71
Zubuyu 87
---

Change color of pie chart according to section label (pandas/matplotlib)

I am working from a DataFrame called plot_df that looks like this:
Country Visual Format $
0 France DEFAULT 4.378900e+03
1 France DIGITAL3D 1.170000e+02
2 France IMAX3D 0.000000e+00
3 Hong Kong DIGITAL 1.061189e+07
4 Hong Kong DIGITAL3D 1.881850e+05
5 India DBOX 1.137234e+06
6 India DIGIMAX 2.653723e+06
7 India DIGITAL 3.283665e+07
8 Japan DEFAULT 5.819080e+07
9 Japan DIGIMAX 8.193800e+06
10 Kuwait DEFAULT 6.130250e+04
11 Kuwait DIGITAL3D 1.099000e+03
12 Kuwait IMAX3D 1.057550e+04
13 Kuwait MXP3D 8.736000e+03
14 Malaysia DIGIMAX 2.941200e+04
15 Malaysia DIGITAL 2.590491e+06
16 Malaysia MXP2D 9.478000e+03
17 Mexico 4D3D 3.806130e+06
18 Mexico DIGIMAX3D 0.000000e+00
19 Mexico DIGITAL 3.631979e+07
20 Mexico DIGITAL3D 7.510887e+06
21 Netherlands, The 4D3D 4.435451e+04
22 Netherlands, The DIGIMAX3D 7.488704e+04
23 Netherlands, The DIGITAL 3.350028e+04
24 Netherlands, The DIGITAL3D 2.521642e+05
25 Netherlands, The MXP3D 3.298899e+04
26 Peru DIGITAL 1.707998e+06
27 Peru DIGITAL3D 1.030680e+05
28 Peru MXP2D 3.961500e+04
29 Peru MXP3D 4.077950e+04
30 Peru PLF 1.310630e+05
31 Spain DIGIMAX3D 7.717070e+03
32 Spain DIGITAL 5.198949e+05
33 Spain DIGITAL3D 2.494451e+04
34 Spain MXP3D 1.025880e+04
35 Thailand DIGITAL 3.217920e+05
36 Turkey 4D3D 5.433525e+04
37 Turkey DIGITAL 2.693310e+05
38 Turkey DIGITAL3D 6.161560e+05
39 Turkey MXP3D 4.168149e+04
40 UK & Ireland DEFAULT 1.170058e+06
41 UK & Ireland DIGITAL3D 1.755717e+05
42 UK & Ireland IMAX3D 1.065599e+05
43 United Arab Emirates DEFAULT 4.317666e+06
44 United Arab Emirates DIGITAL3D 2.808751e+04
45 United Arab Emirates IMAX3D 6.832500e+04
I am trying to create _ number of pie chart subplots. This is my code so far:
fig, axes = plt.subplots(nrows=int(np.ceil(plot_df.index.get_level_values(0).nunique()/3)),
ncols=3,
figsize=(15,15))
fig.tight_layout()
axes_list = [item for sublist in axes for item in sublist]
for country in plot_df.index.get_level_values(0).unique():
ax = axes_list.pop(0)
plot_df.loc[(country, slice(None))].plot(kind='pie',
subplots=True,
legend=False,
autopct='%1.1f%%',
ax=ax)
ax.set_title(country, fontweight='bold')
ax.tick_params(
bottom=False
)
ax.set_ylabel(ylabel=None)
for ax in axes_list:
ax.remove()
My end result will look something like this:
My question has to do with the colors assigned to each visual format. Every country has a different set of formats and this leads to inconsistent assignment of colors to labels. (For example, DIGITAL is BLUE in Hong Kong but is GREEN in India).
Is there a way to create a dictionary, with visual formats as keys and colors as values, and assign this dictionary to the pandas plot color parameter? Thanks.

You can use the colors parameter for pie charts. Since this takes an array, you'll have to create an array that corresponds to your input data for each plot.
cdict = {'DIGITAL': 'r', 'DIGIMAX3D': 'y', 'DIGITAL3D': 'b', ...}
for country in plot_df.index.get_level_values(0).unique():
ax = axes_list.pop(0)
df = plot_df.loc[(country, slice(None))]
colors = [cdict[x] for x in df.index] % colors based on index of input data
df.plot(kind='pie', colors=colors, subplots=True, legend=False, autopct='%1.1f%%', ax=ax)

Another thing you could do is to reshape the data and plot with pandas:
(df['$'].unstack(0,fill_value=0)
.plot.pie(subplots=True, layout=(4,4), figsize=(12,12))
);
Output:

Read excel file in spyder but some data missing

I am a new in python and is trying to read my excel file in spyder, anaconda. However, when I run it, some row is missing and replaced with '...'. I have seven columns and 100 rows in my excel file. The column arrangement also quite weird.
This is my code:
import pandas as pd
print(" Comparing within 100 Airline \n\n")
def view():
airlines = pd.ExcelFile('Airline_final.xlsx')
df1 = pd.read_excel("Airline_final.xlsx",sheet_name=2)
print("\n\n 1: list of all Airlines \n")
print(df1)
view()
Here is what I get:
18 #051 Cubana Cuba
19 #003 Aigle Azur France
20 #011 Air Corsica France
21 #012 Air France France
22 #019 Air Mediterranee France
23 #050 Corsair France
24 #072 HOP France
25 #087 Joon France
26 #006 Air Berlin Germany
27 #049 Condor Flugdienst Germany
28 #057 Eurowings Germany
29 #064 Germania Germany
.. ... ... ...
70 #018 Air Mandalay Myanmar
71 #020 Air KBZ Myanmar
72 #067 Golden Myanmar Airlines Myanmar
73 #017 Air Koryo North Korea
74 #080 Jetstar Asia Singapore
75 #036 Binter Canarias Spain
76 #040 Canaryfly Spain
77 #073 Iberia and Iberia Express Spain

To print the whole dataframe use:
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
print(df1)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unable to do web scraping from URL using Python Alchemy - python

Related

Scraping non-interactable table from dynamic webpage

How to separate a combined column, but with incongruent data

scraping basketball results and associate related competition to each match

Change color of pie chart according to section label (pandas/matplotlib)

Read excel file in spyder but some data missing

Categories

Resources