Table data not getting extracted using python - python

I am trying to extract the table data from the website http://bet.hkjc.com/racing/index.aspx?date=22-01-2017&venue=ST&raceno=1&lang=en
The table with the horse data is what I want to extract . I am using this bit of code but it is returning me an empty array
page = requests.get("http://bet.hkjc.com/racing/index.aspx?date=22-01-2017&venue=ST&raceno=1&lang=en")
tree = html.fromstring(page.content)
temp = tree.xpath('//*[#id="horseTable"]')
print(temp)
Please help !

I would highly recommend you to use requests for scraping and beautifulsoup for parsing.
Here's an example:
import bs4
import requests
content = requests.get("http://bet.hkjc.com/racing/index.aspx?date=22-01-2017&venue=ST&raceno=1&lang=en").text # Get page content
soup = bs4.BeautifulSoup(content, 'lxml') # Parse page content
table = soup.find('table', {'id': 'horseTable'}) # Locate that table tag
rows = table.find_all('tr') # Find all row tags in that table
for row in rows:
columns = row.find_all('td') # Find all data tags in each column
for column in columns:
print column.text.strip(), # Output data in each column
print '\n',
The output:
T/P No. Colour Horse Draw Wt. Jockey Trainer Body Wt. Rtg. Gear Last 6 Runs
1 HEALTHY LUCK 3 133 D Whyte K L Man 59 1
2 GRACE HEART 11 130 C Y Ho (-2) C Fownes 56 B 12/8/9/11/7/7
3 CITY LEGEND 7 126 K Teetan T P Yung 52
4 GENERAL O'REILLY 4 126 K C Ng (-5) Y S Tsui 52
5 JOLLY AMBER 9 126 O Murphy P F Yiu 52 B1
6 KEEP MOVING 12 126 N Callan P F Yiu 52
7 LUNAR ZEPHYR 13 126 M L Yeung (-2) T P Yung 52 V1
8 MERRYGOWIN 5 126 Z Purton P O'Sullivan 52
9 VICTORY MUSIC 8 126 T Berry J Moore 52 SR 10
10 VITAL SPRING 1 126 J Moreira J Size 52
11 PRINCE HARMONY 14 122 S de Sousa W Y So 48 H/P 9/6/10
12 FUN MANAGER 6 120 T H So (-2) C H Yip 46 10/9/10/14/8/10
13 MASSIVE MOVE 10 118 K K Chiong (-5) L Ho 44 V 3/8/6/3/7/5
14 HAPPY SOUND 2 116 H W Lai A Lee 42 E-/B/TT 12/11/11/5/9/1
Standby Horse
T/P No. Horse Wt. Trainer Body Wt. Rtg. Gear Last 6 Runs
1 BELOVED 131 P F Yiu 57 H 11
2 EXPONENTS 124 Y S Tsui 50 B 7/7

I think you should have a look at this solution provided in the link below:
https://chihacknight.org/blog/2014/11/26/an-intro-to-web-scraping-with-python.html
this is a quite helpful link if you refer to the code and amend it to suite your needs.

Related

Trying to access variables while scraping website; trying to get var in script

Trying to web scrape info from this website: http://www.dexel.co.uk/shopping/tyre-results?width=205&profile=55&rim=16&speed=.
For context, I'm trying to find the Tyre brand (Bridgestone, Michelin), pattern (e.g Turanza T001, Ecopia EP500), Tyre Size (205/55. 16 V (91), 225/50. 16 W (100) XL), Seasonality (if available) (Summer, Winter) and price.
My measurements for the tyre are Width – 205, Aspect Ratio – 55, Rim Size - 16.
I found all the info I need here at the var allTyres section. The problem is, I am struggling with how to extract the "manufacturer" (brand), "description" (description has the pattern and size), "winter" (it would have 0 for no and 1 for yes), "summer" (same as winter) and "price".
Afterwards, I want to export the data in CSV format.
Thanks
To create a pandas dataframe from the allTyres data you can do (from the DataFrame you can select columns you want, save it to CSV etc..):
import re
import json
import requests
import pandas as pd
url = "http://www.dexel.co.uk/shopping/tyre-results?width=205&profile=55&rim=16&speed="
data = json.loads(
re.search(r"allTyres = (.*);", requests.get(url).text).group(1)
)
# uncomment to print all data:
# print(json.dumps(data, indent=4))
df = pd.DataFrame(data)
print(df.head())
Prints:
id ManufacturerID width profile rim speed load description part_no pattern manufacturer extra_load run_flat winter summer OEList price tyre_class rolling_resistance wet_grip Graphic noise_db noise_rating info pattern_name recommended rating
0 1881920 647 205 55 16 V 91 205/55VR16 BUDGET VR 2055516VBUD Economy N N 0 1 53.20 C1 G F BUD 73 3 0 1
1 3901788 647 205 55 16 H 91 205/55R16 BUDGET 91H 2055516HBUD Economy N N 0 1 53.20 C1 G F BUD 73 3 0 1
2 1881957 647 205 55 16 W 91 205/55ZR16 BUDGET ZR 2055516ZBUD Economy N N 0 1 53.54 C1 G F BUD 73 3 0 1
3 6022423 129 205 55 16 H 91 205/55R16 91H UROYAL RAINSPORT 5 2055516HUN09BGS RainSport 5 Uniroyal N N 0 1 70.46 C1 C A UNIRSP5 71 2 <p>The NEW RainSport 5 combines best-in-class wet performance, enhanced mileage, and superior steering control for maximum driving pleasure.</p>\n<ul>\n <li>Safe driving even in the most challenging wet conditions</li>\n <li>Extended tyre life for a long journey</li>\n <li>Excellent control and steering response for maximum driving pleasure.</li>\n</ul> RainSport 5 0 4
4 6022424 129 205 55 16 V 91 205/55R16 91V UROYAL RAINSPORT 5 2055516VUN09BGR RainSport 5 Uniroyal N N 0 1 70.81 C1 C A UNIRSP5 71 2 <p>The NEW RainSport 5 combines best-in-class wet performance, enhanced mileage, and superior steering control for maximum driving pleasure.</p>\n<ul>\n <li>Safe driving even in the most challenging wet conditions</li>\n <li>Extended tyre life for a long journey</li>\n <li>Excellent control and steering response for maximum driving pleasure.</li>\n</ul> RainSport 5 0 4

Replace blank value in dataframe based on another column condition

I have many blanks in a merged data set and I want to fill them with a condition.
My current code looks like this
import pandas as pd
import csv
import numpy as np
pd.set_option('display.max_columns', 500)
# Read all files into pandas dataframes
Jan = pd.read_csv(r'C:\~\Documents\Jan.csv')
Feb = pd.read_csv(r'C:\~\Documents\Feb.csv')
Mar = pd.read_csv(r'C:\~\Documents\Mar.csv')
Jan=pd.DataFrame({'Department':['52','5','56','70','7'],'Item':['2515','254','818','','']})
Feb=pd.DataFrame({'Department':['52','56','765','7','40'],'Item':['2515','818','524','','']})
Mar=pd.DataFrame({'Department':['7','70','5','8','52'],'Item':['45','','818','','']})
all_df_list = [Jan, Feb, Mar]
appended_df = pd.concat(all_df_list)
df = appended_df
df.to_csv(r"C:\~\Documents\SallesDS.csv", index=False)
Data set:
df
Department Item
52 2515
5 254
56 818
70
7 50
52 2515
56 818
765 524
7
40
7 45
70
5 818
8
52
What I want is to fill the empty cells in Item with a correspondent values of the Department column.
So If Department is 52 and Item is empty it should be filled with 2515
Department 7 and Item is empty fill it with 45
and the result should look like this
df
Department Item
52 2515
5 254
56 818
70
7 50
52 2515
56 818
765 524
7 45
40
7 45
70
5 818
8
52 2515
I tried the following method but non of them worked.
1
df.loc[(df['Item'].isna()) & (df['Department'].str.contains(52)), 'Item'] = 2515
df.loc[(df['Item'].isna()) & (df['Department'].str.contains(7)), 'Item'] = 45
2
df["Item"] = df["Item"].fillna(df["Department"])
df = df.replace({"Item":{"52":"2515", "7":"45"}})
both ethir return error or do not work
Answer:
Hi I have used the below code and it worked
b = [52]
df.Item=np.where(df.Department.isin(b),df.Item.fillna(2515),df.Item)
a = [7]
df.Item=np.where(df.Department.isin(a),df.Item.fillna(45),df.Item)
Hope it helps someone who face the same issue
The following solution first creates a map of each department and it's maximum corresponding item (assuming there is one), and then matches that item to a department with a blank item. Note that in your data frame, the empty items are an empty string ("") and not NaN.
Create a map:
values = df.groupby('Department').max()
values['Item'] = values['Item'].apply(lambda x: np.nan if x == "" else x)
values = values.dropna().reset_index()
Department Item
0 5 818
1 52 2515
2 56 818
3 7 45
4 765 524
Then use df.apply():
df['Item'] = df.apply(lambda x: values[values['Department'] == x['Department']]['Item'].values if x['Item'] == "" else x['Item'], axis=1)
In this case, the new values will have brackets around them. They can be removed with str.replace():
df['Item'] = df['Item'].astype(str).str.replace(r'\[|\'|\'|\]', "", regex=True)
The result:
Department Item
0 52 2515
1 5 254
2 56 818
3 70
4 7 45
0 52 2515
1 56 818
2 765 524
3 7 45
4 40
0 7 45
1 70
2 5 818
3 8
4 52 2515
Hi I have used the below code and it worked
b = [52]
df.Item=np.where(df.Department.isin(b),df.Item.fillna(2515),df.Item)
a = [7]
df.Item=np.where(df.Department.isin(a),df.Item.fillna(45),df.Item)
Hope it helps someone who face the same issue

i want to separate data frame based on marks and download it then

This is my data frame:
Name Age Stream Percentage
0 A 21 Math 88
1 B 19 Commerce 92
2 C 20 Arts 95
3 D 18 Biology 70
0 E 21 Math 88
1 F 19 Commerce 92
2 G 20 Arts 95
3 H 18 Biology 70
I want to download different excel file for each subject in one loop so basically, I should get 4 excel files for each subject
i tried this but didn't work:
n=0
for subjects in df.stream:
df.to_excel("sub"+ str(n)+".xlsx")
n+=1
I think groupby is helpful here. and you can use enumerate to keep track of the index.
for i, (group, group_df) in enumerate(df.groupby('stream')):
group_df.to_excel('sub{}.xlsx'.format(i))
# Alternatively, to name the file based on the stream...
group_df.to_excel('sub{}.xlsx'.format(group))
group is going to be the name of the stream.
group_df is going to be a sub-dataframe containing all the data in that group.

How to scrape a non tabled list from wikipedia and create a datafram?

en.wikipedia.org/wiki/List_of_neighbourhoods_of_Istanbul
in the link above, there is an un-tabulated data for Istanbul Neighborhoods.
I want to fetch these Neighborhoods into a data frame by this code
import pandas as pd
import requests
from bs4 import BeautifulSoup
wikiurl="https://en.wikipedia.org/wiki/List_of_neighbourhoods_of_Istanbul"
response=requests.get(wikiurl)
soup = BeautifulSoup(response.text, 'html.parser')
tocList=soup.findAll('a',{'class':"new"})
neighborhoods=[]
for item in tocList:
text = item.get_text()
neighborhoods.append(text)
df = pd.DataFrame(neighborhoods, columns=['Neighborhood'])
print(df)
and I got this output:
Neighborhood
0 Maden
1 Nizam
2 Anadolu
3 Arnavutköy İmrahor
4 Arnavutköy İslambey
... ...
705 Seyitnizam
706 Sümer
707 Telsiz
708 Veliefendi
709 Yeşiltepe
710 rows × 1 columns
but some data are not fetched, check the data below and compare to the output:
Adalar
Burgazada
Heybeliada
Kınalıada
Maden
Nizam
findall() is not fetching the Neighborhoods which referred as links, not class, i.e.
<ol><li>Burgazada</li>
<li>Heybeliada</li>
and can I develop the code into 2 columns, each 'Neighborhood' and its 'District'
Are you trying to fetch this list from Table of Contents ?
Please check if this solves your problem:
import pandas as pd
import requests
from bs4 import BeautifulSoup
wikiurl="https://en.wikipedia.org/wiki/List_of_neighbourhoods_of_Istanbul"
response=requests.get(wikiurl)
soup = BeautifulSoup(response.text, 'html.parser')
tocList=soup.findAll('span',{'class':"toctext"})
districts=[]
blocked_words = ['Neighbourhoods by districts','Further reading', 'External links']
for item in tocList:
text = item.get_text()
if text not in blocked_words:
districts.append(text)
df = pd.DataFrame(districts, columns=['districts'])
print(df)
Output:
districts
0 Adalar
1 Arnavutköy
2 Ataşehir
3 Avcılar
4 Bağcılar
5 Bahçelievler
6 Bakırköy
7 Başakşehir
8 Bayrampaşa
9 Beşiktaş
10 Beykoz
11 Beylikdüzü
12 Beyoğlu
13 Büyükçekmece
14 Çatalca
15 Çekmeköy
16 Esenler
17 Esenyurt
18 Eyüp
19 Fatih
20 Gaziosmanpaşa
21 Güngören
22 Kadıköy
23 Kağıthane
24 Kartal
25 Küçükçekmece
26 Maltepe
27 Pendik
28 Sancaktepe
29 Sarıyer
30 Silivri
31 Sultanbeyli
32 Sultangazi
33 Şile
34 Şişli
35 Tuzla
36 Ümraniye
37 Üsküdar
38 Zeytinburnu

Python parsing data from a website using regular expression

I'm trying to parse some data from this website:
http://www.csfbl.com/freeagents.asp?leagueid=2237
I've written some code:
import urllib
import re
name = re.compile('<td>(.+?)')
player_id = re.compile('<td><a href="(.+?)" onclick=')
#player_id_num = re.compile('<td><a href=player.asp?playerid="(.+?)" onclick=')
stat_c = re.compile('<td class="[^"]+" align="[^"]+"><span class="[^"]?">(.+?)</span><br><span class="[^"]?">')
stat_p = re.compile('<td class="[^"]+" align="[^"]+"><span class="[^"]?">"[^"]+"</span><br><span class="[^"]?">(.+?)</span></td>')
url = 'http://www.csfbl.com/freeagents.asp?leagueid=2237'
sock = urllib.request.urlopen(url).read().decode("utf-8")
#li = name.findall(sock)
name = name.findall(sock)
player_id = player_id.findall(sock)
#player_id_num = player_id_num.findall(sock)
#age = age.findall(sock)
stat_c = stat_c.findall(sock)
stat_p = stat_p.findall(sock)
First question : player_id returns the whole url "player.asp?playerid=4209661". I was unable to get just the number part. How can I do that?
(my attempt is described in #player_id_num)
Second question: I am not able to get stat_c when span_class is empty as in "".
Is there a way I can get these resolved? I am not very familiar with RE (regular expressions), I looked up tutorials online but it's still unclear what I am doing wrong.
Very simple using the pandas library.
Code:
import pandas as pd
url = "http://www.csfbl.com/freeagents.asp?leagueid=2237"
dfs = pd.read_html(url)
# print dfs[3]
# dfs[3].to_csv("stats.csv") # Send to a CSV file.
print dfs[3].head()
Result:
0 1 2 3 4 5 6 7 8 9 10 \
0 Pos Name Age T PO FI CO SY HR RA GL
1 P George Pacheco 38 R 4858 7484 8090 7888 6777 4353 6979
2 P David Montoya 34 R 3944 5976 6673 8699 6267 6685 5459
3 P Robert Cole 34 R 5769 7189 7285 5863 6267 5868 5462
4 P Juanold McDonald 32 R 69100 5772 4953 4866 5976 67100 5362
11 12 13 14 15 16
0 AR EN RL Fatigue Salary NaN
1 3747 6171 -3 100% --- $3,672,000
2 5257 5975 -4 96% 2% $2,736,000
3 4953 5061 -4 96% 3% $2,401,000
4 5982 5263 -4 100% --- $1,890,000
You can apply whatever cleaning methods you want from here onwards. Code is rudimentary so it's up to you to improve it.
More Code:
import pandas as pd
import itertools
url = "http://www.csfbl.com/freeagents.asp?leagueid=2237"
dfs = pd.read_html(url)
df = dfs[3] # "First" stats table.
# The first row is the actual header.
# Also, notice the NaN at the end.
header = df.iloc[0][:-1].tolist()
# Fix that atrocity of a last column.
df.drop([15], axis=1, inplace=True)
# Last row is all NaNs. This particular
# table should end with Jeremy Dix.
df = df.iloc[1:-1,:]
df.columns = header
df.reset_index(drop=True, inplace=True)
# Pandas cannot create two rows without the
# dataframe turning into a nightmare. Let's
# try an aesthetic change.
sub_header = header[4:13]
orig = ["{}{}".format(h, "r") for h in sub_header]
clone = ["{}{}".format(h, "p") for h in sub_header]
# http://stackoverflow.com/a/3678930/2548721
comb = [iter(orig), iter(clone)]
comb = list(it.next() for it in itertools.cycle(comb))
# Construct the new header.
new_header = header[0:4]
new_header += comb
new_header += header[13:]
# Slow but does it cleanly.
for s, o, c in zip(sub_header, orig, clone):
df.loc[:, o] = df[s].apply(lambda x: x[:2])
df.loc[:, c] = df[s].apply(lambda x: x[2:])
df = df[new_header] # Drop the other columns.
print df.head()
More result:
Pos Name Age T POr POp FIr FIp COr COp ... RAp GLr \
0 P George Pacheco 38 R 48 58 74 84 80 90 ... 53 69
1 P David Montoya 34 R 39 44 59 76 66 73 ... 85 54
2 P Robert Cole 34 R 57 69 71 89 72 85 ... 68 54
3 P Juanold McDonald 32 R 69 100 57 72 49 53 ... 100 53
4 P Trevor White 37 R 61 66 62 64 67 67 ... 38 48
GLp ARr ARp ENr ENp RL Fatigue Salary
0 79 37 47 61 71 -3 100% $3,672,000
1 59 52 57 59 75 -4 96% $2,736,000
2 62 49 53 50 61 -4 96% $2,401,000
3 62 59 82 52 63 -4 100% $1,890,000
4 50 70 100 62 69 -4 100% $1,887,000
Obviously, what I did instead was separate the Real values from Potential values. Some tricks were used but it gets the job done at least for the first table of players. The next few ones require a degree of manipulation.

Categories