BeautifulSoup extracting data from multiple tables - python

I'm trying to extract some data from two html tables in a html file with BeautifulSoup.
This is actually the first time I'm using it and I'searched a lot of questions/example but none seem to work in my case.
The html contains two tables, the first with the headers of the first column (which are always text) and the second, containing the data of the following columns. Moreover, the table contains text, numbers and also symbols. This makes for a novice like me everything more complicated. Here's the layout of the html, copied from the browser
I was able to extract the whole html content of the rows, but only for the first tables, so in reality I am not getting any data but only the content of the first column.
The output I'm trying to obtain is a string containing the "joint" information of the tables (Col1= text, Col2=number, Col3=number, Col4=number, Col5=number), such as:
Canada, 6, 5, 2, 1
Here's the list of the Xpaths for each item:
"Canada": /html/body/div/div[1]/table/tbody[2]/tr[2]/td/div/a
"6": /html/body/div/div[2]/div/table/tbody[2]/tr[2]/td[1]
"5": /html/body/div/div[2]/div/table/tbody[2]/tr[2]/td[3]
"2": /html/body/div/div[2]/div/table/tbody[2]/tr[2]/td[5]
"1": /html/body/div/div[2]/div/table/tbody[2]/tr[2]/td[7]
I would be also happy with strings in "rough" html format, as long as there is one string per row, so that I'll be able to parse it further with the methods I already know. Here's the code I have so far. Thanks!
from BeautifulSoup import BeautifulSoup
html="""
my html code
"""
soup = BeautifulSoup(html)
table=soup.find("table")
for row in table.findAll('tr'):
col = row.findAll('td')
print row, col

Using bs4, but this should work:
from bs4 import BeautifulSoup as bsoup
ofile = open("htmlsample.html")
soup = bsoup(ofile)
soup.prettify()
tables = soup.find_all("tbody")
storeTable = tables[0].find_all("tr")
storeValueRows = tables[2].find_all("tr")
storeRank = []
for row in storeTable:
storeRank.append(row.get_text().strip())
storeMatrix = []
for row in storeValueRows:
storeMatrixRow = []
for cell in row.find_all("td")[::2]:
storeMatrixRow.append(cell.get_text().strip())
storeMatrix.append(", ".join(storeMatrixRow))
for record in zip(storeRank, storeMatrix):
print " ".join(record)
The above will print out:
# of countries - rank 1 reached 0, 0, 1, 9
# of countries - rank 5 reached 0, 8, 49, 29
# of countries - rank 10 reached 25, 31, 49, 32
# of countries - rank 100 reached 49, 49, 49, 32
# of countries - rank 500 reached 49, 49, 49, 32
# of countries - rank 1000 reached 49, 49, 49, 32
[Finished in 0.5s]
Changing storeTable to tables[1] and storeValueRows to tables[3] will print out:
Country
Canada 6, 5, 2, 1
Brazil 7, 5, 2, 1
Hungary 7, 6, 2, 2
Sweden 9, 5, 1, 1
Malaysia 10, 5, 2, 1
Mexico 10, 5, 2, 2
Greece 10, 6, 2, 1
Israel 10, 6, 2, 1
Bulgaria 10, 6, 2, -
Chile 10, 6, 2, -
Vietnam 10, 6, 2, -
Ireland 10, 6, 2, -
Kuwait 10, 6, 2, -
Finland 10, 7, 2, -
United Arab Emirates 10, 7, 2, -
Argentina 10, 7, 2, -
Slovakia 10, 7, 2, -
Romania 10, 8, 2, -
Belgium 10, 9, 2, 3
New Zealand 10, 13, 2, -
Portugal 10, 14, 2, -
Indonesia 10, 14, 2, -
South Africa 10, 15, 2, -
Ukraine 10, 15, 2, -
Philippines 10, 16, 2, -
United Kingdom 11, 5, 2, 1
Denmark 11, 6, 2, 2
Australia 12, 9, 2, 3
United States 13, 9, 2, 2
Austria 13, 9, 2, 3
Turkey 14, 5, 2, 1
Egypt 14, 5, 2, 1
Netherlands 14, 8, 2, 2
Spain 14, 11, 2, 4
Thailand 15, 10, 2, 3
Singapore 16, 10, 2, 2
Switzerland 16, 10, 2, 3
Taiwan 17, 12, 2, 4
Poland 17, 13, 2, 5
France 18, 8, 2, 3
Czech Republic 18, 13, 2, 6
Germany 19, 11, 2, 3
Norway 20, 14, 2, 5
India 20, 14, 2, 5
Italy 20, 15, 2, 7
Hong Kong 26, 21, 2, -
Japan 33, 16, 4, 5
Russia 33, 17, 2, 7
South Korea 46, 27, 2, 5
[Finished in 0.6s]
Not the best of code and can be improved further. However, the logic applies well.
Hope this helps.
EDIT:
If you want the format South Korea, 46, 27, 2, 5 instead of South Korea 46, 27, 2, 5 (note the , after the country name), just change this:
storeRank.append(row.get_text().strip())
to this:
storeRank.append(row.get_text().strip() + ",")

Looks like you are scraping data from http://www.appannie.com.
Here is the code to get the data. I am sure some parts of the code can be improved or written in a pythonic way. But it gets what you want. Also, I used Beautiful Soup 4 instead of 3.
from bs4 import BeautifulSoup
html_file = open('test2.html')
soup = BeautifulSoup(html_file)
countries = []
countries_table = soup.find_all('table', attrs={'class':'data-table table-rank'})[1]
countries_body = countries_table.find_all('tbody')[1]
countries_row = countries_body.find_all('tr', attrs={"class": "ranks"})
for row in countries_row:
countries.append(row.div.a.text)
data = []
data_table = soup.find_all('table', attrs={'class':'data-table table-rank'})[3]
data_body = data_table.find_all('tbody')[1]
data_row = data_body.find_all('tr', attrs={"class": "ranks"})
for row in data_row:
tds = row.find_all('td')
sublist = []
for td in tds[::2]:
sublist.append(td.text)
data.append(sublist)
for element in zip(countries, data):
print element
Hope this helps :)

Just thought I would put my alternate version of this here. I don't even know why people still use Beautifulsoup for web-scraping, its much easier to directly use XPath through LXML. Here is the same problem, perhaps in an easier to read and update form:
from lxml import html, etree
tree = html.parse("sample.html").xpath('//body/div/div')
lxml_getData = lambda x: "{}, {}, {}, {}".format(lxml_getValue(x.xpath('.//td')[0]), lxml_getValue(x.xpath('.//td')[2]), lxml_getValue(x.xpath('.//td')[4]), lxml_getValue(x.xpath('.//td')[6]))
lxml_getValue = lambda x: etree.tostring(x, method="text", encoding='UTF-8').strip()
locations = tree[0].xpath('.//tbody')[1].xpath('./tr')
locations.pop(0) # Don't need first row
data = tree[1].xpath('.//tbody')[1].xpath('./tr')
data.pop(0) # Don't need first row
for f, b in zip(locations, data):
print(lxml_getValue(f), lxml_getData(b))

Related

return the days in a month that are not in a list

I've a list of days list = [1,5,16,29]
Considering current month september and year 2021
I've a user wise day df
user_id day month year
1 1 9 2021
1 2 9 2021
1 6 9 2021
1 14 9 2021
1 22 9 2021
1 18 9 2021
2 2 9 2021
2 17 9 2021
2 3 9 2021
2 30 9 2021
2 29 9 2021
2 28 9 2021
How can I get the user wise days of given month and year that are not present in respective users df['day'] and in the list?
Expected result
user_id remaining_days_of_month
1 3,4,7,8,9,10,11,12,13,15,17,19,20,21,23,24,25,26,27,28,30
2 4,6,7,8,9,10,11,12,13,14,15,18,19,20,21,22,23,24,25,26,27
You can use calendar.monthrange to get a range of the number of days in a year-month
df
import calendar
def get_remaining_days(group, lst):
month = group.month.unique()[0]
days_to_remove = np.unique(np.concatenate((group.day, lst)))
lst_of_days = list(range(*calendar.monthrange(2021, month)))
remaining_days = [i for i in lst_of_days if i not in days_to_remove]
return remaining_days
lst = [1,5,16,29]
result = df.groupby(by=["user_id", "month"]).apply(lambda x: get_remaining_days(x, lst))
result.name = "remaining_days_of_month"
result = result.to_frame()
result
I made it work for different months and same user. In case you happen to different year too, it won't require much change
Use calendar.monthrange to get the size of a month, then make a set difference
import pandas as pd
import calendar
df = pd.DataFrame({'user_id': [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2],
'day': [1, 2, 6, 14, 22, 18, 2, 17, 3, 30, 29, 28],
'month': [9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9]})
month = df['month'].iloc[0]
values = [1, 5, 16, 29]
days_of_month = set(range(1, 1 + calendar.monthrange(2021, month)[1])).difference(values)
df: pd.DataFrame = df.groupby('user_id')['day'].apply(list).reset_index()
df['day'] = df['day'].apply(lambda cell: set(days_of_month).difference(cell))
user_id
day
0
1
{3, 4, 7, 8, 9, 10, 11, 12, 13, 15, 17, 19, 20, 21, 23, 24, 25, 26, 27, 28}
1
2
{4, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27}

Alternative to for loops for calculating 15^6 combinations in Python

Today, I have a nested for loop in python to calculate the value of all different combinations in a horse racing card consisting of six different races; i.e. six different arrays (of different lengths, but up to 15 items per array). It can be up to 11 390 625 combinations (15^6).
For each horse in each race, I calculate a value (EV) which I want to multiply.
Array 1: 1A,1B,1C,1D,1E,1F
Array 2: 2A,2B,2C,2D,2E,2F
Array 3: 3A,3B,3C,3D,3E,3F
Array 4: 4A,4B,4C,4D,4E,4F
Array 5: 5A,5B,5C,5D,5E,5F
Array 6: 6A,6B,6C,6D,6E,6F
1A * 1B * 1C * 1D * 1E * 1F = X,XX
.... .... .... .... ... ...
6A * 6B * 6C * 6D * 6E * 6F 0 X,XX
Doing four levels is OK. It takes me about 3 minutes.
I have yet not been able to do six levels.
I need help in creating a better way of doing this, and have no idea how to proceed. Does numpy perhaps offer help here? Pandas? I've tried compiling the code with Cython, but it did not help much.
My function takes in a list containing the horses in numerical order and their EV. (Since horse starting numbers do not start with zero, I add 1 to the index). I iterate through all the different races, and save the output for the combination into a dataframe.
def calculateCombos(horses_in_race_1,horses_in_race_2,horses_in_race_3,horses_in_race_4,horses_in_race_5,horses_in_race_6,totalCombinations, df):
totalCombinations = 0
for idx1, hr1_ev in enumerate(horses_in_race_1):
hr1_no = idx1 + 1
for idx2, hr2_ev in enumerate(horses_in_race_2):
hr2_no = idx2 + 1
for idx3, hr3_ev in enumerate(horses_in_race_3):
hr3_no_ = idx3 + 1
for idx4, hr4_ev in enumerate(horses_in_race_4):
hr4_no = idx4 + 1
for idx5, hr5_ev in enumerate(horses_in_race_5):
hr5_no = idx5 + 1
for idx6, hr6_ev in enumerate(horses_in_race_6):
hr6_no = idx6 + 1
totalCombinations = totalCombinations + 1
combinationEV = hr1_ev * hr2_ev * hr3_ev * hr4_ev * hr5_ev * hr6_ev
new_row = {'Race1':str(hr1_no),'Race2':str(hr2_no),'Race3':str(hr3_no),'Race4':str(hr4_no),'Race5':str(hr5_no),'Race6':str(hr6_no), 'EV':combinationEV}
df = appendCombinationToDF(df, new_row)
return df
Why don't you try this and see if you can run the function without any issues? This works on my laptop (I'm using PyCharm). If you can't run this, then I would say that you need a better PC perhaps. I did not encounter any memory error.
Assume that we have the following:
horses_in_race_1 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
horses_in_race_2 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
horses_in_race_3 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
horses_in_race_4 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
horses_in_race_5 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
horses_in_race_6 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
I have re-written the function as follows - made a change in enumeration. Also, not using df as I do not know what function this is - appendCombinationToDF
def calculateCombos(horses_in_race_1,horses_in_race_2,horses_in_race_3,horses_in_race_4,horses_in_race_5,horses_in_race_6):
for idx1, hr1_ev in enumerate(horses_in_race_1, start = 1):
for idx2, hr2_ev in enumerate(horses_in_race_2, start = 1):
for idx3, hr3_ev in enumerate(horses_in_race_3, start = 1):
for idx4, hr4_ev in enumerate(horses_in_race_4, start = 1):
for idx5, hr5_ev in enumerate(horses_in_race_5, start = 1):
for idx6, hr6_ev in enumerate(horses_in_race_6, start = 1):
combinationEV = hr1_ev * hr2_ev * hr3_ev * hr4_ev * hr5_ev * hr6_ev
new_row = {'Race1':str(idx1),'Race2':str(idx2),'Race3':str(idx3),'Race4':str(idx4),'Race5':str(idx5),'Race6':str(idx6), 'EV':combinationEV}
l.append(new_row)
#df = appendCombinationToDF(df, new_row)
l = [] # df = ...
calculateCombos(horses_in_race_1, horses_in_race_2, horses_in_race_3, horses_in_race_4, horses_in_race_5, horses_in_race_6)
Executing len(l), I get:
11390625 # maximum combinations possible. This means that above function ran successfully and computation succeeded.
If the above can be executed, replace list l with df and see if function can execute without encountering memory error. I was able to run the above in less than 20-30 seconds.

i have a problem with duplicate rows in python

i want to Remove duplicate data in a rows from data frame python without affecting the shape of the DataFrame
i tried already this cods but i could not bring the year column to the now dataframe
any help thanks
dicpn = dict()
for value in ss.Country.unique():
if len(ss.loc[(ss.Country == value) & ss.Country.duplicated(keep=False)]) > 0:
all_values = ss.loc[(ss.Country == value) & ss.Country.duplicated(keep=False), 'Count'].tolist()
dicpn[value] = all_values
elif len(ss.loc[(ss.Country == value) & ss.Country.duplicated(keep=False)]) == 0:
dicpn[value] = ss.loc[(ss.Country== value), 'Count'].tolist()
# make a new dataframe
df2 = pd.DataFrame(columns=['landen', 'Count'])
df2.landen = list(dicpn.keys())
df2.Count = list(dicpn.values())
landen Count
0 Argentina [6, 15, 10, 4, 11, 1, 13, 7, 8, 1, 2, 2, 22, 3...
1 Australia [2, 1, 2230, 1, 3, 1, 5, 55, 38, 48, 9, 1, 2, ...
2 Belgium [1289, 1, 1620, 3, 8, 28, 13, 37, 2, 1, 560, 3...
3 Canada [1, 5, 230, 3, 4, 9, 3, 1, 1376, 159, 11, 44, ...
4 China [168, 12, 1, 114, 5, 8961, 1, 33, 4, 3, 23, 21...
5 Denmark [11, 20, 4, 2, 479, 140, 5, 53, 9, 2, 12, 16, ...
Country PublishYear Count
8 Argentina 2003 6
9 Argentina 2014 15
10 Argentina 2010 10
11 Argentina 2015 4
12 Argentina 2014 11
... ... ... ...
254169 United States 2004 2
254170 United States 2004 955
254171 United States 2003 10
254172 United States 2015 16
254173 United States 2012 259
]2

How to randomly select a specific sequence from a list?

I have a list of hours starting from (0 is midnight).
hour = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
I want to generate a sequence of 3 consecutive hours randomly. Example:
[3,6]
or
[15, 18]
or
[23,2]
and so on. random.sample does not achieve what I want!
import random
hourSequence = sorted(random.sample(range(1,24), 2))
Any suggestions?
Doesn't exactly sure what you want, but probably
import random
s = random.randint(0, 23)
r = [s, (s+3)%24]
r
Out[14]: [16, 19]
Note: None of the other answers take in to consideration the possible sequence [23,0,1]
Please notice the following using itertools from python lib:
from itertools import islice, cycle
from random import choice
hours = list(range(24)) # List w/ 24h
hours_cycle = cycle(hours) # Transform the list in to a cycle
select_init = islice(hours_cycle, choice(hours), None) # Select a iterator on a random position
# Get the next 3 values for the iterator
select_range = []
for i in range(3):
select_range.append(next(select_init))
print(select_range)
This will print sequences of three values on your hours list in a circular way, which will also include on your results for example the [23,0,1].
You can try this:
import random
hour = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
index = random.randint(0,len(hour)-2)
l = [hour[index],hour[index+3]]
print(l)
You can get a random number from the array you already created hour and take the element that is 3 places afterward:
import random
def random_sequence_endpoints(l, span):
i = random.choice(range(len(l)))
return [hour[i], hour[(i+span) % len(l)]]
hour = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
result = random_sequence_endpoints(hour, 3)
This will work not only for the above hours list example but for any other list contain any other elements.

loop for computing average of selected data in dataframe using pandas

I have a 3 row x 96 column dataframe. I'm trying to computer the average of the two rows beneath the index (row1:96) for every 12 data points. here is my dataframe:
Run 1 Run 2 Run 3 Run 4 Run 5 Run 6 \
0 1461274.92 1458079.44 1456807.1 1459216.08 1458643.24 1457145.19
1 478167.44 479528.72 480316.08 475569.52 472989.01 476054.89
2 ------ ------ ------ ------ ------ ------
Run 7 Run 8 Run 9 Run 10 ... Run 87 \
0 1458117.08 1455184.82 1455768.69 1454738.07 ... 1441822.45
1 473630.89 476282.93 475530.87 474200.22 ... 468525.2
2 ------ ------ ------ ------ ... ------
Run 88 Run 89 Run 90 Run 91 Run 92 Run 93 \
0 1445339.53 1461050.97 1446849.43 1438870.43 1431275.76 1430781.28
1 460076.8 473263.06 455885.07 475245.64 483875.35 487065.25
2 ------ ------ ------ ------ ------ ------
Run 94 Run 95 Run 96
0 1436007.32 1435238.23 1444300.51
1 474328.87 475789.12 458681.11
2 ------ ------ ------
[3 rows x 96 columns]
Currently I am trying to use df.irow(0) to select all the data in row index 0.
something along the lines of:
selection = np.arange(0,13)
for i in selection:
new_df = pd.DataFrame()
data = df.irow(0)
........
then i get lost
I just don't know how to link this range with the dataframe in order to computer the mean for every 12 data points in each column.
To summarize, I want the average for every 12 runs in each column. So, i should end up with a separate dataframe with 2 * 8 average values (96/12).
any ideas?
thanks.
You can do a groupby on axis=1 (using some dummy data I made up):
>>> h = df.iloc[:2].astype(float)
>>> h.groupby(np.arange(len(h.columns))//12, axis=1).mean()
0 1 2 3 4 5 6 7
0 0.609643 0.452047 0.536786 0.377845 0.544321 0.214615 0.541185 0.544462
1 0.382945 0.596034 0.659157 0.437576 0.490161 0.435382 0.476376 0.423039
First we extract the data and force recognition of a float (the presence of the ------ row means that you've probably got an object dtype, which will make the mean unhappy.)
Then we make an array saying what groups we want to put the different columns in:
>>> np.arange(len(df.columns))//12
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5,
5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7,
7, 7, 7, 7], dtype=int32)
which we feed as an argument to groupby. .mean() handles the rest.
It's always best to try to use pandas methods when you can, rather than iterating over the rows. The DataFrame's iloc method is useful for extracting any number of rows.
The following example shows you how to do what you want in a two-column DataFrame. The same technique will work independent of the number of columns:
In [14]: df = pd.DataFrame({"x": [1, 2, "-"], "y": [3, 4, "-"]})
In [15]: df
Out[15]:
x y
0 1 3
1 2 4
2 - -
In [16]: df.iloc[2] = df.iloc[0:2].sum()
In [17]: df
Out[17]:
x y
0 1 3
1 2 4
2 3 7
However, in your case you want to sum each group of eight cells in df.iloc[2]`, so you might be better simply taking the result of the summing expression with the statement
ds = df.iloc[0:2].sum()
which with your data will have the form
col1 0
col2 1
col3 2
col4 3
...
col93 92
col94 93
col95 94
col96 95
(These numbers are representative, you will obviously see your column sums). You can then turn this into a 12x8 matrix with
ds.values.reshape(12, 8)
whose value is
array([[ 0, 1, 2, 3, 4, 5, 6, 7],
[ 8, 9, 10, 11, 12, 13, 14, 15],
[16, 17, 18, 19, 20, 21, 22, 23],
[24, 25, 26, 27, 28, 29, 30, 31],
[32, 33, 34, 35, 36, 37, 38, 39],
[40, 41, 42, 43, 44, 45, 46, 47],
[48, 49, 50, 51, 52, 53, 54, 55],
[56, 57, 58, 59, 60, 61, 62, 63],
[64, 65, 66, 67, 68, 69, 70, 71],
[72, 73, 74, 75, 76, 77, 78, 79],
[80, 81, 82, 83, 84, 85, 86, 87],
[88, 89, 90, 91, 92, 93, 94, 95]])
but summing this array will give you the sum of all elements, so instead create another DataFrame with
rs = pd.DataFrame(ds.values.reshape(12, 8))
and then sum that:
rs.sum()
giving
0 528
1 540
2 552
3 564
4 576
5 588
6 600
7 612
dtype: int64
You may find in practice that it is easier to simply create two 12x8 matrices in the first place, which you can add together before creating a dataframe which you can then sum. Much depends on how you are reading your data.

Categories