Web scraping with Python and Pandas - Pagination - python

With this short code I can get data from the table:
import pandas as pd
df=pd.read_html('https://www.worldathletics.org/records/toplists/middle-long/800-metres/indoor/men/senior/2023?regionType=world&timing=electronic&page=1&bestResultsOnly=false&oversizedTrack=regular',parse_dates=True)
df[0].to_csv('2023_I_M_800.csv')
I am trying to get data from all pages or a determinated number of them but since this website doesn't use lu or li elementsIdon'tknow exacxtly how to built it.
Any help or idea would be appreciated.

Since the url contains the page number, why not just making a loop and concat ?
`https://www.worldathletics.org/records/toplists/middle-long/800-metres/indoor/men/senior/2023?regionType=world&timing=electronic&page=1&bestResultsOnly=false&oversizedTrack=regular
import pandas as pd
​
F, L = 1, 4 # first and last pages
​
dico = {}
for page in range(F, L+1):
url = f'https://www.worldathletics.org/records/toplists/middle-long/800-metres/indoor/men/senior/2023?regionType=world&timing=electronic&page={page}&bestResultsOnly=false&oversizedTrack=regular'
sub_df = pd.read_html(url, parse_dates=True)[0]
sub_df.insert(0, "page_number", page)
dico[page] = sub_df
​
out = pd.concat(dico, ignore_index=True)
# out.to_csv('2023_I_M_800.csv') # <- uncomment this line to make a .csv
NB : You can access each sub_df separately by using key-indexing notation : dico[num_page].
Output :
print(out)
page_number Rank ... Date Results Score
0 1 1 ... 22 JAN 2023 1230
1 1 2 ... 22 JAN 2023 1204
2 1 3 ... 29 JAN 2023 1204
3 1 4 ... 27 JAN 2023 1192
4 1 5 ... 28 JAN 2023 1189
.. ... ... ... ... ...
395 4 394 ... 21 JAN 2023 977
396 4 394 ... 28 JAN 2023 977
397 4 398 ... 27 JAN 2023 977
398 4 399 ... 28 JAN 2023 977
399 4 399 ... 29 JAN 2023 977
[400 rows x 11 columns]

Try this:
for page in range(1, 10):
df=pd.read_html(f'https://www.worldathletics.org/records/toplists/middle-long/800-metres/indoor/men/senior/2023?regionType=world&timing=electronic&page={page}&bestResultsOnly=false&oversizedTrack=regular',parse_dates=True)
df[0].to_csv(f'2023_I_M_800_page_{page}.csv')

Related

Instantiating python dataframe in a for loop

I am trying to create a condition using for loop and if statement for a python dataframe object. In order to accurately specify which row from the data table to extract upon a specific condition, I searched the row index, and created an argument to specify the location before the for loop. The specifics looks something like this:
import pandas as pd
input_csv_file = "./CSV/Officers_and_Shareholders.csv"
df = pd.read_csv(input_csv_file, skiprows=10, on_bad_lines='skip')
df.fillna('', inplace=True)
# df.drop([0, 3], inplace=True)
df.columns = ['Nama', 'Jabatan', 'Alamat', 'Klasifikasi Saham', 'Jumlah Lembar Saham', 'Total']
# print(df.shape)
# print(df.columns)
# print(df.iloc[:53])
# shareholders = df.iloc[24:42]
# print(shareholders)
# officers = df.iloc[0:23]
# print(officers)
dataframe = df.query("Total.ne('-')")
def get_shareholder_by_row_index():
for column in df.columns:
if object(df.iloc[column][:53]) == dataframe:
shareholders = df.iloc[24:42]
print(shareholders)
# elif object(df[:53][column]) != dataframe:
# officers = df.iloc[0:23]
# print(officers)
Because the format of the CSV file is not proper, I forced dataframe to re-create a header on top of the original CSV file, which I indicate under df.columns. The df.iloc[24:42] and df.iloc[0:23] are able to specifically locate the data range in the dataframe, but it doesn't return so when instantiated inside the for loop. Objectively, I want to create a function where if the row under the column Total is empty (-), then return the officers, but if the row under the column Total is not empty, then return shareholders. In this case, how should I modify the for loop and the if statement?
The desired output for shareholders will be:
24 PT CTCORP INFRASTRUKTUR D INDONESIA, ... Rp. 3.200.000.000
25 Nomor SK :- I ...
26 JalanKaptenPierreTendeanKavling12-14A ...
27 PT INTRERPORT PATIMBAN AGUNG, ... Rp. 2.900.000.000
28 Nomor SK :- ...
29 ...
30 ...
31 ...
32 ...
33 ...
34 PT PATIMBAN MAJU BERSAMA, ... Rp. 2.900.000.000
35 Nomor SK :AHU- ...
36 0061318.AH.01.01.TAHUN 2021 ...
37 Tanggal SK :30 September 2021 ...
38 ...
39 ...
40 PT TERMINAL PETIKEMAS ... Rp. 1.000.000.000
41 SURABAYA, ...
42 Nomor SK :- ...
and for the officers, it will return:
Nama ... Total
1 NIK: 3171060201830005 ...
2 NPWP: 246383541071000 ...
3 TTL: Jakarta, 02 Januari 1983 ...
5 NIK: 1271121011700003 ...
6 NPWP: 070970173112000 ...
7 TTL: Bogor, 10 November 1970 ...
8 ARLAN SEPTIA ANANDA ...
9 RASAM, ...
10 NIK: 3174051209620003 ...
11 NPWP: 080878200013000 ...
12 TTL: Jakarta, 12 September ...
13 1962 ...
15 NIK: 3171011605660004 ...
16 NPWP: 070141650093000 ...
17 TTL: Jakarta, 16 Mei 1966 ...
18 FUAD RIZAL, ...
21 PURNOMO, UTAMA RASRINIK: 3578032408610001 ...
22 NPWP: 097468813615000 ...
23 TTL: SLEMAN, 24 Agustus 1961 ...
Stakeholder and Officer will be printed withrecpect to the index (Row Number)
if this is not the desired answer then mention little detail
def get_shareholder_by_row_index():
for i in range(len(df)):
# this will give you shareholders if row under Total is empty else office if row is not empty
if df["Total"][i] == '' :
print(i," shareholders")
print(df.iloc[i])
# what ever your code is, will be here
else:
print(i," officers")
print(df.iloc[i])
# what ever your code is, will be here
# this will give you the indces where row under total is empty
print(df["Total"].iloc[:53][df["Total"] == ''])

Panda having problem merging dataframes together

I'm trying to merge two dataframes together using .merge and "inner" to find common column, here are the two dataframes, 1st one,
Year Month Brunei Darussalam ... Australia New Zealand Africa
0 1978 Jan na ... 28421 3612 587
1 1978 Feb na ... 13982 2521 354
2 1978 Mar na ... 16536 2727 405
3 1978 Apr na ... 16499 3197 736
4 1978 May na ... 20690 5130 514
.. ... ... ... ... ... ... ...
474 2017 Jul 5625 ... 104873 15358 6964
475 2017 Aug 4610 ... 75171 11197 6987
476 2017 Sep 5387 ... 100987 12021 5458
477 2017 Oct 4202 ... 90940 11834 5635
478 2017 Nov 5258 ... 81821 9348 6717
2nd one,
Year Month
0 1980 Jul
1 1980 Aug
2 1980 Sep
3 1980 Oct
4 1980 Nov
I tried to use this input to initialize my command,
merge = pd.merge(dataframe,df, how='inner', on=['Year', 'Month'])
print(merge)
but I keep getting this ERROR,
Traceback (most recent call last):
File "main.py", line 52, in <module>
merge = pd.merge(dataframe,df, how='inner', on=['Year', 'Month'])
File "/opt/virtualenvs/python3/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 74, in merge
op = _MergeOperation(
File "/opt/virtualenvs/python3/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 672, in __init__
self._maybe_coerce_merge_keys()
File "/opt/virtualenvs/python3/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 1193, in _maybe_coerce_merge_keys
raise ValueError(msg)
ValueError: You are trying to merge on int64 and object columns. If you wish to proceed you should use pd.concat
It means one column Year is numeric, second filled by strings.
So need same types like:
dataframe['Year'] = dataframe['Year'].astype(int)
df['Year'] = df['Year'].astype(int)
df1 = pd.merge(dataframe,df, how='inner', on=['Year', 'Month'])
Or:
dataframe['Year'] = dataframe['Year'].astype(str)
df['Year'] = df['Year'].astype(str)
df1 = pd.merge(dataframe,df, how='inner', on=['Year', 'Month'])

calculating percentile values for each columns group by another column values - Pandas dataframe

I have a dataframe that looks like below -
Year Salary Amount
0 2019 1200 53
1 2020 3443 455
2 2021 6777 123
3 2019 5466 313
4 2020 4656 545
5 2021 4565 775
6 2019 4654 567
7 2020 7867 657
8 2021 6766 567
Python script to get the dataframe below -
import pandas as pd
import numpy as np
d = pd.DataFrame({
'Year': [
2019,
2020,
2021,
] * 3,
'Salary': [
1200,
3443,
6777,
5466,
4656,
4565,
4654,
7867,
6766
],
'Amount': [
53,
455,
123,
313,
545,
775,
567,
657,
567
]
})
I want to calculate certain percentile values for all the columns grouped by 'Year'.
Desired output should look like -
I am running below python script to perform the calculations to calculate certain percentile values-
df_percentile = pd.DataFrame()
p_list = [0.05, 0.10, 0.25, 0.50, 0.75, 0.95, 0.99]
c_list = []
p_values = []
for cols in d.columns[1:]:
for p in p_list:
c_list.append(cols + '_' + str(p))
p_values.append(np.percentile(d[cols], p))
print(len(c_list), len(p_values))
df_percentile['Name'] = pd.Series(c_list)
df_percentile['Value'] = pd.Series(p_values)
print(df_percentile)
Output -
Name Value
0 Salary_0.05 1208.9720
1 Salary_0.1 1217.9440
2 Salary_0.25 1244.8600
3 Salary_0.5 1289.7200
4 Salary_0.75 1334.5800
5 Salary_0.95 1370.4680
6 Salary_0.99 1377.6456
7 Amount_0.05 53.2800
8 Amount_0.1 53.5600
9 Amount_0.25 54.4000
10 Amount_0.5 55.8000
11 Amount_0.75 57.2000
12 Amount_0.95 58.3200
13 Amount_0.99 58.5440
How can I get the output in the required format without having to do extra data manipulation/formatting or in fewer lines of code?
You can try pivot followed by quantile:
(df.pivot(columns='Year')
.quantile([0.01,0.05,0.75, 0.95, 0.99])
.stack('Year')
)
Output:
Salary Amount
Year
0.01 2019 1269.08 58.20
2020 3467.26 456.80
2021 4609.02 131.88
0.05 2019 1545.40 79.00
2020 3564.30 464.00
2021 4785.10 167.40
0.75 2019 5060.00 440.00
2020 6261.50 601.00
2021 6771.50 671.00
0.95 2019 5384.80 541.60
2020 7545.90 645.80
2021 6775.90 754.20
0.99 2019 5449.76 561.92
2020 7802.78 654.76
2021 6776.78 770.84

Get new dataframe by multiple conditions with pd.Dataframe.isin()

I am trying to write function that obtains a df and a dictionary that maps columns to values. The function slices rows (indexes) such that it returns only rows whose values match ‘criteria’ keys values.
for example:
df_isr13 = filterby_criteria(df, {"Area":["USA"], "Year":[2013]}) Only rows with "Year"=2013 and "Area"="USA" are included in the output.
I tried:
def filterby_criteria(df, criteria):
for key, values in criteria.items():
return df[df[key].isin(values)]
but I get only the first criterion
How can I get the new dataframe that except all criterias by pd.Dataframe.isin()?
You can use for loop and add every criterion by pandas merge function:
def filterby_criteria(df, criteria):
for key, values in criteria.items():
df = pd.merge(df[df [key].isin(values)], df, how='inner')
return df
Consider a simple merge of two data frames since by default merge uses all matching names:
from itertools import product
import pandas as pd
def filterby_criteria(df, criteria):
# EXTRACT DICT ITEMS
k,v = criteria.keys(), criteria.values()
# BUILD DF OF ALL POSSIBLE MATCHES
all_matches = (pd.DataFrame(product(*v))
.set_axis(list(k), axis='columns', inplace=False)
)
# RETURN MERGED DF
return df.merge(all_matches)
To demonstrate with random, seeded data:
Data
import numpy as np
import pandas as pd
np.random.seed(61219)
tools = ['sas', 'stata', 'spss', 'python', 'r', 'julia']
years = list(range(2013, 2019))
random_df = pd.DataFrame({'Tool': np.random.choice(tools, 500),
'Int': np.random.randint(1, 10, 500),
'Num': np.random.uniform(1, 100, 500),
'Year': np.random.choice(years, 500)
})
print(random_df.head(10))
# Tool Int Num Year
# 0 spss 4 96.465327 2016
# 1 sas 7 23.455771 2016
# 2 r 5 87.349825 2014
# 3 julia 4 18.214028 2017
# 4 julia 7 17.977237 2016
# 5 stata 3 41.196579 2013
# 6 stata 8 84.943676 2014
# 7 python 4 60.576030 2017
# 8 spss 4 47.024075 2018
# 9 stata 3 87.271072 2017
Function call
criteria = {"Tool":["python", "r"], "Year":[2013, 2015]}
def filterby_criteria(df, criteria):
k,v = criteria.keys(), criteria.values()
all_matches = (pd.DataFrame(product(*v))
.set_axis(list(k), axis='columns', inplace=False)
)
return df.merge(all_matches)
final_df = filterby_criteria(random_df, criteria)
Output
print(final_df)
# Tool Int Num Year
# 0 python 8 96.611384 2015
# 1 python 7 66.782828 2015
# 2 python 9 73.638629 2015
# 3 python 4 70.763264 2015
# 4 python 2 28.311917 2015
# 5 python 3 69.888967 2015
# 6 python 8 97.609694 2015
# 7 python 3 59.198276 2015
# 8 python 3 64.497017 2015
# 9 python 8 87.672138 2015
# 10 python 9 33.605467 2015
# 11 python 8 25.225665 2015
# 12 r 3 72.202364 2013
# 13 r 1 62.192478 2013
# 14 r 7 39.264766 2013
# 15 r 3 14.599786 2013
# 16 r 4 22.963723 2013
# 17 r 1 97.647922 2013
# 18 r 5 60.457344 2013
# 19 r 5 15.711207 2013
# 20 r 7 80.273330 2013
# 21 r 7 74.190107 2013
# 22 r 7 37.923396 2013
# 23 r 2 91.970678 2013
# 24 r 4 31.489810 2013
# 25 r 1 37.580665 2013
# 26 r 2 9.686955 2013
# 27 r 6 56.238919 2013
# 28 r 6 72.820625 2015
# 29 r 3 61.255351 2015
# 30 r 4 45.690621 2015
# 31 r 5 71.143601 2015
# 32 r 6 54.744846 2015
# 33 r 1 68.171978 2015
# 34 r 5 8.521637 2015
# 35 r 7 87.027681 2015
# 36 r 3 93.614377 2015
# 37 r 7 37.918881 2015
# 38 r 3 7.715963 2015
# 39 python 1 42.681928 2013
# 40 python 6 57.354726 2013
# 41 python 1 48.189897 2013
# 42 python 4 12.201131 2013
# 43 python 9 1.078999 2013
# 44 python 9 75.615457 2013
# 45 python 8 12.631277 2013
# 46 python 9 82.227578 2013
# 47 python 7 97.802213 2013
# 48 python 1 57.103964 2013
# 49 python 1 1.941839 2013
# 50 python 3 81.981437 2013
# 51 python 1 56.869551 2013
PyFiddle Demo (click Run at top)

DataFrame from string that look like table

I have problem with creating DataFrame from string that look like table. Exactly, I want to create same table as my data.This is my data, and below is my code:
0 2017 IX 2018 X 2018 X 2018 X 2018
0 2017 IX 2018 0 2017 IX 2018
UKUPNO 1.053 1.075 1.093 103,8 101,7 1.633 1.669 1.701 104,2 101,9
A Poljoprivreda, šumarstvo i ribolov 907 888 925 102,0 104,2 1.394 1.356 1.420 101,9 104,7
B Vađenje ruda i kamena 913 919 839 91,9 91,3 1.395 1.406 1.297 93,0 92,2
C Prerađivačka industrija 769 764 775 100,8 101,4 1.176 1.169 1.187 100,9 101,5
D Proizvodnja i snabdijevanje 1.574 1.570 1.647 104,6 104,9 2.459 2.455 2.579 104,9 105,1
električnom energijom, plinom,
parom i klimatizacija
E Snabdijevanje vodom; uklanjanje 956 973 954 99,8 98,0 1.462 1.491 1.462 100,0 98,1
otpadnih voda, upravljanje otpadom
TESTDATA = io.StringIO(''' ''')
df=pd.read_csv(TESTDATA,sep='delimiter',header=None,engine='python')
When I read my code, I get this DataFrame
0 Prosječna neto plaća ...
1 u KM ...
2 Index Index ...
3 0 2017 IX 2018 X 2018 X 2018 ...
4 0 2017 IX 2018 ...
5 UKUPNO ...
6 A Poljoprivreda, šumarstvo i ribolov ...
7 B Vađenje ruda i kamena ...
8 C Prerađivačka industrija ...
9 D Proizvodnja i snabdijevanje ...
10 električnom energijom, plinom,

Categories