Read tables from HTML page by changing the ID using python

Read tables from HTML page by changing the ID using python - python

I am using the html link below to read the table in the page:
http://a810-bisweb.nyc.gov/bisweb/ActionsByLocationServlet?requestid=1&allbin=2016664
The last part of the link(allbin) is an ID. This ID changes and by using different IDs you can access different tables and records. The link although remains the same, just the ID in the end changes every time. I have like 1000 more different IDs like this. So, How can I actually use those different IDs to access different tables and join them together?
Output Like this,
ID Number Type FileDate
2016664 NB 14581-26 New Building 12/21/2020
4257909 NB 1481-29 New Building 3/6/2021
4138920 NB 481-29 New Building 9/4/2020
List of other ID for use:
['4257909', '4138920', '4533715']
This was my attempt, I can read a single table with this code.
import requests
import pandas as pd
url = 'http://a810-bisweb.nyc.gov/bisweb/ActionsByLocationServlet?requestid=1&allbin=2016664'
html = requests.get(url).content
df_list = pd.read_html(html,header=0)
df = df_list[3]
df

To get all pages from list of IDs you can use next example:
import requests
import pandas as pd
from io import StringIO
url = "http://a810-bisweb.nyc.gov/bisweb/ActionsByLocationServlet?requestid=1&allbin={}&allcount={}"
def get_info(ID, page=1):
out = []
while True:
try:
print("ID: {} Page: {}".format(ID, page))
t = requests.get(url.format(ID, page), timeout=1).text
df = pd.read_html(StringIO(t))[3].loc[1:, :]
if len(df) == 0:
break
df.columns = ["NUMBER", "NUMBER", "TYPE", "FILE DATE"]
df["ID"] = ID
out.append(df)
page += 25
except requests.exceptions.ReadTimeout:
print("Timeout...")
continue
return out
list_of_ids = [2016664, 4257909, 4138920, 4533715]
dfs = []
for ID in list_of_ids:
dfs.extend(get_info(ID))
df = pd.concat(dfs)
print(df)
df.to_csv("data.csv", index=None)
Prints:
NUMBER NUMBER TYPE FILE DATE ID
1 ALT 1469-1890 NaN ALTERATION 00/00/0000 2016664
2 ALT 1313-1874 NaN ALTERATION 00/00/0000 2016664
3 BN 332-1938 NaN BUILDING NOTICE 00/00/0000 2016664
4 BN 636-1916 NaN BUILDING NOTICE 00/00/0000 2016664
5 CO NB 1295-1923 (PDF) CERTIFICATE OF OCCUPANCY 00/00/0000 2016664
...
And saves data.csv (screenshot from LibreOffice):

The code below will extract all the tables in a web page
import numpy as np
import pandas as pd
url = 'http://a810-bisweb.nyc.gov/bisweb/ActionsByLocationServlet?requestid=1&allbin=2016664'
df_list = pd.read_html(url) #returns as list of dataframes from the web page
print(len(df_list)) #print the number of dataframes
i = 0
while i < len(df_list): #loop through the list to print all tables
df = df_list[i]
print(df)
i = i + 1

Related

merge json results and use properties as pandas columns

I have a Json and I'd like to merge the result objects together and have a dataframe that uses "properties" as columns (ID, Title, Site, Last edited time, Link, Status).
Here is what I tried:
import pandas as pd
import json
data = json.load(open('db.json',encoding='utf-8'))
df = pd.DataFrame(data["results"])
df2 = pd.DataFrame(df["properties"])
print(df2)
Here is the json: https://dpaste.com/GV94XD64Y
Here is the result I am expecting:
Site Last edited time Link Status ID Title
0 stackoverflow.com 2023-01-16T20:44:00.000Z https://stackoverflow.com None 1 page 0
1 stackoverflow.com 2023-01-16T20:44:00.000Z https://stackoverflow.com None 1 page 1

You can use apply pd.Series on properties:
df2 = df.properties.apply(pd.Series)

In Python, How do you use a loop to create a dataframe?

I recently pulled data from youtube API, and I'm trying to create a data frame using that information.
When I loop through each item with the "print" function, I get 25 rows output for each variable (which is what I want in the data frame I create).
How can I create a new data frame that contains 25 rows using this information instead of just 1 line in the data frame?
When I loop through each item like this:
df = pd.DataFrame(columns = ['video_title', 'video_id', 'date_created'])
#For Loop to Create columns for DataFrame
x=0
while x < len(response['items']):
video_title= response['items'][x]['snippet']['title']
video_id= response['items'][x]['id']['videoId']
date_created= response['items'][x]['snippet']['publishedAt']
x=x+1
#print(video_title, video_id)
df = df.append({'video_title': video_title,'video_id': video_id,
'date_created': date_created}, ignore_index=True)
=========ANSWER UPDATE==========
THANK YOU TO EVERYONE THAT GAVE INPUT !!!
The code that created the Dataframe was:
import pandas as pd
x=0
video_title = []
video_id = []
date_created = []
while x < len(response['items']):
video_title.append (response['items'][x]['snippet']
['title'])
video_id.append (response['items'][x]['id']['videoId'])
date_created.append (response['items'][x]['snippet'].
['publishedAt'])
x=x+1
#print(video_title, video_id)
df = pd.DataFrame({'video_title': video_title,'video_id':
video_id, 'date_created': date_created})

Based on what I know about youtube APIs return objects, the values of 'title' , 'videoId' and 'publishedAt' are strings.
A strategy of making a single df from these strings are:
Store these strings in a list. So you will have three lists.
Convert the lists into a df
You will get a df with x rows, based on x values that are retrieved.
Example:
import pandas as pd
x=0
video_title = []
video_id = []
date_created = []
while x < len(response['items']):
video_title.append (response['items'][x]['snippet']['title'])
video_id.append (response['items'][x]['id']['videoId'])
date_created.append (response['items'][x]['snippet']['publishedAt'])
x=x+1
#print(video_title, video_id)
df = pd.DataFrame({'video_title': video_title,'video_id':
video_id, 'date_created': date_created})

Pandas DataFrame combine rows by column value, where Date Rows are NULL

Scenerio:
Parse the PDF Bank statement and transform into clean and formatted csv file.
What I've tried:
I manage to parse the pdf file(tabular format) using camelot library but failed to produce the desired result in sense of formatting.
Code:
import camelot
import pandas as pd
tables = camelot.read_pdf('test.pdf', pages = '3')
for i, table in enumerate(tables):
print(f'table_id:{i}')
print(f'page:{table.page}')
print(f'coordinates:{table._bbox}')
tables = camelot.read_pdf('test.pdf', flavor='stream', pages = '3')
columns = df.iloc[0]
df.columns = columns
df = df.drop(0)
df.head()
for c in df.select_dtypes('object').columns:
df[c] = df[c].str.replace('$', '')
df[c] = df[c].str.replace('-', '')
def convert_to_float(num):
try:
return float(num.replace(',',''))
except:
return 0
for col in ['Deposits', 'Withdrawals', 'Balance']:
df[col] = df[col].map(convert_to_float)
My_Result:
Desired_Output:
The logic I came up with is to move those rows up i guess n-1 if date column is NaN i don't know if this logic is right or not.Can anyone help me to sort out this properly?
I tried pandas groupby and aggregation functions but it only merging the whole data and removing NaN and duplicate dates which is not suitable because every entry is necessary.

Using Transform -
df.loc[~df.Date.isna(), 'group'] = 1
g = df.group.fillna(0).cumsum()
df['Description'] = df.groupby(g)['Description'].transform(' '.join)
new_df = df.loc[~df['Date'].isna()]

Convert a datatable into mulitple dataframe

I have written code to get data from website for a particular date say 26th feb 2021. the processed data from code is as follows.
Client Type
Client
DII
FII
Pro
Future Index Long
126331
584
82434
27321
Future Index Short
133088
34291
40107
29184
Option Index Call Long
1022372
267
198308
310605
Option Index Put Long
790647
12740
291494
292811
Option Index Call Short
964795
0
147444
419313
Option Index Put Short
919882
0
157139
310671
I want to convert the dataframe into multiple dataframes an e.g would
dfclient should be like this:-
Date
Future Index Long
Future Index Short
Option Index Call Long
Option Index Put Long
Option Index Call Short
Option Index Put Short
26-02-2021
126331
133088
1022372
790647
964795
919882
what is fastest way to achieve the objective.
I would be required to run a loop as i want data for last 10 business days
entire code is as follows:-
from numpy.core.fromnumeric import transpose
import pandas as pd
import datetime as dt
import xlwings as xw
from pandas.tseries.offsets import *
import requests as req
hols = ["2021-01-26", "2021-03-11", "2021-03-29", "2021-04-02",
"2021-04-14", "2021-04-21", "2021-05-13", "2021-07-21",
"2021-08-19", "2021-09-10", "2021-10-15", "2021-11-04",
"2021-11-05", "2021-11-19"]
hols = pd.to_datetime(hols)
bdays = pd.date_range(end = dt.date.today(),periods=60,freq = BDay())
wdays = bdays.difference(hols)[-10:]
wodays = pd.DataFrame(wdays,columns = ['Business_day'])
wodays['Datestring'] = wodays['Business_day'].dt.strftime("%d%m%Y")
# getting data from NSE website
url = 'https://archives.nseindia.com/content/nsccl/fao_participant_oi_'+wodays.Datestring[0]+'.csv'
headers = {'User-Agent': 'Mozilla/5.0'}
OC = req.get(url,headers=headers).content
data = pd.read_csv(url, header = 1, usecols = [0,1,2,5,6,7,8], index_col = 0 )
data = data.head(4).transpose()

For the looping among all date and saving data frame for each date you need to create a dictionary of data frame ... in your case dictionary of dfclient
here is a chunk of code you needed
dfclient = {}
for i, date in enumerate(wodays.Datestring):
# getting data from NSE website
url = 'https://archives.nseindia.com/content/nsccl/fao_participant_oi_' + date + '.csv'
headers = {'User-Agent': 'Mozilla/5.0'}
OC = req.get(url, headers=headers).content
data = pd.read_csv(url, header=1, usecols=[0, 1, 2, 5, 6, 7, 8], index_col=0)
data = data.head(4).iloc[0]
data = pd.DataFrame(data).transpose().reset_index()
data.at[0, 'index'] = dt.date(wdays[i].year, wdays[i].month, wdays[i].day)
data.rename(columns={'index': 'Date'}, inplace=True)
dfclient[date] = data
Each data frame of dfclient you can access by passing datetring of wodays as indexkey
i.e. dfclient['01032021'] or dfclient['02032021'] you will get your require data frame

Get affiliation information from multiple authors in a loop

Currently working with pybliometrics (scopus) I want to create a loop that allows me to get affiliation information from multiple authors.
Basically, this is the idea of my loop. How do I do that with many authors?
from pybliometrics.scopus import AuthorRetrieval
import pandas as pd
import numpy as np
au = AuthorRetrieval(authorid)
au.affiliation_history
au.identifier
x = au.identifier
refs2 = au.affiliation_history
len(refs2)
refs2
df = pd.DataFrame(refs2)
df.columns
a_history = df
df['authorid'] = x
#moving authorid to 0
cols = list(df)
cols.insert(0, cols.pop(cols.index('authorid')))
df = df.loc[:, cols]
df.to_excel("af_historyfinal.xlsx")

Turning your code into a loop over multiple author IDs? Nothing easier than that. Let's say AUTHOR_IDS equals 7004212771 and 57209617104:
import pandas as pd
from pybliometrics.scopus import AuthorRetrieval
def retrieve_affiliations(auth_id):
"""Author's affiliation history from Scopus as DataFrame."""
au = AuthorRetrieval(authorid)
df = pd.DataFrame(au.affiliation_history)
df["auth_id"] = au.identifier
return df
AUTHOR_IDS = [7004212771, 57209617104]
# Option 1, for few IDs
df = pd.concat([retrieve_affiliations(a) for a in AUTHOR_IDS])
# Option 2, for many IDs
df = pd.DataFrame():
for a in AUTHOR_IDS:
df = df.append(retrieve_affiliations(a))
# Have author ID as first column
df = df.set_index("authorid").reset_index()
df.to_excel("af_historyfinal.xlsx", index=False)
If, say, your IDs are in a comma-separated file called "input.csv", with one column called "authors", then you start with
AUTHOR_IDS = pd.read_csv("input.csv")["authors"].unique()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Read tables from HTML page by changing the ID using python - python

Related

merge json results and use properties as pandas columns

In Python, How do you use a loop to create a dataframe?

Pandas DataFrame combine rows by column value, where Date Rows are NULL

Convert a datatable into mulitple dataframe

Get affiliation information from multiple authors in a loop

Categories

Resources