merge json results and use properties as pandas columns - python

I have a Json and I'd like to merge the result objects together and have a dataframe that uses "properties" as columns (ID, Title, Site, Last edited time, Link, Status).
Here is what I tried:
import pandas as pd
import json
data = json.load(open('db.json',encoding='utf-8'))
df = pd.DataFrame(data["results"])
df2 = pd.DataFrame(df["properties"])
print(df2)
Here is the json: https://dpaste.com/GV94XD64Y
Here is the result I am expecting:
Site Last edited time Link Status ID Title
0 stackoverflow.com 2023-01-16T20:44:00.000Z https://stackoverflow.com None 1 page 0
1 stackoverflow.com 2023-01-16T20:44:00.000Z https://stackoverflow.com None 1 page 1

You can use apply pd.Series on properties:
df2 = df.properties.apply(pd.Series)

Related

Read tables from HTML page by changing the ID using python

I am using the html link below to read the table in the page:
http://a810-bisweb.nyc.gov/bisweb/ActionsByLocationServlet?requestid=1&allbin=2016664
The last part of the link(allbin) is an ID. This ID changes and by using different IDs you can access different tables and records. The link although remains the same, just the ID in the end changes every time. I have like 1000 more different IDs like this. So, How can I actually use those different IDs to access different tables and join them together?
Output Like this,
ID Number Type FileDate
2016664 NB 14581-26 New Building 12/21/2020
4257909 NB 1481-29 New Building 3/6/2021
4138920 NB 481-29 New Building 9/4/2020
List of other ID for use:
['4257909', '4138920', '4533715']
This was my attempt, I can read a single table with this code.
import requests
import pandas as pd
url = 'http://a810-bisweb.nyc.gov/bisweb/ActionsByLocationServlet?requestid=1&allbin=2016664'
html = requests.get(url).content
df_list = pd.read_html(html,header=0)
df = df_list[3]
df
To get all pages from list of IDs you can use next example:
import requests
import pandas as pd
from io import StringIO
url = "http://a810-bisweb.nyc.gov/bisweb/ActionsByLocationServlet?requestid=1&allbin={}&allcount={}"
def get_info(ID, page=1):
out = []
while True:
try:
print("ID: {} Page: {}".format(ID, page))
t = requests.get(url.format(ID, page), timeout=1).text
df = pd.read_html(StringIO(t))[3].loc[1:, :]
if len(df) == 0:
break
df.columns = ["NUMBER", "NUMBER", "TYPE", "FILE DATE"]
df["ID"] = ID
out.append(df)
page += 25
except requests.exceptions.ReadTimeout:
print("Timeout...")
continue
return out
list_of_ids = [2016664, 4257909, 4138920, 4533715]
dfs = []
for ID in list_of_ids:
dfs.extend(get_info(ID))
df = pd.concat(dfs)
print(df)
df.to_csv("data.csv", index=None)
Prints:
NUMBER NUMBER TYPE FILE DATE ID
1 ALT 1469-1890 NaN ALTERATION 00/00/0000 2016664
2 ALT 1313-1874 NaN ALTERATION 00/00/0000 2016664
3 BN 332-1938 NaN BUILDING NOTICE 00/00/0000 2016664
4 BN 636-1916 NaN BUILDING NOTICE 00/00/0000 2016664
5 CO NB 1295-1923 (PDF) CERTIFICATE OF OCCUPANCY 00/00/0000 2016664
...
And saves data.csv (screenshot from LibreOffice):
The code below will extract all the tables in a web page
import numpy as np
import pandas as pd
url = 'http://a810-bisweb.nyc.gov/bisweb/ActionsByLocationServlet?requestid=1&allbin=2016664'
df_list = pd.read_html(url) #returns as list of dataframes from the web page
print(len(df_list)) #print the number of dataframes
i = 0
while i < len(df_list): #loop through the list to print all tables
df = df_list[i]
print(df)
i = i + 1

Python Pandas Dataframe from API JSON Response >>

I am new to Python, Can i please seek some help from experts here?
I wish to construct a dataframe from https://api.cryptowat.ch/markets/summaries JSON response.
based on following filter criteria
Kraken listed currency pairs (Please take note, there are kraken-futures i dont want those)
Currency paired with USD only, i.e aaveusd, adausd....
Ideal Dataframe i am looking for is (somehow excel loads this json perfectly screenshot below)
Dataframe_Excel_Screenshot
resp = requests.get(https://api.cryptowat.ch/markets/summaries) kraken_assets = resp.json() df = pd.json_normalize(kraken_assets) print(df)
Output:
result.binance-us:aaveusd.price.last result.binance-us:aaveusd.price.high ...
0 264.48 267.32 ...
[1 rows x 62688 columns]
When i just paste the link in browser JSON response is with double quotes ("), but when i get it via python code. All double quotes (") are changed to single quotes (') any idea why?. Though I tried to solve it with json_normalize but then response is changed to [1 rows x 62688 columns]. i am not sure how do i even go about working with 1 row with 62k columns. i dont know how to extract exact info in the dataframe format i need (please see excel screenshot).
Any help is much appreciated. thank you!
the result JSON is a dict
load this into a dataframe
decode columns into products & measures
filter to required data
import requests
import pandas as pd
import numpy as np
# load results into a data frame
df = pd.json_normalize(requests.get("https://api.cryptowat.ch/markets/summaries").json()["result"])
# columns are encoded as product and measure. decode columns and transpose into rows that include product and measure
cols = np.array([c.split(".", 1) for c in df.columns]).T
df.columns = pd.MultiIndex.from_arrays(cols, names=["product","measure"])
df = df.T
# finally filter down to required data and structure measures as columns
df.loc[df.index.get_level_values("product").str[:7]=="kraken:"].unstack("measure").droplevel(0,1)
sample output
product
price.last
price.high
price.low
price.change.percentage
price.change.absolute
volume
volumeQuote
kraken:aaveaud
347.41
347.41
338.14
0.0274147
9.27
1.77707
613.281
kraken:aavebtc
0.008154
0.008289
0.007874
0.0219326
0.000175
403.506
3.2797
kraken:aaveeth
0.1327
0.1346
0.1327
-0.00673653
-0.0009
287.113
38.3549
kraken:aaveeur
219.87
226.46
209.07
0.0331751
7.06
1202.65
259205
kraken:aavegbp
191.55
191.55
179.43
0.030559
5.68
6.74476
1238.35
kraken:aaveusd
259.53
267.48
246.64
0.0339841
8.53
3623.66
929624
kraken:adaaud
1.61792
1.64602
1.563
0.0211692
0.03354
5183.61
8366.21
kraken:adabtc
3.757e-05
3.776e-05
3.673e-05
0.0110334
4.1e-07
252403
9.41614
kraken:adaeth
0.0006108
0.00063
0.0006069
-0.0175326
-1.09e-05
590839
367.706
kraken:adaeur
1.01188
1.03087
0.977345
0.0209986
0.020811
1.99104e+06
1.98693e+06
Hello Try the below code. I have understood the structure of the Dataset and modified to get the desired output.
`
resp = requests.get("https://api.cryptowat.ch/markets/summaries")
a=resp.json()
a['result']
#creating Dataframe froom key=result
da=pd.DataFrame(a['result'])
#using Transpose to get required Columns and Index
da=da.transpose()
#price columns contains a dict which need to be seperate Columns on the data frame
db=da['price'].to_dict()
da.drop('price', axis=1, inplace=True)
#intialising seperate Data frame for price
z=pd.DataFrame({})
for i in db.keys():
i=pd.DataFrame(db[i], index=[i])
z=pd.concat([z,i], axis=0 )
da=pd.concat([z, da], axis=1)
da.to_excel('nex.xlsx')`

Python, extracting a string between two specific characters for all rows in a dataframe

I am currently trying to write a function that will extract the string between 2 specific characters.
My data set contains emails only, that look like this: pstroulgerrn#time.com.
I am trying to extract everything after the # and everything before the . so that the email listed above would output time.
Here is my code so far :
new = df_personal['email'] # 1000x1 dataframe of emails
def extract_company(x):
y = [ ]
y = x[x.find('#')+1 : x.find('.')]
return y
extract_company(new)
Note : If I change new to df_personal['email'][0] the correct output is displayed for that row.
However, when trying to do it for the entire dataframe, I get an error saying :
AttributeError: 'Series' object has no attribute 'find'
You can extract a series of all matching texts using regex:
import pandas as pd
df = pd.DataFrame( ['kabawonga#something.whereever','kabawonga#omg.whatever'])
df.columns = ['email']
print(df)
k = df["email"].str.extract(r"#(.+)\.")
print(k)
Output:
# df
email
0 kabawonga#something.whereever
1 kabawonga#omg.whatever
# extraction
0
0 something
1 omg
See pandas.Series.str.extract
Try:
df_personal["domain"]=df_personal["email"].str.extract(r"\#([^\.]+)\.")
Outputs (for the sample data):
import pandas as pd
df_personal=pd.DataFrame({"email": ["abc#yahoo.com", "xyz.abc#gmail.com", "john.doe#aol.co.uk"]})
df_personal["domain"]=df_personal["email"].str.extract(r"\#([^\.]+)\.")
>>> df_personal
email domain
0 abc#yahoo.com yahoo
1 xyz.abc#gmail.com gmail
2 john.doe#aol.co.uk aol
You can do it with an apply function, by first splitting by a . and then by # for each of the row:
Snippet:
import pandas as pd
df = pd.DataFrame( ['abc#xyz.dot','def#qwe.dot','def#ert.dot.dot'])
df.columns = ['email']
df["domain"] = df["email"].apply(lambda x: x.split(".")[0].split("#")[1])
Output:
df
Out[37]:
email domain
0 abc#xyz.dot xyz
1 def#qwe.dot qwe
2 def#ert.dot.dot ert

Creating a pandas DataFrame from list of lists containing data from bs4 object

I'm looking to create a DataFrame containing data scraped from a website. The data is placed into two lists - Job title and URL that links to the job application page. My aim is to then pass them into a list to create a DataFrame as demonstrated by https://www.geeksforgeeks.org/different-ways-to-create-pandas-dataframe/
list_job_titles = []
list_job_URLs = []
for a in soup.find_all('a', href = re.compile("work-placement-internship")):
URL_from_soup = (a['href'] + " ")
title_from_soup =(a.text.strip())
list_job_titles.append(title_from_soup)
list_job_URLs.append(URL_from_soup)
time.sleep(0.1)
data = [[list_job_titles],[list_job_URLs]]
df = pd.DataFrame(data, columns=['Job title', 'URL'])
I have tested the web scraping aspect of the script and it obtains all desired information from the site. However when it comes to creating the DataFrame I get the error:
ValueError: 2 columns passed, passed data had 1 columns
I have then tried passing in one column header:
df = pd.DataFrame(data, columns=['Job title'])
To which I get the output:
Job Title
0 [Some job title...
1 [https://someURL...
Any idea how to split this into 2 columns, one for the title and one for the URL
Cheers
Try this:
Replace:
df = pd.DataFrame(data, columns=['Job title', 'URL'])
With:
df = pd.DataFrame({"Job title": list_job_titles, "URL": list_job_URLs})
Try something like this:
df = pd.DataFrame({"Job Title": list_job_titles, "Job URLs": list_job_urls})

splitting url and getting values from that URl in columns

Hi say I have a column in data frame
name submission contains - mhttps://ckd.pdc.com/pdc/73ba5189-94fd-44aa-88d3-6b36aaa69b02/DDA1610095.zip
I want to one column say Zest and I want the value DDA1610095 in that column.
and a new column say type and want .zip in that column how to do that using pandas.
you can use str.split to extract the zip from the url
df
url
0 mhttps://ckd.pdc.com/pdc/73ba5189-94fd-44aa-88d3-6b36aaa69b02/DDA1610095.zip
df['zip'] = df.url.str.split('/',expand=True).T[0] \
[df.url.str.split('/',expand=True).T.shape[0]-1]
df.T
Out[46]:
0
url mhttps://ckd.pdc.com/pdc/73ba5189-94fd-44aa-88d3-6b36aaa69b02/DDA1610095.zip
zip DDA1610095.zip
try using a str.split and add another str so you can index each row.
data = [{'ID' : '1',
'URL': 'https://ckd.pdc.com/pdc/73ba5189-94fd-44aa-88d3-6b36aaa69b02/DDA1610095.zip'}]
df = pd.DataFrame(data)
print(df)
ID URL
0 1 https://ckd.pdc.com/pdc/73ba5189-94fd-44aa-88d...
#Get the file name and replace zip (probably a more elegant way to do this)
df['Zest'] = df.URL.str.split('/').str[-1].str.replace('.zip','')
#assign the type into the next column.
df['Type'] = df.URL.str.split('.').str[-1]
print(df)
ID URL Zest Type
0 1 https://ckd.pdc.com/pdc/73ba5189-94fd-44aa-88d... DDA1610095 zip

Categories