Fast Parsing a huge 12 GB JSON file with Python - python

I have a 12GB JSON file that every line contains information about a scientific paper. This is how it looks
enter image description here
I want to parse it and create 3 pandas dataframes that contain information about venues, authors and how many times an author has published in a venue. Bellow you can see the code I have written. My problem is that this code needs many days in order to run. Is there a way to make it faster?
venues = pd.DataFrame(columns = ['id', 'raw', 'type'])
authors = pd.DataFrame(columns = ['id','name'])
main = pd.DataFrame(columns = ['author_id','venue_id','number_of_times'])
with open(r'C:\Users\dintz\Documents\test.json',encoding='UTF-8') as infile:
papers = ijson.items(infile, 'item')
for paper in papers:
if 'id' not in paper["venue"]:
if 'type' not in paper["venue"]:
venues = venues.append({'raw': paper["venue"]["raw"]},ignore_index=True)
else:
venues = venues.append({'raw': paper["venue"]["raw"], 'type': paper["venue"]["type"]},ignore_index=True)
else:
venues = venues.append({'id': paper["venue"]["id"] , 'raw': paper["venue"]["raw"], 'type': paper["venue"]["type"]},ignore_index=True)
paper_authors = paper["authors"]
paper_authors_json = json.dumps(paper_authors)
obj = ijson.items(paper_authors_json,'item')
for author in obj:
authors = authors.append({'id': author["id"] , 'name': author["name"]},ignore_index=True)
main = main.append({'author_id': author["id"] , 'venue_raw': venues.iloc[-1]['raw'],'number_of_times': 1},ignore_index=True)
authors = authors.drop_duplicates(subset=None, keep='first', inplace=False)
venues = venues.drop_duplicates(subset=None, keep='first', inplace=False)
main = main.groupby(by=['author_id','venue_raw'], axis=0, as_index = False).sum()

Apache Spark allows to read json files in multiple chunks in parallel to make it faster -
https://spark.apache.org/docs/latest/sql-data-sources-json.html
For a regular multi-line JSON file, set the multiLine parameter to True.
If you're not familiar with Spark, you can use Pandas-compatible layer on top of Spark that is called Koalas -
https://koalas.readthedocs.io/en/latest/
Koalas read_json call -
https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.read_json.html

Your use wrong tool to accomplish this task, do not use pandas for this scenario.
Lets look at the last 3 lines of code, it is simple and clean, but how to fill these data into pandas dataframe is not so easy, when you can not use pandas input function such as read_json() or read_csv().
I prefer use pure python for this simple task, if your PC has sufficient memory, use dict to get a unique authors and venues, use itertools.groupby to grouping and use more_itertools.ilen to calculate the count.
authors = {}
venues = {}
for paper in papers:
venues[paper["venue"]["id"]] = (paper["venue"]["raw"], paper["venue"]["type"])
for author in obj:
authors[author["id"]] = author["name"]

Related

How can I align columns if rows have different number of values?

I am scraping data with python. I get a csv file and can split it into columns in excel later. But I am encountering an issue I have not been able to solve. Sometimes the scraped items have two statuses and sometimes just one. The second status is thus moving the other values in the columns to the right and as a result the dates are not all in the same column which would be useful to sort the rows.
Do you have any idea how to make the columns merge if there are two statuses for example or other solutions?
Maybe is is also an issue that I still need to separate the values into columns manually with excel.
Here is my code
#call packages
import random
import time
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import pandas as pd
# define driver etc.
service_obj = Service("C:\\Users\\joerg\\PycharmProjects\\dynamic2\\chromedriver.exe")
browser = webdriver.Chrome(service=service_obj)
# create loop
initiative_list = []
for i in range(0, 2):
url = 'https://ec.europa.eu/info/law/better-regulation/have-your-say/initiatives_de?page='+str(i)
browser.get(url)
time.sleep(random.randint(5, 10))
initiative_item = browser.find_elements(By.CSS_SELECTOR, "initivative-item")
initiatives = [item.text for item in initiative_item]
initiative_list.extend(initiatives)
df = pd.DataFrame(initiative_list)
#create csv
print(df)
df.to_csv('Initiativen.csv')
df.columns = ['tosplit']
new_df = df['tosplit'].str.split('\n', expand=True)
print(new_df)
new_df.to_csv('Initiativennew.csv')
I tried to merge the columns if there are two statuses.
make the columns merge if there are two statuses for example or other solutions
[If by "statuses" you mean the yellow labels ending in OPEN/UPCOMING/etc, then] it should be taken care of by the following parts of the getDetails_iiaRow (below the dividing line):
labels = cssSelect(iiaEl, 'div.field span.label')
and then
'labels': ', '.join([l.text.strip() for l in labels])
So, multiple labels will be separated by commas (or any other separator you apply .join to).
initiative_item = browser.find_elements(By.CSS_SELECTOR, "initivative-item")
initiatives = [item.text for item in initiative_item]
Instead of doing it like this and then having to split and clean things, you should consider extracting each item in a more specific manner and have each "row" be represented as a dictionary (with the column-names as the keys, so nothing gets mis-aligned later). If you wrap it as a function:
def cssSelect(el, sel): return el.find_elements(By.CSS_SELECTOR, sel)
def getDetails_iiaRow(iiaEl):
title = cssSelect(iiaEl, 'div.search-result-title')
labels = cssSelect(iiaEl, 'div.field span.label')
iiarDets = {
'title': title[0].text.strip() if title else None,
'labels': ', '.join([l.text.strip() for l in labels])
}
cvSel = 'div[translate]+div:last-child'
for c in cssSelect(iiaEl, f'div:has(>{cvSel})'):
colName = cssSelect(c, 'div[translate]')[0].text.strip()
iiarDets[colName] = cssSelect(c, cvSel)[0].text.strip()
link = iiaEl.get_attribute('href')
if link[:1] == '/':
link = f'https://ec.europa.eu/{link}'
iiarDets['link'] = iiaEl.get_attribute('href')
return iiarDets
then you can simply loop through the pages like:
initiative_list = []
for i in range(0, 2):
url = f'https://ec.europa.eu/info/law/better-regulation/have-your-say/initiatives_de?page={i}'
browser.get(url)
time.sleep(random.randint(5, 10))
initiative_list += [
getDetails_iiaRow(iia) for iia in
cssSelect(browser, 'initivative-item>article>a ')
]
and the since it's all cleaned already, you can directly save the data with
pd.DataFrame(initiative_list).to_csv('Initiativen.csv', index=False)
The output I got for the first 3 pages looks like:
I think it is worth working a little bit harder to get your data rationalised before putting it in the csv rather than trying to unpick the damage once ragged data has been exported.
A quick look at each record in the page suggests that there are five main items that you want to export and these correspond to the five top-level divs in the a element.
The complexity (as you note) comes because there are sometimes two statuses specified, and in that case there is sometimes a separate date range for each and sometimes a single date range.
I have therefore chosen to put the three ever present fields as the first three columns, followed next by the status + date range columns as pairs. Finally I have removed the field names (these should effectively become the column headings) to leave only the variable data in the rows.
initiatives = [processDiv(item) for item in initiative_item]
def processDiv(item):
divs = item.find_elements(By.XPATH, "./article/a/div")
if "\n" in divs[0].text:
statuses = divs[0].text.split("\n")
if len(divs) > 5:
return [divs[1].text, divs[2].text.split("\n")[1], divs[3].text.split("\n")[1], statuses[0], divs[4].text.split("\n")[1], statuses[1], divs[5].text.split("\n")[1]]
else:
return [divs[1].text, divs[2].text.split("\n")[1], divs[3].text.split("\n")[1], statuses[0], divs[4].text.split("\n")[1], statuses[1], divs[4].text.split("\n")[1]]
else:
return [divs[1].text, divs[2].text.split("\n")[1], divs[3].text.split("\n")[1], divs[0].text, divs[4].text.split("\n")[1]]
The above approach sticks as close to yours as I can. You will clearly need to rework the pandas code to reflect the slightly altered data structure.
Personally, I would invest even more time in clearly identifying the best definitions for the fields that represent each piece of data that you wish to retrieve (rather than as simply divs 0-5), and extract the text directly from them (rather than messing around with split). In this way you are far more likely to create robust code that can be maintained over time (perhaps not your goal).

Get only the name of a DataFrame - Python - Pandas

I'm actually working on a ETL project with crappy data I'm trying to get right.
For this, I'm trying to create a function that would take the names of my DFs and export them to CSV files that would be easy for me to deal with in Power BI.
I've started with a function that will take my DFs and clean the dates:
df_liste = []
def facture(x) :
x = pd.DataFrame(x)
for s in x.columns.values :
if s.__contains__("Fact") :
x.rename(columns= {s : 'periode_facture'}, inplace = True)
x['periode_facture'] = x[['periode_facture']].apply(lambda x : pd.to_datetime(x, format = '%Y%m'))
If I don't set 'x' as a DataFrame, it doesn't work but that's not my problem.
As you can see, I have set a list variable which I would like to increment with the names of the DFs, and the names only. Unfortunately, after a lot of tries, I haven't succeeded yet so... There it is, my first question on Stack ever!
Just in case, this is the first version of the function I would like to have:
def export(x) :
for df in x :
df.to_csv(f'{df}.csv', encoding='utf-8')
You'd have to set the name of your dataframe first using df.name (probably, when you are creating them / reading data into them)
Then you can access the name like a normal attribute
import pandas as pd
df = pd.DataFrame( data=[1, 2, 3])
df.name = 'my df'
and can use
df.to_csv(f'{df.name}.csv', encoding='utf-8')

Assemble a dataframe from two csv files

I wrote the following code to form a data frame containing the energy consumption and the temperature. The data for each of the variables is collected from a different csv file:
def match_data():
pwr_data = pd.read_csv(r'C:\\Users\X\Energy consumption per hour-data-2022-03-16 17_50_56_Edited.csv')
temp_data = pd.read_csv(r'C:\\Users\X\temp.csv')
new_time = []
new_pwr = []
new_tmp = []
for i in range(1,len(pwr_data)):
for j in range(1,len(temp_data)):
if pwr_data['time'][i] == temp_data['Date'][j]:
time = pwr_data['time'][i]
pwr = pwr_data['watt_hour'][i]
tmp = temp_data['Temp'][j]
new_time.append(time)
new_pwr.append(pwr)
new_tmp.append(tmp)
return pd.DataFrame({'Time' : new_time,'watt_hour' : new_pwr,'Temp':new_tmp})
I was trying to collect data with matching time indices so that I can assemble them in a data frame.
The code works well but it takes time(43 seconds for around 1300 data points). At the moment I don't have much data but I was wondering if there was a more efficient and faster way to do so
Do the pwr_data['time'] and temp_data['Date] columns have the same granularity?
If so, you can pd.merge() the two dataframes after reading them.
# read data
pwr_data = pd.read_csv(r'C:\\Users\X\Energy consumption per hour-data-2022-03-16 17_50_56_Edited.csv')
temp_data = pd.read_csv(r'C:\\Users\X\temp.csv')
# merge data on time and Date columns
# you can set the how to be 'inner' or 'right' depending on your needs
df = pd.merge(pwr_data, temp_data, how='left', left_on='time', right_on='Date')
Just like #greco recommended this did the trick and in no time!
pd.merge(pwr_data,temp_data,how='inner',left_on='time',right_on='Date')
'time' and Date are the columns on which you want to base the merge.

How to read empty delta partitions without failing in Azure Databricks?

I'm looking for a workaround. Sometimes our automated framework will read delta partitions, that does not exist. It will fail because no parquet files are in this partition.
I don't want it to fail.
What I do then is :
spark_read.format('delta').option("basePath",location) \
.load('/mnt/water/green/date=20221209/object=34')
Instead, I want it to return the empty dataframe. Return a dataframe with no records.
I did that, but found it a bit cumbersome, and was wondering if there was a better way.
df = spark_read.format('delta').load(location)
folder_partition = /date=20221209/object=34'.split("/")
for folder_pruning_token in folder_partition :
folder_pruning_token_split = folder_pruning_token.split("=")
column_name = folder_pruning_token_split[0]
column_value = folder_pruning_token_split[1]
df = df .filter(df [column_name] == column_value)
You really don't need to do that trick with Delta Lake tables. This trick was primarily used for Parquet & other file formats to avoid scanning of files on HDFS or cloud storage that is very expensive.
You just need to load data, and filter data using where/filter. It's similar to what you do:
df = spark_read.format('delta').load(location) \
.filter("date = '20221209' and object = 34")
If you need, you can of course extract that values automatically, maybe slightly simpler code:
df = spark_read.format('delta').load(location)
folder_partition = '/date=20221209/object=34'.split("/")
cols = [f"{s[0]} = '{s[1]}'"
for s in [f.split('=')for f in folder_partition]
]
df = df.filter(" and ".join(cols))

Pandas DF.output write to columns (current data is written all to one row or one column)

I am using Selenium to extract data from the HTML body of a webpage and am writing the data to a .csv file using pandas.
The data is extracted and written to the file, however I would like to manipulate the formatting of the data to write to specified columns, after reading many threads and docs I am not able to understand how to do this.
The current CSV file output is as follows, all data in one row or one column
0,
B09KBFH6HM,
dropdownAvailable,
90,
1,
B09KBNJ4F1,
dropdownAvailable,
100,
2,
B09KBPFPCL,
dropdownAvailable,
110
or if I use the [count] count +=1 method it will be one row
0,B09KBFH6HM,dropdownAvailable,90,1,B09KBNJ4F1,dropdownAvailable,100,2,B09KBPFPCL,dropdownAvailable,110
I would like the output to be formatted as follows,
/col1 /col2 /col3 /col4
0, B09KBFH6HM, dropdownAvailable, 90,
1, B09KBNJ4F1, dropdownAvailable, 100,
2, B09KBPFPCL, dropdownAvailable, 110
I have tried using columns= options but get errors in the terminal and don't understand what feature I should be using to achieve this in the docs under the append details
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html?highlight=append#pandas.DataFrame.append
A simplified version is as follows
from selenium import webdriver
import pandas as pd
price = []
driver = webdriver.Chrome("./chromedriver")
driver.get("https://www.example.co.jp/dp/zzzzzzzzzz/")
select_box = driver.find_element_by_name("dropdown_selected_size_name")
options = [x for x in select_box.find_elements_by_tag_name("option")]
for element in options:
price.append(element.get_attribute("value"))
price.append(element.get_attribute("class"))
price.append(element.get_attribute("data-a-html-content"))
output = pd.DataFrame(price)
output.to_csv("Data.csv", encoding='utf-8-sig')
driver.close()
Do I need to parse each item separately and append?
I would like each of the .get_attribute values to be written to a new column.
Is there any advice you can offer for a solution to this as I am not very proficient at pandas, thank you for your helps
 Approach similar to #user17242583, but a little shorter:
data = [[e.get_attribute("value"), e.get_attribute("class"), e.get_attribute("data-a-html-content")] for e in options]
df = pd.DataFrame(data, columns=['ASIN', 'dropdownAvailable', 'size']) # third column maybe is the product size
df.to_csv("Data.csv", encoding='utf-8-sig')
Adding all your items to the price list is going to cause them all to be in one column. Instead, store separate lists for each column, in a dict, like this (name them whatever you want):
data = {
'values': [],
'classes': [],
'data_a_html_contents': [],
}
...
for element in options:
values.append(element.get_attribute("value"))
classes.append(element.get_attribute("class"))
data_a_html_contents.append(element.get_attribute("data-a-html-content"))
...
output = pd.DataFrame(data)
output.to_csv("Data.csv", encoding='utf-8-sig')
You were collecting the value, class and data-a-html-content and appending them to the same list price. Hence, the list becomes:
price = [value1, class1, data-a-html-content1, value2, class2, data-a-html-content2, ...]
Hence, within the dataframe it looks like:
Solution
To get value, class and data-a-html-content in seperate columns you can adopt any of the below two approaches:
Pass a dictionary to the dataframe.
Pass a list of lists to the dataframe.
While the #user17242583 and #h.devillefletcher suggests a dictionary, you can still achieve the same using list of lists as follows:
values = []
classes = []
data-a-html-contents = []
driver = webdriver.Chrome("./chromedriver")
driver.get("https://www.example.co.jp/dp/zzzzzzzzzz/")
select_box = driver.find_element_by_name("dropdown_selected_size_name")
options = [x for x in select_box.find_elements_by_tag_name("option")]
for element in options:
values.append(element.get_attribute("value"))
classes.append(element.get_attribute("class"))
data-a-html-contents.append(element.get_attribute("data-a-html-content"))
df = pd.DataFrame(data=list(zip(values, classes, data-a-html-contents)), columns=['Value', 'Class', 'Data-a-Html-Content'])
output = pd.DataFrame(my_list)
output.to_csv("Data.csv", encoding='utf-8-sig')
References
You can find a couple of relevant detailed discussions in:
Selenium: Web-Scraping Historical Data from Coincodex and transform into a Pandas Dataframe
Python Selenium: How do I print the values from a website in a text file?

Categories