huggingface datasets: load_dataset gets stuck for loading wikipedia - python

I am trying to run the following lines of code:
from datasets import load_dataset
wiki_data = load_dataset("wikipedia", language="en", date="20230101", beam_runner="DirectRunner")
However, the script gets stuck after I've seen the message:
WARNING:apache_beam.options.pipeline_options:Discarding unparseable args:
Also, as this script keeps running, I see that the memory goes constantly up.
What can be the problem and how to solve it?

Related

Fail to load a subpart of "open-images-v6" with Fiftyone

Context
I'm trying to retrieve a large amount of data to train a CNN.
More specifically, I'm looking for pictures of Swimming pools.
I have found a lot of them in the open-images-v6 database made by Google.
So now, I just want to download these particular images (I don't want 9 Millions images to end up in my download folder).
Problem
In order to do this, I followed carefully the instructions given on the Download page (see : https://storage.googleapis.com/openimages/web/download.html).
So, I installed "fiftyone", tried out the "testing" procedure (which would be loading the "quickstart" dataset and navigating through the data) and have not encountered any issues so far.
But when I tried to retrieve the Swimming pool images with the following code, I went through a lot of issues :
import fiftyone as fo
import fiftyone.zoo as foz
dataset = foz.load_zoo_dataset(
"open-images-v6",
split="validation",
label_types="detections",
classes="Swimming pool"
)
session = fo.launch_app(dataset)
I will skip right to the problem I couldn't figure out :
when I run the code, it properly downloads a bunch of .csv files, but when it tries to download the data (the images) it shows a pretty bad looking error :
botocore.exceptions.ClientError: An error occurred (404) when calling the HeadObject operation: Not Found
State of the art
After hours of searching the origin of the error, I eventually discovered that it was somehow linked with AWS, but I have absolutely no clue what I can do on this field.
I saw a random tutorial on internet that recommended to install "awscli" via PIP but nothing changed.
I tried to import other datasets with the same procedure (i.e foz.load_zoo_dataset("coco-2017")) and it seemed to work (at least the download started but I stopped it early).
Thank you for your time.
Thank you for the aws hint, that finally got me on the right trail.
Fiftyone uses the python os.path.join() functionality, which will create windows style paths when running windows. The s3 blob storage can't use those windows paths, therefore raising the 404 error.
Since this is a bug in fiftyone itself (I will create a pr to get that bug fixed), you will need to modify fiftyone yourself.
Go to your python site-packages dir, then open fityone/utils/openimages.py
In this file, add the following code to the import statements:
import re
Then search for the _download_images_if_necessary method and replace this line:
fp_download = os.path.join(split, image_id + ".jpg")
with this one:
fp_download = re.sub(r"\\", "/", os.path.join(split, image_id + ".jpg"))
This did fix the problem for me.

CPU and RAM needed to run Python code on Azure

I'm trying to recreate on of the charts found at the RtCOVID-19 website, using a previously archived version of the code found here. I am using Spyder via Anaconda to run the Python scripts.
After cloning the repo, I create a project and attempt to run the following lines of code, pulled from this jupyter notebook in the repo, which should output the tables needed to create the 'Oregon' graph.
import pymc3 as pm
import pandas as pd
import numpy as np
import arviz as az
from matplotlib import pyplot as plt
from covid.models.generative import GenerativeModel
from covid.data import summarize_inference_data
from covid.data import get_and_process_covidtracking_data, summarize_inference_data
df = get_and_process_covidtracking_data(run_date=pd.Timestamp.today()-pd.Timedelta(days=1))
region = "OR"
model_data = df.loc[region]
gm = GenerativeModel(region, model_data)
gm.sample()
To see an example of the desired output, use the link to the Jupyter notebook referenced above.
The issue that I am running into is that my computer is not powerful enough to run the NUTS sampler. Whereas in the Jupyter notebook, we see that the authors are able to run the sample in 7m, my computer gives me an estimated run time of 4h and gets incredibly hot in the process. As such, I simply stop the model from running lest my computer explode into flames.
Some IT folks that I know said that they can create an instance in Azure to run these scripts, which would give me significantly more computing power, but they need to know how much CPU and RAM I need. Can anybody help me out with this? I only need to run the model one time, for example to recreate the Oregon chart, rather than all 50 charts as shown on the website. More generally, is the solution to this problem indeed to run the model in a cloud computing environment/is this possible?

PyCharm: Process finished with exit code -1073741515 (0xC0000135)

I've been trying to scrape data from websites using selenium in Python (3.7.4 for 32-bit).
The script runs through properly and it is supposed to concatenate the columns (using numpy) and then write the data frame into .csv files.
However, PyCharm (2018.3.7) gives me following error code at some point of the data scraping:
Process finished with exit code -1073741515 (0xC0000135)
I could not find anything specific about the error code. Does anyone know why it would occur?

Jupyter notebook's response is slow when the codes have multiple lines

I have a question for the jupyter notebook.
When I copied and pasted 663 lines of a python code to the jupyter notebook,
it shows the much lower response than the notebook which has just a few code lines.
Have anyone experienced this issue?
Anyone knows the solution?
Without any information about your code is really difficult to give you an answer.
However try to keep you output under control. Too much output to generate with a single run can overkill the kernel.
Moreover, it makes not much sense to run in a single cell almost 700 lines of code, are you sure you're using the right tool?
Sometimes a piece of code could slow all the session, if you split your execution in smaller pieces, over multiple cells you will find what is really your bottleneck.
Add this to your notebook and then click on the link after you execute the cell. Then you can track progress of what's running and see which statements are causing it to be slow. You could also split the code up into multiple cells to see where slow down is occurring.
from IPython.core.display import display, HTML
#sc = SparkContext.getOrCreate()
from pyspark import SparkContext
sc =SparkContext()
spark_url = sc.uiWebUrl
display(HTML('''
<p>
<br />Spark connection is ready! Use this URL to monitor your Spark application!
</p>
<p>
{spark_url}
</p>'''.format(spark_url=spark_url)))

Implementing DCT2/DCT3 in Python

I am having some issues with my code for doing the final implementation for a data to image library using the JPEG DCT2/3 process. Linked below is the source code that I am using. I am using the python code under the SageMathCloud. I've been trying to figure out this specific error for the past several hours, and no matter how I do it, it just doesn't work. I get the same error message everytime, and I just can't track down the reason why.
Gist

Categories