load dataset into rdd from website in spark

load dataset into rdd from website in spark - python

I load dataset with tar.gz file from website in pyspark.
dataset=spark.sparkContext.textFile('https://www.example/example.tar.gz') (the url is just an example)
and
dataset.collect()
get error.

You cannot load files directly into core spark from website. You have to download files from website to your local file system and load it as follows
dataset=spark.sparkContext.textFile("file:///your file local file path")
or after placing files in hdfs using
dataset=spark.sparkContext.textFile(" your hdfs file path")

Related

(Python)Unable to import excel file using colab

i have tried to import an excel csv file from my local disk but failed
import pandas as pd
df=pd.read_csv(r'C:/Users/Username/Desktop/da/SeoulBikeData.csv',encoding="gbk")
print("df")
error:
FileNotFoundError: [Errno 2] No such file or directory: 'C:/Users/Username/Desktop/da/SeoulBikeData.csv'

First you need to load file into local directory of colab
here you can click on that folder
Click on icon of upload and upload the file you want to read
import pandas as pd
df=pd.read_csv(r'fummy.csv',encoding="gbk")
df
Finally your file will be loaded.
But there is limitation whenever you close your notebook on colab you have to done whole process again
better way to do this is upload file on drive and mount the drive and load file from drive:
from google.colab import drive
drive.mount('/content/drive')
After that:
df= pd.read_csv("copy path file here")
Third Way to load file in google colab is:
import io
from google.colab import files
uploaded = files.upload()
df2 = pd.read_csv(io.BytesIO(uploaded['fummy.csv']))
Click on choose file located directory of your file

Try uploading files on drive first.
Then read.
Go to file.
Copy path.
Try following code.
from google.colab import files
Upload= files.upload
Df= pd.read_csv("file_path_file_name")
df.head

You cannot import local directory excel files directly in Google collab. You have to upload the excel file in the Collab directory and then you can import it.

In Google Collab you have to upload it in sample_data folder. But the file will lost after you close it. Other alternatives are jupyter notebook where you can do everything in google collab but offline.
But if you want to use google collab here's where you need to upload
Files >> sample_data >> upload files

Try uploading files on drive first. Then read. Go to file. Copy path. Try following code according to mu method in then below
from google.colab import drive
drive.mount('/content/drive')
after that:
df= pd.read_csv("copy path file here")

Colab: Have you specified the `path` option in the configuration file /root/.pylidcrc?

I am trying to use pylidc in Google Colab and I recieve this error when i try to use its functions:
"RuntimeError: Could not establish path to dicom files. Have you specified the `path` option in the configuration file /root/.pylidcrc?"
I need to edit/create the pylidc.conf file located somewhere in the root folder. This folder cannot open from colab. How do I configure a file that is in the root folder without being able to access it?
Here is the documentation of pylidc: https://pylidc.github.io/install.html

Cannot read file from NAS

I am trying to read an excel file from a NAS using Jupyter Notebook (macOS, Python 3, SynologyDS218+).
Script worked absolutely fine when the file was stored locally, but I cannot determine the correct code or file path adjustment to access the file once moved to the NAS.
I am logged into the NAS and from Mac Finder the file path is:
Server: smb://NAS/home/folder/file.xlsx
I have reviewed...
How to read an excel file directly from a Server with Python
Python - how to read path file/folder from server
https://apple.stackexchange.com/questions/337472/how-to-access-files-on-a-networked-smb-server-with-python-on-macos
... and tried numerous code variations as a result, but with no success.
I am using:
pd.read_excel(“//NAS/home/folder/file.xlsx”, sheet_name=‘total’, header=84, index_col=0, usecols=‘B,AL,DC’, skiprows=0, parse_dates=True).dropna()
But regardless of the code/file path variation, the same error is returned:
FileNotFoundError [Errno 2] No such file or directory: //NAS/home/folder/file.xlsx

I finally located the correct code / file path adjustment necessary to read the file. See https://apple.stackexchange.com/questions/349102/accessing-a-nas-device-from-python-code-on-mac-os
Although the drive was mapped and Mac Finder indicated a file path of "smb://NAS/home/folder/file.xlsx", using Control+Command+C instead, to copy the file path from Finder to the clipboard returned "/Volumes/home/folder/file.xlsx". Using this file path located the file and allowed it to be read.

Accessing external file in Python UDF

I am using hive and a python udf. I defined a sql file in which I added the python udf and I call it. So far so good and I can process on my query results using my python function.
However, at this point of time, I have to use an external .txt file in my python udf. I uploaded that file into my cluster (the same directory as .sql and .py file) and I also added that in my .sql file using this command:
ADD FILE /home/ra/stopWords.txt;
When I call this file in my python udf as this:
file = open("/home/ra/stopWords.txt", "r")
I got several errors. I cannot figure out how to add nested files and using them in hive.
any idea?

All added files are located in the current working directory (./) of UDF script.
If you add a single file using ADD FILE /dir1/dir2/dir3/myfile.txt, its path will be
./myfile.txt
If you add a directory using ADD FILE /dir1/dir2, the file's path will be
./dir2/dir3/myfile.txt

How to save file as a reg hive file when saving it in a script?

I have a python script that exports a file but there is a problem with the result. I can't figure out what extension I have to save it as to make it a reg hive file. I am exporting the SAM folder and it's contents from the windows registry with a python script. How can I save it as a reg hive file from a script?
Here is the code:
import sys, os
os.system("reg export HKEY_LOCAL_MACHINE\SAM C:\Export.*")
I don't know what extension to put after C:\Export.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

load dataset into rdd from website in spark - python

I load dataset with tar.gz file from website in pyspark. dataset=spark.sparkContext.textFile('https://www.example/example.tar.gz') (the url is just an example) and dataset.collect() get error.

Related

(Python)Unable to import excel file using colab

Colab: Have you specified the `path` option in the configuration file /root/.pylidcrc?

Cannot read file from NAS

Accessing external file in Python UDF

How to save file as a reg hive file when saving it in a script?

Categories

Resources