Pyspark write in file doesn't not exist

Pyspark write in file doesn't not exist - python

I try to write in file in pyspark but I have an error, the file doesn't exist. I'm new in pyspark.
I have this code for write:
result.repartition(1).write.partitionBy('client', 'payload_type').json(OUTPUT_PATH, mode='append')
Is it possible to add a parameter to force to create folder/file if doesn't exist ?

I make a mistake, the error is not at the line of the post but this:
existing_data = spark.read.json(OUTPUT_PATH)
with:
OUTPUT_PATH = f"s3a://{BUCKET_DEST}/{SOURCE}/"
At the first execution, both folder doesn't exist, can I force create it if no exist on the read?

Related

Importing DataFrames - no such file or directory

I am trying to import a csv file as a dataframe into a Jupyter notebook.
rest_relation = pd.read_csv('store_id_relation.csv', delimiter=',')
But I get this error
FileNotFoundError: [Errno 2] No such file or directory: 'store_id_relation.csv'
The store_id_relation.csv is definitely in the data folder, and I have tried adding data\ to the file location but I get the same error. Whats going wrong here?

Using the filename only works if the file is located in the current working directory. You can check the current working directory using os.getcwd().
import os
current_directory = os.getcwd()
print(current_directory)
One way you can make your code work is to change the working directory to the one where the file is located.
os.chdir("Path to wherever your file is located")
Or you can substitute the filename with the full path to the file. The full path would look something like C:\Users\Documents\store_id_relation.csv.

Always try to pass the full path whenever possible
By default, read_csv looks for the file in the current working directory
Provide the full path to the read_csv function
However in your case , data and Data are different , and file paths are case sensitive
You can try the below path to fetch the file and convert it into a dataframe
rest_relation = pd.read_csv('Data\\store_id_relation.csv', delimiter=',')

In case where you don't want to change your read_csv()'s parameter, just make sure you are in a correct path of directory which your .csv file is in. Otherwise you can change your directory into there and then run the python file or simply change read_csv()'s parameter.

Importing a file using pandas

Data visualisation: I want to import a file using pandas. I assigned a variable and file name given is correct but it shows a error message as file does not exist
Code: sample_data = pd.read_csv('sample_data.csv')

You're probably in the wrong working directory. Run os.getcwd() to check. If so, you can either change working directories with os.chdir() or give an absolute path to the file instead of a relative path.
If you're already in the right working directory, run os.listdir() to make sure the file is actually there.

Ok, in this case, what you will have to do is get the path file by right-clicking on the file and going to properties(Windows, don't know about Mac). Then just copy the file path and paste it instead of the file name. So for now, it should be something like this(as I don't know your file path):
sample_data = pd.read_csv('C:\Users\SVISHWANATH\Downloads\datasets')
Now, after the last folder, give in your file name. So now, it should look like this:
sample_data = pd.read_csv('C:\Users\SVISHWANATH\Downloads\datasets\sample_data.csv')
However, this will still not work as the slashes need to be the other way. Because of this, there will have to be an r before the quotes.
sample_data = pd.read_csv(r'C:\Users\SVISHWANATH\Downloads\datasets\sample_data.csv')
Now this should work as all the requirements are met.

You need to store the .csv file in a folder where the actual program is stored, otherwise you have to give a proper path to the file:
sample_data = pd.read_csv('filepath')

How to use json file that placed in the python project?

I have project in Pycharm, and I want to use json file that placed in this project.
How can I call it?
I use this one:
import json
file = open('~/PycharmProjects/Test/JSON_new.json')
x = json.load(file)
And received the following error:
FileNotFoundError: [Errno 2] No such file or directory: '~/PycharmProjects/Test/JSON_new.json'
But path is correct
EDIT: I understood what is the problem. Instead of json file txt was created (but I selected json). It creates txt files, maybe, someone knows hot to solve it? I can create only .py files directly. Other files no.
Is it correct if I create scratch json file and placed it in Scratches?

You may like to use following path(In linux):
file = open('/home/<user>/PycharmProjects/Test/JSON_new.json')
Replace user with your username. You need to know the correct path to the file, for which you can user PWD command in terminal.

You can use json module for this. You can open the file in a seperate object and pass it to json.load if you have a JSON string use json.loads.
import json
file = open('/path/to/json/file.json')
file_opened = json.load(file)

How to modify a .pd file from python

I have a .pd file called 'missing' on my computer. The path is C:\me\Desktop\missing.pd
Inside this file there is just dates. I have an algo which create and populate this 'missing.pd' with dates everytime I run it. My algo basically create a dataframe with inside some dates, sometime empty and then create the missing.pd file on my computer and add the dates.
What I am trying to do is to not recreate everytime the missing.pd file (that's what my code do so far).
I want to say to my code :
if C:\me\Desktop\missing.pd exist, then check inside if the dates of my created dataframe are already here, if no add the ones which are not already here, if missing.pd do not exist, create it and fill it with the dates.
so far for this part of the code, it is :
path = r"C:\me\Desktop\missing.pd"
missing = pd.DataFrame(missing)
missing.to_pickle( os.path.join(path,"%s_missing.pd"%(country)))

You can use os.path.isfile(filename) to check if the file exists. Documentation here.
import os.path
if os.path.isfile(path):
"""Your date checking code here."""

Copying file from one directory to another

Does anyone know how I can copy/duplicate a file from one directory into another without specification of src path? I got it to work with "shutil.copy2" but it's not exactly what I am looking for since the src argument asks for the path.
My goal is to be able to copy/duplicate a file from one directory into another by filename. Has anyone done this before, if so can you guide me in the right direction? - Thanks
#----------------------------------------------------------------------------------------------------------------#
# These params will be used for specifying which template you want to copy and where to output
#----------------------------------------------------------------------------------------------------------------#
'''Load file from x directory into current working directory '''
#PullTemplate: Specify which template you want to copy, by directory path
TemplateRepo = ("/home/hadoop/BackupFolders/Case_Project/scripts")
#OutputTemplate: Let's you specify where you want to output the copied template.
#Originally set to your current working directory (u".")
OutputTemplate = (u".")
shutil.copy2(TemplateRepo, OutputTemplate)

Well if you are trying to load a file in the same project you need to have at least the folder name inside that project.
You can use json
Something like this.
import json
#someFiles is just a fold name inside the projects main folder.
with open("someFiles\\file_name", "r") as whatever_u_want:
var_of_choice = json.load(whatever_u_want)
print (var_of_choice)
once the file is open you can save the variable var_of_choice as any file name you wish where you wish using the json dump method.

Click the file you want to copy, create a duplicate of the file by choosing Duplicate key under File ( below the Jupyter logo at the top ).
Choose the file copied(file_copy), and choose Move key under File.
Choose the file path where you want to paste/move the copied file.
Rename the copied file name as you wish.
For more info, you can refer to here: https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781785884870/1/ch01lvl1sec12/basic-notebook-operations

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pyspark write in file doesn't not exist - python

I make a mistake, the error is not at the line of the post but this: existing_data = spark.read.json(OUTPUT_PATH) with: OUTPUT_PATH = f"s3a://{BUCKET_DEST}/{SOURCE}/" At the first execution, both folder doesn't exist, can I force create it if no exist on the read?

Related

Importing DataFrames - no such file or directory

Importing a file using pandas

How to use json file that placed in the python project?

How to modify a .pd file from python

Copying file from one directory to another

Categories

Resources