I am using Databricks(Pyspark) to write a csv file inside Azure Blob Storage using:
file_location = "/mnt/ndemo/nsalman/curation/movies/"
df.repartition(1).write.format("com.databricks.spark.csv").option("header", "true").save(file_location)
The file that is created is named as : part-00000-tid-3921235530521294160-fb002878-253d-44f5-a773-7bda908c7178-13-1-c000.csv
Now I am renaming it to "movies.csv" using this:
filePath = "/mnt/ndemo/nsalman/curation/movies/"
fs.rename(spark._jvm.org.apache.hadoop.fs.Path(filePath+"part*"), spark._jvm.org.apache.hadoop.fs.Path(filePath+"movies.csv"))
After running it gives me this output:
Since I am new to Pyspark, I am not sure why is not my file being renamed? Can anyone please let me know where I am going wrong
Try this
old_file_name = "test1.csv"
new_file_name = "test2.csv"
dbutils.fs.mv(old_file_name,new_file_name)
working for me.
You can use following command if you want to change the folder name ,
dbutils.fs.mv("dbfs:/tmp/test", "dbfs:/tmp/test2", recurse=True)
if you want change the single file name ,
dbutils.fs.mv("dbfs:/mnt/all_tables.txt", "dbfs:/mnt/all_tables.txt_newname")
Example ,
Related
I try to write in file in pyspark but I have an error, the file doesn't exist. I'm new in pyspark.
I have this code for write:
result.repartition(1).write.partitionBy('client', 'payload_type').json(OUTPUT_PATH, mode='append')
Is it possible to add a parameter to force to create folder/file if doesn't exist ?
I make a mistake, the error is not at the line of the post but this:
existing_data = spark.read.json(OUTPUT_PATH)
with:
OUTPUT_PATH = f"s3a://{BUCKET_DEST}/{SOURCE}/"
At the first execution, both folder doesn't exist, can I force create it if no exist on the read?
What I want: to get the file name of the file that was just uploaded into Google Colab via the following code.
from google.colab import files
uploaded = files.upload()
I tried: printing the file once I uploaded it and I get the following output.
print(uploaded)
{'2019.06.01_Short Maintenance - Vehicle, Malibu_TGZ.csv': b'Year,Miles,$/Gallon,Total $,Vehicle\r\n6/1/2019,343.4,2.529,28,Malibu\r\n6/8/2019,34.3,2.529,5,Malibu\r\n6/8/2019,315.6,2.529,33.1,Malibu\r\n6/30/2019,323,2.399,30,Malibu\r\n7/5/2019,316.4,2.559,31,Malibu\r\n7/12/2019,334.6,2.529,30.45,Malibu\r\n7/21/2019,288.7,2.459,33.75,Malibu\r\n7/29/2019,336.7,2.419,28,Malibu\r\n8/6/2019,317.3,2.379,30.45,Malibu\r\n8/14/2019,340.9,2.359,30.1,Malibu\r\n8/22/2019,307.4,2.299,29.85,Malibu\r\n9/1/2019,239.1,2.279,29.7,Malibu\r\n9/14/2019,237.8,2.419,28.9,Malibu\r\n9/6/2019,288,2.469,30.4,Malibu\r\n10/13/2019,305.7,2.299,27.81,Malibu\r\n10/20/2019,330.7,2.369,30.05,Malibu\r\n11/8/2019,257,2.429,32.4,Malibu\r\n12/3/2019,249.3,2.319,5.01,Malibu\r\n12/7/2019,37.2,2.099,25,Malibu\r\n12/22/2019,276.4,2.229,29.4,Malibu\r\n1/12/2020,334,2.199,5,Malibu\r\n1/19/2020,51,2.009,28.15,Malibu\r\n2/8/2020,231.5,2.079,25.8,Malibu\r\n2/23/2020,254.7,2.159,25.75,Malibu\r\n3/19/2020,235.3,1.879,23.15,Malibu\r\n5/22/2020,303,1.699,23.15,Malibu\r\n'}
It appears to be a dic with the key as the file name, and value as a string of all the data in the file. I don't know how to ge the key value, assuming that is what I need to do.
It's in the keys of uploaded. You can use iter() and next() to get it.
filename = next(iter(uploaded))
Access by getting keys
filenames = uploaded.keys()
for file in filenames:
data = uploaded[file]
If you have more than one file, just create a list for your data and append retrieved values.
I'm pretty new to Azure and have been having a problem whilst trying to export to a csv. I want to rename the output file from the default part-0000-tid-12345 naming to something more recognisable. My problem is , that when I export the file it creates a Subdirectory with the filename and then within that directory I get the file. Is there a way of getting rid of the directory that's created i.e the path lookslike the write path below, but adds a directory ...outbound/cs_notes_.csv/filenmae.csv
%python
import os, sys, datetime
readPath = "/mnt/publisheddatasmets1mig/metering/smets1mig/cs/system_data_build/notes/rg"
writePath = "/mnt/publisheddatasmets1mig/metering/smets1mig/cs/system_data_build/notes/outbound"
file_list = dbutils.fs.ls(readPath)
for i in file_list:
file_path = i[0]
file_name = i[1]
file_name
Current_Date = datetime.datetime.today().strftime ('%Y-%m-%d-%H-%M-%S')
fname = "CS_Notes_" + str(Current_Date) + ".csv"
for i in file_list:
if i[1].startswith("part-00000"):
dbutils.fs.cp(readPath+"/"+file_name,writePath+"/"+fname)
dbutils.fs.rm(readPath+"/"+file_name)
Any help would be appreciated
It's not possible to do it directly to change the output file name in Apache Spark.
Spark uses Hadoop File Format, which requires data to be partitioned - that's why you have part- files. You can easily change output filename after processing just like in the SO thread.
You may refer similar SO thread, which addressed similar issue.
Hope this helps.
I'm currently working on a Python Flask API.
For demo purposes, I have a folder in the server containing .tar.gz files.
Basically I'm wondering how do I save these files knowing their relative path name, say like file.tar.gz, into a FILE object. I need the tar file in the format to be able to run the following code on it, where f would be the tar file:
tar = tarfile.open(mode="r:gz", fileobj=f)
for member in tar.getnames():
tf = tar.extractfile(member)
Thanks in advance!
Not ver familiar with this but , just saving it normally with .tar.gz extension should work? if yes and if you have the file already compressed then a very simple code could do that,
compresseddata= 'your file'
with open('file.tar.gz') as fo:
fo.write(compressed data)
fo.flush().close()
Will this do the job , or am i getting something wrong here?
I am currently learning Pandas for data analysis and having some issues reading a csv file in Atom editor.
When I am running the following code:
import pandas as pd
df = pd.read_csv("FBI-CRIME11.csv")
print(df.head())
I get an error message, which ends with
OSError: File b'FBI-CRIME11.csv' does not exist
Here is the directory to the file: /Users/alekseinabatov/Documents/Python/"FBI-CRIME11.csv".
When i try to run it this way:
df = pd.read_csv(Users/alekseinabatov/Documents/Python/"FBI-CRIME11.csv")
I get another error:
NameError: name 'Users' is not defined
I have also put this directory into the "Project Home" field in the editor settings, though I am not quite sure if it makes any difference.
I bet there is an easy way to get it to work. I would really appreciate your help!
Have you tried?
df = pd.read_csv("Users/alekseinabatov/Documents/Python/FBI-CRIME11.csv")
or maybe
df = pd.read_csv('Users/alekseinabatov/Documents/Python/"FBI-CRIME11.csv"')
(If the file name has quotes)
Just referring to the filename like
df = pd.read_csv("FBI-CRIME11.csv")
generally only works if the file is in the same directory as the script.
If you are using windows, make sure you specify the path to the file as follows:
PATH = "C:\\Users\\path\\to\\file.csv"
Had an issue with the path, it turns out that you need to specify the first '/' to get it to work!
I am using VSCode/Python on macOS
I also experienced the same problem I solved as follows:
dataset = pd.read_csv('C:\\Users\\path\\to\\file.csv')
Being on jupyter notebook it works for me including the relative path only. For example:
df = pd.read_csv ('file.csv')
But, for example, in vscode I have to put the complete path:
df = pd.read_csv ('/home/code/file.csv')
You are missing '/' before Users. I assume that you are using a MAC guessing from the file path names. You root directory is '/'.
I had the same issue, but it was happening because my file was called "geo_data.csv.csv" - new laptop wasn't showing file extensions, so the name issue was invisible in Windows Explorer.
Very silly, I know, but if this solution doesn't work for you, try that :-)
Just change the CSV file name. Once I changed it for me, it worked fine. Previously I gave data.csv then I changed it to CNC_1.csv.
What worked for me:
import csv
import pandas as pd
import os
base =os.path.normpath(r"path")
with open(base, 'r') as csvfile:
readCSV = csv.reader(csvfile, delimiter='|')
data=[]
for row in readCSV:
data.append(row)
df = pd.DataFrame(data[1:],columns=data[0][0:15])
print(df)
This reads in the file , delimit by |, and appends to list which is converted to a pandas df (taking 15 columns)
Make sure your source file is saved in .csv format. I tried all the steps of adding the full path to the file, including and deleting the header=0, adding skiprows=0 but nothing works as I saved the excel file(data file) in workbook format and not in CSV format. so keep in mind to first check your file extension.
Adnane's answer helped me.
Here's my full code on mac, hope this helps someone. All my csv files are saved in /Users/lionelyu/Documents/Python/Python Projects/
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('ggplot')
path = '/Users/lionelyu/Documents/Python/Python Projects/'
aapl = pd.read_csv(path + 'AAPL_CLOSE.csv',index_col='Date',parse_dates=True)
cisco = pd.read_csv(path + 'CISCO_CLOSE.csv',index_col='Date',parse_dates=True)
ibm = pd.read_csv(path + 'IBM_CLOSE.csv',index_col='Date',parse_dates=True)
amzn = pd.read_csv(path + 'AMZN_CLOSE.csv',index_col='Date',parse_dates=True)
Run "pwd" command first in cli to find out what is your current project's direction and then add the name of the file to your path!
Try this
import os
cd = os.getcwd()
dataset_train = pd.read_csv(cd+"/Google_Stock_Price_Train.csv")
In my case I just removed .csv from the end. I am using ubuntu.
pd.read_csv("/home/mypc/Documents/pcap/s2csv")
Sometimes we ignore a little bit issue which is not a Python or IDE fault
its logical error
We assumed a file .csv which is not a .csv file its a Excell Worksheet file have a look
When you try to open that file using Import compiler will through the error
have a look
To Resolve the issue
open your Target file into Microsoft Excell and save that file in .csv format
it is important to note that Encoding is important because it will help you to open the file when you try to open it with
with open('YourTargetFile.csv','r',encoding='UTF-8') as file:
So you are set to go
now Try to open your file as this
import csv
with open('plain.csv','r',encoding='UTF-8') as file:
load = csv.reader(file)
for line in load:
print(line)
Here is the Output
What works for me is
dataset = pd.read_csv('FBI_CRIME11.csv')
Highlight it and press enter. It also depends on the IDE you are using. I am using Anaconda Spyder or Jupiter.
I am using a Mac. I had the same problem wherein .csv file was in the same folder where the python script was placed, however, Spyder still was unable to locate the file. I changed the file name from capital letters to all small letters and it worked.