Azure output to csv using python - python

I'm pretty new to Azure and have been having a problem whilst trying to export to a csv. I want to rename the output file from the default part-0000-tid-12345 naming to something more recognisable. My problem is , that when I export the file it creates a Subdirectory with the filename and then within that directory I get the file. Is there a way of getting rid of the directory that's created i.e the path lookslike the write path below, but adds a directory ...outbound/cs_notes_.csv/filenmae.csv
%python
import os, sys, datetime
readPath = "/mnt/publisheddatasmets1mig/metering/smets1mig/cs/system_data_build/notes/rg"
writePath = "/mnt/publisheddatasmets1mig/metering/smets1mig/cs/system_data_build/notes/outbound"
file_list = dbutils.fs.ls(readPath)
for i in file_list:
file_path = i[0]
file_name = i[1]
file_name
Current_Date = datetime.datetime.today().strftime ('%Y-%m-%d-%H-%M-%S')
fname = "CS_Notes_" + str(Current_Date) + ".csv"
for i in file_list:
if i[1].startswith("part-00000"):
dbutils.fs.cp(readPath+"/"+file_name,writePath+"/"+fname)
dbutils.fs.rm(readPath+"/"+file_name)
Any help would be appreciated

It's not possible to do it directly to change the output file name in Apache Spark.
Spark uses Hadoop File Format, which requires data to be partitioned - that's why you have part- files. You can easily change output filename after processing just like in the SO thread.
You may refer similar SO thread, which addressed similar issue.
Hope this helps.

Related

Bulk convert files extension using Python

Trying to develop a bulk webp to png converter using python.
Am using the webptools library (https://pypi.org/project/webptools/)
the documentation above only shows how to convert one file at each time and require user input of the file name.
So, what I am trying to do is to scan the folder for *.webp and then convert it to *.png with the original filename. I couldn't solve the output file names. I suppose with the current codes, it keeps overwriting the same file x.png, so it ended up with just 1 output file. I can't figure out how to fix this.
I am new to python. hope to get some guidance or help here. Thank you very much.
from webptools import dwebp
import os, glob
os.chdir("./images") # working directory
webp_list = []
for file in glob.glob("*.webp"):
webp_list = file
print([webp_list])
for files in webp_list:
print(dwebp(input_image=webp_list, output_image="x.png", option="-o", logging="-v"))
# documentation - code allows only 1 input and 1 output
# print(dwebp(input_image="sample.webp", output_image="sample.png", option="-o", logging="-v"))
After you do
webp_list = []
for file in glob.glob("*.webp"):
webp_list = file
print([webp_list])
webp_list is name of last file which matches, rather list of file names. glob.glob itself
Return a possibly-empty list of path names that match pathname(...)
so there is no need for such conhortion and you can simply do
webp_list = glob.glob("*.webp")
instead, then you need different output filename, for which I propose following solution
for filename in webp_list:
outname = filename[:-4] + "png"
dwebp(input_image=filename, output_image=outname, option="-o", logging="-v")
filename[:-4] means filename without 4 last characters (webp in this case), which is then concatenated with png.
I've never used this library before, so my suggestion is based just on how I guess it should work:
from webptools import dwebp
import os, glob
os.chdir("./images") # working directory
webp_list = []
for file in glob.glob("*.webp"):
output_file = file[:-4] + 'png'
dwebp(input_image=file, output_image=output_file, option="-o", logging="-v")

Changing filenames in folders

I have a folder that contains a lot of files that has a lot of copies which make them unreadable.
Example:
cow.txt
cow.txt(1)
cow.txt(2)
cow.txt(3)
dog.txt
dog.txt(1)
I would like to to have all the files structured in away that makes them able to be opened. Example
cow.txt
cow(1).txt
cow(2).txt
cow(3).txt
dog.txt
dog(1).txt
Any help you can provided would be greatly appreciated. I am just looking to make sure there name is changed, and am not looking to read each individual file. In addition if possible I would like to break up the files into 20k blocks. Thank you in advance.
I have tried using os.rename to simply rename the file but I am confused on how to do the efficiently as the numbers come after the .txt I then decided to read all the files and convert them to a pandas data frame and fix it that way. However I am confused on how to pull the files and make them with that name.
list_of_files = os.listdir()
df = pd.DataFrame(list_of_files, columns = ['File_Name'])
df['.txt_removed'] = df.replace(to_replace = '.txt', value = '', regex = True)
df['txt_add'] = df['.txt_removed'] + '.txt'
To pull the files I would do something like this
for filewant_in df['txt_add']:
if filewant in os.listdir():
sutil.copy(os.path.join(filewant), 'new location')
I do not think this option will work even though it gives me my intended result. As I would like to change the overall file names.
You can use python's standard library, the os module has the os.rename function.
Like this:
It works like this:
os.rename('cow.txt(1)', 'cow(1).txt')
Create a .py file and paste the code below then run it. Change /mydir path with the path to the directory having the files. The code will loop through the directory finding all the containing have .txt as part of the file extension and renaming them to a .txt file. I hope it works.
import glob, os
os.chdir("/mydir")
for file in glob.glob("*.txt*"):
file_name = os.path.basename(file)
part_name = file_name.split(".", 1)
new_name = part_name[0]+'.txt'
os.rename(file,new_name)

Trying to print name of all csv files within a given folder

I am trying to write a program in python that loops through data from various csv files within a folder. Right now I just want to see that the program can identify the files in a folder but I am unable to have my code print the file names in my folder. This is what I have so far, and I'm not sure what my problem is. Could it be the periods in the folder names in the file path?
import glob
path = "Users/Sarah/Documents/College/Lab/SEM EDS/1.28.20 CZTS hexane/*.csv"
for fname in glob.glob(path):
print fname
No error messages are popping up but nothing will print. Does anyone know what I'm doing wrong?
Are you on a Linux-base system ? If you're not, switch the / for \\.
Is the directory you're giving the full path, from the root folder ? You might need to
specify a FULL path (drive included).
If that still fails, silly but check there actually are files in there, as your code otherwise seems fine.
This code below worked for me, and listed csv files appropriately (see the C:\\ part, could be what you're missing).
import glob
path = "C:\\Users\\xhattam\\Downloads\\TEST_FOLDER\\*.csv"
for fname in glob.glob(path):
print(fname)
The following code gets a list of files in a folder and if they have csv in them it will print the file name.
import os
path = r"C:\temp"
filesfolders = os.listdir(path)
for file in filesfolders:
if ".csv" in file:
print (file)
Note the indentation in my code. You need to be careful not to mix tabs and spaces as theses are not the same to python.
Alternatively you could use os
import os
files_list = os.listdir(path)
out_list = []
for item in files_list:
if item[-4:] == ".csv":
out_list.append(item)
print(out_list)
Are you sure you are using the correct path?
Try moving the python script in the folder when the CSV files are, and then change it to this:
import glob
path = "./*.csv"
for fname in glob.glob(path):
print fname

Python script runs correctly from pycharm but not from batch file

I have a python script that scans some csv files from a directory, gets the last line from each, and adds them all in a new csv file. When running the script inside pycharm it runs correctly and does its designated job, but when trying to run it through a batch file (i need thath to do some automation later on) it returns an empty csv file instead of the one it's supposed to.
The batch file is created by writing in a .txt file:
"path of python.exe" "path of the .py file of the script"
and then changing the .txt extension to a .bat one (that's the process i found online about creating batch files) and the script is:
import pandas as pd
import glob
import os
path = r'Path for origin files.'
r_path = r'Path of resulting file'
if os.path.exists(r_path + 'csv'):
os.remove(r_path + 'csv')
if os.path.exists(r_path + 'txt'):
os.remove(r_path + 'txt')
files = glob.glob(path)
column_list = [None] * 44
for i in range(44):
column_list[i] = str(i + 1)
df = pd.DataFrame(columns = column_list)
for name in files:
df_n = pd.read_csv(name, names = column_list)
df = df.append(df_n.iloc[-1], ignore_index=True)
del df_n
df.to_csv(r_path, index=False, header=False)
del df
What am i doing wrong?
Pycharm automatically adds an environment variable called PYTHONPATH to the command before it executes it. The PYTHONPATH variable indicates the python process what the base path is for the execution of the script.
For example if your file path is awesomecsv.csv how should the python process know which folder it should look for to find that file?
PYTHONPATH=/my/path/tothefolderwheremyscriptis/ python my_script.py
above with the PYTHONPATH you tell python what folder you are executing your python command from.
related documentation
Probably the error is in the paths of the csv files, Pycharm probably is setting for you some kind of workspace folder try to use a full path to the directory

Automatically create a file in folder (Python)

The problem is: I have to create a file in an automatically way into some folder which have been automatically created before.
Let me explain better. First I post the code used to create the folder...
import os
from datetime import datetime
timestr = datetime.now().strftime("%Y%m%d-%H%M%S-%f")
now = datetime.now()
newDirName = now.strftime("%Y%m%d-%H%M%S-%f")
folder = os.mkdir('C:\\Users\\User\\Desktop\\' + newDirName)
This code will create a folder on Desktop with timestamp (microseconds included to make it as unique as possible..) as name.
Now I would like to create also a file (for example a txt) inside the folder. I already have the code to do it...
file = open('B' + timestr, 'w')
file.write('Exact time is: ' + timestr)
file.close()
How can I "combine" this together ? First create the folder and, near immediately, the file inside it?
Thank you. If it's still not clear, feel free to ask.
Yes, just create a directory and then immediately a file inside it. All I/O operations in Python are synchronous by default so you won't get any race conditions.
Resulting code will be (also made some improvings to your code):
import os
from datetime import datetime
timestr = datetime.now().strftime("%Y%m%d-%H%M%S-%f")
dir_path = os.path.join('C:\\Users\\User\\Desktop', timestr)
folder = os.mkdir(dir_path)
with open(os.path.join(dir_path, 'B' + timestr), 'w') as file:
file.write('Exact time is: ' + timestr)
You can also make your code (almost) cross-platform by replacing hard-coded desktop directory path with
os.path.join(os.path.expanduser('~'), 'Desktop')

Categories