Getting a file name after being loaded into Google Colab - python

What I want: to get the file name of the file that was just uploaded into Google Colab via the following code.
from google.colab import files
uploaded = files.upload()
I tried: printing the file once I uploaded it and I get the following output.
print(uploaded)
{'2019.06.01_Short Maintenance - Vehicle, Malibu_TGZ.csv': b'Year,Miles,$/Gallon,Total $,Vehicle\r\n6/1/2019,343.4,2.529,28,Malibu\r\n6/8/2019,34.3,2.529,5,Malibu\r\n6/8/2019,315.6,2.529,33.1,Malibu\r\n6/30/2019,323,2.399,30,Malibu\r\n7/5/2019,316.4,2.559,31,Malibu\r\n7/12/2019,334.6,2.529,30.45,Malibu\r\n7/21/2019,288.7,2.459,33.75,Malibu\r\n7/29/2019,336.7,2.419,28,Malibu\r\n8/6/2019,317.3,2.379,30.45,Malibu\r\n8/14/2019,340.9,2.359,30.1,Malibu\r\n8/22/2019,307.4,2.299,29.85,Malibu\r\n9/1/2019,239.1,2.279,29.7,Malibu\r\n9/14/2019,237.8,2.419,28.9,Malibu\r\n9/6/2019,288,2.469,30.4,Malibu\r\n10/13/2019,305.7,2.299,27.81,Malibu\r\n10/20/2019,330.7,2.369,30.05,Malibu\r\n11/8/2019,257,2.429,32.4,Malibu\r\n12/3/2019,249.3,2.319,5.01,Malibu\r\n12/7/2019,37.2,2.099,25,Malibu\r\n12/22/2019,276.4,2.229,29.4,Malibu\r\n1/12/2020,334,2.199,5,Malibu\r\n1/19/2020,51,2.009,28.15,Malibu\r\n2/8/2020,231.5,2.079,25.8,Malibu\r\n2/23/2020,254.7,2.159,25.75,Malibu\r\n3/19/2020,235.3,1.879,23.15,Malibu\r\n5/22/2020,303,1.699,23.15,Malibu\r\n'}
It appears to be a dic with the key as the file name, and value as a string of all the data in the file. I don't know how to ge the key value, assuming that is what I need to do.

It's in the keys of uploaded. You can use iter() and next() to get it.
filename = next(iter(uploaded))

Access by getting keys
filenames = uploaded.keys()
for file in filenames:
data = uploaded[file]
If you have more than one file, just create a list for your data and append retrieved values.

Related

Error in loading JSON data - can't find the file

I've been trying to solve this issue, but right now I need some help. I'm trying to upload this JSON file (DBL) in Spyder IDE. I have stored the JSON-data file and the Spyder file, in the same map in order to read the JSON file, but it's not working.
My Python code:
import json
file = open("dbl")
dbl = json.load(file)
print(dbl)
Every time I upload the json file, in the same map as the spyder.py file, it can't recognize the file directory.
I have stored the my .py file in the same folder as the JSON file.
This is the error message:
FileNotFoundError: [Errno 2] No such file or directory: 'dbl.json'
The file, in fact, does not exist. The actual filename is dbl.json.json.
import json
file = open("dbl.json.json")
dbl = json.load(file)
print(dbl)
You'll use ".json" in the file path.
For example:
file = open("dbl.json")
if dbl is name of json file then you should add ".json" extension too. you might do this:
# Opening JSON file
f = open('dbl.json',)
# returns JSON object as
# a dictionary
data = json.load(f)
# Iterating through the json
# list
for i in data:
print(i)
# Closing file
f.close()```
The code is fine, however, it's good practice to add a file extension. Looks like you forgot to add the extension.
You are using relative paths. It is advised to use absolute paths. In this case, put the python script and db1 file in the same directory and try again.
In case, you want to debug just add the below code on top of your script to see if the file is present or not and modify the script accordingly.
import os;
print(os.listdir())

Renaming a csv file placed in Azure Blob Storage

I am using Databricks(Pyspark) to write a csv file inside Azure Blob Storage using:
file_location = "/mnt/ndemo/nsalman/curation/movies/"
df.repartition(1).write.format("com.databricks.spark.csv").option("header", "true").save(file_location)
The file that is created is named as : part-00000-tid-3921235530521294160-fb002878-253d-44f5-a773-7bda908c7178-13-1-c000.csv
Now I am renaming it to "movies.csv" using this:
filePath = "/mnt/ndemo/nsalman/curation/movies/"
fs.rename(spark._jvm.org.apache.hadoop.fs.Path(filePath+"part*"), spark._jvm.org.apache.hadoop.fs.Path(filePath+"movies.csv"))
After running it gives me this output:
Since I am new to Pyspark, I am not sure why is not my file being renamed? Can anyone please let me know where I am going wrong
Try this
old_file_name = "test1.csv"
new_file_name = "test2.csv"
dbutils.fs.mv(old_file_name,new_file_name)
working for me.
You can use following command if you want to change the folder name ,
dbutils.fs.mv("dbfs:/tmp/test", "dbfs:/tmp/test2", recurse=True)
if you want change the single file name ,
dbutils.fs.mv("dbfs:/mnt/all_tables.txt", "dbfs:/mnt/all_tables.txt_newname")
Example ,

How to retrieve only the file name in a s3 folders path using pyspark

Hi I have aws s3 bucket in which few of the folders and subfolders are defined
I need to retrieve only the filename in whichever folder it will be. How to go about it
s3 bucket name - abc
path - s3://abc/ann/folder1/folder2/folder3/file1
path - s3://abc/ann/folder1/folder2/file2
code tried so far
s3 = boto3.client(s3)
lst_obj = s3.list_objects(bucket='abc',prefix='ann/')
lst_obj["contents"]
I'm further looping to get all the contents
for file in lst_obj["contents"]:
do somtheing...
Here file["Key"] gives me the whole path, but i just need the filename
Here is an example of how to get the filenames.
import boto3
s3 = boto3.resource('s3')
for obj in s3.Bucket(name='<your bucket>').objects.filter(Prefix='<prefix>'):
filename = obj.key.split('/')[-1]
print(filename)
You can just extract the name by splitting the file Key on / symbol and extracting last element
for file in lst_obj["contents"]:
name = file["Key"].split("/")[-1]
Using list objects even with a prefix is simply filtering objects that start with a specific prefix.
What you see as a path in S3 is actually part of the objects key, in fact the key (which is acting as a piece of metadata to identify the object) actually has the value including what might look as if they're subfolders.
If you want the last part of the object key, you will need to split the key by the separator ('/').
You could do this with file['Key'].rsplit(',')[1] which would give you the filename.

How to iterate over JSON files in a directory and upload to mongodb

So I have a folder with about 500 JSON files. I need to upload all of them to a local mongodb database. I tried using Mongo Compass, but Compass can only upload one file at a time. In python I tried to write some simple code to iterate through the folder and upload them one by one, but I ran into some problems. First of all the JSON files are not comma-separated, rather line separated. So the files look like:
{ some JSON object }
{ some JSON object }
...
I wrote the following code to iterate through the folder and upload it:
import pymongo
import json
import pandas as pd
import numpy as np
myclient = pymongo.MongoClient("mongodb://localhost:27017/")
mydb = myclient['Test']
mycol = mydb['data']
directory = os.fsencode("C:/Users/PB/Desktop/test/")
for file in os.listdir(directory):
filename = os.fsdecode(file)
if filename.endswith(".json"):
mycol.insert_many(filename)
The code basically goes through a folder, checks if it's a .json file, then inserts it into the database. That is what should happen. However, I get this error:
TypeError: document must be an instance of dict, bson.son.SON,
bson.raw_bson.RawBSONDocument, or a type that inherits from
collections.MutableMapping
I cannot seem to upload it through python. I tried multiple variations of the code, but for some reason the python does not accept the json files.
The problem with these files seems to be that python only allows for comma-separated JSON files.
How could I fix this to upload all the files?
You're inserting the names of the files to mongo. Not the contents of the file.
Assuming you have multiple json files in a directory, where each file contains a json-object in each line...
You need to go through all the files, filter them, open them, read them line by line, parse each line into a dict, and then insert. Something like below:
os.chdir(directory)
for file in os.listdir(directory):
if file.endswith(".json"):
with open(file) as f:
for line in f:
mongo_obj = json.loads(line)
mycol.insert(mongo_obj)
I did a chdir first to avoid having to pass the whole path to open

python split string and list full path name

I am using glob to get a list of all PDF files in a folder (I need full path names to upload file to cloud)
also, during the upload I need to assign a "title" to the file which we be the items name in the cloud.
I need to split the last "\" and the "." and get the values in between. for example:
pdf_list = glob.glob(r'C:\User\username\Desktop\pdf\*.pdf')
a item in the list will be: "c:\User\username\Desktop\pdf\4434343434331.pdf"
I need another pythonic way to grab the pdfs file name in a separate variable while still in the for loop.
for file in pdf_list:
upload.file
file.title(file.split(".")[0]
however the above split will not return my desired results but something along those lines
I am using a for loop to upload each pdf (using file path)
Actually, there is a function for this already:
for file in pdf_list:
file_name = os.path.basename(file)
upload.file(file_name)
You can use pathlib, for example:
from pathlib import Path
p = list(Path('C:/User/username/Desktop/pdf').glob('*.pdf'))
first_filename = p[0].name

Categories