I have multiple s3 file paths which contain the folder name as date. I want to extract the latest path from S3 using python and boto3 based on the date.
For Example- Below are the few paths I have under my root folder(s3:///all/stage/servicenow/service-mgmt/sm_task/raw/)
Sample Paths -
s3://my-bucket/all/stage/pqr/xyz/abc/raw/2020/12/11/10/20/file.parquet
s3://my-bucket/all/stage/pqr/xyz/abc/raw/2020/12/11/11/12/file.parquet
s3://my-bucket/all/stage/pqr/xyz/abc/raw/2020/12/11/12/01/file.parquet
s3://my-bucket/all/stage/pqr/xyz/abc/raw/2020/12/12/11/10/file.parquet
all the above paths are in s3:///all/stage/pqr/xyz/abc/raw/YYYY/MM/DD/HH/mm/file.parquet format
So I need the latest timestamp path under root path (s3:///all/stage/pqr/xyz/abc/raw/) which is s3:///all/stage/pqr/xyz/abc/raw/2020/12/12/11/10/file.parquet.
How can i achieve this using python and Boto3.
Any help will be appreciated as I am new in python
Please comment if the question is not clear
from os import path
is one way to check the file
using function
os.path.splitext(root,date)
and just use your own algorithm to check weather or not your file time is the newest
Related
I have been running some script in VBA which I like to convert in Python. In VBA I use variables to identify the path from which the raw data are collected so that the same script can be used in any machine. In order to do do so in VBA I identify the path by using the using the following string:
"C:\Users\" & Environ$("username") & "\Data"
Is there a way I can identify in the same way the path in Python?
Use pathlib from the Python Standard Library.
import pathlib
print(pathlib.Path.home() / 'Data')
In older versions of Python (before pathlib) you could use os.path.
import os.path
print(os.path.join(os.path.expanduser('~'), 'Data'))
Some additional info: if you want to reference the path where your script is, you can do that too. You just have to know that __file__ contains the full path to your script.
import pathlib
# full path to the script
print(pathlib.Path(__file__))
# full path to the scripts folder
print(pathlib.Path(__file__).parent)
# full path to the subfolder 'subfolder' in the scripts folder
print(pathlib.Path(__file__).parent / 'subfolder')
# full path to a file in the subfolder
print(pathlib.Path(__file__).parent / 'subfolder' / 'datafile.csv')
Just to add to Matthias answer: in case you are looking to use python to "find" a path to the save a file you should keep in mind that despite printing the correct path, Python won't recognize the path unless, after identify the correct Path, you convert it to a string
i.e.: suppose you set your path as a variable
path = pathlib.Path(__file__).parent / 'subfolder' / 'datafile.csv'
then you need to covert path as a string to be used to open/save a file
path1 = str(path)
then you can use path1 to open/save a file
I am in the process of making an automated script. Basically I am downloading some shapefiles, unzipping them and them making a few changes each month. Each month I download the same dataset.
An issue I have found is that the dataset name changes each month after I download it, I'm not sure how i can point the script too it if the name changes? I don't really want to have to update the script with the new file path each month.
For example November was
L:\load\Ten\Nov20\NSW\FME_68185551_1604301077137_7108\GSNSWDataset
And Dec is
L:\load\Ten\Dec20\NSW\FME_68185551_1606880934716_1252\GSNSWDataset
You could use glob with a wildcard in the changing number section. Something like:
import glob
import datetime
d = datetime.today().strftime('%b%y') #'Dec20'
fil = glob.glob("L:/load/Ten/%s/NSW/FME*/GSNSWDataset" % (d))[0]
This should get you the correct path to your files and then you just read/manipulate however you need.
I have a python file, converted from a Jupiter Notebook, and there is a subfolder called 'datasets' inside this file folder. When I'm trying to open a file that is inside that 'datasets' folder, with this code:
import pandas as pd
# Load the CSV data into DataFrames
super_bowls = pd.read_csv('/datasets/super_bowls.csv')
It says that there is no such file or folder. Then I add this line
os.getcwd()
And the output is the top-level folder of the project, and not the subfolder when is this python file. And I think maybe that's the reason why it's not working.
So, how can I open that csv file with relative paths? I don't want to use absolute path because this code is going to be used in another computers.
Why os.getcwd() is not getting the actual folder path?
My observation, the dot (.) notation to move to the parent directory sometimes does not work depending on the operating system. What I generally do to make it os agnostic is this:
import pandas as pd
import os
__location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
super_bowls = pd.read_csv(__location__ + '/datasets/super_bowls.csv')
This works on my windows and ubantu machine equally well.
I am not sure if there are other and better ways to achieve this. Would like to hear back if there are.
(edited)
Per your comment below, the current working directory is
/Users/ivanparra/AprendizajePython/
while the file is in
/Users/ivanparra/AprendizajePython/Jupyter/datasets/super_bowls.csv
For that reason, going to the datasets subfolder of the current working directory (CWD) takes you to /Users/ivanparra/AprendizajePython/datasets which either doesn't exist or doesn't contain the file you're looking for.
You can do one of two things:
(1) Use an absolute path to the file, as in
super_bowls = pd.read_csv("/Users/ivanparra/AprendizajePython/Jupyter/datasets/super_bowls.csv")
(2) use the right relative path, as in
super_bowls = pd.read_csv("./Jupyter/datasets/super_bowls.csv")
There's also (3) - use os.path.join to contact the CWD to the relative path - it's basically the same as (2).
(you can also use
The answer really lies in the response by user2357112:
os.getcwd() is working fine. The problem is in your expectations. The current working directory is the directory where Python is running, not the directory of any particular source file. – user2357112 supports Monica May 22 at 6:03
The solution is:
data_dir = os.path.dirname(__file__)
Try this code
super_bowls = pd.read_csv( os.getcwd() + '/datasets/super_bowls.csv')
I noticed this problem a few years ago. I think it's a matter of design style. The problem is that: your workspace folder is just a folder, not a project folder. Most of the time, your relative reference is based on the current file.
VSCode actually supports the dynamic setting of cwd, but that's not the default. If your work folder is not a rigorous and professional project, I recommend you adding the following settings to launch.json. This is the simplest answer you need.
"cwd": "${fileDirname}"
Thanks to everyone that tried to help me. Thanks to the Roy2012 response, I got a code that works for me.
import pandas as pd
import os
currentPath = os.path.dirname(__file__)
# Load the CSV data into DataFrames
super_bowls = pd.read_csv(currentPath + '/datasets/super_bowls.csv')
The os.path.dirname gives me the path of the current file, and let me work with relative paths.
'/Users/ivanparra/AprendizajePython/Jupyter'
and with that it works like a charm!!
P.S.: As a side note, the behavior of os.getcwd() is quite different in a Jupyter Notebook than a python file. Inside the notebook, that function gives the current file path, but in a python file, gives the top folder path.
I want this line to save the csv in my current directory alongside my python file:
df.to_csv(./"test.csv")
My python file is in "C:\Users\Micheal\Desktop\VisualStudioCodes\Q1"
Unfortunately it saves it in "C:\Users\Micheal" instead.
I have tried import os path to use os.curdir but i get nothing but errors with that.
Is there even a way to save the csv alongside the python file using os.curdir?
Or is there a simpler way to just do this in python without importing anything?
import os
directory_of_python_script = os.path.dirname(os.path.abspath(__file__))
df.to_csv(os.path.join(directory_of_python_script, "test.csv"))
And if you want to read same .csv file later,
pandas.read_csv(os.path.join(directory_of_python_script, "test.csv"))
Here, __file__ gives the relative location(path) of the python script being runned. We get the absolute path by os.path.abspath() and then convert it to the name of the parent directory.
os.path.join() joins two paths together considering the operating system defaults for path seperators, '\' for Windows and '/' for Linux, for example.
This kind of an approach should work, I haven't tried, if does not work, let me know.
I have six directories the follow the format
\home\mydir\myproject\2012-01-23_03-01-34
\home\mydir\myproject\2012-01-11_01-00-57
\home\mydir\myproject\2010-01-11_01-00-57
\home\mydir\myproject\2010-01-11_01-00-54
\home\mydir\myproject\2010-01-08_01-00-54
Note, the datetime as the final directory. It is exactly this format and it is meant to indicate the time the directory it was created Now they all cotain the file name myfile.xml. I want to parse out the latest and greatest myfile.xml. Does python have any magic where it can tell the latest (i.e. most up to date directory) from the name format of the directory I am using? If it does not, does it have any magic where it can tell by the file timestamps who is the most up to date? The OS is windows?
Another way of looking at this is that the most up to date directory will also have the highest number.
Thanks.
If you have those directory names in a list dirs, then max(dirs) will give you the latest.
For getting OS information as to the age of the files see http://docs.python.org/release/2.5.2/lib/module-stat.html - if you really need the "most up to date", and there's a chance the files in the directories could be modified, and so considered more up-to-date than files in directories with later names, going by what the OS says is more robust. If only the creation age given by the folder is relevant then #Greg Hewgill has you covered.