How can I read a csv file from Dropbox Online (without downloading it to local machine)?
I installed and imported dropbox in Python, created an app and got a token from "https://www.dropbox.com/developers/apps". While I read some resources online, I am still fuzzy on how I move from here.
Line 149 on the example here provides a function that downloads a file and returns it as a byte string.
https://github.com/dropbox/dropbox-sdk-python/blob/main/example/updown.py
Then just parse the byte string, for example using pandas:
import pandas as pd
from example.updown import download # import the download function from updown.py, or copy or replicate that function, I'm only providing pseudo code here
file_as_bytes = download(dbx, folder, subfolder, name)
df = pd.read_csv(file_as_bytes)
print(df)
Think as suggested here you can just read directly over the URL (but you need to change the arg from ?dl=0 to ?dl=1
e.g.
df = pd.read_csv("https://www.dropbox.com/s/asdfasdfasfasdf/movie_data.csv?dl=1")
Related
I am a beginner in Python programming but I believe the problem I am trying to solve might not be a big one.
So I am working on a program to present the last row on the latest csv file to the end user. At the moment I am copying and pasting the latest file from the FTP directory onto for example:
pd.read_csv("ftp://123.4.567.890/folder1/folder2/123.csv")
where 123.csv is the latest file. Any solutions on how I might be able to get that 123.csv file automatically on to the pandas read() function? In addition, I am using Jupyter Notebook but I am somehow unable to change the working directory from my os to FTP. If I am able to do that it might be very helpful.
The arrangement of the files on the FTP directory looks like below with no column names-
02/03/2021 12:00AM 37,471 312.csv
02/03/2021 12:00AM 24,138 312.raw
01/26/2021 12:00AM 31,246 612.csv
01/26/2021 12:00AM 19,098 2612.raw
02/01/2021 12:00AM 15,337 0100.csv
02/01/2021 12:00AM 9,858 0100.raw
02/02/2021 12:00AM 134,098 0112.csv
So guys how to fetch the latest CSV file from above?
I would really appreciate your help.
Thanks
There's no magic solution that will have Pandas load the latest file from an FTP server.
You need to split your task to two steps:
Finding the latest file in the FTP server:
Python FTP get the most recent file by date
Your server seems to be IIS. IIS does not support MLSD. And your IIS server is configured to use DOS style listing. Most code you will find for parsing LIST response is for *nix servers. Unless you can configure your IIS to use *nix style listing, most code won't work with your server. Either you will have to adjust the code. Or use the less efficient MDTM solution (should be ok, if there are few files only).
Loading that file to Pandas (you have that already).
Pandas can read CSV files directly using the FTP protocol (it's not limited to just the HTTP/HTTPS protocol).
You need to make sure you ftp URL is correct (the IP address in your question is not a valid IP address - IPv4 IPs are 4 x 8 bit numbers so max 255.255.255.255) - and refers to the latest file. You may need to do some processing if the latest file doesn't have a standardised name. If you have control over the server, you could add a link e.g. ftp://servername/latest.csv
Alternatively, you could do this dynamically on the client, using Python:
import ftplib
FTP_HOST = 'ftp.ifremer.fr'
FTP_DIR = '/ifremer/argo/etc/ObjectiveAnalysisWarning/incois/'
# connect to the remote server using anonymous FTP
ftp = ftplib.FTP(FTP_HOST, 'anonymous', '')
# change the remote working directory
ftp.cwd(FTP_DIR)
# load the modification dates for each file
results = [(name, ftp.voidcmd("MDTM " + name), ) for name in ftp.nlst()]
# sort by modification date
results.sort(key=lambda x: x[1])
# get the filename for the most recently modified file
most_recent_filename = results[-1][0]
Here is an example of using pandas to download a CSV from a publically available FTP source:
import pandas as pd
df = pd.read_csv('ftp://ftp.ifremer.fr/ifremer/argo/etc/ObjectiveAnalysisWarning/incois/ar_scoop2_IN_20130722123443.csv')
To use the already identified most recent file from the earlier code and download using pandas lib:
df = pd.read_csv(
'ftp://' + FTP_HOST + os.path.join(FTP_DIR, most_recent_filename)
)
Adjust the URLs to your own, valid URL and you should have the DataFrame that you need.
To get the last row of a DataFrame:
df.iloc[-1, :]
I didn't do exactly the code for a csv file but here is something similar to your question.
I had the same problem with the opening of a notepad file and had to copy the directory to the new file each time. Here is the code i wrote to overcome the problem.
filename = str(input("Please input the file name: "))
newfile = str(filename + ".txt")
import subprocess
subprocess.Popen(["notepad",newfile])
So the code allows you to enter the notepad file name(Trial 1) and then concatenates it with .txt. This concatenated string is then used as the directory to the file which is the opened using the subprocess.Popen.
This worked excellently for me, and i know it is maybe not relevant to your question but i hope it helps.
Regards
I have an excel sheet called last_run.xlsx, and I use a small python code to upload it on slack, as follow:
import os
import slack
token= XXX
client = slack.WebClient(token=slack_token)
response = client.files_upload(
channels="#viktor",
file="last_run.xlsx")
But when I receive it on slack it is a weird zip file and not an excel file anymore... any idea what I do wrong?
Excel files are actually zipped collection of XML documents. So it appears that the automatic file detection of Slack is recognizing it as ZIP file for that reason.
Also manually specified xlsx as filetype does not change that behavior.
What works is if you also specify a filename. Then it will be correctly recognized and uploaded as Excel file.
Code:
import os
import slack
client = slack.WebClient(token="MY_TOKEN")
response = client.files_upload(
channels="#viktor",
file="last_run.xlsx",
filename="last_run.xlsx")
This looks like a bug in the automatic to me, so I would suggest to submit a bug report to Slack about this behavior.
I'm trying to import and excel file that I have stored in a folder within a GitHub repository. Based on that the file path should be
"C:\\Users\\'username'\\Documents\\GitHub\\'repository'\\'folder'\\'filename'.xlsx"
But when I enter the code
import pandas as pd
xlsfile="C:\\Users\\'username'\\Documents\\GitHub\\'repository'\\'folder'\\'filename'.xlsx"
xl1=pd.read_excel(xlsfile,sheet_name='sheet',skiprows=21)
I get an error that says the file path I entered doesn't exist. I know that the entire path to the file exists because my working directory also contains the file, so what could I be doing wrong?
I have no experience coding. Thanks.
Remove the "'" in your filename? Is your sheet really named 'sheet'? I think the default is 'sheet1' ect.
There can be multiple things, as Joe stated you probably don't have ' ' around your file names, I'm assuming that they included those so that you input your local filepath in there (i.e. replace 'username' with Jack.Donaghue and so on) an example of this would look something like:"C:/Users/Jack_Donague/Documents/GitHub/YourRepoName/data/datafilename.xlsx"
Also as colbster pointed out to confirm what your sheet is named. I've also experienced some issues with \ vs / in the file names since I'm working on Windows10.
I would recommend trying
import pandas as pd
xlsfile="C:/Users/'username'/Documents/GitHub/'repository'/'folder'/'filename'.xlsx"
xl1=pd.read_excel(xlsfile,sheet_name='sheet',skiprows=21)
I was wondering is there any way by which I can download only a part of a .rar or .zip file without downloading the whole file ? There is a zip file containing files A,B,C and D. I only need A. Can I somehow, use zipfile module so that i can only download 1 file ?
i am trying below code:
r = c.get(file)
z = ZipFile.ZipFile(BytesIO(r.content))
for file1 in z.namelist():
if 'time' not in file1:
print("hi")
z.extractall(file1,download_path + filename)
This code is downloading whole zip file and only extracting specific one. Can i somehow download only the file i Need.
There is similar question here but it shows only approch by command line in linux. That question dosent address how it can be done using python liabraries.
The question #Juggernaut mentioned in a comment is actually very helpful, as it points you in the direction of the solution.
You need to create a replacement for Bytes.IO that returns the necessary information to ZipFile. You will need to get the length of the file, and then get whatever sections ZipFile asks for.
How large are those file? Is it really worth the trouble?
Use remotezip: https://github.com/gtsystem/python-remotezip. You can install it using pip:
pip install remotezip
Usage example:
from remotezip import RemoteZip
with RemoteZip("https://path/to/zip/file.zip") as zip_file:
for file in zip_file.namelist():
if 'time' not in file:
print("hi")
zip_file.extract(file, path="/path/to/extract")
Note that to use this approach, the web server from which you receive the file needs to support the Range header.
I have a zipfile on my Google Drive. In that zipfile is a XML file, which I want to parse, extract a specific information and save this information on my local computer (or wherever).
My goal is to use Python & Google Drive API (with help of PyDrive) to achieve this. The workflow could be as follows:
Connect to my Google Drive via Google Drive API (PyDrive)
Get my zipfile id
Load my zipfile to memory
Unzip, obtain the XML file
Parse the XML, extract the desired information
Save it as a csv on my local computer
Right now, I am able to do steps 1,2,4,5,6. But I dont know how to load the zipfile into memory without writing it on my local HDD first.
Following PyDrive code will obtain the zipfile and place it on my local HDD, which is not exactly what I want.
toUnzip = drive.CreateFile({'id':'MY_FILE_ID'})
toUnzip.GetContentFile('zipstuff.zip')
I guess one solution could be as follows:
I could read the zipfile as a string with some encoding:
toUnzip = drive.CreateFile({'id':'MY_FILE_ID'})
zipAsString = toUnzip.GetContentString(encoding='??')
and then, I could somehow (no idea how, perhaps StringIO could be useful) read this string with Python zipfile library. Is this solution even possible? Is there a better way?
You could try StringIO, they emulate files but reside in memory.
Here is the code from a related SO post:
# get_zip_data() gets a zip archive containing 'foo.txt', reading 'hey, foo'
from StringIO import StringIO
zipdata = StringIO()
zipdata.write(get_zip_data())
myzipfile = zipfile.ZipFile(zipdata)
foofile = myzipfile.open('foo.txt')
print foofile.read()
# output: "hey, foo"
or using a URL:
url = urlopen("http://www.test.com/file.zip")
zipfile = ZipFile(StringIO(url.read()))
Hope this helps.
Eventually, I solved it using BytesIOand cp862 encoding:
toUnzipStringContent = toUnzip.GetContentString(encoding='cp862')
toUnzipBytesContent = BytesIO(toUnzipStringContent.encode('cp862'))
readZipfile = zipfile.ZipFile(toUnzipBytesContent, "r")