I currently have a GET request to a URL that returns three things: .zip file, .zipsig file, and a .txt file.
I'm only interested in the .zip file which has dozens of .json files. I would like to extract all these .json files, preferable directly into a single pandas data frame, but extracting them into a folder also works.
Code so far, mostly stolen:
license = requests.get(url, headers={'Authorization': "Api-Token " + 'blah'})
z = zipfile.ZipFile(io.BytesIO(license.content))
billingRecord = z.namelist()[0]
z.extract(billingRecord, path = "C:\\Users\\Me\\Downloads\\Json license")
This extracts the entire .zip file to the path. I would like to extract the individual .json files from said .zip file to the path.
import io
import zipfile
import pandas as pd
import json
dfs = []
with zipfile.ZipFile(io.BytesIO(license.content)) as zfile:
for info in zfile.infolist():
if info.filename.endswith('.zip'):
zfiledata = io.BytesIO(zfile.read(info.filename))
with zipfile.ZipFile(zfiledata) as json_zips:
for info in json_zips.infolist():
if info.filename.endswith('.json'):
json_data = pd.json_normalize(json.loads(json_zips.read(info.filename)))
dfs.append(json_data)
df = pd.concat(dfs, sort=False)
print(df)
I would do something like this. Obviously this is my test.zip file but the steps are:
List the files from the archive using the .infolist() method on your z archive
Check if the filename ends with the json extension using .endswith('.json')
Extract that filename with .extract(info.filename, info.filename)
Obviously you've called your archive z but mine is archive bu that should get you started.
Example code:
import zipfile
with zipfile.ZipFile("test.zip", mode="r") as archive:
for info in archive.infolist():
print(info.filename)
if info.filename.endswith('.png'):
print('Match: ', info.filename)
archive.extract(info.filename, info.filename)
Related
Looked all over SO for an approach to this problem and none that I have tried have worked. I've seen several posts about downloading zip files from URL or unzipping files from a local directory in Python, but I am a bit confused on how to put it all together.
My problem: I have a page of zipped data that is organized by month going back to 2010. I'd like to use some Python code to:
scrape the page and nab only the .zip files (there's other data on the page)
unzip each respective monthly dataset
extract and concatenate all the .csv files in each unzipped folder to one long Pandas dataframe
I've tried
from urllib.request import urlopen
url = "https://s3.amazonaws.com/capitalbikeshare-data/2010-capitalbikeshare-tripdata.zip"
save_as = "tripdata1.csv"
# Download from URL
with urlopen(url) as file:
content = file.read()
# Save to file
with open(save_as, 'wb') as download:
download.write(content)
but this returns gibberish.
Then, I tried an approach I saw from a related SO post:
import glob
import pandas as pd
from zipfile import ZipFile
path = r'https://s3.amazonaws.com/capitalbikeshare-data/index.html' # my path to site
#load all zip files in folder
all_files = glob.glob(path + "/*.zip")
df_comb = pd.DataFrame()
flag = False
for filename in all_files:
zip_file = ZipFile(filename)
for text_file in zip_file.infolist():
if text_file.filename.endswith('tripdata.csv'):
df = pd.read_csv(zip_file.open(text_file.filename),
delimiter=';',
header=0,
index_col=['ride_id']
)
if not flag:
df_comb = df
flag = True
else:
df_comb = pd.concat([df_comb, df])
print(df_comb.info())
But this returned a df with zero data, or with additional tinkering, returned error that there were no filenames on the page... :/
Final data should essentially just be a row-wise merge of all the monthly trip data from the index.
Any advice or fixes will be highly appreciated!
How to download specific files from .txt url?
I have a url https://.storage.public.eu/opendata/files/open_data_files_access.txt (not real) with multiple files (here are just a few, in reality there are around 5k files) that can be downloaded separately, however I would need to download only specific files, and do this with Python.
For instance, I have a list with folder name and list of file name. How do I download only those file that are on the list? Let's say the list is:
files = ['folder1_file_1.jpg', 'folder1_file_2.jpg', 'folder1_file_3.jpg', 'folder1_file_4.jpg', 'folder1_file_10.jpg', 'folder2_file_2.jpg', 'folder2_file_3.jpg', 'folder3_file_1.jpg', 'folder3_file_3.jpg', 'folder3_file_4.jpg']
How to download only these in the list and save in specified directory?
I assume that the answer is somewhere here, but no work for me:
uurl = 'https://.storage.public.eu/opendata/files/open_data_files_access.txt'
from requests import get # to make GET request
def download(url, file_name):
# open in binary mode
with open(file_name, "wb") as file:
# get request
response = get(url)
# write to file
file.write(response.content)
file_name' = ['folder1_file_1.jpg', 'folder1_file_2.jpg', 'folder1_file_3.jpg', 'folder1_file_4.jpg', 'folder1_file_10.jpg', 'folder2_file_2.jpg', 'folder2_file_3.jpg', 'folder3_file_1.jpg', 'folder3_file_3.jpg', 'folder3_file_4.jpg']
download(uurl, file_name)
I have an input file name file_a.xml
I already created a function to parse out the xml and save it as a df. Then I used df.to_csv
to save the output file name file_a.csv
Is there a way to do this automatically with default filename and extension?
I need to iterate over a folder with lots of .xml files, so I like to the output filename & extension it based on the input xml file.
xml_file = open ('file/path/dir/file_a.xml','r').read()
def XML_to_CSV(xml_file):
...code to parse out xml...
return df
csv_data = df.to_csv('file/path/dir/file_a.csv',index = False)
Try something like this:
import os
import pandas as pd
from pathlib import Path
for file in os.listdir("your dir"):
if file.endswith(".xml"):
...make the xml turn df.
df.to_csv(Path(file).stem + '.csv', index=False)
I am doing a data science project.
I am using google notebook for my job
My dataset is residing at here which I want to access directly at python Notebook.
I am using following line of code to get out of it.
df = pd.read_csv('link')
But Command line is throwing an error like below
What should I do?
Its difficult to answer exactly as there lack of data but here you go for this kind of request..
you have to import ZipFile & urlopen in order to get data from url and extract the data from Zip and the use the csv file for pandas processings.
from zipfile import ZipFile
from urllib.request import urlopen
import pandas as pd
import os
URL = 'https://he-s3.s3.amazonaws.com/media/hackathon/hdfc-bank-ml-hiring-challenge/application-scorecard-for-customers/05d2b4ea-c-Dataset.zip'
# open and save the zip file onto computer
url = urlopen(URL)
output = open('05d2b4ea-c-Dataset.zip', 'wb') # note the flag: "wb"
output.write(url.read())
output.close()
# read the zip file as a pandas dataframe
df = pd.read_csv('05d2b4ea-c-Dataset.zip') zip files
# if keeping on disk the zip file is not wanted, then:
os.remove(zipName) # remove the copy of the zipfile on disk
Use urllib module to download into memory the zip file which returns a file-like object that you can read(), pass it to ZipFile(standard package).
Since here there are multiple files like
['test_data/AggregateData_Test.csv', 'test_data/TransactionData_Test.csv', 'train_data/AggregateData_Train.csv', 'train_data/Column_Descriptions.xlsx', 'train_data/sample_submission.csv', 'train_data/TransactionData_Train.csv']
Load it to a dict of dataframes with filename as the key. Altogether the code will be.
from urllib.request import urlopen
from zipfile import ZipFile
from io import BytesIO
zip_in_memory = urlopen("https://he-s3.s3.amazonaws.com/media/hackathon/hdfc-bank-ml-hiring-challenge/application-scorecard-for-customers/05d2b4ea-c-Dataset.zip").read()
z = ZipFile(BytesIO(zip_in_memory))
dict_of_dfs = {file.filename: pd.read_csv(z.open(file.filename))\
for file in z.infolist()\
if file.filename.endswith('.csv')}
Now you can access dataframes of each csv like dict_of_dfs['test_data/AggregateData_Test.csv'].
Ofcourse all of this is unnecessary if you will just download the zip from the link and pass it as a zipfile.
I'm looking for a way to extract a specific file (knowing his name) from an archive containing multiple ones, without writing any file on the hard drive.
I tried to use both StringIO and zipfile, but I only get the entire archive, or the same error from Zipfile (open require another argument than a StringIo object)
Needed behaviour:
archive.zip #containing ex_file1.ext, ex_file2.ext, target.ext
extracted_file #the targeted unzipped file
archive.zip = getFileFromUrl("file_url")
extracted_file = extractFromArchive(archive.zip, target.ext)
What I've tried so far:
import zipfile, requests
data = requests.get("file_url")
zfile = StringIO.StringIO(zipfile.ZipFile(data.content))
needed_file = zfile.open("Needed file name", "r").read()
There is a builtin library, zipfile, made for working with zip archives.
https://docs.python.org/2/library/zipfile.html
You can list the files in an archive:
ZipFile.namelist()
and extract a subset:
ZipFile.extract(member[, path[, pwd]])
EDIT:
This question has in-memory zip info. TLDR, Zipfile does work with in-memory file-like objects.
Python in-memory zip library
I finally found why I didn't succeed to do it after few hours of testing :
I was bufferring the zipfile object instead of buffering the file itself and then open it as a Zipfile object, which raised a type error.
Here is the way to do :
import zipfile, requests
data = requests.get(url) # Getting the archive from the url
zfile = zipfile.ZipFile(StringIO.StringIO(data.content)) # Opening it in an emulated file
filenames = zfile.namelist() # Listing all files
for name in filesnames:
if name == "Needed file name": # Verify the file is present
needed_file = zfile.open(name, "r").read() # Getting the needed file content
break