Reading only .csv file within a .zip from URL with Pandas? - python

There is a .csv file contained within a .zip file from a URL I am trying to read into a Pandas DataFrame; I don't want to download the .zip file to disk but rather read the data directly from the URL. I realize that pandas.read_csv() can only do this if the .csv file is the only file contained in the .zip, however, when I run this:
import pandas as pd
# specify zipped comma-separated values url
zip_csv_url = 'http://www12.statcan.gc.ca/census-recensement/2016/geo/ref/gaf/files-fichiers/2016_92-151_XBB_csv.zip'
df1 = pd.read_csv(zip_csv_url)
I get this:
ValueError: Multiple files found in compressed zip file ['2016_92-151_XBB.csv', '92-151-g2016001-eng.pdf', '92-151-g2016001-fra.pdf']
The contents of the .zip appear to be arranged as a list; I'm wondering how I can assign the new DataFrame (df1) as the only available .csv file in the .zip (as the .zip file from the URL I will be using would only ever have one .csv file within it). Thanks!
N.B.
The corresponding .zip file from a separate URL with shapefiles reads no problem with geopandas.read_file() when I run this code:
import geopandas as gpd
# specify zipped shapefile url
zip_shp_url = 'http://www12.statcan.gc.ca/census-recensement/2011/geo/bound-limit/files-fichiers/2016/ldb_000b16a_e.zip'
gdf1 = gpd.read_file(zip_shp_url)
Despite having a .pdf file also contained within the .zip, as seen in the image below:
It would appear that the geopandas.read_file() has the ability to only read the requisite shapefiles for creating the GeoDataFrame while ignoring unnecessary data files. Since it is based on Pandas, shouldn't Pandas also have a functionality to only read a .csv within a .zip with multiple other file types? Any thoughts?

import zipfile
import pandas as pd
from io import BytesIO
from urllib.request import urlopen
resp = urlopen( YOUR_ZIP_LINK )
files_zip = zipfile.ZipFile(BytesIO(resp.read()))
# files_zip.namelist()
directory_to_extract_to = YOUR_DESTINATION_FOLDER
file = YOUR_csv_FILE_NAME
with files_zip as zip_ref:
zip_ref.extract(file,directory_to_extract_to)
pd.read_csv(directory_to_extract_to + file)

Related

How to Create Data Frame using multiple text files using python

I Need to Create Data Frame Using Multiple text files, all the text files are in same directory
Text File Format, each text file contains data showed in image
Text file format
here I need to create DataFrame using these kind of multiple text files
If it is possible to remove the last line (Name: 45559, dtype: object) then you should be able to load txt file as a csv:
import pandas as pd
import os
txt_files_dir = '...'
files = os.listdir(txt_files_dir)
dfs_list = [pd.read_csv(file, sep='\s+') for file in files]
data_frame_result = pd.concat(dfs_list, axis=0, ignore_index=True)

Turn one dataframe into several dfs and add them as CSVs to zip archive (without saving files locally)

I have a data frame which I read in from a locally saved CSV file.
I then want to loop over said file and create several CSV files based on a string in one column.
Lastly, I want to add all those files to a zip file, but without saving them locally. I just want one zip archive including all the different CSV files.
All my attempts using the io or zipfile modules only resulted in one zip file with one CSV file in it (pretty much with what I started with)
Any help would be much appreciated!
Here is my code so far, which works but saves all CSV files just to my hard drive.
import pandas as pd
from zipfile import ZipFile
df = pd.read_csv("myCSV.csv")
channelsList = df["Turn one column to list"].values.tolist()
channelsList = list(set(channelsList)) #delete duplicates from list
for channel in channelsList:
newDf = df.loc[df['Something to match'] == channel]
newDf.to_csv(f"{channel}.csv") # saves csv files to disk
DataFrame.to_csv() can write to any file-like object, and ZipFile.writestr() can accept a string (or bytes), so it is possible to avoid writing the CSV files to disk using io.StringIO. See the example code below.
Note: If the channel is simply stored in a single column of your input data, then the more idiomatic (and more efficient) way to iterate over the partitions of your data is to use groupby().
from io import StringIO
from zipfile import ZipFile
import numpy as np
import pandas as pd
# Example data
df = pd.DataFrame(np.random.random((100,3)), columns=[*'xyz'])
df['channel'] = np.random.randint(5, size=len(df))
with ZipFile('/tmp/output.zip', 'w') as zf:
for channel, channel_df in df.groupby('channel'):
s = StringIO()
channel_df.to_csv(s, index=False, header=True)
zf.writestr(f"{channel}.csv", s.getvalue())

Using Pandas, how to read a csv file inside a zip file which you fetch using an url[Python]

This url
https://ihmecovid19storage.blob.core.windows.net/latest/ihme-covid19.zip
contains 2 csv files, and 1 pdf which is updated daily, containing Covid-19 Data.
I want to be able to load the Summary_stats_all_locs.csv as a Pandas DataFrame.
Usually if there is a url that points to a csv I can just use df = pd.read_csv(url) but since the csv is inside a zip, I can't do that here.
How would I do this?
Thanks
You will need to first fetch the file, then load it using the ZipFile module. Pandas can read csvs from inside a zip actually, but the problem here is there are multiple, so we need to this and specify the file name.
import requests
import pandas as pd
from zipfile import ZipFile
from io import BytesIO
r = requests.get("https://ihmecovid19storage.blob.core.windows.net/latest/ihme-covid19.zip")
files = ZipFile(BytesIO(r.content))
pd.read_csv(files.open("2020_05_16/Summary_stats_all_locs.csv"))

Python pandas create datafrane from csv embeded within a web txt file

I am trying to import CSV formatted data to Pandas dataframe. The CSV data is located within a .txt file the is located at a web URL. The issue is that I only want to import a part (or parts) of the .txt file that is formatted as CSV (see image below). Essentially I need to skip the first 9 rows and then import rows 10-16 as CSV.
My code
import csv
import pandas as pd
import io
url = "http://www.bom.gov.au/climate/averages/climatology/windroses/wr15/data/086282-3pmMonth.txt"
df = pd.read_csv(io.StringIO(url), skiprows = 9, sep =',', skipinitialspace = True)
df
I get a lengthy error msg that ultimately says "EmptyDataError: No columns to parse from file"
I have looked at similar examples Read .txt file with Python Pandas - strings and floats but this is different.
The code above attempts to read a CSV file from the URL itself rather than the text file fetched from that URL. To see what I mean take out the skiprows parameter and then show the data frame. You'll see this:
Empty DataFrame
Columns: [http://www.bom.gov.au/climate/averages/climatology/windroses/wr15/data/086282-3pmMonth.txt]
Index: []
Note that the columns are the URL itself.
Import requests (you may have to install it first) and then try this:
content = requests.get(url).content
df = pd.read_csv(io.StringIO(content.decode('utf-8')),skiprows=9)

How do I convert several large text files into one CSV file if they are too large to be converted individually?

I have several large .text files that I want to consolidate into one .csv file. However, each of the files is to large to import into Excel on its own, let alone all together.
I want to create a use pandas to analyze the data, but don't know how to get the files all in one place.
How would I go about reading the data directly into Python, or into Excel for a .csv file?
The data in question is the 2019-2020 Contributions by individuals file on the FEC's website.
You can convert each of the files to csv and the concatenate them to fom one final csv file
import pandas as pd
csv_path = 'pathtonewcsvfolder' # use your path
all_files=os.listdir("path/to/textfiles")
x=0
for filename in all_files:
df = pd.read_fwf(filename)
df.to_csv(os.path.join(csv_path,'log'+str(x)+'.csv'))
x+=1
all_csv_files = glob.iglob(os.path.join(csv_path, "*.csv"))
converted_df=pd.concat((pd.read_csv(f) for f in all_csv_files), ignore_index=True)
converted_df.to_csv('converted.csv')

Categories