Here is a data I am interested in.
http://fenixservices.fao.org/faostat/static/bulkdownloads/Production_Crops_E_All_Data.zip
It consists of 3 files:
I want to download zip with pandas and create DataFrame from 1 file called Production_Crops_E_All_Data.csv
import pandas as pd
url="http://fenixservices.fao.org/faostat/static/bulkdownloads/Production_Crops_E_All_Data.zip"
df=pd.read_csv(url)
Pandas can download files, it can work with zips and of course it can work with csv files. But how can I work with 1 specific file in archive with many files?
Now I get error
ValueError: ('Multiple files found in compressed zip file %s)
This post doesn't answer my question bcause I have multiple files in 1 zip
Read a zipped file as a pandas DataFrame
From this link
try this
from zipfile import ZipFile
import io
from urllib.request import urlopen
import pandas as pd
r = urlopen("http://fenixservices.fao.org/faostat/static/bulkdownloads/Production_Crops_E_All_Data.zip").read()
file = ZipFile(io.BytesIO(r))
data_df = pd.read_csv(file.open("Production_Crops_E_All_Data.csv"), encoding='latin1')
data_df_noflags = pd.read_csv(file.open("Production_Crops_E_All_Data_NOFLAG.csv"), encoding='latin1')
data_df_flags = pd.read_csv(file.open("Production_Crops_E_Flags.csv"), encoding='latin1')
Hope this helps!
EDIT: updated for python3 StringIO to io.StringIO
EDIT: updated the import of urllib, changed usage of StringIO to BytesIO. Also your CSV files are not utf-8 encoding, I tried latin1 and that worked.
You could use python's datatable, which is a reimplementation of Rdatatable in python.
Read in data :
from datatable import fread
#The exact file to be extracted is known, simply append it to the zip name:
url = "Production_Crops_E_All_Data.zip/Production_Crops_E_All_Data.csv"
df = fread(url)
#convert to pandas
df.to_pandas()
You can equally work within datatable; do note however, that it is not as feature-rich as Pandas; but it is a powerful and very fast tool.
Update: You can use the zipfile module as well :
from zipfile import ZipFile
from io import BytesIO
with ZipFile(url) as myzip:
with myzip.open("Production_Crops_E_All_Data.csv") as myfile:
data = myfile.read()
#read data into pandas
#had to toy a bit with the encoding,
#thankfully it is a known issue on SO
#https://stackoverflow.com/a/51843284/7175713
df = pd.read_csv(BytesIO(data), encoding="iso-8859-1", low_memory=False)
Related
I have a data frame which I read in from a locally saved CSV file.
I then want to loop over said file and create several CSV files based on a string in one column.
Lastly, I want to add all those files to a zip file, but without saving them locally. I just want one zip archive including all the different CSV files.
All my attempts using the io or zipfile modules only resulted in one zip file with one CSV file in it (pretty much with what I started with)
Any help would be much appreciated!
Here is my code so far, which works but saves all CSV files just to my hard drive.
import pandas as pd
from zipfile import ZipFile
df = pd.read_csv("myCSV.csv")
channelsList = df["Turn one column to list"].values.tolist()
channelsList = list(set(channelsList)) #delete duplicates from list
for channel in channelsList:
newDf = df.loc[df['Something to match'] == channel]
newDf.to_csv(f"{channel}.csv") # saves csv files to disk
DataFrame.to_csv() can write to any file-like object, and ZipFile.writestr() can accept a string (or bytes), so it is possible to avoid writing the CSV files to disk using io.StringIO. See the example code below.
Note: If the channel is simply stored in a single column of your input data, then the more idiomatic (and more efficient) way to iterate over the partitions of your data is to use groupby().
from io import StringIO
from zipfile import ZipFile
import numpy as np
import pandas as pd
# Example data
df = pd.DataFrame(np.random.random((100,3)), columns=[*'xyz'])
df['channel'] = np.random.randint(5, size=len(df))
with ZipFile('/tmp/output.zip', 'w') as zf:
for channel, channel_df in df.groupby('channel'):
s = StringIO()
channel_df.to_csv(s, index=False, header=True)
zf.writestr(f"{channel}.csv", s.getvalue())
I am looking to read in a .csv.gz file that is in the same directory as my python script using the gzip and pandas module only.
So far I have,
import gzip
import pandas as pd
data = gzip.open(test_data.csv.gz, mode='rb')
How do I proceed in converting / reading this file in as a dataframe without using the csv module as seen in similarly answered questions?
You can use pandas.read_csv directly:
import pandas as pd
df = pd.read_csv('test_data.csv.gz', compression='gzip')
If you must use gzip:
with gzip.open('test_data.csv.gz', mode='rb') as csv:
df = pd.read_csv(csv)
I have .ndjson file that has 20GB that I want to open with Python. File is to big so I found a way to split it into 50 peaces with one online tool. This is the tool: https://pinetools.com/split-files
Now I get one file, that has extension .ndjson.000 (and I do not know what is that)
I'm trying to open it as json or as a csv file, to read it in pandas but it does not work.
Do you have any idea how to solve this?
import json
import pandas as pd
First approach:
df = pd.read_json('dump.ndjson.000', lines=True)
Error: ValueError: Unmatched ''"' when when decoding 'string'
Second approach:
with open('dump.ndjson.000', 'r') as f:
my_data = f.read()
print(my_data)
Error: json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 104925061 (char 104925060)
I think the problem is that I have some emojis in my file, so I do not know how to encode them?
ndjson is now supported out of the box with argument lines=True
import pandas as pd
df = pd.read_json('/path/to/records.ndjson', lines=True)
df.to_json('/path/to/export.ndjson', lines=True)
I think the pandas.read_json cannot handle ndjson correctly.
According to this issue you can do sth. like this to read it.
import ujson as json
import pandas as pd
records = map(json.loads, open('/path/to/records.ndjson'))
df = pd.DataFrame.from_records(records)
P.S: All credits for this code go to KristianHolsheimer from the Github Issue
The ndjson (newline delimited) json is a json-lines format, that is, each line is a json. It is ideal for a dataset lacking rigid structure ('non-sql') where the file size is large enough to warrant multiple files.
You can use pandas:
import pandas as pd
data = pd.read_json('dump.ndjson.000', lines=True)
In case your json strings do not contain newlines, you can alternatively use:
import json
with open("dump.ndjson.000") as f:
data = [json.loads(l) for l in f.readlines()]
This url
https://ihmecovid19storage.blob.core.windows.net/latest/ihme-covid19.zip
contains 2 csv files, and 1 pdf which is updated daily, containing Covid-19 Data.
I want to be able to load the Summary_stats_all_locs.csv as a Pandas DataFrame.
Usually if there is a url that points to a csv I can just use df = pd.read_csv(url) but since the csv is inside a zip, I can't do that here.
How would I do this?
Thanks
You will need to first fetch the file, then load it using the ZipFile module. Pandas can read csvs from inside a zip actually, but the problem here is there are multiple, so we need to this and specify the file name.
import requests
import pandas as pd
from zipfile import ZipFile
from io import BytesIO
r = requests.get("https://ihmecovid19storage.blob.core.windows.net/latest/ihme-covid19.zip")
files = ZipFile(BytesIO(r.content))
pd.read_csv(files.open("2020_05_16/Summary_stats_all_locs.csv"))
I am reading a parquet file and transforming it into dataframe.
from fastparquet import ParquetFile
pf = ParquetFile('file.parquet')
df = pf.to_pandas()
Is there a way to read a parquet file from a variable (that previously read and now hold parquet data)?
Thanks.
In Pandas there is method to deal with parquet. Here is reference to the docs. Something like that:
import pandas as pd
pd.read_parquet('file.parquet')
should work. Also please read this post for engine selection.
You can read a file from a variable also using pandas.read_parquet using the following code. I tested this with the pyarrow backend but this should also work for the fastparquet backend.
import pandas as pd
import io
with open("file.parquet", "rb") as f:
data = f.read()
buf = io.BytesIO(data)
df = pd.read_parquet(buf)