I am looking to read in a .csv.gz file that is in the same directory as my python script using the gzip and pandas module only.
So far I have,
import gzip
import pandas as pd
data = gzip.open(test_data.csv.gz, mode='rb')
How do I proceed in converting / reading this file in as a dataframe without using the csv module as seen in similarly answered questions?
You can use pandas.read_csv directly:
import pandas as pd
df = pd.read_csv('test_data.csv.gz', compression='gzip')
If you must use gzip:
with gzip.open('test_data.csv.gz', mode='rb') as csv:
df = pd.read_csv(csv)
Related
I have .ndjson file that has 20GB that I want to open with Python. File is to big so I found a way to split it into 50 peaces with one online tool. This is the tool: https://pinetools.com/split-files
Now I get one file, that has extension .ndjson.000 (and I do not know what is that)
I'm trying to open it as json or as a csv file, to read it in pandas but it does not work.
Do you have any idea how to solve this?
import json
import pandas as pd
First approach:
df = pd.read_json('dump.ndjson.000', lines=True)
Error: ValueError: Unmatched ''"' when when decoding 'string'
Second approach:
with open('dump.ndjson.000', 'r') as f:
my_data = f.read()
print(my_data)
Error: json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 104925061 (char 104925060)
I think the problem is that I have some emojis in my file, so I do not know how to encode them?
ndjson is now supported out of the box with argument lines=True
import pandas as pd
df = pd.read_json('/path/to/records.ndjson', lines=True)
df.to_json('/path/to/export.ndjson', lines=True)
I think the pandas.read_json cannot handle ndjson correctly.
According to this issue you can do sth. like this to read it.
import ujson as json
import pandas as pd
records = map(json.loads, open('/path/to/records.ndjson'))
df = pd.DataFrame.from_records(records)
P.S: All credits for this code go to KristianHolsheimer from the Github Issue
The ndjson (newline delimited) json is a json-lines format, that is, each line is a json. It is ideal for a dataset lacking rigid structure ('non-sql') where the file size is large enough to warrant multiple files.
You can use pandas:
import pandas as pd
data = pd.read_json('dump.ndjson.000', lines=True)
In case your json strings do not contain newlines, you can alternatively use:
import json
with open("dump.ndjson.000") as f:
data = [json.loads(l) for l in f.readlines()]
Here is a data I am interested in.
http://fenixservices.fao.org/faostat/static/bulkdownloads/Production_Crops_E_All_Data.zip
It consists of 3 files:
I want to download zip with pandas and create DataFrame from 1 file called Production_Crops_E_All_Data.csv
import pandas as pd
url="http://fenixservices.fao.org/faostat/static/bulkdownloads/Production_Crops_E_All_Data.zip"
df=pd.read_csv(url)
Pandas can download files, it can work with zips and of course it can work with csv files. But how can I work with 1 specific file in archive with many files?
Now I get error
ValueError: ('Multiple files found in compressed zip file %s)
This post doesn't answer my question bcause I have multiple files in 1 zip
Read a zipped file as a pandas DataFrame
From this link
try this
from zipfile import ZipFile
import io
from urllib.request import urlopen
import pandas as pd
r = urlopen("http://fenixservices.fao.org/faostat/static/bulkdownloads/Production_Crops_E_All_Data.zip").read()
file = ZipFile(io.BytesIO(r))
data_df = pd.read_csv(file.open("Production_Crops_E_All_Data.csv"), encoding='latin1')
data_df_noflags = pd.read_csv(file.open("Production_Crops_E_All_Data_NOFLAG.csv"), encoding='latin1')
data_df_flags = pd.read_csv(file.open("Production_Crops_E_Flags.csv"), encoding='latin1')
Hope this helps!
EDIT: updated for python3 StringIO to io.StringIO
EDIT: updated the import of urllib, changed usage of StringIO to BytesIO. Also your CSV files are not utf-8 encoding, I tried latin1 and that worked.
You could use python's datatable, which is a reimplementation of Rdatatable in python.
Read in data :
from datatable import fread
#The exact file to be extracted is known, simply append it to the zip name:
url = "Production_Crops_E_All_Data.zip/Production_Crops_E_All_Data.csv"
df = fread(url)
#convert to pandas
df.to_pandas()
You can equally work within datatable; do note however, that it is not as feature-rich as Pandas; but it is a powerful and very fast tool.
Update: You can use the zipfile module as well :
from zipfile import ZipFile
from io import BytesIO
with ZipFile(url) as myzip:
with myzip.open("Production_Crops_E_All_Data.csv") as myfile:
data = myfile.read()
#read data into pandas
#had to toy a bit with the encoding,
#thankfully it is a known issue on SO
#https://stackoverflow.com/a/51843284/7175713
df = pd.read_csv(BytesIO(data), encoding="iso-8859-1", low_memory=False)
This url
https://ihmecovid19storage.blob.core.windows.net/latest/ihme-covid19.zip
contains 2 csv files, and 1 pdf which is updated daily, containing Covid-19 Data.
I want to be able to load the Summary_stats_all_locs.csv as a Pandas DataFrame.
Usually if there is a url that points to a csv I can just use df = pd.read_csv(url) but since the csv is inside a zip, I can't do that here.
How would I do this?
Thanks
You will need to first fetch the file, then load it using the ZipFile module. Pandas can read csvs from inside a zip actually, but the problem here is there are multiple, so we need to this and specify the file name.
import requests
import pandas as pd
from zipfile import ZipFile
from io import BytesIO
r = requests.get("https://ihmecovid19storage.blob.core.windows.net/latest/ihme-covid19.zip")
files = ZipFile(BytesIO(r.content))
pd.read_csv(files.open("2020_05_16/Summary_stats_all_locs.csv"))
I am reading a parquet file and transforming it into dataframe.
from fastparquet import ParquetFile
pf = ParquetFile('file.parquet')
df = pf.to_pandas()
Is there a way to read a parquet file from a variable (that previously read and now hold parquet data)?
Thanks.
In Pandas there is method to deal with parquet. Here is reference to the docs. Something like that:
import pandas as pd
pd.read_parquet('file.parquet')
should work. Also please read this post for engine selection.
You can read a file from a variable also using pandas.read_parquet using the following code. I tested this with the pyarrow backend but this should also work for the fastparquet backend.
import pandas as pd
import io
with open("file.parquet", "rb") as f:
data = f.read()
buf = io.BytesIO(data)
df = pd.read_parquet(buf)
I have a bunch of DAT files that I need to convert to XLS files using Python. Should I use the CSV library to do this or is there a better way?
I'd use pandas.
import pandas as pd
df = pd.read_table('DATA.DAT')
df.to_excel('DATA.xlsx')
and of course you can setup a loop to get through all you files. Something along these lines maybe
import glob
import os
os.chdir("C:\\FILEPATH\\")
for file in glob.glob("*.DAT"):
#What file is being converted
print file
df = pd.read_table(file)
file1 = file.replace('DAT','xlsx')
df.to_excel(file1)
writer = pd.ExcelWriter('pandas_example.dat',
engine='xlsxwriter',
options={'strings_to_urls': False})
or you can use :
pd.to_excel('example.xlsx')