I have a data frame and I want to export it, using to_csv.
I need it to be a csv file inside a zip.
I tried using compression but it did not work as planned:
metadata_table.to_csv(r'/tmp/meta.gz', compression='gzip')
this code will create a zipped file, but inside is not an excel file, it's a regular text editor file. if I change the file name to .csv I will only get a regular csv (in excel format) with all the information messed up inside.
is it possible to do it with one command? and not export to csv first, and compress into a zip after?
Try saving with filename as file.csv.gz as written below:
import pandas as pd
data.to_csv('file.csv.gz', compression='gzip')
Hope this is helpful!
Related
I have an excel that is generated daily and can have up to 50k+ rows. Is there a way to read only the last row (which is the sum of the columns)?
right now I am just reading the entire sheet and keeping only the last row but it is taking up a huge amount of runtime.
my code:
df=pd.read_excel(filepath,header=1,usecols="O:AC")
df=df.tail(1)
Pandas is quite slow, especially with large in memory data. You can think about a lazy loading method, for example check dask.
Else you can read the file using "open" and read the last line :
with open(filepath, "r") as file:
last_line = file.readlines()[-1]
I dont think there is a way to decrease runtime when you read excel file.
When you read a excel or one sheet of excel,you would load excel all data into dask,even you use pd.read_excel skiprows,Its just keep the row the skiprows choose after you load all data into dask.So it cant decrease runtime.
If you really want decrease runtime of read file,you should save the file into another format,.csv or .txt and so on.
AND you generally you can't read Microsoft Excel files as a text files using methods like readlines or read. You should convert files to another format before (good solution is .csv which can be readed by csv module) or use a special python modules like pyexcel and openpyxl to read .xlsx files directly.
Update: Sorry it seems my question wasn't asked properly. So I am analyzing a transportation network consisting of more than 5000 links. All the data included in a big CSV file. I have several JSON files which each consist of subset of this network. I am trying to loop through all the JSON files INDIVIDUALLY (i.e. not trying to concatenate or something), read the JSON file, extract the information from the CVS file, perform calculation, and save the information along with the name of file in new dataframe. Something like this:
enter image description here
This is the code I wrote, but not sure if it's efficient enough.
name=[]
percent_of_truck=[]
path_to_json = \\directory
import glob
z= glob.glob(os.path.join(path_to_json, '*.json'))
for i in z:
with open(i, 'r') as myfile:
l=json.load(myfile)
name.append(i)
d_2019= final.loc[final['LINK_ID'].isin(l)] #retreive data from main CSV file
avg_m=(d_2019['AADTT16']/d_2019['AADT16']*d_2019['Length']).sum()/d_2019['Length'].sum() #calculation
percent_of_truck.append(avg_m)
f=pd.DataFrame()
f['Name']=name
f['% of truck']=percent_of_truck
I'm assuming here you just want a dictionary of all the JSON. If so, use the JSON library ( import JSON). If so, this code may be of use:
import json
def importSomeJSONFile(f):
return json.load(open(f))
# make sure the file exists in the same directory
example = importSomeJSONFile("example.json")
print(example)
#access a value within this , replacing key with what you want like "name"
print(JSON_imported[key])
Since you haven't added any Schema or any other specific requirements.
You can follow this approach to solve your problem, in any language you prefer
Get Directory of the JsonFiles, which needs to be read
Get List of all files present in directory
For each file-name returned in Step2.
Read File
Parse Json from String
Perform required calculation
I have two questions regarding reading data from a file in .xlsx format.
Is it possible to convert an .xlsx file to .csv without actually opening the file in pandas or using xlrd? Because when I have to open many files this is quite slow and I was trying to speed it up.
Is it possible to use some sort of for loop to loop through decoded xlsx lines? I put an example below.
xlsx_file = 'some_file.xlsx'
with open(xlsx_file) as lines:
for line in lines:
<do something like I would do for a normal string>
I would like to know if this is possible without the well known xlrd module.
Problem Statement :
I have a directory with gzip files , and each gzip file contains a text file.
I have written a code in such a way that it unzips all the gzip files and then used to read each unzipped text file and then combined that output to one text file, then applied a condition , if that condition meets then it writes to excel.
The above process is bit tedious and lengthy.
Can anyone please help me out in writing the code where the data is read directly from the txt file which is gzipped and write it contents to excel.
IIUC you can use pandas using first read_csv:
df = read_csv('yourfile.gzip', compression='gzip')
then apply your conditions on df and write back the dataframe to excel using to_excel:
df.to_excel(file.xls)
I have many Python scripts that output CSV files. It is occasionally convenient to open these files in Excel. After installing OS X Mavericks, Excel no longer opens these files properly: Excel doesn't parse the files and it duplicates the rows of the file until it runs out of memory. Specifically, when Excel attempts to open the file, a prompt appears that reads: "File not loaded completely."
Example of code I'm using to generate the CSV files:
import csv
with open('csv_test.csv', 'wb') as f:
writer = csv.writer(f)
writer.writerow([1,2,3])
writer.writerow([4,5,6])
Even the simple file generated by the above code fails to load in Excel. However, if I open the CSV file in a text editor and copy/paste the text into Excel, parse it with text to columns, and then save as CSV from Excel, then I can reopen the CSV file in Excel without issue. Do I need to pass an additional parameter in my scripts to make Excel parse the CSV files the same way it used to? Or is there some setting I can change in OS X Mavericks or Excel? Thanks.
Maybe I had the similar problem, the error message "SYLK: File format is not valid" when open python autogenerated csv file. The solution is really funny. The first two characters must not be I and D in uppercase (ID). Also see "SYLK: File format is not valid" error message when you open file.
Possible solution1: use *.txt instead of *.csv. In this case Excel (at least, 2010) will show you an import data wizard where you can specify delimiters, character encoding, field types, etc.
UPD: Solution2:
The python "csv" module has a "dialect" feature. For example, the following modification of your code generates valid csv file for my environment (Python 2.7, Excel 2010, Windows7, locale with ";" list delimiters):
import csv
with open('csv_test2.csv', 'wb') as f:
csv.excel.delimiter=';'
writer = csv.writer(f, dialect=csv.excel)
writer.writerow([1,2,3])
writer.writerow([4,5,6])