I use the following function to write a pandas DataFrame to Excel
def write_dataset(
train: pd.DataFrame,
forecast: pd.DataFrame,
config
out_path: str,
) -> None:
forecast = forecast.rename(
columns={
"col": "col_predicted",
}
)
df = pd.concat([train, forecast])
df.drop(["id"], axis=1, inplace=True)
if config.join_meta:
df.drop(
["some_col", "some_other_col"],
axis=1,
inplace=True,
)
df.sort_values(config.id_columns, inplace=True)
df.rename(columns={"date": "month"}, inplace=True)
df["a_col"] = df["a_col"].round(2)
df.to_excel(out_path, index=False)
just before the df.to_excel() the DataFrame looks completely normal, just containing some NaNs. But the file it writes is a 0 Byte file, which I can't even open with Excel. I use this function for 6 different dfs and somehow it works for some and doesn't for others. Also on my colleagues computer it always works fine.
I'm using python version 3.10.4, pandas 1.4.2 and opnepyxl 3.0.9
Any ideas what is happening and how to fix that behavior?
I encountered this issue on my Mac, and was similarly stumped for a while. Then I realized that the file appears as 0 bytes once the code has begun to create the file but hasn't yet finished.
So in my case, I found that all I had to do was wait a long time, and eventually (> 5-10m) the file jumped from 0 bytes to its full size. My file was about 14mb, so it shouldn't have required that much time. My guess is that this is an issue related to how the OS is handling scheduling and permissions among various processes and memory locations, hence why some dfs work fine and others don't.
(So it might be worth double checking that you don't have other processes that might be trying to claim write access of the write destination. I've seen programs like automatic backup services claim access to folders and cause conflicts along these lines.)
Related
Background:
I am creating a program that will need to keep track of what it has ran and when and what it has sent and to whom. The logging module in Python doesn't appear to accomplish what I need but I'm still pretty new to this so I may be wrong. Alternative solutions to accomplish the same end are also welcome.
The program will need to take in a data file (preferably .xlsx or .csv) which will be formatted as something like this (the Nones will need to be filled in by the program):
Run_ID
Date_Requested
Time_Requested
Requestor
Date_Completed
Time_Completed
R_423h
9/8/2022
1806
email#email.com
None
None
The program will then need to compare the Run_IDs from the log to the new run_IDs provided in a similar format to the table above (in a .csv) ie:
ResponseId
R_jals893
R_hejl8234
I can compare the IDs myself, but the issue then becomes it will need to update the log with the new IDs it has ran, along with the times they were run and the emails and such, and then resave the log file. I'm sure this is easy but it's throwing me for a loop.
My code:
log = pd.read_excel('run_log.xlsx', usecols=None, parse_dates=True)
new_run_requests=pd.read_csv('Run+Request+Sheet_September+6,+2022_14.28.csv',parse_dates=True)
old_runs = log.Run_ID[:]
new_runs = new_run_requests.ResponseId[:]
log['Run_ID'] = pd.concat([old_runs, new_runs], ignore_index=True)
After this the dataframe does not change.
This is one of the things I have tried out of 2 or 3. Suggestions are appreciated!
we have a big [ file_name.tar.gz] file here big in the sense our machine can not handle in go, it has three type of files inside it, let us say [first_file.unl, second_file.unl, thrid_file.unl]
background about unl extension: pd.read_csv able to read the file successfully without giving any kind of errors.
i am trying below steps in order to accomplish the tasks
step 1:
all_files = glob.glob(path + "/*.gz")
above step able to list all three types of file now using below code to process further
step 2:
li = []
for filename in x:
df_a = pd.read_csv(filename, index_col= False, header=0,names= header_name,
low_memory=False, sep ="|")
li.append(df_a)
step 3:
frame = pd.concat(li, axis=0, ignore_index= True)
all three steps will work perfectly if
we have small data that could fit in our machine memory
we have only one type of files inside zip file
how do we overcome this problem, please help
we are expecting to have a code, that has ability to read a file in chunk for particular file type and create data frame for the same.
also please do advise, apart from pandas libary, is there any other approaches or library that could handle this more efficiently considering our data residing in linux server.
You can refer to this link:
How do I read a large csv file with pandas?
In general, you can try with chunks
For better performance, I suggest to use Dask or Pyspark
Use tarfile's open, next, and extractfile to get the entries, where extractfile returns a file object with which you can read that entry. You can provide that object to read_csv.
I inherited a project using Dask Dataframe to create a dataframe.
from dask import dataframe as dd
# leaving out param values for brevity
df = dd.read_csv(
's3://some-bucket/*.csv.gz',
sep=delimiter,
header=header,
names=partition_column_names,
compression=table_compression,
encoding='utf-8',
error_bad_lines=False,
warn_bad_lines=True,
parse_dates=date_columns,
dtype=column_dtype,
blocksize=None,
)
df_len = len(df)
# more stuff
I take that Dataframe, process it, and turn it into Parquet.
The process works fine, but occasionally (still haven't identified the pattern), the process just hangs on the len(df). No errors, no exiting, nothing.
Is there any concept with Dask Dataframes to have a timeout on a Dataframe operation? Perhaps an option to turn on debugging to get better insight as to what is happening?
The diagnostics dashboard provides the most information here. https://docs.dask.org/en/latest/diagnostics-distributed.html has the richest information, but the local schedulers provide some information too (https://docs.dask.org/en/latest/diagnostics-local.html).
Due to some limitations of the consumer of my data, I need to "rewrite" some parquet files to convert timestamps that are in nanosecond precision to timestamps that are in millisecond precision.
I have implemented this and it works but I am not completely satisfied with it.
import pandas as pd
df = pd.read_parquet(
f's3://{bucket}/{key}', engine='pyarrow')
for col_name in df.columns:
if df[col_name].dtype == 'datetime64[ns]':
df[col_name] = df[col_name].values.astype('datetime64[ms]')
df.to_parquet(f's3://{outputBucket}/{outputPrefix}{additionalSuffix}',
engine='pyarrow', index=False)
I'm currently running this job in lambda for each file but I can see this may be expensive and may not always work if the job takes longer than 15 minutes as that is the maximum time Lambda's can run.
The files can be on the larger side (>500 MB).
Any ideas or other methods I could consider? I am unable to use pyspark as my dataset has unsigned integers in it.
You could try rewriting all columns at once. Maybe this would reduce some memory copies in pandas, thus speeding up the process if you have many columns:
df_datetimes = df.select_dtypes(include="datetime64[ns]")
df[df_datetimes.columns] = df_datetimes.astype("datetime64[ms]")
Add use_deprecated_int96_timestamps=True to df.to_parquet() when you first write the file, and it will save as a nanosecond timestamp. https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html
I am working with a Stata .dta file that is around 3.3 gigabytes, so it is large but not excessively large. I am interested in using IPython and tried to import the .dta file using Pandas but something wonky is going on. My box has 32 gigabytes of RAM and attempting to load the .dta file results in all the RAM being used (after ~30 minutes) and my computer to stall out. This doesn't 'feel' right in that I am able to open the file in R using read.dta() from the foreign package no problem, and working with the file in Stata is fine. The code I am using is:
%time myfile = pd.read_stata(data_dir + 'my_dta_file.dta')
and I am using IPython in Enthought's Canopy program. The reason for the '%time' is because I am interested in benchmarking this against R's read.dta().
My questions are:
Is there something I am doing wrong that is resulting in Pandas having issues?
Is there a workaround to get the data into a Pandas dataframe?
Here is a little function that has been handy for me, using some pandas features that might not have been available when the question was originally posed:
def load_large_dta(fname):
import sys
reader = pd.read_stata(fname, iterator=True)
df = pd.DataFrame()
try:
chunk = reader.get_chunk(100*1000)
while len(chunk) > 0:
df = df.append(chunk, ignore_index=True)
chunk = reader.get_chunk(100*1000)
print '.',
sys.stdout.flush()
except (StopIteration, KeyboardInterrupt):
pass
print '\nloaded {} rows'.format(len(df))
return df
I loaded an 11G Stata file in 100 minutes with this, and it's nice to have something to play with if I get tired of waiting and hit cntl-c.
This notebook shows it in action.
For all the people who end on this page, please upgrade Pandas to the latest version. I had this exact problem with a stalled computer during load (300 MB Stata file but only 8 GB system ram), and upgrading from v0.14 to v0.16.2 solved the issue in a snap.
Currently, it's v 0.16.2. There have been significant improvements to speed though I don't know the specifics. See: most efficient I/O setup between Stata and Python (Pandas)
There is a simpler way to solve it using Pandas' built-in function read_stata.
Assume your large file is named as large.dta.
import pandas as pd
reader=pd.read_stata("large.dta",chunksize=100000)
df = pd.DataFrame()
for itm in reader:
df=df.append(itm)
df.to_csv("large.csv")
Question 1.
There's not much I can say about this.
Question 2.
Consider exporting your .dta file to .csv using Stata command outsheet or export delimited and then using read_csv() in pandas. In fact, you could take the newly created .csv file, use it as input for R and compare with pandas (if that's of interest). read_csv is likely to have had more testing than read_stata.
Run help outsheet for details of the exporting.
You should not be reading a 3GB+ file into an in-memory data object, that's a recipe for disaster (and has nothing to do with pandas).
The right way to do this is to mem-map the file and access the data as needed.
You should consider converting your file to a more appropriate format (csv or hdf) and then you can use the Dask wrapper around pandas DataFrame for chunk-loading the data as needed:
from dask import dataframe as dd
# If you don't want to use all the columns, make a selection
columns = ['column1', 'column2']
data = dd.read_csv('your_file.csv', use_columns=columns)
This will transparently take care of chunk-loading, multicore data handling and all that stuff.