My dataset looks at flight delays and cancellations from 2009 to 2018. Here are the important points to consider:
Each year is its own csv file so '2009.csv', '2010.csv', all the way to '2018.csv'
Each file is roughly 700mb
I used the following to combine csv files
import pandas as pd
import numpy as np
import os, sys
import glob
os.chdir('c:\\folder'
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
combined_airline_csv = pd.concat([pd.read_csv(f) for f in all_filenames])
combined_airline_csv.to_csv('combined_airline_csv.csv', index =False, encoding = 'utf-8-sig')
When I run this, I receive the following message:
MemoryError: Unable to allocate 43.3MiB for an array with shape(5674621, ) and data type float64.
I am presuming that my file is too large and that will need to run this on a virtual machine (i.e. AWS).
Any thoughts?
Thank you!
This is a duplicate of how to merge 200 csv files in Python.
Since you just want to combine them into one file, there is no need to load all data into a dataframe at the same time. Since they all have the same structure, I would advise creating one filewriter, then open each file with a file reader and write (if we want to be fancy let's call it stream) the data line by line. Just be careful not to copy the headers each time, since you only want them one time. Pandas is simply not the best tool for this task :)
In general, this is a typical task that can also be done easily and even faster directly on the command line. (code depends on the os)
Related
I have a data frame which I read in from a locally saved CSV file.
I then want to loop over said file and create several CSV files based on a string in one column.
Lastly, I want to add all those files to a zip file, but without saving them locally. I just want one zip archive including all the different CSV files.
All my attempts using the io or zipfile modules only resulted in one zip file with one CSV file in it (pretty much with what I started with)
Any help would be much appreciated!
Here is my code so far, which works but saves all CSV files just to my hard drive.
import pandas as pd
from zipfile import ZipFile
df = pd.read_csv("myCSV.csv")
channelsList = df["Turn one column to list"].values.tolist()
channelsList = list(set(channelsList)) #delete duplicates from list
for channel in channelsList:
newDf = df.loc[df['Something to match'] == channel]
newDf.to_csv(f"{channel}.csv") # saves csv files to disk
DataFrame.to_csv() can write to any file-like object, and ZipFile.writestr() can accept a string (or bytes), so it is possible to avoid writing the CSV files to disk using io.StringIO. See the example code below.
Note: If the channel is simply stored in a single column of your input data, then the more idiomatic (and more efficient) way to iterate over the partitions of your data is to use groupby().
from io import StringIO
from zipfile import ZipFile
import numpy as np
import pandas as pd
# Example data
df = pd.DataFrame(np.random.random((100,3)), columns=[*'xyz'])
df['channel'] = np.random.randint(5, size=len(df))
with ZipFile('/tmp/output.zip', 'w') as zf:
for channel, channel_df in df.groupby('channel'):
s = StringIO()
channel_df.to_csv(s, index=False, header=True)
zf.writestr(f"{channel}.csv", s.getvalue())
we have a big [ file_name.tar.gz] file here big in the sense our machine can not handle in go, it has three type of files inside it, let us say [first_file.unl, second_file.unl, thrid_file.unl]
background about unl extension: pd.read_csv able to read the file successfully without giving any kind of errors.
i am trying below steps in order to accomplish the tasks
step 1:
all_files = glob.glob(path + "/*.gz")
above step able to list all three types of file now using below code to process further
step 2:
li = []
for filename in x:
df_a = pd.read_csv(filename, index_col= False, header=0,names= header_name,
low_memory=False, sep ="|")
li.append(df_a)
step 3:
frame = pd.concat(li, axis=0, ignore_index= True)
all three steps will work perfectly if
we have small data that could fit in our machine memory
we have only one type of files inside zip file
how do we overcome this problem, please help
we are expecting to have a code, that has ability to read a file in chunk for particular file type and create data frame for the same.
also please do advise, apart from pandas libary, is there any other approaches or library that could handle this more efficiently considering our data residing in linux server.
You can refer to this link:
How do I read a large csv file with pandas?
In general, you can try with chunks
For better performance, I suggest to use Dask or Pyspark
Use tarfile's open, next, and extractfile to get the entries, where extractfile returns a file object with which you can read that entry. You can provide that object to read_csv.
I have a folder that have say a few hundreds files and is growing every hour. I am trying to consolidate all the data into a single file for analysis use. But the script I wrote is not too effective for processing these data as it will read all the content in the folder and append it to an xlsx file. The processing time is simply too long.
What I seeking is to enhance and improve my script:
1) To be able to only read and extract data new files that have not been previously read
2) To extract and append these data to update the xlxs file.
I just need some to enlightenment to help me improve on the script.
Part of my code is as follows
import pandas as pd
import numpy as np
import os
import dask.dataframe as dd
import glob
import schedule
import time
import re
import datetime as dt
def job():
# Select the path to download the files
path=r'V:\DB\ABCD\BEFORE\8_INCHES'
files=glob.glob(path+"/*.csv")
df=None
# Extracting of information from files
for i, file in enumerate (files) :
if i==0:
df= np.transpose(pd.read_csv(file,delimiter="|",index_col=False))
df['Path'] =file
df['Machine No']=re.findall("MC-11",str(df["Path"]))
df['Process']= re.findall("ABCD",str(df["Path"]))
df['Before/After']=re.findall("BEFORE",str(df["Path"]))
df['Wafer Size']=re.findall("8_INCHES",str(df["Path"]))
df['Employee ID']=df["Path"].str.extract(r'(?<!\d)(\d{6})(?!\d)',expand=False)
df['Date']=df["Path"].str.extract(r'(\d{4}_\d{2}_\d{2})',expand=False)
df['Lot Number']=df["Path"].str.extract(r'(\d{7}\D\d)',expand=False)
df['Part Number']=df["Path"].str.extract(r'([A-Z]{2,3}\d{3,4}[A-Z][A-Z]\d{2,4}[A-Z])',expand=False)
df["Part Number"].fillna("ENGINNERING SAMPLE",inplace=True)
else:
tmp= np.transpose(pd.read_csv(file,delimiter="|",index_col=False))
tmp['Path'] =file
tmp['Machine No']=tmp["Path"].str.extract(r'(\D{3}\d{2})',expand=False)
tmp['Process']= tmp["Path"].str.extract(r'(\w{8})',expand=False)
tmp['Before/After']= tmp["Path"].str.extract(r'([B][E][F][O][R][E])',expand= False)
tmp['Wafer Size']= tmp["Path"].str.extract(r'(\d\_\D{6})',expand= False)
tmp['Employee ID']=tmp["Path"].str.extract(r'(?<!\d)(\d{6})(?!\d)',expand=False)
tmp['Date']=tmp["Path"].str.extract(r'(\d{4}_\d{2}_\d{2})',expand=False)
tmp['Lot Number']=tmp["Path"].str.extract(r'(\d{7}\D\d)',expand=False)
tmp['Part Number']=tmp["Path"].str.extract(r'([A-Z]{2,3}\d{3,4}[A-Z][A-Z]\d{2,4}[A-Z])',expand=False)
tmp["Part Number"].fillna("ENGINNERING SAMPLE",inplace=True)
df= df.append(tmp)
export_excel= rf.to_excel(r'C:\Users\hoosk\Documents\Python Scripts\hoosk\test26_feb_2020.xlsx')
#schedule to run every hour
schedule.every(1).hour.do(job)
while True:
schedule.run_pending()
time.sleep(1)
In general terms you'll want to do the following:
Read in the xlsx file at the start of your script.
Extract a set with all the filename already parsed (Path attribute)
For each file you iterate over check if it is contained within the set of already processed files.
This assumes that existing files don't have their content updated. If that could happen, you may want to track metrics like last change date (a checksum would be most reliable, but probably too expensive to compute).
I am trying to write a pandas dataframe to parquet file format (introduced in most recent pandas version 0.21.0) in append mode. However, instead of appending to the existing file, the file is overwritten with new data. What am i missing?
the write syntax is
df.to_parquet(path, mode='append')
the read syntax is
pd.read_parquet(path)
Looks like its possible to append row groups to already existing parquet file using fastparquet. This is quite a unique feature, since most libraries don't have this implementation.
Below is from pandas doc:
DataFrame.to_parquet(path, engine='auto', compression='snappy', index=None, partition_cols=None, **kwargs)
we have to pass in both engine and **kwargs.
engine{‘auto’, ‘pyarrow’, ‘fastparquet’}
**kwargs - Additional arguments passed to the parquet library.
**kwargs - here we need to pass is: append=True (from fastparquet)
import pandas as pd
import os.path
file_path = "D:\\dev\\output.parquet"
df = pd.DataFrame(data={'col1': [1, 2,], 'col2': [3, 4]})
if not os.path.isfile(file_path):
df.to_parquet(file_path, engine='fastparquet')
else:
df.to_parquet(file_path, engine='fastparquet', append=True)
If append is set to True and the file does not exist then you will see below error
AttributeError: 'ParquetFile' object has no attribute 'fmd'
Running above script 3 times I have below data in parquet file.
If I inspect the metadata, I can see that this resulted in 3 row groups.
Note:
Append could be inefficient if you write too many small row groups. Typically recommended size of a row group is closer to 100,000 or 1,000,000 rows. This has a few benefits over very small row groups. Compression will work better, since compression operates within a row group only. There will also be less overhead spent on storing statistics, since each row group stores its own statistics.
To append, do this:
import pandas as pd
import pyarrow.parquet as pq
import pyarrow as pa
dataframe = pd.read_csv('content.csv')
output = "/Users/myTable.parquet"
# Create a parquet table from your dataframe
table = pa.Table.from_pandas(dataframe)
# Write direct to your parquet file
pq.write_to_dataset(table , root_path=output)
This will automatically append into your table.
I used aws wrangler library. It works like charm
Below are the reference docs
https://aws-data-wrangler.readthedocs.io/en/latest/stubs/awswrangler.s3.to_parquet.html
I have read from kinesis stream and used kinesis-python library to consume the message and writing to s3 . processing logic of json I have not included as this post deals with problem unable to append data to s3. Executed in aws sagemaker jupyter
Below is the sample code I used:
!pip install awswrangler
import awswrangler as wr
import pandas as pd
evet_data=pd.DataFrame({'a': [a], 'b':[b],'c':[c],'d':[d],'e': [e],'f':[f],'g': [g]},columns=['a','b','c','d','e','f','g'])
#print(evet_data)
s3_path="s3://<your bucker>/table/temp/<your folder name>/e="+e+"/f="+str(f)
try:
wr.s3.to_parquet(
df=evet_data,
path=s3_path,
dataset=True,
partition_cols=['e','f'],
mode="append",
database="wat_q4_stg",
table="raw_data_v3",
catalog_versioning=True # Optional
)
print("write successful")
except Exception as e:
print(str(e))
Any clarifications ready to help. In few more posts I have read to read data and overwrite again. But as the data gets larger it will slow down the process. It is inefficient
There is no append mode in pandas.to_parquet(). What you can do instead is read the existing file, change it, and write back to it overwriting it.
Use the fastparquet write function
from fastparquet import write
write(file_name, df, append=True)
The file must already exist as I understand it.
API is available here (for now at least): https://fastparquet.readthedocs.io/en/latest/api.html#fastparquet.write
Pandas to_parquet() can handle both single files as well as directories with multiple files in it. Pandas will silently overwrite the file, if the file is already there. To append to a parquet object just add a new file to the same parquet directory.
os.makedirs(path, exist_ok=True)
# write append (replace the naming logic with what works for you)
filename = f'{datetime.datetime.utcnow().timestamp()}.parquet'
df.to_parquet(os.path.join(path, filename))
# read
pd.read_parquet(path)
I'm quite stuck with a code I'm writing in Python, I'm a beginner and maybe is really easy, but I just can't see it. Any help would be appreciated. So thank you in advance :)
Here is the problem: I have to read some special data files with an special extension .fen into a pandas DataFrame.This .fen files are inside a zipped file .fenx that contains the .fen file and a .cfg configuration file.
In the code I've written I use zipfile library in order to unzip the files, and then get them in the DataFrame. This code is the following
import zipfile
import numpy as np
import pandas as pd
def readfenxfile(Directory,File):
fenxzip = zipfile.ZipFile(Directory+ '\\' + File, 'r')
fenxzip.extractall()
fenxzip.close()
cfgGeneral,cfgDevice,cfgChannels,cfgDtypes=readCfgFile(Directory,File[:-5]+'.CFG')
#readCfgFile redas the .cfg file and returns some important data.
#Here only the cfgDtypes would be important as it contains the type of data inside the .fen and that will become the column index in the final DataFrame.
if cfgChannels!=None:
dtDtype=eval('np.dtype([' + cfgDtypes + '])')
dt=np.fromfile(Directory+'\\'+File[:-5]+'.fen',dtype=dtDtype)
dt=pd.DataFrame(dt)
else:
dt=[]
return dt,cfgChannels,cfgDtypes
Now, the extract() method saves the unzipped file in the hard drive. The .fenx files can be quite big so this need of storing (and afterwards deleting them) is really slow. I would like to do the same I do now, but getting the .fen and .cfg files into the memory, not the hard drive.
I have tried things like fenxzip.read('whateverthenameofthefileis.fen')and some other methods like .open() from the zipfile library. But I can't get what .read() returns into a numpy array in anyway i tried.
I know it can be a difficult question to answer, because you don't have the files to try and see what happens. But if someone would have any ideas I would be glad of reading them. :) Thank you very much!
Here is the solution I finally found in case it can be helpful for anyone. It uses the tempfile library to create a temporal object in memory.
import zipfile
import tempfile
import numpy as np
import pandas as pd
def readfenxfile(Directory,File,ExtractDirectory):
fenxzip = zipfile.ZipFile(Directory+ r'\\' + File, 'r')
fenfile=tempfile.SpooledTemporaryFile(max_size=10000000000,mode='w+b')
fenfile.write(fenxzip.read(File[:-5]+'.fen'))
cfgGeneral,cfgDevice,cfgChannels,cfgDtypes=readCfgFile(fenxzip,File[:-5]+'.CFG')
if cfgChannels!=None:
dtDtype=eval('np.dtype([' + cfgDtypes + '])')
fenfile.seek(0)
dt=np.fromfile(fenfile,dtype=dtDtype)
dt=pd.DataFrame(dt)
else:
dt=[]
fenfile.close()
fenxzip.close()
return dt,cfgChannels,cfgDtypes