Python Pandas Import Large Files - python

I use the following code to load various log files, and then process them later in the code. This code works perfectly, unless the log file is large in size. I know there isn't any other issue, because if manually I split the large-log file into smaller files, then the code works again, without any issues.
So, what's the best way out? Is it better to write some code before pandas, to split large files? Or can pandas import large files (in chunks?) but still let me process it.
code:
for root, dirs, files in os.walk('.', topdown=True):
for file in files:
try:
df = pd.read_csv(file, sep='\n', header=None, engine='python', quoting=3)
df = df[0].str.strip(' \t"').str.split('[,|;: \t]+', 1, expand=True).rename(columns={0: 'email', 1: 'data'})
error:
MemoryError: Unable to allocate 75.8 MiB for an array with shape (4968252,) and data

Related

Is there a limit for Pandas Apply for many files?

I have 10.000 CSV datasets for my project. I need to preprocess all of those datasets but Pandas can only process five at one run? Here is my code for opening all files.
path = "./for_processing"
files = [file for file in os.listdir(path) if not file.startswith('.')]
index_cleaned_news = 0
for file in files:
NEWS_DATA = pd.read_csv(path+"/kompas/clean_news"+str(index_cleaned_news)+".csv")
#Process here
NEWS_DATA.to_csv(path+"/process_kompas/preprocess_news"+str(index_cleaned_news)+".csv")
index_cleaned_news = index_cleaned_news+1
Here is a screenshot when I try to run the code
Anybody know how to fix this or is it a rule from Pandas? Thank you

handling zip files using python library pandas

we have a big [ file_name.tar.gz] file here big in the sense our machine can not handle in go, it has three type of files inside it, let us say [first_file.unl, second_file.unl, thrid_file.unl]
background about unl extension: pd.read_csv able to read the file successfully without giving any kind of errors.
i am trying below steps in order to accomplish the tasks
step 1:
all_files = glob.glob(path + "/*.gz")
above step able to list all three types of file now using below code to process further
step 2:
li = []
for filename in x:
df_a = pd.read_csv(filename, index_col= False, header=0,names= header_name,
low_memory=False, sep ="|")
li.append(df_a)
step 3:
frame = pd.concat(li, axis=0, ignore_index= True)
all three steps will work perfectly if
we have small data that could fit in our machine memory
we have only one type of files inside zip file
how do we overcome this problem, please help
we are expecting to have a code, that has ability to read a file in chunk for particular file type and create data frame for the same.
also please do advise, apart from pandas libary, is there any other approaches or library that could handle this more efficiently considering our data residing in linux server.
You can refer to this link:
How do I read a large csv file with pandas?
In general, you can try with chunks
For better performance, I suggest to use Dask or Pyspark
Use tarfile's open, next, and extractfile to get the entries, where extractfile returns a file object with which you can read that entry. You can provide that object to read_csv.

Difficulty combining csv files into a single file

My dataset looks at flight delays and cancellations from 2009 to 2018. Here are the important points to consider:
Each year is its own csv file so '2009.csv', '2010.csv', all the way to '2018.csv'
Each file is roughly 700mb
I used the following to combine csv files
import pandas as pd
import numpy as np
import os, sys
import glob
os.chdir('c:\\folder'
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
combined_airline_csv = pd.concat([pd.read_csv(f) for f in all_filenames])
combined_airline_csv.to_csv('combined_airline_csv.csv', index =False, encoding = 'utf-8-sig')
When I run this, I receive the following message:
MemoryError: Unable to allocate 43.3MiB for an array with shape(5674621, ) and data type float64.
I am presuming that my file is too large and that will need to run this on a virtual machine (i.e. AWS).
Any thoughts?
Thank you!
This is a duplicate of how to merge 200 csv files in Python.
Since you just want to combine them into one file, there is no need to load all data into a dataframe at the same time. Since they all have the same structure, I would advise creating one filewriter, then open each file with a file reader and write (if we want to be fancy let's call it stream) the data line by line. Just be careful not to copy the headers each time, since you only want them one time. Pandas is simply not the best tool for this task :)
In general, this is a typical task that can also be done easily and even faster directly on the command line. (code depends on the os)

pandas.read_csv() in a for loop. Why is my code not working for more then 2 cycles?

It works fine for simple files but not with more complex ones.
My files are not corrupted and they are in the right directory.
I tried it with easy generate files (1,2,3,4... a,b,c,d...).
I put it at Github tonight so you can run the code and see the files.
import os
import glob
import pandas as pd
def concatenate(indir='./files/', outfile='./all.csv'):
os.chdir(indir)
fileList = glob.glob('*.CSV')
dfList = []
'''colnames = ['Time', 'Number', 'Reaction', 'Code', 'Message', 'date']'''
print(len(fileList))
for filename in fileList:
print(filename)
df = pd.read_csv(filename, header=0)
dfList.append(df)
'''print(dfList)'''
concatDf = pd.concat(dfList, axis=0)
'''concatDf.columns = colnames'''
concatDf.to_csv(outfile, index=None)
concatenate()
Error
Unable to open parsers.pyx: Unable to read file (Error: File not found
(/Users/alf4/Documents/vs_code/files/pandas/_libs/parsers.pyx)).
But just after more than two files.
complex ones? do you mean bigger csv files ?
instead of appendding data to an empty list and then concatenating back to the dataframe, we can do it in a single step, take an empty dataframe(df1), keep appending df to df1 in the loop.
df1=df1.append(df)
and then write it out in the end
df1.to_csv(outfile, index=None)
I am sorry for this question/the wrong topic because it seems not to be a code problem.
It seems that the installation of pandas is bugged. It put it to repl.it to share it here and there it works. At the moment I try to repair the python and pandas installation.
So many thanks to these guys in the comments for the helping.

Python - Pandas Concatenate Multiple Text Files Within Multiple Zip Files

I am having problems getting txt files located in zipped files to load/concatenate using pandas. There are many examples on here with pd.concat(zip_file.open) but still not getting anything to work in my case since I have more than one zip file and multiple txt files in each.
For example, Lets say I have TWO Zipped files in a specific folder "Main". Each zipped file contains FIVE txt files each. I want to read all of these txt files and pd.concat them all together. In my real world example I will have dozens of zip folders with each containing five txt files.
Can you help please?
Folder and File Structure for Example:
'C:/User/Example/Main'
TAG_001.zip
sample001_1.txt
sample001_2.txt
sample001_3.txt
sample001_4.txt
sample001_5.txt
TAG_002.zip
sample002_1.txt
sample002_2.txt
sample002_3.txt
sample002_4.txt
sample002_5.txt
I started like this but everything after this is throwing errors:
import os
import glob
import pandas as pd
import zipfile
path = 'C:/User/Example/Main'
ziplist = glob.glob(os.path.join(path, "*TAG*.zip"))
This isn't efficient but it should give you some idea of how it might be done.
import os
import zipfile
import pandas as pd
frames = {}
BASE_DIR = 'C:/User/Example/Main'
_, _, zip_filenames = list(os.walk(BASE_DIR))[0]
for zip_filename in zip_filenames:
with zipfile.ZipFile(os.path.join(BASE_DIR, zip_filename)) as zip_:
for filename in zip_.namelist():
with zip_.open(filename) as file_:
new_frame = pd.read_csv(file_, sep='\t')
frame = frames.get(filename)
if frame is not None:
pd.concat([frame, new_frame])
else:
frames[filename] = new_frame
#once all frames have been concatenated loop over the dict and write them back out
depending on how much data there is you will have to design a solution that balances processing power/memory/disk space. This solution could potentially use up a lot of memory.

Categories