Is there a limit for Pandas Apply for many files? - python

I have 10.000 CSV datasets for my project. I need to preprocess all of those datasets but Pandas can only process five at one run? Here is my code for opening all files.
path = "./for_processing"
files = [file for file in os.listdir(path) if not file.startswith('.')]
index_cleaned_news = 0
for file in files:
NEWS_DATA = pd.read_csv(path+"/kompas/clean_news"+str(index_cleaned_news)+".csv")
#Process here
NEWS_DATA.to_csv(path+"/process_kompas/preprocess_news"+str(index_cleaned_news)+".csv")
index_cleaned_news = index_cleaned_news+1
Here is a screenshot when I try to run the code
Anybody know how to fix this or is it a rule from Pandas? Thank you

Related

Python Pandas Import Large Files

I use the following code to load various log files, and then process them later in the code. This code works perfectly, unless the log file is large in size. I know there isn't any other issue, because if manually I split the large-log file into smaller files, then the code works again, without any issues.
So, what's the best way out? Is it better to write some code before pandas, to split large files? Or can pandas import large files (in chunks?) but still let me process it.
code:
for root, dirs, files in os.walk('.', topdown=True):
for file in files:
try:
df = pd.read_csv(file, sep='\n', header=None, engine='python', quoting=3)
df = df[0].str.strip(' \t"').str.split('[,|;: \t]+', 1, expand=True).rename(columns={0: 'email', 1: 'data'})
error:
MemoryError: Unable to allocate 75.8 MiB for an array with shape (4968252,) and data

How do I convert several large text files into one CSV file if they are too large to be converted individually?

I have several large .text files that I want to consolidate into one .csv file. However, each of the files is to large to import into Excel on its own, let alone all together.
I want to create a use pandas to analyze the data, but don't know how to get the files all in one place.
How would I go about reading the data directly into Python, or into Excel for a .csv file?
The data in question is the 2019-2020 Contributions by individuals file on the FEC's website.
You can convert each of the files to csv and the concatenate them to fom one final csv file
import pandas as pd
csv_path = 'pathtonewcsvfolder' # use your path
all_files=os.listdir("path/to/textfiles")
x=0
for filename in all_files:
df = pd.read_fwf(filename)
df.to_csv(os.path.join(csv_path,'log'+str(x)+'.csv'))
x+=1
all_csv_files = glob.iglob(os.path.join(csv_path, "*.csv"))
converted_df=pd.concat((pd.read_csv(f) for f in all_csv_files), ignore_index=True)
converted_df.to_csv('converted.csv')

How to merge multiple Excel files from a folder and it's sub folders using Python

I have multiple Excel spreedsheets in given folder and it's sub folder. All have same file name string with suffix as date and time. How to merge them all into one single file while making worksheet name and titles as index for appending data frames. Typically there would be small chunks of 200 KB each file of ~100 files in subfolders or 20 MB of ~10 files in subfolders
This may help you to merge all the xlsx file in current directory.
import glob
import os
import pandas as pd
output = pd.DataFrame()
for file in glob.glob(os.getcwd()+"\\*.xlsx"):
cn = pd.read_excel(file)
output = output.append(cn)
output.to_csv(os.getcwd()+"\\outPut.csv", index = False, na_rep = "NA", header=None)
print("Completed +::" )
Note : you need xlrd-1.1.0 library along with pandas to read xlsx files.
I have tried operating using static file name definitions, would be good if it makes consolation by column header from dynamic file list pick, whichever starts with .xls* (xls / xlsx / xlsb / xlsm) and .csv and .txt
import pandas as pd
db = pd.read_excel("/data/Sites/Cluster1 0815.xlsx")
db1 = pd.read_excel("/data/Sites/Cluster2 0815.xlsx")
db2 = read_excel("/data/Sites/Cluster3 0815.xlsx")
sdb = db.append(db1)
sdb = sdb.append(db2)
sdb.to_csv("/data/Sites/sites db.csv", index = False, na_rep = "NA", header=None)
Dynamic file list merge found to have the below output. However the processing time has to be counted on...
gur.com/QKTKw.jpg
While running on batch files the code generated below error (please to note that these file are asymmetric in information carried) attached is snap:

Python - Pandas Concatenate Multiple Text Files Within Multiple Zip Files

I am having problems getting txt files located in zipped files to load/concatenate using pandas. There are many examples on here with pd.concat(zip_file.open) but still not getting anything to work in my case since I have more than one zip file and multiple txt files in each.
For example, Lets say I have TWO Zipped files in a specific folder "Main". Each zipped file contains FIVE txt files each. I want to read all of these txt files and pd.concat them all together. In my real world example I will have dozens of zip folders with each containing five txt files.
Can you help please?
Folder and File Structure for Example:
'C:/User/Example/Main'
TAG_001.zip
sample001_1.txt
sample001_2.txt
sample001_3.txt
sample001_4.txt
sample001_5.txt
TAG_002.zip
sample002_1.txt
sample002_2.txt
sample002_3.txt
sample002_4.txt
sample002_5.txt
I started like this but everything after this is throwing errors:
import os
import glob
import pandas as pd
import zipfile
path = 'C:/User/Example/Main'
ziplist = glob.glob(os.path.join(path, "*TAG*.zip"))
This isn't efficient but it should give you some idea of how it might be done.
import os
import zipfile
import pandas as pd
frames = {}
BASE_DIR = 'C:/User/Example/Main'
_, _, zip_filenames = list(os.walk(BASE_DIR))[0]
for zip_filename in zip_filenames:
with zipfile.ZipFile(os.path.join(BASE_DIR, zip_filename)) as zip_:
for filename in zip_.namelist():
with zip_.open(filename) as file_:
new_frame = pd.read_csv(file_, sep='\t')
frame = frames.get(filename)
if frame is not None:
pd.concat([frame, new_frame])
else:
frames[filename] = new_frame
#once all frames have been concatenated loop over the dict and write them back out
depending on how much data there is you will have to design a solution that balances processing power/memory/disk space. This solution could potentially use up a lot of memory.

Extract data from multiple excel files in multiple directories in python pandas

I am new to Python and I am posting the question in stack overflow for the first time. Please help in solving the problem.
My main directory is 'E:\Data Science\Macros\ZBILL_Dump', containing month-wise folders and each folder contains date-wise excel data.
I was able to extract data from a single folder:
import os
import pandas as pd
import numpy as np
# Find file names in the specified directory
loc = 'E:\Data Science\Macros\ZBILL_Dump\Apr17\\'
files = os.listdir(loc)
# Find the ONLY Excel files
files_xlsx = [f for f in files if f[-4:] == 'xlsx']
# Create empty dataframe and read in new data
zbill = pd.DataFrame()
for f in files_xlsx:
New_data = pd.read_excel(os.path.normpath(loc + f), 'Sheet1')
zbill = zbill.append(New_data)
zbill.head()
I am trying to extract data from my main directory i.e "ZBILL_Dump" which contains many sub folders, but I could not do it. Please somebody help me.
Thanks a lot.
You can use glob.
import glob
import pandas as pd
# grab excel files only
pattern = 'E:\Data Science\Macros\ZBILL_Dump\Apr17\\*.xlsx'
# Save all file matches: xlsx_files
xlsx_files = glob.glob(pattern)
# Create an empty list: frames
frames = []
# Iterate over csv_files
for file in xlsx_files:
# Read xlsx into a DataFrame
df = pd.read_xlsx(file)
# Append df to frames
frames.append(df)
# Concatenate frames into dataframe
zbill = pd.concat(frames)
You can use regex if you want to look in different sub-directories. Use 'filepath/*/*.xlsx' to search the next level. More info here https://docs.python.org/3/library/glob.html
Use glob with its recursive feature for searching sub-directories:
import glob
files = glob.glob('E:\Data Science\Macros\ZBILL_Dump\**\*.xlsx', recursive=True)
Docs: https://docs.python.org/3/library/glob.html

Categories