Python - Pandas Concatenate Multiple Text Files Within Multiple Zip Files - python

I am having problems getting txt files located in zipped files to load/concatenate using pandas. There are many examples on here with pd.concat(zip_file.open) but still not getting anything to work in my case since I have more than one zip file and multiple txt files in each.
For example, Lets say I have TWO Zipped files in a specific folder "Main". Each zipped file contains FIVE txt files each. I want to read all of these txt files and pd.concat them all together. In my real world example I will have dozens of zip folders with each containing five txt files.
Can you help please?
Folder and File Structure for Example:
'C:/User/Example/Main'
TAG_001.zip
sample001_1.txt
sample001_2.txt
sample001_3.txt
sample001_4.txt
sample001_5.txt
TAG_002.zip
sample002_1.txt
sample002_2.txt
sample002_3.txt
sample002_4.txt
sample002_5.txt
I started like this but everything after this is throwing errors:
import os
import glob
import pandas as pd
import zipfile
path = 'C:/User/Example/Main'
ziplist = glob.glob(os.path.join(path, "*TAG*.zip"))

This isn't efficient but it should give you some idea of how it might be done.
import os
import zipfile
import pandas as pd
frames = {}
BASE_DIR = 'C:/User/Example/Main'
_, _, zip_filenames = list(os.walk(BASE_DIR))[0]
for zip_filename in zip_filenames:
with zipfile.ZipFile(os.path.join(BASE_DIR, zip_filename)) as zip_:
for filename in zip_.namelist():
with zip_.open(filename) as file_:
new_frame = pd.read_csv(file_, sep='\t')
frame = frames.get(filename)
if frame is not None:
pd.concat([frame, new_frame])
else:
frames[filename] = new_frame
#once all frames have been concatenated loop over the dict and write them back out
depending on how much data there is you will have to design a solution that balances processing power/memory/disk space. This solution could potentially use up a lot of memory.

Related

Turn one dataframe into several dfs and add them as CSVs to zip archive (without saving files locally)

I have a data frame which I read in from a locally saved CSV file.
I then want to loop over said file and create several CSV files based on a string in one column.
Lastly, I want to add all those files to a zip file, but without saving them locally. I just want one zip archive including all the different CSV files.
All my attempts using the io or zipfile modules only resulted in one zip file with one CSV file in it (pretty much with what I started with)
Any help would be much appreciated!
Here is my code so far, which works but saves all CSV files just to my hard drive.
import pandas as pd
from zipfile import ZipFile
df = pd.read_csv("myCSV.csv")
channelsList = df["Turn one column to list"].values.tolist()
channelsList = list(set(channelsList)) #delete duplicates from list
for channel in channelsList:
newDf = df.loc[df['Something to match'] == channel]
newDf.to_csv(f"{channel}.csv") # saves csv files to disk
DataFrame.to_csv() can write to any file-like object, and ZipFile.writestr() can accept a string (or bytes), so it is possible to avoid writing the CSV files to disk using io.StringIO. See the example code below.
Note: If the channel is simply stored in a single column of your input data, then the more idiomatic (and more efficient) way to iterate over the partitions of your data is to use groupby().
from io import StringIO
from zipfile import ZipFile
import numpy as np
import pandas as pd
# Example data
df = pd.DataFrame(np.random.random((100,3)), columns=[*'xyz'])
df['channel'] = np.random.randint(5, size=len(df))
with ZipFile('/tmp/output.zip', 'w') as zf:
for channel, channel_df in df.groupby('channel'):
s = StringIO()
channel_df.to_csv(s, index=False, header=True)
zf.writestr(f"{channel}.csv", s.getvalue())

programtically ingesting xl files to pandas data frame by reading filename

I have a folder with 6 files, 4 are excel files that I would like to bring into pandas and 2 are just other files. I want to be able to use pathlib to work with the folder to automatically ingest the excel files I want into individual pandas dataframes. I would also like to be able to name each new dataframe with the name of the excel file (without the file extension)
for example.
import pandas as pd
import pathlib as pl
folder = pl.WindowsPath(r'C:\Users\username\project\output')
files = [e for e in folder.iterdir()]
for i in files:
print(i)
['C:\Users\username\project\output\john.xlsx',
'C:\Users\username\project\output\paul.xlsx',
'C:\Users\username\project\output\random other file not for df.xlsx',
'C:\Users\username\project\output\george.xlsx',
'C:\Users\username\project\output\requirements for project.txt',
'C:\Users\username\project\output\ringo.xlsx' ]
From here, i'd like to be able to do something like
for i in files:
if ' ' not in str(i.name):
str(i.name.strip('.xlsx'))) = pd.read_excel(i)
read the file name, if it doesn't contain any spaces, take the name, remove the file extension and use that as the variable name for a pandas dataframe built from the excel file.
If what I'm doing isn't possible then I have other ways to do it, but they repeat a lot of code.
Any help is appreciated.
using pathlib and re
we can exclude any files that match a certain pattern in our dictionary comprehension, that is any files with a space.
from pathlib import Path
import re
import pandas as pd
pth = (r'C:\Users\username\project\output')
files = Path(pth).glob('*.xlsx') # use `rglob` if you want to to trawl a directory.
dfs = {file.stem : pd.read_excel(file) for file in
files if not re.search('\s', file.stem)}
based on the above you'll get :
{'john': pandas.core.frame.DataFrame,
'paul': pandas.core.frame.DataFrame,
'george': pandas.core.frame.DataFrame,
'ringo': pandas.core.frame.DataFrame}
where pandas.core.frame.DataFrame is your target dataframe.
you can then call them by doing dfs['john']

Extract data from multiple excel files in multiple directories in python pandas

I am new to Python and I am posting the question in stack overflow for the first time. Please help in solving the problem.
My main directory is 'E:\Data Science\Macros\ZBILL_Dump', containing month-wise folders and each folder contains date-wise excel data.
I was able to extract data from a single folder:
import os
import pandas as pd
import numpy as np
# Find file names in the specified directory
loc = 'E:\Data Science\Macros\ZBILL_Dump\Apr17\\'
files = os.listdir(loc)
# Find the ONLY Excel files
files_xlsx = [f for f in files if f[-4:] == 'xlsx']
# Create empty dataframe and read in new data
zbill = pd.DataFrame()
for f in files_xlsx:
New_data = pd.read_excel(os.path.normpath(loc + f), 'Sheet1')
zbill = zbill.append(New_data)
zbill.head()
I am trying to extract data from my main directory i.e "ZBILL_Dump" which contains many sub folders, but I could not do it. Please somebody help me.
Thanks a lot.
You can use glob.
import glob
import pandas as pd
# grab excel files only
pattern = 'E:\Data Science\Macros\ZBILL_Dump\Apr17\\*.xlsx'
# Save all file matches: xlsx_files
xlsx_files = glob.glob(pattern)
# Create an empty list: frames
frames = []
# Iterate over csv_files
for file in xlsx_files:
# Read xlsx into a DataFrame
df = pd.read_xlsx(file)
# Append df to frames
frames.append(df)
# Concatenate frames into dataframe
zbill = pd.concat(frames)
You can use regex if you want to look in different sub-directories. Use 'filepath/*/*.xlsx' to search the next level. More info here https://docs.python.org/3/library/glob.html
Use glob with its recursive feature for searching sub-directories:
import glob
files = glob.glob('E:\Data Science\Macros\ZBILL_Dump\**\*.xlsx', recursive=True)
Docs: https://docs.python.org/3/library/glob.html

Using Pandas read_table with list of files

I am pretty new to Python in general, but am trying to make a script that takes data from certain files in a folder and puts it into an Excel spreadsheet.
The code I have will find the file type that I want in my specified folder, and then make a list with the full file paths.
import os
file_paths = []
for folder, subs, files in os.walk('C://Users/Dir'):
for filename in files:
if filename.endswith(".log") or filename.endswith(".txt"):
file_paths.append(os.path.abspath(os.path.join(folder,filename)))
It will also take a specific file path, pull data from the correct column, and put it into excel in the correct cells.
import pandas as pd
import numpy
for i in range(len(file_paths)):
fields = ['RDCR']
data = pd.read_table(file_paths[i], sep= "\s+", names = fields, usecols=[3],
Where I am having trouble is making the read_table iterate through my list of files and put the data into an excel sheet where every time it reads a new file it moves over one column in the spreadsheet.
Ideally, the for loop would see how long the file_paths list is, and use that as the range. It would then use the file_paths[i] to input the file names into the read_table one by one.
What happens is that it finds the length of file_paths, and instead of iterating through the files in it one by one, it just inputs the data from the last file on the list.
Any help would be much appreciated! Thank you!
Try to concatenate all of them at once and write to excel one time.
from glob import glob
import pandas as pd
files = glob('C://Users/Dir/*.log') + glob('C://Users/Dir/*.txt')
def read_file(f):
fields = ['RDCR']
return pd.read_table(
f, sep="\s+",
names=fields, usecols=[3])
df = pd.concat([read_file(f) for f in files], axis=1).to_excel('out.xlsx')

Read multiple .xlsx files from a directory into separate Pandas data frames based on file name

I want to load multiple xlsx files with varying structures from a directory and assign these their own data frame based on the file name. I have 30+ files with differing structures but for brevity please consider the following:
3 excel files [wild_animals.xlsx, farm_animals_xlsx, domestic_animals.xlsx]
I want to assign each with their own data frame so if the file name contains 'wild' it is assigned to wild_df, if farm then farm_df and if domestic then dom_df. This is just the first step in a process as the actual files contain a lot of 'noise' that needs to be cleaned depending on file type etc they file names will also change on a weekly basis with only a few key markers staying the same.
My assumption is the glob module is the best way to begin to do this but in terms of taking very specific parts of the file extension and using this to assign to a specific df I become a bit lost so any help appreciated.
I asked a similar question a while back but it was part of a wider question most of which I have now solved.
I would parse them into a dictionary of DataFrame's:
import os
import glob
import pandas as pd
files = glob.glob('/path/to/*.xlsx')
dfs = {}
for f in files:
dfs[os.path.splitext(os.path.basename(f))[0]] = pd.read_excel(f)
then you can access them as a normal dictionary elements:
dfs['wild_animals']
dfs['domestic_animals']
etc.
You nee to get all xlsx files, than using comprehension dict, you can access to any elm
import pandas as pd
import os
import glob
path = 'Your_path'
extension = 'xlsx'
os.chdir(path)
result = [i for i in glob.glob('*.{}'.format(extension))]
{elm:pd.ExcelFile(elm) for elm in result}
For completeness wanted to show the solution I ended up using, very close to Khelili suggestion with a few tweaks to suit my particular code including not creating a DataFrame at this stage
import os
import pandas as pd
import openpyxl as excel
import glob
#setting up path
path = 'data_inputs'
extension = 'xlsx'
os.chdir(path)
files = [i for i in glob.glob('*.{}'.format(extension))]
#Grouping files - brings multiple files of same type together in a list
wild_groups = ([s for s in files if "wild" in s])
domestic_groups = ([s for s in files if "domestic" in s])
#Sets up a dictionary associated with the file groupings to be called in another module
file_names = {"WILD":wild_groups, "DOMESTIC":domestic_groups}
...

Categories