Concatenating Excel and CSV files - python

I've been asked to compile data files into one Excel spreadsheet using Python, but they are all either Excel files or CSV's. I'm trying to use the following code:
import glob, os
import shutil
import pandas as pd
par_csv = set(glob.glob("*Light*")) + - set(glob.glob("*all*")) - set(glob.glob("*Untitled"))
par
df = pd.DataFrame()
for file in par:
print(file)
df = pd.concat([df, pd.read(file)])
Is there a way I can use the pd.concat function to read the files in more than one format (si both xlsx and csv), instead of one or the other?

Related

How to use Python to extract Excel Sheet data

I have many folders, each folders contains 1 excel file like 1Aug2022, 2Aug2022...
I want python to Read thru all Folders, and only open the excel file name like 19AUG2022, the excel file have many sheets inside like IP-1*****, IP-2*****, IP-3*****. Then go to sheets with (IP-2*****) to extract 2columns of data.
How can I do it in python?
You can use pandas package: https://pandas.pydata.org/
an example is
import pandas as pd
your_excel_path = "your/path/to/the/excel/file"
data = pd.read_excel(your_excel_path, sheet_name = "19AUG2022") # If you want to read specific sheet's data
data = pd.read_excel(your_excel_path, sheet_name = None) # If you want to read all sheets' data, it will return a list of dataframes
As Fergus said use pandas.
The code to search all directorys may look like that:
import os
import pandas as pd
directory_to_search = "./"
sheet_name = "IP-2*****"
for root, dirs, files in os.walk(directory_to_search):
for file in files:
if file == "19AUG2022":
df = pd.read_excel(io=os.path.join(root, file), sheet_name=sheet_name)

programtically ingesting xl files to pandas data frame by reading filename

I have a folder with 6 files, 4 are excel files that I would like to bring into pandas and 2 are just other files. I want to be able to use pathlib to work with the folder to automatically ingest the excel files I want into individual pandas dataframes. I would also like to be able to name each new dataframe with the name of the excel file (without the file extension)
for example.
import pandas as pd
import pathlib as pl
folder = pl.WindowsPath(r'C:\Users\username\project\output')
files = [e for e in folder.iterdir()]
for i in files:
print(i)
['C:\Users\username\project\output\john.xlsx',
'C:\Users\username\project\output\paul.xlsx',
'C:\Users\username\project\output\random other file not for df.xlsx',
'C:\Users\username\project\output\george.xlsx',
'C:\Users\username\project\output\requirements for project.txt',
'C:\Users\username\project\output\ringo.xlsx' ]
From here, i'd like to be able to do something like
for i in files:
if ' ' not in str(i.name):
str(i.name.strip('.xlsx'))) = pd.read_excel(i)
read the file name, if it doesn't contain any spaces, take the name, remove the file extension and use that as the variable name for a pandas dataframe built from the excel file.
If what I'm doing isn't possible then I have other ways to do it, but they repeat a lot of code.
Any help is appreciated.
using pathlib and re
we can exclude any files that match a certain pattern in our dictionary comprehension, that is any files with a space.
from pathlib import Path
import re
import pandas as pd
pth = (r'C:\Users\username\project\output')
files = Path(pth).glob('*.xlsx') # use `rglob` if you want to to trawl a directory.
dfs = {file.stem : pd.read_excel(file) for file in
files if not re.search('\s', file.stem)}
based on the above you'll get :
{'john': pandas.core.frame.DataFrame,
'paul': pandas.core.frame.DataFrame,
'george': pandas.core.frame.DataFrame,
'ringo': pandas.core.frame.DataFrame}
where pandas.core.frame.DataFrame is your target dataframe.
you can then call them by doing dfs['john']

converting all the .xls to .xlsx

Hello I am having an issue to convert all the .xls files to .xlsx. other challenge is each .xls file have multiple sheets and I have lot of files to convert. Can you some one help me with a solution
import glob
import pandas as pd
import os
from pandas import ExcelWriter
_list_of_xls_files = glob.glob(r'C:\Users\enter_your_pc_username_here\Documents\*xls')
for _xls_file in _list_of_xls_files:
df = pd.read_excel(_xls_file,sheet_name = None)
_list_of_tabs_inside_xls_file = df.keys()
with ExcelWriter(str(_xls_file).replace('.xls','.xlsx')) as writer:
for n, _sheet_name in enumerate(list_of_tabs_inside_xls_file):
df[_sheet_name].to_excel(writer,'sheet%s' % n)
Source:
1 Using Pandas to pd.read_excel() for multiple worksheets of the same workbook

How to Include Source File in Column in Pandas Dataframe

I have a list of 50+ Excel files that I loop through and consolidate into one dataframe. However, I need to know the source of the data, since the data will repeat across these files.
Each file name is the date of the report. Since this data is time series data, I need to pull this date into the dataframe to do further manipulations.
import os
import glob
import pandas as pd
path = r"path"
extension = 'xls*'
os.chdir(path)
files = glob.glob('*.{}'.format(extension))
files_df = pd.concat([pd.read_excel(fp, usecols=[0,15], header=None) for fp in files], ignore_index=True)
I get the expected dataframe. I just do not know how to include the source file name as a third column. I thought there was an argument to do this in pd.read_excel(), but I couldn't find it.
For example, I have a list of the following files:
02-2019.xlsx
03-2011.xls
04-2014.xls
etc
I want to include those file names next to the data that comes from that file in the combined dataframe.
Maybe use the keys= parameter in pd.concat()?
import os
import glob
import pandas as pd
path = r"path"
extension = 'xls*'
os.chdir(path)
files = glob.glob('*.{}'.format(extension))
# remove ignore_index=True otherwise keys parameter won't work
files_df = pd.concat([pd.read_excel(fp, usecols=[0,15], header=None) for fp in files], keys=[f"{fp.split('.')[0]}" for fp in files])
You can then reset_index() and convert to_datetime()
fp.reset_index(inplace=True)
fp['index'] = pd.to_datetime(fp['index'])

Python: convert DAT files to XLS

I have a bunch of DAT files that I need to convert to XLS files using Python. Should I use the CSV library to do this or is there a better way?
I'd use pandas.
import pandas as pd
df = pd.read_table('DATA.DAT')
df.to_excel('DATA.xlsx')
and of course you can setup a loop to get through all you files. Something along these lines maybe
import glob
import os
os.chdir("C:\\FILEPATH\\")
for file in glob.glob("*.DAT"):
#What file is being converted
print file
df = pd.read_table(file)
file1 = file.replace('DAT','xlsx')
df.to_excel(file1)
writer = pd.ExcelWriter('pandas_example.dat',
engine='xlsxwriter',
options={'strings_to_urls': False})
or you can use :
pd.to_excel('example.xlsx')

Categories