How to convert 100's of .log files to .xlsx files [duplicate] - python

This question already has answers here:
How can I iterate over files in a given directory?
(11 answers)
Closed 2 years ago.
This a a portion of a larger code.
I have a directory with 100's of .log files that I need to convert to .xlsx files one at a time. I wrote this code;
import csv
import glob
import pandas as pd
df = pd.read_csv('input.log', delimiter=r"\s+", header=None, names=list(range(20)))
df.to_excel('input.xlsx', 'Sheet1')
Which works for a single file. What do I need to add to have it look through the directory and convert each file regardless of how many there are?

import glob
import pandas as pd
files = glob.glob("*.log")
for file in files:
df = pd.read_csv(file, delimiter=r"\s+", header=None, names=list(range(20)))
df.to_excel(file+'.xlsx', index=Flase)

Related

programtically ingesting xl files to pandas data frame by reading filename

I have a folder with 6 files, 4 are excel files that I would like to bring into pandas and 2 are just other files. I want to be able to use pathlib to work with the folder to automatically ingest the excel files I want into individual pandas dataframes. I would also like to be able to name each new dataframe with the name of the excel file (without the file extension)
for example.
import pandas as pd
import pathlib as pl
folder = pl.WindowsPath(r'C:\Users\username\project\output')
files = [e for e in folder.iterdir()]
for i in files:
print(i)
['C:\Users\username\project\output\john.xlsx',
'C:\Users\username\project\output\paul.xlsx',
'C:\Users\username\project\output\random other file not for df.xlsx',
'C:\Users\username\project\output\george.xlsx',
'C:\Users\username\project\output\requirements for project.txt',
'C:\Users\username\project\output\ringo.xlsx' ]
From here, i'd like to be able to do something like
for i in files:
if ' ' not in str(i.name):
str(i.name.strip('.xlsx'))) = pd.read_excel(i)
read the file name, if it doesn't contain any spaces, take the name, remove the file extension and use that as the variable name for a pandas dataframe built from the excel file.
If what I'm doing isn't possible then I have other ways to do it, but they repeat a lot of code.
Any help is appreciated.
using pathlib and re
we can exclude any files that match a certain pattern in our dictionary comprehension, that is any files with a space.
from pathlib import Path
import re
import pandas as pd
pth = (r'C:\Users\username\project\output')
files = Path(pth).glob('*.xlsx') # use `rglob` if you want to to trawl a directory.
dfs = {file.stem : pd.read_excel(file) for file in
files if not re.search('\s', file.stem)}
based on the above you'll get :
{'john': pandas.core.frame.DataFrame,
'paul': pandas.core.frame.DataFrame,
'george': pandas.core.frame.DataFrame,
'ringo': pandas.core.frame.DataFrame}
where pandas.core.frame.DataFrame is your target dataframe.
you can then call them by doing dfs['john']

Concatenating Excel and CSV files

I've been asked to compile data files into one Excel spreadsheet using Python, but they are all either Excel files or CSV's. I'm trying to use the following code:
import glob, os
import shutil
import pandas as pd
par_csv = set(glob.glob("*Light*")) + - set(glob.glob("*all*")) - set(glob.glob("*Untitled"))
par
df = pd.DataFrame()
for file in par:
print(file)
df = pd.concat([df, pd.read(file)])
Is there a way I can use the pd.concat function to read the files in more than one format (si both xlsx and csv), instead of one or the other?

How do I convert several large text files into one CSV file if they are too large to be converted individually?

I have several large .text files that I want to consolidate into one .csv file. However, each of the files is to large to import into Excel on its own, let alone all together.
I want to create a use pandas to analyze the data, but don't know how to get the files all in one place.
How would I go about reading the data directly into Python, or into Excel for a .csv file?
The data in question is the 2019-2020 Contributions by individuals file on the FEC's website.
You can convert each of the files to csv and the concatenate them to fom one final csv file
import pandas as pd
csv_path = 'pathtonewcsvfolder' # use your path
all_files=os.listdir("path/to/textfiles")
x=0
for filename in all_files:
df = pd.read_fwf(filename)
df.to_csv(os.path.join(csv_path,'log'+str(x)+'.csv'))
x+=1
all_csv_files = glob.iglob(os.path.join(csv_path, "*.csv"))
converted_df=pd.concat((pd.read_csv(f) for f in all_csv_files), ignore_index=True)
converted_df.to_csv('converted.csv')

Import CSV file into Python [duplicate]

This question already has answers here:
How do I read and write CSV files with Python?
(7 answers)
Closed 4 years ago.
I tried several times to import CSV file into python 2.7.15 but it fail.
Please suggest how to import CSV in Python.
Thank you
There are two ways to import a csv file in Python.
First: Using standard Python csv
import csv
with open('filename.csv') as csv_file:
csv_read=csv.reader(csv_file, delimiter=',')
Second: Using pandas
import pandas as pd
data = pd.read_csv('filename.csv')
data.head() # to display the first 5 lines of loaded data
I would suggest to import using pandas since that's the more convenient way to do so.

Error when adding a new column to pandas dataframe

I am trying to modify .csv files in a folder. The files contain flight information from years 2011-2016.
However, year information cannot be found in the values.
I would like to solve this by using the filename of the .csv file which contains the year. I am adding a new 'year' column after reading it into a pandas dataframe. I will then export the modified file to a new .csv with only the year as its filename.
However, I am encountering this error:
ValueError:Length of values does not match length of index
Code below for your reference.
import pandas as pd
import glob
import re
import os
path = r'data_caap/'
all_files = glob.glob(os.path.join(path, "*.csv"))
for f in all_files:
df = pd.read_csv(f)
year= re.findall(r'\d{4}', f)
#Error here
df['year']=year
#Error here
df.to_csv(year)
Found the cause of the error.
Must be df['year']=year[0]. findall returns a list. – DyZ
Thanks a lot #Dyz

Categories