Creating Multiple .txt files from an Excel file with Python Loop - python

My work is in the process of switching from SAS to Python and I'm trying to get a head start on it. I'm trying to write separate .txt files, each one representing all of the values of an Excel file column.
I've been able to upload my Excel sheet and create a .txt file fine, but for practical uses, I need to find a way to create a loop that will go through and make each column into it's own .txt file, and name the file "ColumnName.txt".
Uploading Excel Sheet:
import pandas as pd
wb = pd.read_excel('placements.xls')
Creating single .txt file: (Named each column A-Z for easy reference)
with open("A.txt", "w") as f:
for item in wb['A']:
f.write("%s\n" % item)
Trying my hand at a for loop (to no avail):
import glob
for file in glob.glob("*.txt"):
f = open(( file.rsplit( ".", 1 )[ 0 ] ) + ".txt", "w")
f.write("%s\n" % item)
f.close()
The first portion worked like a charm and gave me a .txt file with all of the relevant data.
When I used the glob command to attempt making some iterations, it doesn't error out, but only gives me one output file (A.txt) and the only data point in A.txt is the letter A. I'm sure my inputs are way off... after scrounging around forever it's what I found that made sense and ran, but I don't think I'm understanding the inputs going in to the command, or if what I'm running is just totally inaccurate.
Any help anyone would give would be much appreciated! I'm sure it's a simple loop, just hard to wrap your head around when you're so new to python programming.
Thanks again!

I suggest use pandas for write files by to_csv, only change extension to .txt:
# Uploading Excel Sheet:
import pandas as pd
df = pd.read_excel('placements.xls')
# Creating single .txt file: (Named each column A-Z for easy reference)
for col in df.columns:
print (col)
#python 3.6+
df[col].to_csv(f"{col}.txt", index=False, header=None)
#python bellow
#df[col].to_csv("{}.txt".format(col), index=False, header=None)

Related

Extract text from multiple PDFs and write to a single CSV

I want to loop through all the PDFs in a directory, extract the text from each one using PDFminer, and then write the output to a single CSV file. I am able to extract the text from each PDF individually by passing it to the function defined here. I am also able to get a list of all the PDF filenames in a given directory. But when I try to put the two together and write the results to a single CSV, I get a CSV with headers but no data.
Here is my code:
import os
pdf_files = [name for name in os.listdir("C:\\My\\Directory\\Path") if name.endswith(".pdf")] #get all files in directory
pdf_files_path = ["C:\\My\\Directory\\Path\\" + pdf_files[i] for i in range(len(pdf_files))] #add directory path
import pandas as pd
df = pd.DataFrame(columns=['FileName','Text'])
for i in range(len(pdf_files)):
scraped_text = convert_pdf_to_txt(pdf_files_path[i])
df.append({ 'FileName': pdf_files[i], 'Text': scraped_text[i]},ignore_index=True)
df.to_csv('output.csv')
The variables have the following values:
pdf_files: ['12280_2007_Article_9000.pdf', '12280_2007_Article_9001.pdf', '12280_2007_Article_9002.pdf', '12280_2007_Article_9003.pdf', '12280_2007_Article_9004.pdf', '12280_2007_Article_9005.pdf', '12280_2007_Article_9006.pdf', '12280_2007_Article_9007.pdf', '12280_2007_Article_9008.pdf', '12280_2007_Article_9009.pdf']
pdf_files_path: ['C:\\My\\Directory Path\\12280_2007_Article_9000.pdf', etc...]
Empty DataFrame
Columns: [FileName, Text]
Index: []
Update: based on a suggestion from #AMC I checked the contents of scraped_text in the loop. For the Text column, it seems that I'm looping through the characters in the first PDF file, rather than looping through each file in the directly. Also, the contents of the loop are not getting written to the dataframe or CSV.
12280_2007_Article_9000.pdf E
12280_2007_Article_9001.pdf a
12280_2007_Article_9002.pdf s
12280_2007_Article_9003.pdf t
12280_2007_Article_9004.pdf
12280_2007_Article_9005.pdf A
12280_2007_Article_9006.pdf s
12280_2007_Article_9007.pdf i
12280_2007_Article_9008.pdf a
12280_2007_Article_9009.pdf n
I guess you don't need pandas for that. You can make it simpler by using the standard library csv.
Another thing that can be improved, if you are using Python 3.4+, is to replace os with pathlib.
Here is an almost complete example:
import csv
from pathlib import Path
folder = Path('c:/My/Directory/Path')
csv_file = Path('c:/path/to/output.csv')
with csv_file.open('w', encoding='utf-8') as f:
writer = csv.writer(f, csv.QUOTE_ALL)
writer.writerow(['FileName', 'Text'])
for pdf_file in folder.glob('*.pdf'):
pdf_text = convert_pdf_to_txt(pdf_file).replace('\n', '|')
writer.writerow([pdf_file.name, pdf_text])
Another thing to bear in mind is to be sure pdf_text will be a single line or else your csv file will be kind of broken. One way to work around that is to pick an arbitrary character to use in place of the new line marks. If you pick the pipe character, for example, than you can do something like this, prior to writer.writerow:
pdf_text.replace('\n', '|')
It is not meant to be a complete example but a starting point. I hope it helps.

Is there a way I can extract mutliple pieces of data from a multiple text file in python and save it as a row in a new .csv file?

Is there a way I can extract multiple pieces of data from a text file in python and save it as a row in a new .csv file? I need to do this for multiple input files and save the output as a single .csv file for all of the input files.
I have never used Python before so I am quite clueless. I have used matlab before and I know how I would do it in matlab if it was numbers (but unfortunately it is text which is why I am trying python). So to be clear I need a new line in the .csv output file for each "ID" in the input files.
An example of the data is show below (2 separate files)
EXAMPLE DATA - FILE 1:
id,ARI201803290
version,2
info,visteam,COL
info,hometeam,ARI
info,site,PHO01
info,date,2018/03/29
id,ARI201803300
data,er,corbp001,2
version,2
info,visteam,COL
info,hometeam,ARI
info,site,PHO01
info,date,2018/03/30
data,er,delaj001,0
EXAMPLE DATA - FILE 2:
id,NYN201803290
version,2
info,visteam,SLN
info,hometeam,NYN
info,site,NYC20
info,usedh,false
info,date,2018/03/29
data,er,famij001,0
id,NYN201803310
version,2
info,visteam,SLN
info,hometeam,NYN
info,site,NYC20
info,date,2018/03/31
data,er,gselr001,0
I'm hoping to get the data in a .csv format with all the details from one "id" on 1 line. There are multiple "id's" per text file and there are multiple files. I want to repeat this process for multiple text files so the outputs are in the same .csv output file. I want the output to look as follows in the .csv file, with each piece of info as a new cell:
ARI201803290 COL ARI PHO01 2018/03/29 2
ARI201803300 COL ARI PHO01 2018/03/30 0
NYN201803290 SLN NYN NYC20 2018/03/29 0
NYN201803310 SLN NYN NYC20 2018/03/31 0
If I was doing it in matlab I'd use a for loop and if statement and say
j=1
k=1
for i=1:size(myMatrix, 1)
if file1(i;1)==id
output(k,1)=(i;2)
k=k+1
else if
file1(i;1)==info && file1(i;1)==info
output(j,2)=(i;3)
j=j+1
etc.....
However I obviously can't do this in matlab because I have comma separated text files, not a matrix. Does anyone have any suggestions how I can translate my idea to python code? Or any other suggestion. I am super new to python so willing to try anything that might work.
Thank you very much in advance!
python is very flexible and can do these jobs very easily,
there is a lot of csv tools/modules in python to handle pretty much all type of csv and excel files, however i prefer to handle a csv the same as a text file because csv is simply a text file with comma separated text, so simple is better than complicated
below is the code with comments to explain most of it, you can tweak it to match your needs exactly
import os
input_folder = 'myfolder/' # path of folder containing the text files on your disk
# create a list with file names with their full paths using list comprehension
data_files = [os.path.join(input_folder, file) for file in os.listdir(input_folder)]
# open our csv file for writing
csv = open('myoutput.csv', 'w') # better to open files with context manager like below but i am trying to show you different methods
def write_to_csv(line):
print(line)
csv.write(line)
# loop thru your text files
for file in data_files:
with open(file, 'r') as f: # use context manager to open files (best practice)
buff = []
for line in f:
line = line.strip() # remove spaces and new lines
line = line.split(',') # split line to list of values
if buff and line[0] == 'id': # hit another 'id'
write_to_csv(','.join(buff) + '\n')
buff = []
buff.append(line[-1]) # add the last word in line
write_to_csv(','.join(buff) + '\n')
csv.close() # must close any open file handles opened manually "no context manager i.e. no with"
output:
ARI201803290,2,COL,ARI,PHO01,2018/03/29,2
ARI201803300,2,COL,ARI,PHO01,2018/03/30,0
NYN201803290,2,SLN,NYN,NYC20,false,2018/03/29,0
NYN201803310,2,SLN,NYN,NYC20,2018/03/31,0

No results when looping through files in folder to find regex and sending to Excel - Using Python 3.6 and Xlsxwriter

Before anything, I'd like to point out that I'm fairly new to python and programming as a whole so I apologize if I'm not being to clear in my question. If that is the case, just please let me know what I'm doing wrong and I will make alterations to my question.
Quick Run Down:
I have created a rather script that iterates through a whole bunch of TXT files( ~120 right now) within a specific folder directory. If the TXT files match certain conditions( filename.endswith(" ")) a loop is initiated, which is suppose to go into each individual text file and find all the emails via regex. With each instance of finding emails within a specific file, a list is created. Once all these emails are extracted (and their corresponding lists created),they are sent to Excel via xlsxwriter.
Main Issue:
There are NO emails/results when I open the excel file that is created! Also, no errors are produced when the script runs. This script works perfectly fine when I do it file by file(meaning that I use a text file's specific path instead of iterating through the whole folder). What am I doing wrong?
Ideally(but not as important as the issue /\ ):
I would like the script to create a Sheet within the newly created Workbook for every list, that way it is organized. I have about 120 TXT files in the folder thus far, but the files can be organized into groups based on file names (don't think it's practical to have over >50 sheets in a workbook). File names are shared as such...
Client_Info_LA(1) , Client_Info_LA(2), Self_Info(1),Self_Info(2)
Thus organizing all Client_Info_LA in one sheet and Self_Info in another(was thinking of maybe using pandas to groupby). This isn't as important to me as actually getting the script to output the data I need into Excel, but if anyone knows how to tackle this it would really be helpful!
Here's the script...
import re
import xlsxwriter
import os
'Create List of Lists'
n = len(os.listdir("C:\\Users\\me\\Desktop\\Emails_Project\\Txt_Emails"))
lists = [[] for _ in range(n)] #For stack peeps: Is the list of lists causing the issue?
workbook = xlsxwriter.Workbook('TestEmails1.xlsx')
worksheet = workbook.add_worksheet()
'Find emails'
for filename in os.listdir("C:\\Users\\me\\Desktop\\Emails_Project\\Txt_Emails"):
if filename.endswith(".txt") :
for emails in filename:
if re.findall(r"[a-z0-9\.\-+_]+#[a-z0-9\.\-+_]+\.[a-z]+", emails):
lists.append(emails)
worksheet.write_column('A2', lists)
else:
continue
workbook.close()
I have been searching through the web and have tried multiple things -- nothing has worked. This is truly my last resort so if anyone can give me some guidance, suggestions, or insight as to how to fix this, I would truly appreciate it!
Mainly, there are two problems in the code provided.
First, it doesn't open files. The variable file_name is a string (a list of characters). So, the for loop for emails in file_name: will iterate over the elements of the characters of the string file_name. Hence, emails is a character.
Second, it overwrites what it has written before. worksheet.write_column('A2', lists) because you specify the same cell every time.
Here is a suggested code snippet that writes the emails found from different files to different columns in a same sheet.
import re
import xlsxwriter
import os
import codecs
workbook = xlsxwriter.Workbook('TestEmails1.xlsx')
worksheet = workbook.add_worksheet()
path = "C:\\Users\\me\\Desktop\\Emails_Project\\Txt_Emails\\"
row_number = 0
col_number = 0
for filename in os.listdir(path):
if filename.endswith(".txt") :
file_handler = codecs.open(path+filename,'r', 'utf-8')
file_contents = file_handler.read()
found_mails = re.findall(r"[a-z0-9\.\-+_]+#[a-z0-9\.\-+_]+\.[a-z]+", file_contents)
if found_mails != [] :
for mail in found_mails:
worksheet.write(row_number, col_number, mail)
row_number+=1
row_number=0
col_number+=1
workbook.close()

How do you write an excel file as a csv?

At work, I need to carry out this process every month, which involves downloading some files from a server, copying them into a new folder, deleting the existing files, etc, which is quite mundane. I tasked myself with writing a Python script to do this.
One step of this process is opening an Excel file and saving it as a CSV file. Is there anyway of doing this Python?
EDIT:
The main difficulty I have with this is two-fold. I know how to write a CSV file using the Python's csv library, but
How do I read in Excel files into Python?
Does the result from reading in an Excel file then writing it as a CSV file coincide with opening the file in Excel, and perform save-as CSV manually?
Is there a better way of doing this then the way suggested here?
I guess I really want an answer for 1. 2 I can find out myself by using something like Winmerge...
To manipulate Excel files in Python, take a look at this question and answer, where the use of xlrd package is used. Example:
from xlrd import open_workbook
book = open_workbook('example.xlsx')
sheet = book.sheet_by_index(1)
To manipulate CSV files in Python, take a look at this question and answer and documentation, where the library csvis used. Example:
import csv
with open('example.csv', 'rb') as csvfile:
spamreader = csv.reader(csvfile, delimiter=' ', quotechar='|')
for row in spamreader:
print ', '.join(row)
I had to do this for a project in a course I took.
You dont need to involve Excel in the process as you can simply create a csv file and then open it in any program you like.
If you know how to write to a file as txt then you can create a csv. I am not sure if this is the most efficient way but it works.
When setting up the file, instead of something like:
data = [["Name", "Age", "Gender"],["Joe",13,"M"],["Jill",14,"F"]]
filename = input("What do you want to save the file as ")
filename = filename + ".txt"
file = open(filename,"w")
for i in range(len(data)):
line = ""
for x in range (len(data[i])):
line += str(data[i][x])
line += ","
file.write(line)
file.write("\n")
file.close()
simple replace the file extension from txt to csv like this:
filename = filename + ".csv"

Looping through multiple excel files in python using pandas

I know this type of question is asked all the time. But I am having trouble figuring out the best way to do this.
I wrote a script that reformats a single excel file using pandas.
It works great.
Now I want to loop through multiple excel files, preform the same reformat, and place the newly reformatted data from each excel sheet at the bottom, one after another.
I believe the first step is to make a list of all excel files in the directory.
There are so many different ways to do this so I am having trouble finding the best way.
Below is the code I currently using to import multiple .xlsx and create a list.
import os
import glob
os.chdir('C:\ExcelWorkbooksFolder')
for FileList in glob.glob('*.xlsx'):
print(FileList)
I am not sure if the previous glob code actually created the list that I need.
Then I have trouble understanding where to go from there.
The code below fails at pd.ExcelFile(File)
I beleive I am missing something....
# create for loop
for File in FileList:
for x in File:
# Import the excel file and call it xlsx_file
xlsx_file = pd.ExcelFile(File)
xlsx_file
# View the excel files sheet names
xlsx_file.sheet_names
# Load the xlsx files Data sheet as a dataframe
df = xlsx_file.parse('Data',header= None)
# select important rows,
df_NoHeader = df[4:]
#then It does some more reformatting.
'
Any help is greatly appreciated
I solved my problem. Instead of using the glob function I used the os.listdir to read all my excel sheets, loop through each excel file, reformat, then append the final data to the end of the table.
#first create empty appended_data table to store the info.
appended_data = []
for WorkingFile in os.listdir('C:\ExcelFiles'):
if os.path.isfile(WorkingFile):
# Import the excel file and call it xlsx_file
xlsx_file = pd.ExcelFile(WorkingFile)
# View the excel files sheet names
xlsx_file.sheet_names
# Load the xlsx files Data sheet as a dataframe
df = xlsx_file.parse('sheet1',header= None)
#.... do so reformating, call finished sheet reformatedDataSheet
reformatedDataSheet
appended_data.append(reformatedDataSheet)
appended_data = pd.concat(appended_data)
And thats it, it does everything I wanted.
you need to change
os.chdir('C:\ExcelWorkbooksFolder')
for FileList in glob.glob('*.xlsx'):
print(FileList)
to just
os.chdir('C:\ExcelWorkbooksFolder')
FileList = glob.glob('*.xlsx')
print(FileList)
Why does this fix it? glob returns a single list. Since you put for FileList in glob.glob(...), you're going to walk that list one by one and put the result into FileList. At the end of your loop, FileList is a single filename - a single string.
When you do this code:
for File in FileList:
for x in File:
the first line will assign File to the first character of the last filename (as a string). The second line will assign x to the first (and only) character of File. This is not likely to be a valid filename, so it throws an error.

Categories