Does anyone know a simple way to use python to load several CSV files into one given access table?
For example, my directory could have 100 files named import_*.csv (import_1.csv, import_2.csv, etc)
There is one destination table in MS Access that should receive all of these csv's.
I know I could use pyodbc and build up statements line-by-line to do this, but that's a lot of coding. You also then have to keep your SQL up-to-date as fields might get added or removed. MS Access has it's own bulk load functionality - and I'm hoping that either this is accessible via python or that python has a library to that will do the same.
I would be fantastic if there is a library out there that could do it as easily as:
dbobj.connectOdbc( dsn )
dbobj.bulkLoad( "MyTable" , "c:/temp/test.csv" )
Internally it takes some work to figure out the schema and to make it work. But hopefully someone out there has already done the heavy lifting?
Is there a way to do a bulk import? Reading into pandas is trivial enough - but then you have to get it into MS Access from there.
This is an old post, but I'll take a crack at it. So, you have 100+ CSV files in a directory and you want to push everything into MS Access. Ok, I would combine all CSV files into one sight DF in Python, then save the DF, and import that into MS Access.
#1 Use Python to merge all CSV files into one single dataframe:
# Something like...
import pandas as pd
import csv
import glob
import os
#os.chdir("C:\\your_path_here\\")
results = pd.DataFrame([])
filelist = glob.glob("C:\\your_path_here\*.csv")
#dfList=[]
for filename in filelist:
print(filename)
namedf = pd.read_csv(filename, skiprows=0, index_col=0)
results = results.append(namedf)
results.to_csv('C:\\your_path_here\\Combinefile.csv')
Alternatively, and this is how I would to it...Use VBA in Access to consolidate all CSV files into one single table (no need, whatsoever, for Python).
Private Sub Command1_Click()
Dim strPathFile As String, strFile As String, strPath As String
Dim strTable As String
Dim blnHasFieldNames As Boolean
' Change this next line to True if the first row in EXCEL worksheet
' has field names
blnHasFieldNames = False
' Replace C:\Documents\ with the real path to the folder that
' contains the CSV files
strPath = "C:\your_path_here\"
' Replace tablename with the real name of the table into which
' the data are to be imported
strTable = "tablename"
strFile = Dir(strPath & "*.csv")
Do While Len(strFile) > 0
DoCmd.TransferText acImportDelim, , strTable, strPathFile, True
' Uncomment out the next code step if you want to delete the
' EXCEL file after it's been imported
' Kill strPathFile
strFile = Dir()
Loop
End Sub
Related
I want to loop through all the PDFs in a directory, extract the text from each one using PDFminer, and then write the output to a single CSV file. I am able to extract the text from each PDF individually by passing it to the function defined here. I am also able to get a list of all the PDF filenames in a given directory. But when I try to put the two together and write the results to a single CSV, I get a CSV with headers but no data.
Here is my code:
import os
pdf_files = [name for name in os.listdir("C:\\My\\Directory\\Path") if name.endswith(".pdf")] #get all files in directory
pdf_files_path = ["C:\\My\\Directory\\Path\\" + pdf_files[i] for i in range(len(pdf_files))] #add directory path
import pandas as pd
df = pd.DataFrame(columns=['FileName','Text'])
for i in range(len(pdf_files)):
scraped_text = convert_pdf_to_txt(pdf_files_path[i])
df.append({ 'FileName': pdf_files[i], 'Text': scraped_text[i]},ignore_index=True)
df.to_csv('output.csv')
The variables have the following values:
pdf_files: ['12280_2007_Article_9000.pdf', '12280_2007_Article_9001.pdf', '12280_2007_Article_9002.pdf', '12280_2007_Article_9003.pdf', '12280_2007_Article_9004.pdf', '12280_2007_Article_9005.pdf', '12280_2007_Article_9006.pdf', '12280_2007_Article_9007.pdf', '12280_2007_Article_9008.pdf', '12280_2007_Article_9009.pdf']
pdf_files_path: ['C:\\My\\Directory Path\\12280_2007_Article_9000.pdf', etc...]
Empty DataFrame
Columns: [FileName, Text]
Index: []
Update: based on a suggestion from #AMC I checked the contents of scraped_text in the loop. For the Text column, it seems that I'm looping through the characters in the first PDF file, rather than looping through each file in the directly. Also, the contents of the loop are not getting written to the dataframe or CSV.
12280_2007_Article_9000.pdf E
12280_2007_Article_9001.pdf a
12280_2007_Article_9002.pdf s
12280_2007_Article_9003.pdf t
12280_2007_Article_9004.pdf
12280_2007_Article_9005.pdf A
12280_2007_Article_9006.pdf s
12280_2007_Article_9007.pdf i
12280_2007_Article_9008.pdf a
12280_2007_Article_9009.pdf n
I guess you don't need pandas for that. You can make it simpler by using the standard library csv.
Another thing that can be improved, if you are using Python 3.4+, is to replace os with pathlib.
Here is an almost complete example:
import csv
from pathlib import Path
folder = Path('c:/My/Directory/Path')
csv_file = Path('c:/path/to/output.csv')
with csv_file.open('w', encoding='utf-8') as f:
writer = csv.writer(f, csv.QUOTE_ALL)
writer.writerow(['FileName', 'Text'])
for pdf_file in folder.glob('*.pdf'):
pdf_text = convert_pdf_to_txt(pdf_file).replace('\n', '|')
writer.writerow([pdf_file.name, pdf_text])
Another thing to bear in mind is to be sure pdf_text will be a single line or else your csv file will be kind of broken. One way to work around that is to pick an arbitrary character to use in place of the new line marks. If you pick the pipe character, for example, than you can do something like this, prior to writer.writerow:
pdf_text.replace('\n', '|')
It is not meant to be a complete example but a starting point. I hope it helps.
My work is in the process of switching from SAS to Python and I'm trying to get a head start on it. I'm trying to write separate .txt files, each one representing all of the values of an Excel file column.
I've been able to upload my Excel sheet and create a .txt file fine, but for practical uses, I need to find a way to create a loop that will go through and make each column into it's own .txt file, and name the file "ColumnName.txt".
Uploading Excel Sheet:
import pandas as pd
wb = pd.read_excel('placements.xls')
Creating single .txt file: (Named each column A-Z for easy reference)
with open("A.txt", "w") as f:
for item in wb['A']:
f.write("%s\n" % item)
Trying my hand at a for loop (to no avail):
import glob
for file in glob.glob("*.txt"):
f = open(( file.rsplit( ".", 1 )[ 0 ] ) + ".txt", "w")
f.write("%s\n" % item)
f.close()
The first portion worked like a charm and gave me a .txt file with all of the relevant data.
When I used the glob command to attempt making some iterations, it doesn't error out, but only gives me one output file (A.txt) and the only data point in A.txt is the letter A. I'm sure my inputs are way off... after scrounging around forever it's what I found that made sense and ran, but I don't think I'm understanding the inputs going in to the command, or if what I'm running is just totally inaccurate.
Any help anyone would give would be much appreciated! I'm sure it's a simple loop, just hard to wrap your head around when you're so new to python programming.
Thanks again!
I suggest use pandas for write files by to_csv, only change extension to .txt:
# Uploading Excel Sheet:
import pandas as pd
df = pd.read_excel('placements.xls')
# Creating single .txt file: (Named each column A-Z for easy reference)
for col in df.columns:
print (col)
#python 3.6+
df[col].to_csv(f"{col}.txt", index=False, header=None)
#python bellow
#df[col].to_csv("{}.txt".format(col), index=False, header=None)
I need to iterate through all .rrd files inside a given directory and do fetch data inside each rrd database and do some actions and export them into single csv file in python script!
How can this be done in a efficient way? it's ok to advice me how to loop and access database data over multiple files?
I assume you have a rrdtool installation with python bindings already on your system. If not, here is an installation description.
Then, to loop over .rrd files in a given directory and performing a fetch:
import os
import rrdtool
target_directory = "some/directory/with/rrdfiles"
rrd_files = [os.path.join(target_directory, f) for f in os.listdir(target_directory) if f.endswith('.rrd')]
data_container = []
for rrd_file in rrd_files:
data = rrdtool.fetch(rrd_file, 'AVERAGE' # or whichever CF you need
'--resolution', '200', # for example
) # and any other settings you need
data_container.append(data)
The parameter list follows rrdfetch.
Once you have whatever data you need inside the rrd_files loop, you should accumulate it in a list of lists, with each sublist being one row of data. Writing them to a csv is as easy as this:
import csv
# read the .rrd data as above into a 'data_container' variable
with open('my_data.csv', 'w', newline='') as csv_file:
rrd_writer = csv.writer(csv_file)
for row in data_container:
rrd_writer.writerow(row)
This should outline the general steps you have to follow, you will probably need to adapt them (the rrdfetch in particular) to your case.
Before anything, I'd like to point out that I'm fairly new to python and programming as a whole so I apologize if I'm not being to clear in my question. If that is the case, just please let me know what I'm doing wrong and I will make alterations to my question.
Quick Run Down:
I have created a rather script that iterates through a whole bunch of TXT files( ~120 right now) within a specific folder directory. If the TXT files match certain conditions( filename.endswith(" ")) a loop is initiated, which is suppose to go into each individual text file and find all the emails via regex. With each instance of finding emails within a specific file, a list is created. Once all these emails are extracted (and their corresponding lists created),they are sent to Excel via xlsxwriter.
Main Issue:
There are NO emails/results when I open the excel file that is created! Also, no errors are produced when the script runs. This script works perfectly fine when I do it file by file(meaning that I use a text file's specific path instead of iterating through the whole folder). What am I doing wrong?
Ideally(but not as important as the issue /\ ):
I would like the script to create a Sheet within the newly created Workbook for every list, that way it is organized. I have about 120 TXT files in the folder thus far, but the files can be organized into groups based on file names (don't think it's practical to have over >50 sheets in a workbook). File names are shared as such...
Client_Info_LA(1) , Client_Info_LA(2), Self_Info(1),Self_Info(2)
Thus organizing all Client_Info_LA in one sheet and Self_Info in another(was thinking of maybe using pandas to groupby). This isn't as important to me as actually getting the script to output the data I need into Excel, but if anyone knows how to tackle this it would really be helpful!
Here's the script...
import re
import xlsxwriter
import os
'Create List of Lists'
n = len(os.listdir("C:\\Users\\me\\Desktop\\Emails_Project\\Txt_Emails"))
lists = [[] for _ in range(n)] #For stack peeps: Is the list of lists causing the issue?
workbook = xlsxwriter.Workbook('TestEmails1.xlsx')
worksheet = workbook.add_worksheet()
'Find emails'
for filename in os.listdir("C:\\Users\\me\\Desktop\\Emails_Project\\Txt_Emails"):
if filename.endswith(".txt") :
for emails in filename:
if re.findall(r"[a-z0-9\.\-+_]+#[a-z0-9\.\-+_]+\.[a-z]+", emails):
lists.append(emails)
worksheet.write_column('A2', lists)
else:
continue
workbook.close()
I have been searching through the web and have tried multiple things -- nothing has worked. This is truly my last resort so if anyone can give me some guidance, suggestions, or insight as to how to fix this, I would truly appreciate it!
Mainly, there are two problems in the code provided.
First, it doesn't open files. The variable file_name is a string (a list of characters). So, the for loop for emails in file_name: will iterate over the elements of the characters of the string file_name. Hence, emails is a character.
Second, it overwrites what it has written before. worksheet.write_column('A2', lists) because you specify the same cell every time.
Here is a suggested code snippet that writes the emails found from different files to different columns in a same sheet.
import re
import xlsxwriter
import os
import codecs
workbook = xlsxwriter.Workbook('TestEmails1.xlsx')
worksheet = workbook.add_worksheet()
path = "C:\\Users\\me\\Desktop\\Emails_Project\\Txt_Emails\\"
row_number = 0
col_number = 0
for filename in os.listdir(path):
if filename.endswith(".txt") :
file_handler = codecs.open(path+filename,'r', 'utf-8')
file_contents = file_handler.read()
found_mails = re.findall(r"[a-z0-9\.\-+_]+#[a-z0-9\.\-+_]+\.[a-z]+", file_contents)
if found_mails != [] :
for mail in found_mails:
worksheet.write(row_number, col_number, mail)
row_number+=1
row_number=0
col_number+=1
workbook.close()
edit: see the bottom for my eventual solution
I have a directory of ~12,700 text files.
They have names like this:
1 - Re/ Report Novenator public call for bury - by Lizbett on Thu, 10 Sep 2009.txt
Where the leading digital increments with each file (e.g. the last file in the directory begins with "12,700 - ").
Unfortunately, the files are not timesorted, and I need them to be. Luckily I have a separate CSV file where the ID numbers are mapped e.g. the 1 in the example above should really be 25 (since there are 24 messages before it), and 2 should really be 8, and 3 should be 1, and so forth, like so:
OLD_FILEID TIMESORT_FILEID
21 0
23 1
24 2
25 3
I don't need to change anything in the file title except for this single leading number which I need to swap with its associated value. In my head, the way this would work is to open a file name, check the digits which appear before the dash, look them up in the CSV, replace them with the associated value, and then save the file with the adjusted title and go on to the next file.
What would be the best way to go about doing something like this? I'm a python newbie but have played around enough to feel comfortable following most directions or suggestions. Thanks :)
e: following the instructions below as best I could I did this, which doesn't work, but I'm not sure why:
import os
import csv
import sys
#open and store the csv file
with open('timesortmap.csv','rb') as csvfile:
timeReader = csv.reader(csvfile, delimiter = ',', quotechar='"')
#get the list of files
for filename in os.listdir('DiggOutput-TIMESORT/'):
oldID = filename.split(' - ')[0]
newFilename = filename.replace(oldID, timeReader[oldID],1)
os.rename(oldID, newFilename)
The error I get is:
TypeError: '_csv.reader' object is not subscriptable
I am not using DictReader, but that's because when I use csv.reader and print the rows, it looks like this:
['12740', '12738']
['12742', '12739']
['12738', '12740']
['12737', '12741']
['12739', '12742']
And when I use DictReader it looks like this:
{'FILEID-TS': '12738', 'FILEID-OLD': '12740'}
{'FILEID-TS': '12739', 'FILEID-OLD': '12742'}
{'FILEID-TS': '12740', 'FILEID-OLD': '12738'}
{'FILEID-TS': '12741', 'FILEID-OLD': '12737'}
{'FILEID-TS': '12742', 'FILEID-OLD': '12739'}
And I get this error in terminal:
File "TimeSorter.py", line 16, in <module>
newFilename = filename.replace(oldID, timeReader[oldID],1)
AttributeError: DictReader instance has no attribute '__getitem__'
This should really be very simple to do in Python just using the csv and os modules.
Python has a built-in dictionary type called dict that could be used to store the contents of the csv file in-memory while you are processing. Basically, you would need to read the csv file using the csv module and convert each entry into a dictionary entry, probably using the OLD_FILEID field as the key and the TIMESORT_FILEID as the value.
You can then use os.listdir() to get the list of files and use a loop to get each file name in turn. (If you need to filter the list of file names to exclude some files, take a look at the glob module). Inside your loop, you just need to extract the number associated with the file, which can be done using something like this:
file_number = filename.split(' - ')[0]
Then call os.rename() passing in the old file name and the new file name. The new filename can be found using something like:
new_filename = filename.replace(file_number, file_mapping[file_number], 1)
Where file_mapping is the dictionary created from the csv file. This will replace the first occurrence of the file_number with the number from your mapping file.
Edit
As TheodrosZelleke points out, there is the potential to overwrite an existing file by literally following what I laid out above. Several possible strategies:
Use os.rename() to move the renamed versions of the files into a different directory (e.g. a subdirectory of the current directory or, even better, a temporary directory created using tempfile.mkdtemp(). Once all the files have been renamed, use os.rename to move the files from the temporary directory to the current directory.
Add an extension to the new filename, e.g., .tmp, assuming that the extension chosen will not cause other conflicts. Once all the renames are done, use a second loop to rename the files to exclude the .tmp extension.
Here's what I ended up working out with friends, should anyone find and look for this:
import os
import csv
import sys
IDs = {}
#open and store the csv file
with open('timesortmap.csv','rb') as csvfile:
timeReader = csv.reader(csvfile, delimiter = ',', quotechar='"')
# build a dictionary with the associated IDs
for row in timeReader:
IDs[ row[0] ] = row[1]
# #get the list of files
path = 'DiggOutput-OLDID/'
tmpPath = 'DiggOutput-TIMESORT/'
for filename in os.listdir('DiggOutput-OLDID/'):
oldID = filename.split(' - ')[0]
newFilename = filename.replace(oldID, IDs[oldID])
os.rename(path + filename, tmpPath + newFilename)