CSV - reading problem using custom python script - python

I'm writing a custom script whose first task is to extract a csv's data into a python dictionary. There's some weird behaviour with a variable though: When executing the script below, instead of subsequent inputs, I get "Squeezed text (77 lines)" as output. If I inspect that, I get a white empty screen, so there seems to be nothing. Totally don't get what's happening..
My script:
import os
import io
separator = ";"
source_data_folder = os.path.realpath( __file__ ).replace( "extraction.py", "source_data" )
for source_file in os.listdir( source_data_folder ):
iterated_source_file = io.open( source_data_folder + "/" + source_file, encoding='windows-1252' )
source_data = {}
source_data_key_indexes = {}
line_counter = 0
for iterated_line in iterated_source_file:
iterated_lines_data = iterated_line.split( "" + separator + "" )
column_counter = 0
if line_counter == 0:
for iterated_lines_field in iterated_lines_data:
source_data[iterated_lines_field] = []
source_data_key_indexes[column_counter] = iterated_lines_field
column_counter += 1
else:
for iterated_lines_field in iterated_lines_data:
source_data[source_data_key_indexes[column_counter]].append( iterated_lines_field )
column_counter += 1
line_counter += 1
iterated_source_file.close()
for column_index in source_data_key_indexes:
input( "Shall the column '" + source_data_key_indexes[column_index] + '"be exported? (y/n)" )
When I put this part:
for column_index in source_data_key_indexes:
input( "Shall the column '" + source_data_key_indexes[column_index] + '"be exported? (y/n)" )
Out of the initial for loop, without any indentation, it however works; but I need to call it in the first for loop. I could may due this with a callback, but why is this actually happening??
I'm using Python v. 3.7.3 and am executing the script via the Python Shell v. 3.7.3.
content of a sample CSV file, placed in the source_data folder, which is placed in the same location as the "extraction.py" file, holding the code above:
first;second;third;fourth
this;is;the;1st
this;is;the;2nd
This CSV - file was obtained by creating the according table in a new Microsoft Office Excel datasheet, with the according three lines + four columns, then saving the file as utf-8 csv file via "save as..." and selecting the utf-8 csv file type.
Note: I noticed that when I add the line
print( iterated_line )
below the line line_counter == 0: of my code, I interestingly get the "Squeezed text (77 lines)" again, followed by the visible content of the first line as a simple string. This is only true for the table header line (only the very first one); for the others only the line content is outputted. Interestingly, this happens for any csv - file I create in the above - mentioned way; no matter the amount of rows, columns, or their content. So is this actually some formatting issue with Python + Ms Excel?

import os
import csv
source_data_folder = os.path.realpath( __file__ ).replace("extraction.py", "source_data")
for filename in os.listdir(source_data_folder):
with open(filename, encoding='windows-1252') as fp:
reader = csv.DictReader(fp, delimiter=';')
table = list(reader)
# Convert list of dicts to dict of lists
table = {key: [item[key] for item in table] for key in table[0]}
print(table)

I found the problem, weirdly thanks to this. The problem is that os.listdir() contained that .DS_store - file as first element, which is where the buggy first iteration originates from, so replace:
for source_file in os.listdir( source_data_folder ):
with
# remove the .DS_Store default file which is auto-added in Mac OS
myfiles = os.listdir( source_data_folder )
del myfiles[0]
# iterate through all the source files present in the source folder
for source_file in myfiles:
And the only problem now is that I have the string
\ufeff
At the very start of the very first line only. To not consider it, according to this, use the utf-8-sig encoding instead of utf-8, indeed worked (the encoding change tells the engine to "omit the BOM in the result").

Related

Extracting a diffrentiating numerical value from multiple files - PowerShell/Python

I have multiple text files containing different text.
They all contain a single appearance of the same 2 lines I am interested in:
================================================================
Result: XX/100
I am trying to write a script to collect all those XX values (numerical values between 0 and 100), and paste them in a CSV file with the text file name in column A and the numerical value in column B.
I have considered using Python or PowerShell for this purpose.
How can I identify the line where "Result" appears under the string of "===..", collect its content until '\n', and then strip it from "Result: " and "/100"?
"Result" and other numerical values could appear in the files, but never in the quoted format, and below "=====", like the line im interested in.
Thank you!
Edit: I have written this poor naive attempt to collect the numerical values.
import os
dir_path = os.path.dirname(os.path.realpath(__file__))
for filename in os.listdir(dir_path):
if filename.endswith(".txt"):
with open(filename,"r") as f:
lineFound=False
for index, line in enumerate(f):
if lineFound:
line=line.replace("Result: ", "")
line=line.replace("/100","")
line.strip()
grade=line
lineFound=False
print(grade, end='')
continue
if index>3:
if "================================================================" in line:
lineFound=True
I'd still be happy to learn if there's a simple way to do this with PowerShell tbh
For the output, I used csv writer to append the results to a file one by one.
So there's two steps involved here, first is to get a list of files. There's a ton of answers for that one on stackoverflow, but this one is stupidly complete.
Once you have the list of files, you can simply just load the files themselves one by one, and then do some simple string.split() to get the value you want.
Finally, write the results into a CSV file. Since the CSV file is a simple one, you don't need to use the CSV library for this.
See the code example below. Note that I copied/pasted the function for generating the list of files from my personal github repo. I reuse that one a lot.
import os
def get_files_from_path(path: str = ".", ext:str or list=None) -> list:
"""Find files in path and return them as a list.
Gets all files in folders and subfolders
See the answer on the link below for a ridiculously
complete answer for this.
https://stackoverflow.com/a/41447012/9267296
Args:
path (str, optional): Which path to start on.
Defaults to '.'.
ext (str/list, optional): Optional file extention.
Defaults to None.
Returns:
list: list of file paths
"""
result = []
for subdir, dirs, files in os.walk(path):
for fname in files:
filepath = f"{subdir}{os.sep}{fname}"
if ext == None:
result.append(filepath)
elif type(ext) == str and fname.lower().endswith(ext.lower()):
result.append(filepath)
elif type(ext) == list:
for item in ext:
if fname.lower().endswith(item.lower()):
result.append(filepath)
return result
filelist = get_files_from_path("path/to/files/", ext=".txt")
split1 = "================================================================\nResult: "
split2 = "/100"
with open("output.csv", "w") as outfile:
outfile.write('filename, value\n')
for filename in filelist:
with open(filename) as infile:
value = infile.read().split(split1)[1].split(split2)[0]
print(value)
outfile.write(f'"{filename}", {value}\n')
You could try this.
In this example the filename written to the CSV will be its full (absolute) path. You may just want the base filename.
Uses the same, albeit seemingly unnecessary, mechanism for deriving the source directory. It would be unusual to have your Python script in the same directory as your data.
import os
import glob
equals = '=' * 64
dir_path = os.path.dirname(os.path.realpath(__file__))
outfile = os.path.join(dir_path, 'foo.csv')
with open(outfile, 'w') as csv:
print('A,B', file=csv)
for file in glob.glob(os.path.join(dir_path, '*.txt')):
prev = None
with open(file) as indata:
for line in indata:
t = line.split()
if len(t) == 2 and t[0] == 'Result:' and prev.startswith(equals):
v = t[1].split('/')
if len(v) == 2 and v[1] == '100':
print(f'{file},{v[0]}', file=csv)
break
prev = line

How to use use Python to import (.tex file), create a new (.tex file) and append the new (.tex file) from the imported (.tex file)

I have several (1000+) .tex files which goes something like this:
File1.tex:
\documentclass[15pt]{article}
\begin{document}
Question1:
$f(x)=sin(x)$\\
Question2:
$f(x)=tan(x)$
\end{document}
File2.tex is similar in structure:
\documentclass[15pt]{article}
\begin{document}
Question1:
$f(x)=cos(x)$\\
Question2:
$f(x)=sec(x)$\\
Question3:
$f(x)=cot(x)$
\end{document}
What I would like to do is write a Python script that allows me to select question 1 from file1.tex and question 3 from file2.tex and compile a new file3.tex file (or PDF) with the following format:
\documentclass[15pt]{article}
\begin{document}
Question1:
$f(x)=sin(x)$\\
Question2:
$f(x)=cot(x)$
\end{document}
PS- I don't mind if I can carry out this type of work on LaTex. I just thought with Python I can eventually create a GUI.
So far I've managed to read/append a .tex file by manually typing what I want rather than creating some sort of a system that allows me to "copy" specific section of a .tex file or files into another another .tex file.
I used exactly what you had for file1 and file2.tex. I left comments throughout rather than explain step by step.
PreProcess
The preprocess involves creating an xlsx file which will have all of the names of the tex file in the first column.
import os
import xlsxwriter
workbook = xlsxwriter.Workbook('Filenames.xlsx')
worksheet = workbook.add_worksheet("FileNames")
worksheet.write(0, 0, "NameCol")
path = os.getcwd() # get path to directory
filecount = 1
for file in os.listdir(path): # for loop over files in directory
if file.split('.')[-1] == 'tex': # only open latex files
worksheet.write(filecount, 0, file)
filecount += 1
workbook.close()
Select Problems
Now you go through an list to the right like I have what problems you want out of the file.
PostProcess
Now we can run through our xlsx file and create a new latex file from it.
import pandas as pd
import math
import os
# get data
allfileqs = []
df = pd.read_excel('Filenames.xlsx')
for row in df.iterrows():
tempqs = []
for i in range(len(row[1].values) - 1):
if math.isnan(row[1].values[i + 1]):
continue
else:
tempqs.append(int(row[1].values[i + 1]))
allfileqs.append(tempqs)
print(allfileqs)
allfileqs = [["Question" + str(allfileqs[i][j]) + ':' for j in range(len(allfileqs[i]))] for i in range(len(allfileqs))]
starttex = [r'\documentclass[15pt]{article}', r'\begin{document}']
endtex = [r'\end{document}']
alloflines = []
path = os.getcwd() # get path to directory
for file in os.listdir(path): # for loop over files in directory
if file.split('.')[-1] == 'tex': # only open latex files
lf = open(file, 'r')
lines = lf.readlines()
# remove all new lines, each item is on new line we know
filt_lines = [lines[i].replace('\n', '') for i in range(len(lines)) if lines[i] != '\n']
alloflines.append(filt_lines) # save data for later
lf.close() # close file
# good now we have filtered lines
newfile = []
questcount = 1
for i in range(len(alloflines)):
for j in range(len(alloflines[i])):
if alloflines[i][j] in allfileqs[i]:
newfile.append("Question" + str(questcount) + ":")
newfile.append(alloflines[i][j + 1])
questcount += 1
# okay cool we have beg, middle (newfile) and end of tex
newest = open('file3.tex', 'w') # open as write only
starter = '\n\n'.join(starttex) + '\n' + '\n\n'.join(newfile) + '\n\n' + endtex[0]
for i in range(len(starter)):
newest.write(starter[i])
newest.close()

Splitting CSV file by row

Someone, please review this code for me. I am kind of confused with the path in the code. By the way, the code is for splitting CSV file based on a number of rows, got it on GitHub and by using it have been trying to split a CSV file but the code is too confusing for me.
You may also follow the link for the code click to see the code
Assuming that, the name of the csv to be splitted is Dominant.csv,
source_filepath is C:\\Users\James\\Desktop\\Work,
dest_path is C:\\Users\James\\Desktop\\Work\\Processed,
result_filename_prefix is split,
My confusion is,
Is target_filename in the code means my csv file Dominant.csv? and what exactly is this target_filepath?
Could someone please reformat the code for me as per the given path and file names? Would be really thankful
import csv
import os
import sys
if len(sys.argv) != 5:
raise Exception('Wrong number of arguments!')
SOURCE_FILEPATH = sys.argv[1]
DEST_PATH = sys.argv[2]
FILENAME_PREFIX = sys.argv[3]
ROW_LIMIT = int(sys.argv[4])
def split_csv(source_filepath, dest_path, result_filename_prefix, row_limit):
"""
Split a source CSV into multiple CSVs of equal numbers of records,
except the last file.
The initial file's header row will be included as a header row in each split
file.
Split files follow a zero-index sequential naming convention like so:
`{result_filename_prefix}_0.csv`
:param source_filepath {str}:
File name (including full path) for the file to be split.
:param dest_path {str}:
Full path to the directory where the split files should be saved.
:param result_filename_prefix {str}:
File name to be used for the generated files.
Example: If `my_split_file` is provided as the prefix, then a resulting
file might be named: `my_split_file_0.csv'
:param row_limit {int}:
Number of rows per file (header row is excluded from the row count).
:return {NoneType}:
"""
if row_limit <= 0:
raise Exception('row_limit must be > 0')
with open(source_filepath, 'r') as source:
reader = csv.reader(source)
headers = next(reader)
file_number = 0
records_exist = True
while records_exist:
i = 0
target_filename = f'{result_filename_prefix}_{file_number}.csv'
target_filepath = os.path.join(dest_path, target_filename)
with open(target_filepath, 'w') as target:
writer = csv.writer(target)
while i < row_limit:
if i == 0:
writer.writerow(headers)
try:
writer.writerow(next(reader))
i += 1
except:
records_exist = False
break
if i == 0:
# we only wrote the header, so delete that file
os.remove(target_filepath)
file_number += 1
split_csv(SOURCE_FILEPATH, DEST_PATH, FILENAME_PREFIX, ROW_LIMIT)
target_filename is the name you want the output file to have.
target_filepath is the path to the output file including its name.
In the split_csv function call:
SOURCE_PATH is the path to the source file
DEST_PATH is the path to the folder you want the output file in
FILENAME_PREFIX is what you want the output file name(s) to start with
ROW_LIMIT is the maximum number of rows per file you want written to the output file.

Encoding issue with CSV write in Python

I was trying to extract a big amount of data from a PostgreSQL database and then write it to a CSV, but I got a MemoryError. For doing that, I've used the Python StringIO module, and if I'd put a limited number of rows generated by the query it worked just fine; the file was written and there was no problem in importing it as text file on Excel.
Since it gives me back that error, I've tried to write the file row by row like this, with a little function that takes care of the decoding/encoding stuff:
def convert_and_write(out_stream, input_value):
output_value = input_value.decode('utf-8').encode('utf-16','ignore')
out_stream.write(output_value)
with open( filename, "w" ) as out_file:
for row in rs:
tmpstr = ""
for c in range( len( row ) ):
xx = str( row[c] ).replace("\"", "").replace( chr(9), " " )
tmpstr += xx + chr(9)
convert_and_write( out_file,
tmpstr.replace( "\n", "-" ).replace( "\r", " " )
)
convert_and_write( out_file, "\r\n" )
Now the file is written, but when I try to import it as a text on Excel, there's a problem: the lines that divide the columns are over the text, like this
(I've hidden some data, but you get the idea)
and when I compare the file before/after on Notepad++, the codification seems to be the same (UCS-2 Little Endian), but I can notice strange marks on the first letter of the words of the newest file
What should I do?

Python string replace in a file without touching the file if no substitution was made

What does Python's string.replace return if no string substitution was made?
Does Python's file.open(f, 'w') always touch the file even if no changes were made?
Using Python, I'm trying to replace occurrences of 'oldtext' with 'newtext' in a set of files. If a file contains 'oldtext', I want to do the replacement and save the file. Otherwise, do nothing, so the file maintains its old timestamp.
The following code works fine, except all files get written, even if no string substitution was made, and all files have a new timestamp.
for match in all_files('*.html', '.'): # all_files returns all html files in current directory
thefile = open(match)
content = thefile.read() # read entire file into memory
thefile.close()
thefile = open(match, 'w')
thefile.write(content.replace(oldtext, newtext)) # write the file with the text substitution
thefile.close()
In this code I'm trying to do the file.write only if a string substitution occurred, but still, all the files get a new timestamp:
count = 0
for match in all_files('*.html', '.'): # all_files returns all html files in current directory
thefile = open(match)
content = thefile.read() # read entire file into memory
thefile.close()
thefile = open(match, 'w')
replacedText = content.replace(oldtext, newtext)
if replacedText != '':
count += 1
thefile.write(replacedText)
thefile.close()
print (count) # print the number of files that we modified
At the end, count is the total number of files, not the number of files modified. Any suggestions? Thanks.
I'm using Python 3.1.2 on Windows.
What does Python's string.replace
return if no string substitution was
made?
It returns the original string.
Does Python's file.open(f, 'w') always
touch the file even if no changes were
made?
More than merely touching the file, it destroys any content f used to contain.
So, you can test if the file needs to be rewritten with if replacedText != content, and only open the file in write mode if this is the case:
count = 0
for match in all_files('*.html', '.'): # all_files returns all html files in current directory
with open(match) as thefile:
content = thefile.read() # read entire file into memory
replacedText = content.replace(oldtext, newtext)
if replacedText!=content:
with open(match, 'w') as thefile:
count += 1
thefile.write(replacedText)
print (count) # print the number of files that we modified
What does Python's string.replace return if no string substitution was made?
str.replace() returns the string itself or a copy if the object is a subclass of string.
Does Python's file.open(f, 'w') always touch the file even if no changes were made?
open(f, 'w') opens and truncates the file f.
Note the code below is CPython specific; it won't work correctly on pypy, jython:
count = 0
for match in all_files('*.html', '.'):
content = open(match).read()
replacedText = content.replace(oldtext, newtext)
if replacedText is not content:
count += 1
open(match, 'w').write(replacedText)
print (count)
Your case is a particular case: 'newtext' has exactly the same number of characters as 'oldtext'.
Hence, it is possible to use one of the following codes to replace exactly a word 'oldtext' or a line in which the word 'oldtext' is present, by word 'newtext' or a line in which 'newtext' replaces 'oldtext'.
.
If files have not super-big size, the content of each file can be read entirely into memory:
from os import fsync # code using find()
count = 0
for match in all_files('*.html', '.'):
with open(match,'rb+') as thefile:
diag = False
fno = thefile.fileno()
content = thefile.read()
thefile.seek(0,0)
x = content.find('oldtext')
while x>=0:
diag = True
thefile.seek(x,1)
thefile.write('newtext')
thefile.flush()
fsync(fno)
x = content[thefile.tell():].find('oldtext')
if diag:
cnt += 1
or
from os import fsync # code using a regex
import re
pat = re.compile('oldtext')
count = 0
for match in all_files('*.html', '.'):
with open(match,'rb+') as thefile:
diag = False
fno = thefile.fileno()
content = thefile.read()
thefile.seek(0,0)
prec = 0
for mat in pat.finditer(content):
diag = True
thefile.seek(mat.start()-prec,1)
thefile.write('newtext')
thefile.flush()
fsync(fno)
prec = mat.end()
if diag:
cnt += 1
.
For heavy files, a reading and rewriting line after line is possible:
from os import fsync # code for big files, using regex
import re
pat = re.compile('oldtext')
count = 0
for match in all_files('*.html', '.'):
with open(match,'rb+') as thefile:
diag = False
fno = thefile.fileno()
line = thefile.readline()
while line:
if 'oldtext' in line:
diag = True
thefile.seek(-len(line),1)
thefile.write(pat.sub('newtext',line))
thefile.flush()
fsync(fno)
line = thefile.readline()
if diag:
cnt += 1
.
The instructions thefile.flush() and fsync(fno) are required after each writing in order that the file handler thefile points with accuracy on the exact position in the file at any moment. They allow to obtain effective writing ordered by instuction write()
flush() does not necessarily write the
file’s data to disk. Use flush()
followed by os.fsync() to ensure this
behavior.
http://docs.python.org/library/stdtypes.html#file.flush
.
These programs do the minimum. So I think they are fast.
.
Nota bene : a file opened in mode 'rb+' have no changing of its time of last modification if no modification has been performed.

Categories