Someone, please review this code for me. I am kind of confused with the path in the code. By the way, the code is for splitting CSV file based on a number of rows, got it on GitHub and by using it have been trying to split a CSV file but the code is too confusing for me.
You may also follow the link for the code click to see the code
Assuming that, the name of the csv to be splitted is Dominant.csv,
source_filepath is C:\\Users\James\\Desktop\\Work,
dest_path is C:\\Users\James\\Desktop\\Work\\Processed,
result_filename_prefix is split,
My confusion is,
Is target_filename in the code means my csv file Dominant.csv? and what exactly is this target_filepath?
Could someone please reformat the code for me as per the given path and file names? Would be really thankful
import csv
import os
import sys
if len(sys.argv) != 5:
raise Exception('Wrong number of arguments!')
SOURCE_FILEPATH = sys.argv[1]
DEST_PATH = sys.argv[2]
FILENAME_PREFIX = sys.argv[3]
ROW_LIMIT = int(sys.argv[4])
def split_csv(source_filepath, dest_path, result_filename_prefix, row_limit):
"""
Split a source CSV into multiple CSVs of equal numbers of records,
except the last file.
The initial file's header row will be included as a header row in each split
file.
Split files follow a zero-index sequential naming convention like so:
`{result_filename_prefix}_0.csv`
:param source_filepath {str}:
File name (including full path) for the file to be split.
:param dest_path {str}:
Full path to the directory where the split files should be saved.
:param result_filename_prefix {str}:
File name to be used for the generated files.
Example: If `my_split_file` is provided as the prefix, then a resulting
file might be named: `my_split_file_0.csv'
:param row_limit {int}:
Number of rows per file (header row is excluded from the row count).
:return {NoneType}:
"""
if row_limit <= 0:
raise Exception('row_limit must be > 0')
with open(source_filepath, 'r') as source:
reader = csv.reader(source)
headers = next(reader)
file_number = 0
records_exist = True
while records_exist:
i = 0
target_filename = f'{result_filename_prefix}_{file_number}.csv'
target_filepath = os.path.join(dest_path, target_filename)
with open(target_filepath, 'w') as target:
writer = csv.writer(target)
while i < row_limit:
if i == 0:
writer.writerow(headers)
try:
writer.writerow(next(reader))
i += 1
except:
records_exist = False
break
if i == 0:
# we only wrote the header, so delete that file
os.remove(target_filepath)
file_number += 1
split_csv(SOURCE_FILEPATH, DEST_PATH, FILENAME_PREFIX, ROW_LIMIT)
target_filename is the name you want the output file to have.
target_filepath is the path to the output file including its name.
In the split_csv function call:
SOURCE_PATH is the path to the source file
DEST_PATH is the path to the folder you want the output file in
FILENAME_PREFIX is what you want the output file name(s) to start with
ROW_LIMIT is the maximum number of rows per file you want written to the output file.
Related
I have multiple text files containing different text.
They all contain a single appearance of the same 2 lines I am interested in:
================================================================
Result: XX/100
I am trying to write a script to collect all those XX values (numerical values between 0 and 100), and paste them in a CSV file with the text file name in column A and the numerical value in column B.
I have considered using Python or PowerShell for this purpose.
How can I identify the line where "Result" appears under the string of "===..", collect its content until '\n', and then strip it from "Result: " and "/100"?
"Result" and other numerical values could appear in the files, but never in the quoted format, and below "=====", like the line im interested in.
Thank you!
Edit: I have written this poor naive attempt to collect the numerical values.
import os
dir_path = os.path.dirname(os.path.realpath(__file__))
for filename in os.listdir(dir_path):
if filename.endswith(".txt"):
with open(filename,"r") as f:
lineFound=False
for index, line in enumerate(f):
if lineFound:
line=line.replace("Result: ", "")
line=line.replace("/100","")
line.strip()
grade=line
lineFound=False
print(grade, end='')
continue
if index>3:
if "================================================================" in line:
lineFound=True
I'd still be happy to learn if there's a simple way to do this with PowerShell tbh
For the output, I used csv writer to append the results to a file one by one.
So there's two steps involved here, first is to get a list of files. There's a ton of answers for that one on stackoverflow, but this one is stupidly complete.
Once you have the list of files, you can simply just load the files themselves one by one, and then do some simple string.split() to get the value you want.
Finally, write the results into a CSV file. Since the CSV file is a simple one, you don't need to use the CSV library for this.
See the code example below. Note that I copied/pasted the function for generating the list of files from my personal github repo. I reuse that one a lot.
import os
def get_files_from_path(path: str = ".", ext:str or list=None) -> list:
"""Find files in path and return them as a list.
Gets all files in folders and subfolders
See the answer on the link below for a ridiculously
complete answer for this.
https://stackoverflow.com/a/41447012/9267296
Args:
path (str, optional): Which path to start on.
Defaults to '.'.
ext (str/list, optional): Optional file extention.
Defaults to None.
Returns:
list: list of file paths
"""
result = []
for subdir, dirs, files in os.walk(path):
for fname in files:
filepath = f"{subdir}{os.sep}{fname}"
if ext == None:
result.append(filepath)
elif type(ext) == str and fname.lower().endswith(ext.lower()):
result.append(filepath)
elif type(ext) == list:
for item in ext:
if fname.lower().endswith(item.lower()):
result.append(filepath)
return result
filelist = get_files_from_path("path/to/files/", ext=".txt")
split1 = "================================================================\nResult: "
split2 = "/100"
with open("output.csv", "w") as outfile:
outfile.write('filename, value\n')
for filename in filelist:
with open(filename) as infile:
value = infile.read().split(split1)[1].split(split2)[0]
print(value)
outfile.write(f'"{filename}", {value}\n')
You could try this.
In this example the filename written to the CSV will be its full (absolute) path. You may just want the base filename.
Uses the same, albeit seemingly unnecessary, mechanism for deriving the source directory. It would be unusual to have your Python script in the same directory as your data.
import os
import glob
equals = '=' * 64
dir_path = os.path.dirname(os.path.realpath(__file__))
outfile = os.path.join(dir_path, 'foo.csv')
with open(outfile, 'w') as csv:
print('A,B', file=csv)
for file in glob.glob(os.path.join(dir_path, '*.txt')):
prev = None
with open(file) as indata:
for line in indata:
t = line.split()
if len(t) == 2 and t[0] == 'Result:' and prev.startswith(equals):
v = t[1].split('/')
if len(v) == 2 and v[1] == '100':
print(f'{file},{v[0]}', file=csv)
break
prev = line
I'm writing a custom script whose first task is to extract a csv's data into a python dictionary. There's some weird behaviour with a variable though: When executing the script below, instead of subsequent inputs, I get "Squeezed text (77 lines)" as output. If I inspect that, I get a white empty screen, so there seems to be nothing. Totally don't get what's happening..
My script:
import os
import io
separator = ";"
source_data_folder = os.path.realpath( __file__ ).replace( "extraction.py", "source_data" )
for source_file in os.listdir( source_data_folder ):
iterated_source_file = io.open( source_data_folder + "/" + source_file, encoding='windows-1252' )
source_data = {}
source_data_key_indexes = {}
line_counter = 0
for iterated_line in iterated_source_file:
iterated_lines_data = iterated_line.split( "" + separator + "" )
column_counter = 0
if line_counter == 0:
for iterated_lines_field in iterated_lines_data:
source_data[iterated_lines_field] = []
source_data_key_indexes[column_counter] = iterated_lines_field
column_counter += 1
else:
for iterated_lines_field in iterated_lines_data:
source_data[source_data_key_indexes[column_counter]].append( iterated_lines_field )
column_counter += 1
line_counter += 1
iterated_source_file.close()
for column_index in source_data_key_indexes:
input( "Shall the column '" + source_data_key_indexes[column_index] + '"be exported? (y/n)" )
When I put this part:
for column_index in source_data_key_indexes:
input( "Shall the column '" + source_data_key_indexes[column_index] + '"be exported? (y/n)" )
Out of the initial for loop, without any indentation, it however works; but I need to call it in the first for loop. I could may due this with a callback, but why is this actually happening??
I'm using Python v. 3.7.3 and am executing the script via the Python Shell v. 3.7.3.
content of a sample CSV file, placed in the source_data folder, which is placed in the same location as the "extraction.py" file, holding the code above:
first;second;third;fourth
this;is;the;1st
this;is;the;2nd
This CSV - file was obtained by creating the according table in a new Microsoft Office Excel datasheet, with the according three lines + four columns, then saving the file as utf-8 csv file via "save as..." and selecting the utf-8 csv file type.
Note: I noticed that when I add the line
print( iterated_line )
below the line line_counter == 0: of my code, I interestingly get the "Squeezed text (77 lines)" again, followed by the visible content of the first line as a simple string. This is only true for the table header line (only the very first one); for the others only the line content is outputted. Interestingly, this happens for any csv - file I create in the above - mentioned way; no matter the amount of rows, columns, or their content. So is this actually some formatting issue with Python + Ms Excel?
import os
import csv
source_data_folder = os.path.realpath( __file__ ).replace("extraction.py", "source_data")
for filename in os.listdir(source_data_folder):
with open(filename, encoding='windows-1252') as fp:
reader = csv.DictReader(fp, delimiter=';')
table = list(reader)
# Convert list of dicts to dict of lists
table = {key: [item[key] for item in table] for key in table[0]}
print(table)
I found the problem, weirdly thanks to this. The problem is that os.listdir() contained that .DS_store - file as first element, which is where the buggy first iteration originates from, so replace:
for source_file in os.listdir( source_data_folder ):
with
# remove the .DS_Store default file which is auto-added in Mac OS
myfiles = os.listdir( source_data_folder )
del myfiles[0]
# iterate through all the source files present in the source folder
for source_file in myfiles:
And the only problem now is that I have the string
\ufeff
At the very start of the very first line only. To not consider it, according to this, use the utf-8-sig encoding instead of utf-8, indeed worked (the encoding change tells the engine to "omit the BOM in the result").
I have been trying to merge several .csv files form different subfolders (all with the same name) into one. I tried with R but I got the result of not enough memory for carry the process (it should merge more than 20 million of rows). I am now working with a python script to try to get it (see below) it has many columns too so I don't need all of them but also I dont know if can choose which columns to add to the new csv:
import glob
import csv
import os
path= 'C:\\path\\to\\folder\\where\\all\\files\\are-allowated-in-subfolders'
result = glob.glob('*/certificates.csv')
#for i in result:
#full_path = "C:\\path\\to\\folder\\where\\all\\files\\are-allowated-in-subfolders\\" + result
#print(full_path)
os.chdir(path)
i=0
for root, directories, files in os.walk(path, topdown=False):
for name in files:
print(name)
try:
i += 1
if i % 10000 == 0:
#just to see the progress
print(i)
if name == 'certificates.csv':
creader = csv.reader(open(name))
cwriter = csv.writer(open('processed_' + name, 'w'))
for cline in creader:
new_line = [val for col, val in enumerate(cline)]
cwriter.writerow(new_line)
except:
print('problem with file: ' + name)
pass
but it doesn't work, and neither return any error so at the moment I am completely stuck.
Your indentation is wrong, and you are overwriting the output file for each new input file. Also, you are not using the glob result for anything. If the files you want to read are all immediately in subdirectories of path, you can do away with the os.walk() call and do the glob after you os.chdir().
import glob
import csv
import os
# No real need to have a variable for this really
path = 'C:\\path\\to\\folder\\where\\all\\files\\are-allowated-in-subfolders'
os.chdir(path)
# Obviously, can't use input file name in output file name
# because there is only one output file for many input files
with open('processed.csv', 'w') as dest:
cwriter = csv.writer(dest)
for i, name in enumerate(glob.glob('*/certificates.csv'), 1):
if i % 10000 == 0:
#just to see the progress
print(i)
try:
with open(name) as csvin:
creader = csv.reader(csvin)
for cline in creader:
# no need to enumerate fields
cwriter.writerow(cline)
except:
print('problem with file: ' + name)
You probably just need to keep a merged.csv file open whilst reading in each of the certificates.csv files. glob.glob() can be used to recursively find all suitable files:
import glob
import csv
import os
path = r'C:\path\to\folder\where\all\files\are-allowated-in-subfolders'
os.chdir(path)
with open('merged.csv', 'w', newline='') as f_merged:
csv_merged = csv.writer(f_merged)
for filename in glob.glob(os.path.join(path, '*/certificates.csv'), recursive=True):
print(filename)
try:
with open(filename) as f_csv:
csv_merged.writerows(csv.reader(f_csv))
except:
print('problem with file: ', filename)
An r prefix can be added to your path to avoid needing to escape each backslash. Also newline='' should be added to the open() when using a csv.writer() to stop extra blank lines being written to your output file.
I have more than 30 text files. I need to do some processing on each text file and save them again in text files with different names.
Example-1: precise_case_words.txt ---- processing ---- precise_case_sentences.txt
Example-2: random_case_words.txt ---- processing ---- random_case_sentences.txt
Like this i need to do for all text files.
present code:
new_list = []
with open('precise_case_words.txt') as inputfile:
for line in inputfile:
new_list.append(line)
final = open('precise_case_sentences.txt', 'w+')
for item in new_list:
final.write("%s\n" % item)
Am manually copy+paste this code all the times and manually changing the names everytime. Please suggest me a solution to avoid manual job using python.
Suppose you have all your *_case_words.txt in the present dir
import glob
in_file = glob.glob('*_case_words.txt')
prefix = [i.split('_')[0] for i in in_file]
for i, ifile in enumerate(in_file):
data = []
with open(ifile, 'r') as f:
for line in f:
data.append(line)
with open(prefix[i] + '_case_sentence.txt' , 'w') as f:
f.write(data)
This should give you an idea about how to handle it:
def rename(name,suffix):
"""renames a file with one . in it by splitting and inserting suffix before the ."""
a,b = name.split('.')
return ''.join([a,suffix,'.',b]) # recombine parts including suffix in it
def processFn(name):
"""Open file 'name', process it, save it under other name"""
# scramble data by sorting and writing anew to renamed file
with open(name,"r") as r, open(rename(name,"_mang"),"w") as w:
for line in r:
scrambled = ''.join(sorted(line.strip("\n")))+"\n"
w.write(scrambled)
# list of filenames, see link below for how to get them with os.listdir()
names = ['fn1.txt','fn2.txt','fn3.txt']
# create demo data
for name in names:
with open(name,"w") as w:
for i in range(12):
w.write("someword"+str(i)+"\n")
# process files
for name in names:
processFn(name)
For file listings: see How do I list all files of a directory?
I choose to read/write line by line, you can read in one file fully, process it and output it again on block to your liking.
fn1.txt:
someword0
someword1
someword2
someword3
someword4
someword5
someword6
someword7
someword8
someword9
someword10
someword11
into fn1_mang.txt:
0demoorsw
1demoorsw
2demoorsw
3demoorsw
4demoorsw
5demoorsw
6demoorsw
7demoorsw
8demoorsw
9demoorsw
01demoorsw
11demoorsw
I happened just today to be writing some code that does this.
I would run the following code for multiple fastq files in a folder. In a folder I have different fastq files; first I have to read one file and perform the required operations, then store results in a separate file. fastq and then read second file, perform the same operations and save results in new 2nd file.fastq. Repeat the same procedure for all the files in the folder.
How can I do? Can someone suggest me a way to this this?
from Bio.SeqIO.QualityIO import FastqGeneralIterator
fout=open("prova_FiltraN_CE_filt.fastq","w")
fin=open("prova_FiltraN_CE.fastq","rU")
maxN=0
countall=0
countincl=0
with open("prova_FiltraN_CE.fastq", "rU") as handle:
for (title, sequence, quality) in FastqGeneralIterator(handle):
countN = sequence.count("N", 0, len(sequence))
countall+=1
if countN==maxN:
fout.write("#%s\n%s\n+\n%s\n" % (title, sequence, quality))
countincl+=1
fin.close
fout.close
print countall, countincl
I think the following will do what you want. What I did was make your code into a function (and modified it to be what I think is more correct) and then called that function for every .fastq file found in the designated folder. The output file names are generated from the input files found.
from Bio.SeqIO.QualityIO import FastqGeneralIterator
import glob
import os
def process(in_filepath, out_filepath):
maxN = 0
countall = 0
countincl = 0
with open(in_filepath, "rU") as fin:
with open(out_filepath, "w") as fout:
for (title, sequence, quality) in FastqGeneralIterator(fin):
countN = sequence.count("N", 0, len(sequence))
countall += 1
if countN == maxN:
fout.write("#%s\n%s\n+\n%s\n" % (title, sequence, quality))
countincl += 1
print os.path.split(in_filepath)[1], countall, countincl
folder = "/path/to/folder" # folder to process
for in_filepath in glob.glob(os.path.join(folder, "*.fastq")):
root, ext = os.path.splitext(in_filepath)
if not root.endswith("_filt"): # avoid processing existing output files
out_filepath = root + "_filt" + ext
process(in_filepath, out_filepath)