Merge files created with JSON arrays into a single file

Merge files created with JSON arrays into a single file - python

just to get you started, I want to merge json array files into a single file, with (comma) appended to the end of the array.
MemoryError now in my code, please help me!
in my code >
import os, sys
path = "censored"
dirs = os.listdir(path)
save_list = []
s = ""
for file in dirs:
save_list.append(file)
for i in range(len(save_list)):
f = open(path + save_list[i], 'r')
s += f.read()
s += s.replace("]", "],")
f.close()
ff = open("a", 'w')
ff.write(s)
ff.close()
print("done")
Error >
Traceback (most recent call last):
File "test.py", line 15, in <module>
s += s.replace("]", "],")
MemoryError
Want result
file "a" in substance
[{a:b}]
file "b" in substance
[{c:d}]
file "c" want result substance
[{a:b}], [{c:d}]

Based on your code, you're opening every single file one after another, and only closing the last one. You should be opening and closing each file after you are done with it to make it more memory efficient.
You can actually progressively write to the output file instead of storing it as a string in memory and writing it at one shot.
There's actually no need to save all files from dir into save_list and reaccess it in another loop. So you can omit save_list.
Putting everything together, you'll get the following code snippet:
# everything above as follows
for file in dirs:
curr_file_path = path + file
curr_file_string = ""
# using this would close the file automatically
with open(curr_file_path, 'r') as f:
raw_file = f.read()
curr_file_string = raw_file.replace("]", "],")
# open the output file and set the mode to append ('a') to batch write
# similarly, this will close the output file after every write
with open("output file", "a") as out_f:
out_f.write(curr_file_string)
print("done")

Related

Python: Files assigned variable path not printing or editing

I have most of the program done. The last part of this program needs to open the file in append mode> Add 2 names > close file. Then, it has to open file in read mode> print contents> close file.
The file path has been assigned to a variable.
I keep getting the below error. (code is below that)
I don't know what to do to fix this
Traceback (most recent call last):
File "C:\Users\gorri\Desktop\College Work\CPT180_ShellScripting\Assignments\Programs\workWithFiles2.py", line 34, in
cat_files = open(cat_files, 'a')
TypeError: expected str, bytes or os.PathLike object, not TextIOWrapper
from pathlib import Path
import os
import os.path
path_E = Path('E:/CPT180Stuff')
os.chdir(path_E)
cat_files = (path_E / 'pets' / 'cats' / 'catnames.txt')
#opens catnames file to append end and add names.
cat_files = open(cat_files, 'a')
cat_files.write('Princess\nFreya\n')
cat_files.close()
cat_files = open(cat_files, 'r')
cat_files = cat_files.read()
print(cat_files)
cat_files.close()

In your current code, you are first assigning cat_files to the file name, but then in this line:
cat_files = open(cat_files, 'r')
You are now assigning cat_files to a file handle, which is not a string. This is why the next statement fails: it is expecting the filename string, not the file handle. You should use a different variable name for the handle, e.g.:
#opens catnames file to append end and add names.
f = open(cat_files, 'a')
f.write('Princess\nFreya\n')
f.close()
f = open(cat_files, 'r')
f = f.read()
print(f)
f.close()

Code not outputting to correct folder Python

so I have a some code that opens a text file containing a list of paths to files like so:
C:/Users/User/Desktop/mini_mouse/1980
C:/Users/User/Desktop/mini_mouse/1982
C:/Users/User/Desktop/mini_mouse/1984
It then opens these files individually, line-by-line, and does some filtering to the files. I then want it to output the result to a completely different folder called:
output_location = 'C:/Users/User/Desktop/test2/'
As it stands, my code currently outputs the result to the place where the original file was opened i.e if it opens the file C:/Users/User/Desktop/mini_mouse/1980, the output will be in the same folder under the name '1980_filtered'. I, however, would like the output to go into the output_location. Could anyone see where I am going wrong currently? Any help would be greatly appreciated! Here is my code:
import os
def main():
stop_words_path = 'C:/Users/User/Desktop/NLTK-stop-word-list.txt'
stopwords = get_stop_words_list(stop_words_path)
output_location = 'C:/Users/User/Desktop/test2/'
list_file = 'C:/Users/User/Desktop/list_of_files.txt'
with open(list_file, 'r') as f:
for file_name in f:
#print(file_name)
if file_name.endswith('\n'):
file_name = file_name[:-1]
#print(file_name)
file_path = os.path.join(file_name) # joins the new path of the file to the current file in order to access the file
filestring = '' # file string which will take all the lines in the file and add them to itself
with open(file_path, 'r') as f2: # open the file
print('just opened ' + file_name)
print('\n')
for line in f2: # read file line by line
x = remove_stop_words(line, stopwords) # remove stop words from line
filestring += x # add newly filtered line to the file string
filestring += '\n' # Create new line
new_file_path = os.path.join(output_location, file_name) + '_filtered' # creates a new file of the file that is currenlty being filtered of stopwords
with open(new_file_path, 'a') as output_file: # opens output file
output_file.write(filestring)
if __name__ == "__main__":
main()

Assuming you're using Windows (because you have a normal Windows filesystem), you have to use backslashes in your pathnames. Note that this is only on Windows. I know it's annoying, so I changed it for you (you're welcome :)). You also have to use two backslashes, as it will try to use it as an escape char.
import os
def main():
stop_words_path = 'C:\\Users\\User\\Desktop\\NLTK-stop-word-list.txt'
stopwords = get_stop_words_list(stop_words_path)
output_location = 'C:\\Users\\User\\Desktop\\test2\\'
list_file = 'C:\\Users\\User\\Desktop\\list_of_files.txt'
with open(list_file, 'r') as f:
for file_name in f:
#print(file_name)
if file_name.endswith('\n'):
file_name = file_name[:-1]
#print(file_name)
file_path = os.path.join(file_name) # joins the new path of the file to the current file in order to access the file
filestring = '' # file string which will take all the lines in the file and add them to itself
with open(file_path, 'r') as f2: # open the file
print('just opened ' + file_name)
print('\n')
for line in f2: # read file line by line
x = remove_stop_words(line, stopwords) # remove stop words from line
filestring += x # add newly filtered line to the file string
filestring += '\n' # Create new line
new_file_path = os.path.join(output_location, file_name) + '_filtered' # creates a new file of the file that is currenlty being filtered of stopwords
with open(new_file_path, 'a') as output_file: # opens output file
output_file.write(filestring)
if __name__ == "__main__":
main()

Based your code it looks like an issue in the line:
new_file_path = os.path.join(output_location, file_name) + '_filtered'
In Python's os.path.join() any absolute path (or drive letter in Windows) in the inputs will discard everything before it and restart the join from the new absolute path (or drive letter). Since you're calling file_name directly from list_of_files.txt and you have each path formatted there relative to the C: drive, each call to os.path.join() is dropping output_location and being reset to the original file path.
See Why doesn't os.path.join() work in this case? for a better explanation of this behavior.
When building the output path you could strip the file name, "1980" for instance, from the path "C:/Users/User/Desktop/mini_mouse/1980" and join based on the output_location variable and the isolated file name.

Text to Excel in Python overwrites existing data

I have a small script, and I am trying to parse data from text file to Excel file instead of going till the counter which is till 166 it stops at 134 and then doesn't do anything.
I have a file close operation also but it doesn't close the file and looks like the script continues to run.
Any thoughts? What am I doing wrong ?
path = ('C:\\Users\\40081\\PycharmProjects\\abcd')
#file_name = open('parsed_DUT1.txt', 'r')
file_name = 'parsed_DUT1.txt'
count=1
for line in file_name:
inputfile = open(file_name)
outputfile = open("parsed_DUT1" + '.xls', 'w+')
while count < 166:
for line in inputfile:
text = "TestNum_" + str(count*1)
if text in line :
#data = text[text.find(" ")+1:].split()[0]
outputfile.writelines(line)
count = count+1
inputfile.close()
outputfile.close()

w+
Opens a file for both writing and reading. Overwrites the existing
file if the file exists. If the file does not exist, creates a new
file for reading and writing.
You are opening the output file in w+ mode, that overwrites it everytime. Try with
outputfile = open("parsed_DUT1" + '.xls', 'a') # 'a' opens a file for appending.
I also suggest you to deal with files with with statement:
with open(file_name) as inputfile, open("parsed_DUT1" + '.xls', 'a') as outputfile:
# do stuff with input and output files

Search,replace text and save as based on text in document in Python

All, I am just getting started with python and I thought this may be a good time to see if it can help me automate a lot of repeative tasks I have to complete.
I am using a script I found on Gethub that will search and replace and then write a new file with the name output.txt. It works fine, but Since I have lots of these files I need to be able to name them different names based on the Text in the final modified document.
To make this a little more difficult the name of the file is based on the text I will be modifing the document with.
So, basically after I run this script, I have a file that sits at C:\Program Files (x86)\Python35-32\Scripts\Text_Find_and_Replace\Result with the name of output.txt in this Modified new file I would like to name it based on what text is in a particular line of the file. So in the modified file of output.txt I would like to have it rename the file to the plain text in line 35.
I have figured out how to read the line within the file using
import linecache
line = linecache.getline("readme.txt", 1)
line
>>> line
'This is Python version 3.5.1\n'
I just need to figure out how to rename the file based on the variable "line"
Any Ideas?
#!/usr/bin/python
import os
import sys
import string
import re
## information/replacingvalues.txt this is the text of the values you want in your final document
information = open("C:\Program Files (x86)\Python35- 32\Scripts\Text_Find_and_Replace\information/replacingvalues.txt", 'r')
#Text_Find_and_Replace\Result\output.txt This is the dir and the sum or final document
output = open("C:\Program Files (x86)\Python35-32\Scripts\Text_Find_and_Replace\Result\output.txt", 'w')
#field = open("C:\Program Files (x86)\Python35- 32\Scripts\Text_Find_and_Replace\Field/values.txt"
# Field is the file or words you will be replacing
field = open("C:\Program Files (x86)\Python35- 32\Scripts\Text_Find_and_Replace\Field/values.txt", 'r')
##
##
# modified code for autohot key
# Text_Find_and_Replace\Test/remedy line 1.ahk is the original doc you want modified
with open("C:\Program Files (x86)\Python35- 32\Scripts\Text_Find_and_Replace\Test/remedy line 1.ahk", 'r') as myfile:
inline = myfile.read()
#orig code
##with open("C:\Program Files (x86)\Python35- 32\Scripts\Text_Find_and_Replace\Test/input.txt", 'r') as myfile:
## inline = myfile.read()
informations = []
fields = []
dictionary = {}
i = 0
for line in information:
informations.append(line.splitlines())
for lines in field:
fields.append(lines.split())
i = i+1;
if (len(fields) != len(informations) ):
print ("replacing values and values have different numbers")
exit();
else:
for i in range(0, i):
rightvalue = str(informations[i])
rightvalue = rightvalue.strip('[]')
rightvalue = rightvalue[1:-1]
leftvalue = str(fields[i])
leftvalue = leftvalue.strip('[]')
leftvalue = leftvalue.strip("'")
dictionary[leftvalue] = rightvalue
robj = re.compile('|'.join(dictionary.keys()))
result = robj.sub(lambda m: dictionary[m.group(0)], inline)
output.write(result)
information.close;
output.close;
field.close;
I figured out how...
import os
import linecache
linecache.clearcache()
newfilename= linecache.getline("C:\python 3.5/remedy line 1.txt",37)
filename = ("C:\python 3.5/output.ahk")
os.rename(filename, newfilename.strip())
linecache.clearcache()

Gzip problem,traceback and IOError: [Errno 2] No such file or directory

I'm new to python and bioinformatics field. I'm using python-2.6. Now I'm trying to select all fastq.gz files, then gzip.open(just a few lines because it's too huge and time-wasting), then count 'J' , then pick out those files with 'J' count NOT equal to 0.
The following is my code:
#!/usr/bin/python
import os,sys,re,gzip
path = "/home/XXX/nearline"
for file in os.listdir(path):
if re.match('.*\.recal.fastq.gz', file):
text = gzip.open(file,'r').readlines()[:10]
word_list = text.split()
number = word_list.count('J') + 1
if number !== 0:
print file
But I got some errors:
Traceback (most recent call last):
File "fastqfilter.py", line 9, in <module>
text = gzip.open(file,'r').readlines()[:10]
File "/share/lib/python2.6/gzip.py", line 33, in open
return GzipFile(filename, mode, compresslevel)
File "/share/lib/python2.6/gzip.py", line 79, in __init__
fileobj = self.myfileobj = __builtin__.open(filename, mode or 'rb')
IOError: [Errno 2] No such file or directory: 'ERR001268_1.recal.fastq.gz'
What's this traceback: File......
Is there anything wrong with gzip here?
And why can't it find ERR001268_1.recal.fastq.gz? It's the first fastq file in the list, and DOES exist there.
Hope give me some clues, and any point out any other errors in the script.
THanks a lot.
Edit: thx everyone. I followed Dan's suggestion. And I tried on ONE fastq file first. My script goes like:
#!/usr/bin/python
import os,sys
import gzip
import itertools
file = gzip.open('/home/xug/nearline/ERR001274_1.recal.fastq.gz','r')
list(itertools.islice(file.xreadlines(),10))
word_list = list.split()
number = word_list.count('J') + 1
if number != 0:
print 'ERR001274_1.recal.fastq.gz'
Then errors are:
Traceback (most recent call last):
File "try2.py", line 8, in <module>
list(itertools.islice(text.xreadlines(),10))
AttributeError: GzipFiles instance has no attribute 'xreadlines'
Edit again: Thx Dan, I've solved the problem yesterday. Seems GzipFiles don't support xreadlines. So I tried the similar way as you suggested later. And it works. See below:
#!/usr/bin/python
import os,sys,re
import gzip
from itertools import islice
path = "/home/XXXX/nearline"
for file in os.listdir(path):
if re.match('.*\.recal.fastq.gz', file):
fullpath = os.path.join(path, file)
myfile = gzip.open(fullpath,'r')
head = list(islice(myfile,1000))
word_str = ";".join(str(x) for x in head)
number = word_str.count('J')
if number != 0:
print file

on this line:
text = gzip.open(file,'r').read()
file is a filename not a full path so
fullpath = os.path.join(path, file)
text = gzip.open(fullpath,'r').read()
about F.readlines()[:10] will read the whole file in to a list of lines and then take the first 10
import itertools
list(itertools.islice(F.xreadlines(),10))
this will not read the whole file into memory and will only read the first 10 lines into a list
but as gzip.open returns an object that doesn't have .xreadlines() and but as files are iterable on their lines just:
list(itertools.islice(F,10))
would work as this test shows:
>>> import gzip,itertools
>>> list(itertools.islice(gzip.open("/home/dan/Desktop/rp718.ps.gz"),10))
['%!PS-Adobe-2.0\n', '%%Creator: dvips 5.528 Copyright 1986, 1994 Radical Eye Software\n', '%%Title: WLP-94-int.dvi\n', '%%CreationDate: Mon Jan 16 16:24:41 1995\n', '%%Pages: 6\n', '%%PageOrder: Ascend\n', '%%BoundingBox: 0 0 596 842\n', '%%EndComments\n', '%DVIPSCommandLine: dvips -f WLP-94-int.dvi\n', '%DVIPSParameters: dpi=300, comments removed\n']

Change your code to:
#!/usr/bin/python
import os,sys,re,gzip
path = "/home/XXX/nearline"
for file in os.listdir(path):
if re.match('.*\.recal.fastq.gz', file):
text = gzip.open(os.path.join(path,file),'r').readlines()[:10]
word_list = text.split()
number = word_list.count('J') + 1
if number !== 0:
print file

It's trying to open ERR001268_1.recal.fastq.gz from the working directory, not from /home/XXX/nearline.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Merge files created with JSON arrays into a single file - python

Related

Python: Files assigned variable path not printing or editing

Code not outputting to correct folder Python

Text to Excel in Python overwrites existing data

Search,replace text and save as based on text in document in Python

Gzip problem,traceback and IOError: [Errno 2] No such file or directory

Categories

Resources