I have some text files (just using two here), and I want to read them in to Python and manipulate them. I'm trying to store lists of strings (one string for each word, one list each file).
My code currently looks like this:
(files are named m1.txt and m2.txt)
dict={'m1':[],'m2':[]}
for k in files:
with open(k,'r') as f:
for line in f:
for word in line.split():
for i in range (1,3):
dict['m'+str(i)].append(word)
This code ends up combining the words in both text files instead of giving me the words for each file separately. Ultimately I want to read lots of files so any help on how to separate them out would be much appreciated!
This example dynamically fetches the file name (without the extension) and uses it to denote where in the dict we're working:
files = ['m1.txt', 'm2.txt'];
file_store = {'m1':[],'m2':[]}
for file in files:
prefix = (file.split(r'.'))[0]
with open(file, 'r') as f:
for line in f:
for word in line.split():
file_store[prefix].append(word)
You were opening each list repeatedly while processing each individual file by alternative i values in the final for loop.
Try something like this:
dict={'m1':[],'m2':[]}
for i, k in enumerate(files):
with open(k,'r') as f:
for line in f:
for word in line.split():
dict['m'+str(i+1)].append(word)
I've left your code "as is" but the comment above regarding not using language keywords is important.
Related
So I have two files. One yaml file that contains tibetan words : its meaning. Another csv file that contains only word and it's POStag. As below:
yaml file :
ད་གདོད: ད་གཟོད་དང་དོན་འདྲ།
ད་ཆུ: དངུལ་ཆུ་ཡི་མིང་གཞན།
ད་ཕྲུག: དྭ་ཕྲུག་གི་འབྲི་ཚུལ་གཞན།
ད་བེར: སྒྲིབ་བྱེད་དང་རླུང་འགོག་བྱེད་ཀྱི་གླེགས་བུ་ལེབ་མོའི་མིང་།
ད་མེ་དུམ་མེ: དམ་དུམ་ལ་ལྟོས།
csv file :
ད་ཆུ PART
ད་གདོད DET
Desired output:
ད་ཆུ PART དངུལ་ཆུ་ཡི་མིང་གཞན།
ད་གདོད DET ད་གཟོད་དང་དོན་འདྲ།
Any idea on how to make text match from csv file to yaml file and extract its meaning in csv?
On a functional point of view, you have:
a dictionary, meaning here a key: value thing
a list of words to search in that dictionary, and that will produce a record
If everything can fit in memory, you can first read the yaml file to produce a Python dictionary, and then read the words file, one line at a time and use the above dictionary to generate the expected line. If the yaml file is too large, you could use the dbm (or shelve) module as an on disk dictionary.
As you have not shown any code, I cannot either... I can just say that you can simply use process the second file as a plain text one and just read it one line at a time. For the first one, you can either look for a yaml module from PyPI, or if the syntax is always as simple as the lines you have shown, just process it as text one line at a time and use split to extract the key and the value.
Assuming your files are called dict.yml and input.csv.
You can start by turning the yaml file into a dictionary with
import yaml
with open('dict.yaml', 'r') as file:
trans_dict = yaml.safe_load(file)
Which should give you
>>> trans_dict
{'ད་གདོད': 'ད་གཟོད་དང་དོན་འདྲ།',
'ད་ཆུ': 'དངུལ་ཆུ་ཡི་མིང་གཞན།',
'ད་ཕྲུག': 'དྭ་ཕྲུག་གི་འབྲི་ཚུལ་གཞན།',
'ད་བེར': 'སྒྲིབ་བྱེད་དང་རླུང་འགོག་བྱེད་ཀྱི་གླེགས་བུ་ལེབ་མོའི་མིང་།',
'ད་མེ་དུམ་མེ': 'དམ་དུམ་ལ་ལྟོས།'}
Then, you can iterate over the lines in the CSV and use the dictionary to get the definition:
outputs = []
with open('text.txt', 'r') as file:
for line in file:
term = line.strip()
definition = trans_dict.get(term.strip())
outputs.append(
term if definition is None
else f"{term} {definition}"
)
From here, your outputs variable should contain ['ད་ཆུ དངུལ་ཆུ་ཡི་མིང་གཞན།', 'ད་གདོད ད་གཟོད་དང་དོན་འདྲ།']. If you optionally wanted to write this out to a file, you could do
with open('output.txt', 'w') as file:
file.write('\n'.join(outputs))
If you had more tokens on each line of the CSV (unclear from your post), you could iterate over those tokens within a line, but you'd be able to apply basically the same approach.
The easiest solution that came to my mind would be iterating over all lines in the YAML-file and checking if the word is inside the CSV-file:
YAML_LINES = "ད་གདོད: ད་གཟོད་དང་དོན་འདྲ།\nད་ཆུ: དངུལ་ཆུ་ཡི་མིང་གཞན\nད་ཕྲུག: དྭ་ཕྲུག་གི་འབྲི་ཚུལ་གཞན\nད་བེར: སྒྲིབ་བྱེད་དང་རླུང་འགོག་བྱེད་ཀྱི་གླེགས་བུ་ལེབ་མོའི་མིང་།\nད་མེ་དུམ་མེ: དམ་དུམ་ལ་ལྟོས།".split("\n")
CSV_LINES = "ད་ཆུ\nད་གདོད".split("\n")
for line in YAML_LINES:
word, meaning = line.split(": ")
if word in CSV_LINES:
output = word + " " + meaning
print(output)
The YAML_LINES and CSV_LINES lists are only to provide a quick and dirty example.
I'm using Notepad ++ to do a find and replacement function. Currently I have a a huge numbers of text files. I need to do a replacement for different string in different file. I want do it in batch. For example.
I have a folder that has the huge number of text file. I have another text file that has the strings for find and replace in order
Text1 Text1-corrected
Text2 Text2-corrected
I have a small script that do this replacement only for the opened files in Notepad++. For achieving this I'm using python script in Notepad++. The code is as follows.
with open('C:/replace.txt') as f:
for l in f:
s = l.split()
editor.replace(s[0], s[1])
In simple words, the find and replace function should fetch the input from a file.
Thanks in advance.
with open('replace.txt') as f:
replacements = [tuple(line.split()) for line in f]
for filename in filenames:
with open(filename, 'w') as f:
contents = f.read()
for old, new in replacements:
contents = contents.replace(old, new)
f.write(contents)
Read replacements into a list of tuples, then go through each file, and read the contents into memory, do the replacements, then write it back. I think the files get overwritten properly, but you might want to double check.
I have multiple .txt files for my python project. All are lists of strings seperated by lines. Previously, I have imported each txt file and converted them into lists individually using
with open('general_responses.txt') as f:
general_responses = f.read().splitlines()
But, I want to automoate this using a for loop in order to speed up the process so that I can more easily add response lists to my project. So, this is the code I currently have, not that it works...
final_files = ['general_responses.txt', 'cat_responses.txt', 'dog_responses.txt']
for word in final_files:
with open(word) as f:
word = word[:-4]
word = f.read().splitlines()
so when I run
print (general_responses)
my script should print out a list of strings from the txt file general_responses.txt
However, this does not work. Any suggestions?
EDIT:
for example, general_responses.txt would contain something along the lines of:
hi i'm fred
whats up
how are you doing today?
You want to collect the result of each file in a separate list (here called lines):
final_files = ['general_responses.txt', 'cat_responses.txt', 'dog_responses.txt']
lines = []
for filename in final_files:
with open(filename) as f:
lines.extend(f.read().splitlines())
I'm looking for some help with my code which is rigth below :
for file in file_name :
if os.path.isfile(file):
for line_number, line in enumerate(fileinput.input(file, inplace=1)):
print file
os.system("pause")
if line_number ==1:
line = line.replace('Object','#Object')
sys.stdout.write(line)
I wanted to modify some previous extracted files in order to plot them with matplotlib. To do so, I remove some lines, comment some others.
My problem is the following :
Using for line_number, line in enumerate(fileinput.input(file, inplace=1)): gives me only 4 out of 5 previous extracted files (when looking file_name list contains 5 references !)
Using for line_number, line in enumerate(file): gives me the 5 previous extracted file, BUT I don't know how to make modifications using the same file without creating another one...
Did you have an idea on this issue? Is this a normal issue?
There a number of things that might help you.
Firstly file_name appears to be a list of file names. It might be better named file_names and then you could use file_name for each one. You have verified that this does hold 5 entries.
The enumerate() function is used to help when enumerating a list of items to provide both an index and the item for each loop. This saves you having to use a separate counter variable, e.g.
for index, item in enumerate(["item1", "item2", "item3"]):
print index, item
would print:
0 item1
1 item2
2 item3
This is not really required, as you have chosen to use the fileinput library. This is designed to take a list of files and iterate over all of the lines in all of the files in one single loop. As such you need to tweak your approach a bit, assuming your list of files is called file_names then you write something as follows:
# Keep only files in the file list
file_names = [file_name for file_name in file_names if os.path.isfile(file_name)]
# Iterate all lines in all files
for line in fileinput.input(file_names, inplace=1):
if fileinput.filelineno() == 1:
line = line.replace('Object','#Object')
sys.stdout.write(line)
The main point here being that it is better to pre filter any non-filenames before passing the list to fileinput. I will leave it up to you to fix the output.
fileinput provides a number of functions to help you figure out which file or line number is currently being processed.
Assuming you're still having trouble, my typical approach is to open a file read-only, read its contents into a variable, close the file, make an edited variable, open the file to write (wiping out original file), and finally write the edited contents.
I like this approach since I can simply change the file_name that gets written out if I want to test my edits without wiping out the original file.
Also, I recommend naming containers using plural nouns, like #Martin Evans suggests.
import os
file_names = ['file_1.txt', 'file_2.txt', 'file_3.txt', 'file_4.txt', 'file_5.txt']
file_names = [x for x in file_names if os.path.isfile(x)] # see #Martin's answer again
for file_name in file_names:
# Open read-only and put contents into a list of line strings
with open(file_name, 'r') as f_in:
lines = f_in.read().splitlines()
# Put the lines you want to write out in out_lines
out_lines = []
for index_no, line in enumerate(lines):
if index_no == 1:
out_lines.append(line.replace('Object', '#Object'))
elif ...
else:
out_lines.append(line)
# Uncomment to write to different file name for edits testing
# with open(file_name + '.out', 'w') as f_out:
# f_out.write('\n'.join(out_lines))
# Write out the file, clobbering the original
with open(file_name, 'w') as f_out:
f_out.write('\n'.join(out_lines))
Downside with this approach is that each file needs to be small enough to fit into memory twice (lines + out_lines).
Best of luck!
I have a 'key' file that looks like this (MyKeyFile):
afdasdfa ghjdfghd wrtwertwt asdf (these are in a column, but I never figured out the formatting, sorry)
I call these keys and they are identical to the first word of the lines that I want to extract from a 'source' file. So the source file (MySourceFile) would look something like this (again, bad formatting, but 1st column = the key, following columns = data):
afdasdfa (several tab delimited columns)
.
.
ghjdfghd ( several tab delimited columns)
.
wrtwertwt
.
.
asdf
And the '.' would indicate lines of no interest currently.
I am an absolute novice in Python and this is how far I've come:
with open('MyKeyFile','r') as infile, \
open('MyOutFile','w') as outfile:
for line in infile:
for runner in source:
# pick up the first word of the line in source
# if match, print the entire line to MyOutFile
# here I need help
outfile.close()
I realize there may be better ways to do this. All feedback is appreciated - along my way of solving it, or along more sophisticated ones.
Thanks
jd
I think that this would be a cleaner way of doing it, assuming that your "key" file is called "key_file.txt" and your main file is called "main_file.txt"
keys = []
my_file = open("key_file.txt","r") #r is for reading files, w is for writing to them.
for line in my_file.readlines():
keys.append(str(line)) #str() is not necessary, but it can't hurt
#now you have a list of strings called keys.
#take each line from the main text file and check to see if it contains any portion of a given key.
my_file.close()
new_file = open("main_file.txt","r")
for line in new_file.readlines():
for key in keys:
if line.find(key) > -1:
print "I FOUND A LINE THAT CONTAINS THE TEXT OF SOME KEY", line
You can modify the print function or get rid of it to do what you want with the desired line that contains the text of some key. Let me know if this works
As I understood (corrent me in the comments if I am wrong), you have 3 files:
MySourceFile
MyKeyFile
MyOutFile
And you want to:
Read keys from MyKeyFile
Read source from MySourceFile
Iterate over lines in the source
If line's first word is in keys: append that line to MyOutFile
Close MyOutFile
So here is the Code:
with open('MySourceFile', 'r') as sourcefile:
source = sourcefile.read().splitlines()
with open('MyKeyFile', 'r') as keyfile:
keys = keyfile.read().split()
with open('MyOutFile', 'w') as outfile:
for line in source:
if line.split():
if line.split()[0] in keys:
outfile.write(line + "\n")
outfile.close()