compare two lists of txt files, match and split - python

First csv file contains the filenames of the video files and the second a list of files from a specific path with the extensions.
I want to compare those two lists and add the matching file extensions for the results from list2 in order stacked by the find result (line by line).
I always get a randomly ordered 83 matches out of the 183 there should be. And is there a limit?
list of files
# list of video filenames with extension
list1 = set(line.strip() for line in open('D://RTS//marko//lab//python//xml_parse//aveco2venice_playlist//data//export//06-tv1-filelist.txt', 'r', encoding="UTF8"))
# list of files from playlist
list2 = set(line.strip() for line in open('D://RTS//marko//lab//python//xml_parse//aveco2venice_playlist//data//export//05-tv1-matpath.txt', 'r', encoding="UTF8"))
def filename(name):
return name.split('.')[0]
list2_fn = [filename(name) for name in list2]
found_fn = [name for name in list1 if filename(name) in list2_fn]
# print results
output_diff = open('D://RTS//marko//lab//python//xml_parse//aveco2venice_playlist//data//export//05-tv1-differente.txt', 'a', encoding="UTF8")
print(found_fn, file=output_diff)

Related

Making a dictionary from files in which keys are filenames and values are strings with specific character

so my problem is - I have proteomes in FASTA format, which look like this:
Name of the example file:
GCA_003547095.1_protein.faa
Contents:
>CAG77607.1
ABCDEF
>CAG72141.1
CSSDAS
And I also have files that contain just names of the proteins, i.e.:
Filename:
PF00001
Contents:
CAG77607.1
CAG72141.1
My task is to iterate through proteomes using list of proteins to find out how many proteins are in each proteome. PE told me that it should be a dictionary made from filenames of proteomes as keys and sequence names after ">" as values.
My approach was as follows:
import pandas as pd
file_names = open("proteomes_list").readlines()
d = {x: pd.read_csv("/proteomes/" + "GCA_003547095.1_protein.faa").columns.tolist() for x in file_names}
print (d)
As You can see I've made proteome filenames into list (using simple bash "ls", these are ONLY names of proteomes) and then creating dictionary with sequence names as values - unfortunetly each proteome (including the tested proteome) has only one value.
I will be grateful if You could shed some light on my case.
My goal was to make dictionary where key would be i.e. GCA_003547095.1_protein.faa and value i.e. CAG77607.1, CAG72141.1.
Is this the output you expect. This function should iterate over your file and grab the fasta file header or the name of the proteins that are expected in the file. Here is a quick function that can create a list of the fasta header.
You can create the dictionary you mentioned buy iterating over the file names and update the parent dictionary
import os
def extract_proteomes(folder: str, filename: str) -> list[str]:
with open(os.path.join(folder, filename), mode='r') as file:
content: str = file.read().split('\n')
protein_names = [i[1:] for i in content if i.startswith('>')]
if not protein_names:
protein_names = [i for i in content if i]
return protein_names
folder = "/Users/user/Downloads/"
files = ["GCA_003547095.1_protein.faa", "PF00001"]
d = {}
for i in files:
d.update({i: extract_proteomes(folder=folder, filename=i)})

Creating a list with words counted from multiple .docx files

I'm trying to do a project where I automate my invoices for translation jobs. Basically the script reads multiple .docx files in a folder, counts words for every separate file, then writes those filenames and the corresponding word counts into Excel file.
I've created a word counter script, but can't figure out how to add the counted words to a list to later use this list to extract values from it for my Excel file, and create an invoice.
Here is my code:
import docx
import os
import re
from docx import Document
#Folder to work with
folder = r'D:/Tulk_1'
files = os.listdir(folder)
#List with the names of files and word counts for each file
list_files = []
list_words = []
for file in files:
#Getting the absolute location
location = folder + '/' + file
#Adding filenames to the list
list_files.append(file)
#Word counter
document = docx.Document(location)
newparatextlist = []
for paratext in document.paragraphs:
newparatextlist.append(paratext.text)
#Printing file names
print(file)
#Printing word counts for each file
print(len(re.findall(r'\w+', '\n'.join(newparatextlist))))
Output:
cold calls.docx
2950
Kristības.docx
1068
Tulkojums starpniecības līgums.docx
946
Tulkojums_PL_ULIHA_39_41_4 (1).docx
788
Unfortunately I copied the counter part from the web and the last line is too complicated for me:
print(len(re.findall(r'\w+', '\n'.join(newparatextlist))))
So I don't know how to extract the results out of it into a list.
When I try to store the last line into a variable like this:
x = len(re.findall(r'\w+', '\n'.join(newparatextlist)))
The output is only word count for one of the files:
cold calls.docx
Kristības.docx
Tulkojums starpniecības līgums.docx
Tulkojums_PL_ULIHA_39_41_4 (1).docx
788
Maybe you could help me to break the last line into smaller steps? Or perhaps there are easier solutions to my task?
EDIT:
The desired output for the:
print(list_words)
should be:
[2950, 1068, 946, 788]
Similar as it already is for file names:
print(list_files)
output:
['cold calls.docx', 'Kristības.docx', 'Tulkojums starpniecības līgums.docx', 'Tulkojums_PL_ULIHA_39_41_4 (1).docx']

Build a dictionary from .txt files analysis

I have a basic program that can count the number of words in a given text file. I am trying to turn this into a program that can take in several different .txt files, with an arbitrary amount of keywords within those file analyzed, and output a dictionary within a list of the results (or similar object).
The output I am looking for is a list of dictionaries wherein the list keys are the names of the .txt files in the filenames list, and the dictionary keys-values are the arbitrary words within the first function and their words counts, respectively.
I have two function that I have created and cannot seem to get any out whatsoever - which means that somethin n.
Code:
def word_count(filename, *selected_words):
"""Count the approximate number of words in a file."""
with open(filename,"r",encoding='utf-8') as f_obj:
contents = f_obj.read()
filename = {}
filename['document'] = filename
filename['total_words'] = len(contents.split())
for word in selected_words:
count = contents.lower().count(word)
filename[word] = count
return filename
def analysis_output():
for file in files:
word_count(file, 'the', 'yes') #WORD_COUNT FUNCTION
files = ['alice.txt', 'siddhartha.txt',
'moby_dick.txt', 'little_women.txt']
analysis_output()
When I run this, I am not getting any output - no errors telling me the code has run (likely improperly). Any advice on how to turn this into a a list of dictionaries is helpful!
You simply forgot to define a variable to receive the output from word_count. In fact, you can do it this way:
def word_count(filename, *selected_words):
"""Count the approximate number of words in a file."""
with open(filename,"r",encoding='utf-8') as f_obj:
contents = f_obj.read()
results_dict = {}
results_dict['document'] = filename
results_dict['total_words'] = len(contents.split())
for word in selected_words:
count = contents.lower().count(word)
results_dict[word] = count
return results_dict
def analysis_output():
output = []
for file in files:
output.append(word_count(file, 'the', 'yes')) #WORD_COUNT FUNCTION
return output
files = ['alice.txt', 'siddhartha.txt',
'moby_dick.txt', 'little_women.txt']
final_result = analysis_output()
My solution below solves your problem in a slightly different way. I am using lists and strings only, no dictionaries. I've entered extra comments, if needed - I hope you will find it useful.
def get_words_string(file_name):
"""get a lower-case string of all words from a file"""
try:
with open(file_name,"r",encoding='utf-8') as file_object:
contents = file_object.read().lower()
return contents
except FileNotFoundError:
print(f'File not found')
def count_words(words_string, *words_to_count):
'''counts a number of each *words_to_count in a words_string'''
for word in words_to_count:
print(f'{word} occurs {words_string.count(word)} times')
files = [
'text files/alice.txt',
'text files/moby_dick.txt',
'text files/pride_and_pre.txt',
]
for file in files:
print(file)
#try block just in case if file is missing
#so the program can continue
try:
count_words(get_words_string(file), 'yes', 'the', 'cat', 'honour')
except:
pass

Why is cleared appended list in target list when I clear the original list from which I appended?

I need to append content of txt files as sublist into one target list:
final_product would be:
[[content file 1],[content of file2],[content of file 3]]
I tried to solve it with this code:
import os
list_of_files = []
product = []
final_product = []
working_dir = r"U:/Work/"
os.chdir(working_dir)
def create_sublists(fname):
with open(fname, 'r', encoding = 'ansi') as f:
for line in f:
product.append(line.strip()) #append each line from file to product
final_product.append(product) #append the product list to final_product
product.clear() #clear the product list
f.close()
#create list of all files in the working directory
for file in os.listdir(working_dir):
if file.endswith('.txt'):
list_of_files.append(file)
for file in list_of_files:
create_sublists(file)
print(final_product)
I thought that it will work in this way: first file will write its content into list product, this list will be appended into list final_product, the list product will be cleared, then will be appended second file ....
But it creates this:
[ [], [], [], [], [], [] ].
When I dont use the product.clear() it fills the final_product in this (wrong)way:
[[content_file1],[conetn_file1, content_file2],.
[content_file1,content_file2, content_file3], ....]
Then when I use product.clear() it deletes everything appended in final_product. Why?
As deceze points out in the comments, you're always using the same list.
I'm not sure why you are doing this; just create a new list on each iteration.
def create_sublists(fname):
product = []
with ...
Also note, you don't have to close f explicitly; it's automatically closed when the with block exits, that's the whole point of with.

Python create dict from CSV and use the file name as key

I have a simple CSV text file called "allMaps.txt". It contains the following:
cat, dog, fish
How can I take the file name "allMaps" and use it as a key, along with the contents being values?
I wish to achieve this format:
{"allMaps": "cat", "dog", "fish"}
I have a range of txt files all containing values separated by commas so a more dynamic method that does it for all would be beneficial!
The other txt files are:
allMaps.txt
fiveMaps.txt
tenMaps.txt
sevenMaps.txt
They all contain comma separated values. Is there a way to look into the folder and convert each one on the text files into a key-value dict?
Assuming you have the file names in a list.
files = ["allMaps.txt", "fiveMaps.txt", "tenMaps.txt", "sevenMaps.txt"]
You can do the following:
my_dict = {}
for file in files:
with open(file) as f:
items = [i.strip() for i in f.read().split(",")]
my_dict[file.replace(".txt", "")] = items
If the files are all in the same folder, you could do the following instead of maintaining a list of files:
import os
files = os.listdir("<folder>")
Given the file names, you can create a dictionary where the key stores the filenames with a corresponding value of a list of the file data:
files = ['allMaps.txt', 'fiveMaps.txt', 'tenMaps.txt', 'sevenMaps.txt']
final_results = {i:[b.strip('\n').split(', ') for b in open(i)] for i in files}

Categories