I have a list of files where each file has two columns.
The 1st column contains words, and the 2nd column contains numbers.
I want to extract all the unique words from the files, and sum the numbers in them. This I was able to do...
The second task is to count the number of files in which the words were found. I am having trouble in this part... I am using a dictionary for this.
Here is my code:
import os
from typing import TextIO
currentdir = " " #CHANGE INPUT PATH
resultdir = " " #CHANGE OUTPUT ACCORDINGLY
if not os.path.exists(resultdir):
os.makedirs(resultdir)
systemcallcount ={}
for root, dirs, files in os.walk(currentdir):
for name in files:
outfile2 = open(root+"/"+name,'r')
for line in outfile2:
words=line.split(" ")
if words[0] not in systemcallcount:
systemcallcount[words[0]]=int(words[1])
else:
systemcallcount[words[0]]+=int(words[1])
outfile2.close()
for keys,values in systemcallcount.items():
print(keys)
print(values)
for example I have two files -
file1 file2
a 2 a 3
b 3 b 1
c 1
so the output would be -
a 5 2
b 4 2
c 1 1
To explain second column of output a is 2 because it is occuring in both the files whereas c is 1 as it is appearing in only file1.
I hope this helps
This code takes a string and checks in a folder for files that contain it
# https://www.opentechguides.com/how-to/article/python/59/files-containing-text.html
search_string="python"
search_path="C:\Users\You\Desktop\Project\Files"
extension="txt" # files extension
# loop through files in the path specified
for fname in os.listdir(search_path):
if fname.endswith(file_type):
# Open file for reading
fo = open(search_path + fname)
# Read the first line from the file
line = fo.readline()
# Initialize counter for line number
line_no = 1
# Number of files found is 0
files_no=0;
# Loop until EOF
while line != '' :
# Search for string in line
index = line.find(search_str)
if ( index != -1) :
# print the occurence
print(fname, "[", line_no, ",", index, "] ", line, sep="")
# Read next line
line = fo.readline()
# Increment line counter
line_no += 1
# Increment files counter
files_no += 1
# Close the files
fo.close()
One way is to use collections.defaultdict. You can create a set of words and then increment your dictionary counter for each file, for each word.
from collections import defaultdict
d = defaultdict(int)
for root, dirs, files in os.walk(currentdir):
for name in files:
with open(root+'/'+name,'r') as outfile2:
words = {line.split()[0] for line in outfile2}
for word in words:
d[words[0]] += 1
Another way is to use Pandas to work on both of your tasks.
Read the files into a table
Note the source file in a separate column.
Apply functions to get unique words, sum the numbers, and count the source files for each word.
Here is the code:
import pandas as pd
import sys,os
files = os.listdir(currentdir)
dfs = []
for f in files:
df = pd.read_csv(currentdir+"/"+f,sep='\t',header=None)
df['source_file'] = f
dfs.append(df)
def concat(x):
return pd.Series(dict(A = x[0].unique()[0],
B = x[1].sum(),
C = len(x['source_file'])))
df = pd.concat(dfs,ignore_index=True).groupby(0).apply(concat)
# Print result to standard output
df.to_csv(sys.stdout,sep='\t',header=None,index=None)
You may refer here: Pandas groupby: How to get a union of strings
It appears that you want to parse the file into a dictionary of lists, so that for the input you provided:
file1 file2
a 2 a 3
b 3 b 1
c 1
... you get the following data structure after parsing:
{'a': [2, 3], 'b': [3, 1], 'c': [1]}
From that, you can easily get everything you need.
Parsing this way should be rather simple using a defaultdict:
parsed_data = defaultdict(list)
for filename in list_of_filenames:
with open(filename) as f:
for line in f:
name, number = line.split()
parsed_data[name].append(int(number))
After that, printing the data you are interested in should be trivial:
for name, values in parsed_data.items():
print('{} {} {}'.format(name, sum(values), len(values)))
The solution assumes that the same name will not appear twice in the same file. It is not specified what should happen in that case.
TL;DR: The solution for your problems is defaultdict.
Related
I have a set of files that are read in line by line. I would like have the last line of every file to have the name of the file next to it. This is the code that accomplishes the reading in the file part but I don't know how to get the filenames to show up:
import glob
a = []
def convert_txt_to_dataframe(path):
for files in glob.glob(path + "./*manual.txt"):
for x in open(files):
a.append(x)
So this accomplishes importing all the text files line by line, so now I want every the last line of every file to have an accompanying filename next to it
I want it to look something like:
Hello Goodbye
0 Thank you for being a loyal customer. MyDocuments/TextFile1
1 Thank you for being a horrible customer. MyDocuments/TextFile1
2 Thank you for being a nice customer. MyDocuments/TextFile3
So I'm assuming you are taking a list of files and those columns you mentioned [0,1,2] are referring to the last lines of each file in your list. With that in mind, I would try a simpler approach instead of a dataframe. And even if you have to use the dataframe for other reasons, perhaps you can convert to text as your last step and try this:
Example File ("ExampleText2"):
I love coffee
I love creamer
I love coffee and creamer
I have a rash..
Code:
last = []
with open('exampleText2.txt', 'r') as f:
last = f.readlines()[-1] + " other FileName"
Output:
last
'I have a rash.. other FileName'
readlines() will return a list of all the lines in your file, so you could try calling the -1 to pull the last line, then add to it.
I'm assuming that the number of lines is more than or equals the number of files.
import glob
words = ['Thank you for being a loyal customer.',
'Thank you for being a horrible customer.',
'Thank you for being a nice customer.']
def convert(path):
a = []
z = 0
for files in glob.glob(path + "/*.txt"):
temp = [words[z],files]
a.append(temp)
z += 1
print (a)
convert(your_path)
The question is ill-defined, but assuming the OP wants the result shown in the DataFrame example (i.e. not just the last line is somehow decorated with the filename, but all lines are), here is a way to achieve that. For this example, we just have two files: file1.txt contains two lines: 'a' and 'b', file2.txt contains one line: 'c'.
We write a file reader that returns a list of lists: each sublist contains the filename and a line.
import glob
def get_file(filename):
with open(filename) as f:
return [[filename, line.rstrip('\n')] for line in f]
Try it:
m = map(get_file, glob.glob('file*.txt'))
list(m)
Out[]:
[[['file2.txt', 'c']], [['file1.txt', 'a'], ['file1.txt', 'b']]]
Let us flatten these lists to get one two-dimensional array. Also, it is probably nicer to get a result where the files are sorted alphabetically.
def flatten(m):
return [k for sublist in m for k in sublist]
m = map(get_file, sorted(glob.glob('file*.txt')))
flatten(m)
Out[]:
[['file1.txt', 'a'], ['file1.txt', 'b'], ['file2.txt', 'c']]
Now, it sometimes helps to have the line number (say if we are going to put that data in a DataFrame and do further sorting and analytics). Our reader becomes:
def get_file(filename):
with open(filename) as f:
return [[filename, lineno, line.rstrip('\n')] for lineno, line in enumerate(f, start=1)]
m = map(get_file, sorted(glob.glob('file*.txt')))
out = pd.DataFrame(flatten(m), columns=['filename', 'lineno', 'line'])
out
Out[]:
filename lineno line
0 file1.txt 1 a
1 file1.txt 2 b
2 file2.txt 1 c
Notice that the map above lends itself nicely to a multi-threaded reading if we do have a large number of files:
from concurrent.futures import ThreadPoolExecutor
with ThreadPoolExecutor(max_workers=4) as pool:
m = pool.map(get_file, sorted(glob.glob('file*.txt')))
out = pd.DataFrame(flatten(m), columns=['filename', 'lineno', 'line'])
out
Out[]:
filename lineno line
0 file1.txt 1 a
1 file1.txt 2 b
2 file2.txt 1 c
I have 2 text files and 2 lists (FIRST_LIST,SECOND_LIST), I want to find out count of each file matching words from FIRST_LIST,SECOND_LIST individually.
FIRST_LIST = "accessorizes","accessorizing","accessorized","accessorize"
SECOND_LIST="accessorize","accessorized","accessorizes","accessorizing"
(this is not a string I am getting this data ".txt " file format)
text_File1 (contains):
This is a very good question, and you have received good answers
which describe interesting topics accessorized accessorize.
text_File2 (contains):
is more applied,using accessorize accessorized,accessorizes,accessorizing
output format:
File1 first list count=2
File1 second list count=0
File2 first list count=0
File2 second list count=4
This code below I have tried to archieve this functionality but not able to get the expected output.
if any help appreciated
Reading all (x.txt Files)
import os
import glob
files=[]
for filename in glob.glob("*.txt"):
files.append(filename)
creating a def function to remove Punctuations
# remove Punctuations
import re
def remove_punctuation(line):
return re.sub(r'[^\w\s]', '', line)
Reading Multiple Files fro "filename" in a Loop but it is merging. I need separate each text1 file counts and text2 file counts
two_files=[]
for filename in files:
for line in open(filename):
#two_files.append(remove_punctuation(line))
print(remove_punctuation(line),end='')
two_files.append(remove_punctuation(line))
FIRST_LIST = "accessorizes","accessorizing","accessorized","accessorize"
SECOND_LIST="accessorize","accessorized","accessorizes","accessorizing"
c=[]
for match in FIRST_LIST:
if any(match in value for value in two_files):
#c=match+1
print (match)
c.append(match)
print(c)
len(c)
d=[]
for match in SECOND_LIST:
if any(match in value for value in two_files):
#c=match+1
print (match)
d.append(match)
print(d)
len(d)
I'm not sure this is what you wanted, but I think it's because you are appending the lines from both files in the same list. You should create a list for each. Try:
import glob
files=[]
for filename in glob.glob("*.txt"):
files.append(filename)
# remove Punctuations
import re
def remove_punctuation(line):
return re.sub(r'[^\w\s]', '', line)
two_files=[]
for filename in files:
temp = []
for line in open(filename):
temp.append(remove_punctuation(line))
two_files.append(temp)
FIRST_LIST = "accessorizes","accessorizing","accessorized","accessorize"
SECOND_LIST="accessorize","accessorized","accessorizes","accessorizing"
c=[]
d=[]
for file in two_files:
temp = []
for match in FIRST_LIST:
for value in file:
if match in value:
temp.append(match)
c.append(temp)
temp2 = []
for match in SECOND_LIST:
for value in file:
if match in value:
temp2.append(match)
d.append(temp2)
print('File1 first list count = ' + str(len(c[0])))
print('File1 second list count = ' + str(len(d[0])))
print('File2 first list count = ' + str(len(c[1])))
print('File2 second list count = ' + str(len(d[1])))
I have 2 txt files (a and b_).
file_a.txt contains a long list of 4-letter combinations (one combination per line):
aaaa
bcsg
aacd
gdee
aadw
hwer
etc.
file_b.txt contains a list of letter combinations of various length (some with spaces):
aaaibjkes
aaleoslk
abaaaalkjel
bcsgiweyoieotpwe
csseiolskj
gaelsi asdas
aaaloiersaaageehikjaaa
hwesdaaadf wiibhuehu
bcspwiopiejowih
gdeaes
aaailoiuwegoiglkjaaake
etc.
I am looking for a python script that would allow me to do the following:
read file_a.txt line by line
take each 4-letter combination (e.g. aaai)
read file_b.txt and find all the various-length letter combinations starting with the 4-letter combination (eg. aaaibjkes, aaailoiersaaageehikjaaa, aaailoiuwegoiglkjaaaike etc.)
print the results of each search in a separate txt file named with the 4-letter combination.
File aaai.txt:
aaaibjkes
aaailoiersaaageehikjaaa
aaailoiuwegoiglkjaaake
etc.
File bcsi.txt:
bcspwiopiejowih
bcsiweyoieotpwe
etc.
I'm sorry I'm a newbie. Can someone point me in the right direction, please. So far I've got only:
#I presume I will have to use regex at some point
import re
file1 = open('file_a.txt', 'r').readlines()
file2 = open('file_b.txt', 'r').readlines()
#Should I look into findall()?
I hope this would help you;
file1 = open('file_a.txt', 'r')
file2 = open('file_b.txt', 'r')
#get every item in your second file into a list
mylist = file2.readlines()
# read each line in the first file
while file1.readline():
searchStr = file1.readline()
# find this line in your second file
exists = [s for s in mylist if searchStr in s]
if (exists):
# if this line exists in your second file then create a file for it
fileNew = open(searchStr,'w')
for line in exists:
fileNew.write(line)
fileNew.close()
file1.close()
What you can do is to open both files and run both files down line by line using for loops.
You can have two for loops, the first one reading file_a.txt as you will be reading through it only once. The second will read through file_b.txt and look for the string at the start.
To do so, you will have to use .find() to search for the string. Since it is at the start, the value should be 0.
file_a = open("file_a.txt", "r")
file_b = open("file_b.txt", "r")
for a_line in file_a:
# This result value will be written into your new file
result = ""
# This is what we will search with
search_val = a_line.strip("\n")
print "---- Using " + search_val + " from file_a to search. ----"
for b_line in file_b:
print "Searching file_b using " + b_line.strip("\n")
if b_line.strip("\n").find(search_val) == 0:
result += (b_line)
print "---- Search ended ----"
# Set the read pointer to the start of the file again
file_b.seek(0, 0)
if result:
# Write the contents of "results" into a file with the name of "search_val"
with open(search_val + ".txt", "a") as f:
f.write(result)
file_a.close()
file_b.close()
Test Cases:
I am using the test cases in your question:
file_a.txt
aaaa
bcsg
aacd
gdee
aadw
hwer
file_b.txt
aaaibjkes
aaleoslk
abaaaalkjel
bcsgiweyoieotpwe
csseiolskj
gaelsi asdas
aaaloiersaaageehikjaaa
hwesdaaadf wiibhuehu
bcspwiopiejowih
gdeaes
aaailoiuwegoiglkjaaake
The program produces an output file bcsg.txt as it is supposed to with bcsgiweyoieotpwe inside.
Try this:
f1 = open("a.txt","r").readlines()
f2 = open("b.txt","r").readlines()
file1 = [word.replace("\n","") for word in f1]
file2 = [word.replace("\n","") for word in f2]
data = []
data_dict ={}
for short_word in file1:
data += ([[short_word,w] for w in file2 if w.startswith(short_word)])
for single_data in data:
if single_data[0] in data_dict:
data_dict[single_data[0]].append(single_data[1])
else:
data_dict[single_data[0]]=[single_data[1]]
for key,val in data_dict.iteritems():
open(key+".txt","w").writelines("\n".join(val))
print(key + ".txt created")
In the code below I'm opening a fileList and check for each file in the fileList.
If the name of the file corresponds with first 4 characters of each line in another text file, I extract the number which is written in the text file with line.split()[1] and then assign the int of this string to d. Afterwards I will use this d to divide the counter.
Here's a part of my function :
fp=open('yearTerm.txt' , 'r') #open the text file
def parsing():
fileList = pathFilesList()
for f in fileList:
date_stamp = f[15:-4]
#problem is here that this for , finds d for first file and use it for all
for line in fp :
if date_stamp.startswith(line[:4]) :
d = int(line.split()[1])
print d
print "Processing file: " + str(f)
fileWordList = []
fileWordSet = set()
# One word per line, strip space. No empty lines.
fw = open(f, 'r')
fileWords = Counter(w for w in fw.read().split())
# For each unique word, count occurance and store in dict.
for stemWord, stemFreq in fileWords.items():
Freq= stemFreq / d
if stemWord not in wordDict:
wordDict[stemWord] = [(date_stamp, Freq)]
else:
wordDict[stemWord].append((date_stamp, Freq))
This works but it gives me the wrong output, the for cycle for finding d is just done once but I want it to run for each file as each file has different d. I don't know how to change this for in order to get the right d for each file or whatever else I should use.
I appreciate any advices.
I don't quite understand what you are trying to do, but if you want to do some processing for each "good" line in fp, you should move corresponding code under that if:
def parsing():
fileList = pathFilesList()
for f in fileList:
date_stamp = f[15:-4]
#problem is here that this for , finds d for first file and use it for all
for line in fp :
if date_stamp.startswith(line[:4]) :
d = int(line.split()[1])
print d
print "Processing file: " + str(f)
fileWordList = []
fileWordSet = set()
...
I want to compare multiple files (15-20), which are gzipped, and restore from them lines, that are common. But this is not so simple. Lines that are exact in certain columns, and also I would like to have for them count information in how many files they were present. If 1, the line is unique to a file, etc. Would be also nice to hold those file names as well.
each file looks st like this:
##SAMPLE=<ID=NormalID,Description="Cancer-paired normal sample. Sample ID 'NORMAL'">
##SAMPLE=<ID=CancerID,Description="Cancer sample. Sample ID 'TUMOR'">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NormalID_NORMAL CancerID_TUMOR
chrX 136109567 . C CT . PASS IC=8;IHP=8;NT=ref;QSI=35;QSI_NT=35;RC=7;RU=T;SGT=ref->het;SOMATIC;TQSI=1;TQSI_NT=1;phastCons;CSQ=T|ENSG00000165370|ENST00000298110|Transcript|5KB_downstream_variant|||||||||YES|GPR101||||| DP:DP2:TAR:TIR:TOR:DP50:FDP50:SUBDP50 23:23:21,21:0,0:2,2:21.59:0.33:0.00 33:33:16,16:13,13:4,4:33.38:0.90:0.00
chrX 150462334 . T TA . PASS IC=2;IHP=2;NT=ref;QSI=56;QSI_NT=56;RC=1;RU=A;SGT=ref->het;SOMATIC;TQSI=2;TQSI_NT=2;CSQ=A||||intergenic_variant||||||||||||||| DP:DP2:TAR:TIR:TOR:DP50:FDP50:SUBDP50 30:30:30,30:0,0:0,0:31.99:0.00:0.00 37:37:15,17:16,16:6,5:36.7:0.31:0.00
Files are tab delimited.
If line starts with #, ignore this line. We are interested only in those, that do not.
Taking 0 based python coordinates, we are interested in 0,1,2,3,4 fields. They have to match between files to be reported as common. However we still need tohold information about the rest of the coulmns/fields, so that they can be written tot he output file
Right now I have the following code:
import gzip
filenames = ['a','b','c']
files = [gzip.open(name) for name in filenames]
sets = [set(line.strip() for line in file if not line.startswith('#')) for file in files]
common = set.intersection(*sets)
for file in files: file.close()
print common
In my currenyt code I do not know how to implement correctly the if not line.startswith() (which place?), and how to specify the columns in line that should be matched. Not to mention, that I have no idea how to get the lines that are for example present in 6 files, or present in 10 out of total 15 files.
Any help with this?
Collect the lines in a dictionary with the fields that make them similar as key:
from collections import defaultdict
d = defaultdict(list)
def process(filename, line):
if line[0] == '#':
return
fields = line.split('\t')
key = tuple(fields[0:5]) # Fields that makes lines similar/same
d[key].append((filename, line))
for filename in filenames:
with gzip.open(filename) as fh:
for line in fh:
process(filename, line.strip())
Now, you have a dictionary with lists of filename-line tuples. You can now print all the lines which appear more than 10 times:
for l in d.values():
if len(l) < 10: continue
print 'Same key found %d times:' % len(l)
for filename, line in l:
print '%s: %s' % (filename, line)