Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
Help guys!!!
List of 150 text files,
One text file with query texts: (
SRR1005851
SRR1299210
SRR1021605
SRR1299782
SRR1299369
SRR1006158
...etc).
I want to search for each of this query texts from the list of 150 text files.
if for example SRR1005851 is found in at least 120 of the files, SRR1005851 will be appended in an output file.
the search will iterate all search query text and through all 150 files.
Summary: I am looking for which query text is found in at least 90% of the 150 files.
I don't think I fully understand your question. Posting your code and an example file would have been very helpful.
This code will count all entries in all files, then it will identify unique entries per file. After that, it will count each entry's occurrence in each file. Then, it will select only entries that appeared at least in 90% of all files.
Also, this code could have been shorter, but for readability's sake, I created many variables, with long, meaningful names.
Please read the comments ;)
import os
from collections import Counter
from sys import argv
# adjust your cut point
PERCENT_CUT = 0.9
# here we are going to save each file's entries, so we can sum them later
files_dict = {}
# total files seems to be the number you'll need to check against count
total_files = 0;
# raw total entries, even duplicates
total_entries = 0;
unique_entries = 0;
# first argument is script name, so have the second one be the folder to search
search_dir = argv[1]
# list everything under search dir - ideally only your input files
# CHECK HOW TO READ ONLY SPECIFIC FILE types if you have something inside the same folder
files_list = os.listdir(search_dir)
total_files = len(files_list)
print('Files READ:')
# iterate over each file found at given folder
for file_name in files_list:
print(" "+file_name)
file_object = open(search_dir+file_name, 'r')
# returns a list of entries with 'newline' stripped
file_entries = map(lambda it: it.strip("\r\n"), file_object.readlines())
# gotta count'em all
total_entries += len(file_entries)
# set doesn't allow duplicate entries
entries_set = set(file_entries)
#creates a dict from the set, set each key's value to 1.
file_entries_dict = dict.fromkeys(entries_set, 1)
# entries dict is now used differenty, each key will hold a COUNTER
files_dict[file_name] = Counter(file_entries_dict)
file_object.close();
print("\n\nALL ENTRIES COUNT: "+str(total_entries))
# now we create a dict that will hold each unique key's count so we can sum all dicts read from files
entries_dict = Counter({})
for file_dict_key, file_dict_value in files_dict.items():
print(str(file_dict_key)+" - "+str(file_dict_value))
entries_dict += file_dict_value
print("\nUNIQUE ENTRIES COUNT: "+str(len(entries_dict.keys())))
# print(entries_dict)
# 90% from your question
cut_line = total_files * PERCENT_CUT
print("\nNeeds at least "+str(int(cut_line))+" entries to be listed below")
#output dict is the final dict, where we put entries that were present in > 90% of the files.
output_dict = {}
# this is PYTHON 3 - CHECK YOUR VERSION as older versions might use iteritems() instead of items() in the line belows
for entry, count in entries_dict.items():
if count > cut_line:
output_dict[entry] = count;
print(output_dict)
Related
I am doing a project in which I extract data from three different data sets and combine it to look at campaign contributions. To do this I turned the relevant data from two of the sets into dictionaries (canDict and otherDict) with ID numbers as keys and the information I need (party affiliation) as values. Then I wrote a program to pull party information based on the key (my third set included these ID numbers as well) and match them with the employer of the donating party, and the amount donated. That was a long winded explanation, but I thought it would help with understanding this chunk of code.
My problem is that, for some reason, my third dictionary (employerDict) won't compile. By the end of this step I should have a dictionary containing employers as keys, and a list of tuples as values, but after running it, the dictionary remains blank. I've been over this line by line a dozen times and I'm pulling my hair out - I can't for the life of me think why it won't work, which is making it hard to search for answers. I've commented almost every line to try to make it easier to understand out of context. Can anyone spot my mistake?
Update: I added a counter, n, to the outermost for loop to see if the program was iterating at all.
Update 2: I added another if statement in the creation of the variable party, in case the ID at data[0] did not exist in canDict or in otherDict. I also added some already suggested fixes from the comments.
n=0
with open(path3) as f: # path3 is a txt file
for line in f:
n+=1
if n % 10000 == 0:
print(n)
data = line.split("|") # Splitting each line into its entries (delimited by the symbol |)
party = canDict.get(data[0]) # data[0] is an ID number. canDict and otherDict contain these IDs as keys with party affiliations as values
if party is None:
party = otherDict[data[0]] # If there is no matching ID number in canDict, search otherDict
if party is None:
party = 'Other'
else:
print('ERROR: party is None')
x = (party, int(data[14])) # Creating a tuple of the the party (found through the loop) and an integer amount from the file path3
employer = data[11] # Index 11 in path3 is the employer of the person
if employer != '':
value = employerDict.get(employer) # If the employer field is not blank, see if this employer is already a key in employerDict
if value is None:
employerDict[employer] = [x] # If the key does not exist, create it and add a list including the tuple x as its value
else:
employerDict[employer].append(x) # If it does exist, add the tuple x to the existing value
else:
print('ERROR: employer == ''')
Thanks for all the input everyone - however, it looks like its a problem with my data file, not a problem with the program. Dangit.
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
I'm trying to add 18 list items to a 3 text file, 6 items for each them. paths[0] and paths[1] text files are correct but third one paths[2] only gets 3 items.
mainStudents = ["1 Name Surname D1", "2 Name Surname D2" ...16x]
def randomize(path):
count = 0
currentpathIs = getprojdir + "\\" + "\shuffled" + path
print(currentpathIs)
with open(currentpathIs, "a+") as file:
while True:
try:
file.write(mainStudents[count] + "\n")
del(mainStudents[count])
print(count)
count += 1
except Exception as e:
print(mainStudents)
break
if count == 6:
break
randomize(paths[0])
randomize(paths[1])
randomize(paths[2])
I'm getting this error:
Traceback (most recent call last):
File "C:\Users\user\Desktop\New folder\python.py", line 53, in randomize(paths[2])
File "C:\Users\user\Desktop\New folder\python.py", line 43, in randomize
file.write(mainStudents[count] + "\n")
IndexError: list index out of range
But there's 3 items left in mainStudents list?
The problem you have is that when you delete the item from your list, you are then decreasing its size. Therefore, your count is going to increment and then try to access an index that no longer exists. Take a very simple example of just having a list of size two:
mainStudents = ["1 Name Surname D1", "2 Name Surname D2"]
Now, when you call your method, what is going to happen, the first iteration will work, because you will access mainStudents[0].
But, on your second iteration, you have deleted that item from the list, so now your list looks like:
['2 Name Surname D2']
Which is now a list of size one, and the index access for it would be 0.
So, the next iteration of your while loop will have count at 1. Therefore, that is exactly where your IndexError is coming from.
The combination of deciding to use a while loop and del items from your list is what is causing issues. Instead, choose what it is exactly you want to iterate over, which from your logic seems like it is mainStudents. So why not just do that instead?
def randomize(path):
currentpathIs = getprojdir + "\\" + "\shuffled" + path
print(currentpathIs)
with open(currentpathIs, "a+") as file:
for student in mainStudents:
file.write(student + "\n")
And you can further simplify that by simply taking your list and converting it to a string separated by \n by using the available string method, join:
'\n'.join(mainStudents)
Furthermore, there are available methods to facilitate path creation. Take a look at the os module. More specifically os.path.join. So, your code can be further simplified to:
from os.path import join
def randomize(path):
currentpathIs = join(getprojdir, shuffled, path)
with open(currentpathIs, "a+") as file:
file.write('\n'.join(mainStudents))
I'm using openpyxl in python, and I'm trying to run through 50k lines and grab data from each row and place it into a file. However.. what I'm finding is it runs incredibely slow the farther I get into it. The first 1k lines goes super fast, less than a minute, but after that it takes longer and longer and longer to do the next 1k lines.
I was opening a .xlsx file. I wonder if it is faster to open a .txt file as a csv or something or to read a json file or something? Or to convert somehow to something that will read faster?
I have 20 unique values in a given column, and then values are random for each value. I'm trying to grab a string of the entire unique value column for each value.
Value1: 1243,345,34,124,
Value2: 1243,345,34,124,
etc, etc
I'm running through the Value list, seeing if the name exists in a file, if it does, then it will access that file and append to it the new value, if the file doesn't exist, it will create the file and then set it to append. I have a dictionary that has all the "append write file" things connected to it, so anytime I want to write something, it will grab the file name, and the append thing will be available in the dict, it will look it up and write to that file, so it doesn't keep opening new files everytime it runs.
The first 1k took less than a minute.. now I'm on 4k to 5k records, and it's running all ready 5 minutes.. it seems to take longer as it goes up in records, I wonder how to speed it up. It's not printing to the console at all.
writeFile = 1
theDict = {}
for row in ws.iter_rows(rowRange):
for cell in row:
#grabbing the value
theStringValueLocation = "B" + str(counter)
theValue = ws[theStringValueLocation].value
theName = cell.value
textfilename = theName + ".txt"
if os.path.isfile(textfilename):
listToAddTo = theDict[theName]
listToAddTo.write("," + theValue)
if counter == 1000:
print "1000"
st = datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S')
else:
writeFileName = open(textfilename, 'w')
writeFileName.write(theValue)
writeFileName = open(textfilename, 'a')
theDict[theName] = writeFileName
counter = counter + 1
I added some time stamps to the above code, it is not there, but you can see the output below. The problem I'm seeing is that it is going up higher and higher each 1k run. 2 minutes the firs ttime, thne 3 minutes, then 5 minutes, then 7 minutes. By the time it hits 50k, I'm worried it's going to be taking an hour or something and it will be taking too long.
1000
2016-02-25 15:15:08
20002016-02-25 15:17:07
30002016-02-25 15:20:52
2016-02-25 15:25:28
4000
2016-02-25 15:32:00
5000
2016-02-25 15:40:02
6000
2016-02-25 15:51:34
7000
2016-02-25 16:03:29
8000
2016-02-25 16:18:52
9000
2016-02-25 16:35:30
10000
Somethings I should make clear.. I don't know the names of the values ahead of time, maybe I should run through and grab those in a seperate python script to make this go faster?
Second, I need a string of all values seperated by comma, that's why I put it into a text file to grab later. I was thinking of doing it by a list as was suggested to me, but I'm wondering if that will have the same problem. I'm thinking the problem has to do with reading off excel. Anyway I can get a string out of it seperated by comma, I can do it another way.
Or maybe I could do try/catch instead of searching for the file everytime, and if there is an error, I can assume to create a new file? Maybe the lookup everytime is making it go really slow? the If the file exists?
this question is a continuation from my original here and I took some suggestions from there.... What is the fastest performance tuple for large data sets in python?
I think what you're trying to do is get a key out of column B of the row, and use that for the filename to append to. Let's speed it up a lot:
from collections import defaultdict
Value_entries = defaultdict(list) # dict of lists of row data
for row in ws.iter_rows(rowRange):
key = row[1].value
Value_entries[key].extend([cell.value for cell in row])
# All done. Now write files:
for key in Value_entries.keys():
with open(key + '.txt', 'w') as f:
f.write(','.join(Value_entries[key]))
It looks like you only want cells from the B-column. In this case you can use ws.get_squared_range() to restrict the number of cells to look at.
for row in ws.get_squared_range(min_col=2, max_col=2, min_row=1, max_row=ws.max_row):
for cell in row: # each row is always a sequence
filename = cell.value
if os.path.isfilename(filename):
…
It's not clear what's happening with the else branch of your code but you should probably be closing any files you open as soon as you have finished with them.
Based on the other question you linked to, and the code above, it appears you have a spreadsheet of name - value pairs. The name in in column A and the value is in column B. A name can appear multiple times in column A, and there can be a different value in column B each time. The goal is to create a list of all the values that show up for each name.
First, a few observations on the code above:
counter is never initialized. Presumably it is initialized to 1.
open(textfilename,...) is called twice without closing the file in between. Calling open allocates some memory to hold data related to operating on the file. The memory allocated for the first open call may not get freed until much later, maybe not until the program ends. It is better practice to close files when you are done with them (see using open as a context manager).
The looping logic isn't correct. Consider:
First iteration of inner loop:
for cell in row: # cell refers to A1
valueLocation = "B" + str(counter) # valueLocation is "B1"
value = ws[valueLocation].value # value gets contents of cell B1
name = cell.value # name gets contents of cell A1
textfilename = name + ".txt"
...
opens file with name based on contents of cell A1, and
writes value from cell B1 to the file
...
counter = counter + 1 # counter = 2
But each row has at least two cells, so on the second iteration of the inner loop:
for cell in row: # cell now refers to cell B1
valueLocation = "B" + str(counter) # valueLocation is "B2"
value = ws[valueLocation].value # value gets contents of cell B2
name = cell.value # name gets contents of cell B1
textfilename = name + ".txt"
...
opens file with name based on contents of cell "B1" <<<< wrong file
writes the value of cell "B2" to the file <<<< wrong value
...
counter = counter + 1 # counter = 3 when cell B1 is processed
Repeat for each of 50K rows. Depending on how many unique values are in column B, the program could be trying to have hundreds or thousands of open files (based on contents of cells A1, B1, A2, B2, ...) ==>> very slow or program crashes.
iter_rows() returns a tuple of the cells in the row.
As people suggested in the other question, use a dictionary and lists to store the values and write them all out at the end. Like so (Im using python 3.5, so you may have to adjust this if you are using 2.7)
Here is a straight forward solution:
from collections import defaultdict
data = defaultdict(list)
# gather the values into lists associated with each name
# data will look like { 'name1':['value1', 'value42', ...],
# 'name2':['value7', 'value23', ...],
# ...}
for row in ws.iter_rows():
name = row[0].value
value = row[1].value
data[name].append(value)
for key,valuelist in data.items():
# turn list of strings in to a long comma-separated string
# e.g., ['value1', 'value42', ...] => 'value1,value42, ...'
value = ",".join(valuelist)
with open(key + ".txt", "w") as f:
f.write(value)
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I have two python scripts that I would like to combine and run as one program. But I am unsure about what exactly do I need to alter to make to the two scripts work together.
Here is my first code:
import random
with open('filename.txt') as fin:
lines = fin.readlines()
random.shuffle(lines)
for i, line in enumerate(lines):
if i >= 0 and i < 6800:
print(line, end='')
And here is the second:
import csv
with open ("Randomfile.txt") as f:
dict1 = {}
r = csv.reader(f,delimiter="\t")
for row in r:
a, b, v = row
dict1.setdefault((a,b),[]).append(v)
#for key in dict1:
#print(key[0])
#print(key[1])
#print(d[key][0]])
with open ("filename2.txt") as f:
dict2 = {}
r = csv.reader(f,delimiter="\t")
for row in r:
a, b, v = row
dict2.setdefault((a,b),[]).append(v)
#for key in dict2:
#print(key[0])
count = 0
for key1 in dict1:
for key2 in dict2:
if (key1[0] == key2[0]) and abs((float(key1[1].split(" ")[0])) - (float(key2[1].split(" ")[0]))) < 0:
count += 1
print(count)
What I usually do is using the first code, I extract a random set of elements. I then save it as a text file, open it in the second code and compare it with my other file to get my results.
However, I would essentially like to skip the saving and reopening process. I want to place my first script in my second and alter the code to make it run as one. So that when my elements are extracted, they are then automatically compared to my other file.
I have read up and watched videos about using
if __name__==__main__
But I don't really understand its function. So if that is the solution, I would love to understand how to use it in solving my problem.
Please help my figure out how I can combine the two scripts, altering them both to have the code run as one. I happy to cooperate and clarify anything.
[EDIT] My files are in the following format.
An example of my random file:
3 10045 0.120559958
4 157465 0.590642951
1 222471 0.947959795
3 222473 0.083341617
2 222541 0.054014337
5 222588 0.060296547
An example of my other file (that i am comparing to my random file):
2 143521109 4.57E-08
1 201466556 5.57E-08
1 11566373 8.43E-08
1 143627370 8.61E-08
6 98624499 1.02E-07
Imagine that instead of having two scripts, each script was a function and then they were both called from another function.
In other words, you would have the following:
def first_code():
...code of first script goes here...
def second_code():
...code of second script goes here...
def master_function():
first_code()
second_code()
Now, if master_function() is called, so are the other two. If you replace that definition with main:
if __name__ == "__main__":
first_code()
second_code()
It will automatically run if you execute the script from your command line.
Well, I modified your code as follows:
import csv
import random
with open('filename.txt') as fin:
lines = fin.readlines()
random.shuffle(lines)
rnd_str = []
for i, line in enumerate(lines):
if i >= 0 and i < 6800:
rnd_str.append(line)
r = rnd_str
dict1 = {}
for row in r:
a, b, v = row.split()
dict1.setdefault((a,b),[]).append(v)
with open ("filename2.txt") as f:
dict2 = {}
r = csv.reader(f,delimiter="\t")
dict2 = {}
for row in r:
a, b, v = row.split()
dict2.setdefault((a,b),[]).append(v)
count = 0
for key1 in dict1:
for key2 in dict2:
if (key1[0] == key2[0]) and ((float(key1[1]) - (float(key2[1]))) < 0):
count += 1
print(count)
Thus you have no need to save the random file and you can process its content in the second part of the code i.e. comparing with the other file's content.
Note: it was a raw in your code:
abs((float(key1[1].split(" ")[0])) - (float(key2[1].split(" ")[0]))) < 0
that made me smile, because how can be the abs(x) < 0?
Anyway the script works now it results 4 upon the samples you gave.
Instead of printing in the first program try to create a dictionary from that output and then you can work with that dict instead of copying output, saving and loading again. You will save a lot of time.
So in the first file try to create a dict and change print into append to that dict. You don't need another script, just extend the first one with the code from second one working with duct instead of new file.
Don't Worry. There's no need to alter your code. Just make a new script and put this in it:
def code1():
import firstprogram
def code2():
import secondprogram
code1()
code2()
This will run your first program and the your second. Just make sure to replace firstprogram and secondprogram with the names of your two programs.
One thing you can do is type in the second file name in the main file. For example, my first file name is 'main.py' and the second one is 'float.py' you can merge these file together by typing:
_merge_ = 'float.py'
in the main file, which is 'main.py'
Hope it works!!
Thanking you all in advance.
Regards,
VC
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 8 years ago.
Improve this question
I'm trying to print data from my text file into python
text_file = open ("Class1.txt", "r")
data = text_file.read().splitlines()
for li in data:
namelist = li.split(":")[0]
scorelist = li.split(":")[1]
print (namelist)
print (scorelist)
text_file.close()
My text file has:
Jim:13524
Harry:3
Jarrod:10
Jacob:0
Harold:5
Charlie:3
Jj:0
It only shows the last entry
Shell:
Would you like to view class 1, 2 or 3? 1
Jj
0
The problem is that you are over-writing the value of namelist and scorelist with each pass through the loop. You need to add each item to a list. Adding a sequential list of items to a list is usually done with list.append() or a list comprehension. Read the documentation, or do some tutorials?
To actually create list, you can do this:
namelist, scorelist = [],[]
for li in data:
namelist.append(li.split(":")[0])
scorelist.append(li.split(":")[1])
Alternately, this might be a better overall approach:
with open("Class1.txt", "r") as text_file:
names_scores = [(e[0],e[1]) for e in [li.split(":") for li in text_file]
for name,score in name_scores:
print(name,score)
This assumes you really just want to extract the names and scores and print them, not do anything else. How you handle and store the data depends a lot on what you are doing with it once you extract from the file.