I have two files which I loaded into lists. The content of the first file is something like this:
d.complex.1
23
34
56
58
68
76
.
.
.
etc
d.complex.179
43
34
59
69
76
.
.
.
etc
The content of the second file is also the same but with different numerical values. Please consider from one d.complex.* to another d.complex.* as one set.
Now I am interested in comparing each numerical value from one set of first file with each numerical value of the sets in the second file. I would like to record the number of times each numerical has appeared in the second file overall.
For example, the number 23 from d.complex.1 could have appeared 5 times in file 2 under different sets. All I want to do is record the number of occurrences of number 23 in file 2 including all sets of file 2.
My initial approach was to load them into a list and compare but I am not able to achieve this. I searched in google and came across sets but being a python noob, I need some guidance. Can anyone help me?
If you feel the question is not clear,please let me know. I have also pasted the complete file 1 and file 2 here:
http://pastebin.com/mwAWEcTa
http://pastebin.com/DuXDDRYT
Open the file using Python's open function, then iterate over all its lines. Check whether the line contains a number, if so, increase its count in a defaultdict instance as described here.
Repeat this for the other file and compare the resulting dicts.
First create a function which can load a given file, as you may want to maintain individual sets and also want to count occurrence of each number, best would be to have a dict for whole file where keys are set names e.g. complex.1 etc, for each such set keep another dict for numbers in set, below code explains it better
def file_loader(f):
file_dict = {}
current_set = None
for line in f:
if line.startswith('d.complex'):
file_dict[line] = current_set = {}
continue
if current_set is not None:
current_set[line] = current_set.get(line, 0)
return file_dict
Now you can easily write a function which will count a number in given file_dict
def count_number(file_dict, num):
count = 0
for set_name, number_set in file_dict.iteritems():
count += number_set.get(num, 0)
return count
e.g here is a usage example
s = """d.complex.1
10
11
12
10
11
12"""
file_dict = file_loader(s.split("\n"))
print file_dict
print count_number(file_dict, '10')
output is:
{'d.complex.1': {'11': 2, '10': 2, '12': 2}}
2
You may have to improve file loader, e.g. skip empty lines, convert to int etc
Related
For this problem, I have a separate txt file which contains a list of values down below:
Years+1900 Populationx106
0 1650
10 1750
20 1860
30 2070
40 2300
50 2560
60 3040
70 3710
80 4450
90 5280
100 6080
110 6870
For the problem I'm working on, I'm supposed to obtain that file and path name to then use to do calculations on with some functions I created. I have finished the functions I need to do, however I'm having an issue running it because I believe when doing the function it reads the "Years+1900 Populationx106" part first instead of the numbers below it.
Here's the code for my functions:
Input: year
Output: estimate of population for that year
def pop(year):
return 1436.53*((1.01395)**year)
# Input: data
# Return: the average error as per equation 18.
def error(data):
error=0
for i in data:
error +=(abs(i[1]-pop(i[0]))/i[1])
return 100*error/12
Here is the code I created to retrieve the data from my separate txt file:
def get_data(path,name):
with open("Assignment7/pop.txt", "r") as path:
path = open("Assignment7/pop.txt", "r")
name = path.read()
return name
The error I'm receiving is for the part below. It is an index error and it says the string index is out of range. I believe this is because it is reading the first part of the data in the pop.txt, how can I remove te first line in the pop.txt so that it only reads the numerical values I have?
error +=(abs(i[1]-pop(i[0]))/i[1])
I have tried changing the index values already, however it still says that my string index is out of range.
Let's assume you are correct and passing the first line of your text file to your function is breaking it.
You can "throw away" the first line of the text file by reading it as a single line (but doing nothing with it) and then reading the data you actually want like this..
def get_data(path,name):
with open("Assignment7/pop.txt", "r") as path:
path = open("Assignment7/pop.txt", "r")
header=path.readline() #Read "Header line", but don't use it
name = path.read() #Read subsequent lines as the data you want
return name
I suspect that you’ve simply read the entire file as one string. Therefore each element, i, is a single character and has no dimensionality. You’ll need to either parse the file by the new line character to split it into by line (and likely again to get it the two separate columns).
Python String Split will be useful for the that.
You’re correct that the first line will pose issues, but this can be removed by using a path.readline() call as Richard said.
I have two lists structured somehow like this:
A= [[1,27],[2,27],[3,27],[4,28],[5,29]]
B= [[6,30],[7,31],[8,31]]
and i have a file that has numbers:
1 5
2 3
3 1
4 2
5 5
6....
i want a code that reads this file and maps it to the list. e.g if the file has 1, it should read A list and output 27, if it has 6, it should read B and print 30, such that I get
27 29
27 27
27 27
28 27
29 29
30 31
The problem is, that my code gives index error, i read the file line by line and have an if condition that checks if the number i read from the file is less than the maximum number in list A, if so, it outputs the second character of that list and otherwise move on. The problem is, that instead of moving on to list B, my program still reads A and gives index error.
with open(filename) as myfile:
for line in myfile.readlines():
parts=line.split()
if parts[0]< maxnumforA:
print A[int(parts[0])-1]
else:
print B[int(parts[0]-1)
You should turn that lists into dictionaries. For example:
_A = dict(A)
_B = dict(B)
with open(filename) as myfile:
for line in myfile:
parts = line.split()
for part in parts:
part = int(part)
if part in _A:
print _A[part]
elif part in _B:
print _B[part]
If the action that will take place does not need to know if it comes from A or B, both can be turned into a single dictionary:
d = dict(A + B) # Creating the dictionary
with open(filename) as myfile:
for line in myfile:
parts = line.split()
for part in parts:
part = int(part)
if part in d:
print d[part]
Creating the dictionary can be acomplished in many different ways, I will list some of them:
d = dict(A + B): First joins both lists into a single list (without modifying A or B) and then turns the result into a dictionary. It's the most clear way to do it.
d = {**dict(A), **dict(B)}: Turns both lists into two separates dictionaries (without mmodifying A or B), unpacks them and pack both of them into a single dictionary. Slighlty (and I mean really slightly) faster than the previous method and less clear. Proposed by #Nf4r
d = dict(A) & d.update(B): Turns the first list into a dictionary and updates that dictionary with the content of the second list. Fastest method, 1 line of code per list instead of 1 line for any list and no temporary objects generation so more efficient memory-wise.
As everyone stated before, dict would be much better. I don't know if the left-sided values in each lists are unique, but if yes, you could just go for:
d = {**dict(A), **dict(B)} # available in Python 3.x
with open(filename) as file:
for line in file.readlines():
for num in line.split():
if int(num) in d:
print(d[int(num)])
I have a 5GB file of businesses and I'm trying to extract all the businesses that whose business type codes (SNACODE) start with the SNACODE corresponding to grocery stores. For example, SNACODEs for some businesses could be 42443013, 44511003, 44419041, 44512001, 44522004 and I want all businesses whose codes start with my list of grocery SNACODES codes = [4451,4452,447,772,45299,45291,45212]. In this case, I'd get the rows for 44511003, 44512001, and 44522004
Based on what I googled, the most efficient way to read in the file seemed to be one row at a time (if not the SQL route). I then used a for loop and checked if my SNACODE column started with any of my codes (which probably was a bad idea but the only way I could get to work).
I have no idea how many rows are in the file, but there are 84 columns. My computer was running for so long that I asked a friend who said it should only take 10-20 min to complete this task. My friend edited the code but I think he misunderstood what I was trying to do because his result returns nothing.
I am now trying to find a more efficient method than re-doing my 9.5 hours and having my laptop run for an unknown amount of time. The closest thing I've been able to find is most efficient way to find partial string matches in large file of strings (python), but it doesn't seem like what I was looking for.
Questions:
What's the best way to do this? How long should this take?
Is there any way that I can start where I stopped? (I have no idea how many rows of my 5gb file I read, but I have the last saved line of data--is there a fast/easy way to find the line corresponding to a unique ID in the file without having to read each line?)
This is what I tried -- in 9.5 hours it outputted a 72MB file (200k+ rows) of grocery stores
codes = [4451,4452,447,772,45299,45291,45212] #codes for grocery stores
for df in pd.read_csv('infogroup_bus_2010.csv',sep=',', chunksize=1):
data = np.asarray(df)
data = pd.DataFrame(data, columns = headers)
for code in codes:
if np.char.startswith(str(data["SNACODE"][0]), str(code)):
with open("grocery.csv", "a") as myfile:
data.to_csv(myfile, header = False)
print code
break #break code for loop if match
grocery.to_csv("grocery.csv", sep = '\t')
This is what my friend edited it to. I'm pretty sure the x = df[df.SNACODE.isin(codes)] is only matching perfect matches, and thus returning nothing.
codes = [4451,4452,447,772,45299,45291,45212]
matched = []
for df in pd.read_csv('infogroup_bus_2010.csv',sep=',', chunksize=1024*1024, dtype = str, low_memory=False):
x = df[df.SNACODE.isin(codes)]
if len(x):
matched.append(x)
print "Processed chunk and found {} matches".format(len(x))
output = pd.concat(matched, axis=0)
output.to_csv("grocery.csv", index = False)
Thanks!
To increase speed you could pre-build a single regexp matching the lines you need and the read the raw file lines (no csv parsing) and check them with the regexp...
codes = [4451,4452,447,772,45299,45291,45212]
col_number = 4 # Column number of SNACODE
expr = re.compile("[^,]*," * col_num +
"|".join(map(str, codes)) +
".*")
for L in open('infogroup_bus_2010.csv'):
if expr.match(L):
print L
Note that this is just a simple sketch as no escaping is considered... if the SNACODE column is not the first one and preceding fields may contain a comma you need a more sophisticated regexp like:
...
'([^"][^,]*,|"([^"]|"")*",)' * col_num +
...
that ignores commas inside double-quotes
You can probably make your pandas solution much faster:
codes = [4451, 4452, 447, 772, 45299, 45291, 45212]
codes = [str(code) for code in codes]
sna = pd.read_csv('infogroup_bus_2010.csv', usecols=['SNACODE'],
chunksize=int(1e6), dtype={'SNACODE': str})
with open('grocery.csv', 'w') as fout:
for chunk in sna:
for code in chunk['SNACODE']:
for target_code in codes:
if code.startswith(target_code):
fout.write('{}\n'.format(code))
Read only the needed column with usecols=['SNACODE']. You can adjust the chunk size with chunksize=int(1e6). Depending on your RAM you can likely make it much bigger.
I have a text file with many tens of thousands short sentences like this:
go to venice
come back from grece
new york here i come
from belgium to russia and back to spain
I run a tagging algorithm which produces a tagged output of this sentence file:
go to <place>venice</place>
come back from <place>grece</place>
<place>new york</place> here i come
from <place>belgium</place> to <place>russia</place> and back to <place>spain</place>
The algorithm runs over the input multiple times and produces each time slightly different tagging. My goal is to identify those lines where those differences occur. In other words, print all utterances for which the tagging differs across N results files.
For example N=10, I get 10 tagged files. Suppose line 1 is tagged all the time the same for all 10 tagged files - do not print it. Suppose line 2 is tagged once this way and 9 times other way - print it. And so on.
For N=2 is easy, I just run diff. But what to do if I have N=10 results?
If you have the tagged files - just create a counter for each line of how many times you've seen it:
# use defaultdict for convenience
from collections import defaultdict
# start counting at 0
counter_dict = defaultdict(lambda: 0)
tagged_file_names = ['tagged1.txt', 'tagged2.txt', ...]
# add all lines of each file to dict
for file_name in tagged_file_names:
with open(file_name) as f:
# use enumerate to maintain order
# produces (LINE_NUMBER, LINE CONTENT) tuples (hashable)
for line_with_number in enumerate(f.readlines()):
counter_dict[line_with_number] += 1
# print all values that do not repeat in all files (in same location)
for key, value in counter_dict.iteritems():
if value < len(tagged_file_names):
print "line number %d: [%s] only repeated %d times" % (
key[0], key[1].strip(), value
)
Walkthrough:
First of all, we create a data structure to enable us counting our entries, which are numbered lines. This data structure is a collections.defaultdict which a default value of 0 - which is the count of newly added lines (increased to 1 with each add).
Then, we create the actual entry using a tuple which is hashable, so it can be used as a dictionary key, and by default deeply-comparable to other tuples. this means (1, "lolz") is equal to (1, "lolz") but different than (1, "not lolz") or (2, lolz) - so it fits our use of deep-comparing lines to account for content as well as position.
Now all that's left to do is add all entries using a straightforward for loop and see what keys (which correspond to numbered lines) appear in all files (that is - their value is equal to the number of tagged files provided).
Example:
reut#tHP-EliteBook-8470p:~/python/counter$ cat tagged1.txt
123
abc
def
reut#tHP-EliteBook-8470p:~/python/counter$ cat tagged2.txt
123
def
def
reut#tHP-EliteBook-8470p:~/python/counter$ ./difference_counter.py
line number 1: [abc] only repeated 1 times
line number 1: [def] only repeated 1 times
if you compare all of them to the first text, then you can get a list of all texts that are different. this might not be the quickest way but it would work.
import difflib
n1 = '1 2 3 4 5 6'
n2 = '1 2 3 4 5 6'
n3 = '1 2 4 5 6 7'
l = [n1, n2, n3]
m = [x for x in l if x != l[0]]
diff = difflib.unified_diff(l[0], l.index(m))
print ''.join(diff)
this is my first time writing a python script and I'm having some trouble getting started. Let's say I have a txt file named Test.txt that contains this information.
x y z Type of atom
ATOM 1 C1 GLN D 10 26.395 3.904 4.923 C
ATOM 2 O1 GLN D 10 26.431 2.638 5.002 O
ATOM 3 O2 GLN D 10 26.085 4.471 3.796 O
ATOM 4 C2 GLN D 10 26.642 4.743 6.148 C
What I want to do is eventually write a script that will find the center of mass of these three atoms. So basically I want to sum up all of the x values in that txt file with each number multiplied by a given value depending on the type of atom.
I know I need to define the positions for each x-value, but I'm having trouble with figuring out how to make these x-values be represented as numbers instead of txt from a string. I have to keep in mind that I'll need to multiply these numbers by the type of atom, so I need a way to keep them defined for each atom type. Can anyone push me in the right direction?
mass_dictionary = {'C':12.0107,
'O':15.999
#Others...?
}
# If your files are this structured, you can just
# hardcode some column assumptions.
coords_idxs = [6,7,8]
type_idx = 9
# Open file, get lines, close file.
# Probably prudent to add try-except here for bad file names.
f_open = open("Test.txt",'r')
lines = f_open.readlines()
f_open.close()
# Initialize an array to hold needed intermediate data.
output_coms = []; total_mass = 0.0;
# Loop through the lines of the file.
for line in lines:
# Split the line on white space.
line_stuff = line.split()
# If the line is empty or fails to start with 'ATOM', skip it.
if (not line_stuff) or (not line_stuff[0]=='ATOM'):
pass
# Otherwise, append the mass-weighted coordinates to a list and increment total mass.
else:
output_coms.append([mass_dictionary[line_stuff[type_idx]]*float(line_stuff[i]) for i in coords_idxs])
total_mass = total_mass + mass_dictionary[line_stuff[type_idx]]
# After getting all the data, finish off the averages.
avg_x, avg_y, avg_z = tuple(map( lambda x: (1.0/total_mass)*sum(x), [[elem[i] for elem in output_coms] for i in [0,1,2]]))
# A lot of this will be better with NumPy arrays if you'll be using this often or on
# larger files. Python Pandas might be an even better option if you want to just
# store the file data and play with it in Python.
Basically using the open function in python you can open any file. So you can do something as follows: --- the following snippet is not a solution to the whole problem but an approach.
def read_file():
f = open("filename", 'r')
for line in f:
line_list = line.split()
....
....
f.close()
From this point on you have a nice setup of what you can do with these values. Basically the second line just opens the file for reading. The third line define a for loop that reads the file one line at a time and each line goes into the line variable.
The last line in that snippet basically breaks the string --at every whitepsace -- into an list. So line_list[0] will be the value on your first column and so forth. From this point if you have any programming experience you can just use if statements and such to get the logic that you want.
** Also keep in mind that the type of values stored in that list will all be string so if you want to perform any arithmetic operations such as adding you have to be careful.
* Edited for syntax correction
If you have pandas installed, checkout the read_fwf function that imports a fixed-width file and creates a DataFrame (2-d tabular data structure). It'll save you lines of code on import and also give you a lot of data munging functionality if you want to do any additional data manipulations.