I have a list of floating point numbers in a file in column like this:
123.456
234.567
345.678
How can i generate an output file which is generated by subtracting the value in a line with the value just above it. For the input file above,the output generated should be:
123.456-123.456
234.567-123.456
345.678-234.567
The first value should return zero, but the other values should get subtracted with the value just above it. This is not an homework question. This is a small requirement of my bigger problem and i am stuck at this point. Help much appreciated. Thanks !!
This will work:
diffs = [0] + [j - data[i] for i,j in enumerate(data[1:])]
So, assuming data.txt contains:
123.456
234.567
345.678
then
with open('data.txt') as f:
data = f.readlines()
diffs = [0] + [float(j) - float(data[i]) for i,j in enumerate(data[1:])]
print diffs
will yield
[0, 111.111, 111.11099999999999]
This answer assumes you want to keep the computed values for further processing.
If at some point you want to write these out to a file, line by line:
with open('result.txt', 'w') as outf:
for i in diffs:
outf.write('{0:12.5f}\n'.format(i))
and adjust the field widths to suit your needs (right now 12 spaces reserved, 5 after the decimal point), written out to file result.txt.
UPDATE: Given (from the comments below) that there is possibly too much data to hold in memory, this solution should work. Python 2.6 doesn't allow opening both files in the same with, hence the separate statements.
with open('result2.txt', 'w') as outf:
outf.write('{0:12.5f}\n'.format(0.0))
prev_item = 0;
with open('data.txt') as inf:
for i, item in enumerate(inf):
item = float(item.strip())
val = item - prev_item
if i > 0:
outf.write('{0:12.5f}\n'.format(val))
prev_item = item
Has a bit of a feel of a hack. Doesn't create a huge list in memory though.
Given a list of values:
[values[i] - values[i-1] if i > 0 else 0.0 for i in range(len(values))]
Instead of list comprehensions or generator expressions, why not write your own generator that can have arbitrarily complex logic and easily operate on enormous data sets?
from itertools import imap
def differences(values):
yield 0 # The initial 0 you wanted
iterator = imap(float, values)
last = iterator.next()
for value in iterator:
yield value - last
last = value
with open('data.txt') as f:
data = f.readlines()
with open('outfile.txt', 'w') as f:
for value in differences(data):
f.write('%s\n' % value)
If data holds just a few values, the benefit wouldn't necessarily be so clear (although the explicitness of the code itself might be nice next year when you have to come back and maintain it). But suppose data was a stream of values from a huge (or infinite!) source and you wanted to process the first thousand values from it:
diffs = differences(enormousdataset)
for count in xrange(1000):
print diffs.next()
Finally, this plays well with data sources that aren't indexable. Solutions that track index numbers to look up values don't play well with the output of generators.
Related
I have a csv file that has a district, party, votes, states, and year rows. I want to rank every district from texas from the lowest number of votes to the highest number of democratic votes only from the year 2018, how can I achieve this in the most beginner code? Would I have to use selection/ insertion sort?
If you were to do this using Java (not sure how beginner friendly it is), you could do something like
String s = Files.readString(pathToYourFile);
and then split the string into an array of strings, delineated by the new line character "\n".
Then you can parse the String into an object, delineated by the character "," since it is csv. Now you can make a List, and do a Collections.sort(yourList).
Or, you could use a SQL database :)
I would use bubble sort, it is the simplest algorithm for sorting numbers.
def bubble_sort(your_list):
has_swapped = True
index_of_votes_value = 0 #assuming the list is 2D, each year, and then within the year, the votes, party, etc.
num_of_iterations = 0
while(has_swapped):
has_swapped = False
for i in range(len(your_list) - num_of_iterations - 1):
if your_list[i][index_of_votes_value] > your_list[i+1][index_of_votes_value]:
# Swap
your_list[i][index_of_votes_value], your_list[i+1][index_of_votes_value] = your_list[i+1][index_of_votes_value], your_list[i][index_of_votes_value]
has_swapped = True
num_of_iterations += 1
This algorithm iterates through your list, it checks if each index is larger than the one after it, and if it is, it swaps them. It keeps going through the list until the list is in order.
This is how I would personally do it:
with open('data.csv') as f:
file_content = f.read()
data = [line.split(';') for line in file_content.splitlines()]
sorted_data = sorted(data, key=lambda x: int(x[1]), reverse=True)
print('\n'.join(';'.join(x) for x in sorted_data))
I have a piece of code where I need to go through many lists to read and assign values
The code does the following:
First, for each element in the entities (1st list 1300), the code will read its text file that contains many lines (2nd list 5000) each line contains two values. After that, the code will check if the first value in each line exists in the features list (3rdlist 17000), if yes it will write the second value in the line into the matrix.
The code is working but it is inefficient and extremely slow.(more than 12 hours)
#first list
for i in range(len(entities_list)-1):
fin = open('/home/rana/'+entities_list[i]+'.txt','r')
#Second List
for line in fin.readlines():
#Third List
for j in range(len(features_list)-1):
if features_list[j]==line.split()[0]:
co_occurrence_matrix[i,j]=float(line.split()[1])
I will appreciate if someone gives me an Idea how to solve this issue
Your feature lookup in the inner loop is slow O(n) and repeated 1300x5000 ~ 6.5M times. The first thing you can do is to convert the features_list to a dict and speed that lookup up to O(1) (eliminate the third loop):
features = dict(zip(features_list, range(len(features_list)-1)))
for i in range(len(entities_list)-1):
with open('/home/rana/'+entities_list[i]+'.txt', 'r') as fin:
for line in fin:
key, value = line.split()
if key in features:
j = features[key]
co_occurrence_matrix[i,j] = float(value)
You can completely optimize away the third loop by creating a map upfront:
# first create a matrix map for fast features lookup
features_map = {feature: index for index, feature in enumerate(features_list)}
for index, entity in enumerate(entities_list):
with open('/home/rana/{}.txt'.format(entity), 'r') as f:
for line in f:
feature, value = line.split() # you might want to validate this, tho
if feature in features_map:
co_occurrence_matrix[index, features_map[feature]] = float(value)
You might be able to speed it up further by delegating your I/O part (loading of the files) over multiple threads if the files are particularly big.
I have a sample file that looks like this:
#XXXXXXXXX
VXVXVXVXVX
+
ZZZZZZZZZZZ
#AAAAAA
YBYBYBYBYBYBYB
ZZZZZZZZZZZZ
...
I wish to only read the lines that fall on the index 4i+2, where i starts at 0. So I should read the VXVXV (4*0+2 = 2)... line and the YBYB...(4*1 +2 = 6)line in the snippet above. I need to count the number of 'V's, 'X's,'Y's and 'B's and store in a pre-existing dict.
fp = open(fileName, "r")
lines = fp.readlines()
for i in xrange(1, len(lines),4):
for c in str(lines(i)):
if c == 'V':
some_dict['V'] +=1
Can someone explain how do I avoid going off index and only read in the lines at the 4*i+2 index of the lines list?
Can't you just slice the list of lines?
lines = fp.readlines()
interesting_lines = lines[2::4]
Edit for others questioning how it works:
The "full" slice syntax is three parts: start:end:step
The start is the starting index, or 0 by default. Thus, for a 4 * i + 2, when i == 0, that is index #2.
The end is the ending index, or len(sequence) by default. Slices go up to but not including the last index.
The step is the increment between chosen items, 1 by default. Normally, a slice like 3:7 would return elements 3,4,5,6 (and not 7). But when you add a step parameter, you can do things like "step by 4".
Doing "step by 4" means start+0, start+4, start+8, start+12, ... which is what the OP wants, so long as the start parameter is chosen correctly.
You can do one of the following:
Start xrange at 0 then add 2 onto i in secondary loop
for i in xrange(0, len(lines), 4):
for c in str(lines(i+2))
if c == 'V':
some_dict['V'] += 1
Start xrange at 2, then access i the way specified in your original program
for i in xrange(2, len(lines), 4):
for c in str(lines(i))
if c == 'V':
some_dict['V'] += 1
I'm not quite clear on what you're trying to do here--- are you actually just trying to only read the lines you want from disk? (In which case you've gone wrong from the start, because readlines() reads the whole file.) Or are you just trying to filter the list of lines to pick out the ones you want?
I'll assume the latter. In which case, the easiest thing to do would be to just use a listcomp to filter the line by indices. e.g. something simple like:
indices = [x[0] * 4 + 2 for x in enumerate(lines)]
filtered_lines = [lines[i] for i in indices if len(lines) > i]
and there you go, you've got just the lines you want, no index errors or anything silly like that. Then you can separate out and simplify the rest of your code to do the counting, just operating on the filtered list.
(just slightly edited the first list comp to be a little more idiomatic)
I already gave a similar answer to another question: How would I do this in a file?
A better solution (avoiding unnecessary for loops) would be
fp = open(fileName, "r")
def addToDict(letter):
someDict[letter] += 1;
[addToDict('V') for 'V' in str(a) for a in fp.readlines()[2::4]];
I tried to make this an anonymous function without success, if someone can do that it would be excellent.
So, I'm having a lot of trouble finding the largest decimal integer in a massive list of strings (1500ish). Here's what I have within a function (to find the max value):
all_data_lines = data.split('\n');
maxvalue = float(0.0);
for item in all_data_lines:
temp = item.split(',')[1];
if (float(temp) > maxvalue):
maxvalue = float(temp);
return maxvalue
The data file is essentially a huge list seperated by new lines and then seperated by comma's. So, I need to compare the second comma seperated element on every line.
This is what I have above. For some reason, I'm having this error:
in max_temperature
temp = item.split(',')[1];
IndexError: list index out of range
You apparently have lines that have no comma on them; perhaps you have empty lines. If you are using data.split('\n') then you are liable to end up with a last, empty value for example:
>>> '1\n2\n'.split('\n')
['1', '2', '']
>>> '1\n2\n'.splitlines()
['1', '2']
Using str.splitlines() on the other hand produces a list without a last empty value.
Rather than split on each line manually, and loop manually, use the csv module and a generator expression:
import csv
def foo(data):
reader = csv.reader(data.splitlines(), quoting=csv.QUOTE_NONNUMERIC)
return max(r[1] for r in reader if len(r) > 1)
This delegates splitting to the csv.reader() object, leaving you free to focus on testing for rows with enough elements to have a second column.
The csv.QUOTE_NONNUMERIC option tells the reader to convert values to floats for you so you don't even have to do that anymore either. This, however, works only if all columns without quotes are numeric. If this is not the case and you get ValueErrors instead, you can still do the conversion manually:
def foo(data):
reader = csv.reader(data.splitlines())
return max(float(r[1]) for r in reader if len(r) > 1)
I have a list of lists called sorted_lists
I'm using this to write them into a txt file. Now the thing is, this line below prints ALL the lists. I'm trying to figure it out how to print only first n (n = any number), for example first 10 lists.
f.write ("\n".join("\t".join (row) for row in sorted_lists)+"\n")
Try the following:
f.write ("\n".join("\t".join (row) for row in sorted_lists[0:N])+"\n")
where N is the number of the first N lists you want to print.
sorted_lists[0:N] will catch the first N lists (counting from 0 to N-1, there are N lists; list[N] is excluded). You could also write sorted_lists[:N] which implicitly means that it will start from the first item of the list (item 0). They are the same, the latter may be considered more elegant.
f.write ('\n'.join('\t'.join(row) for row in sorted_lists[:n+1])+'\n')
where n is the number of lists.
Why not simplify this code and use the right tools:
from itertools import islice
import csv
first10 = islice(sorted_lists, 10)
with open('output.tab', 'wb') as fout:
tabout = csv.writer(fout, delimiter='\t')
tabout.writerows(first10)
You should read up on the python slicing features.
If you want to look at only the first 10 entires of sorted_lists, you could do sorted_lists[0:10].