How to save two-column array in a file in Python [duplicate] - python

This question already has an answer here:
How to add new column to output file in Python?
(1 answer)
Closed 7 years ago.
I have this code (see this Thread) for saving a two-column array in the file. The thing is that I need to call this function N times:
def save(self):
n=self.n
with open("test.csv","a") as f:
f.write("name\tnum\n")
for k, v in tripo.items():
if v:
f.write(n+"\t")
f.write("{}\n".format(k.split(".")[0]))
for s in v:
f.write(n+"\t")
f.write("\n".join([s.split(".")[0]])+"\n")
This is the sample content of tripo for n=1:
{
'1.txt': [],
'2.txt': [],
'5.txt': [],
'4.txt': ['3.txt','6.txt'],
'7.txt': ['8.txt']
}
This is the expected output for n=1...N:
name num
1 4
1 3
1 6
1 7
1 8
...
N 3
N 6
N ...
However, the above-given code puts some values in the same column.
UPDATE:
For instance, if I have this string '170.txt': ['46.txt','58.txt','86.txt'], then I receive this result:
1 1 1 1 170
46
58
86
instead of:
1 170
1 46
1 58
1 86

import os
tripo = [
('1.txt', []),
('2.txt', []),
('5.txt', []),
('4.txt', ['3.txt','6.txt']),
('7.txt', ['8.txt'])
]
def getname(f):
return os.path.splitext(f)[0]
def getresult(t):
result = []
for k, v in tripo:
values = [getname(n) for n in v]
if len(values)>0:
result.append(getname(k))
for x in values:
result.append(x)
return result
def writedown(n,r):
with open("test.csv","a") as f:
for x in r:
f.write("%s\t%s\n" % (n,x))
print("%s\t%s\n" % (n,x))
print(getresult(tripo))
writedown(1, getresult(tripo))

Use Pickle. Use pickle.dump to store to file and pickle.load to load it.

I don't understand quite well your question.
Does the object representation is correct but the writing in the file incorrect?
If this is the case as Dan said, using pickle could be useful.
import pickle;
s = pickle.dumps(object);
f.write(s);
f.close();
#for reading;
f = open('test.csv', 'rb');
serialized_object = pickle.load(f)
The serialized_objectvariable should have the structure you want to preserve.

Related

Memory efficient way to read an array of integers from single line of input in python2.7

I want to read a single line of input containing integers separated by spaces.
Currently I use the following.
A = map(int, raw_input().split())
But now the N is around 10^5 and I don't need the whole array of integers, I just need to read them 1 at a time, in the same sequence as the input.
Can you suggest an efficient way to do this in Python2.7
Use generators:
numbers = '1 2 5 18 10 12 16 17 22 50'
gen = (int(x) for x in numbers.split())
for g in gen:
print g
1
5
6
8
10
12
68
13
the generator object would use one item at a time, and won't construct a whole list.
You could parse the data a character at a time, this would reduce memory usage:
data = "1 50 30 1000 20 4 1 2"
number = []
numbers = []
for c in data:
if c == ' ':
if number:
numbers.append(int(''.join(number)))
number = []
else:
number.append(c)
if number:
numbers.append(int(''.join(number)))
print numbers
Giving you:
[1, 50, 30, 1000, 20, 4, 1, 2]
Probably quite a bit slower though.
Alternatively, you could use itertools.groupby() to read groups of digits as follows:
from itertools import groupby
data = "1 50 30 1000 20 4 1 2"
numbers = []
for k, g in groupby(data, lambda c: c.isdigit()):
if k:
numbers.append(int(''.join(g)))
print numbers
If you're able to destroy the original string, split accepts a parameter for the maximum number of breaks.
See docs for more details and examples.

Python - read 1000 lines from a file at a time

I've checked this, this and this.
The 3rd link seemed to have the answer yet it didn't do the job.
I can't have a solution where the whole file is brought to main memory, as the files I'll be working with will be very large. So I decided to use islice as shown in the 3rd link. First 2 links were irrelevant as they used it for only 2 lines or read 1000 characters. Whereas I need 1000 lines. for now N is 1000
My file contains 1 million lines:
Sample:
1 1 1
1 2 1
1 3 1
1 4 1
1 5 1
1 6 1
1 7 1
1 8 1
1 9 1
1 10 1
So if I read 1000 lines at a time, I should go through the while 1000 times, yet when I print p to check how many times I've been in through, it doesn't stop at a 1000. It reached 19038838 after running my program for 1400 seconds!!
CODE:
def _parse(pathToFile, N, alg):
p = 1
with open(pathToFile) as f:
while True:
myList = []
next_N_lines = islice(f, N)
if not next_N_lines:
break
for line in next_N_lines:
s = line.split()
x, y, w = [int(v) for v in s]
obj = CoresetPoint(x, y)
Wobj = CoresetWeightedPoint(obj, w)
myList.append(Wobj)
a = CoresetPoints(myList)
client.compressPoints(a) // This line is not the problem
print(p)
p = p+1
c = client.getTotalCoreset()
return c
What am I doing wrong ?
As #Ev.kounis said your while loop doesn't seem to work properly.
I would recommend to go for the yield function for chunk of data at a time like this:
def get_line():
with open('your file') as file:
for i in file:
yield i
lines_required = 1000
gen = get_line()
chunk = [next(gen) for i in range(lines_required)]

What is the equivalent to scala.util.Try in pyspark?

I've got a lousy HTTPD access_log and just want to skip the "lousy" lines.
In scala this is straightforward:
import scala.util.Try
val log = sc.textFile("access_log")
log.map(_.split(' ')).map(a => Try(a(8))).filter(_.isSuccess).map(_.get).map(code => (code,1)).reduceByKey(_ + _).collect()
For python I've got the following solution by explicitly defining a function in contrast using the "lambda" notation:
log = sc.textFile("access_log")
def wrapException(a):
try:
return a[8]
except:
return 'error'
log.map(lambda s : s.split(' ')).map(wrapException).filter(lambda s : s!='error').map(lambda code : (code,1)).reduceByKey(lambda acu,value : acu + value).collect()
Is there a better way doing this (e.g. like in Scala) in pyspark?
Thanks a lot!
Better is a subjective term but there are a few approaches you can try.
The simplest thing you can do in this particular case is to avoid exceptions whatsoever. All you need is a flatMap and some slicing:
log.flatMap(lambda s : s.split(' ')[8:9])
As you can see it means no need for an exception handling or subsequent filter.
Previous idea can be extended with a simple wrapper
def seq_try(f, *args, **kwargs):
try:
return [f(*args, **kwargs)]
except:
return []
and example usage
from operator import div # FYI operator provides getitem as well.
rdd = sc.parallelize([1, 2, 0, 3, 0, 5, "foo"])
rdd.flatMap(lambda x: seq_try(div, 1., x)).collect()
## [1.0, 0.5, 0.3333333333333333, 0.2]
finally more OO approach:
import inspect as _inspect
class _Try(object): pass
class Failure(_Try):
def __init__(self, e):
if Exception not in _inspect.getmro(e.__class__):
msg = "Invalid type for Failure: {0}"
raise TypeError(msg.format(e.__class__))
self._e = e
self.isSuccess = False
self.isFailure = True
def get(self): raise self._e
def __repr__(self):
return "Failure({0})".format(repr(self._e))
class Success(_Try):
def __init__(self, v):
self._v = v
self.isSuccess = True
self.isFailure = False
def get(self): return self._v
def __repr__(self):
return "Success({0})".format(repr(self._v))
def Try(f, *args, **kwargs):
try:
return Success(f(*args, **kwargs))
except Exception as e:
return Failure(e)
and example usage:
tries = rdd.map(lambda x: Try(div, 1.0, x))
tries.collect()
## [Success(1.0),
## Success(0.5),
## Failure(ZeroDivisionError('float division by zero',)),
## Success(0.3333333333333333),
## Failure(ZeroDivisionError('float division by zero',)),
## Success(0.2),
## Failure(TypeError("unsupported operand type(s) for /: 'float' and 'str'",))]
tries.filter(lambda x: x.isSuccess).map(lambda x: x.get()).collect()
## [1.0, 0.5, 0.3333333333333333, 0.2]
You can even use pattern matching with multipledispatch
from multipledispatch import dispatch
from operator import getitem
#dispatch(Success)
def check(x): return "Another great success"
#dispatch(Failure)
def check(x): return "What a failure"
a_list = [1, 2, 3]
check(Try(getitem, a_list, 1))
## 'Another great success'
check(Try(getitem, a_list, 10))
## 'What a failure'
If you like this approach I've pushed a little bit more complete implementation to GitHub and pypi.
First, let me generate some random data to start working with.
import random
number_of_rows = int(1e6)
line_error = "error line"
text = []
for i in range(number_of_rows):
choice = random.choice([1,2,3,4])
if choice == 1:
line = line_error
elif choice == 2:
line = "1 2 3 4 5 6 7 8 9_1"
elif choice == 3:
line = "1 2 3 4 5 6 7 8 9_2"
elif choice == 4:
line = "1 2 3 4 5 6 7 8 9_3"
text.append(line)
Now I have a string text looks like
1 2 3 4 5 6 7 8 9_2
error line
1 2 3 4 5 6 7 8 9_3
1 2 3 4 5 6 7 8 9_2
1 2 3 4 5 6 7 8 9_3
1 2 3 4 5 6 7 8 9_1
error line
1 2 3 4 5 6 7 8 9_2
....
Your solution:
def wrapException(a):
try:
return a[8]
except:
return 'error'
log.map(lambda s : s.split(' ')).map(wrapException).filter(lambda s : s!='error').map(lambda code : (code,1)).reduceByKey(lambda acu,value : acu + value).collect()
#[('9_3', 250885), ('9_1', 249307), ('9_2', 249772)]
Here is my solution:
from operator import add
def myfunction(l):
try:
return (l.split(' ')[8],1)
except:
return ('MYERROR', 1)
log.map(myfunction).reduceByKey(add).collect()
#[('9_3', 250885), ('9_1', 249307), ('MYERROR', 250036), ('9_2', 249772)]
Comment:
(1) I highly recommend also calculating the lines with "error" because it won't add too much overhead, and also can be used for sanity check, for example, all the counts should add up to the total number of rows in the log, if you filter out those lines, you have no idea those are truly bad lines or something went wrong in your coding logic.
(2) I will try to package all the line level operations in one function to avoid chaining of map, filter functions, so it is more readable.
(3) From performance perspective, I generated a sample of 1M records and my code finished in 3 seconds and yours in 2 seconds, it is not a fair comparasion since the data is so small and my cluster is pretty beefy, I would recommend you generate a bigger file (1e12?) and do a benchmark on yours.

How to randomly sample from 4 csv files so that no more than 2/3 rows appear in order from each csv file, in Python

Hi I'm very new to python and trying to create a program that takes a random sample from a CSV file and makes a new file with some conditions. What I have done so far is probably highly over-complicated and not efficient (though it doesn't need to be).
I have 4 CSV files that contain 264 rows in total, where each full row is unique, though they all share common values in some columns.
csv1 = 72 rows, csv2 = 72 rows, csv3 = 60 rows, csv4 = 60 rows. I need to take a random sample of 160 rows which will make 4 blocks of 40, where in each block 10 must come from each csv file. The tricky part is that no more than 2 or 3 rows from the same CSV file can appear in order in the final file.
So far I have managed to take a random sample of 40 from each CSV (just using random.sample) and output them to 4 new CSV files. Then I split each csv into 4 new files each containing 10 rows so that I have each in a separate folder(1-4). So I now have 4 folders each containing 4 csv files. Now I need to combine these so that rows that came from the original CSV file don't repeat more than 2 or 3 times and the row order will be as random as possible. This is where I'm completely lost, I'm presuming that I should combine the 4 files in each folder (which I can do) and then re-sample or shuffle in a loop until the conditions are met, or something to that effect but I'm not sure how to proceed or am I going about this in the completely wrong way. Any help anyone can give me would be greatly appreciated and I can provide any further details that are necessary.
var_start = 1
total_condition_amount_start = 1
while (var_start < 5):
with open("condition"+`var_start`+".csv", "rb") as population1:
conditions1 = [line for line in population1]
random_selection1 = random.sample(conditions1, 40)
with open("./temp/40cond"+`var_start`+".csv", "wb") as temp_output:
temp_output.write("".join(random_selection1))
var_start = var_start + 1
while (total_condition_amount_start < total_condition_amount):
folder_no = 1
splitter.split(open("./temp/40cond"+`total_condition_amount_start`+".csv", 'rb'));
shutil.move("./temp/output_1.csv", "./temp/block"+`folder_no`+"/output_"+`total_condition_amount_start`+".csv")
folder_no = folder_no + 1
shutil.move("./temp/output_2.csv", "./temp/block"+`folder_no`+"/output_"+`total_condition_amount_start`+".csv")
folder_no = folder_no + 1
shutil.move("./temp/output_3.csv", "./temp/block"+`folder_no`+"/output_"+`total_condition_amount_start`+".csv")
folder_no = folder_no + 1
shutil.move("./temp/output_4.csv", "./temp/block"+`folder_no`+"/output_"+`total_condition_amount_start`+".csv")
total_condition_amount_start = total_condition_amount_start + 1
You should probably try using the CSV built in lib: http://docs.python.org/3.3/library/csv.html
That way you can handle each file as a list of dictionaries, which will make your task a lot easier.
from random import randint, sample, choice
def create_random_list(length):
return [randint(0, 100) for i in range(length)]
# This should be your list of four initial csv files
# with the 264 rows in total, read with the csv lib
lists = [create_random_list(264) for i in range(4)]
# Take a randomized sample from the lists
lists = map(lambda x: sample(x, 40), lists)
# Add some variables to the
lists = map(lambda x: {'data': x, 'full_count': 0}, lists)
final = [[] for i in range(4)]
for l in final:
prev = None
count = 0
while len(l) < 40:
current = choice(lists)
if current['full_count'] == 10 or (current is prev and count == 3):
continue
# Take an item from the chosen list if it hasn't been used 3 times in a
# row or is already used 10 times. Append that item to the final list
total_left = 40 - len(l)
maxx = 0
for i in lists:
if i is not current and 10 - i['full_count'] > maxx:
maxx = 10 - i['full_count']
current_left = 10 - current['full_count']
max_left = maxx + maxx/3.0
if maxx > 3 and total_left <= max_left:
# Make sure that in te future it can still be split in to sets of
# max 3
continue
l.append(current['data'].pop())
count += 1
current['full_count'] += 1
if current is not prev:
count = 0
prev = current
for li in lists:
li['full_count'] = 0

Printing in a loop

I have the following file I'm trying to manipulate.
1 2 -3 5 10 8.2
5 8 5 4 0 6
4 3 2 3 -2 15
-3 4 0 2 4 2.33
2 1 1 1 2.5 0
0 2 6 0 8 5
The file just contains numbers.
I'm trying to write a program to subtract the rows from each other and print the results to a file. My program is below and, dtest.txt is the name of the input file. The name of the program is make_distance.py.
from math import *
posnfile = open("dtest.txt","r")
posn = posnfile.readlines()
posnfile.close()
for i in range (len(posn)-1):
for j in range (0,1):
if (j == 0):
Xp = float(posn[i].split()[0])
Yp = float(posn[i].split()[1])
Zp = float(posn[i].split()[2])
Xc = float(posn[i+1].split()[0])
Yc = float(posn[i+1].split()[1])
Zc = float(posn[i+1].split()[2])
else:
Xp = float(posn[i].split()[3*j+1])
Yp = float(posn[i].split()[3*j+2])
Zp = float(posn[i].split()[3*j+3])
Xc = float(posn[i+1].split()[3*j+1])
Yc = float(posn[i+1].split()[3*j+2])
Zc = float(posn[i+1].split()[3*j+3])
Px = fabs(Xc-Xp)
Py = fabs(Yc-Yp)
Pz = fabs(Zc-Zp)
print Px,Py,Pz
The program is calculating the values correctly but, when I try to call the program to write the output file,
mpipython make_distance.py > distance.dat
The output file (distance.dat) only contains 3 columns when it should contain 6. How do I tell the program to shift what columns to print to for each step j=0,1,....
For j = 0, the program should output to the first 3 columns, for j = 1 the program should output to the second 3 columns (3,4,5) and so on and so forth.
Finally the len function gives the number of rows in the input file but, what function gives the number of columns in the file?
Thanks.
Append a , to the end of your print statement and it will not print a newline, and then when you exit the for loop add an additional print to move to the next row:
for j in range (0,1):
...
print Px,Py,Pz,
print
Assuming all rows have the same number of columns, you can get the number of columns by using len(row.split()).
Also, you can definitely shorten your code quite a bit, I'm not sure what the purpose of j is, but the following should be equivalent to what you're doing now:
for j in range (0,1):
Xp, Yp, Zp = map(float, posn[i].split()[3*j:3*j+3])
Xc, Yc, Zc = map(float, posn[i+1].split()[3*j:3*j+3])
...
You don't need to:
use numpy
read the whole file in at once
know how many columns
use awkward comma at end of print statement
use list subscripting
use math.fabs()
explicitly close your file
Try this (untested):
with open("dtest.txt", "r") as posnfile:
previous = None
for line in posnfile:
current = [float(x) for x in line.split()]
if previous:
delta = [abs(c - p) for c, p in zip(current, previous)]
print ' '.join(str(d) for d in delta)
previous = current
just in case your dtest.txt grows larger and you don't want to redirect your output but rather write to distance.dat, especially, if you want to use numpy. Thank #John for pointing out my mistake in the old code ;-)
import numpy as np
pos = np.genfromtxt("dtest.txt")
dis = np.array([np.abs(pos[j+1] - pos[j]) for j in xrange(len(pos)-1)])
np.savetxt("distance.dat",dis)

Categories