joining every 4th line in csv-file - python

I'd like to join every 4th line together so I thought something like this would work:
import csv
filename = "mycsv.csv"
f = open(filename, "rb")
new_csv = []
count = 1
for i, line in enumerate(file(filename)):
line = line.rstrip()
print line
if count % 4 == 0:
new_csv.append(old_line_1 + old_line_2 + old_line_3+line)
else:
old_line_1 = line[i-2]
old_line_2 = line[i-1]
old_line_3 = line
count += 1
print new_csv
But line[i-1] and line[i-2] does not take current line -1 and -2 as I thought. So how can I access current line -1 and -2?

The variable line contains only the line for the current iteration, so accessing line[i-1] will only give you one character within the current line. The other answer is probably the tersest way to put it but, building on your code, you could do something like this instead:
import csv
filename = "mycsv.csv"
with open(filename, "rb") as f:
reader = csv.reader(f)
new_csv = []
lines = []
for i, line in enumerate(reader):
line = line.rstrip()
lines.append(line)
if (i + 1) % 4 == 0:
new_csv.append("".join(lines))
lines = []
print new_csv

This should do as you require
join_every_n = 4
all_lines = [line.rstrip() for line in file(filename)] # note the OP uses some unknown func `file` here
transposed_lines = zip(*[all_lines[n::join_every_n] for n in range(join_every_n)])
joined = [''.join([l1,l2,l3,l4]) for (l1,l2,l3,l4) in transposed_lines]
likewise you could also do
joined = map(''.join, transposed_lines)
Explanation
This will return every i'th element in a your_list with an offset of n
your_list[n::i]
Then you can combine this across a range(4) to generate for every 4 lines in a list such that you get
[[line0, line3, ...], [line1, line4, ...], [line2, line6, ...], [line3, line7, ...]]
Then the transposed_lines is required to transpose this array so that it becomes like
[[line0, line1, line2, line3], [line4, line5, line6, line7], ...]
Now you can simple unpack and join each individual list element
Example
all_lines = map(str, range(100))
transposed_lines = zip(*[all_lines[n::4] for n in range(4)])
joined = [''.join([l1,l2,l3,l4]) for (l1,l2,l3,l4) in transposed_lines]
gives
['0123',
'4567',
'891011',
...

Related

Selecting line from file by using "startswith" and "next" commands

I have a file from which I want to create a list ("timestep") from the numbers which appear after each line "ITEM: TIMESTEP" so:
timestep = [253400, 253500, .. etc]
Here is the sample of the file I have:
ITEM: TIMESTEP
253400
ITEM: NUMBER OF ATOMS
378
ITEM: BOX BOUNDS pp pp pp
-2.6943709180241954e-01 5.6240920636804063e+01
-2.8194230631882372e-01 5.8851195163321044e+01
-2.7398090193568775e-01 5.7189372326936599e+01
ITEM: ATOMS id type q x y z
16865 3 0 28.8028 1.81293 26.876
16866 2 0 27.6753 2.22199 27.8362
16867 2 0 26.8715 1.04115 28.4178
16868 2 0 25.7503 1.42602 29.4002
16869 2 0 24.8716 0.25569 29.8897
16870 3 0 23.7129 0.593415 30.8357
16871 3 0 11.9253 -0.270359 31.7252
ITEM: TIMESTEP
253500
ITEM: NUMBER OF ATOMS
378
ITEM: BOX BOUNDS pp pp pp
-2.6943709180241954e-01 5.6240920636804063e+01
-2.8194230631882372e-01 5.8851195163321044e+01
-2.7398090193568775e-01 5.7189372326936599e+01
ITEM: ATOMS id type q x y z
16865 3 0 28.8028 1.81293 26.876
16866 2 0 27.6753 2.22199 27.8362
16867 2 0 26.8715 1.04115 28.4178
16868 2 0 25.7503 1.42602 29.4002
16869 2 0 24.8716 0.25569 29.8897
16870 3 0 23.7129 0.593415 30.8357
16871 3 0 11.9253 -0.270359 31.7252
To do this I tried to use "startswith" and "next" commands at once and it didn't work. Is there other way to do it? I send also the code I'm trying to use for that:
timestep = []
with open(file, 'r') as f:
lines = f.readlines()
for line in lines:
line = line.split()
if line[0].startswith("ITEM: TIMESTEP"):
timestep.append(next(line))
print(timestep)
The logic is to decide whether to append the current line to timestep or not. So, what you need is a variable which tells you append the current line when that variable is TRUE.
timestep = []
append_to_list = False # decision variable
with open(file, 'r') as f:
lines = f.readlines()
for line in lines:
line = line.strip() # remove "\n" from line
if line.startswith("ITEM"):
# Update add_to_list
if line == 'ITEM: TIMESTEP':
append_to_list = True
else:
append_to_list = False
else:
# append to list if line doesn't start with "ITEM" and append_to_list is TRUE
if append_to_list:
timestep.append(line)
print(timestep)
output:
['253400', '253500']
First - I don't like this, because it doesn't scale. You can only get the first immediately following line nicely, anything else will be just ugh...
But you asked, so ... for x in lines will create an iterator over lines and use that to keep the position. You don't have access to that iterator, so next will not be the next element you're expecting. But you can make your own iterator and use that:
lines_iter = iter(lines)
for line in lines_iter:
# whatever was here
timestep.append(next(line_iter))
However, if you ever want to scale it... for is not a good way to iterate over a file like this. You want to know what is in the next/previous line. I would suggest using while:
timestep = []
with open('example.txt', 'r') as f:
lines = f.readlines()
i = 0
while i < len(lines):
if line[i].startswith("ITEM: TIMESTEP"):
i += 1
while not line[i].startswith("ITEM: "):
timestep.append(next(line))
i += 1
else:
i += 1
This way you can extend it for different types of ITEMS of variable length.
So the problem with your code is subtle. You have a list lines which you iterate over, but you can't call next on a list.
Instead, turn it into an explicit iterator and you should be fine
timestep = []
with open(file, 'r') as f:
lines = f.readlines()
lines_iter = iter(lines)
for line in lines_iter:
line = line.strip() # removes the newline
if line.startswith("ITEM: TIMESTEP"):
timestep.append(next(lines_iter, None)) # the second argument here prevents errors
# when ITEM: TIMESTEP appears as the
# last line in the file
print(timestep)
I'm also not sure why you included line.split, which seems to be incorrect (in any case line.split()[0].startswith('ITEM: TIMESTEP') can never be true, since the split will separate ITEM: and TIMESTEP into separate elements of the resulting list.)
For a more robust answer, consider grouping your data based on when the line begins with ITEM.
def process_file(f):
ITEM_MARKER = 'ITEM: '
item_title = '(none)'
values = []
for line in f:
if line.startswith(ITEM_MARKER):
if values:
yield (item_title, values)
item_title = line[len(ITEM_MARKER):].strip() # strip off the marker
values = []
else:
values.append(line.strip())
if values:
yield (item_title, values)
This will let you pass in the whole file and will lazily produce a set of values for each ITEM: <whatever> group. Then you can aggregate in some reasonable way.
with open(file, 'r') as f:
groups = process_file(f)
aggregations = {}
for name, values in groups:
aggregations.setdefault(name, []).extend(values)
print(aggregations['TIMESTEP']) # this is what you want
You can use enumerate to help with index referencing. We can check to see if the string ITEM: TIMESTEP is in the previous line then add the integer to our timestep list.
timestep = []
with open('example.txt', 'r') as f:
lines = f.readlines()
for i, line in enumerate(lines):
if "ITEM: TIMESTEP" in lines[i-1]:
timestep.append(int(line.strip()))
print(timestep)

How to increase the speed of CSV data matching?

I have a scripts that parse two CSV files and compares the first column from one file with the second column from another file. The problem is those files are big and it takes some time to finish the process. The question is how to improve the speed? I tried to use yield from lines before the for cycle but the problem is then I have convert lines[1:] to list(lines[1:]) as result it makes no sense.
def pk():
with open('way/to/first.csv') as csv_file:
lines = csv_file.readlines()
full_list = []
for line in lines[1:]:
array = line.split(',')
list_pk = array[0].replace('"', '')
full_list.append(list_pk)
return full_list
def fk():
with open('way/to/second.csv') as csv_file:
lines = csv_file.readlines()
full_list = []
for line in lines[1:]:
array = line.split(',')
list_fk = array[1].replace('"', '')
full_list.append(list_fk)
return full_list
def res():
f = fk()
p = pk()
for i in f:
if i not in p:
raise AssertionError(f'{i} not found')
Try using python's "set difference" to find the elements in set A that do not have a match in set B:
def res():
fset = set(fk())
pset = set(pk())
print('items in F that are missing from P:')
print(fset - pset)

string manipulation and adding values based on row they are

I have a file text delimited file which I am trying to make binary combination per each line and giving the number of line to each pairs.
Here is an example (you can download it here too if you want https://gist.github.com/anonymous/4107418c63b88c6da44281a8ae7a321f)
"A,B "
"AFD,DNGS,SGDH "
"NHYG,QHD,lkd,uyete"
"AFD,TTT"
I want to have it like this
A_1 B_1
AFD_2 DNGS_2
AFD_2 SGDH_2
DNGS_2 SGDH_2
NHYG_3 QHD_3
NHYG_3 lkd_3
NHYG_3 uyete_3
QHD_3 lkd_3
QHD_3 uyete_3
lkd_3 uyete_3
AFD_4 TTT_4
It means, A_1 and B_1 are coming from the first row
AFD_2 & DNGS_2 are coming from the second row , etc etc
I have tried to do it but I cannot figure it out
#!/usr/bin/python
import itertools
# make my output
out = {}
# give a name to my data
file_name = 'data.txt'
# read all the lines
for n, line in enumerate(open(file_name).readlines()):
# split each line by comma
item1 = line.split('\t')
# split each stirg from another one by a comma
item2 = item1.split(',')
# iterate over all combinations of 2 strings
for i in itertools.combinations(item2,2):
# save the data into out
out.write('\t'.join(i))
Output Answer 1
"A_1, B "_1
"AFD_2, DNGS_2
"AFD_2, SGDH "_2
DNGS_2, SGDH "_2
"NHYG_3, QHD_3
"NHYG_3, lkd_3
"NHYG_3, uyete"_3
QHD_3, lkd_3
QHD_3, uyete"_3
lkd_3, uyete"_3
"AFD_4, TTT"_4
answer 2
"A_1 B "_1
"AFD_2 DNGS_2
"AFD_2 SGDH "_2
DNGS_2 SGDH "_2
"NHYG_3 QHD_3
"NHYG_3 lkd_3
"NHYG_3 uyete"_3
QHD_3 lkd_3
QHD_3 uyete"_3
lkd_3 uyete"_3
"AFD_4 TTT"_4
Try this
#!/usr/bin/python
from itertools import combinations
with open('data1.txt') as f:
result = []
for n, line in enumerate(f, start=1):
items = line.strip().split(',')
x = [['%s_%d' % (x, n) for x in item] for item in combinations(items, 2)]
result.append(x)
for res in result:
for elem in res:
print(',\t'.join(elem))
You need a list of list of lists to represent each pair. You can build them using a list comprehension in a loop.
I wasn't sure what you wanted as your actual output format, but this prints your expected output.
If there are quotes in the input file, the simple fix is
items = line.replace("\"", "").strip().split(',')
For the above code. This would break if there were other double quotes in the data. So if you know there aren't its ok.
Otherwise, create a small function to strip the quotes. This example also writes to a file.
#!/usr/bin/python
from itertools import combinations
def remquotes(s):
beg, end = 0, len(s)
if s[0] == '"': beg = 1
if s[-1] == '"': end = -1
return s[beg:end]
with open('data1.txt') as f:
result = []
for n, line in enumerate(f, start=1):
items = remquotes(line.strip()).strip().split(',')
x = [['%s_%d' % (x, n) for x in item] for item in combinations(items, 2)]
result.append(x)
with open('out.txt', 'w') as fout:
for res in result:
for elem in res:
linestr = ',\t'.join(elem)
print(linestr)
fout.write(linestr + '\n')
Similar to the other answer provided adding that based on the comments it looks like you actually wish to write to a tab-delimited text file instead of a dictionary.
#!/usr/bin/python
import itertools
file_name = 'data.txt'
out_file = 'out.txt'
with open(file_name) as infile, open(out_file, "w") as out:
for n,line in enumerate(infile):
row = [i + "_" + str(n+1) for i in line.strip().split(",")]
for i in itertools.combinations(row,2):
out.write('\t'.join(i) + '\n')
The following seems to work with a minimal amount of code:
import itertools
input_filename = 'data.txt'
output_filename = 'split_data.txt'
with open(input_filename, 'rt') as inp, open(output_filename, 'wt') as outp:
for n, line in enumerate(inp, 1):
items = ('{}_{}'.format(x.strip(), n)
for x in line.replace('"', '').split(','))
for combo in itertools.combinations(items, 2):
outp.write('\t'.join(combo) + '\n')

Python use for loop to read specific multiply lines from txt files

I want use python to read specific multiply lines from txt files. For example ,read line 7 to 10, 17 to 20, 27 to 30 etc.
Here is the code I write, but it will only print out the first 3 lines numbers. Why? I am very new to use Python.
with open('OpenDR Data.txt', 'r') as f:
for poseNum in range(0, 4):
Data = f.readlines()[7+10*poseNum:10+10*poseNum]
for line in Data:
matAll = line.split()
MatList = map(float, matAll)
MatArray1D = np.array(MatList)
print MatArray1D
This simplifies the math a little to choose the relevant lines. You don't need to use readlines().
with open('OpenDR Data.txt', 'r') as fp:
for idx, line in enumerate(fp, 1):
if idx % 10 in [7,8,9,0]:
matAll = line.split()
MatList = map(float, matAll)
MatArray1D = np.array(MatList)
print MatArray1D
with open('OpenDR Data.txt') as f:
lines = f.readlines()
for poseNum in range(0, 4):
Data = lines[7+10*poseNum:10+10*poseNum]
You should only call readlines() once, so you should do it outside the loop:
with open('OpenDR Data.txt', 'r') as f:
lines = f.readlines()
for poseNum in range(0, 4):
Data = lines[7+10*poseNum:10+10*poseNum]
for line in Data:
matAll = line.split()
MatList = map(float, matAll)
MatArray1D = np.array(MatList)
print MatArray1D
You can use a combination list slicing and comprehension.
start = 7
end = 10
interval = 10
groups = 3
with open('data.txt') as f:
lines = f.readlines()
mult_lines = [lines[start-1 + interval*i:end + interval*i] for i in range(groups)]
This will return a list of lists containing each group of lines (i.e. 7 thru 10, 17 thru 20).

How to rearrange numbers from different lines of a text file in python?

So I have a text file consisting of one column, each column consist two numbers
190..255
337..2799
2801..3733
3734..5020
5234..5530
5683..6459
8238..9191
9306..9893
I would like to discard the very 1st and the very last number, in this case, 190 and 9893.
and basically moves the rest of the numbers one spot forward. like this
My desired output
255..337
2799..2801
3733..3734
5020..5234
5530..5683
6459..8238
9191..9306
I hope that makes sense I'm not sure how to approach this
lines = """190..255
337..2799
2801..3733"""
values = [int(v) for line in lines.split() for v in line.split('..')]
# values = [190, 255, 337, 2799, 2801, 3733]
pairs = zip(values[1:-1:2], values[2:-1:2])
# pairs = [(255, 337), (2799, 2801)]
out = '\n'.join('%d..%d' % pair for pair in pairs)
# out = "255..337\n2799..2801"
Try this:
with open(filename, 'r') as f:
lines = f.readlines()
numbers = []
for row in lines:
numbers.extend(row.split('..'))
numbers = numbers[1:len(numbers)-1]
newLines = ['..'.join(numbers[idx:idx+2]) for idx in xrange(0, len(numbers), 2]
with open(filename, 'w') as f:
for line in newLines:
f.write(line)
f.write('\n')
Try this:
Read all of them into one list, split each line into two numbers, so you have one list of all your numbers.
Remove the first and last item from your list
Write out your list, two items at a time, with dots in between them.
Here's an example:
a = """190..255
337..2799
2801..3733
3734..5020
5234..5530
5683..6459
8238..9191
9306..9893"""
a_list = a.replace('..','\n').split()
b_list = a_list[1:-1]
b = ''
for i in range(len(a_list)/2):
b += '..'.join(b_list[2*i:2*i+2]) + '\n'
temp = []
with open('temp.txt') as ofile:
for x in ofile:
temp.append(x.rstrip("\n"))
for x in range(0, len(temp) - 1):
print temp[x].split("..")[1] +".."+ temp[x+1].split("..")[0]
x += 1
Maybe this will help:
def makeColumns(listOfNumbers):
n = int()
while n < len(listOfNumbers):
print(listOfNumbers[n], '..', listOfNumbers[(n+1)])
n += 2
def trim(listOfNumbers):
listOfNumbers.pop(0)
listOfNumbers.pop((len(listOfNumbers) - 1))
listOfNumbers = [190, 255, 337, 2799, 2801, 3733, 3734, 5020, 5234, 5530, 5683, 6459, 8238, 9191, 9306, 9893]
makeColumns(listOfNumbers)
print()
trim(listOfNumbers)
makeColumns(listOfNumbers)
I think this might be useful too. I am reading data from a file name list.
data = open("list","r")
temp = []
value = []
print data
for line in data:
temp = line.split("..")
value.append(temp[0])
value.append(temp[1])
for i in range(1,(len(value)-1),2):
print value[i].strip()+".."+value[i+1]
print value
After reading the data I split and store it in the temporary list.After that, I copy data to the main list value which have all of the data.Then I iterate from the second element to second last element to get the output of interest. strip function is used in order to remove the '\n' character from the value.
You can later write these values to a file Instead of printing out.

Categories