How to read a file word by word

How to read a file word by word - python

I have a PPM file that I need to do certain operations on. The file is structured as in the following example. The first line, the 'P3' just says what kind of document it is. In the second line it gives the pixel dimension of an image, so in this case it's telling us that the image is 480x640. In the third line it declares the maximum value any color can take. After that there are lines of code. Every three integer group gives an rbg value for one pixel. So in this example, the first pixel has rgb value 49, 49, 49. The second pixel has rgb value 48, 48, 48, and so on.
P3
480 640
255
49 49 49 48 48 48 47 47 47 46 46 46 45 45 45 42 42 42 38 38
38 35 35 35 23 23 23 8 8 8 7 7 7 17 17 17 21 21 21 29 29
29 41 41 41 47 47 47 49 49 49 42 42 42 33 33 33 24 24 24 18 18
...
Now as you may notice, this particular picture is supposed to be 640 pixels wide which means 640*3 integers will provide the first row of pixels. But here the first row is very, very far from containing 640*3 integers. So the line-breaks in this file are meaningless, hence my problem.
The main way to read Python files is line-by-line. But I need to collect these integers into groups of 640*3 and treat that like a line. How would one do this? I know I could read the file in line-by-line and append every line to some list, but then that list would be massive and I would assume that doing so would place an unacceptable burden on a device's memory. But other than that, I'm out of ideas. Help would be appreciated.

To read three space-separated word at a time from a file:
with open(filename, 'rb') as file:
kind, dimensions, max_color = map(next, [file]*3) # read 3 lines
rgbs = zip(*[(int(word) for line in file for word in line.split())] * 3)
Output
[(49, 49, 49),
(48, 48, 48),
(47, 47, 47),
(46, 46, 46),
(45, 45, 45),
(42, 42, 42),
...
See What is the most “pythonic” way to iterate over a list in chunks?
To avoid creating the list at once, you could use itertools.izip() that would allow to read one rgb value at a time.

Probably not the most 'pythonic' way but...
Iterate through the lines containing integers.
Keep four counts - a count of 3 - color_code_count, a count of 1920 - numbers_processed, a count - col (0-639), and another - rows (0-479).
For each integer you encounter, add it to a temporary list at index of list[color_code_count]. Increment color_code_count, col, and numbers_processed.
Once color_code_count is 3, you take your temporary list and create a tuple 3 or triplet (not sure what the term is but your structure will look like (49,49,49) for the first pixel), and add that to a list of 640 columns, and 480 rows - insert your (49, 49, 49) into pixels[col][row].
Increment col.
Reset color_code_count.
'numbers_processed' will continue to increment until you get to 1920.
Once you hit 1920, you've reached the end of the first row.
Reset numbers_processed and col to zero, increment row by 1.
By this point, you should have 640 tuple3s or triplets in the row zero starting with (49,49,49), (48, 48, 48), (47, 47, 47), etc. And you're now starting to insert pixel values in row 1 column 0.
Like I said, probably not the most 'pythonic' way. There are probably better ways of doing this using join and map but I think this might work? This 'solution' if you want to call it that, shouldn't care about number of integers on any line since you're keeping count of how many numbers you expect to run through (1920) before you start a new row.

A possible way to go through each word is to iterate through each line then .split it into each word.
the_file = open("file.txt",r)
for line in the_file:
for word in line.split():
#-----Your Code-----
From there you can do whatever you want with your "words." You can add if-statements to check if there are numbers in each line with: (Though not very pythonic)
for line in the_file:
if "1" not in line or "2" not in line ...:
for word in line.split():
#-----Your Code-----
Or you can test if there is anything in each line: (Much more pythonic)
for line in the_file:
for word in line.split():
if len(word) != 0 or word != "\n":
#-----Your Code-----
I would recommend adding each of your new "lines" to a new document.

I am a C programmer. Sorry if this code looks like C Style:
f = open("pixel.ppm", "r")
type = f.readline()
height, width = f.readline().split()
height, width = int(height), int(width)
max_color = int(f.readline());
colors = []
count = 0
col_count = 0
line = []
while(col_count < height):
count = 0
i = 0
row =[]
while(count < width * 3):
temp = f.readline().strip()
if(temp == ""):
col_count = height
break
temp = temp.split()
line.extend(temp)
i = 0
while(i + 2 < len(line)):
row.append({'r':int(line[i]),'g':int(line[i+1]),'b':int(line[i+2])})
i = i+3
count = count +3
if(count >= width *3):
break
if(i < len(line)):
line = line[i:len(line)]
else:
line = []
col_count += 1
colors.append(row)
for row in colors:
for rgb in row:
print(rgb)
print("\n")
You can tweak this according to your needs. I tested it on this file:
P4
3 4
256
4 5 6 4 7 3
2 7 9 4
2 4
6 8 0
3 4 5 6 7 8 9 0
2 3 5 6 7 9 2
2 4 5 7 2
2

This seems to do the trick:
from re import findall
def _split_list(lst, i):
return lst[:i], lst[i:]
def iter_ppm_rows(path):
with open(path) as f:
ftype = f.readline().strip()
h, w = (int(s) for s in f.readline().split(' '))
maxcolor = int(f.readline())
rlen = w * 3
row = []
next_row = []
for line in f:
line_ints = [int(i) for i in findall('\d+\s+', line)]
if not row:
row, next_row = _split_list(line_ints, rlen)
else:
rest_of_row, next_row = _split_list(line_ints, rlen - len(row))
row += rest_of_row
if len(row) == rlen:
yield row
row = next_row
next_row = []
It isn't very pretty, but it allows for varying whitespace between numbers in the file, as well as varying line lengths.
I tested it on a file that looked like the following:
P3
120 160
255
0 1 2 3 4 5 6 7
8 9 10 11 12 13
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
[...]
9993 9994 9995 9996 9997 9998 9999
That file used random line lengths, but printed numbers in order so it was easy to tell at what value the rows began and stopped. Note that its dimensions are different than in the question's example file.
Using the following test code...
for row in iter_ppm_rows('mock_ppm.txt'):
print(len(row), row[0], row[-1])
...the result was the following, which seems to not be skipping over any data and returning rows of the right size.
480 0 479
480 480 959
480 960 1439
480 1440 1919
480 1920 2399
480 2400 2879
480 2880 3359
480 3360 3839
480 3840 4319
480 4320 4799
480 4800 5279
480 5280 5759
480 5760 6239
480 6240 6719
480 6720 7199
480 7200 7679
480 7680 8159
480 8160 8639
480 8640 9119
480 9120 9599
As can be seen, trailing data at the end of the file that can't represent a complete row was not yielded, which was expected but you'd likely want to account for it somehow.

Related

The very fast way to find repeating combinations in Python using pandas?

I have this "DrawsDB.csv" sample file as input:
Day,Hour,N1,N2,N3,N4,N5,N6,N7,N8,N9,N10,N11,N12,N13,N14,N15,N16,N17,N18,N19,N20
1996-03-18,15:00,4,9,10,16,21,22,23,26,27,34,35,41,42,48,62,66,68,73,76,78
1996-03-19,15:00,6,12,15,19,28,33,35,39,44,48,49,59,62,63,64,67,69,71,75,77
1996-03-21,15:00,2,4,6,7,15,16,17,19,20,26,28,45,48,52,54,69,72,73,75,77
1996-03-22,15:00,3,8,15,17,19,25,30,33,34,35,36,38,44,49,60,61,64,67,68,75
1996-03-25,15:00,2,10,11,14,18,22,26,27,29,30,42,44,45,55,60,61,66,67,75,79
2022-01-01,15:00,1,9,12,17,33,34,36,37,38,44,45,46,53,56,58,60,62,63,70,72
2022-01-01,22:50,1,3,4,14,19,22,24,27,32,33,35,36,44,48,53,55,69,70,76,78
2022-01-02,15:00,13,15,16,19,22,24,31,37,38,43,47,58,64,66,70,72,73,75,76,78
2022-01-02,22:50,5,10,11,14,16,28,29,36,41,53,54,56,58,59,61,67,68,71,73,77
2022-01-03,15:00,8,9,10,11,15,20,21,22,26,30,35,36,39,42,52,58,63,64,73,80
2022-01-03,22:50,4,9,17,21,22,32,33,34,36,37,38,41,48,49,50,60,64,69,70,75
2022-01-04,15:00,4,5,7,9,11,16,17,21,22,25,30,37,38,39,44,49,52,60,65,78
2022-01-04,22:50,17,18,22,26,27,30,31,40,43,49,55,62,63,64,65,71,72,73,76,80
2022-01-05,15:00,1,5,8,14,15,20,23,25,26,33,34,35,37,47,54,59,67,70,72,76
2022-01-05,22:50,6,7,14,15,16,18,26,37,39,41,45,51,52,54,55,59,61,70,71,80
2022-01-06,15:00,9,10,11,17,28,30,32,41,42,44,45,49,50,51,55,65,67,72,76,78
2022-01-06,22:50,1,2,6,9,11,15,21,26,31,37,40,43,47,51,52,54,67,68,73,75
This is just a sample. The real csv file is more than 50.000 rows in total.
N1 to N20 columns contains random values, non repeating across the same row, which means they are not duplicate. And they are sorted from smallest one (N1) to the biggest one (N20).
I want to get repeating combos (e.g. of 5 numbers let's say) across all rows from the DataFrame from columns N1 to N20.
So, for the entire .csv file posted above the output should be:
(6, 15, 26, 52, 54) 3
(17, 33, 34, 36, 38) 3
(17, 33, 34, 36, 60) 3
(17, 33, 34, 38, 60) 3
(17, 33, 36, 38, 60) 3
(17, 34, 36, 38, 60) 3
(33, 34, 36, 38, 60) 3
...
This is the full ouput which I'm not posting here because of text size limitations:
https://pastebin.com/4EVXXSn1
Please check it out.
Sorry for making such long output, I tried to create a shorter one but didn't succeed in getting representative combos for it.
This is the Python code I wrote to accomplish what I need: (please read its commented lines too)
import pandas as pd
from itertools import combinations
from collections import Counter
df = pd.read_csv("DrawsDB.csv")
# looping through db using method found here:
# https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas
df = df.reset_index() # make sure indexes pair with number of rows
draws = []
# please read this: https://stackoverflow.com/a/55557758/7710871 (Conclusion:iter is very slow)
for index, row in df.iterrows():
draws.append(
[row['N1'], row['N2'], row['N3'], row['N4'], row['N5'], row['N6'], row['N7'], row['N8'], row['N9'], row['N10'],
row['N11'], row['N12'], row['N13'], row['N14'], row['N15'], row['N16'], row['N17'], row['N18'], row['N19'],
row['N20']])
# comparing to each other in order to check for repeating combos:
repeating_combos = []
for i in range(len(draws)):
for j in draws[i + 1:]:
repeating_combos.append(sorted(list(set(draws[i]).intersection(j))))
# e.g. getting any repeating combo of 5 across all rows:
combos_of_5 = []
for each in repeating_combos:
if len(each) == 5:
combos_of_5.append(tuple(each))
# print(each)
elif len(each) > 5:
# e.g. a repeating sequence of 6 numbers means in fact 6 combos taken by 5 numbers in this case.
# e.g. a repeating sequence of 7 numbers means in fact 21 combos of 5 numbers and so on.
# Combinations(k, n)
for cmb in combinations(each, 5):
combos_of_5.append(tuple(sorted(list(set(cmb)))))
# count how many times each combo appear:
x = Counter(combos_of_5)
sorted_x = dict(sorted(x.items(), key=lambda item: item[1], reverse=True))
for k, v in sorted_x.items():
print(k, v)
It works very well, as expected but there is one single problem: for a bigger DataFrame it takes a lot of time to do its job done. More than that, if you want to get repeating combinations with more than 5 numbers (let's say with 6, 7, 8 or 9 numbers) it will take for ever to run.
How to do it in full pandas in a very fast and much more smarter way than I did?
Also, please note that it does not generate every combo in the first instance and after that start looking for each of those combos into DataFrame because it will take even longer.
Thank you very much in advance!
P.S. What if the numbers from N1 to N20 were not sorted? Will this make any difference?
I read this topic and many others already but none is asking for the same thing so I think it is not duplicate and this could help many other have the same or very similar problem.

Proof of work:
Given this part of your dataframe:
index
Day
Hour
N1
N2
N3
N4
N5
N6
N7
N8
N9
N10
N11
N12
N13
N14
N15
N16
N17
N18
0
1996-03-18
15:00
4
9
10
16
21
22
23
26
27
34
35
41
42
48
62
66
68
73
1
1996-03-19
15:00
6
12
15
19
28
33
35
39
44
48
49
59
62
63
64
67
69
71
2
1996-03-21
15:00
2
4
6
7
15
16
17
19
20
26
28
45
48
52
54
69
72
73
3
1996-03-22
15:00
3
8
15
17
19
25
30
33
34
35
36
38
44
49
60
61
64
67
You can update your code with something similar to the one below:
check = [6,15]
df['check'] = df.iloc[:,2:].apply(lambda r: all(s in r.values for s in check), axis=1)
true_count = df.check.sum()
print(f'The following numbers {check} appear {true_count} time(s) in the dataframe.')
Result:
The following numbers [6, 15] appear 2 time(s) in the dataframe.

how to successively increment a 2-dimensional list

Part of this assignment deals with a 1-dimensional list and a 2-dimensional list. The 2-D list has 10 rows, with 4 elements each; the 1-D list has 4 elements.
The assignments calls for copying the gamma list (see code) into the first row of the inStock list. Then each row after the first needs to be successively incremented by 3. By successively i mean multiplying everything in the first row of inStock by three and storing those values in the second row, then taking the values stored in the second row multiplying those by three and storing those values in the third row of inStock, and so on.
I understand how to copy gamma but I am having trouble figuring out how to increment based off the previous list.
I am having difficulty creating a function that increments inStock successively.
This is what I have done. It increases the elements in gamma by three and stores them into the first row of inStock. But all the while loop does is take the values from the first row of inStock and store them into the other rows, rather than increment them successively.
row = 10
col = 4
gamma = [11, 13, 15, 17]
inStock = [[0] * col] * row
def copyGamma(listG, gamma):
listG[0] = gamma.copy()
x = 0
while x < 9:
x +=1
listG[x] = [i * 3 for i in listG[0]]
return listG
retList = copyGamma(inStock, gamma)
print(retList)
#this is the output of the above code
11 13 15 17 #this is inStock[0]
33 39 45 51 #this is inStock[1]
33 39 45 51 #this is inStock[2]
33 39 45 51
33 39 45 51
33 39 45 51
33 39 45 51
33 39 45 51
33 39 45 51
33 39 45 51
#This is the output i am looking for, format does not matter:
11 13 15 17 #This is inStock[0]
33 39 45 51 #This is inStock[1]
99 117 135 153 #This *should* be inStock[2]
297 351 405 459 #and so on
891 1053 1215 1377
2673 3159 3645 4131
8019 9477 10935 12393
24057 28431 32805 37179
72171 85293 98415 111537
216513 255879 295245 334611

You can use a list comprehension and the fact that each row's elements are effectively multiplied by a power of 3:
inStock = [[x * 3**i for x in gamma] for i in range(row)]

Trying to construct a greedy algorithm with python

So i'm trying to create a greedy algorithm for a knapsack problem. The txt file below is the knap20.txt file. The first line gives the number of items, in this case 20. The last line gives the capacity of the knapsack, in this case 524. The remaining lines give the index, value and weight of each item.
My function is to ideally return the solution in a list and the value of the weights
From what I can tell by my results, my program is working correctly. Is it working as you would expect, and how can i improve it?
txt file
20
1 91 29
2 60 65
3 61 71
4 9 60
5 79 45
6 46 71
7 19 22
8 57 97
9 8 6
10 84 91
11 20 57
12 72 60
13 32 49
14 31 89
15 28 2
16 81 30
17 55 90
18 43 25
19 100 82
20 27 19
524
python file
import os
import matplotlib.pyplot as plt
def get_optimal_value(capacity, weights, values):
value = 0.
numItems = len(values)
valuePerWeight = sorted([[values[i] / weights[i], weights[i]] for i in range(numItems)], reverse=True)
while capacity > 0 and numItems > 0:
maxi = 0
idx = None
for i in range(numItems):
if valuePerWeight[i][1] > 0 and maxi < valuePerWeight[i][0]:
maxi = valuePerWeight[i][0]
idx = i
if idx is None:
return 0.
if valuePerWeight[idx][1] <= capacity:
value += valuePerWeight[idx][0]*valuePerWeight[idx][1]
capacity -= valuePerWeight[idx][1]
else:
if valuePerWeight[idx][1] > 0:
value += (capacity / valuePerWeight[idx][1]) * valuePerWeight[idx][1] * valuePerWeight[idx][0]
return values, value
valuePerWeight.pop(idx)
numItems -= 1
return value
def read_kfile(fname):
print('file started')
with open(fname) as kfile:
print('fname found', fname)
lines = kfile.readlines() # reads the whole file
n = int(lines[0])
c = int(lines[n+1])
vs = []
ws = []
lines = lines[1:n+1] # Removes the first and last line
for l in lines:
numbers = l.split() # Converts the string into a list
vs.append(int(numbers[1])) # Appends value, need to convert to int
ws.append(int(numbers[2])) # Appends weigth, need to convert to int
return n, c, vs, ws
dir_path = os.path.dirname(os.path.realpath(__file__)) # Get the directory where the file is located
os.chdir(dir_path) # Change the working directory so we can read the file
knapfile = 'knap20.txt'
nitems, capacity, values, weights = read_kfile(knapfile)
val1,val2 = get_optimal_value(capacity, weights, values)
print ('values',val1)
print('value',val2)
result
values [91, 60, 61, 9, 79, 46, 19, 57, 8, 84, 20, 72, 32, 31, 28, 81, 55, 43, 100, 27]
value 733.2394366197183

Python: How to write values to a csv file from another csv file

For index.csv file, its fourth column has ten numbers ranging from 1-5. Each number can be regarded as an index, and each index corresponds with an array of numbers in filename.csv.
The row number of filename.csv represents the index, and each row has three numbers. My question is about using a nesting loop to transfer the numbers in filename.csv to index.csv.
from numpy import genfromtxt
import numpy as np
import csv
import collections
data1 = genfromtxt('filename.csv', delimiter=',')
data2 = genfromtxt('index.csv', delimiter=',')
out = np.zeros((len(data2),len(data1)))
for row in data2:
for ch_row in range(len(data1)):
if (row[3] == ch_row + 1):
out = row.tolist() + data1[ch_row].tolist()
print(out)
writer = csv.writer(open('dn.csv','w'), delimiter=',',quoting=csv.QUOTE_ALL)
writer.writerow(out)
For example, the fourth column of index.csv contains 1,2,5,3,4,1,4,5,2,3 and filename.csv contains:
# filename.csv
20 30 50
70 60 45
35 26 77
93 37 68
13 08 55
What I need is to write the indexed row from filename.csv to index.csv and store these number in 5th, 6th and 7th column:
# index.csv
# 4 5 6 7
... 1 20 30 50
... 2 70 60 45
... 5 13 08 55
... 3 35 26 77
... 4 93 37 68
... 1 20 30 50
... 4 93 37 68
... 5 13 08 55
... 2 70 60 45
... 3 35 26 77
If I do "print(out)", it comes out a correct answer. However, when I input "out" in the shell, there are only one row appears like [1.0, 1.0, 1.0, 1.0, 20.0, 30.0, 50.0]
What I need is to store all the values in the "out" variables and write them to the dn.csv file.

This ought to do the trick for you:
Code:
from csv import reader, writer
data = list(reader(open("filename.csv", "r"), delimiter=" "))
out = writer(open("output.csv", "w"), delimiter=" ")
for row in reader(open("index.csv", "r"), delimiter=" "):
out.writerow(row + data[int(row[3])])
index.csv:
0 0 0 1
0 0 0 2
0 0 0 3
filename.csv:
20 30 50
70 60 45
35 26 77
93 37 68
13 08 55
This produces the output:
0 0 0 1 70 60 45
0 0 0 2 35 26 77
0 0 0 3 93 37 68
Note: There's no need to use numpy here. The stadard library csv module will do most of the work for you.
I also had to modify your sample datasets a bit as what you showed had indexes out of bounds of the sample data in filename.csv.
Please also note that Python (like most languages) uses 0th indexes. So you may have to fiddle with the above code to exactly fit your needs.

with open('dn.csv','w') as f:
writer = csv.writer(f, delimiter=',',quoting=csv.QUOTE_ALL)
for row in data2:
idx = row[3]
out = [idx] + [x for x in data1[idx-1]]
writer.writerow(out)

Parsing members of a variable length python string

I am using sed in python to read the text from a log file into a single string.
Here is the command:
sys_output=commands.getoutput('sed -n "/SYS /,/Tot /p" %s.log' % cim_input_prefix)
and here is a printout of sys_output
SYS SCFTYP METHOD NC NO NU NBS MEMORY CPU TIME
1 RHF CCSD 18 21 59 89 92 1.6163
2 RHF CCSD 4 7 22 36 2 0.0036
Tot 94 1.6199
SYS SCFTYP METHOD NC NO NU NBS MEMORY CPU TIME
1 RHF CCSD 4 4 14 19 1 0.0002
Tot 1 0.0002
SYS SCFTYP METHOD NC NO NU NBS MEMORY CPU TIME
1 RHF CCSD 4 9 36 55 8 0.0416
2 RHF CCSD 18 25 73 108 200 5.3587
3 RHF CCSD 4 10 29 48 6 0.0217
Tot 214 5.4221
Which has three groups, with [2,1,3] rows of interest.
The log files my script will encounter may have a variable number of groups and rows, so I can't simply split the string and pull out the useful information.
I am interested in the index of group and row, and the memory column.
How can I parse this large string to obtain a dictionary such as:
{'1-1': 92, '1-2': 2, '2-1': 1, '3-1': 8, '3-2': 200, '3-3': 6}?
Thank you very much for your time

Some kind of state machine based on the particular traits of the output may make life easier than worrying too much about indices.
This snippet works with the example and could be tailored to handle corner cases.
import collections
with open("cpu_text", "r") as f:
lines = f.readlines()
lines = [line.strip() for line in lines]
group_id = 0
group_member_id = 0
output_dict = collections.OrderedDict()
for line in lines:
if line.find("SYS") > -1:
group_id += 1
elif line.find("Tot") > -1:
group_member_id = 0
else:
group_member_id += 1
key = "{0}-{1}".format(group_id, group_member_id)
memory = line.split()[7]
output_dict[key] = memory
print(output_dict)
Output:
OrderedDict([('1-1', '92'), ('1-2', '2'), ('2-1', '1'), ('3-1', '8'), ('3-2', '200'), ('3-3', '6')])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to read a file word by word - python

Related

The very fast way to find repeating combinations in Python using pandas?

how to successively increment a 2-dimensional list

Trying to construct a greedy algorithm with python

Python: How to write values to a csv file from another csv file

Parsing members of a variable length python string

Categories

Resources