Digit frequency in first digit in csv, No import - python

I'm really a beginner at Python, and I'm doing course in my Uni. If you have tips and advice for this question please, much appreciated it.
I have trouble with writing the codes for the frequency of the first digit in CSV file.
No import is allowed.
for example, if I have the following values from CSV,
we have to figure it out how many 1,2,3,4,5,6,7,8,9,0 appears in the first digit in every number,
etc. from 5.385686, 3665, 6942, 4053, 7726, 4601, 7302 there are one 3 in the first digit,
two 4 in the first digit,one 3 in the first digit etc)
I deleted anything other than the number and . from the file. (using corrector for Ascii table)
I tried to put all the data into the list first and returned '5.385686' but I have no idea what to do next..
expected output:
[[26, 22, 28, 22, 16, 20, 31, 22, 13, 0]]
I'm showing only some part from CSV.
5.385686 3665 6942 4053 7726 4601 7302
11754.41657 7859 7002 1502 8754 449 472
800.1759341 2161 4958 3738 5105 1472 2487
1055.19226 7473 3713 4302 3174 6415 9094
1747.798453 2685 5343 3207 2137 1934 1101
2551.157404 3200 4655 2673 4270 821 330
480.7713868 1172 847 3683 9486 2258 6323
19018.97818 3678 5628 1171 7270 8333 2534
505.5652756 7222 4105 6529 169 307 3142
3759.276869 9649 1445 5944 8892 371 8307
4753 6737 906 5057 4401 8698 533
2790 5239 6392 8637 8785 1331 6848
3328 639 3519 7829 6796 3935 2893
6331 2986 6076 1085 7715 8241 5688
[[26, 22, 28, 22, 16, 20, 31, 22, 13, 0]]
This is what I got so far:
def filename():
file = open("sample_accounts.csv", "r")
filecsv = file.read()
filecsv = filecsv.lower()
a = []
b = [ ]
chlist = list(range(128))
del chlist[48:58]
del chlist[46]
for c in chlist:
filecsv = filecsv.replace(chr©," ")
a.append(chlist)
ftlist = filecsv.split()
greet = ftlist
a.append(ftlist)
for i in greet:
return greet[0]
# for i in greet:
# return greet[i]
#
# dic = {}
#
# for word in ftlist:
# dic[word] = dic.get(word,0) + 1
#
# # for item in dic: # **** *
# # print(item, dic[item])
# return greet
d = filename()

You can do that by string the count of each digit in a dictionary:
count = dict({})
with open('path to your file') as file:
for line in file.readlines():
for number in line.split(' '):
number=number.strip()
if len(number)<1:
continue
digit = number[0]
if digit.isdigit():
digit = int(digit)
if digit in count:
count[digit] = count[digit]+1
else:
count[digit] = 1
print(count.values())
Output:
[14, 11, 16, 12, 10, 11, 9, 11, 4]

Based exclusively on the csv snipped in the question, you can do something like this:
csv_dat = """[your csv snippet]"""
csv_lst = csv_dat.split(' ') #need to create a list from your snippet; you may already have it in your code
fd_lst = [] #initialize a list for the first digit in each
for item in csv_lst:
fd_lst.append((item.strip()[0])) #select the first character in each entry
print('digit frequency')
for x in set(fd_lst): #count only unique characters
print(x,'\t',fd_lst.count(x))
Output:
digit frequency
8 10
6 10
9 4
7 9
3 14
1 10
5 9
2 9
4 10

Related

Obtaining a dictionary out of regular expressions

I have a question that includes various steps.
I am parsing a file that looks like this:
9
123
0 987
3 890 234 111
1 0 1 90 1 34 1 09 1 67
1 684321
2 352 69
1 1 1 243 1 198 1 678 1 11
2 098765
1 143
1 2 1 23 1 63 1 978 1 379
3 784658
1 43
1 3 1 546 1 789 1 12 1 098
I want to make this lines in the file, keys of a dictionary (ignoring the first number and just taking the second one, because it just indicates which number of key should be):
0 987
1 684321
2 098765
3 784658
And this lines, the values of the elements (ignoring only the first number too, because it just indicates how many elements are):
3 890 234 111
2 352 69
1 143
1 43
So at the end it has to look like this:
d = {987 : [890, 234, 111], 684321 : [352, 69],
098765 : [143], 784658 : [43]}
So far I have this:
findkeys = re.findall(r"\d\t(\d+)\n", line)
findelements = re.findall(r"\d\t(\d+)", line)
listss.append("".join(findelements))
d = {findkeys: listss}
The regular expressions need more exceptions because the one for the keys, it gives me the elements of other lines that I don't want them to be keys, but have just one number too. Like in the example of the file, the number 43 appears as a result.
And the regular expression of the elements gives me back all the lines.
I don´t know if it will be easier to make that the code should ignore the lines of which I do not need information, but I don't know how to do that.
I want it to keep it has simple has possible.
Thanks!
with open('filename.txt') as f:
lines = f.readlines()
lines = [x.strip() for x in lines]
lines = lines[2:]
keys = lines[::3]
values = lines[1::3]
output lines:
['0 987',
'3 890 234 111',
'1 0 1 90 1 34 1 09 1 67',
'1 684321',
'2 352 69',
'1 1 1 243 1 198 1 678 1 11',
'2 098765',
'1 143',
'1 2 1 23 1 63 1 978 1 379',
'3 784658',
'1 43',
'1 3 1 546 1 789 1 12 1 098']
output keys:
['0 987', '1 684321', '2 098765', '3 784658']
output values:
['3 890 234 111', '2 352 69', '1 143', '1 43']
Now you just have to put it together ! Iterate through keys and values.
Once you have the lines in a list (lines variable), you can simply use re to isolate numbers and dictionary/list comprehension to build the desired data structure.
Based on you example data, every 3rd line is a key with values on the following line. This means you only need to stride by 3 in the list.
findall() will give you the list of numbers (as text) on each line and you can ignore the first one with simple subscripts.
import re
value = re.compile(r"(\d+)")
numbers = [ [int(v) for v in value.findall(line)] for line in lines]
intDict = { key[1]:values[1:] for key,values in zip(numbers[2::3],numbers[3::3]) }
You could also do it using split() but then you have to exclude empty entries that multiple spaces will create in the split:
numbers = [ [int(v) for v in line.split() if v != ""] for line in lines]
intDict = { key[1]:values[1:] for key,values in zip(numbers[2::3],numbers[3::3]) }
You could build yourself a parser with e.g. parsimonious:
from parsimonious.nodes import NodeVisitor
from parsimonious.grammar import Grammar
data = """
9
123
0 987
3 890 234 111
1 0 1 90 1 34 1 09 1 67
1 684321
2 352 69
1 1 1 243 1 198 1 678 1 11
2 098765
1 143
1 2 1 23 1 63 1 978 1 379
3 784658
1 43
1 3 1 546 1 789 1 12 1 098
"""
grammar = Grammar(
r"""
data = (important / garbage)+
important = keyline newline valueline
garbage = ~".*" newline?
keyline = ws number ws number
valueline = (ws number)+
newline = ~"[\n\r]"
number = ~"\d+"
ws = ~"[ \t]+"
"""
)
tree = grammar.parse(data)
class DataVisitor(NodeVisitor):
output = {}
current = None
def generic_visit(self, node, visited_children):
return node.text or visited_children
def visit_keyline(self, node, children):
key = node.text.split()[-1]
self.current = key
def visit_valueline(self, node, children):
values = node.text.split()
self.output[self.current] = [int(x) for x in values[1:]]
dv = DataVisitor()
dv.visit(tree)
print(dv.output)
This yields
{'987': [890, 234, 111], '684321': [352, 69], '098765': [143], '784658': [43]}
The idea here is that every "keyline" is only composed of two numbers with the second being the soon-to-be keyword. The next line is the valueline.

Trying to construct a greedy algorithm with python

So i'm trying to create a greedy algorithm for a knapsack problem. The txt file below is the knap20.txt file. The first line gives the number of items, in this case 20. The last line gives the capacity of the knapsack, in this case 524. The remaining lines give the index, value and weight of each item.
My function is to ideally return the solution in a list and the value of the weights
From what I can tell by my results, my program is working correctly. Is it working as you would expect, and how can i improve it?
txt file
20
1 91 29
2 60 65
3 61 71
4 9 60
5 79 45
6 46 71
7 19 22
8 57 97
9 8 6
10 84 91
11 20 57
12 72 60
13 32 49
14 31 89
15 28 2
16 81 30
17 55 90
18 43 25
19 100 82
20 27 19
524
python file
import os
import matplotlib.pyplot as plt
def get_optimal_value(capacity, weights, values):
value = 0.
numItems = len(values)
valuePerWeight = sorted([[values[i] / weights[i], weights[i]] for i in range(numItems)], reverse=True)
while capacity > 0 and numItems > 0:
maxi = 0
idx = None
for i in range(numItems):
if valuePerWeight[i][1] > 0 and maxi < valuePerWeight[i][0]:
maxi = valuePerWeight[i][0]
idx = i
if idx is None:
return 0.
if valuePerWeight[idx][1] <= capacity:
value += valuePerWeight[idx][0]*valuePerWeight[idx][1]
capacity -= valuePerWeight[idx][1]
else:
if valuePerWeight[idx][1] > 0:
value += (capacity / valuePerWeight[idx][1]) * valuePerWeight[idx][1] * valuePerWeight[idx][0]
return values, value
valuePerWeight.pop(idx)
numItems -= 1
return value
def read_kfile(fname):
print('file started')
with open(fname) as kfile:
print('fname found', fname)
lines = kfile.readlines() # reads the whole file
n = int(lines[0])
c = int(lines[n+1])
vs = []
ws = []
lines = lines[1:n+1] # Removes the first and last line
for l in lines:
numbers = l.split() # Converts the string into a list
vs.append(int(numbers[1])) # Appends value, need to convert to int
ws.append(int(numbers[2])) # Appends weigth, need to convert to int
return n, c, vs, ws
dir_path = os.path.dirname(os.path.realpath(__file__)) # Get the directory where the file is located
os.chdir(dir_path) # Change the working directory so we can read the file
knapfile = 'knap20.txt'
nitems, capacity, values, weights = read_kfile(knapfile)
val1,val2 = get_optimal_value(capacity, weights, values)
print ('values',val1)
print('value',val2)
result
values [91, 60, 61, 9, 79, 46, 19, 57, 8, 84, 20, 72, 32, 31, 28, 81, 55, 43, 100, 27]
value 733.2394366197183

Reading values from a text file with different row and column size in python

I have read other simliar posts but they don't seem to work in my case. Hence, I'm posting it newly here.
I have a text file which has varying row and column sizes. I am interested in the rows of values which have a specific parameter. E.g. in the sample text file below, I want the last two values of each line which has the number '1' in the second position. That is, I want the values '1, 101', '101, 2', '2, 102' and '102, 3' from the lines starting with the values '101 to 104' because they have the number '1' in the second position.
$MeshFormat
2.2 0 8
$EndMeshFormat
$Nodes
425
.
.
$EndNodes
$Elements
630
.
97 15 2 0 193 97
98 15 2 0 195 98
99 15 2 0 197 99
100 15 2 0 199 100
101 1 2 0 201 1 101
102 1 2 0 201 101 2
103 1 2 0 202 2 102
104 1 2 0 202 102 3
301 2 2 0 303 178 78 250
302 2 2 0 303 250 79 178
303 2 2 0 303 198 98 249
304 2 2 0 303 249 99 198
.
.
.
$EndElements
The problem is, with the code I have come up with mentioned below, it starts from '101' but it reads the values from the other lines upto '304' or more. What am I doing wrong or does someone has a better way to tackle this?
# Here, (additional_lines + anz_knoten_gmsh - 2) are additional lines that need to be skipped
# at the beginning of the .txt file. Initially I find out where the range
# of the lines lies which I need.
# The two_noded_elem_start is the first line having the '1' at the second position
# and four_noded_elem_start is the first line number having '2' in the second position.
# So, basically I'm reading between these two parameters.
input_file = open(os.path.join(gmsh_path, "mesh_outer_region.msh"))
output_file = open(os.path.join(gmsh_path, "mesh_skip_nodes.txt"), "w")
for i, line in enumerate(input_file):
if i == (additional_lines + anz_knoten_gmsh + two_noded_elem_start - 2):
break
for i, line in enumerate(input_file):
if i == additional_lines + anz_knoten_gmsh + four_noded_elem_start - 2:
break
elem_list = line.strip().split()
del elem_list[:5]
writer = csv.writer(output_file)
writer.writerow(elem_list)
input_file.close()
output_file.close()
*EDIT: The piece of code used to find the parameters like two_noded_elem_start is as follows:
# anz_elemente_ueberg_gmsh is another parameter that is found out
# from a previous piece of code and '$EndElements' is what
# is at the end of the text file "mesh_outer_region.msh".
input_file = open(os.path.join(gmsh_path, "mesh_outer_region.msh"), "r")
for i, line in enumerate(input_file):
if line.strip() == anz_elemente_ueberg_gmsh:
break
for i, line in enumerate(input_file):
if line.strip() == '$EndElements':
break
element_list = line.strip().split()
if element_list[1] == '1':
two_noded_elem_start = element_list[0]
two_noded_elem_start = int(two_noded_elem_start)
break
input_file.close()
>>> with open('filename') as fh: # Open the file
... for line in fh: # For each line the file
... values = line.split() # Split the values into a list
... if values[1] == '1': # Compare the second value
... print values[-2], values[-1] # Print the 2nd from last and last
1 101
101 2
2 102
102 3

Unsure why program similar to bubble-sort is not working

I have been working on a programming challenge, problem here, which basically states:
Given integer array, you are to iterate through all pairs of neighbor
elements, starting from beginning - and swap members of each pair
where first element is greater than second.
And then return the amount of swaps made and the checksum of the final answer. My program seemingly does both the sorting and the checksum according to how it wants. But my final answer is off for everything but the test input they gave.
So: 1 4 3 2 6 5 -1
Results in the correct output: 3 5242536 with my program.
But something like:
2 96 7439 92999 240 70748 3 842 74 706 4 86 7 463 1871 7963 904 327 6268 20955 92662 278 57 8 5912 724 70916 13 388 1 697 99666 6924 2 100 186 37504 1 27631 59556 33041 87 9 45276 -1
Results in: 39 1291223 when the correct answer is 39 3485793.
Here's what I have at the moment:
# Python 2.7
def check_sum(data):
data = [str(x) for x in str(data)[::]]
numbers = len(data)
result = 0
for number in range(numbers):
result += int(data[number])
result *= 113
result %= 10000007
return(str(result))
def bubble_in_array(data):
numbers = data[:-1]
numbers = [int(x) for x in numbers]
swap_count = 0
for x in range(len(numbers)-1):
if numbers[x] > numbers[x+1]:
temp = numbers[x+1]
numbers[x+1] = numbers[x]
numbers[x] = temp
swap_count += 1
raw_number = int(''.join([str(x) for x in numbers]))
print('%s %s') % (str(swap_count), check_sum(raw_number))
bubble_in_array(raw_input().split())
Does anyone have any idea where I am going wrong?
The issue is with your way of calculating Checksum. It fails when the array has numbers with more than one digit. For example:
2 96 7439 92999 240 70748 3 842 74 706 4 86 7 463 1871 7963 904 327 6268 20955 92662 278 57 8 5912 724 70916 13 388 1 697 99666 6924 2 100 186 37504 1 27631 59556 33041 87 9 45276 -1
You are calculating Checksum for 2967439240707483842747064867463187179639043276268209559266227857859127247091613388169792999692421001863750412763159556330418794527699666
digit by digit while you should calculate the Checksum of [2, 96, 7439, 240, 70748, 3, 842, 74, 706, 4, 86, 7, 463, 1871, 7963, 904, 327, 6268, 20955, 92662, 278, 57, 8, 5912, 724, 70916, 13, 388, 1, 697, 92999, 6924, 2, 100, 186, 37504, 1, 27631, 59556, 33041, 87, 9, 45276, 99666]
The fix:
# Python 2.7
def check_sum(data):
result = 0
for number in data:
result += number
result *= 113
result %= 10000007
return(result)
def bubble_in_array(data):
numbers = [int(x) for x in data[:-1]]
swap_count = 0
for x in xrange(len(numbers)-1):
if numbers[x] > numbers[x+1]:
numbers[x+1], numbers[x] = numbers[x], numbers[x+1]
swap_count += 1
print('%d %d') % (swap_count, check_sum(numbers))
bubble_in_array(raw_input().split())
More notes:
To swap two variables in Python, you dont need to use a temp variable, just use a,b = b,a.
In python 2.X, use xrange instead of range.

How to read a file word by word

I have a PPM file that I need to do certain operations on. The file is structured as in the following example. The first line, the 'P3' just says what kind of document it is. In the second line it gives the pixel dimension of an image, so in this case it's telling us that the image is 480x640. In the third line it declares the maximum value any color can take. After that there are lines of code. Every three integer group gives an rbg value for one pixel. So in this example, the first pixel has rgb value 49, 49, 49. The second pixel has rgb value 48, 48, 48, and so on.
P3
480 640
255
49 49 49 48 48 48 47 47 47 46 46 46 45 45 45 42 42 42 38 38
38 35 35 35 23 23 23 8 8 8 7 7 7 17 17 17 21 21 21 29 29
29 41 41 41 47 47 47 49 49 49 42 42 42 33 33 33 24 24 24 18 18
...
Now as you may notice, this particular picture is supposed to be 640 pixels wide which means 640*3 integers will provide the first row of pixels. But here the first row is very, very far from containing 640*3 integers. So the line-breaks in this file are meaningless, hence my problem.
The main way to read Python files is line-by-line. But I need to collect these integers into groups of 640*3 and treat that like a line. How would one do this? I know I could read the file in line-by-line and append every line to some list, but then that list would be massive and I would assume that doing so would place an unacceptable burden on a device's memory. But other than that, I'm out of ideas. Help would be appreciated.
To read three space-separated word at a time from a file:
with open(filename, 'rb') as file:
kind, dimensions, max_color = map(next, [file]*3) # read 3 lines
rgbs = zip(*[(int(word) for line in file for word in line.split())] * 3)
Output
[(49, 49, 49),
(48, 48, 48),
(47, 47, 47),
(46, 46, 46),
(45, 45, 45),
(42, 42, 42),
...
See What is the most “pythonic” way to iterate over a list in chunks?
To avoid creating the list at once, you could use itertools.izip() that would allow to read one rgb value at a time.
Probably not the most 'pythonic' way but...
Iterate through the lines containing integers.
Keep four counts - a count of 3 - color_code_count, a count of 1920 - numbers_processed, a count - col (0-639), and another - rows (0-479).
For each integer you encounter, add it to a temporary list at index of list[color_code_count]. Increment color_code_count, col, and numbers_processed.
Once color_code_count is 3, you take your temporary list and create a tuple 3 or triplet (not sure what the term is but your structure will look like (49,49,49) for the first pixel), and add that to a list of 640 columns, and 480 rows - insert your (49, 49, 49) into pixels[col][row].
Increment col.
Reset color_code_count.
'numbers_processed' will continue to increment until you get to 1920.
Once you hit 1920, you've reached the end of the first row.
Reset numbers_processed and col to zero, increment row by 1.
By this point, you should have 640 tuple3s or triplets in the row zero starting with (49,49,49), (48, 48, 48), (47, 47, 47), etc. And you're now starting to insert pixel values in row 1 column 0.
Like I said, probably not the most 'pythonic' way. There are probably better ways of doing this using join and map but I think this might work? This 'solution' if you want to call it that, shouldn't care about number of integers on any line since you're keeping count of how many numbers you expect to run through (1920) before you start a new row.
A possible way to go through each word is to iterate through each line then .split it into each word.
the_file = open("file.txt",r)
for line in the_file:
for word in line.split():
#-----Your Code-----
From there you can do whatever you want with your "words." You can add if-statements to check if there are numbers in each line with: (Though not very pythonic)
for line in the_file:
if "1" not in line or "2" not in line ...:
for word in line.split():
#-----Your Code-----
Or you can test if there is anything in each line: (Much more pythonic)
for line in the_file:
for word in line.split():
if len(word) != 0 or word != "\n":
#-----Your Code-----
I would recommend adding each of your new "lines" to a new document.
I am a C programmer. Sorry if this code looks like C Style:
f = open("pixel.ppm", "r")
type = f.readline()
height, width = f.readline().split()
height, width = int(height), int(width)
max_color = int(f.readline());
colors = []
count = 0
col_count = 0
line = []
while(col_count < height):
count = 0
i = 0
row =[]
while(count < width * 3):
temp = f.readline().strip()
if(temp == ""):
col_count = height
break
temp = temp.split()
line.extend(temp)
i = 0
while(i + 2 < len(line)):
row.append({'r':int(line[i]),'g':int(line[i+1]),'b':int(line[i+2])})
i = i+3
count = count +3
if(count >= width *3):
break
if(i < len(line)):
line = line[i:len(line)]
else:
line = []
col_count += 1
colors.append(row)
for row in colors:
for rgb in row:
print(rgb)
print("\n")
You can tweak this according to your needs. I tested it on this file:
P4
3 4
256
4 5 6 4 7 3
2 7 9 4
2 4
6 8 0
3 4 5 6 7 8 9 0
2 3 5 6 7 9 2
2 4 5 7 2
2
This seems to do the trick:
from re import findall
def _split_list(lst, i):
return lst[:i], lst[i:]
def iter_ppm_rows(path):
with open(path) as f:
ftype = f.readline().strip()
h, w = (int(s) for s in f.readline().split(' '))
maxcolor = int(f.readline())
rlen = w * 3
row = []
next_row = []
for line in f:
line_ints = [int(i) for i in findall('\d+\s+', line)]
if not row:
row, next_row = _split_list(line_ints, rlen)
else:
rest_of_row, next_row = _split_list(line_ints, rlen - len(row))
row += rest_of_row
if len(row) == rlen:
yield row
row = next_row
next_row = []
It isn't very pretty, but it allows for varying whitespace between numbers in the file, as well as varying line lengths.
I tested it on a file that looked like the following:
P3
120 160
255
0 1 2 3 4 5 6 7
8 9 10 11 12 13
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
[...]
9993 9994 9995 9996 9997 9998 9999
That file used random line lengths, but printed numbers in order so it was easy to tell at what value the rows began and stopped. Note that its dimensions are different than in the question's example file.
Using the following test code...
for row in iter_ppm_rows('mock_ppm.txt'):
print(len(row), row[0], row[-1])
...the result was the following, which seems to not be skipping over any data and returning rows of the right size.
480 0 479
480 480 959
480 960 1439
480 1440 1919
480 1920 2399
480 2400 2879
480 2880 3359
480 3360 3839
480 3840 4319
480 4320 4799
480 4800 5279
480 5280 5759
480 5760 6239
480 6240 6719
480 6720 7199
480 7200 7679
480 7680 8159
480 8160 8639
480 8640 9119
480 9120 9599
As can be seen, trailing data at the end of the file that can't represent a complete row was not yielded, which was expected but you'd likely want to account for it somehow.

Categories