Hey guys so I have set up 2 dictionaries which have the same keys but different values for both. I am trying to get the code to print out like this
Digit Count %
1
2
3
4
5
6
7
8
9
The count is the countList and the % is the numFreq Values with their numbers also going down in the Count and % respectively.
Okay so the Data File looks like this (only doing some numbers because the file is pretty big
Census Data
Alabama Winfield 4534
Alabama Woodland 208
Alabama Woodstock 1081
Alabama Woodville 743
Alabama Yellow Bluff 175
Alabama York 2477
Alaska Adak 361
the count is the number of occurences of the first digit of the number. I basically turned each line into a list and appended the last value of the list (the number) to a new list. So then I did a Count for how many times 1, 2, 3, 4, 5, 6 , 7, 8 ,9 appear. That's what countList represents. So I stored that in a dictionary with the keys being the digits and the counts being the values. The % is the relative frequency of the count. So I set up a new list and calculated the relative frequency which is basically the count + the sum of all the counts and rounded it off to one digit. The % column has the relative count of each digit. I put that into a dictionary also where the keys are the digits 1, 2, 3, 4, 5, 6, 7, 8, 9. So now I just need to print these numbers into the 3 columns,
Here is my code so far
def main():
num_freq = {}
pop_num = []
inFile = open ("Census__2008.txt", "r")
count = 0
for line in inFile:
if (count == 0):
count += 1
continue
else:
count += 1
line = line.strip()
word_list = line.split()
pop_num.append (word_list[-1])
counts = {}
for x in pop_num:
k = str(x)[0]
counts.setdefault(k, 0)
counts[k] += 1
countList = [counts[str(i)] for i in range(1,10)]
sumList = sum(countList)
dictCount = {}
dictCount[1] = countList[0]
dictCount[2] = countList[1]
dictCount[3] = countList[2]
dictCount[4] = countList[3]
dictCount[5] = countList[4]
dictCount[6] = countList[5]
dictCount[7] = countList[6]
dictCount[8] = countList[7]
dictCount[9] = countList[8]
num_Freq = []
for elm in countList:
rel_Freq = 0
rel_Freq = rel_Freq + ((elm / sumList) * 100.0)
rel_Freq = round(rel_Freq, 1)
num_Freq.append(rel_Freq)
freqCount = {}
freqCount[1] = num_Freq[0]
freqCount[2] = num_Freq[1]
freqCount[3] = num_Freq[2]
freqCount[4] = num_Freq[3]
freqCount[5] = num_Freq[4]
freqCount[6] = num_Freq[5]
freqCount[7] = num_Freq[6]
freqCount[8] = num_Freq[7]
freqCount[9] = num_Freq[8]
print ("Digit" " ", "Count", " ", "%")
print (
main()
Using your code, you just need to do:
for i in range(1, 10):
print (i, dictCount[i], freqCount[i])
But you can simplify it a lot:
import collections
data = []
with open("Census__2008.txt") as fh:
fh.readline() # skip first line
for line in fh:
value = line.split()[-1]
data.append(value)
c = collections.Counter([x[0] for x in data])
total = sum(c.values())
print("Digit", "Count", "%")
for k, v in sorted(c.iteritems()):
freq = v / total * 100
round_freq = round(freq, 1)
print(k, v, round_freq)
Related
I have space delimited data in a text file look like the following:
0 1 2 3
1 2 3
3 4 5 6
1 3 5
1
2 3 5
3 5
each line has different length.
I need to read it starting from line 2 ('1 2 3')
and parse it and get the following information:
Number of unique data = (1,2,3,4,5,6)=6
Count of each data:
count data (1)=3
count data (2)=2
count data (3)=5
count data (4)=1
count data (5)=4
count data (6)=1
Number of lines=6
Sort the data in descending order:
data (3)
data (5)
data (1)
data (2)
data (4)
data (6)
I did this:
file=open('data.txt')
csvreader=csv.reader(file)
header=[]
header=next(csvreader)
print(header)
rows=[]
for row in csvreader:
rows.append(row)
print(rows)
After this step, what should I do to get the expected results?
I would do something like this:
from collections import Counter
with open('data.txt', 'r') as file:
lines = file.readlines()
lines = lines[1:] # skip first line
data = []
for line in lines:
data += line.strip().split(" ")
counter = Counter(data)
print(f'unique data: {list(counter.keys())}')
print(f'count data: {list(sorted(counter.most_common(), key=lambda x: x[0]))}')
print(f'number of lines: {len(lines)}')
print(f'sort data: {[x[0] for x in counter.most_common()]}')
A simple brute force approach:
nums = []
counts = {}
for row in open('data.txt'):
if row[0] == '0':
continue
nums.extend( [int(k) for k in row.rstrip().split()] )
print(nums)
for n in nums:
if n not in counts:
counts[n] = 1
else:
counts[n] += 1
print(counts)
ordering = list(sorted(counts.items(), key=lambda k: -k[1]))
print(ordering)
Here is another approach
def getData(infile):
""" Read file lines and return lines 1 thru end"""
lnes = []
with open(infile, 'r') as data:
lnes = data.readlines()
return lnes[1:]
def parseData(ld):
""" Parse data and print desired results """
unique_symbols = set()
all_symbols = dict()
for l in ld:
symbols = l.strip().split()
for s in symbols:
unique_symbols.add(s)
cnt = all_symbols.pop(s, 0)
cnt += 1
all_symbols[s] = cnt
print(f'Number of Unique Symbols = {len(unique_symbols)}')
print(f'Number of Lines Processed = {len(ld)}')
for symb in unique_symbols:
print(f'Number of {symb} = {all_symbols[symb]}')
print(f"Descending Sort of Symbols = {', '.join(sorted(list(unique_symbols), reverse=True))}")
On executing:
infile = r'spaced_text.txt'
parseData(getData(infile))
Produces:
Number of Unique Symbols = 6
Number of Lines Processed = 6
Number of 2 = 2
Number of 5 = 4
Number of 3 = 5
Number of 1 = 3
Number of 6 = 1
Number of 4 = 1
Descending Sort of Symbols = 6, 5, 4, 3, 2, 1
I have 3 text files as:
List1.txt:
032_M5, 5
035_M9, 5
036_M4, 3
038_M2, 6
041_M1, 6
List2.txt:
032_M5, 6
035_M9, 6
036_M4, 5
038_M2, 5
041_M1, 6
List3.txt:
032_M5, 6
035_M9, 6
036_M4, 4
038_M2, 5
041_M1, 6
where the 1st part (i.e string) of lines in all 3 text files are the same, but the 2nd part (i.e number) changes.
I want to get three output files from this:
Output1.txt --> All lines where numbers corresponds to a string are all different.
Example:
036_M4 3, 5, 4
Output2.txt --> All lines where numbers corresponds to a string are the same.
Example:
041_M1, 6
Output3.txt --> All lines where atleast two numbers corresponds to a string are the same (which includes results of Output2.txt also).
Example:
032_M5, 6
035_M9, 6
038_M2, 5
041_M1, 6
Then I need the count of lines with number 1, number 2, number 3, number 4, number 5, and number 6 from Output3.txt.
Here is what I tried. It is giving me the wrong output.
from collections import defaultdict
data = defaultdict(list)
for fileName in ["List1.txt","List2.txt", "List3.txt"]:
with open(fileName,'r') as file1:
for line in file1:
col1,value = line.split(",")
data[col1].append(int(value))
with open("Output3.txt","w") as output:
for (col1),values in data.items():
if len(values) < 3: continue
result = max(x for x in values)
output.write(f"{col1}, {result}\n")
Here is an approach that does not utilize any python modules and it entirely depends on native built-in python functions:
with open("List1.txt", "r") as list1, open("List2.txt", "r") as list2, open("List3.txt", "r") as list3:
# Forming association between keywords and numbers.
data1 = list1.readlines()
totalKeys = [elem.split(',')[0] for elem in data1]
numbers1 = [elem.split(',')[1].strip() for elem in data1]
numbers2 = [elem.split(',')[1].strip() for elem in list2.readlines()]
numbers3 = [elem.split(',')[1].strip() for elem in list3.readlines()]
totalValues = list(zip(numbers1,numbers2,numbers3))
totalDict = dict(zip(totalKeys,totalValues))
#Outputs
output1 = []
output2 = []
output3 = []
for key in totalDict.keys():
#Output1
if len(set(totalDict[key])) == 3:
output1.append([key, totalDict[key]])
#Output2
if len(set(totalDict[key])) == 1:
output2.append([key, totalDict[key][0]])
#Output3
if len(set(totalDict[key])) <= 2:
output3.append([key, max(totalDict[key], key=lambda elem: totalDict[key].count(elem))])
#Output1
print('Output1:')
for elem in output1:
print(elem[0] + ' ' + ", ".join(elem[1]))
print()
#Output2
print('Output2:')
for elem in output2:
print(elem[0] + ' ' + " ".join(elem[1]))
print()
#Output3
print('Output3:')
for elem in output3:
print(elem[0] + ' ' + " ".join(elem[1]))
The result of the above will be:
Output1:
036_M4 3, 5, 4
Output2:
041_M1 6
Output3:
032_M5 6
035_M9 6
038_M2 5
041_M1 6
max gives the biggest number in the list, not the most commonly occurring. For that, use statistics.mode
from collections import defaultdict
from statistics import mode
data = defaultdict(list)
for fileName in ["List1.txt","List2.txt", "List3.txt"]:
with open(fileName,'r') as file1:
for line in file1:
col1,value = line.split(",")
data[col1].append(int(value))
with open("Output1.txt","w") as output:
for (col1),values in data.items():
if len(values) < 3: continue
if values[0] != values[1] != values[2] and values[0] != values[2]:
output.write(f"{col1}, {values[0]}, {values[1]}, {values[2]}\n")
with open("Output2.txt","w") as output:
for (col1),values in data.items():
if len(values) < 3: continue
if values[0] == values[1] == values[2]:
output.write(f"{col1}, {values[0]}\n")
with open("Output3.txt","w") as output:
for (col1),values in data.items():
if len(values) < 3: continue
if len(set(values)) >= 2:
output.write(f"{col1}, {mode(values)}\n")
I have a file with 3 scores for each person. Each person has their own row. I want to use these scores, and get the average of all 3 of them. There scores are separated by tabs and in descending order. For example:
tam 10 6 11
tom 3 7 3
tim 5 4 6
these people would come out with an average of:
tam 9
tom 5
tim 4
I want these to be able to print to the python shell, however not be saved to the file.
with open("file.txt") as file1:
d = {}
count = 0
for line in file1:
column = line.split()
names = column[0]
average = (int(column[1].strip()) + int(column[2].strip()) + int(column[3].strip()))/3
count = 0
while count < 3:
d.setdefault(names, []).append(average)
count = count + 1
for names, v in sorted(d.items()):
averages = (sum(v)/3)
print(names,average)
averageslist=[]
averageslist.append(averages)
My code only finds the first persons average and outputs it for all of them. I also want it to be descending in order of averages.
You can use the following code that parses your file into a list of (name, average) tuples and prints every entry of the by average sorted list:
import operator
with open("file.txt") as f:
data = []
for line in f:
parts = line.split()
name = parts[0]
vals = parts[1:]
avg = sum(int(x) for x in vals)/len(vals)
data.append((name, avg))
for person in sorted(data, key=operator.itemgetter(1), reverse=True):
print("{} {}".format(*person))
You are almost correct.You are calculating average in the first step.So need of sum(v)/3 again.Try this
with open("file.txt") as file1:
d = {}
count = 0
for line in file1:
column = line.split()
names = column[0]
average = (int(column[1].strip()) + int(column[2].strip()) + int(column[3].strip()))/3
d[names] = average
for names, v in sorted(d.items(),key=lambda x:x[1],reverse=True): #increasing order==>sorted(d.items(),key=lambda x:x[1])
print(names,v)
#output
('tam', 9)
('tim', 5)
('tom', 4)
To sort by name
for names, v in sorted(d.items()):
print(names,v)
#output
('tam', 9)
('tim', 5)
('tom', 4)
The issue is this:
averages = (sum(v)/3)
print(names,average)
Notice that on the first line you are computing averages (with an s at the end) and on the next line you are printing average (without an s).
Try This:
from operator import itemgetter
with open("file.txt") as file1:
d = {}
count = 0
for line in file1:
column = line.split()
names = column[0]
average = (int(column[1].strip()) + int(column[2].strip()) + int(column[3].strip()))/3
count = 0
d.setdefault(names, []).append(average)
for names,v in sorted(d.items(), key=itemgetter(1),reverse=True):
print(names,v)
I'm stuck on a function in my program that formats multiple lines of numbers seperated by spaces. The following code so far takes the unformatted list of lists and makes it into a table seperated by spaces without brackets:
def printMatrix(matrix):
return ('\n'.join(' '.join(map(str, row)) for row in matrix))
I would like all of the numbers to line up nicely though in the output. I can't figure out how to stick the format operator into the list comprehension to make this happen. The input is always a square matrix (2x2 3x3 etc)
Here's the rest of the program to clarify
# Magic Squares
def main():
file = "matrix.txt"
matrix = readMatrix(file)
print(printMatrix(matrix))
result1 = colSum(matrix)
result2 = rowSum(matrix)
result3 = list(diagonalSums(matrix))
sumList = result1 + result2 + result3
check = checkSums(sumList)
if check == True:
print("This matrix is a magic square.")
else:
print("This matrix is NOT a magic square.")
def readMatrix(file):
contents = open(file).read()
with open(file) as contents:
return [[int(item) for item in line.split()] for line in contents]
def colSum(matrix):
answer = []
for column in range(len(matrix[0])):
t = 0
for row in matrix:
t += row[column]
answer.append(t)
return answer
def rowSum(matrix):
return [sum(column) for column in matrix]
def diagonalSums(matrix):
l = len(matrix[0])
diag1 = [matrix[i][i] for i in range(l)]
diag2 = [matrix[l-1-i][i] for i in range(l-1,-1,-1)]
return sum(diag1), sum(diag2)
def checkSums(sumList):
return all(x == sumList[0] for x in sumList)
def printMatrix(matrix):
return ('\n'.join(' '.join(map(str, row)) for row in matrix))
main()
def printMatrix(matrix):
return "\n".join((("{:<10}"*len(row)).format(*row))for row in matrix)
In [19]: arr=[[1,332,3,44,5],[6,7,8,9,100]]
In [20]: print(printMatrix(arr))
1 332 3 44 5
6 7 8 9 100
"{:<10}"*len(row)) creates a {} for each number left aligned 10 <:10 then we use str.format format(*row) to unpack each row.
Something like this shoud do the trick:
def print_matrix(matrix, pad_string=' ', padding_ammount=1,
number_delimiter=' ', number_width=3):
'''
Converts a list of lists holding integers to a string table.
'''
format_expr = '{{:{}>{}d}}'.format(number_delimiter, number_width)
padding = pad_string * padding_ammount
result = ''
for row in matrix:
for col in row:
result = '{}{}{}'.format(
result,
padding,
format_expr
).format(col)
result = '{}\n'.format(result)
return result
a = [
[1, 2, 3, 22, 450],
[333, 21, 13, 5, 7]
]
print(print_matrix(a))
# 1 2 3 22 450
# 333 21 13 5 7
print(print_matrix(a, number_delimiter=0)
# 001 002 003 022 450
# 333 021 013 005 007
I have the following function:
def filetxt():
word_freq = {}
lvl1 = []
lvl2 = []
total_t = 0
users = 0
text = []
for l in range(0,500):
# Open File
if os.path.exists("C:/Twitter/json/user_" + str(l) + ".json") == True:
with open("C:/Twitter/json/user_" + str(l) + ".json", "r") as f:
text_f = json.load(f)
users = users + 1
for i in range(len(text_f)):
text.append(text_f[str(i)]['text'])
total_t = total_t + 1
else:
pass
# Filter
occ = 0
import string
for i in range(len(text)):
s = text[i] # Sample string
a = re.findall(r'(RT)',s)
b = re.findall(r'(#)',s)
occ = len(a) + len(b) + occ
s = s.encode('utf-8')
out = s.translate(string.maketrans("",""), string.punctuation)
# Create Wordlist/Dictionary
word_list = text[i].lower().split(None)
for word in word_list:
word_freq[word] = word_freq.get(word, 0) + 1
keys = word_freq.keys()
numbo = range(1,len(keys)+1)
WList = ', '.join(keys)
NList = str(numbo).strip('[]')
WList = WList.split(", ")
NList = NList.split(", ")
W2N = dict(zip(WList, NList))
for k in range (0,len(word_list)):
word_list[k] = W2N[word_list[k]]
for i in range (0,len(word_list)-1):
lvl1.append(word_list[i])
lvl2.append(word_list[i+1])
I have used the profiler to find that it seems the greatest CPU time is spent on the zip() function and the join and split parts of the code, I'm looking to see if there is any way I have overlooked that I could potentially clean up the code to make it more optimized, since the greatest lag seems to be in how I am working with the dictionaries and the zip() function. Any help would be appreciated thanks!
p.s. The basic purpose of this function is that a I load in files which contain 20 or so tweets in them, so I am most likely going to end up with about 20k - 50k files being sent through this function. The output is that I produce a list of all the distinct words in a tweet, followed by which words linked to what, e.g:
1 "love"
2 "pasa"
3 "mirar"
4 "ants"
5 "kers"
6 "morir"
7 "dreaming"
8 "tan"
9 "rapido"
10 "one"
11 "much"
12 "la"
...
10 1
13 12
1 7
12 2
7 3
2 4
3 11
4 8
11 6
8 9
6 5
9 20
5 8
20 25
8 18
25 9
18 17
9 2
...
I think you want something like:
import string
from collections import defaultdict
rng = xrange if xrange else range
def filetxt():
users = 0
total_t = 0
occ = 0
wordcount = defaultdict(int)
wordpairs = defaultdict(lambda: defaultdict(int))
for filenum in rng(500):
try:
with open("C:/Twitter/json/user_" + str(filenum) + ".json",'r') as inf:
users += 1
tweets = json.load(inf)
total_t += len(tweets)
for txt in (r['text'] for r in tweets):
occ += txt.count('RT') + txt.count('#')
prev = None
for word in txt.encode('utf-8').translate(None, string.punctuation).lower().split():
wordcount[word] += 1
wordpairs[prev][word] += 1
prev = word
except IOError:
pass
I hope you don't mind I took the liberty of modifying your code to something that I would more likely write.
from itertools import izip
def filetxt():
# keeps track of word count for each word.
word_freq = {}
# list of words which we've found
word_list = []
# mapping from word -> index in word_list
word_map = {}
lvl1 = []
lvl2 = []
total_t = 0
users = 0
text = []
####### You should replace this with a glob (see: glob module)
for l in range(0,500):
# Open File
try:
with open("C:/Twitter/json/user_" + str(l) + ".json", "r") as f:
text_f = json.load(f)
users = users + 1
# in this file there are multiple tweets so add the text
# for each one.
for t in text_f.itervalues():
text.append(t) ## CHECK THIS
except IOError:
pass
total_t = len(text)
# Filter
occ = 0
import string
for s in text:
a = re.findall(r'(RT)',s)
b = re.findall(r'(#)',s)
occ += len(a) + len(b)
s = s.encode('utf-8')
out = s.translate(string.maketrans("",""), string.punctuation)
# make a list of words that are in the text s
words = s.lower().split(None)
for word in word_list:
# try/except is quicker when we expect not to miss
# and it will be rare for us not to have
# a word in our list already.
try:
word_freq[word] += 1
except KeyError:
# we've never seen this word before so add it to our list
word_freq[word] = 1
word_map[word] = len(word_list)
word_list.append(word)
# little trick to get each word and the word that follows
for curword, nextword in zip(words, words[1:]):
lvl1.append(word_map[curword])
lvl2.append(word_map[nextword])
What is is going to do is give you the following. lvl1 will give you a list of numbers corresponding to words in word_list. so word_list[lvl1[0]] will be the first word in the first tweet you processed. lvl2[0] will be index of the word that follows the lvl1[0] so you can say, world_list[lvl2[0]] is the word that follows word_list[lvl1[0]]. This code basically maintains word_map, word_list and word_freq as it builds this.
Please note that the way you were doing this before, specifically the way you were creating W2N will not work properly. Dictionaries do not maintain order. Ordered dictionaries are coming in 3.1 but just forget about it for now. Basically when you were doing word_freq.keys() it was changing every time you added a new word so there was no consistency. See this example,
>>> x = dict()
>>> x[5] = 2
>>> x
{5: 2}
>>> x[1] = 24
>>> x
{1: 24, 5: 2}
>>> x[10] = 14
>>> x
{1: 24, 10: 14, 5: 2}
>>>
So 5 used to be the 2nd one, but now it's the 3rd.
I also updated it to use a 0 index instead of 1 index. I don't know why you were using range(1, len(...)+1) rather than just range(len(...)).
Regardless, you should get away from thinking about for loops in the traditional sense of C/C++/Java where you do loops over numbers. You should consider that unless you need an index number then you don't need it.
Rule of Thumb: if you need an index, you probably need the element at that index and you should be using enumerate anyways. LINK
Hope this helps...
A few things. These lines are weird for me when put together:
WList = ', '.join(keys)
<snip>
WList = WList.split(", ")
That should be Wlist = list(keys).
Are you sure you want to optimize this? I mean, is it really so slow that it's worth your time? And finally, a description of what the script should do would be great, instead of letting us decipher it from the code :)