Files in python

Files in python - python

My program has to do two things with this file.
It needs to print the following information:

def getlines(somefile):
f = open(somefile).readlines()
lines = [line for line in f if not line.startswith("#") and not line.strip() == ""]
return lines
entries = getlines(input("Name of input file: "))
animal_visits = {}
month_visits = [0] * 13
for entry in entries:
# count visits for each animal
animal = entry[:3]
animal_visits[animal] = animal_visits.get(animal, 0) + 1
# count visits for each month
month = int(entry[4:6])
month_visits[month] += 1
print("Total Number of visits for each animal")
for x in sorted(animal_visits):
print(x, "\t", animal_visits[x])
print("====================================================")
print("Month with highest number of visits to the stations")
print(month_visits.index(max(month_visits)))
Outputs:
Name of input file: log
Total Number of visits for each animal
a01 3
a02 3
a03 8
====================================================
Month with highest number of visits to the stations
1

I prepared the following script:
from datetime import datetime # to parse your string as a date
from collections import defaultdict # to accumulate frequencies
import calendar # to get the names of the months
# Store the names of the months
MONTHS = [item for item in calendar.month_name]
def entries(filename):
"""Yields triplets (animal, date, station) contained in
`filename`.
"""
with open(filename, "rb") as fp:
for line in (_line.strip() for _line in fp):
# skip comments
if line.startswith("#"):
continue
try:
# obtain the entry or try next line
animal, datestr, station = line.split(":")
except ValueError:
continue
# convert date string to actual datetime object
date = datetime.strptime(datestr, "%m-%d-%Y")
# yield the value
yield animal, date, station
def visits_per_animal(data):
"""Count of visits per station sorted by animal."""
# create a dictionary whose value is implicitly created to an
# integer=0
counter = defaultdict(int)
for animal, date, station in data:
counter[animal] += 1
# print the outcome
print "Visits Per Animal"
for animal in sorted(counter.keys()):
print "{0}: {1}".format(animal, counter[animal])
def month_of_highest_frequency(data):
"""Calulates the month with the highest frequency."""
# same as above: a dictionary implicitly creation integer=0 for a
# new key
counter = defaultdict(int)
for animal, date, station in data:
counter[date.month] += 1
# select the (key, value) where value is maximum
month_max, visits_max = max(counter.iteritems(), key=lambda t: t[1])
# pretty-print
print "{0} has the most visits ({1})".format(MONTHS[month_max], visits_max)
def main(filename):
"""main program: get data, and apply functions"""
data = [entry for entry in entries(filename)]
visits_per_animal(data)
month_of_highest_frequency(data)
if __name__ == "__main__":
import sys
main(sys.argv[1])
Use as:
$ python animalvisits.py animalvisits.txt
Visits Per Animal
a01: 3
a02: 3
a03: 8
January has the most visits (3)
Having done that I must advice you agains this approach. Querying data like this is very inefficient, difficult, and error prone. I recommend you store your data in an actual database (Python offers an excellent binding for SQlite), and use SQL to make your reductions.
If you adopt the SQlite philosophy, you will simply store your queries as plain text files and run them on demand (via Python, or GUI, or command line).
Visit http://docs.python.org/2/library/sqlite3.html for more details.

have you tried using regex?
I guess your code would reduce to a very few lines if you use regex?
use findall("DIFFERENT REGULAR EXPRESSIONS") and store the values into list. Then you can count the length of the list.

Related

How to speed up a nested loop of 3.6 million records in python?

I have a companies.json file with 3.6 million records (Every record contains an id & vat number) and an event.json file with 76.000 records (+- 20 properties). I wrote a script doing the following steps:
Open both JSON files
Loop through the 76.000 event records (type is class dict)
Check if the status of the event is new
If the status is new, check if the event has a companyID
If the event has a companyID, loop through the 3.6 million records to find the matching company ID.
Check if the matching company record has a VAT number
Replace the companyID with the VAT number and add an companyIDIsVat boolean.
When all looping is done, write the events to a new JSON file.
The script is working fine but it's taking 6-7 hours to complete. Is there a way to speed it up?
Current script
import json
counter = 0;
with open('companies.json', 'r') as companiesFile:
with open('events.json', 'r') as eventsFile:
events = json.load(eventsFile)
companies = json.load(companiesFile)
for index, event in enumerate(events):
print('Counter: ' + str(index))
if 'status' in event:
if(event['status'] == 'new'):
if 'companyID' in event:
for company in companies:
if(event['companyID'] == company['_id']):
if 'vat' in company:
event['companyID'] = company['vat']
event['companyIDIsVat'] = 1
counter = counter + 1
print('Found matches: ' + str(counter))
with open('new_events.json', 'w', encoding='utf-8') as f:
json.dump(events, f, ensure_ascii=False, indent=4)

So, the problem is that you are repeatedly searching through the entire companies list. But lists are inefficient for searching, because here, you must do a linear search, i.e., O(N). But you can do a constant-time search if you used a dict. Assuming you are company['_id'] is unique. Basically, you want to index on your IDs. For constant time lookups, use a dictionary, i.e. a map (a hash map in CPython, and probably every python implementation):
import json
counter = 0
with open('companies.json', 'r') as companiesFile:
with open('events.json', 'r') as eventsFile:
events = json.load(eventsFile)
companies = {
c["_id"]: c for c in json.load(companiesFile)
}
for index, event in enumerate(events):
print('Counter: ' + str(index))
if 'status' in event:
if (
event['status'] == 'new'
and 'companyID' in event
and event['companyID'] in companies
):
company = companies[event['companyID']]
if 'vat' in company:
event['companyID'] = company['vat']
event['companyIDIsVat'] = 1
counter = counter + 1
print('Found matches: ' + str(counter))
with open('new_events.json', 'w', encoding='utf-8') as f:
json.dump(events, f, ensure_ascii=False, indent=4)
This is a minimal modification to your script.
You should probably just save companies.json in the appropriate structure.
Again, it assumes the companies are unique by ID. If not, then you could use a dictionary of lists, and it should be still significantly faster as long as there aren't many repeating ID's

Write map to a csv in python

I am not sure if the title of this is right. I know it is not a list and I am trying to take the results into a dictionary, but it is only adding the last value of my loop.
So I have pasted all my code but I have a question specifically on my candidates loop, where I am trying to get the percentages of votes per candidate. When I print the information it looks like this:
enter image description here
As you can see the 3rd session of the results is showing the candidates and next to them the percentage and the total votes. This results is what I am not sure what is (not a list not a dictionary)
I am trying to write this in my output csv, however after so many ways I always get to write only the last result which is O'Tooley.
I am new at this, so I am not sure first, why even if I save my percentage in a list after each loop, I am still saving only the percentage of O'Tooley. That's why I decided to print after each loop. That was my only way to make sure all the results look as in the picture.
import os
import csv
electiondatapath = os.path.join('../..','gt-atl-data-pt-03-2020-u-c', '03-Python', 'Homework', 'PyPoll', 'Resources', 'election_data.csv')
with open (electiondatapath) as csvelectionfile:
csvreader = csv.reader(csvelectionfile, delimiter=',')
# Read the header row first
csv_header = next(csvelectionfile)
#hold number of rows which will be the total votes
num_rows = 0
#total votes per candidate
totalvotesDic = {}
#list to zip and write to csv
results = []
for row in csvreader:
#total number of votes cast
num_rows += 1
# Check if candidate in the dictionary keys, if is not then add the candidate to the dictionary and count it as one, else sum 1 to the votes
if row[2] not in totalvotesDic.keys():
totalvotesDic[row[2]] = 1
else:
totalvotesDic[row[2]] += 1
print("Election Results")
print("-----------------------")
print(f"Total Votes: {(num_rows)}")
print("-----------------------")
#get the percentage of votes and print result next to candidate and total votes
for candidates in totalvotesDic.keys():
#totalvotesDic[candidates].append("{:.2%}".format(totalvotesDic[candidates] / num_rows))
candidates_info = candidates, "{:.2%}".format(totalvotesDic[candidates] / num_rows), "(", totalvotesDic[candidates], ")"
print(candidates, "{:.2%}".format(totalvotesDic[candidates] / num_rows), "(", totalvotesDic[candidates], ")")
#get the winner out of the candidates
winner = max(totalvotesDic, key=totalvotesDic.get)
print("-----------------------")
print(f"Winner: {(winner)}")
print("-----------------------")
#append to the list to zip
results.append("Election Results")
results.append(f"Total Votes: {(num_rows)}")
results.append(candidates_info)
results.append(f"Winner: {(winner)}")
# zip list together
cleaned_csv = zip(results)
# Set variable for output file
output_file = os.path.join("output_Pypoll.csv")
# Open the output file
with open(output_file, "w") as datafile:
writer = csv.writer(datafile)
# Write in zipped rows
writer.writerows(cleaned_csv)

In each iteration, you created a variable named candidates_info for just one candidate. You need to concatenate strings like this example:
candidates_info = ""
for candidates in totalvotesDic.keys():
candidates_info = '\n'.join([candidates_info, candidates + "{:.2%}".format(totalvotesDic[candidates] / num_rows) + "("+ str(totalvotesDic[candidates])+ ")"])
print(candidates_info)
# prints
# O'Tooley 0.00%(2)
# Someone 30.00%(..)
Also, you don't need keys(). Try this instead:
candidates_info = ""
for candidates, votes in totalvotesDic.items():
candidates_info = '\n'.join([candidates_info, str(candidates) + "{:.2%}".format(votes / num_rows) + "("+ str(votes)+ ")"])

More complex sorting: How to cateorize data and sort the data within categories? (Python)

i have a question regarding a current program that I am trying to modify.
The current program I have:
def extract_names(filename):
names = []
f = open(filename, 'rU')
text = f.read()
yearmatch = re.search(r'Popularity\sin\s(\d\d\d\d)', text)
if not yearmatch:
sys.stderr.write('unavailable year\n')
sys.exit(1)
year = yearmatch.group(1)
names.append(year)
yeartuples = re.findall(r'<td>(\d+)</td><td>(\w+)</td>\<td>(\w+)</td>', text)#finds all patterns of date, boyname, and girlname, creates tuple)
rankednames = {}
for rank_tuple in yeartuples:
(rank, boyname, girlname) = rank_tuple
if boyname not in rankednames:
rankednames[boyname] = rank
if girlname not in rankednames:
rankednames[girlname] = rank
sorted_names = sorted(rankednames.keys(), key=lambda x: int(rankednames[x]), reverse = True)
for name in sorted_names:
names.append(name + " " + rankednames[name])
return names[:20]
#Boilerplate from this point**
def main():
args = sys.argv[1:]
if not args:
print 'usage: [--summaryfile] file [file ...]'
sys.exit(1)
summary = False
if args[0] == '--summaryfile':
summary = True
del args[0]
for filename in args:
names = extract_names(filename)
text = '\n'.join(names)
if summary:
outf = open(filename + '.summary', 'w')
outf.write(text + '\n')
outf.close()
else:
print text
if __name__ == '__main__':
main()
Takes information from a website regarding the most popular babynames of a certain year in a table, uses this data to create a list and print out a list of the babynames in order from the lowest rank (1000) to the highest rank (1). The modification I am trying to make is supposed to sort all of the names by alphabet (a first) but within each group of letters (group of all a's, group of all b's etc.) I am trying to sort the names by descending order within the letter groups, so the lowest ranked name that starts with an a would be the first name to show up. I have tried re.search for each letter but I dont think it works as intended that way. I am having the most trouble with the sorting within the letter categories. Are there any other approaches/solutions to this?

In the call to sorted, replace:
key=lambda x: int(rankednames[x]), reverse = True
with:
key=lambda x: (x[0], -int(rankednames[x]))
The general point is that you can always use a tuple to combine two or more different sort keys with one used first and the other as a "tie-breaker". The specific point is that we can easily simulate reverse=True because the key happens to be an integer and therefore can be negated: this trick wouldn't work for a string key.

UPDATE: Calculate vector length according to str value in specific column in Python

I am trying to measure the length of vectors based on a value of the first column of my input data.
For instance: my input data is as follows:
dog nmod+n+-n 4
dog nmod+n+n-a-commitment-n 6
child into+ns-j+vn-pass-rb-divide-v 3
child nmod+n+ns-commitment-n 5
child nmod+n+n-pledge-n 3
hello nmod+n+ns 2
The value that I want to calculate is based on an identical value in the first column. For instance, I would calculate a value based on all rows in which dog is in the first column, then I would calculate a value based on all rows in which child is in the first column... and so on.
I have worked out the mathematics to calculate the vector length (Euc. norm). However, I am unsure how to base the calculation based on grouping the identical values in the first column.
So far, this is the code that I have written:
#!/usr/bin/python
import os
import sys
import getopt
import datetime
import math
print "starting:",
print datetime.datetime.now()
def countVectorLength(infile, outfile):
with open(infile, 'rb') as inputfile:
flem, _, fw = next(inputfile).split()
current_lem = flem
weights = [float(fw)]
for line in inputfile:
lem, _, w = line.split()
if lem == current_lem:
weights.append(float(w))
else:
print current_lem,
print math.sqrt(sum([math.pow(weight,2) for weight in weights]))
current_lem = lem
weights = [float(w)]
print current_lem,
print math.sqrt(sum([math.pow(weight,2) for weight in weights]))
print "Finish:",
print datetime.datetime.now()
path = '/Path/to/Input/'
pathout = '/Path/to/Output'
listing = os.listdir(path)
for infile in listing:
outfile = 'output' + infile
print "current file is:" + infile
countVectorLength(path + infile, pathout + outfile)
This code outputs the length of vector of each individual lemma. The above data gives me the following output:
dog 7.211102550927978
child 6.48074069840786
hello 2
UPDATE
I have been working on it and I have managed to get the following working code, as updated in the code sample above. However, as you will be able to see. The code has a problem with the output of the very last line of each file --- which I have solved rather rudimentarily by manually adding it. However, because of this problem, it does not permit a clean iteration through the directory -- outputting all of the results of all files in an appended > document. Is there a way to make this a bit cleaner, pythonic way to output directly each individual corresponding file in the outpath directory?

First thing, you need to transform the input into something like
dog => [4,2]
child => [3,5,3]
etc
It goes like this:
from collections import defaultdict
data = defaultdict(list)
for line in file:
line = line.split('\t')
data[line[0]].append(line[2])
Once this is done, the rest is obvious:
def vector_len(vec):
you already got that
vector_lens = {name: vector_len(values) for name, values in data.items()}

I have a file with this data

2012-05-10 BRAD 10
2012-05-08 BRAD 40
2012-05-08 BRAD 60
2012-05-12 TOM 100
I wanted an output as
2012-05-08 BRAD|2|100
2012-05-10 BRAD|1|10
2012-05-12 TOM|1|100
i started with this code::
import os,sys
fo=open("meawoo.txt","w")
f=open("test.txt","r")
fn=f.readlines()
f.close()
for line in fn:
line = line.strip()
sline = line.split("|")
p = sline[1].split(" ")[0],sline[2],sline[4]
print p
fo.writelines(str(p)+"\n")
fo.close()
o_read = open("meawoo.txt","r")
x_read=o_read.readlines()
from operator import itemgetter
x_read.sort(key=itemgetter(0))
from itertools import groupby
z = groupby(x_read, itemgetter(0))
print z
for elt, items in groupby(x_read, itemgetter(0)):
print elt, items
for i in items:
print i
It will be very helpful if u suggest me some usefull changes to my work.TIA

The following code should print the data in your wanted format (as far as I understand it):
d = {}
with open("testdata.txt") as f:
for line in f:
parts = line.split()
if parts[0] in d:
if parts[1] in d[parts[0]]:
d[parts[0]][parts[1]][0] += int(parts[2])
else:
d[parts[0]][parts[1]] = [int(parts[2]), 0]
d[parts[0]][parts[1]][1] +=1
else:
d[parts[0]] = {parts[1]: [int(parts[2]), 1]}
for date in sorted(d):
for name in sorted(d[date]):
print "%s %s|%d|%d" % (date, name, d[date][name][0], d[date][name][1])
I save every line in a dictionary with the lines' dates as keys, and the value is another dictionary with the name as the key and the value is a list with two elements: The first is the cumulative sum of the numbers of this name on this date up to this line, and the second is the number of summands for this date/name constellation. I then print the dictionary in your demanded format and use the circumstance that the comparison of two dates gives the same result as the comparison of the dates as strings that have the format YYY-MM-DD, so I can just use the sorted function on the date strings. I sort on names too.
For an example (adapted to not being able to use a file) see http://ideone.com/rx3h2. It gives the same output you demanded.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Files in python - python

My program has to do two things with this file. It needs to print the following information:

have you tried using regex? I guess your code would reduce to a very few lines if you use regex? use findall("DIFFERENT REGULAR EXPRESSIONS") and store the values into list. Then you can count the length of the list.

Related

How to speed up a nested loop of 3.6 million records in python?

Write map to a csv in python

More complex sorting: How to cateorize data and sort the data within categories? (Python)

UPDATE: Calculate vector length according to str value in specific column in Python

I have a file with this data

Categories

Resources