Python - can text file data be stored in code? - python

I'm writing a program that requires lots of Date lookups (Fiscal Year, Month, Week). To simplify the lookups I created a Dictionary where the Key is a date (used for the lookup) and the Value is a Class Object. I put the class def and the code to read the dates data (a .txt file) in separate file, not the main file. BTW, this is not a question about Date objects.
The code is:
# filename: MYDATES
class cMyDates:
def __init__(self, myList):
self.Week_Start = myList[1]
self.Week_End = myList[2]
self.Week_Num = myList[3]
self.Month_Num = myList[4]
self.Month = myList[5]
self.Year = myList[6]
self.Day_Num = myList[7]
d_Date = {} # <-- this is the dictionary of Class Objects
# open the file with all the Dates Data
myDateFile = "myDates.log"
f = open(myDateFile, "rb")
# parse the Data and add it to the Dictionary
for line in f:
myList = line.replace(' ','').split(',')
k = myList[0]
val = cMarsDates(myList)
d_Date[k] = val
The actual dates data, from the text file, are just long strings separated by a comma: (also these strings are reduced in size for clarity, as-is the class def init)
2012-12-30, 2012-12-30, 2013-01-05, 1, 12, Dec, 2012, 30, Sun
2012-12-31, 2012-12-30, 2013-01-05, 1, 12, Dec, 2012, 31, Mon
In my main program I import this data:
import MYDATES as myDate
From here I can access my dictionary object like this:
myDate.d_Date
Everything works fine. My question: Is there a way to store this data inside the python code somehow, instead of in a separate text file? The program will always require this same information and it will never change. It's like a glorified static variable. I figured if I could keep this inside a .pyc file then perhaps it would run faster. Ok, before you jump on me about 'faster' or the amount of time it takes to read the external data and create the dictionary... It doesn't take long (about 0.00999 sec on average, I benchmarked it). The question is just for my edification - in case I need to do something like this again, but on a much larger scale where the time "might" matter.
I thought of storing the dates data in an array (coming from VB thinking) or List (Python) and just feeding it to the dictionary object, but it seems as though you can only .append to a List instead of giving it a predetermined size. Then I thought about creating a dictionary, or dictionaries, but that just seemed overwhelming considering the amount of data I have and the fact I would have to read thru these dictionaries to create another dictionary of Class Objects. It didn't seem right.
Can anybody suggest a different way to populate my dictionary of class objects besides storing the data in a separate text file and reading thru it in the code?

You can have list literals:
values = [1, 2, 3, 4]
Also a dictionary literal:
d = {'2012-12-30': cMyDates(['2012-12-30', '2012-12-30', '2013-01-05', 1, 12, 'Dec', 2012, 30, 'Sun']),
'2012-12-31': cMyDates(['2012-12-31', '2012-12-30', '2013-01-05', 1, 12, 'Dec', 2012, 31, 'Mon'])}
You probably want a proper constructor for your class instead of passing a list:
class cMyDates:
def __init__(self, Week_Start, Week_End, Week_Num, Month_Num, Month, Year, Day_Num):
self.Week_Start = Week_Start
self.Week_End = Week_End
self.Week_Num = Week_Num
self.Month_Num = Month_Num
self.Month = Month
self.Year = Year
self.Day_Num = Day_Num
Then your literal can look like this, which is a lot nicer:
d = {'2012-12-30': cMyDates(Week_Start='2012-12-30',
Week_End='2013-01-05',
Week_Num=1,
Month_Num=12,
Month='Dec',
Year=2012,
Day=30,
Day_Num='Sun'),
'2012-12-31': cMyDates(Week_Start='2012-12-31',
Week_End='2013-01-05',
Week_Num=1,
Month_Num=12,
Month='Dec',
Year=2012,
Day=31,
Day_Num='Mon'))}

Sure - put the text in a longstring, denoted by starting with either ''' or """ and finishing with the same sequence on an empty line.
I use this mostly where I have some literal xml I want to parse, where xml is the original format so I don't want to parse-then-print-then-paste-into-python-file whenever it changes. Just doing a paste to replace the xml is much easier.
Longstring looks ike this:
dates='''2012-12-30, 2012-12-30, 2013-01-05, 1, 12, Dec, 2012, 30, Sun
2012-12-31, 2012-12-30, 2013-01-05, 1, 12, Dec, 2012, 31, Mon
'''
Obviously you will have to parse this out - if you use StringIO to open the string with a file-like interface, your parsing of it should be unchanged.
BTW if instead of doing a separate open, you use the with statement, closing the file is neatly handled, regardless of exceptions. BTW2, not sure why you are opening your text file "rb" - you should use "rt".
Revised code looks like this:
with open(myDateFile, "rt") as f:
# parse the Data and add it to the Dictionary
for line in f:
myList = line.replace(' ','').split(',')
k = myList[0]
val = cMarsDates(myList)
d_Date[k] = val
or, I think this should work (untested, it's late):
import io.StringIO as StringIO
with StringIO.StringIO(dates) as f:
# parse the Data and add it to the Dictionary
for line in f:
myList = line.replace(' ','').split(',')
k = myList[0]
val = cMarsDates(myList)
d_Date[k] = val

Related

Python - Get newest dict value where string = string

I have this code and it works. But I want to get two different files.
file_type returns either NP or KL. So I want to get the NP file with the max value and I want to get the KL file with the max value.
The dict looks like
{"Blah_Blah_NP_2022-11-01_003006.xlsx": "2022-03-11",
"Blah_Blah_KL_2022-11-01_003006.xlsx": "2022-03-11"}
This is my code and right now I am just getting the max date without regard to time. Since the date is formatted how it is and I don't care about time, I can just use max().
I'm having trouble expanding the below code to give me the greatest NP file and the greatest KL file. Again, file_type returns the NP or KL string from the file name.
file_dict = {}
file_path = Path(r'\\place\Report')
for file in file_path.iterdir():
if file.is_file():
path_object = Path(file)
filename = path_object.name
stem = path_object.stem
file_type = file_date = stem.split("_")[2]
file_date = stem.split("_")[3]
file_dict.update({filename: file_date})
newest = max(file_dict, key=file_dict.get)
return newest
I basically want newest where file_type = NP and also newest where file_type = KL
You could filter the dictionary into two dictionaries (or however many you need if there's more types) and then get the max date for any of those.
But the whole operation can be done efficiently in only few lines:
from pathlib import Path
from datetime import datetime
def get_newest():
maxs = {}
for file in Path(r'./examples').iterdir():
if file.is_file():
*_, t, d, _ = file.stem.split('_')
d = datetime(*map(int, d.split('-')))
maxs[t] = d if t not in maxs else max(d, maxs[t])
return maxs
print(get_newest())
This:
collects the maximum date for each type into a dict maxs
loops over the files like you did (but in a location where I created some examples following your pattern)
only looks at the files, like your code
assumes the files all meet your pattern, and splits them over '_', only keeping the next to last part as the date and the part before it as the type
converts the date into a datetime object
keeps whichever is greater, the new date or a previously stored one (if any)
Result:
{'KL': datetime.datetime(2023, 11, 1, 0, 0), 'NP': datetime.datetime(2022, 11, 2, 0, 0)}
The files in the folder:
Blah_Blah_KL_2022-11-01_003006.txt
Blah_Blah_KL_2023-11-01_003006.txt
Blah_Blah_NP_2022-11-02_003051.txt
Blah_Blah_NP_2022-11-01_003006.txt
Blah_Blah_KL_2021-11-01_003006.txt
In the comments you asked
no idea how the above code it getting the diff file types and the max. Is it just looing for all the diff types in general? It's hard to know what each piece is with names like s, d, t, etc. Really lost on *_, t, d, _ = and also d = datetime(*map(int, d.split('-')))
That's a fair point, I prefer short names when I think the meaning is clear, but a descriptive name might have been better. t is for type (and type would be a bad name, shadowing type, so perhaps file_type). d is for date, or dt for datetime might have been better. I don't see s?
The *_, t, d, _ = is called 'extended tuple unpacking', it takes all the results from what follows and only keeps the 3rd and 2nd to last, as t and d respectively, and throws the rest away. The _ takes up a position, but the underscore indicates we "don't care" about whatever is in that position. And the *_ similarly gobbles up all values at the start, as explained in the linked PEP article.
The d = datetime(*map(int, d.split('-'))) is best read from the inside out. d.split('-') just takes a date string like '2022-11-01' and splits it. The map(int, ...) that's applied to the result applies the int() function to every part of that result - so it turns ('2022', '11', '01') into (2022, 11, 1). The * in front of map() spreads the results as parameters to datetime - so, datetime(2022, 11, 1) would be called in this example.
This is what I both like and hate about Python - as you get better at it, there are very concise (and arguably beautiful - user #ArtemErmakov seems to agree) ways to write clean solutions. But they become hard to read unless you know most of the basics of the language. They're not easy to understand for a beginner, which is arguably a bad feature of a language.
To answer the broader question: since the loop takes each file, gets the type (like 'KL') from it and gets the date, it can then check the dictionary, add the date if the type is new, or if the type was already in the dictionary, update it with the maximum of the two, which is what this line does:
maxs[t] = d if t not in maxs else max(d, maxs[t])
I would recommend you keep asking questions - and whenever you see something like this code, try to break it down into all it small parts, and see what specific parts you don't understand. Python is a powerful language.
As a bonus, here is the same solution, but written a bit more clearly to show what is going on:
from pathlib import Path
from datetime import datetime
def get_newest_too():
maximums = {}
for file_path in Path(r'./examples').iterdir():
if file_path.is_file():
split_file = file_path.stem.split('_')
file_type = split_file[-3]
date_time_text = split_file[-2]
date_time_parts = (int(part) for part in date_time_text.split('-'))
date_time = datetime(*date_time_parts) # spreading is just right here
if file_type in maximums:
maximums[file_type] = max(date_time, maximums[file_type])
else:
maximums[file_type] = date_time
return maximums
print(get_newest_too())
Edit: From the comments, it became clear that you had trouble selecting the actual file of each specific type for which the date was the maximum for that type.
Here's how to do that:
from pathlib import Path
from datetime import datetime
def get_newest():
maxs = {}
for file in Path(r'./examples').iterdir():
if file.is_file():
*_, t, d, _ = file.stem.split('_')
d = datetime(*map(int, d.split('-')))
maxs[t] = (d, file) if t not in maxs else max((d, file), maxs[t])
return {f: d for _, (d, f) in maxs.items()}
print(get_newest())
Result:
{WindowsPath('examples/Blah_Blah_KL_2023-11-01_003006.txt'): datetime.datetime(2023, 11, 1, 0, 0), WindowsPath('examples/Blah_Blah_NP_2022-11-02_003051.txt'): datetime.datetime(2022, 11, 2, 0, 0)}
You could construct another dict containing only the items you need:
file_dict_NP = {key:value for key, value in file_dict.items() if 'NP' in key}
And then do the same thing on it:
newest_NP = max(file_dict_NP, key=file_dict_NP.get)

Reading data from large text file with strings and floats in Python

I'm having trouble reading large amounts of data from a text file, and splitting and removing certain objects from it to get a more refined list. For example, let's say I have a text file, we'll call it 'data.txt', that has this data in it.
Some Header Here
Object Number = 1
Object Symbol = A
Mass of Object = 1
Weight of Object = 1.2040
Hight of Object = 0.394
Width of Object = 4.2304
Object Number = 2
Object Symbol = B
Mass Number = 2
Weight of Object = 1.596
Height of Object = 3.293
Width of Object = 4.654
.
.
. ...Same format continuing down
My problem is taking the data I need from this file. Let's say I'm only interested in the Object Number and Mass of Object, which repeats through the file, but with different numerical values. I need a list of this data. Example
Object Number Mass of Object
1 1
2 2
. .
. .
. .
etc.
With the headers excluded of course, as this data will be applied to an equation. I'm very new to Python, and don't have any knowledge of OOP. What would be the easiest way to do this? I know the basics of opening and writing to text files, even a little bit of using the split and strip functions. I've researched quite a bit on this site about sorting data, but I can't get it to work for me.
Try this:
object_number = [] # list of Object Number
mass_of_object = [] # list of Mass of Object
with open('data.txt') as f:
for line in f:
if line.startswith('Object Number'):
object_number.append(int(line.split('=')[1]))
elif line.startswith('Mass of Object'):
mass_of_object.append(int(line.split('=')[1]))
In my opinion dictionary (and sub-classes) has an efficiency greater than a group of lists for huge data input.
Moreover, my code don't need any modification if you need to extract a new object data from your file.
from _collections import defaultdict
checklist = ["Object Number", "Mass of Object"]
data = dict()
with open("text.txt") as f:
# iterating over the file allows
# you to read it automatically one line at a time
for line in f:
for regmatch in checklist:
if line.startswith(regmatch):
# this is to erase newline characters
val = line.rstrip()
val = val.split(" = ")[1]
data.setdefault(regmatch, []).append(val)
print data
This is the output:
defaultdict(None, {'Object Number': ['1', '2'], 'Mass of Object': ['1']})
Here some theory about speed, here some tips about performance optimization and here about dependency between type of data and implementation efficiency.
Last, some examples about re (regular expression):
https://docs.python.org/2/howto/regex.html

Vector data from a file

I am just starting out with Python. I have some fortran and some Matlab skills, but I am by no means a coder. I need to post-process some output files.
I can't figure out how to read each value into the respective variable. The data looks something like this:
h5097600N1 2348.13 2348.35 -0.2219 20.0 -4.438
h5443200N1 2348.12 2348.36 -0.2326 20.0 -4.651
h8467200N2 2348.11 2348.39 -0.2813 20.0 -5.627
...
In my limited Matlab notation, I would like to assign the following variables of the form tN1(i,j) something like this:
tN1(1,1)=5097600; tN1(1,2)=5443200; tN2(1,3)=8467200; #time between 'h' and 'N#'
hmN1(1,1)=2348.13; hmN1(1,2)=2348.12; hmN2(1,3)=2348.11; #value in 2nd column
hsN1(1,1)=2348.35; hsN1(1,2)=2348.36; hsN2(1,3)=2348.39; #value in 3rd column
I will have about 30 sets, or tN1(1:30,1:j); hmN1(1:30,1:j);hsN1(1:30,1:j)
I know it may not seem like it, but I have been trying to figure this out for 2 days now. I am trying to learn this on my own and it seems I am missing something fundamental in my understanding of python.
I wrote a simple script which does what you asks. It creates three dictionaries, t, hm and hs. These will have keys as the N values.
import csv
import re
path = 'vector_data.txt'
# Using the <with func as obj> syntax handles the closing of the file for you.
with open(path) as in_file:
# Use the csv package to read csv files
csv_reader = csv.reader(in_file, delimiter=' ')
# Create empty dictionaries to store the values
t = dict()
hm = dict()
hs = dict()
# Iterate over all rows
for row in csv_reader:
# Get the <n> and <t_i> values by using regular expressions, only
# save the integer part (hence [1:] and [1:-1])
n = int(re.findall('N[0-9]+', row[0])[0][1:])
t_i = int(re.findall('h.+N', row[0])[0][1:-1])
# Cast the other values to float
hm_i = float(row[1])
hs_i = float(row[2])
# Try to append the values to an existing list in the dictionaries.
# If that fails, new lists is added to the dictionaries.
try:
t[n].append(t_i)
hm[n].append(hm_i)
hs[n].append(hs_i)
except KeyError:
t[n] = [t_i]
hm[n] = [hm_i]
hs[n] = [hs_i]
Output:
>> t
{1: [5097600, 5443200], 2: [8467200]}
>> hm
{1: [2348.13, 2348.12], 2: [2348.11]}
>> hn
{1: [2348.35, 2348.36], 2: [2348.39]}
(remember that Python uses zero-indexing)
Thanks for all your comments. Suggested readings led to other things which helped. Here is what I came up with:
if len(line) >= 45:
if line[0:45] == " FIT OF SIMULATED EQUIVALENTS TO OBSERVATIONS": #! indicates data to follow, after 4 lines of junk text
for i in range (0,4):
junk = file.readline()
for i in range (0,int(nobs)):
line = file.readline()
sline = line.split()
obsname.append(sline[0])
hm.append(sline[1])
hs.append(sline[2])

Filter large file using python, using contents of another

I have a ~1GB text file of data entries and another list of names that I would like to use to filter them. Running through every name for each entry will be terribly slow. What's the most efficient way of doing this in python? Is it possible to use a hash table if the name is embedded in the entry? Can I use make of the fact that the name part is consistently placed?
Example files:
Entries file -- each part of the entry is separated by a tab, until the names
246 lalala name="Jack";surname="Smith"
1357 dedada name="Mary";surname="White"
123456 lala name="Dan";surname="Brown"
555555 lalala name="Jack";surname="Joe"
Names file -- each on a newline
Jack
Dan
Ryan
Desired output -- only entries with a name in the names file
246 lalala name="Jack";surname="Smith"
123456 lala name="Dan";surname="Brown"
555555 lalala name="Jack";surname="Joe"
You can use the set data structure to store the names — it offers efficient lookup but if the names list is very large then you may run into memory troubles.
The general idea is to iterate through all the names, adding them to a set, then checking if each name from each line from the data file is contained in the set. As the format of the entries doesn't vary, you should be able to extract the names with a simple regular expression.
If you run into troubles with the size of the names set, you can read n lines from the names file and repeat the process for each set of names, unless you require sorting.
My first instinct was to make a dictionary of with names as keys, assuming that it was most efficient to look up the names using the hash of keys in the dictionary.
Given the answer, by #rfw, using a set of names, I edited the code as below and tested it against the two methods, using a dict of names and a set.
I built a dummy dataset of over 40 M records and over 5400 names. Using this dataset, the set method consistently had the edge on my machine.
import re
from collections import Counter
import time
# names file downloaded from http://www.tucows.com/preview/520007
# the set contains over 5400 names
f = open('./names.txt', 'r')
names = [ name.rstrip() for name in f.read().split(',') ]
name_set = set(names) # set of unique names
names_dict = Counter(names) # Counter ~= dict of names with counts
# Expect: 246 lalala name="Jack";surname="Smith"
pattern = re.compile(r'.*\sname="([^"]*)"')
def select_rows_set():
f = open('./data.txt', 'r')
out_f = open('./data_out_set.txt', 'a')
for record in f.readlines():
name = pattern.match(record).groups()[0]
if name in name_set:
out_f.write(record)
out_f.close()
f.close()
def select_rows_dict():
f = open('./data.txt', 'r')
out_f = open('./data_out_dict.txt', 'a')
for record in f.readlines():
name = pattern.match(record).groups()[0]
if name in names_dict:
out_f.write(record)
out_f.close()
f.close()
if __name__ == '__main__':
# One round to time the use of name_set
t0 = time.time()
select_rows_set()
t1 = time.time()
time_for_set = t1-t0
print 'Total set: ', time_for_set
# One round to time the use of names_dict
t0 = time.time()
select_rows_dict()
t1 = time.time()
time_for_dict = t1-t0
print 'Total dict: ', time_for_dict
I assumed that a Counter, being at heart a dictionary, and easier to build from the dataset, does not add any overhead to the access time. Happy to be corrected if I am missing something.
Your data is clearly structured as a table so this may be applicable.
Data structure for maintaining tabular data in memory?
You could create a custom data structure with its own "search by name" function. That'd be a list of dictionaries of some sort. This should take less memory than the size of your text file as it'll remove duplicate information you have on each line such as "name" and "surname", which would be dictionary keys. If you know a bit of SQL (very little is required here) then go with Filter large file using python, using contents of another

urllib.urlencode problems

I’m trying to write a python script that sends a query to TweetSentiments.com API.
The idea is that it will perform like this –
Reads CSV tweet file > construct query > Interrogates API > format JSON response > writes to CSV file.
So far I’ve come up with this –
import csv
import urllib
import os
count = 0
TweetList=[] ## Creates empty list to store tweets.
TweetWriter = csv.writer(open('test.csv', 'w'), dialect='excel', delimiter=' ',quotechar='|')
TweetReader = csv.reader(open("C:\StoredTweets.csv", "r"))
for rows in TweetReader:
TweetList.append(rows)
#print TweetList [0]
for rows in TweetList:
data = urllib.urlencode(TweetList[rows])
connect = httplib.HTTPConnection("http://data.tweetsentiments.com:8080/api/analyze.json?q=")
connect.result = json.load(urllib.request("POST", "", data))
TweetWriter.write(result)
But when its run I get “line 20, data = urllib.urlencode(TweetList[rows]) Type Error: list indices must be integers, not list”
I know my list “TweetList” is storing the tweets just as I’d like but I don’t think I’m using urllib.urlencode correct. The API requires that queries are sent like –
http://data.tweetsentiments.com:8080/api/analyze.json?q= (text to analyze)
So the idea was that urllib.urlencode would simply add the tweets to the end of the address to allow a query.
The last four lines of code have become a mess after looking at so many examples. Your help would be much appreciated.
I'm not 100% sure what it is you're trying to do since I don't know what's the format of the files you are reading, but this part looks suspicious:
for rows in TweetList:
data = urllib.urlencode(TweetList[rows])
since TweetList is a list, the for loop puts in the rows one single value from the list in each iteration, and so this for example:
list = [1, 2, 3, 4]
for num in list:
print num
will print 1 2 3 4. But if this:
list = [1, 2, 3, 4]
for num in list:
print list[num]
Will end up with this error: IndexError: list index out of range.
Can you please elaborate a bit more about the format of the files you are reading?
Edit
If I understand you correctly, you need something like this:
tweets = []
tweetReader = csv.reader(open("C:\StoredTweets.csv", "r"))
for row in tweetReader:
tweets.append({ 'tweet': row[0], 'date': row[1] })
for row in tweets:
data = urllib.urlencode(row)
.....

Categories