Python - Get newest dict value where string = string - python

I have this code and it works. But I want to get two different files.
file_type returns either NP or KL. So I want to get the NP file with the max value and I want to get the KL file with the max value.
The dict looks like
{"Blah_Blah_NP_2022-11-01_003006.xlsx": "2022-03-11",
"Blah_Blah_KL_2022-11-01_003006.xlsx": "2022-03-11"}
This is my code and right now I am just getting the max date without regard to time. Since the date is formatted how it is and I don't care about time, I can just use max().
I'm having trouble expanding the below code to give me the greatest NP file and the greatest KL file. Again, file_type returns the NP or KL string from the file name.
file_dict = {}
file_path = Path(r'\\place\Report')
for file in file_path.iterdir():
if file.is_file():
path_object = Path(file)
filename = path_object.name
stem = path_object.stem
file_type = file_date = stem.split("_")[2]
file_date = stem.split("_")[3]
file_dict.update({filename: file_date})
newest = max(file_dict, key=file_dict.get)
return newest
I basically want newest where file_type = NP and also newest where file_type = KL

You could filter the dictionary into two dictionaries (or however many you need if there's more types) and then get the max date for any of those.
But the whole operation can be done efficiently in only few lines:
from pathlib import Path
from datetime import datetime
def get_newest():
maxs = {}
for file in Path(r'./examples').iterdir():
if file.is_file():
*_, t, d, _ = file.stem.split('_')
d = datetime(*map(int, d.split('-')))
maxs[t] = d if t not in maxs else max(d, maxs[t])
return maxs
print(get_newest())
This:
collects the maximum date for each type into a dict maxs
loops over the files like you did (but in a location where I created some examples following your pattern)
only looks at the files, like your code
assumes the files all meet your pattern, and splits them over '_', only keeping the next to last part as the date and the part before it as the type
converts the date into a datetime object
keeps whichever is greater, the new date or a previously stored one (if any)
Result:
{'KL': datetime.datetime(2023, 11, 1, 0, 0), 'NP': datetime.datetime(2022, 11, 2, 0, 0)}
The files in the folder:
Blah_Blah_KL_2022-11-01_003006.txt
Blah_Blah_KL_2023-11-01_003006.txt
Blah_Blah_NP_2022-11-02_003051.txt
Blah_Blah_NP_2022-11-01_003006.txt
Blah_Blah_KL_2021-11-01_003006.txt
In the comments you asked
no idea how the above code it getting the diff file types and the max. Is it just looing for all the diff types in general? It's hard to know what each piece is with names like s, d, t, etc. Really lost on *_, t, d, _ = and also d = datetime(*map(int, d.split('-')))
That's a fair point, I prefer short names when I think the meaning is clear, but a descriptive name might have been better. t is for type (and type would be a bad name, shadowing type, so perhaps file_type). d is for date, or dt for datetime might have been better. I don't see s?
The *_, t, d, _ = is called 'extended tuple unpacking', it takes all the results from what follows and only keeps the 3rd and 2nd to last, as t and d respectively, and throws the rest away. The _ takes up a position, but the underscore indicates we "don't care" about whatever is in that position. And the *_ similarly gobbles up all values at the start, as explained in the linked PEP article.
The d = datetime(*map(int, d.split('-'))) is best read from the inside out. d.split('-') just takes a date string like '2022-11-01' and splits it. The map(int, ...) that's applied to the result applies the int() function to every part of that result - so it turns ('2022', '11', '01') into (2022, 11, 1). The * in front of map() spreads the results as parameters to datetime - so, datetime(2022, 11, 1) would be called in this example.
This is what I both like and hate about Python - as you get better at it, there are very concise (and arguably beautiful - user #ArtemErmakov seems to agree) ways to write clean solutions. But they become hard to read unless you know most of the basics of the language. They're not easy to understand for a beginner, which is arguably a bad feature of a language.
To answer the broader question: since the loop takes each file, gets the type (like 'KL') from it and gets the date, it can then check the dictionary, add the date if the type is new, or if the type was already in the dictionary, update it with the maximum of the two, which is what this line does:
maxs[t] = d if t not in maxs else max(d, maxs[t])
I would recommend you keep asking questions - and whenever you see something like this code, try to break it down into all it small parts, and see what specific parts you don't understand. Python is a powerful language.
As a bonus, here is the same solution, but written a bit more clearly to show what is going on:
from pathlib import Path
from datetime import datetime
def get_newest_too():
maximums = {}
for file_path in Path(r'./examples').iterdir():
if file_path.is_file():
split_file = file_path.stem.split('_')
file_type = split_file[-3]
date_time_text = split_file[-2]
date_time_parts = (int(part) for part in date_time_text.split('-'))
date_time = datetime(*date_time_parts) # spreading is just right here
if file_type in maximums:
maximums[file_type] = max(date_time, maximums[file_type])
else:
maximums[file_type] = date_time
return maximums
print(get_newest_too())
Edit: From the comments, it became clear that you had trouble selecting the actual file of each specific type for which the date was the maximum for that type.
Here's how to do that:
from pathlib import Path
from datetime import datetime
def get_newest():
maxs = {}
for file in Path(r'./examples').iterdir():
if file.is_file():
*_, t, d, _ = file.stem.split('_')
d = datetime(*map(int, d.split('-')))
maxs[t] = (d, file) if t not in maxs else max((d, file), maxs[t])
return {f: d for _, (d, f) in maxs.items()}
print(get_newest())
Result:
{WindowsPath('examples/Blah_Blah_KL_2023-11-01_003006.txt'): datetime.datetime(2023, 11, 1, 0, 0), WindowsPath('examples/Blah_Blah_NP_2022-11-02_003051.txt'): datetime.datetime(2022, 11, 2, 0, 0)}

You could construct another dict containing only the items you need:
file_dict_NP = {key:value for key, value in file_dict.items() if 'NP' in key}
And then do the same thing on it:
newest_NP = max(file_dict_NP, key=file_dict_NP.get)

Related

Pandas Styler.to_latex() - how to pass commands and do simple editing

How do I pass the following commands into the latex environment?
\centering (I need landscape tables to be centered)
and
\caption* (I need to skip for a panel the table numbering)
In addition, I would need to add parentheses and asterisks to the t-statistics, meaning row-specific formatting on the dataframes.
For example:
Current
variable
value
const
2.439628
t stat
13.921319
FamFirm
0.114914
t stat
0.351283
founder
0.154914
t stat
2.351283
Adjusted R Square
0.291328
I want this
variable
value
const
2.439628
t stat
(13.921319)***
FamFirm
0.114914
t stat
(0.351283)
founder
0.154914
t stat
(1.651283)**
Adjusted R Square
0.291328
I'm doing my research papers in DataSpell. All empirical work is in Python, and then I use Latex (TexiFy) to create the pdf within DataSpell. Due to this workflow, I can't edit tables in latex code while they get overwritten every time I run the jupyter notebook.
In case it helps, here's an example of how I pass a table to the latex environment:
# drop index to column
panel_a.reset_index(inplace=True)
# write Latex index and cut names to appropriate length
ind_list = [
"ageFirm",
"meanAgeF",
"lnAssets",
"bsVol",
"roa",
"fndrCeo",
"lnQ",
"sic",
"hightech",
"nonFndrFam"
]
# assign the list of values to the column
panel_a["index"] = ind_list
# format column names
header = ["", "count","mean", "std", "min", "25%", "50%", "75%", "max"]
panel_a.columns = header
with open(
os.path.join(r"/.../tables/panel_a.tex"),"w"
) as tf:
tf.write(
panel_a
.style
.format(precision=3)
.format_index(escape="latex", axis=1)
.hide(level=0, axis=0)
.to_latex(
caption = "Panel A: Summary Statistics for the Full Sample",
label = "tab:table_label",
hrules=True,
))
You're asking three questions in one. I think I can do you two out of three (I hear that "ain't bad").
How to pass \centering to the LaTeX env using Styler.to_latex?
Use the position_float parameter. Simplified:
df.style.to_latex(position_float='centering')
How to pass \caption*?
This one I don't know. Perhaps useful: Why is caption not working.
How to apply row-specific formatting?
This one's a little tricky. Let me give an example of how I would normally do this:
df = pd.DataFrame({'a':['some_var','t stat'],'b':[1.01235,2.01235]})
df.style.format({'a': str, 'b': lambda x: "{:.3f}".format(x)
if x < 2 else '({:.3f})***'.format(x)})
Result:
You can see from this example that style.format accepts a callable (here nested inside a dict, but you could also do: .format(func, subset='value')). So, this is great if each value itself is evaluated (x < 2).
The problem in your case is that the evaluation is over some other value, namely a (not supplied) P value combined with panel_a['variable'] == 't stat'. Now, assuming you have those P values in a different column, I suggest you create a for loop to populate a list that becomes like this:
fmt_list = ['{:.3f}','({:.3f})***','{:.3f}','({:.3f})','{:.3f}','({:.3f})***','{:.3f}']
Now, we can apply a function to df.style.format, and pop/select from the list like so:
fmt_list = ['{:.3f}','({:.3f})***','{:.3f}','({:.3f})','{:.3f}','({:.3f})***','{:.3f}']
def func(v):
fmt = fmt_list.pop(0)
return fmt.format(v)
panel_a.style.format({'variable': str, 'value': func})
Result:
This solution is admittedly a bit "hacky", since modifying a globally declared list inside a function is far from good practice; e.g. if you modify the list again before calling func, its functionality is unlikely to result in the expected behaviour or worse, it may throw an error that is difficult to track down. I'm not sure how to remedy this other than simply turning all the floats into strings in panel_a.value inplace. In that case, of course, you don't need .format anymore, but it will alter your df and that's also not ideal. I guess you could make a copy first (df2 = df.copy()), but that will affect memory.
Anyway, hope this helps. So, in full you add this as follows to your code:
fmt_list = ['{:.3f}','({:.3f})***','{:.3f}','({:.3f})','{:.3f}','({:.3f})***','{:.3f}']
def func(v):
fmt = fmt_list.pop(0)
return fmt.format(v)
with open(fname, "w") as tf:
tf.write(
panel_a
.style
.format({'variable': str, 'value': func})
...
.to_latex(
...
position_float='centering'
))

Python - can text file data be stored in code?

I'm writing a program that requires lots of Date lookups (Fiscal Year, Month, Week). To simplify the lookups I created a Dictionary where the Key is a date (used for the lookup) and the Value is a Class Object. I put the class def and the code to read the dates data (a .txt file) in separate file, not the main file. BTW, this is not a question about Date objects.
The code is:
# filename: MYDATES
class cMyDates:
def __init__(self, myList):
self.Week_Start = myList[1]
self.Week_End = myList[2]
self.Week_Num = myList[3]
self.Month_Num = myList[4]
self.Month = myList[5]
self.Year = myList[6]
self.Day_Num = myList[7]
d_Date = {} # <-- this is the dictionary of Class Objects
# open the file with all the Dates Data
myDateFile = "myDates.log"
f = open(myDateFile, "rb")
# parse the Data and add it to the Dictionary
for line in f:
myList = line.replace(' ','').split(',')
k = myList[0]
val = cMarsDates(myList)
d_Date[k] = val
The actual dates data, from the text file, are just long strings separated by a comma: (also these strings are reduced in size for clarity, as-is the class def init)
2012-12-30, 2012-12-30, 2013-01-05, 1, 12, Dec, 2012, 30, Sun
2012-12-31, 2012-12-30, 2013-01-05, 1, 12, Dec, 2012, 31, Mon
In my main program I import this data:
import MYDATES as myDate
From here I can access my dictionary object like this:
myDate.d_Date
Everything works fine. My question: Is there a way to store this data inside the python code somehow, instead of in a separate text file? The program will always require this same information and it will never change. It's like a glorified static variable. I figured if I could keep this inside a .pyc file then perhaps it would run faster. Ok, before you jump on me about 'faster' or the amount of time it takes to read the external data and create the dictionary... It doesn't take long (about 0.00999 sec on average, I benchmarked it). The question is just for my edification - in case I need to do something like this again, but on a much larger scale where the time "might" matter.
I thought of storing the dates data in an array (coming from VB thinking) or List (Python) and just feeding it to the dictionary object, but it seems as though you can only .append to a List instead of giving it a predetermined size. Then I thought about creating a dictionary, or dictionaries, but that just seemed overwhelming considering the amount of data I have and the fact I would have to read thru these dictionaries to create another dictionary of Class Objects. It didn't seem right.
Can anybody suggest a different way to populate my dictionary of class objects besides storing the data in a separate text file and reading thru it in the code?
You can have list literals:
values = [1, 2, 3, 4]
Also a dictionary literal:
d = {'2012-12-30': cMyDates(['2012-12-30', '2012-12-30', '2013-01-05', 1, 12, 'Dec', 2012, 30, 'Sun']),
'2012-12-31': cMyDates(['2012-12-31', '2012-12-30', '2013-01-05', 1, 12, 'Dec', 2012, 31, 'Mon'])}
You probably want a proper constructor for your class instead of passing a list:
class cMyDates:
def __init__(self, Week_Start, Week_End, Week_Num, Month_Num, Month, Year, Day_Num):
self.Week_Start = Week_Start
self.Week_End = Week_End
self.Week_Num = Week_Num
self.Month_Num = Month_Num
self.Month = Month
self.Year = Year
self.Day_Num = Day_Num
Then your literal can look like this, which is a lot nicer:
d = {'2012-12-30': cMyDates(Week_Start='2012-12-30',
Week_End='2013-01-05',
Week_Num=1,
Month_Num=12,
Month='Dec',
Year=2012,
Day=30,
Day_Num='Sun'),
'2012-12-31': cMyDates(Week_Start='2012-12-31',
Week_End='2013-01-05',
Week_Num=1,
Month_Num=12,
Month='Dec',
Year=2012,
Day=31,
Day_Num='Mon'))}
Sure - put the text in a longstring, denoted by starting with either ''' or """ and finishing with the same sequence on an empty line.
I use this mostly where I have some literal xml I want to parse, where xml is the original format so I don't want to parse-then-print-then-paste-into-python-file whenever it changes. Just doing a paste to replace the xml is much easier.
Longstring looks ike this:
dates='''2012-12-30, 2012-12-30, 2013-01-05, 1, 12, Dec, 2012, 30, Sun
2012-12-31, 2012-12-30, 2013-01-05, 1, 12, Dec, 2012, 31, Mon
'''
Obviously you will have to parse this out - if you use StringIO to open the string with a file-like interface, your parsing of it should be unchanged.
BTW if instead of doing a separate open, you use the with statement, closing the file is neatly handled, regardless of exceptions. BTW2, not sure why you are opening your text file "rb" - you should use "rt".
Revised code looks like this:
with open(myDateFile, "rt") as f:
# parse the Data and add it to the Dictionary
for line in f:
myList = line.replace(' ','').split(',')
k = myList[0]
val = cMarsDates(myList)
d_Date[k] = val
or, I think this should work (untested, it's late):
import io.StringIO as StringIO
with StringIO.StringIO(dates) as f:
# parse the Data and add it to the Dictionary
for line in f:
myList = line.replace(' ','').split(',')
k = myList[0]
val = cMarsDates(myList)
d_Date[k] = val

Python: Joining and writing (XML.etrees) trees stored in a list

I'm looping over some XML files and producing trees that I would like to store in a defaultdict(list) type. With each loop and the next child found will be stored in a separate part of the dictionary.
d = defaultdict(list)
counter = 0
for child in root.findall(something):
tree = ET.ElementTree(something)
d[int(x)].append(tree)
counter += 1
So then repeating this for several files would result in nicely indexed results; a set of trees that were in position 1 across different parsed files and so on. The question is, how do I then join all of d, and write the trees (as a cumulative tree) to a file?
I can loop through the dict to get each tree:
for x in d:
for y in d[x]:
print (y)
This gives a complete list of trees that were in my dict. Now, how do I produce one massive tree from this?
Sample input file 1
Sample input file 2
Required results from 1&2
Given the apparent difficulty in doing this, I'm happy to accept more general answers that show how I can otherwise get the result I am looking for from two or more files.
Use Spyne:
from spyne.model.primitive import *
from spyne.model.complex import *
class GpsInfo(ComplexModel):
UTC = DateTime
Latitude = Double
Longitude = Double
DopplerTime = Double
Quality = Unicode
HDOP = Unicode
Altitude = Double
Speed = Double
Heading = Double
Estimated = Boolean
class Header(ComplexModel):
Name = Unicode
Time = DateTime
SeqNo = Integer
class CTrailData(ComplexModel):
index = UnsignedInteger
gpsInfo = GpsInfo
Header = Header
class CTrail(ComplexModel):
LastError = AnyXml
MaxTrial = Integer
Trail = Array(CTrailData)
from lxml import etree
from spyne.util.xml import *
file_1 = get_xml_as_object(etree.fromstring(open('file1').read()), CTrail)
file_2 = get_xml_as_object(etree.fromstring(open('file2').read()), CTrail)
file_1.Trail.extend(file_2.Trail)
file_1.Trail.sort(key=lambda x: x.index)
elt = get_object_as_xml(file_1, no_namespace=True)
print etree.tostring(elt, pretty_print=True)
While doing this, Spyne also converts the data fields from string to their native Python formats as well, so it'll be much easier for you to work with the data from this xml document.
Also, if you don't mind using the latest version from git, you can do e.g.:
class GpsInfo(ComplexModel):
# (...)
doppler_time = Double(sub_name="DopplerTime")
# (...)
so that you can get data from the CamelCased tags without having to violate PEP8.
Use lxml.objectify:
from lxml import etree, objectify
obj_1 = objectify.fromstring(open('file1').read())
obj_2 = objectify.fromstring(open('file2').read())
obj_1.Trail.CTrailData.extend(obj_2.Trail.CTrailData)
# .sort() won't work as objectify's lists are not regular python lists.
obj_1.Trail.CTrailData = sorted(obj_1.Trail.CTrailData, key=lambda x: x.index)
print etree.tostring(obj_1, pretty_print=True)
It doesn't do the additional conversion work that the Spyne variant does, but for your use case, that might be enough.

Python: How do I iterate over several files with similar names (the variation in each name is the date)?

I wrote a program that filters files containing to pull location and time from specific ones. Each file contains one day's worth of tweets.
I would like to run this program over one year's worth of tweets, which would involve iterating over 365 folders with names like this: 2011--.tweets.dat.gz, with the stars representing numbers that complete the file name to make it a date for each day in the year.
Basically, I'm looking for code that will loop over 2011-01-01.tweets.dat.gz, 2011-01-02.tweets.dat.gz, ..., all the way through 2011-12-31.tweets.dat.gz.
What I'm imagining now is somehow telling the program to loop over all files with the name 2011-*.tweets.dat.gz, but I'm not sure exactly how that would work or how to structure it, or even if the * syntax is correct.
Any tips?
Easiest way is indeed with a glob:
import from glob import iglob
for pathname in iglob("/path/to/folder/2011-*.tweets.dat.gz"):
print pathname # or do whatever
Use the datetime module:
>>> from datetime import datetime,timedelta
>>> d = datetime(2011,1,1)
while d < datetime(2012,1,1) :
filename = "{}{}".format(d.strftime("%Y-%m-%d"),'.tweets.dat.gz')
print filename
d = d + timedelta(days = 1)
...
2011-01-01.tweets.dat.gz
2011-01-02.tweets.dat.gz
2011-01-03.tweets.dat.gz
2011-01-04.tweets.dat.gz
2011-01-05.tweets.dat.gz
2011-01-06.tweets.dat.gz
2011-01-07.tweets.dat.gz
2011-01-08.tweets.dat.gz
2011-01-09.tweets.dat.gz
2011-01-10.tweets.dat.gz
...
...
2011-12-27.tweets.dat.gz
2011-12-28.tweets.dat.gz
2011-12-29.tweets.dat.gz
2011-12-30.tweets.dat.gz
2011-12-31.tweets.dat.gz

basic python vlookup equivalent

I'm looking for the equivalent to the vlookup function in excel. I have a script where I read in a csv file. I would like to be able to query an associated value from another column in the .csv. Script so far:
import matplotlib
import matplotlib.mlab as mlab
import glob
for files in glob.glob("*.csv"):
print files
r = mlab.csv2rec(files)
r.cols = r.dtype.names
depVar = r[r.cols[0]]
indVar = r[r.cols[1]]
print indVar
This will read in from .csv files in the same folder the script is in. In the above example depVar is the first column in the .csv, and indVar is the second column. In my case, I know a value for indVar, and I want to return the associated value for depVar. I'd like to add a command like:
depVar = r[r.cols[0]]
indVar = r[r.cols[1]]
print indVar
depVarAt5 = lookup value in depVar where indVar = 5 (I could sub in things for the 5 later)
In my case, all values in all fields are numbers and all of the values of indVar are unique. I want to be able to define a new variable (depVarAt5 in last example) equal to the associated value.
Here's example .csv contents, name the file anything and place it in same folder as script. In this example, depVarAt5 should be set equal to 16.1309.
Temp,Depth
16.1309,5
16.1476,94.4007
16.2488,100.552
16.4232,106.573
16.4637,112.796
16.478,118.696
16.4961,124.925
16.5105,131.101
16.5462,137.325
16.7016,143.186
16.8575,149.101
16.9369,155.148
17.0462,161.187
I think this solves your problem quite directly:
import numpy
import glob
for f in glob.glob("*.csv"):
print f
r = numpy.recfromcsv(f)
print numpy.interp(5, r.depth, r.temp)
I'm pretty sure numpy is a prerequisite for matplotlib.
Not sure what that r object is, but since it has a member called cols, I'm going to assume it also has a member called rows which contains the row data.
>>> r.rows
[[16.1309, 5], [16.1476, 94.4007], ...]
In that case, your pseudocode very nearly contains a valid generator expression/list comprehension.
depVarAt5 = lookup value in depVar where indVar = 5 (I could sub in things for the 5 later)
becomes
depVarAt5 = [row[0] for row in r.rows if row[1] == 5]
Or, more generally
depVarValue = [row[depVarColIndex] for row in r.rows if row[indVarColIndex] == searchValue]
so
def vlookup(rows, searchColumn, dataColumn, searchValue):
return [row[dataColumn] for row in rows if row[searchColumn] == searchValue]
Throw a [0] on the end of that if you can guarantee there will be exactly one output per input.
There's also a csv module in the Python standard libary which you might prefer to work with. =)
For arbitrary orderings and exact matches you can use indVar.index() and index depVar with the returned index.
If indVar is ordered and (well, "or", sort of) you need closest match then you should look at using bisect on indVar.

Categories