I have a .pd file called 'missing' on my computer. The path is C:\me\Desktop\missing.pd
Inside this file there is just dates. I have an algo which create and populate this 'missing.pd' with dates everytime I run it. My algo basically create a dataframe with inside some dates, sometime empty and then create the missing.pd file on my computer and add the dates.
What I am trying to do is to not recreate everytime the missing.pd file (that's what my code do so far).
I want to say to my code :
if C:\me\Desktop\missing.pd exist, then check inside if the dates of my created dataframe are already here, if no add the ones which are not already here, if missing.pd do not exist, create it and fill it with the dates.
so far for this part of the code, it is :
path = r"C:\me\Desktop\missing.pd"
missing = pd.DataFrame(missing)
missing.to_pickle( os.path.join(path,"%s_missing.pd"%(country)))
You can use os.path.isfile(filename) to check if the file exists. Documentation here.
import os.path
if os.path.isfile(path):
"""Your date checking code here."""
Related
I have some CSV files with its extension which keeps changes all the time.
Active_Count_1618861363072
Deposit_1618861402104
Game_Type_Wise_Net_Sell_1618861383176
Total_Count_1618861351976
I want to read these files automatically
df1=pd.read_csv('Active_count_'.csv)
df2=pd.read_csv('Deposit_'.csv)
df3=pd.read_csv('Game_Type_Wise_Net_Sell_'.csv)
df4=pd.read_csv('Total_Count_'.csv)
I want this in such a way that I want to keep after the underscore dynamic and load the CSV files.
Is there a way I can achieve this?
This can be achieved outside Pandas using only standard Python functionality:
import glob
active_count_filename = glob.glob('Active_Count_*.csv')[0]
df1 = pd.read_csv(active_count_filename)
This assumes that there is exactly one Active_count_* file - if none exists, it will throw an error, if more than one exists, one will be chosen randomly.
I'm trying incrementally to build a financial statement database. The first steps center around collecting 10-Ks from the SEC's EDGAR database. I have code for pulling the relevant 8-Ks, 10-Ks, and 10-Qs by CIK number and accession number, and retrieving the relevant excel spreadsheet. The code below is now centering on trying to create a folder within a directory, then name the folder with the CIK code, then pull the spreadsheet from the EDGAR database, and save the spreadsheet to the folder with the CIK code. My example is a csv file I'm calling "accessionnumtest.csv", which has headings:
company_name,report_type,cik,date,cik_accession
and data:
4Less Group, Inc.,10K/A,1438901,11/27/2019,edgar/data/1438901/000121390019024801.txt
AB INTERNATIONAL GROUP CORP.,10K,1605331,10/22/2019,edgar/data/1605331/000166357719000384.txt
ABM INDUSTRIES INC /DE/,10K,771497,12/20/2019,edgar/data/771497/000162828019015259.txt
ACTUANT CORP,10K,6955,10/29/2019,edgar/data/6955/000000695519000033.txt
my code is below
import os
import pandas as pd
path = os.getcwd()
folder_path = "C:/metricdatadb/"
df = pd.read_csv("accessionnumtest.csv")
folder_name = df['cik']
print(folder_name)
for row in df.iterrows():
dir = df.path.join(folder_path, folder_name)
os.makedirs(dir)
This code is giving me, AttributeError: 'DataFrame' object has no attribute 'path' error. I have renamed the path, checked for whitespace in the headers. Any suggestions are appreciated.
Regarding the error: os.path.join. Not pd.path.join. You are calling the wrong module.
That being said, your code is not doing what you are trying to do regardless of the error. folder_name will not update for each row. You could do row.cik to get the value for each iterrows()
dir = os.path.join(folder_path, row.cik)
It is relatively unclear what you're working towards accomplishing, particularly with .csv files and Pandas. The code you have seems to have a lot of curious errors in it, which I think might be ameliorated by going back to learn some of the more simple Python concepts before trying something as difficult as web-scraping. Note I don't mean to give up, rather than building up the fundamentals is a necessary step in this type of project.
That said, if I'm understanding your intent correctly, you want to create a file hierarchy for 10-K, 10-Q, etc. filings for several CIKs.
There shouldn't be any need to use .csv files, or pandas for this.
Probably the simplest way to do this would be to do it in the same step you download them.
Pseudocode for this would be as follows:
for cik in list_of_ciks:
first_file = find_first_file_online();
if first_file is 10-K:
save_to_10-K folder for CIK
if first_file is 10-Q:
save_to_10-Q folder for CIK
As I said above, you can skip the .csv file (Also, note that CSV stands for "comma-separated-value." Some of the entries in your data contain commas, e.g. "4Less Group, Inc." This is incompatible with a CSV file, as it will split the single entry into two columns on the comma, shifting all of your data one column).
When you process the data, you'll want to build the folders as you go.
When you iterate through a new CIK, create the master folder for that CIK. When you encounter a 10-K, create a folder for 10-K's and save it with a unique name. Since you need to use the accession numbers to get the excel sheets, that wouldn't be a bad naming convention to follow.
It would be something like this:
import requests
import pathlib
cik_list = [cik_1, cik_2... cik_n]
for cik in cik_list:
file = requests.get("cik/accession/Report.xlsx").data
with open(pathlib.Path(cik, report_type, accession_number + ".xlsx", "wb")) as excel_file:
excel_file.write(file.data)
The above code will definitely not run, and does not include everything you would need to make it work, since that information has been written by you. Integrating the above concepts into your code is up to you.
To reiterate, you have the CIK, the accession number, and the report type. To save the files in folders, you need only create the folders as you go, with the form "CIK/report_type/accession.xlsx"
I'm writing several related python programs that need to access the same file however, this file will be updated/replaced intermittently and I need them all to access the new file. My current idea is to have a specific folder where the latest file is placed whenever it needs to be replaced and was curious how I could have python select whatever text file is in the folder.
Or, would I be better off creating a program that has a Class entirely dedicated to holding the information of the file and have each program reference the file in that class. I could have the Class use tkinter.filedialog to select a new file whenever necessary and perhaps have a text file that has the path or name to the file that I need to access and have the other programs reference that.
Edit: I don't need to write to the file at all just read from it. However, I would like to have it so that I do not need to manually update the path to the file every time I run the program or update the file path.
Edit2: Changed title to suit the question more
If the requirement is to get the most recently modified file in a specific directory:
import os
mypath = r'C:\path\to\wherever'
myfiles = [(f,os.stat(os.path.join(mypath,f)).st_mtime) for f in os.listdir(mypath)]
mysortedfiles = sorted(myfiles,key=lambda x: x[1],reverse=True)
print('Most recently updated: %s'%mysortedfiles[0][0])
Basically, get a list of files in the directory, together with their modified time as a list of tuples, sort on modified date, then get the one you want.
It sounds like you're looking for a singleton pattern, which is a neat way of hiding a lot of logic into an 'only one instance' object.
This means the logic for identifying, retrieving, and delivering the file is all in one place, and your programs interact with it by saying 'give me the one instance of that thing'. If you need to alter how it identifies, retrieves, or delivers what that one thing is, you can keep that hidden.
It's worth noting that the singleton pattern can be considered an antipattern as it's a form of global state, it depends on the context of the program if this is a deal breaker or not.
To "have python select whatever text file is in the folder", you could use the glob library to get a list of file(s) in the directory, see: https://docs.python.org/2/library/glob.html
You can also use os.listdir() to list all of the files in a directory, without matching pattern names.
Then, open() and read() whatever file or files you find in that directory.
basically I have a structure like "myapp/installed_application_10101010/", however, the number of that folder will change regularly.
Therefore I need a way to find this folder everytime. Almost similar to how unix works when you press tab in the terminal, it'll autofill the rest of the name.
You could use the glob module for this.
import glob
possible_folders = glob.glob('myapp/installed_application_*/')
# returns ['myapp/installed_application_10101010/']
The method returns a list because there could be multiple matches.
I am taking zip file as input which contains multiple files and folders,I am extracting it and then I want to change the last modified time of each content in zip to some new date and time set by user.
I am using os.utime() to change the date and time, but changes get reflected only to the files and not to the folders inside zip.
timeInStr = raw_input("Enter the new time =format: dd-mm-yyyy HH:MM:SS -")
timeInDt=datetime.datetime.strptime(timeInStr, '%d-%m-%Y %H:%M:%S')
timeInTS=mktime(timeInDt.timetuple())
epochTime=(datetime.datetime(timeInDt.year, timeInDt.month, timeInDt.day, timeInDt.hour, timeInDt.minute, timeInDt.second)-datetime.datetime(1970,1,1)).total_seconds()
z=zp.ZipFile(inputZipFile,"a",zp.ZIP_DEFLATED)
for files in z.infolist():
z.extract(files, srcFolderName)
fileName=files.filename
new= fileName.replace('/',os.path.sep)
correctName= srcFolderName+os.path.sep+new
print correctName
if(correctName.endswith(os.path.sep)):
correc=correctName[:-1]
print correc
os.utime(correc, (timeInTS, timeInTS))
else:
os.utime(correctName, (timeInTS, timeInTS))
I am using Python 2.7 as platform
Base to the directory permission is this question on SO. The directory only changes its timestamp when the directory itself changes for ex: when you create a new file in it. So to update the timestamp of folder you can create a temp file and then delete it. There should be a better way but till you find it you can manage using this.
I ran into a similar problem. Here is the code I used to get past the issue.
As user966588 stated, the directory's timestamp is updating as the directory changes.
In the post I linked, I held onto any directory metadata updates until after my directory was fully-populated in order for the timestamp change to stay.