I'm trying incrementally to build a financial statement database. The first steps center around collecting 10-Ks from the SEC's EDGAR database. I have code for pulling the relevant 8-Ks, 10-Ks, and 10-Qs by CIK number and accession number, and retrieving the relevant excel spreadsheet. The code below is now centering on trying to create a folder within a directory, then name the folder with the CIK code, then pull the spreadsheet from the EDGAR database, and save the spreadsheet to the folder with the CIK code. My example is a csv file I'm calling "accessionnumtest.csv", which has headings:
company_name,report_type,cik,date,cik_accession
and data:
4Less Group, Inc.,10K/A,1438901,11/27/2019,edgar/data/1438901/000121390019024801.txt
AB INTERNATIONAL GROUP CORP.,10K,1605331,10/22/2019,edgar/data/1605331/000166357719000384.txt
ABM INDUSTRIES INC /DE/,10K,771497,12/20/2019,edgar/data/771497/000162828019015259.txt
ACTUANT CORP,10K,6955,10/29/2019,edgar/data/6955/000000695519000033.txt
my code is below
import os
import pandas as pd
path = os.getcwd()
folder_path = "C:/metricdatadb/"
df = pd.read_csv("accessionnumtest.csv")
folder_name = df['cik']
print(folder_name)
for row in df.iterrows():
dir = df.path.join(folder_path, folder_name)
os.makedirs(dir)
This code is giving me, AttributeError: 'DataFrame' object has no attribute 'path' error. I have renamed the path, checked for whitespace in the headers. Any suggestions are appreciated.
Regarding the error: os.path.join. Not pd.path.join. You are calling the wrong module.
That being said, your code is not doing what you are trying to do regardless of the error. folder_name will not update for each row. You could do row.cik to get the value for each iterrows()
dir = os.path.join(folder_path, row.cik)
It is relatively unclear what you're working towards accomplishing, particularly with .csv files and Pandas. The code you have seems to have a lot of curious errors in it, which I think might be ameliorated by going back to learn some of the more simple Python concepts before trying something as difficult as web-scraping. Note I don't mean to give up, rather than building up the fundamentals is a necessary step in this type of project.
That said, if I'm understanding your intent correctly, you want to create a file hierarchy for 10-K, 10-Q, etc. filings for several CIKs.
There shouldn't be any need to use .csv files, or pandas for this.
Probably the simplest way to do this would be to do it in the same step you download them.
Pseudocode for this would be as follows:
for cik in list_of_ciks:
first_file = find_first_file_online();
if first_file is 10-K:
save_to_10-K folder for CIK
if first_file is 10-Q:
save_to_10-Q folder for CIK
As I said above, you can skip the .csv file (Also, note that CSV stands for "comma-separated-value." Some of the entries in your data contain commas, e.g. "4Less Group, Inc." This is incompatible with a CSV file, as it will split the single entry into two columns on the comma, shifting all of your data one column).
When you process the data, you'll want to build the folders as you go.
When you iterate through a new CIK, create the master folder for that CIK. When you encounter a 10-K, create a folder for 10-K's and save it with a unique name. Since you need to use the accession numbers to get the excel sheets, that wouldn't be a bad naming convention to follow.
It would be something like this:
import requests
import pathlib
cik_list = [cik_1, cik_2... cik_n]
for cik in cik_list:
file = requests.get("cik/accession/Report.xlsx").data
with open(pathlib.Path(cik, report_type, accession_number + ".xlsx", "wb")) as excel_file:
excel_file.write(file.data)
The above code will definitely not run, and does not include everything you would need to make it work, since that information has been written by you. Integrating the above concepts into your code is up to you.
To reiterate, you have the CIK, the accession number, and the report type. To save the files in folders, you need only create the folders as you go, with the form "CIK/report_type/accession.xlsx"
Related
I have a folder containing thousands of images and each image needs a unique list of keywords added to it. I also have a table with fields showing the file path and associated list of desired keywords for each image. For example, one record might need the tags, "ORASH (a survey site code), Crew 1, Transect A Upstream, Site Layout". While the next record might need the tags, "ORWLW, Crew 2, Amphibian, Pacific Giant Salamander".
How do I iterate over each image to add the IPTC keywords to them? I'm using python 3 and the iptcinfo3 module but am willing to try other modules that may work.
Here's where I'm at now:
import os
import pandas as pd
from iptcinfo3 import IPTCInfo
srcdir = r'E:\photos'
files = os.listdir(srcdir)
# Create a dataframe from the table containing filepaths and associated keywords.
df = pd.read_excel(r'E:\photo_info.xlsx')
# Create a dictionary with the filename as the key and the tags as the value.
references = dict(df.set_index('basename')['tags'])
for file in files:
# Get the full filepath for each image.
filepath = os.path.join(srcdir, file)
# Create an object for a file that may not have IPTC data (ignore the 'Marker scan...' notification).
info = IPTCInfo(filepath, force=True)
At this point, I imagined I'd use info['keywords'] = ... in conjunction with the 'references' dictionary to plug the keywords into the correct files. Then info.save_as(filepath). I'm just not experienced enough to know how to make this work or even if it's a reasonable way of doing it. Any help would be appreciated!
I saved the table with the filenames and keywords as a .csv file where the fields and records looked like this (though the text in the 'Subject' field didn't include the quotes):
SourceFile, Artist, Subject
E:\photos\0048.JPG, MARY GRAY, "YEAR2022, REQUIRED, GPS UNIT WITH
TIME"
Because I use Jupyter Lab for other portions of this workflow, I ran this code there:
import os
os.system('cmd d: & exiftool -overwrite_original -sep ", " -csv="E:\photos\metadata.csv" E:\photos')
This opens the Windows command prompt, changes the path to the D: drive (where the exiftool.exe file was saved), invokes exiftool, sets it to overwrite the original image file rather than create a copy, defines the keyword separator in the .csv file, reads the .csv file that has the list of filenames and associated keywords, then runs it on the E:\photos directory.
Worked like a charm on about 4,000 photos!
I have data that has been collected and organized in multiple folders.
In each folder, there can be multiple similar runs -- e.g. collected data under the same conditions, at different times. These filenames contain a number in them that increments. Each folder contains similar data collected under different conditions. For example, I can have an idle folder, and in it can be files named idle_1.csv, idle_2.csv, idle_3.csv, etc. Then I can have another folder pos1 folder, and similarly, pos1_1.csv, pos1_2.csv, etc.
In order to keep track of what folder and what file the data in the arrays came from, I want to use the folder name, "idle", "pos1", etc, as the array name. Then, each file within that folder (or the data resulting from processing each file in that folder, rather) becomes another row in that array.
For example, if the name of the folder is stored in variable arrname, and the file index is stored in variable arrndx, I want to write the value into that array:
arrname[arrndx]=value
This doesn't work, giving the following error:
TypeError: 'str' object does not support item assignment
Then, I thought about using a dictionary to do this, but I think I still would run into the same issue. If I use a dictionary, I think I need each dictionary's name to be the name derived from the folder name -- creating the same issue. If I instead try to use it as a key in a dictionary, the entries get overwritten with data from every file from the same folder since the name is the same:
arrays['name']=arrname
arrays['index']=int(arrndx)
arrays['val']=value
arrays['name': arrname, 'index':arrndx, 'val':value]
I can't use 'index' either since it is not unique across each different folder.
So, I'm stumped. I guess I could predefine all the arrays, and then write to the correct one based on the variable name, but that could result in a large case statement (is there such a thing in python?) or a big if statement. Maybe there is no avoiding this in my case, but I'm thinking there has to be a more elegant way...
EDIT
I was able to work around my issue using globals():
globals()[arrname].insert(int(arrndx),value)
However, I believe this is not the "correct" solution, although I don't understand why it is frowned upon to do this.
Use a nested dictionary with the folder names at the first level and the file indices (or names) at the second.
from pathlib import Path
data = {}
base_dir = 'base'
for folder in Path(base_dir).resolve().glob('*'):
if not folder.is_dir():
continue
data[folder.name] = {}
for csv in folder.glob('*.csv'):
file_id = csv.stem.split('_')[1]
data[folder.name][file_id] = csv
The above example just saves the file name in the structure but you could alternatively load the file's data (e.g. using Pandas) and save that to the dictionary. It all depends what you want to do with it afterwards.
What about :
foldername = 'idle' # Say your folder name is idle for example
files = {}
files[filename] = [filenmae + "_" + str(i) + ".csv" for i in range(1, number_of_files_inside_folder + 2)]
does that solve your problem ?
So I've a question, Like I'm reading the fits file and then i'm using the information from the header of the fits to define the other files which are related to the original fits file. But for some of the fits file, the other files (blaze_file, bis_file, ccf_table) are not available. And because of that my code gives the pretty obvious error that No Such file or directory.
import pandas as pd
import sys, os
import numpy as np
from glob import glob
from astropy.io import fits
PATH = os.path.join("home", "Desktop", "2d_spectra")
for filename in os.listdir(PATH):
if filename.endswith("_e2ds_A.fits"):
e2ds_hdu = fits.open(filename)
e2ds_header = e2ds_hdu[0].header
date = e2ds_header['DATE-OBS']
date2 = date = date[0:19]
blaze_file = e2ds_header['HIERARCH ESO DRS BLAZE FILE']
bis_file = glob('HARPS.' + date2 + '*_bis_G2_A.fits')
ccf_table = glob('HARPS.' + date2 + '*_ccf_G2_A.tbl')
if not all(file in os.listdir(PATH) for file in [blaze_file,bis_file,ccf_table]):
continue
So what i want to do is like, i want to make my code run only if all the files are available otherwise don't. But the problem is that, i'm defining the other files as variable inside the for loop as i'm using the header information. So how can i define them before the for loop???? and then use something like
So can anyone help me out of this?
The filenames returned by os.listdir() are always relative to the path given there.
In order to be used, they have to be joined with this path.
Example:
PATH = os.path.join("home", "Desktop", "2d_spectra")
for filename in os.listdir(PATH):
if filename.endswith("_e2ds_A.fits"):
filepath = os.path.join(PATH, filename)
e2ds_hdu = fits.open(filepath)
…
Let the filenames be ['a', 'b', 'a_ed2ds_A.fits', 'b_ed2ds_A.fits']. The code now excludes the two first names and then prepends the file path to the remaining two.
a_ed2ds_A.fits becomes /home/Desktop/2d_spectra/a_ed2ds_A.fits and
b_ed2ds_A.fits becomes /home/Desktop/2d_spectra/b_ed2ds_A.fits.
Now they can be accessed from everywhere, not just from the given file path.
I should become accustomed to reading a question in full before trying to answer it.
The problem I mentionned is a problem if you don't start the script from any path outside the said directory. Nevertheless, applying it will make your code much more consistent.
Your real problem, however, lies somewhere else: you examine a file and then, after checking its contents, want to read files whose names depend on informations from that first file.
There are several ways to accomplish your goal:
Just extend your loop with the proper tests.
Pseudo code:
for file in files:
if file.endswith("fits"):
open file
read date from header
create file names depending on date
if all files exist:
proceed
or
for file in files:
if file.endswith("fits"):
open file
read date from header
create file names depending on date
if not all files exist:
continue # actual keyword, no pseudo code!
proceed
Put some functionality into functions (variation of 1.)
Create a loop in a generator function which yields the "interesting information" of one fits file (or alternatively nothing) and have another loop run over them to actually work with the data.
If I am still missing some points or am not detailled enough, please let me know.
Since you have to read the fits file to know the other dependant files names, there's no way you can avoid reading the fit file first. The only thing you can do is test for the dependant files existance before trying to read them and skip the rest of the loop (using continue) if not.
Edit this line
e2ds_hdu = fits.open(filename)
And replace with
e2ds_hdu = fits.open(os.path.join(PATH, filename))
I'm working with zipped files in python for the first time, and I'm stumped.
I read the documentation for zipfile, but I'm not sure what would be the best way to do what I'm trying to do. I have a zipped folder with CSV files inside, and I'd like to be able to open the zip file, and retrieve certain values from the csv files inside.
Do I use zipfile.extract(file name here) to bring it to the current working directory? And if I do that, do I just use the file name to work with the file, or does this index or list them differently?
Currently, I manually extract all files in the zipped folder to the current working directory for my project, and then use the csv module to read them. All I'm really trying to do is remove that step.
Any and all help would be greatly appreciated!
You are looking to avoid extracting to disk, in the zip docs for python there is ZipFile.open() which gives you a file-like object. That is an object that mostly behaves like a regular file on disk, but it is in memory. It gives a bytes array when read, at least in py3.
Something like this...
from zipfile import ZipFile
import csv
with ZipFile('abc.zip') as myzip:
print(myzip.filelist)
for mf in myzip.filelist:
with myzip.open(mf.filename) as myfile:
mc = myfile.read()
c = csv.StringIO(mc.decode())
for row in c:
print(row)
The documentation of Python is actually quite good once one has learned how to find things as well as some of the basic programming terms/descriptions used in the documentation.
For some reason csv.BytesIO is not implemented, hence the extra step via csv.StringIO.
I've got a pretty simple task but I haven't done too many functions with excel within python and I'm not sure how to go about doing this.
What I need to do:
Look at many excel files within subfolders, rename them according to information within the file and store them in all in one folder somewhere else.
The data is structured like this:
Main Folder
Subfolder1
File1
File2
File3
...
For about a hundred subfolders and several files within each subfolder.
From here, I want to pull the company name, part number, and date from within the file and use those to rename the excel file. Not sure how to rename the file.
Then save it somewhere else. I'm having trouble finding all these functions, any advice?
Check the os and os.path module for listing folder contents (walk, listdir) and working with path names (abspath, basename etc.)
Also, shutil has some interesting functions for copying stuff. Check out copyfile and specify the dst parameter based on the data you read from the excel file.
This page can help you getting at the Excel data: http://www.python-excel.org/
You probably want to have some highlevel code like this:
for subfolder_name in os.listdir(MAIN_FOLDER):
# exercise left to reader: filter out non-folders
subfolder_path = os.path.join(MAIN_FOLDER, subfolder_name)
for excel_file_name in os.listdir(os.path.join(MAIN_FOLDER, subfolder_name)):
# exercise left to reader: filter out non-excel-files
excel_file_path = os.path.join(subfolder_path, excel_file_name)
new_excel_file_name = extract_filename_from_excel_file(excel_file_path)
new_excel_file_path = os.path.join(NEW_MAIN_FOLDER, subfolder_name,
new_excel_file_name)
shutil.copyfile(excel_file_path, new_excel_file_path)
You'll have to provide extract_filename_from_excel_file yourself using the xlrd module from the site I mentioned.