read a csv file with specific pattern dynamically in pandas - python

I have some CSV files with its extension which keeps changes all the time.
Active_Count_1618861363072
Deposit_1618861402104
Game_Type_Wise_Net_Sell_1618861383176
Total_Count_1618861351976
I want to read these files automatically
df1=pd.read_csv('Active_count_'.csv)
df2=pd.read_csv('Deposit_'.csv)
df3=pd.read_csv('Game_Type_Wise_Net_Sell_'.csv)
df4=pd.read_csv('Total_Count_'.csv)
I want this in such a way that I want to keep after the underscore dynamic and load the CSV files.
Is there a way I can achieve this?

This can be achieved outside Pandas using only standard Python functionality:
import glob
active_count_filename = glob.glob('Active_Count_*.csv')[0]
df1 = pd.read_csv(active_count_filename)
This assumes that there is exactly one Active_count_* file - if none exists, it will throw an error, if more than one exists, one will be chosen randomly.

Related

AttributeError: 'DataFrame' object has no attribute 'path'

I'm trying incrementally to build a financial statement database. The first steps center around collecting 10-Ks from the SEC's EDGAR database. I have code for pulling the relevant 8-Ks, 10-Ks, and 10-Qs by CIK number and accession number, and retrieving the relevant excel spreadsheet. The code below is now centering on trying to create a folder within a directory, then name the folder with the CIK code, then pull the spreadsheet from the EDGAR database, and save the spreadsheet to the folder with the CIK code. My example is a csv file I'm calling "accessionnumtest.csv", which has headings:
company_name,report_type,cik,date,cik_accession
and data:
4Less Group, Inc.,10K/A,1438901,11/27/2019,edgar/data/1438901/000121390019024801.txt
AB INTERNATIONAL GROUP CORP.,10K,1605331,10/22/2019,edgar/data/1605331/000166357719000384.txt
ABM INDUSTRIES INC /DE/,10K,771497,12/20/2019,edgar/data/771497/000162828019015259.txt
ACTUANT CORP,10K,6955,10/29/2019,edgar/data/6955/000000695519000033.txt
my code is below
import os
import pandas as pd
path = os.getcwd()
folder_path = "C:/metricdatadb/"
df = pd.read_csv("accessionnumtest.csv")
folder_name = df['cik']
print(folder_name)
for row in df.iterrows():
dir = df.path.join(folder_path, folder_name)
os.makedirs(dir)
This code is giving me, AttributeError: 'DataFrame' object has no attribute 'path' error. I have renamed the path, checked for whitespace in the headers. Any suggestions are appreciated.
Regarding the error: os.path.join. Not pd.path.join. You are calling the wrong module.
That being said, your code is not doing what you are trying to do regardless of the error. folder_name will not update for each row. You could do row.cik to get the value for each iterrows()
dir = os.path.join(folder_path, row.cik)
It is relatively unclear what you're working towards accomplishing, particularly with .csv files and Pandas. The code you have seems to have a lot of curious errors in it, which I think might be ameliorated by going back to learn some of the more simple Python concepts before trying something as difficult as web-scraping. Note I don't mean to give up, rather than building up the fundamentals is a necessary step in this type of project.
That said, if I'm understanding your intent correctly, you want to create a file hierarchy for 10-K, 10-Q, etc. filings for several CIKs.
There shouldn't be any need to use .csv files, or pandas for this.
Probably the simplest way to do this would be to do it in the same step you download them.
Pseudocode for this would be as follows:
for cik in list_of_ciks:
first_file = find_first_file_online();
if first_file is 10-K:
save_to_10-K folder for CIK
if first_file is 10-Q:
save_to_10-Q folder for CIK
As I said above, you can skip the .csv file (Also, note that CSV stands for "comma-separated-value." Some of the entries in your data contain commas, e.g. "4Less Group, Inc." This is incompatible with a CSV file, as it will split the single entry into two columns on the comma, shifting all of your data one column).
When you process the data, you'll want to build the folders as you go.
When you iterate through a new CIK, create the master folder for that CIK. When you encounter a 10-K, create a folder for 10-K's and save it with a unique name. Since you need to use the accession numbers to get the excel sheets, that wouldn't be a bad naming convention to follow.
It would be something like this:
import requests
import pathlib
cik_list = [cik_1, cik_2... cik_n]
for cik in cik_list:
file = requests.get("cik/accession/Report.xlsx").data
with open(pathlib.Path(cik, report_type, accession_number + ".xlsx", "wb")) as excel_file:
excel_file.write(file.data)
The above code will definitely not run, and does not include everything you would need to make it work, since that information has been written by you. Integrating the above concepts into your code is up to you.
To reiterate, you have the CIK, the accession number, and the report type. To save the files in folders, you need only create the folders as you go, with the form "CIK/report_type/accession.xlsx"

Adding a path to pandas to_csv function

I have a small chunk of code using Pandas that reads an incoming CSV, performs some simple computations, adds a column, and then turns the dataframe into a CSV using to_csv.
I was running it all in a Jupyter notebook and it worked great, the output csv file would be there right in the directory when I ran it. I have now changed my code to be run from the command line, and when I run it, I don't see the output CSV files anywhere. The way that I did this was saving the file as a .py, saving it into a folder right on my desktop, and putting the incoming csv in the same folder.
From similar questions on stackoverflow I am gathering that right before I use to_csv at the end of my code I might need to add the path into that line as a variable, such as this.
path = 'C:\Users\ab\Desktop\conversion'
final2.to_csv(path, 'Combined Book.csv', index=False)
However after adding this, I am still not seeing this output CSV file in the directory anywhere after running my pretty simple .py code from the command line.
Does anyone have any guidance? Let me know what other information I could add for clarity. I don't think sample code of the pandas computations is necessary, it is as simple as adding a column with data based on one of my incoming columns.
Join the path and the filename together and pass that to pd.to_csv:
import os
path = 'C:\Users\ab\Desktop\conversion'
output_file = os.path.join(path,'Combined Book.csv')
final2.to_csv(output_file, index=False)
Im pretty sure that you have mixed up the arguments, as shown here. The path should include the filename in it.
path = 'C:\Users\ab\Desktop\conversion\Combined_Book.csv'
final2.to_csv(path, index=False)
Otherwise you are trying to overwrite the whole folder 'conversions' and add a complicated value separator.
I think below is what you are looking for , absolute path
import pandas as pd
.....
final2.to_csv('C:\Users\ab\Desktop\conversion\Combined Book.csv', index=False)
OR for an example:
path_to_file = "C:\Users\ab\Desktop\conversion\Combined Book.csv"
final2.to_csv(path_to_file, encoding="utf-8")
Though late answer but would be useful for someone facing similar issues. It is better to dynamically get the csv folder path instead of hardcoding it. We can do so using os.getcwd(). Later join the csv folder path with csv file name using os.path.join(os.getcwd(),'csvFileName')
Example:
import os
path = os.getcwd()
export_path = os.path.join(path,'Combined Book.csv')
final2.to_csv(export_path, index=False, header=True)

Run only if "if " statement is true.!

So I've a question, Like I'm reading the fits file and then i'm using the information from the header of the fits to define the other files which are related to the original fits file. But for some of the fits file, the other files (blaze_file, bis_file, ccf_table) are not available. And because of that my code gives the pretty obvious error that No Such file or directory.
import pandas as pd
import sys, os
import numpy as np
from glob import glob
from astropy.io import fits
PATH = os.path.join("home", "Desktop", "2d_spectra")
for filename in os.listdir(PATH):
if filename.endswith("_e2ds_A.fits"):
e2ds_hdu = fits.open(filename)
e2ds_header = e2ds_hdu[0].header
date = e2ds_header['DATE-OBS']
date2 = date = date[0:19]
blaze_file = e2ds_header['HIERARCH ESO DRS BLAZE FILE']
bis_file = glob('HARPS.' + date2 + '*_bis_G2_A.fits')
ccf_table = glob('HARPS.' + date2 + '*_ccf_G2_A.tbl')
if not all(file in os.listdir(PATH) for file in [blaze_file,bis_file,ccf_table]):
continue
So what i want to do is like, i want to make my code run only if all the files are available otherwise don't. But the problem is that, i'm defining the other files as variable inside the for loop as i'm using the header information. So how can i define them before the for loop???? and then use something like
So can anyone help me out of this?
The filenames returned by os.listdir() are always relative to the path given there.
In order to be used, they have to be joined with this path.
Example:
PATH = os.path.join("home", "Desktop", "2d_spectra")
for filename in os.listdir(PATH):
if filename.endswith("_e2ds_A.fits"):
filepath = os.path.join(PATH, filename)
e2ds_hdu = fits.open(filepath)
…
Let the filenames be ['a', 'b', 'a_ed2ds_A.fits', 'b_ed2ds_A.fits']. The code now excludes the two first names and then prepends the file path to the remaining two.
a_ed2ds_A.fits becomes /home/Desktop/2d_spectra/a_ed2ds_A.fits and
b_ed2ds_A.fits becomes /home/Desktop/2d_spectra/b_ed2ds_A.fits.
Now they can be accessed from everywhere, not just from the given file path.
I should become accustomed to reading a question in full before trying to answer it.
The problem I mentionned is a problem if you don't start the script from any path outside the said directory. Nevertheless, applying it will make your code much more consistent.
Your real problem, however, lies somewhere else: you examine a file and then, after checking its contents, want to read files whose names depend on informations from that first file.
There are several ways to accomplish your goal:
Just extend your loop with the proper tests.
Pseudo code:
for file in files:
if file.endswith("fits"):
open file
read date from header
create file names depending on date
if all files exist:
proceed
or
for file in files:
if file.endswith("fits"):
open file
read date from header
create file names depending on date
if not all files exist:
continue # actual keyword, no pseudo code!
proceed
Put some functionality into functions (variation of 1.)
Create a loop in a generator function which yields the "interesting information" of one fits file (or alternatively nothing) and have another loop run over them to actually work with the data.
If I am still missing some points or am not detailled enough, please let me know.
Since you have to read the fits file to know the other dependant files names, there's no way you can avoid reading the fit file first. The only thing you can do is test for the dependant files existance before trying to read them and skip the rest of the loop (using continue) if not.
Edit this line
e2ds_hdu = fits.open(filename)
And replace with
e2ds_hdu = fits.open(os.path.join(PATH, filename))

How to work with CSV files inside a zipped folder?

I'm working with zipped files in python for the first time, and I'm stumped.
I read the documentation for zipfile, but I'm not sure what would be the best way to do what I'm trying to do. I have a zipped folder with CSV files inside, and I'd like to be able to open the zip file, and retrieve certain values from the csv files inside.
Do I use zipfile.extract(file name here) to bring it to the current working directory? And if I do that, do I just use the file name to work with the file, or does this index or list them differently?
Currently, I manually extract all files in the zipped folder to the current working directory for my project, and then use the csv module to read them. All I'm really trying to do is remove that step.
Any and all help would be greatly appreciated!
You are looking to avoid extracting to disk, in the zip docs for python there is ZipFile.open() which gives you a file-like object. That is an object that mostly behaves like a regular file on disk, but it is in memory. It gives a bytes array when read, at least in py3.
Something like this...
from zipfile import ZipFile
import csv
with ZipFile('abc.zip') as myzip:
print(myzip.filelist)
for mf in myzip.filelist:
with myzip.open(mf.filename) as myfile:
mc = myfile.read()
c = csv.StringIO(mc.decode())
for row in c:
print(row)
The documentation of Python is actually quite good once one has learned how to find things as well as some of the basic programming terms/descriptions used in the documentation.
For some reason csv.BytesIO is not implemented, hence the extra step via csv.StringIO.

Grabbing all files in a query using Python

I'm processing a large amount of files using Python. Data related to each other through their file names.
If I was to perform CMD command to perform this (in windoes) it would look something like:
DIR filePrefix_??.txt
And this would return all the file names I would need for that group.
Is there a similar function that I can use in Python?
Have a look at the glob module.
glob.glob("filePrefix_??.txt")
returns a list of matching file names.

Categories