I need to iterate through all .rrd files inside a given directory and do fetch data inside each rrd database and do some actions and export them into single csv file in python script!
How can this be done in a efficient way? it's ok to advice me how to loop and access database data over multiple files?
I assume you have a rrdtool installation with python bindings already on your system. If not, here is an installation description.
Then, to loop over .rrd files in a given directory and performing a fetch:
import os
import rrdtool
target_directory = "some/directory/with/rrdfiles"
rrd_files = [os.path.join(target_directory, f) for f in os.listdir(target_directory) if f.endswith('.rrd')]
data_container = []
for rrd_file in rrd_files:
data = rrdtool.fetch(rrd_file, 'AVERAGE' # or whichever CF you need
'--resolution', '200', # for example
) # and any other settings you need
data_container.append(data)
The parameter list follows rrdfetch.
Once you have whatever data you need inside the rrd_files loop, you should accumulate it in a list of lists, with each sublist being one row of data. Writing them to a csv is as easy as this:
import csv
# read the .rrd data as above into a 'data_container' variable
with open('my_data.csv', 'w', newline='') as csv_file:
rrd_writer = csv.writer(csv_file)
for row in data_container:
rrd_writer.writerow(row)
This should outline the general steps you have to follow, you will probably need to adapt them (the rrdfetch in particular) to your case.
Related
Update: Sorry it seems my question wasn't asked properly. So I am analyzing a transportation network consisting of more than 5000 links. All the data included in a big CSV file. I have several JSON files which each consist of subset of this network. I am trying to loop through all the JSON files INDIVIDUALLY (i.e. not trying to concatenate or something), read the JSON file, extract the information from the CVS file, perform calculation, and save the information along with the name of file in new dataframe. Something like this:
enter image description here
This is the code I wrote, but not sure if it's efficient enough.
name=[]
percent_of_truck=[]
path_to_json = \\directory
import glob
z= glob.glob(os.path.join(path_to_json, '*.json'))
for i in z:
with open(i, 'r') as myfile:
l=json.load(myfile)
name.append(i)
d_2019= final.loc[final['LINK_ID'].isin(l)] #retreive data from main CSV file
avg_m=(d_2019['AADTT16']/d_2019['AADT16']*d_2019['Length']).sum()/d_2019['Length'].sum() #calculation
percent_of_truck.append(avg_m)
f=pd.DataFrame()
f['Name']=name
f['% of truck']=percent_of_truck
I'm assuming here you just want a dictionary of all the JSON. If so, use the JSON library ( import JSON). If so, this code may be of use:
import json
def importSomeJSONFile(f):
return json.load(open(f))
# make sure the file exists in the same directory
example = importSomeJSONFile("example.json")
print(example)
#access a value within this , replacing key with what you want like "name"
print(JSON_imported[key])
Since you haven't added any Schema or any other specific requirements.
You can follow this approach to solve your problem, in any language you prefer
Get Directory of the JsonFiles, which needs to be read
Get List of all files present in directory
For each file-name returned in Step2.
Read File
Parse Json from String
Perform required calculation
I want to loop through all the PDFs in a directory, extract the text from each one using PDFminer, and then write the output to a single CSV file. I am able to extract the text from each PDF individually by passing it to the function defined here. I am also able to get a list of all the PDF filenames in a given directory. But when I try to put the two together and write the results to a single CSV, I get a CSV with headers but no data.
Here is my code:
import os
pdf_files = [name for name in os.listdir("C:\\My\\Directory\\Path") if name.endswith(".pdf")] #get all files in directory
pdf_files_path = ["C:\\My\\Directory\\Path\\" + pdf_files[i] for i in range(len(pdf_files))] #add directory path
import pandas as pd
df = pd.DataFrame(columns=['FileName','Text'])
for i in range(len(pdf_files)):
scraped_text = convert_pdf_to_txt(pdf_files_path[i])
df.append({ 'FileName': pdf_files[i], 'Text': scraped_text[i]},ignore_index=True)
df.to_csv('output.csv')
The variables have the following values:
pdf_files: ['12280_2007_Article_9000.pdf', '12280_2007_Article_9001.pdf', '12280_2007_Article_9002.pdf', '12280_2007_Article_9003.pdf', '12280_2007_Article_9004.pdf', '12280_2007_Article_9005.pdf', '12280_2007_Article_9006.pdf', '12280_2007_Article_9007.pdf', '12280_2007_Article_9008.pdf', '12280_2007_Article_9009.pdf']
pdf_files_path: ['C:\\My\\Directory Path\\12280_2007_Article_9000.pdf', etc...]
Empty DataFrame
Columns: [FileName, Text]
Index: []
Update: based on a suggestion from #AMC I checked the contents of scraped_text in the loop. For the Text column, it seems that I'm looping through the characters in the first PDF file, rather than looping through each file in the directly. Also, the contents of the loop are not getting written to the dataframe or CSV.
12280_2007_Article_9000.pdf E
12280_2007_Article_9001.pdf a
12280_2007_Article_9002.pdf s
12280_2007_Article_9003.pdf t
12280_2007_Article_9004.pdf
12280_2007_Article_9005.pdf A
12280_2007_Article_9006.pdf s
12280_2007_Article_9007.pdf i
12280_2007_Article_9008.pdf a
12280_2007_Article_9009.pdf n
I guess you don't need pandas for that. You can make it simpler by using the standard library csv.
Another thing that can be improved, if you are using Python 3.4+, is to replace os with pathlib.
Here is an almost complete example:
import csv
from pathlib import Path
folder = Path('c:/My/Directory/Path')
csv_file = Path('c:/path/to/output.csv')
with csv_file.open('w', encoding='utf-8') as f:
writer = csv.writer(f, csv.QUOTE_ALL)
writer.writerow(['FileName', 'Text'])
for pdf_file in folder.glob('*.pdf'):
pdf_text = convert_pdf_to_txt(pdf_file).replace('\n', '|')
writer.writerow([pdf_file.name, pdf_text])
Another thing to bear in mind is to be sure pdf_text will be a single line or else your csv file will be kind of broken. One way to work around that is to pick an arbitrary character to use in place of the new line marks. If you pick the pipe character, for example, than you can do something like this, prior to writer.writerow:
pdf_text.replace('\n', '|')
It is not meant to be a complete example but a starting point. I hope it helps.
I'm currently learning how to code and I have run into this challenge that I have been trying to solve for the last couple of days.
I have over 2000 CSV files that I would like to import into a particular postgresql table at once instead using the import data function on pgadmin 4 which only allows one to import one CSV file at a time. How should I go about doing this? I'm using Windows OS.
Simple way is use Cygwin or inner Ubuntu shell for use this script
all_files=("file_1.csv" "file_2.csv") # OR u can change to * in dir
dir_name=<path_to_files>
export PGUSER=<username_here>
export PGPASSWORD=<password_here>
export PGHOST=localhost
export PGPORT=5432
db_name=<dbname_here>
echo "write db"
for file in ${all_files[*]}; do
psql -U$db_name -a -f $dir_name/"${file}"".sql" >/dev/null
done
If you want to do this purely in Python, then I have given an approach below. It's possible that you wouldn't need to chunk the list (that you could hold all of the files in memory at once and not need to do in batches). It's also possible that all the files are radically different sizes and you'd need something more sophisticated than just batches to prevent you creating an in-memory file object that exceeds your RAM. Or, you might choose to do it in 2000 separate transactions but I suspect some kind of batching will be faster (untested).
import csv
import io
import os
import psycopg2
CSV_DIR = 'the_csv_folder/' # Relative path here, might need to be an absolute path
def chunks(l, n):
"""
https://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks
"""
n = max(1, n)
return [l[i:i+n] for i in range(0, len(l), n)]
# Get a list of all the CSV files in the directory
all_files = os.listdir(CSV_DIR)
# Chunk the list of files. Let's go with 100 files per chunk, can be changed
chunked_file_list = chunks(all_files, 100)
# Iterate the chunks and aggregate the files in each chunk into a single
# in-memory file
for chunk in chunked_file_list:
# This is the file to aggregate into
string_buffer = io.StringIO()
csv_writer = csv.writer(string_buffer)
for file in chunk:
with open(CSV_DIR + file) as infile:
reader = csv.reader(infile)
data = reader.readlines()
# Transfer the read data to the aggregated file
csv_writer.writerows(data)
# Now we have aggregated the chunk, copy the file to Postgres
with psycopg2.connect(dbname='the_database_name',
user='the_user_name',
password='the_password',
host='the_host') as conn:
c = conn.cursor()
# Headers need to the table field names, in the order they appear in
# the csv
headers = ['first_name', 'last_name', ...]
# Now upload the data as though it was a file
c.copy_from(string_buffer, 'the_table_name', sep=',', columns=headers)
conn.commit()
I am new to programming. I have hundreds of CSV files in a folder and certain files have the letters DIF in the second column. I want to rewrite the CSV files without those lines in them. I have attempted doing that for one file and have put my attempt below. I need also need help getting the program to do that dfor all the files in my directory. Any help would be appreciated.
Thank you
import csv
reader=csv.reader(open("40_5.csv","r"))
for row in reader:
if row[1] == 'DIF':
csv.writer(open('40_5N.csv', 'w')).writerow(row)
I made some changes to your code:
import csv
import glob
import os
fns = glob.glob('*.csv')
for fn in fns:
reader=csv.reader(open(fn,"rb"))
with open (os.path.join('out', fn), 'wb') as f:
w = csv.writer(f)
for row in reader:
if not 'DIF' in row:
w.writerow(row)
The glob command produces a list of all files ending with .csv in the current directory. If you want to give the source directory as an argument to your program, have a look into sys.argv or argparse (especially the latter is very powerful for command line parsing).
You also have to be careful when opening a file in 'w' mode: It means truncating the file, i.e. in your loop you would always overwrite the existing file, ending up in only one csv line.
The direcotry 'out' must exist or the script will produce an IOError.
Links:
open
sys.argv
argparse
glob
Most sequence types support the in or not in operators, which are much simpler to use to test for values than figuring index positions.
for row in reader:
if not 'DIF' in row:
csv.writer(open('40_5N.csv', 'w')).writerow(row)
If you're willing to install numpy, you can also read a csv file into the convenient numpy array format with either recfromcsv or the more general genfromtxt (genfromtxt requires you specify the comma delimiter), and you can specify which rows and columns to ignore. Documentation can be found here for genfromtxt:
http://docs.scipy.org/doc/numpy/user/basics.io.genfromtxt.html
And here for recfromcsv: http://nullege.com/codes/search/numpy.recfromcsv?fulldoc=1
I know I can include Python code from a common file using import MyModuleName - but how do I go about importing just a dict?
The problem I'm trying to solve is I have a dict that needs to be in a file in an editable location, while the actual script is in another file. The dict might also be edited by hand, by a non-programmer.
script.py
airportName = 'BRISTOL'
myAirportCode = airportCode[airportName]
myDict.py
airportCode = {'ABERDEEN': 'ABZ', 'BELFAST INTERNATIONAL': 'BFS', 'BIRMINGHAM INTERNATIONAL': 'BHX', 'BIRMINGHAM INTL': 'BHX', 'BOURNMOUTH': 'BOH', 'BRISTOL': 'BRS'}
How do I access the airportCode dict from within script.py?
Just import it
import myDict
print myDict.airportCode
or, better
from myDict import airportCode
print airportCode
Just be careful to put both scripts on the same directory (or make a python package, a subdir with __init__.py file; or put the path to script.py on the PYTHONPATH; but these are "advanced options", just put it on the same directory and it'll be fine).
Assuming your import myDict works, you need to do the following:
from myDict import airportCode
Well, it doesn't need to be a .py file. You could just do:
eval(open("myDict").read())
It's a gaping security hole, though.
Another module you might want to look at is csv for importing CSV files. Then your users could edit it with a spreadsheet and you don't have to teach them Python syntax.
When you perform an import in python you are really just pulling in names into your current namespace. It does not really matter what those names refer to so:
from myDict import airportCode
Will work regardless of whether airportCode is a function, class or just a field as in your case.
If your dict has to be hand-editable by a non-programmer, perhaps it might make more sense using a CSV file for this. Then you editor can even use Excel.
So you can use:
import csv
csvfile = csv.reader(open("airports.csv"))
airportCode = dict(csvfile)
to read a CSV file like
"ABERDEEN","ABZ"
"BELFAST INTERNATIONAL","BFS"
"BIRMINGHAM INTERNATIONAL","BHX"
"BIRMINGHAM INTL","BHX"
"BOURNMOUTH","BOH"
"BRISTOL","BRS"
Careful: If an airport were in that list twice, the last occurrence would silently "overwrite" any previous one(s).
Use csv. Stick import csv with the rest of your module imports,
and then you can do as follows:
f = open('somefile.csv')
reader = csv.DictReader(f, (airport, iatacode))
for row in reader:
print row
which should give you a list of dictionaries:
airport | iatacode
__________________
Aberdeen| ABZ
to create the csv file:
f = open('somefile.csv', 'w')
writer = csv.DictWriter(f, (airport, iatacode))
for row in airportcode:
writer.writerow()
f.close()
which will create a csv file with airports and IATA TLAs in two columns with airport and iatacode as the headers.
You can also skip the dicts and just have strings by using Reader and Writer rather than DictReader and DictWriter.
By default, the csv module produces excel-style csv, but you can set whatever dialect you like as a kwarg.
from myDict import airportCode
airportNode = 'BRISTOL'
myAirportCode = airportCode[airportName]
If myDict should get accessed from a Python module in a different directory, you have to provide a __init__.py module.
For more Information about this topic have a look at the module chapter of the Python documentation.