Reading in multiple hdf5 files and appending them to a new dictionary

Reading in multiple hdf5 files and appending them to a new dictionary - python

I have a list of hdf5 files which I would like to open and read in the appropriate values into a new dictionary and eventually write to a text file. I don't necessarily know the values, so the user defines them in an array as an input into the code. The number of files needed is defined by the number of days worth of data the user wants to look at.
new_data_dic = {}
for j in range(len(values)):
new_data_dic[values[j]] = rbsp_ephm[values[j]]
for i in (np.arange(len(filenames_a)-1)+1):
rbsp_ephm = h5py.File(filenames_a[i])
for j in range(len(values)):
new_data_dic[values[j]].append(rbsp_ephm[values[j]])
This works fine if I only have one file, but if I have two or more it seems to close the key? I'm not sure if this is exactly what is happening, but when I ask what new_data_dic is, for values it gives {'Bfs_geo_a': <Closed HDF5 dataset>,... which will not write to a text file. I've tried closing the hdf5 file before opening the next (rbsp_ephm.close()) but I get the same error.
Thanks for any and all help!

I don't really understand your problem... you are trying to create a list of hdf5 dataset?
Or did you just forget the [()] to acces the values in the dataset itself?
Here is a simple standalone example that works just fine :
import h5py
# File creation
filenames_a = []
values = ['values/toto', 'values/tata', 'values/tutu']
nb_file = 5
tmp = 0
for i in range(nb_file):
fname = 'file%s.h5' % i
filenames_a.append(fname)
file = h5py.File(fname, 'w')
grp = file.create_group('values')
for value in values:
file[value] = tmp
tmp += 1
file.close()
# the thing you want
new_data_dict = {value: [] for value in values}
for fname in filenames_a:
rbsp_ephm = h5py.File(fname, 'r')
for value in values:
new_data_dict[value].append(rbsp_ephm[value][()])
print new_data_dict
It returns :
{'values/tutu': [2, 5, 8, 11, 14], 'values/toto': [0, 3, 6, 9, 12], 'values/tata': [1, 4, 7, 10, 13]}
Does it answer your question?

Maybe not directly the good solution, but you could try to extract data as numpy arrays which are a more flexible format rather than the h5py dataset one. See below how to do it:
>>> print type(file['Average/u'])
<class 'h5py.highlevel.Dataset'>
>>> print type(file['Average/u'][:])
<type 'numpy.ndarray'>
And just in case, you should try to use a more "pythonic" way for your loop, that is:
for j in values:
new_data_dic[j] = rbsp_ephm[j]
instead of:
for j in range(len(values)):
new_data_dic[values[j]] = rbsp_ephm[values[j]]

Related

How to import from CSV

I am trying to parse data from several *.csv files and save them as list for later manipulation, but keep failing.
I have read numerous tutorials and related topics on SO and other sites, but couldn't find the solution for my problem. After several days of working on the code, I am stuck and don't know how to proceed.
# saves filepaths of *.csv files in lists (constant)
CSV_OLDFILE = glob.glob("./oldcsv/*.csv")
assert isinstance(CSV_OLDFILE, list)
CSV_NEWFILE = glob.glob("./newcsv/*.csv")
assert isinstance(CSV_NEWFILE, list)
def get_data(input):
"""copies numbers from *.csv files, saves them in list RAW_NUMBERS"""
for i in range(0, 5): # for each of the six files
with open(input[i], 'r') as input[i]: # open as "read"
for line in input[i]: # parse lines for data
input.append(int(line)) # add to list
return input
def write_data(input):
"""writes list PROCESSED_NUMBERS_FINAL into new *.csv files"""
for i in range(0, 5): # for each of the six files
with open(input[i], 'w') as data: # open as "write"
data = csv.writer(input[i])
return data
RAW_NUMBERS = get_data(CSV_OLDFILE)
# other steps for processing data
write_data(PROCESSED_NUMBERS_FINAL)
Actual result:
TypeError: object of type '_io.TextIOWrapper' has no len()
Expected result: save data from *.csv files, manipulate and write them to new *.csv files.
I think the problem is probably located in my trying to call len of a file object, but I don't know what the correct implementation should look like.
Complete backtrace:
Traceback (most recent call last):
File "./solution.py", line 100, in <module>
PROCESSED_NUMBERS = slowsort_start(RAW_NUMBERS)
File "./solution.py", line 73, in slowsort_start
(input[i], 0, len(input[i])-1))
TypeError: object of type '_io.TextIOWrapper' has no len()

Question: Expected result: read data from *.csv, manipulate numbers and write them to new *.csv.
OOP solution that holds the numbers in a dict of dict:list.
Initialize the class object with the in_path and out_path
import os, csv
class ReadProcessWrite:
def __init__(self, in_path, out_path):
self.in_path = in_path
self.out_path = out_path
self.number = {}
Read all files from self.in_path, filter .csv files.
Create a dict with key ['raw'] and assign all numbers from this *.csv to a list.
Note: Assuming, one number per line!
def read_numbers(self):
for fname in os.listdir(self.in_path):
if fname.endswith('.csv'):
self.number[fname] = {}
with open(os.path.join(self.in_path, fname)) as in_csv:
self.number[fname]['raw'] = [int(number[0]) for number in csv.reader(in_csv)]
print('read_numbers {} {}'.format(fname, self.number[fname]['raw']))
return self
Process the ['raw'] numbers and assigen the result to the key ['final'].
def process_numbers(self):
def process(numbers):
return [n*10 for n in numbers]
for fname in self.number:
print('process_numbers {} {}'.format(fname, self.number[fname]['raw']))
# other steps for processing data
self.number[fname]['final'] = process(self.number[fname]['raw'])
return self
Write the results from key ['final'] to self.out_path, using the same .csv filenames.
def write_numbers(self):
for fname in self.number:
print('write_numbers {} {}'.format(fname, self.number[fname]['final']))
with open(os.path.join(self.out_path, fname), 'w') as out_csv:
csv.writer(out_csv).writerows([[row] for row in self.number[fname]['final']])
Usage:
if __name__ == "__main__":
ReadProcessWrite('oldcsv', 'newcsv').read_numbers().process_numbers().write_numbers()
Output:
read_numbers 001.csv [1, 2, 3]
read_numbers 003.csv [7, 8, 9]
read_numbers 002.csv [4, 5, 6]
process_numbers 003.csv [7, 8, 9]
process_numbers 002.csv [4, 5, 6]
process_numbers 001.csv [1, 2, 3]
write_numbers 003.csv [70, 80, 90]
write_numbers 002.csv [40, 50, 60]
write_numbers 001.csv [10, 20, 30]
Tested with Python: 3.4.2

So this is the solution I found, after lots of trial-and-error and research:
# initializing lists for later use
RAW_DATA = [] # unsorted numbers
SORTED_DATA = [] # sorted numbers
PROCESSED_DATA = [] # sorted and multiplied numbers
def read_data(filepath): # from oldfiles
"""returns parsed unprocessed numbers from old *.csv files"""
numbers = open(filepath, "r").read().splitlines() # reads, gets input from rows
return numbers
def get_data(filepath): # from oldfiles
"""fills list raw_data with parsed input from old *.csv files"""
for i in range(0, 6): # for each of the six files
RAW_DATA.append(read_data(filepath[i])) # add their data to list
def write_data(filepath): # parameter: newfile
"""create new *.csv files with input from sorted_data and permission 600"""
for i in range(0, 6): # for each of the six files
with open(filepath[i], "w", newline="\n") as file: # open with "write"
writer = csv.writer(file) # calls method for writing
for item in SORTED_DATA[i]: # prevents data from being treated as one object
writer.writerow([item]) # puts each entry in row
os.chmod(filepath[i], 0o600) # sets permission to 600 (octal)
This lets me read from files, as well as create and write to files. Given that I need a specific setup, with data only ever being found in "column A", I chose this solution. But thanks again to everybody who answered and commented!

Read files and count lengths for dictionary variable in python

I am making something like a random sentence generator. I want to make a random sentence from words taken randomly from 10 .csv files, which change size frequently so I have to count their size before I select a random line. I have already made it but I'm using way too much code... it currently does something like this:
def file_len(fname):
f = open(fname)
try:
for i, l in enumerate(f):
pass
finally:
f.close()
return i + 1
then selecting random lines...
for all of them I do this:
file1_size = file_len('items/file1.csv')
file1_line = randrange(file1_size) + 1
file1_output = linecache.getline('items/file1.csv', file1_line)
and when it is done for all of them, I just print the output...
print file1_out + file2_out + file3_out ...
Also, sometimes I only want to use some files and not others, so I'd just print the ones I want... e.g. if I just want files number 3, 4 and 7 then I do:
print file3_out + file4_out + file7_out
Obviously there's 30 lines in the line counting, selecting random and assigning that to a variable - 3 lines of code for each file. But things are getting more complex and I thought a dictionary variable might be able to do what I want more quickly and with less code.
I thought it would be good to generate a variable whereby we end up with
random_lines = {'file1.csv': 24, 'file2.csv': 13, 'file3.csv': 3, 'file4.csv': 22, 'file5.csv': 8, 'file6.csv': 97, 'file7.csv': 34, 'file8.csv': 67, 'file9.csv': 31, 'file10.csv': 86}
(The key is the filename and the integer is a random line within the file, re-assigned each time the code is run)
Then, some kind of process that picks the required items (let's say sometimes we only want to use lines from files 1, 6, 8, and 10) and outputs the random line
output = file1.csv random line + file6.csv random line + file8.csv random line + file10.csv random line
then
print output
If anyone can see the obvious quick way to do this (I don't think it's rocket science but I am a beginner at python!) then I'd be very grateful!

Any time you find yourself reusing the same code over and over in an object-oriented language, use a function instead.
def GetOutput(file_name):
file_size = file_len(file_name)
file_line = randrange(file_size) + 1
return linecache.getline(file_name, file_line)
file_1 = GetOutput(file_1)
file_2 = GetOutput(file_2)
...
You can further simplify by storing everything in a dict as you suggest in your original question.
input_files = ['file_1.txt', 'file_2.txt', ...]
outputs = {}
for file_name in input_files:
outputs[file_name] = GetOutput(file_name)

Python: how to save a list with objects in a file?

im trying to create diferent objects (using Clases and objects) and saving them in a file to edit or retrive them later. However this how it looks.
GlobalCategories=[]
GlobalContent=[]
def LoadData(x,y):
import pickle
with open('bin.dat') as f:
x,y = pickle.load(f)
def SaveData(x,y):
import pickle
with open('bin.dat', 'wb') as f:
pickle.dump([x,y], f)
def Loader(x,y):
try:
LoadData(x,y)
except:
SaveData(x,y)
and this the snippet that saves that shows how I save the info the lists (tema is the class and the other stuff are the methods of that class):
newtheme=Tema()
newtheme.setInfo_name(newstr)
newtheme.setInfo_code(newcode)
GlobalCategories.append(newtheme)
SaveData(GlobalContent,GlobalCategories)
X and Y are global lists where I store the objects.(i have noticed that it saves the direction in the memory of each object)
when i first run it, it creates the file and saves the infomation on the file, however if I close it, try to run it again and load the info, the program erases the information, and creates the file again, so anything that was stored is gone.
I dont know if this is a propper way to store objects or if there{s a better way so any advice is very welcome.
#abernert: Thank you abarnert! what I want to do is to save a list with two lists inside. for example one list is going to save the a make (toyota, nisan etc) and the other list the car model(tundra, murano). now each element is an object wich i add to a list when created.
newtheme=Theme()
newtheme.setInfo_name(newstr)
GlobalCategories.append(newtheme)
this is how i save the object in the global list. GlobalCategories is one of those two list i want to load later after i have closed the program (it would be like the list of car companies from the example). Now, where i have the problem is loading the objects from the lists after i have closed and restarted the program, because i am able to retrive and edit them from the list when i have not closed the shell.
I need to load and store the makes and the cars objects in the respective list once i start the program so i can manipulate them later.
Thank you abernert once again!

It's hard to know what the problem is without context of how you are trying to use your LoadData and SaveData functions. However, here is a little demo that does what I think you want.
import pickle
import random
def load_data():
try:
with open("bin.dat") as f:
x, y = pickle.load(f)
except:
x, y = [], []
return x, y
def save_data(data):
with open("bin.dat", "wb") as f:
pickle.dump(data, f)
if __name__ == "__main__":
x, y = load_data()
print x, y
x.append(random.randint(1, 10))
y.append(random.randint(1, 10))
save_data([x, y])
OUTPUT FROM CONSECUTIVE RUNS
[] []
[9] [9]
[9, 10] [9, 9]
[9, 10, 2] [9, 9, 4]
[9, 10, 2, 5] [9, 9, 4, 1]
[9, 10, 2, 5, 6] [9, 9, 4, 1, 9]
[9, 10, 2, 5, 6, 10] [9, 9, 4, 1, 9, 1]

It's hard to be sure, but I'm guessing your problem is that you're writing a binary file, then trying to read it back as text, and you're using Python 2.x on Windows.
In this code:
def LoadData(x,y):
import pickle
with open('bin.dat') as f:
x,y = pickle.load(f)
If you happened to have any LF newline characters in the binary pickle stream, opening the file as text will convert them to CR/LF pairs. This will cause the pickle to be invalid, and therefore it'll raise an exception.
In this code:
def Loader(x,y):
try:
LoadData(x,y)
except:
SaveData(x,y)
… you just swallow any exception and save some empty values.
You probably only want to handle file-not-found errors here (IOError, OSError, or FileNotFoundError, depending on your Python version).
But you definitely want to put the exception into a variable to help debug your problem, like this:
def Loader(x,y):
try:
LoadData(x,y)
except Exception as e:
SaveData(x,y)
You can put a breakpoint on the SaveData line in the debugger, or just add a print(e) line and watch the output, to see why you're getting there.
Meanwhile, even after you fix that, LoadData will never do anything useful. Assigning x,y = pickle.load(f) just rebinds the local variables x and y. The fact that they have the same names as the local variables in Loader doesn't mean that Loader's variables get changed. Neither does the fact that they used to refer to the same values.
Python doesn't have "reference variables" or "output parameters". The normal way to do this is to just return values you want to pass back to the caller:
def LoadData():
import pickle
with open('bin.dat') as f:
x,y = pickle.load(f)
return x,y
And of course Loader has to call it properly:
def Loader(x,y):
try:
x,y = LoadData()
except:
SaveData(x,y)
And you have the exact same problem again in Loader, so you need to fix it again there, and in its caller.

Python: Create coordinate list (convert string to int)

I want to import several coordinates (could add up to 20.000) from an text file.
These coordinates need to be added into a list, looking like the follwing:
coords = [[0,0],[1,0],[2,0],[0,1],[1,1],[2,1],[0,2],[1,2],[2,2]]
However when i want to import the coordinates i got the follwing error:
invalid literal for int() with base 10
I can't figure out how to import the coordinates correctly.
Does anyone has any suggestions why this does not work?
I think there's some problem with creating the integers.
I use the following script:
Bronbestand = open("D:\\Documents\\SkyDrive\\afstuderen\\99 EEM - Abaqus 6.11.2\\scripting\\testuitlezen4.txt", "r")
headerLine = Bronbestand.readline()
valueList = headerLine.split(",")
xValueIndex = valueList.index("x")
#xValueIndex = int(xValueIndex)
yValueIndex = valueList.index("y")
#yValueIndex = int(yValueIndex)
coordList = []
for line in Bronbestand.readlines():
segmentedLine = line.split(",")
coordList.extend([segmentedLine[xValueIndex], segmentedLine[yValueIndex]])
coordList = [x.strip(' ') for x in coordList]
coordList = [x.strip('\n') for x in coordList]
coordList2 = []
#CoordList3 = [map(int, x) for x in coordList]
for i in coordList:
coordList2 = [coordList[int(i)], coordList[int(i)]]
print "coordList = ", coordList
print "coordList2 = ", coordList2
#print "coordList3 = ", coordList3
The coordinates needed to be imported are looking like (this is "Bronbestand" in the script):
id,x,y,
1, -1.24344945, 4.84291601
2, -2.40876842, 4.38153362
3, -3.42273545, 3.6448431
4, -4.22163963, 2.67913389
5, -4.7552824, 1.54508495
6, -4.99013376, -0.313952595
7, -4.7552824, -1.54508495
8, -4.22163963, -2.67913389
9, -3.42273545, -3.6448431
Thus the script should result in:
[[-1.24344945, 4.84291601],[-2.40876842, 4.38153362],[-3.42273545, 3.6448431],[-4.22163963, 2.67913389],[-4.7552824, 1.54508495],[-4.99013376,-0.313952595],[-4.7552824, -1.54508495],[-4.22163963, -2.67913389],[-3.42273545, -3.6448431]]
I also tried importing the coordinates with the native python csv parser but this didn't work either.
Thank you all in advance for the help!

Your numbers are not integers so the conversion to int fails.
Try using float(i) instead of int(i) to convert into floating point numbers instead.
>>> int('1.5')
Traceback (most recent call last):
File "<pyshell#1>", line 1, in <module>
int('1.5')
ValueError: invalid literal for int() with base 10: '1.5'
>>> float('1.5')
1.5

Other answers have said why your script fails, however, there is another issue here - you are massively reinventing the wheel.
This whole thing can be done in a couple of lines using the csv module and a list comprehension:
import csv
with open("test.csv") as file:
data = csv.reader(file)
next(data)
print([[float(x) for x in line[1:]] for line in data])
Gives us:
[[-1.24344945, 4.84291601], [-2.40876842, 4.38153362], [-3.42273545, 3.6448431], [-4.22163963, 2.67913389], [-4.7552824, 1.54508495], [-4.99013376, -0.313952595], [-4.7552824, -1.54508495], [-4.22163963, -2.67913389], [-3.42273545, -3.6448431]]
We open the file, make a csv.reader() to parse the csv file, skip the header row, then make a list of the numbers parsed as floats, ignoring the first column.
As pointed out in the comments, as you are dealing with a lot of data, you may wish to iterate over the data lazily. While making a list is good to test the output, in general, you probably want a generator rather than a list. E.g:
([float(x) for x in line[1:]] for line in data)
Note that the file will need to remain open while you utilize this generator (remain inside the with block).

urllib.urlencode problems

I’m trying to write a python script that sends a query to TweetSentiments.com API.
The idea is that it will perform like this –
Reads CSV tweet file > construct query > Interrogates API > format JSON response > writes to CSV file.
So far I’ve come up with this –
import csv
import urllib
import os
count = 0
TweetList=[] ## Creates empty list to store tweets.
TweetWriter = csv.writer(open('test.csv', 'w'), dialect='excel', delimiter=' ',quotechar='|')
TweetReader = csv.reader(open("C:\StoredTweets.csv", "r"))
for rows in TweetReader:
TweetList.append(rows)
#print TweetList [0]
for rows in TweetList:
data = urllib.urlencode(TweetList[rows])
connect = httplib.HTTPConnection("http://data.tweetsentiments.com:8080/api/analyze.json?q=")
connect.result = json.load(urllib.request("POST", "", data))
TweetWriter.write(result)
But when its run I get “line 20, data = urllib.urlencode(TweetList[rows]) Type Error: list indices must be integers, not list”
I know my list “TweetList” is storing the tweets just as I’d like but I don’t think I’m using urllib.urlencode correct. The API requires that queries are sent like –
http://data.tweetsentiments.com:8080/api/analyze.json?q= (text to analyze)
So the idea was that urllib.urlencode would simply add the tweets to the end of the address to allow a query.
The last four lines of code have become a mess after looking at so many examples. Your help would be much appreciated.

I'm not 100% sure what it is you're trying to do since I don't know what's the format of the files you are reading, but this part looks suspicious:
for rows in TweetList:
data = urllib.urlencode(TweetList[rows])
since TweetList is a list, the for loop puts in the rows one single value from the list in each iteration, and so this for example:
list = [1, 2, 3, 4]
for num in list:
print num
will print 1 2 3 4. But if this:
list = [1, 2, 3, 4]
for num in list:
print list[num]
Will end up with this error: IndexError: list index out of range.
Can you please elaborate a bit more about the format of the files you are reading?
Edit
If I understand you correctly, you need something like this:
tweets = []
tweetReader = csv.reader(open("C:\StoredTweets.csv", "r"))
for row in tweetReader:
tweets.append({ 'tweet': row[0], 'date': row[1] })
for row in tweets:
data = urllib.urlencode(row)
.....

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reading in multiple hdf5 files and appending them to a new dictionary - python

Related

How to import from CSV

Read files and count lengths for dictionary variable in python

Python: how to save a list with objects in a file?

Python: Create coordinate list (convert string to int)

urllib.urlencode problems

Categories

Resources