Related
I am trying to load the content of a txt file in in a python concept in order to use it for training my model in svm.
I would like to load the data as they are on my txt file:
[ 0.02713807 0.01802697 0.01690036 0.01501216 0.01466412 0.01638859 0.0210163 0.02658022 0.03664452 0.05064286 0.06027664 0.06431134 0.04303673 0.03247764 0.02293602 0.01847688 0.0174582 0.01860664 0.02576164 0.02296149 0.0582211 0.37246149]
[ 0.03623561 0.05211099 0.02469929 0.0134991 0.01029103 0.00880611 0.00898548 0.00870684 0.0117465 0.01962223 0.03895351 0.01956952 0.00972828 0.00704872 0.00656471 0.00689743 0.00854528 0.01128713 0.02119957 0.05047751 0.05028719 0.57473797]
And the code that I am using is the one below:
data = []
with open('data2.txt') as f:
for y in f:
data.append(float(y.strip()))
print (data)
When I am running my script I am getting this error:
ValueError: could not convert string to float: '[ 0.02713807 0.01802697 0.01690036 0.01501216 0.01466412 0.01638859'
How should I solve this, any advice please?
Use regular expression to retrieve numbers from line:
data = []
with open('file.txt') as f:
for line in f:
numbers = re.search(r'\[\s*(.*\d)\s*\]', line).group(1)
data.append(list(map(float, numbers.split())))
print(data)
Output:
[[0.02713807, 0.01802697, 0.01690036, 0.01501216, 0.01466412, 0.01638859, 0.0210163, 0.02658022, 0.03664452, 0.05064286, 0.06027664, 0.06431134, 0.04303673, 0.03247764, 0.02293602, 0.01847688, 0.0174582, 0.01860664, 0.02576164, 0.02296149, 0.0582211, 0.37246149], [0.03623561, 0.05211099, 0.02469929, 0.0134991, 0.01029103, 0.00880611, 0.00898548, 0.00870684, 0.0117465, 0.01962223, 0.03895351, 0.01956952, 0.00972828, 0.00704872, 0.00656471, 0.00689743, 0.00854528, 0.01128713, 0.02119957, 0.05047751, 0.05028719, 0.57473797]]
If you generated this file with np.savetxt, the obvious way to load is it np.loadtxt.
More generally, you should never just save stuff to a file in "whatever format, I don't care" and then beat your head against the wall trying to figure out how to parse that format. Save stuff in a format you know how to load. Use np.savetxt and you can load it with np.loadtxt; np.save and np.load; json.dump and json.load; pickle.dump and pickle.load; csv.writer and csv.reader… they come in matched pairs for a reason. (What about formats like just appending str(row) to a file? There is no function that reads that. So the answer is: don't do that.)
And then, the whole problem of "how do I parse something that looks like the repr of a list of floats but with the commas all removed" never arises in the first place.
I'm not sure how you can actually get output that looks like that out of savetxt. By default, if you write a 2D array to a file, you get columns separated by a single space, not blocks of columns with extra space between the blocks, and you don't get brackets around each row. There are a zillion arguments to control the format in different ways, but I have no idea what combination of arguments would give you that format.
But presumably, you know what you called. So you can pass the equivalent arguments to loadtxt.
Or, ideally, make things simpler: change your code to just call savetxt with the default arguments in the first place.
split the line
replace the dot by empty string
check if it is a number
data = []
with open('./data.txt') as f:
for l in f:
data.append([y for y in l.split() if y.replace('.','',1).isdigit()])
print (data)
output
[['0.02713807', '0.01802697', '0.01690036', '0.01501216', '0.01466412'], ['0.03623561', '0.05211099', '0.02469929', '0.0134991', '0.01029103']]
f = open('my_file.txt', 'r+')
my_file_data = f.read()
f.close()
The above code opens 'my_file.txt' in read mode then stores the data it reads from my_file.txt in my_file_data and closes the file. The read function reads the whole file at once. You can use the following to read the file line by line and store it in a list:
f = open('my_file', 'r+')
lines = [line for line inf.readlines()]
f.close()
How about this:
data = []
with open('data2.txt') as f:
for l in f:
data.append(list(map(float,l[1:-2].split())))
print(data)
Output:
[[0.02713807, 0.01802697, 0.01690036, 0.01501216, 0.01466412, 0.01638859, 0.0210163, 0.02658022, 0.03664452, 0.05064286, 0.06027664, 0.06431134, 0.04303673, 0.03247764, 0.02293602, 0.01847688, 0.0174582, 0.01860664, 0.02576164, 0.02296149, 0.0582211, 0.37246149], [0.03623561, 0.05211099, 0.02469929, 0.0134991, 0.01029103, 0.00880611, 0.00898548, 0.00870684, 0.0117465, 0.01962223, 0.03895351, 0.01956952, 0.00972828, 0.00704872, 0.00656471, 0.00689743, 0.00854528, 0.01128713, 0.02119957, 0.05047751, 0.05028719, 0.5747379]]
If you want list of NumPy arrays do:
import numpy as np
data = []
with open('data2.txt') as f:
for l in f:
data.append(np.array(list(map(float,l[1:-2].split()))))
print(data)
Output:
[array([ 0.02713807, 0.01802697, 0.01690036, 0.01501216, 0.01466412,
0.01638859, 0.0210163 , 0.02658022, 0.03664452, 0.05064286,
0.06027664, 0.06431134, 0.04303673, 0.03247764, 0.02293602,
0.01847688, 0.0174582 , 0.01860664, 0.02576164, 0.02296149,
0.0582211 , 0.37246149]), array([ 0.03623561, 0.05211099, 0.02469929, 0.0134991 , 0.01029103,
0.00880611, 0.00898548, 0.00870684, 0.0117465 , 0.01962223,
0.03895351, 0.01956952, 0.00972828, 0.00704872, 0.00656471,
0.00689743, 0.00854528, 0.01128713, 0.02119957, 0.05047751,
0.05028719, 0.5747379 ])]
I am trying to setup a simple data file format, and I am working with these files in Python for analysis. The format basically consists of header information, followed by the data. For syntax and future extensibility reasons, I want to use a JSON object for the header information. An example file looks like this:
{
"name": "my material",
"sample-id": null,
"description": "some material",
"funit": "MHz",
"filetype": "material_data"
}
18 6.269311533 0.128658208 0.962033017 0.566268827
18.10945274 6.268810641 0.128691962 0.961950095 0.565591807
18.21890547 6.268312637 0.128725463 0.961814928 0.564998228...
If the data length/structure is always the same, this is not hard to parse. However, it brought up in my mind a question about the most flexible way to parse out the JSON object, given an unknown number of lines, and an unknown number of nested curly braces, and potentially more than one JSON object in the file.
If there is only one JSON object in the file, one can use this regular expression:
with open(fname, 'r') as fp:
fstring = fp.read()
json_string = re.search('{.*}', fstring, flags=re.S)
However, if there is more than one JSON string, and I want to grab the first one, I need to use something like this:
def grab_json(mystring):
lbracket = 0
rbracket = 0
lbracket_pos = 0
rbracket_pos = 0
for i in range(len(mystring)):
if mystring[i] == '{':
lbracket = 1
lbracket_pos = i
break
for i in range(lbracket_pos+1, len(mystring)):
if mystring[i] == '}':
rbracket += 1
if rbracket == lbracket:
rbracket_pos = i
break
elif mystring[i] == '{':
lbracket += 1
json_string = mystring[lbracket_pos : rbracket_pos + 1]
return json_string, lbracket_pos, rbracket_pos
json_string, beg_pos, end_pos = grab_json(fstring)
I guess the question as always: is there a better way to do this? Better meaning simpler code, more flexible code, more robust code, or really anything?
The easiest solution, as Klaus suggested, is just to use JSON for the entire file. That makes your life much simpler because than writing is just json.dump and reading is just json.load.
A second solution is to put the metadata in a separate file, which keeps reading and writing simple at the expense of multiple files for each data set.
A third solution would be, when writing the file to disk, to prepend the length of the JSON data. So writing might look something like:
metadata_json = json.dumps(metadata)
myfile.write('%d\n' % len(metadata_json))
myfile.write(metadata_json)
myfile.write(data)
Then reading looks like:
with open('myfile') as fd:
len = fd.readline()
metadata_json = fd.read(int(len))
metadata = json.loads(metadata)
data = fd.read()
A fourth option is to adopt an existing storage format (maybe hdf?) that already has the features you are looking for in terms of storing both data and metadata in the same file.
I would store headers separately. It'll give you a possibility to use the same header file for multiple data files
Alternatively you may want to take a look at Apache Parquet Format especially if you want to process your data on distributed cluster(s) using Spark power
I am trying to put data from a text file into an array. below is the array i am trying to create.
[("major",r,w,w,s,w,w,w,s), ("relative minor",r,w,s,w,w,s,w,w),
("harmonic minor",r,w,s,w,w,s,w+s,s)]
But instead when i use the text file and load the data from it I get below as my output. it should output as above, i realise i have to split it but i dont really know how for this sort of set array. could anyone help me with this
['("major",r,w,w,s,w,w,w,s), ("relative minor",r,w,s,w,w,s,w,w),
("harmonic minor",r,w,s,w,w,s,w+s,s)']
below is my text file I am trying to load.
("major",r,w,w,s,w,w,w,s), ("relative minor",r,w,s,w,w,s,w,w), ("harmonic minor",r,w,s,w,w,s,w+s,s)
And this is how im loading it
file = open("slide.txt", "r")
scale = [file.readline()]
If you mean a list instead of an array:
with open(filename) as f:
list_name = f.readlines()
Some questions come to mind about what the rest of your implementation looks like and how you figure it all will work, but below is an example of how this could be done in a pretty straight forward way:
class W(object):
pass
class S(object):
pass
class WS(W, S):
pass
class R(object):
pass
def main():
# separate parts that should become tuples eventually
text = str()
with open("data", "r") as fh:
text = fh.read()
parts = text.split("),")
# remove unwanted characters and whitespace
cleaned = list()
for part in parts:
part = part.replace('(', '')
part = part.replace(')', '')
cleaned.append(part.strip())
# convert text parts into tuples with actual data types
list_of_tuples = list()
for part in cleaned:
t = construct_tuple(part)
list_of_tuples.append(t)
# now use the data for something
print list_of_tuples
def construct_tuple(data):
t = tuple()
content = data.split(',')
for item in content:
t = t + (get_type(item),)
return t
# there needs to be some way to decide what type/object should be used:
def get_type(id):
type_mapping = {
'"harmonic minor"': 'harmonic minor',
'"major"': 'major',
'"relative minor"': 'relative minor',
's': S(),
'w': W(),
'w+s': WS(),
'r': R()
}
return type_mapping.get(id)
if __name__ == "__main__":
main()
This code makes some assumptions:
there is a file data with the content:
("major",r,w,w,s,w,w,w,s), ("relative minor",r,w,s,w,w,s,w,w), ("harmonic minor",r,w,s,w,w,s,w+s,s)
you want a list of tuples which contains the values.
It's acceptable to have w+s represented by some data type, as it would be difficult to have something like w+s appear inside a tuple without it being evaluated when the tuple is created. Another way to do it would be to have w and s represented by data types that can be used with +.
So even if this works, it might be a good idea to think about the format of the text file (if you have control of that), and see if it can be changed into something which would allow you to use some parsing library in a simple way, e.g. see how it could be more easily represented as csv or even turn it into json.
I’m trying to split downloaded data to an 2D array into different datatypes. The downloaded data looks like this:
000|17:40
000|17:45
010|17:50
025|17:55
056|18:00
178|18:05
202|18:10
203|18:15
190|18:20
072|18:25
013|18:30
002|18:35
000|18:40
000|18:45
000|18:50
000|18:55
000|19:00
000|19:05
000|19:10
000|19:15
000|19:20
000|19:25
000|19:30
000|19:35
000|19:40
I’m using the following code to parse this into a two dimensional array:
#!/usr/bin/python
import urllib2
response = urllib2.urlopen('http://gps.buienradar.nl/getrr.php?lat=52&lon=4')
html = response.read()
htmlsplit = []
for record in html.split("\r\n"):
htmlsplit.append(record.split("|"))
print htmlsplit
This is working great, but as expected, it treats it as a string. I’ve found some examples that splits into integers. That’s great if both sides where integers. But in my case it’s an integer | string (or maybe some kind of Python time format)
How can I split this directly into different data types?
Something like this?
for record in html.split("\r\n"): # beware, newlines are treacherous!
s = record.split("|")
htmlsplit.append((int(s[0]), s[1]))
Just write a parser for each record, if you have data this simple. However, I would add some try/except clause to catch errors for non-conforming lines, empty lines, etc. which may be present in the data. The code above is very fragile. Also, you might want to break at only \n and then clean your strings by strip() (i.e. replace s[1] by s[1].strip()). The integer conversion takes care of it automatically.
Use str.splitlines instead of splitting on \r\n
Use the csv module to iterate over the lines:
import csv
txt = '000|17:40\n000|17:45\n000|17:50\n000|17:55\n000|18:00\n000|18:05\n000|18:10\n000|18:15\n000|18:20\n000|18:25\n000|18:30\n000|18:35\n000|18:40\n000|18:45\n000|18:50\n000|18:55\n000|19:00\n000|19:05\n000|19:10\n000|19:15\n000|19:20\n000|19:25\n000|19:30\n000|19:35\n000|19:40\n'
reader = csv.reader(txt.splitlines(), delimiter='|')
column1 = []
column2 = []
for c1, c2 in reader:
column1.append(c1)
column2.append(c2)
You can also use the DictReader
import StringIO
reader2 = csv.DictReader(StringIO.StringIO(txt),
fieldnames=['int', 'time'],
delimiter='|')
column1 = []
column2 = []
for row in reader2:
column1.append(row['time'])
column2.append(row['int'])
So I have a bunch of line of codes like these in a row in my program:
str = str.replace('ten', '10s')
str = str.replace('twy', '20s')
str = str.replave('fy', '40s')
...
I want to make it such that I don't have to manually open my source file to add new cases. For example ('sy', '70'). I know I have to put all these in a function somehow, but I'd like to map cases that are not in my "mapper lib" from the command line. Configuration file maybe? how?
Thanks!
You could use a config file in json format like this:
[
["ten", "10s"],
["twy", "20s"],
["fy", "40s"]
]
Save it as 'replacements.json' and then use it this way:
import json
with open('replacements.json') as i:
replacements = json.load(i)
text = 'ten, twy, fy'
for r in replacements:
text = text.replace(r[0], r[1])
Then when you need to change the values just edit the replacements.json file without touching any Python code.
The format for you replacements file could be anything but json is easy to use and edit.
a simple solution could be to put those in a file, read them in your program and do your replaces in a loop..
Many ways to do this, if it's a rarely changing thing you could consider doing it with a Python dict:
mappings = {
'ten': '10s',
'twy': '20s',
'fy': '40s',
}
def replace(str_):
for s, r in mappings.iteritems():
str_.replace(s, r)
return str_
Alternatively in a Text file (make sure you use a safe delimiter which isn't used in any of the keys!)
mappings.txt
ten|10s
twy|20s
fy|40s
And the Python part:
mappings = {}
for line in open('mappings.txt'):
k, v = line.split('|', 1)
mappings[k] = v
And use the replace from above :)
You could use csv to store the replacements in a human-editable form in a file:
import csv
with open('replacements.csv', 'rb') as f:
replacements = list(csv.reader(f))
for old, new in replacements:
your_string = your_string.replace(old, new)
where replacements.csv:
ten,10s
twy,20s
fy,40s
It avoids unnecessary markup such as ", [] in the json format and allows a delimiter (,) to be present in a string itself unlike the plain text format from #WoLpH's answer.
(live example)