I'm trying to loop through some unstructured text data in python. End goal is to structure it in a dataframe. For now I'm just trying to get the relevant data in an array and understand the line, readline() functionality in python.
This is what the text looks like:
Title: title of an article
Full text: unfortunately the full text of each article,
is on numerous lines. Each article has a differing number
of lines. In this example, there are three..
Subject: Python
Title: title of another article
Full text: again unfortunately the full text of each article,
is on numerous lines.
Subject: Python
This same format is repeated for lots of text articles in the same file. So far I've figured out how to pull out lines that include certain text. For example, I can loop through it and put all of the article titles in a list like this:
a = "Title:"
titleList = []
sample = 'sample.txt'
with open(sample,encoding="utf8") as unstr:
for line in unstr:
if a in line:
titleList.append(line)
Now I want to do the below:
a = "Title:"
b = "Full text:"
d = "Subject:"
list = []
sample = 'sample.txt'
with open(sample,encoding="utf8") as unstr:
for line in unstr:
if a in line:
list.append(line)
if b in line:
1. Concatenate this line with each line after it, until i reach the line that includes "Subject:". Ignore the "Subject:" line, stop the "Full text:" subloop, add the concatenated full text to the list array.<br>
2. Continue the for loop within which all of this sits
As a Python beginner, I'm spinning my wheels searching google on this topic. Any pointers would be much appreciated.
If you want to stick with your for-loop, you're probably going to need something like this:
titles = []
texts = []
subjects = []
with open('sample.txt', encoding="utf8") as f:
inside_fulltext = False
for line in f:
if line.startswith("Title:"):
inside_fulltext = False
titles.append(line)
elif line.startswith("Full text:"):
inside_fulltext = True
full_text = line
elif line.startswith("Subject:"):
inside_fulltext = False
texts.append(full_text)
subjects.append(line)
elif inside_fulltext:
full_text += line
else:
# Possibly throw a format error here?
pass
(A couple of things: Python is weird about names, and when you write list = [], you're actually overwriting the label for the list class, which can cause you problems later. You should really treat list, set, and so on like keywords - even thought Python technically doesn't - just to save yourself the headache. Also, the startswith method is a little more precise here, given your description of the data.)
Alternatively, you could wrap the file object in an iterator (i = iter(f), and then next(i)), but that's going to cause some headaches with catching StopIteration exceptions - but it would let you use a more classic while-loop for the whole thing. For myself, I would stick with the state-machine approach above, and just make it sufficiently robust to deal with all your reasonably expected edge-cases.
As your goal is to construct a DataFrame, here is a re+numpy+pandas solution:
import re
import pandas as pd
import numpy as np
# read all file
with open('sample.txt', encoding="utf8") as f:
text = f.read()
keys = ['Subject', 'Title', 'Full text']
regex = '(?:^|\n)(%s): ' % '|'.join(keys)
# split text on keys
chunks = re.split(regex, text)[1:]
# reshape flat list of records to group key/value and infos on the same article
df = pd.DataFrame([dict(e) for e in np.array(chunks).reshape(-1, len(keys), 2)])
Output:
Title Full text Subject
0 title of an article unfortunately the full text of each article,\nis on numerous lines. Each article has a differing number \nof lines. In this example, there are three.. Python
1 title of another article again unfortunately the full text of each article,\nis on numerous lines. Python
I have a script that reads through a log file that contains hundreds of these logs, and looks for the ones that have a "On, Off, or Switch" type. Then I output each log into its own list. I'm trying to find a way to extract the Out and In times into a separate list/array and then subtract the two times to find the duration of each separate log. This is what the outputted logs look like:
['2020-01-31T12:04:57.976Z 1234 Out: [2020-01-31T00:30:20.150Z] Id: {"Id":"4-f-4-9-6a"', '"Type":"Switch"', '"In":"2020-01-31T00:30:20.140Z"']
This is my current code:
logfile = '/path/to/my/logfile'
with open(logfile, 'r') as f:
text = f.read()
words = ["On", "Off", "Switch"]
text2 = text.split('\n')
for l in text.split('\n'):
if (words[0] in l or words[1] in l or words[2] in l):
log = l.split(',')[0:3]
I'm stuck on how to target only the Out and In time values from the logs and put them in an array and convert to a time value to find duration.
Initial log before script: everything after the "In" time is useless for what I'm looking for so I only have the first three indices outputted
2020-01-31T12:04:57.976Z 1234 Out: [2020-01-31T00:30:20.150Z] Id: {"Id":"4-f-4-9-6a","Type":"Switch,"In":"2020-01-31T00:30:20.140Z","Path":"interface","message":"interface changed status from unknown to normal","severity":"INFORMATIONAL","display":true,"json_map":"{\"severity\":null,\"eventId\":\"65e-64d9-45-ab62-8ef98ac5e60d\",\"componentPath\":\"interface_css\",\"displayToGui\":false,\"originalState\":\"unknown\",\"closed\":false,\"eventType\":\"InterfaceStateChange\",\"time\":\"2019-04-18T07:04:32.747Z\",\"json_map\":null,\"message\":\"interface_css changed status from unknown to normal\",\"newState\":\"normal\",\"info\":\"Event created with current status\"}","closed":false,"info":"Event created with current status","originalState":"unknown","newState":"normal"}
Below is a possible solution. The wordmatch line is a bit of a hack, until I find something clearer: it's just a one-liner that create an empty or 1-element set of True if one of the words matches.
(Untested)
import re
logfile = '/path/to/my/logfile'
words = ["On", "Off", "Switch"]
dateformat = r'\d{4}\-\d{2}\-\d{2}T\d{2}:\d{2}:\d{2}\.\d+[Zz]?'
pattern = fr'Out:\s*\[(?P<out>{dateformat})\].*In":\s*\"(?P<in>{dateformat})\"'
regex = re.compile(pattern)
with open(logfile, 'r') as f:
for line in f:
wordmatch = set(filter(None, (word in s for word in words)))
if wordmatch:
match = regex.search(line)
if match:
intime = match.group('in')
outtime = match.group('out')
# whatever to store these strings, e.g., append to list or insert in a dict.
As noted, your log example is very awkward, so this works for the example line, but may not work for every line. Adjust as necessary.
I have also not included (if so wanted), a conversion to a datetime.datetime object. For that, read through the datetime module documentation, in particular datetime.strptime. (Alternatively, you may want to store your results in a Pandas table. In that case, read through the Pandas documentation on how to convert strings to actual datetime objects.)
You also don't need to read nad split on newlines yourself: for line in f will do that for you (provided f is indeed a filehandle).
Regex is probably the way to go (fastness, efficiency etc.) ... but ...
You could take a very simplistic (if very inefficient) approach of cleaning your data:
join all of it into a string
replace things that hinder easy parsing
split wisely and filter the split
like so:
data = ['2020-01-31T12:04:57.976Z 1234 Out: [2020-01-31T00:30:20.150Z] Id: {"Id":"4-f-4-9-6a"', '"Type":"Switch"', '"In":"2020-01-31T00:30:20.140Z"']
all_text = " ".join(data)
# this is inefficient and will create throwaway intermediate strings - if you are
# in a hurry or operate on 100s of MB of data, this is NOT the way to go, unless
# you have time
# iterate pairs of ("bad thing", "what to replace it with") (or list of bad things)
for thing in [ (": ",":"), (list('[]{}"'),"") ]:
whatt = thing[0]
withh = thing[1]
# if list, do so for each bad thing
if isinstance(whatt, list):
for p in whatt:
# replace it
all_text = all_text.replace(p,withh)
else:
all_text = all_text.replace(whatt,withh)
# format is now far better suited to splitting/filtering
cleaned = [a for a in all_text.split(" ")
if any(a.startswith(prefix) or "Switch" in a
for prefix in {"In:","Switch:","Out:"})]
print(cleaned)
Outputs:
['Out:2020-01-31T00:30:20.150Z', 'Type:Switch', 'In:2020-01-31T00:30:20.140Z']
After cleaning your data would look like:
2020-01-31T12:04:57.976Z 1234 Out:2020-01-31T00:30:20.150Z Id:Id:4-f-4-9-6a Type:Switch In:2020-01-31T00:30:20.140Z
You can transform the clean list into a dictionary for ease of lookup:
d = dict( part.split(":",1) for part in cleaned)
print(d)
will produce:
{'In': '2020-01-31T00:30:20.140Z',
'Type': 'Switch',
'Out': '2020-01-31T00:30:20.150Z'}
You can use datetime module to parse the times from your values as shown in 0 0 post.
I am trying to load the content of a txt file in in a python concept in order to use it for training my model in svm.
I would like to load the data as they are on my txt file:
[ 0.02713807 0.01802697 0.01690036 0.01501216 0.01466412 0.01638859 0.0210163 0.02658022 0.03664452 0.05064286 0.06027664 0.06431134 0.04303673 0.03247764 0.02293602 0.01847688 0.0174582 0.01860664 0.02576164 0.02296149 0.0582211 0.37246149]
[ 0.03623561 0.05211099 0.02469929 0.0134991 0.01029103 0.00880611 0.00898548 0.00870684 0.0117465 0.01962223 0.03895351 0.01956952 0.00972828 0.00704872 0.00656471 0.00689743 0.00854528 0.01128713 0.02119957 0.05047751 0.05028719 0.57473797]
And the code that I am using is the one below:
data = []
with open('data2.txt') as f:
for y in f:
data.append(float(y.strip()))
print (data)
When I am running my script I am getting this error:
ValueError: could not convert string to float: '[ 0.02713807 0.01802697 0.01690036 0.01501216 0.01466412 0.01638859'
How should I solve this, any advice please?
Use regular expression to retrieve numbers from line:
data = []
with open('file.txt') as f:
for line in f:
numbers = re.search(r'\[\s*(.*\d)\s*\]', line).group(1)
data.append(list(map(float, numbers.split())))
print(data)
Output:
[[0.02713807, 0.01802697, 0.01690036, 0.01501216, 0.01466412, 0.01638859, 0.0210163, 0.02658022, 0.03664452, 0.05064286, 0.06027664, 0.06431134, 0.04303673, 0.03247764, 0.02293602, 0.01847688, 0.0174582, 0.01860664, 0.02576164, 0.02296149, 0.0582211, 0.37246149], [0.03623561, 0.05211099, 0.02469929, 0.0134991, 0.01029103, 0.00880611, 0.00898548, 0.00870684, 0.0117465, 0.01962223, 0.03895351, 0.01956952, 0.00972828, 0.00704872, 0.00656471, 0.00689743, 0.00854528, 0.01128713, 0.02119957, 0.05047751, 0.05028719, 0.57473797]]
If you generated this file with np.savetxt, the obvious way to load is it np.loadtxt.
More generally, you should never just save stuff to a file in "whatever format, I don't care" and then beat your head against the wall trying to figure out how to parse that format. Save stuff in a format you know how to load. Use np.savetxt and you can load it with np.loadtxt; np.save and np.load; json.dump and json.load; pickle.dump and pickle.load; csv.writer and csv.reader… they come in matched pairs for a reason. (What about formats like just appending str(row) to a file? There is no function that reads that. So the answer is: don't do that.)
And then, the whole problem of "how do I parse something that looks like the repr of a list of floats but with the commas all removed" never arises in the first place.
I'm not sure how you can actually get output that looks like that out of savetxt. By default, if you write a 2D array to a file, you get columns separated by a single space, not blocks of columns with extra space between the blocks, and you don't get brackets around each row. There are a zillion arguments to control the format in different ways, but I have no idea what combination of arguments would give you that format.
But presumably, you know what you called. So you can pass the equivalent arguments to loadtxt.
Or, ideally, make things simpler: change your code to just call savetxt with the default arguments in the first place.
split the line
replace the dot by empty string
check if it is a number
data = []
with open('./data.txt') as f:
for l in f:
data.append([y for y in l.split() if y.replace('.','',1).isdigit()])
print (data)
output
[['0.02713807', '0.01802697', '0.01690036', '0.01501216', '0.01466412'], ['0.03623561', '0.05211099', '0.02469929', '0.0134991', '0.01029103']]
f = open('my_file.txt', 'r+')
my_file_data = f.read()
f.close()
The above code opens 'my_file.txt' in read mode then stores the data it reads from my_file.txt in my_file_data and closes the file. The read function reads the whole file at once. You can use the following to read the file line by line and store it in a list:
f = open('my_file', 'r+')
lines = [line for line inf.readlines()]
f.close()
How about this:
data = []
with open('data2.txt') as f:
for l in f:
data.append(list(map(float,l[1:-2].split())))
print(data)
Output:
[[0.02713807, 0.01802697, 0.01690036, 0.01501216, 0.01466412, 0.01638859, 0.0210163, 0.02658022, 0.03664452, 0.05064286, 0.06027664, 0.06431134, 0.04303673, 0.03247764, 0.02293602, 0.01847688, 0.0174582, 0.01860664, 0.02576164, 0.02296149, 0.0582211, 0.37246149], [0.03623561, 0.05211099, 0.02469929, 0.0134991, 0.01029103, 0.00880611, 0.00898548, 0.00870684, 0.0117465, 0.01962223, 0.03895351, 0.01956952, 0.00972828, 0.00704872, 0.00656471, 0.00689743, 0.00854528, 0.01128713, 0.02119957, 0.05047751, 0.05028719, 0.5747379]]
If you want list of NumPy arrays do:
import numpy as np
data = []
with open('data2.txt') as f:
for l in f:
data.append(np.array(list(map(float,l[1:-2].split()))))
print(data)
Output:
[array([ 0.02713807, 0.01802697, 0.01690036, 0.01501216, 0.01466412,
0.01638859, 0.0210163 , 0.02658022, 0.03664452, 0.05064286,
0.06027664, 0.06431134, 0.04303673, 0.03247764, 0.02293602,
0.01847688, 0.0174582 , 0.01860664, 0.02576164, 0.02296149,
0.0582211 , 0.37246149]), array([ 0.03623561, 0.05211099, 0.02469929, 0.0134991 , 0.01029103,
0.00880611, 0.00898548, 0.00870684, 0.0117465 , 0.01962223,
0.03895351, 0.01956952, 0.00972828, 0.00704872, 0.00656471,
0.00689743, 0.00854528, 0.01128713, 0.02119957, 0.05047751,
0.05028719, 0.5747379 ])]
I've a property file abc.prop that contains the following.
x=(A B)
y=(C D)
I've a python script abc.py which is able to load the property file abc.prop.
But I'm not able to iterate and convert both the arrays from abc.prop as follows,
x_array=['A','B']
y_array=['C','D']
I tried the following, but I want to know if there's a better way of doing it, instead of using replace() and stripping off braces.
importConfigFile = "abc.prop"
propInputStream = FileInputStream(importConfigFile)
configProps = Properties()
configProps.load(propInputStream)
x_str=configProps.get("x")
x_str=x_str.replace("(","")
x_str=x_str.replace(")","")
x_array=x_str.split(' ')
Please suggest a way to achieve this.
I'm not aware of any special bash to python data structure converters. And I doubt there are any. The only thing I may suggest is a little bit cleaner and dynamic way of doing this.
data = {}
with open('abc.prop', 'r') as f:
for line in f:
parts = line.split('=')
key = parts[0].strip()
value = parts[1].strip('()\n')
values = value.split()
data[key] = [x.strip() for x in values]
print(data)
My code that is meant to replace certain letters (a with e, e with a and s with 3 specifically) is not working, but I am not quite sure what the error is as it is not changing the text file i am feeding it.
pattern = "ae|ea|s3"
def encode(pattern, filename):
message = open(filename, 'r+')
output = []
pattern2 = pattern.split('|')
for letter in message:
isfound = false
for keypair in pattern2:
if letter == keypair[0]:
output.append(keypair[1])
isfound = true
if isfound == true:
break;
if isfound == false:
output.append(letter)
message.close()
Been racking my brain out trying to figure this out for a while now..
It is not changing the textfile because you do not replace the textfile with the output you create. Instead this function is creating the output string and dropping it at the end of the function. Either return the output string from the function and store it outside, or replace the file in the function by writing to the file without appending.
As this seems like an exercise I prefer to not add the code to do it, as you will probably learn more from writing the function yourself.
Here is a quick implementation with the desired result, you will need to modify it yourself to read files, etc:
def encode(pattern, string):
rep = {}
for pair in pattern.split("|"):
rep[pair[0]] = pair[1]
out = []
for c in string:
out.append(rep.get(c, c))
return "".join(out)
print encode("ae|ea|s3", "Hello, this is my default string to replace")
#output => "Hallo, thi3 i3 my dafeult 3tring to rapleca"
If you want to modify a file, you need to specifically tell your program to write to the file. Simply appending to your output variable will not change it.