I would like to parse a machine log file, re-arange the data and write it to a .csv file, which i will import into a google spreadsheet. Or write the data directly to the spreadsheet.
here is an example of how the log looks like:
39 14 15 5 2016 39 14 15 5 2016 0
39 14 15 5 2016 40 14 15 5 2016 0.609
43 14 15 5 2016 44 14 15 5 2016 2.182
the output should look like this:
start_date,start_time,end_time,meters
15/5/16,14:39,14:39,0
15/5/16,14:39,14:40,0.609
15/5/16,14:43,14:44,2.182
i wrote the following python code:
file = open("c:\SteelUsage.bsu")
for line in file.readlines():
print(line) #just for verification
line=line.strip()
position=[]
numbers=line.split()
for number in numbers:
position.append(number)
print(number)#just for verification
the idea is to save each number in a row to a list, then i can re-write the numbers in the right order according to their position.
for example: in row #1 the string "39" will have position 0, "14" pstion 1, etc.
but it seems the code i wrote stores each number as a new list, because when i change print(number) to print(number[0]), the code prints the first digit of each number, istead of printing the first number. (39)
where did i go wrong?
thank you
Do something like this. Write out to your csv file.
with open('c:\SteelUsage.bsu','r') as reader:
lines = reader.readlines()
for line in lines:
inp = [i for i in line.strip().split()]
out = '%s/%s/%s,%s:%s,%s:%s,%s' % (inp[2],inp[3],inp[4],inp[1],inp[0],inp[6],inp[5],inp[10])
print out
Related
I have the attached file which I need to upload in python. I need to ignore NETSIM and 10 value on top and read the remaining. I used the following code to read the file in python:
import pandas as pd
x=pd.read_csv('C:/Users/oq/Desktop/FAST/Algorithms/project/benchmark/input10.txt',sep=r'\\\t',engine='python',skiprows=(0,1,2), header=None)
I used the tab separator in my code but the output is still show me as follows:
0 0\t0.362291\t0.441396
1 1\t0.156279\t0.341383
2 2\t0.699696\t0.045577
3 3\t0.714313\t0.171668
4 4\t0.378966\t0.495494
5 5\t0.961942\t0.444337
6 6\t0.726886\t0.575888
7 7\t0.168639\t0.406223
8 8\t0.875627\t0.061439
9 9\t0.540054\t0.317061
10 5\t7\t155200000.000000\t54000000.000000\t37997...
11 3\t4\t155200000.000000\t40500000.000000\t24507...
12 4\t6\t155200000.000000\t33000000.000000\t18606...
13 5\t6\t155200000.000000\t72000000.000000\t39198...
14 4\t1\t155200000.000000\t40500000.000000\t24507...
15 3\t9\t155200000.000000\t39000000.000000\t22698...
Can someone please guide me as to what's wrong?
The attached file
You want to split on a literal tab character, not the string \\t, so you shouldn't be using a raw string literal here. Change sep to just '\t'.
x=pd.read_csv('C:/Users/oq/Desktop/FAST/Algorithms/project/benchmark/input10.txt',sep='\t',engine='python',skiprows=(0,1,2), header=None)
I apologize if there is an obvious answer to this already.
I have a very large file that poses a few challenges for parsing. I am delivered these files from outside my organization, so there is no chance I can change their format.
Firstly, the file is space delimited but the fields that represent a "column" of data can span multiple rows. For example, if you had a row that was supposed to be 25 columns of data, it may be written in the file as:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25
1 2 3 4 5 6 7 8 9 10 11 12 13
14 15 16 17 18
19 20 21 22 23 24 25
As you can see, I can't rely on each set of data being on the same line, but I can rely on there being the same number of columns per set.
To make matters worse, the file follows a definition:data type format where the first 3 or so lines will be describing the data (including a field that tells me how many rows there are) and the next N rows are data. Then it will go back to the 3 lines format again to describe the next set of data. That means I can't just set up a reader for the N columns format and let it run to EOF.
I'm afraid the built in python file reading functionality could get really ugly real fast, but I can't find anything in csv or numpy that works.
Any suggestions?
EDIT: Just as an example of a different solution:
We have an old tool in MATLAB that parses this file using textscan on an open file handle. We know the number of columns so we do something like:
data = textscan(fid, repmat('%f ',1,n_cols), n_rows, 'delimiter', {' ', '\r', '\n'}, 'multipledelimsasone', true);
This would read the data no matter how it wrapped while leaving a file handle open to process the next section later. This is done because the files are so large they can lead to excess RAM usage.
This is a sketch how you can proceed:
(EDIT: with some modifications)
file = open("testfile.txt", "r")
# store data for the different sections here
datasections = list()
while True:
current_row = []
# read three lines
l1 = file.readline()
if line == '': # or other end condition
break
l2 = file.readline()
l3 = file.readline()
# extract the following information from l1, l2, l3
nrows = # extract the number rows in the next section
ncols = # extract the number of columns in the next section
# loop while len(current_row) < nrows * ncols:
# read next line, isolate the items using str.split()
# append items to current_row
# break current_row into the lines after each ncols-th item
# store data in datasections in a new array
I'm trying to create a Series in Pandas from a list of dates presented as strings, thus:
['2016-08-09',
'2015-08-03',
'2017-08-15',
'2017-12-14',
...
but when I apply pd.Series from within the Pandas module the result in Jupyter notebook displays as:
0 [[[2016-08-09]]]
1 [[[2015-08-03]]]
2 [[[2017-08-15]]]
3 [[[2017-12-14]]]
...
Is there a simple way to fix it? The data has come from an Xml feed parsed using lxml.objectify.
I don't normally get these problems when reading from csv and just curious what I might be doing wrong.
UPDATE:
The code to grab the data and an example site:
import lxml.objectify
import pandas as pd
def parse_sitemap(url):
root = lxml.objectify.parse(url)
rooted = root.getroot()
output_1 = [child.getchildren()[0] for child in rooted.getchildren()]
output_0 = [child.getchildren()[1] for child in rooted.getchildren()]
return output_1
results = parse_sitemap("sitemap.xml")
pd.Series(results)
If you print out type(result[0]), you'll understand, it's not a string you get:
print(type(results[0]))
Output:
lxml.objectify.StringElement
This is not a string, and pandas doesn't seem to be playing nice with it. But the fix is easy. Just convert to string using pd.Series.astype:
s = pd.Series(results).astype(str)
print(s)
0 2017-08-09T11:20:38Z
1 2017-08-09T11:10:55Z
2 2017-08-09T15:36:20Z
3 2017-08-09T16:36:59Z
4 2017-08-02T09:56:50Z
5 2017-08-02T19:33:31Z
6 2017-08-03T07:32:24Z
7 2017-08-03T07:35:35Z
8 2017-08-03T07:54:12Z
9 2017-07-31T16:38:34Z
10 2017-07-31T15:42:24Z
11 2017-07-31T15:44:56Z
12 2017-07-31T15:23:25Z
13 2017-08-01T08:30:27Z
14 2017-08-01T11:01:57Z
15 2017-08-03T13:52:39Z
16 2017-08-03T14:29:55Z
17 2017-08-03T13:39:24Z
18 2017-08-03T13:39:00Z
19 2017-08-03T15:30:58Z
20 2017-08-06T11:29:24Z
21 2017-08-03T10:19:43Z
22 2017-08-14T18:42:49Z
23 2017-08-15T15:42:04Z
24 2017-08-17T08:58:19Z
25 2017-08-18T13:37:52Z
26 2017-08-18T13:38:14Z
27 2017-08-18T13:45:42Z
28 2017-08-03T09:56:42Z
29 2017-08-01T11:01:22Z
dtype: object
i think all you need to do is:
pd.Series(dates)
but there's not enough info in the question to say for sure.
additionally, if you want to use datetime64 objects, you can do:
pd.Series(pd.to_datetime(dates))
So I have a text file that I need to trim based on a value in the second last column - if it says 1, delete the line, if 0, keep the line.
The text looks like this, it just has thousands of rows:
#name #bunch of values #column of interest
00051079+4547116 00 05 10.896 +45 47 11.570 0 0 \n
00051079+4547117 00 05 10.896 +45 47 11.570 432 3 0 0 \n
00051079+4547118 00 05 10.896 +45 47 11.570 34 6 1 0 \n
I have tried this (plus about a hundred variations of this):
with open("Desktop/MStars.txt") as M:
data = M.read()
data = data.split('\n')
mactivity = [row.split()[-2] for row in data]
#name = [row.split(' ')[0] for row in data]
#print ((mactivity))
with open("Desktop/MStars.txt","r") as input:
with open("Desktop/MStarsReduced.txt","w") as output:
for line in input:
if mactivity =="0":
output.write(line)
Thank you in advance, it is driving me mad.
Recall that a line from a CSV reader is a list, where each cell/column is another value.
Editing your last little code block:
with open("Desktop/MStars.txt","r") as input:
with open("Desktop/MStarsReduced.txt","w") as output:
for line in input:
if line[-2] == 0:
output.write(line)
This will write your line if and only if the second to last field is 0. Otherwise, it will not be written.
I have following format of file
# Data set number 1
#
# Number of lines 4010
# Max number of column 3 is 5
# Blahblah
# More blahblah
1 2 1 110
2 2 5 20 21 465 417 38
2 1 2 33 46 17
......
4010 3 5 1001 2010 3355 107 2039
# Data set number 2
#
# Number of lines 4010
# Max number of column 3 is 5
# Blahblah
# More blahblah
1 2 1 110
2 2 5 20 21 465 417 38
2 1 2 33 46 17
......
I hope to read the number of data set, number of lines, and maximum number of column 3. I searched and find out csv module can read the headers, but can I read those numbers of header, and store? What I did was
nnn = linecache.getline(filename, 1)
nnnn = nnn(line.split()[4])
number = linecache.getline(filename, 3)
number2 = number(line.split()[4])
mmm = linecache.getline(filename, 5)
mmmm = mmm(line.split()[7])
mmmmm = int(mmmm)
max_nb = range(mmmmm)
n_data = int(nnnn)
n_frame = range(n_data)
singleframe = natoms + 6
Like this. How can I read those numbers and store using csv module? I skip the 6 headerlines by using 'singleframe', but also curious how csv module can read 6 number of header lines. Thanks
You don't really have a CSV file; you have a proprietary format instead. Just parse it directly, using regular expressions to quickly extract your desired data:
import re
set_number = re.compile(r'Data set number (\d+)'),
patterns = {
'line_count': re.compile(r'Number of lines (\d+)'),
'max_num': re.compile(r'Max number of column 3 is (\d+)'),
}
with open(filename, 'r') as infh:
results = {}
set_numbers = []
for line in infh:
if not line.startswith('#'):
# skip lines without a comment
continue
set_match = set_number.match(line)
if set_match:
set_numbers.append(int(set_match.group(1)))
else:
for name, pattern in patterns.items():
match = pattern.search(line)
if match:
results[name] = int(match.group(1))
Do not use the linecache module. It'll read the whole file into memory, and is really only intended for access to Python source files; whenever a traceback needs to be printed this module caches the source files involved with the current stack. You'd only use it for smaller files from which you need random lines, repeatedly.