parse text file and generate new .csv file based on that data

parse text file and generate new .csv file based on that data - python

I would like to parse a machine log file, re-arange the data and write it to a .csv file, which i will import into a google spreadsheet. Or write the data directly to the spreadsheet.
here is an example of how the log looks like:
39 14 15 5 2016 39 14 15 5 2016 0
39 14 15 5 2016 40 14 15 5 2016 0.609
43 14 15 5 2016 44 14 15 5 2016 2.182
the output should look like this:
start_date,start_time,end_time,meters
15/5/16,14:39,14:39,0
15/5/16,14:39,14:40,0.609
15/5/16,14:43,14:44,2.182
i wrote the following python code:
file = open("c:\SteelUsage.bsu")
for line in file.readlines():
print(line) #just for verification
line=line.strip()
position=[]
numbers=line.split()
for number in numbers:
position.append(number)
print(number)#just for verification
the idea is to save each number in a row to a list, then i can re-write the numbers in the right order according to their position.
for example: in row #1 the string "39" will have position 0, "14" pstion 1, etc.
but it seems the code i wrote stores each number as a new list, because when i change print(number) to print(number[0]), the code prints the first digit of each number, istead of printing the first number. (39)
where did i go wrong?
thank you

Do something like this. Write out to your csv file.
with open('c:\SteelUsage.bsu','r') as reader:
lines = reader.readlines()
for line in lines:
inp = [i for i in line.strip().split()]
out = '%s/%s/%s,%s:%s,%s:%s,%s' % (inp[2],inp[3],inp[4],inp[1],inp[0],inp[6],inp[5],inp[10])
print out

Related

Tab and line separation in python pandas

I have the attached file which I need to upload in python. I need to ignore NETSIM and 10 value on top and read the remaining. I used the following code to read the file in python:
import pandas as pd
x=pd.read_csv('C:/Users/oq/Desktop/FAST/Algorithms/project/benchmark/input10.txt',sep=r'\\\t',engine='python',skiprows=(0,1,2), header=None)
I used the tab separator in my code but the output is still show me as follows:
0 0\t0.362291\t0.441396
1 1\t0.156279\t0.341383
2 2\t0.699696\t0.045577
3 3\t0.714313\t0.171668
4 4\t0.378966\t0.495494
5 5\t0.961942\t0.444337
6 6\t0.726886\t0.575888
7 7\t0.168639\t0.406223
8 8\t0.875627\t0.061439
9 9\t0.540054\t0.317061
10 5\t7\t155200000.000000\t54000000.000000\t37997...
11 3\t4\t155200000.000000\t40500000.000000\t24507...
12 4\t6\t155200000.000000\t33000000.000000\t18606...
13 5\t6\t155200000.000000\t72000000.000000\t39198...
14 4\t1\t155200000.000000\t40500000.000000\t24507...
15 3\t9\t155200000.000000\t39000000.000000\t22698...
Can someone please guide me as to what's wrong?
The attached file

You want to split on a literal tab character, not the string \\t, so you shouldn't be using a raw string literal here. Change sep to just '\t'.
x=pd.read_csv('C:/Users/oq/Desktop/FAST/Algorithms/project/benchmark/input10.txt',sep='\t',engine='python',skiprows=(0,1,2), header=None)

Read Delimited File That Wraps Lines

I apologize if there is an obvious answer to this already.
I have a very large file that poses a few challenges for parsing. I am delivered these files from outside my organization, so there is no chance I can change their format.
Firstly, the file is space delimited but the fields that represent a "column" of data can span multiple rows. For example, if you had a row that was supposed to be 25 columns of data, it may be written in the file as:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25
1 2 3 4 5 6 7 8 9 10 11 12 13
14 15 16 17 18
19 20 21 22 23 24 25
As you can see, I can't rely on each set of data being on the same line, but I can rely on there being the same number of columns per set.
To make matters worse, the file follows a definition:data type format where the first 3 or so lines will be describing the data (including a field that tells me how many rows there are) and the next N rows are data. Then it will go back to the 3 lines format again to describe the next set of data. That means I can't just set up a reader for the N columns format and let it run to EOF.
I'm afraid the built in python file reading functionality could get really ugly real fast, but I can't find anything in csv or numpy that works.
Any suggestions?
EDIT: Just as an example of a different solution:
We have an old tool in MATLAB that parses this file using textscan on an open file handle. We know the number of columns so we do something like:
data = textscan(fid, repmat('%f ',1,n_cols), n_rows, 'delimiter', {' ', '\r', '\n'}, 'multipledelimsasone', true);
This would read the data no matter how it wrapped while leaving a file handle open to process the next section later. This is done because the files are so large they can lead to excess RAM usage.

This is a sketch how you can proceed:
(EDIT: with some modifications)
file = open("testfile.txt", "r")
# store data for the different sections here
datasections = list()
while True:
current_row = []
# read three lines
l1 = file.readline()
if line == '': # or other end condition
break
l2 = file.readline()
l3 = file.readline()
# extract the following information from l1, l2, l3
nrows = # extract the number rows in the next section
ncols = # extract the number of columns in the next section
# loop while len(current_row) < nrows * ncols:
# read next line, isolate the items using str.split()
# append items to current_row
# break current_row into the lines after each ncols-th item
# store data in datasections in a new array

Why does Pandas Series created from list appear enclosed with square brackets?

I'm trying to create a Series in Pandas from a list of dates presented as strings, thus:
['2016-08-09',
'2015-08-03',
'2017-08-15',
'2017-12-14',
...
but when I apply pd.Series from within the Pandas module the result in Jupyter notebook displays as:
0 [[[2016-08-09]]]
1 [[[2015-08-03]]]
2 [[[2017-08-15]]]
3 [[[2017-12-14]]]
...
Is there a simple way to fix it? The data has come from an Xml feed parsed using lxml.objectify.
I don't normally get these problems when reading from csv and just curious what I might be doing wrong.
UPDATE:
The code to grab the data and an example site:
import lxml.objectify
import pandas as pd
def parse_sitemap(url):
root = lxml.objectify.parse(url)
rooted = root.getroot()
output_1 = [child.getchildren()[0] for child in rooted.getchildren()]
output_0 = [child.getchildren()[1] for child in rooted.getchildren()]
return output_1
results = parse_sitemap("sitemap.xml")
pd.Series(results)

If you print out type(result[0]), you'll understand, it's not a string you get:
print(type(results[0]))
Output:
lxml.objectify.StringElement
This is not a string, and pandas doesn't seem to be playing nice with it. But the fix is easy. Just convert to string using pd.Series.astype:
s = pd.Series(results).astype(str)
print(s)
0 2017-08-09T11:20:38Z
1 2017-08-09T11:10:55Z
2 2017-08-09T15:36:20Z
3 2017-08-09T16:36:59Z
4 2017-08-02T09:56:50Z
5 2017-08-02T19:33:31Z
6 2017-08-03T07:32:24Z
7 2017-08-03T07:35:35Z
8 2017-08-03T07:54:12Z
9 2017-07-31T16:38:34Z
10 2017-07-31T15:42:24Z
11 2017-07-31T15:44:56Z
12 2017-07-31T15:23:25Z
13 2017-08-01T08:30:27Z
14 2017-08-01T11:01:57Z
15 2017-08-03T13:52:39Z
16 2017-08-03T14:29:55Z
17 2017-08-03T13:39:24Z
18 2017-08-03T13:39:00Z
19 2017-08-03T15:30:58Z
20 2017-08-06T11:29:24Z
21 2017-08-03T10:19:43Z
22 2017-08-14T18:42:49Z
23 2017-08-15T15:42:04Z
24 2017-08-17T08:58:19Z
25 2017-08-18T13:37:52Z
26 2017-08-18T13:38:14Z
27 2017-08-18T13:45:42Z
28 2017-08-03T09:56:42Z
29 2017-08-01T11:01:22Z
dtype: object

i think all you need to do is:
pd.Series(dates)
but there's not enough info in the question to say for sure.
additionally, if you want to use datetime64 objects, you can do:
pd.Series(pd.to_datetime(dates))

Remove rows based on a value in second last column in Python

So I have a text file that I need to trim based on a value in the second last column - if it says 1, delete the line, if 0, keep the line.
The text looks like this, it just has thousands of rows:
#name #bunch of values #column of interest
00051079+4547116 00 05 10.896 +45 47 11.570 0 0 \n
00051079+4547117 00 05 10.896 +45 47 11.570 432 3 0 0 \n
00051079+4547118 00 05 10.896 +45 47 11.570 34 6 1 0 \n
I have tried this (plus about a hundred variations of this):
with open("Desktop/MStars.txt") as M:
data = M.read()
data = data.split('\n')
mactivity = [row.split()[-2] for row in data]
#name = [row.split(' ')[0] for row in data]
#print ((mactivity))
with open("Desktop/MStars.txt","r") as input:
with open("Desktop/MStarsReduced.txt","w") as output:
for line in input:
if mactivity =="0":
output.write(line)
Thank you in advance, it is driving me mad.

Recall that a line from a CSV reader is a list, where each cell/column is another value.
Editing your last little code block:
with open("Desktop/MStars.txt","r") as input:
with open("Desktop/MStarsReduced.txt","w") as output:
for line in input:
if line[-2] == 0:
output.write(line)
This will write your line if and only if the second to last field is 0. Otherwise, it will not be written.

python csv module read data from header

I have following format of file
# Data set number 1
#
# Number of lines 4010
# Max number of column 3 is 5
# Blahblah
# More blahblah
1 2 1 110
2 2 5 20 21 465 417 38
2 1 2 33 46 17
......
4010 3 5 1001 2010 3355 107 2039
# Data set number 2
#
# Number of lines 4010
# Max number of column 3 is 5
# Blahblah
# More blahblah
1 2 1 110
2 2 5 20 21 465 417 38
2 1 2 33 46 17
......
I hope to read the number of data set, number of lines, and maximum number of column 3. I searched and find out csv module can read the headers, but can I read those numbers of header, and store? What I did was
nnn = linecache.getline(filename, 1)
nnnn = nnn(line.split()[4])
number = linecache.getline(filename, 3)
number2 = number(line.split()[4])
mmm = linecache.getline(filename, 5)
mmmm = mmm(line.split()[7])
mmmmm = int(mmmm)
max_nb = range(mmmmm)
n_data = int(nnnn)
n_frame = range(n_data)
singleframe = natoms + 6
Like this. How can I read those numbers and store using csv module? I skip the 6 headerlines by using 'singleframe', but also curious how csv module can read 6 number of header lines. Thanks

You don't really have a CSV file; you have a proprietary format instead. Just parse it directly, using regular expressions to quickly extract your desired data:
import re
set_number = re.compile(r'Data set number (\d+)'),
patterns = {
'line_count': re.compile(r'Number of lines (\d+)'),
'max_num': re.compile(r'Max number of column 3 is (\d+)'),
}
with open(filename, 'r') as infh:
results = {}
set_numbers = []
for line in infh:
if not line.startswith('#'):
# skip lines without a comment
continue
set_match = set_number.match(line)
if set_match:
set_numbers.append(int(set_match.group(1)))
else:
for name, pattern in patterns.items():
match = pattern.search(line)
if match:
results[name] = int(match.group(1))
Do not use the linecache module. It'll read the whole file into memory, and is really only intended for access to Python source files; whenever a traceback needs to be printed this module caches the source files involved with the current stack. You'd only use it for smaller files from which you need random lines, repeatedly.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.