how to merge rows from a .dat file using python - python

I have data like this in a .dat format
1 13 0.54
1 15 0.65
1 67 0.55
2 355 0.54
2 456 0.29
3 432 0.55
3 542 0.333
I want to merge the rows starting with 1, 2 and so on and want a final file like this:
1 13 0.54 15 0.65 67 0.55
2 355 0.54 456 0.29
3 432 0.55 542 0.333
Can someone please help me? I am new to Python. Unless I get this format file I cannot run my abaqus code.

Explanation - first we split the file into lines, and then we split the lines on white space.
We then use itertools.groupby to group the lines by their first element.
We then take the values, ignore the first element and join on spaces, and prepend the key that we were grouping by and a space.
from itertools import groupby
with open("file.dat") as f:
lines = [line.split() for line in f.readlines()]
filteredlines = [line for line in lines if len(line)]
for k, v in groupby(filteredlines, lambda x: x[0]):
print k + " " + " ".join([" ".join(velem[1:]) for velem in v])
print

The Python CSV library can be used to both read your DAT file and also create your CSV file as follows:
import csv, itertools
with open("input.dat", "r") as f_input, open("output.csv", "wb") as f_output:
csv_input = csv.reader(f_input, delimiter=" ", skipinitialspace=True)
csv_output = csv.writer(f_output, delimiter=" ")
dat = [line for line in csv_input if len(line)]
for k, g in itertools.groupby(dat, lambda x: x[0]):
csv_output.writerow([k] + list(itertools.chain.from_iterable([value[1:] for value in g])))
It produces an output file as follows:
1 13 0.54 15 0.65 67 0.55
2 355 0.54 456 0.29
3 432 0.55 542 0.333
Tested using Python 2.7

Related

Load files with two indexes in one dataframe

How to load 28 files with the same amount of rows and columns so it wont iterate index through all files data 0-2911, but only one file data with index 0-103 and give a second index 1-28 for every new file data started?
Here is the code that I wrote that iterates through all data:
import pandas as pd
import glob
path = r"C:/Users/Measurment_Data/Test_1"
all_files = glob.glob(path + "/*.dat")
li = []
for filename in all_files:
df = pd.read_csv(filename, sep="\t", names=["Voltage", "Current"], header=None)
li.append(df)
frame = pd.concat(li, axis = 0, ignore_index = True)
frame
Output:
ID Voltage Current
0 NaN 1.000000e+00
1 0.00 -3.047149e-06
2 0.04 -4.941096e-06
3 0.08 -4.472754e-06
4 0.12 -1.053477e-05
... ... ...
2907 -0.16 1.194359e-06
2908 -0.12 5.489425e-06
2909 -0.08 -9.656614e-09
2910 -0.04 -3.427169e-06
2911 -0.00 -2.173696e-06
I would like to have new indexes for every new loaded file. Something like this:
File ID Curr Volt
1 0 0.00 1.00E+00
1 1 0.00 -3.05E-06
1 2 0.04 -4.94E-06
...
1 102 0.08 -4.47E-06
1 103 0.12 -1.05E-05
...
2 0 0.00 2.00E+00
2 1 4.00 -3.05E-06
2 2 0.44 -3.94E-06
...
2 102 5.08 -6.47E-06
2 103 0.22 -6.05E-05
...
...
27 0 0.00 2.00E+00
27 1 4.00 -3.05E-06
27 2 0.44 -3.94E-06
...
27 102 5.08 -6.47E-06
27 103 0.22 -6.05E-05
...
28 0 0.00 2.00E+00
28 1 4.00 -3.05E-06
28 2 0.44 -3.94E-06
...
28 102 5.08 -6.47E-06
28 103 0.22 -6.05E-05
I would like to easily access the values of every file with index, so for example all values from 0-5 from 28 files.
Just define a new column after you read every file, then concatenate using default value of ignore_index:
import pandas as pd
import glob
path = r"C:/Users/Measurment_Data/Test_1"
all_files = glob.glob(path + "/*.dat")
li = []
j = 1
for filename in all_files:
df = pd.read_csv(filename, sep="\t", names=["Voltage", "Current"], header=None)
df.insert(0, 'File', '')
df["File"] = j
j += 1
li.append(df)
frame = pd.concat(li, axis = 0)
frame
Give it a try!

Extract data out of a csv file

I am trying to extract data out of a csv file and output the data to another csv file.
relate task perform
0 avc asd
1 12 24
2 34 54
3 22 33
4 11 11
5 335 534
Time A B C D
0 0.334 0.334 0.334 0.334
1 0.543 0.543 0.543 0.543
2 0.752 0.752 0.752 0.752
3 0.961 0.961 0.961 0.961
4 1.17 1.17 1.17 1.17
5 1.379 1.379 1.379 1.379
I am writing a python script to read the above table. I want all the data from Time, A, B,C, and D onwards in a separate file.
import csv
import pandas as pd
import os
read_file = False
with open ('xyz.csv', mode = 'r', encoding = 'utf-8') as f_read:
reader = csv.reader(f_read)
for row in reader:
if 'Time' in row
I am stuck here. I read all the data in 'reader'. All the rows should have been parsed inside 'reader'. Now, how can I extract the data from line with Time and onwards into a separate file?
Is there a better method to achieve the above objective?
Should I use pandas instead of regular python commands?
I read many similar answers on stackoverflow but I am confused on how to finish this problem. Your help is appreciated.
Best

Writing lines from input file to output file based on the order in a list

I have an input data input.dat that looks like this:
0.00 0.00
0.00 0.00
0.00 0.00
-0.28 1.39
-0.49 1.24
-0.57 1.65
-0.61 2.11
-0.90 1.73
-0.87 2.29
I have have a list denoting line numbers as follows:
linenum = [7, 2, 6]
I need to write to a file output_veloc_max.dat the rows in input.dat that correspond to linenum values in the same order.
The result should look like this:
-0.61 2.11
0.00 0.00
-0.57 1.65
I have written the following code:
linenum=[7,2,6]
i=1
with open('inputv.dat', 'r') as f5, open('output_veloc_max.dat', 'w') as out:
for line1 in f5:
if i in linenum:
print(line1, end=' ', file=out)
print(i,line1)
i+=1
But, it gives me output that looks like this:
2 0.00 0.00
6 -0.57 1.65
7 -0.61 2.11
What am I doing wrong?
Store the values as you encounter them in a dictionary d with the keys denoting the line number and the value holding the line contents. Write them to the file with writelines according to the order of linenum. Use enumerate(fileobj, 1) to get a line number for each line instead of an explicit counter like i:
linenum=[7,2,6]
d = {}
with open('inputv.dat', 'r') as f5, open('output_veloc_max.dat', 'w') as out:
for num, line1 in enumerate(f5, 1):
if num in linenum:
d[num] = line1
out.writelines([d[i] for i in linenum])
Of course, you can further trim this down with a dictionary comprehension:
linenum = [7, 2, 6]
with open('inputv.dat', 'r') as f5, open('output_veloc_max.dat', 'w') as out:
d = {k: v for k, v in enumerate(f5, 1) if k in linenum}
out.writelines([d[i] for i in linenum])

How do I get only those lines that has highest value if they are inside a timewindow?

I am new to the python and scripting in general, so I would really appreciate some guidance in writing a python script.
So, to the point:
I have a big number of files in a directory. Some files are empty, other contain rows like that:
16 2009-09-30T20:07:59.659Z 0.05 0.27 13.559 6
16 2009-09-30T20:08:49.409Z 0.22 0.312 15.691 7
16 2009-09-30T20:12:17.409Z -0.09 0.235 11.826 4
16 2009-09-30T20:12:51.159Z 0.15 0.249 12.513 6
16 2009-09-30T20:15:57.209Z 0.16 0.234 11.776 4
16 2009-09-30T20:21:17.109Z 0.38 0.303 15.201 6
16 2009-09-30T20:23:47.959Z 0.07 0.259 13.008 5
16 2009-09-30T20:32:10.109Z 0.0 0.283 14.195 5
16 2009-09-30T20:32:10.309Z 0.0 0.239 12.009 5
16 2009-09-30T20:37:48.609Z -0.02 0.256 12.861 4
16 2009-09-30T20:44:19.359Z 0.14 0.251 12.597 4
16 2009-09-30T20:48:39.759Z 0.03 0.284 14.244 5
16 2009-09-30T20:49:36.159Z -0.07 0.278 13.98 4
16 2009-09-30T20:57:54.609Z 0.01 0.304 15.294 4
16 2009-09-30T20:59:47.759Z 0.27 0.265 13.333 4
16 2009-09-30T21:02:56.209Z 0.28 0.272 13.645 6
and so on.
I want to get this lines out of the files into a new file. But there are some conditionals!
If two or more successive lines are inside a timewindow of 6 seconds, then only the line with highest treshold should be printed into the new file.
So, something like that:
Original:
16 2009-09-30T20:32:10.109Z 0.0 0.283 14.195 5
16 2009-09-30T20:32:10.309Z 0.0 0.239 12.009 5
in output file:
16 2009-09-30T20:32:10.109Z 0.0 0.283 14.195 5
Keep in mind, that lines from different files can have times inside 6s window with lines from other files, so the line, that will be in output is the one that has highest treshold from different files.
The code that explains what is what in the lines is here:
import glob
from datetime import datetime
path = './*.cat'
files=glob.glob(path)
for file in files:
in_file=open(file, 'r')
out_file = open("times_final", "w")
for line in in_file.readlines():
split_line = line.strip().split(' ')
template_number = split_line[0]
t = datetime.strptime(split_line[1], '%Y-%m-%dT%H:%M:%S.%fZ')
mag = split_line[2]
num = split_line[3]
threshold = float(split_line[4])
no_detections = split_line[5]
in_file.close()
out_file.close()
Thank you very much for hints, guidelines, ...
you said in the comments you know how to merge multiple files into 1 sorted by t and that the 6 second windows start with the first row and are based on actual data.
so, you need a way to remember the maximum threshold per window and write only after you are sure you processed all rows in a window. sample implementation:
from datetime import datetime, timedelta
from csv import DictReader, DictWriter
fieldnames=("template_number", "t", "mag","num", "threshold", "no_detections")
with open('master_data') as f_in, open("times_final", "w") as f_out:
reader = DictReader(f_in, delimiter=" ", fieldnames=fieldnames)
writer = DictWriter(f_out, delimiter=" ", fieldnames=fieldnames,
lineterminator="\n")
window_start = datetime(1900, 1, 1)
window_timedelta = timedelta(seconds=6)
window_max = 0
window_row = None
for row in reader:
try:
t = datetime.strptime(row["t"], "%Y-%m-%dT%H:%M:%S.%fZ")
threshold = float(row["threshold"])
except ValueError:
# replace by actual error handling
print("Problem with: {}".format(row))
# switch to new window after 6 seconds
if t - window_start > window_timedelta:
# write out previous window before switching
if window_row:
writer.writerow(window_row)
window_start = t
window_max = threshold
window_row = row
# remember max threshold inside a single window
elif threshold > window_max:
window_max = threshold
window_row = row
# don't forget the last window
if window_row:
writer.writerow(window_row)

Delimiter error from np.loadtxt Python

I'm trying to sum some values in a list so i loaded the .dat file that contains the values, but the only way Python makes the sum of the data is by separate it with ','. Now, this is what I get.
altura = np.loadtxt("bio.dat",delimiter=',',usecols=(5,),dtype='float')
File "/usr/lib/python2.7/dist-packages/numpy/lib/npyio.py", line 846, in loadtxt
vals = [vals[i] for i in usecols]
IndexError: list index out of range
This is my code
import numpy as np
altura = np.loadtxt("bio.dat",delimiter=',',usecols=(5,),dtype='str')
print altura
And this is the file 'bio.dat'
1 Katherine Oquendo M 18 1.50 50
2 Pablo Restrepo H 20 1.83 79
3 Ana Agudelo M 18 1.58 45
4 Kevin Vargas H 20 1.74 80
5 Pablo madrid H 20 1.70 55
What I intend to do is
x=sum(altura)
What should i do with the 'separate'?
In my case, some line includes # character.
Then numpy will ignore all the rests of the line, because that means ‘comment’. So try again with comments parameter like
altura = np.loadtxt("bio.dat",delimiter=',',usecols=(5,),dtype=‘str’,comments=‘')
And I recommend you not to use np.loadtxt. Because it’s incredibly slow if you must process a large(>1M lines) file.
Alternatively, you can convert the tab delimited file to csv first.
csv supports tab delimited files. Supply the delimiter argument to reader:
import csv
txt_file = r"mytxt.txt"
csv_file = r"mycsv.csv"
# use 'with' if the program isn't going to immediately terminate
# so you don't leave files open
# the 'b' is necessary on Windows
# it prevents \x1a, Ctrl-z, from ending the stream prematurely
# and also stops Python converting to / from different line terminators
# On other platforms, it has no effect
in_txt = csv.reader(open(txt_file, "rb"), delimiter = '\t')
out_csv = csv.writer(open(csv_file, 'wb'))
out_csv.writerows(in_txt)
This answer is not my work; it is the work of agf found at https://stackoverflow.com/a/10220428/3767980.
The file doesn't need to be comma separated. Here's my sample run, using StringIO to simulate a file. I assume you want to sum the numbers that look a person's height (in meters).
In [17]: from StringIO import StringIO
In [18]: s="""\
1 Katherine Oquendo M 18 1.50 50
2 Pablo Restrepo H 20 1.83 79
3 Ana Agudelo M 18 1.58 45
4 Kevin Vargas H 20 1.74 80
5 Pablo madrid H 20 1.70 55
"""
In [19]: S=StringIO(s)
In [20]: data=np.loadtxt(S,dtype=float,usecols=(5,))
In [21]: data
Out[21]: array([ 1.5 , 1.83, 1.58, 1.74, 1.7 ])
In [22]: np.sum(data)
Out[22]: 8.3499999999999996
as script (with the data in a .txt file)
import numpy as np
fname = 'stack25828405.txt'
data=np.loadtxt(fname,dtype=float,usecols=(5,))
print data
print np.sum(data)
2119:~/mypy$ python2.7 stack25828405.py
[ 1.5 1.83 1.58 1.74 1.7 ]
8.35

Categories