Input file, modify column, output file - python

I have data in a text file and I would like to be able to modify the file by columns and output the file again. I normally write in C (basic ability) but choose python for it's obvious string benefits. I haven't ever used python before so I'm a tad stuck. I have been reading up on similar problems but they only show how to change whole lines. To be honest I have on clue what to do.
Say I have the file
1 2 3
4 5 6
7 8 9
and I want to be able to change column two with some function say multiply it by 2 so I get
1 4 3
4 10 6
7 16 9
Ideally I would be able to easily change the program so I apply any function to any column.
For anyone who is interested it is for modifying lab data for plotting. eg take the log of the first column.

Python is an excellent general purpose language however I might suggest that if you are on an Unix based system then maybe you should take a look at awk. The language awk is design for these kind of text based transformation. The power of awk is easily seen for your question as the solution is only a few characters: awk '{$2=$2*2;print}'.
$ cat file
1 2 3
4 5 6
7 8 9
$ awk '{$2=$2*2;print}' file
1 4 3
4 10 6
7 16 9
# Multiple the third column by 10
$ awk '{$3=$3*10;print}' file
1 2 30
4 5 60
7 8 90
In awk each column is referenced by $i where i is the ith field. So we just set the value of second field to be the value of second field multiplied by two and print the line. This can be written even more concisely like awk '{$2=$2*2}1' file but best to be clear at beginning.

Here is a very simple Python solution:
for line in open("myfile.txt"):
col = line.strip().split(' ')
print col[0],int(col[1])*2,col[2]
There are plenty of improvements that could made but I'll leave that as an exercise for you.

I would use pandas or just numpy. Read your file with:
data = pd.read_csv('file.txt', header=None, delim_whitespace=True)
then work with the data in a spreadsheet like style, ex:
data.values[:,1] *= 2
finally write again to file with:
data.to_csv('output.txt')

As #sudo_O said, there are much efficient tools than python for this task. However,here is a possible solution :
from itertools import imap, repeat
import csv
fun = pow
with open('m.in', 'r') as input_file :
with open('m.out', 'wb') as out_file:
inpt = csv.reader(input_file, delimiter=' ')
out = csv.writer(out_file, delimiter=' ')
for row in inpt:
row = [ int(e) for e in row] #conversion
opt = repeat(2, len(row) ) # square power for every value
# write ( function(data, argument) )
out.writerow( [ str(elem )for elem in imap(fun, row , opt ) ] )
Here it multiply every number by itself, but you can configure it to multiply only the second colum, by changing opt : opt = [ 1 + (col == 1) for col in range(len(row)) ] (2 for col 1, 1 otherwise )

Related

Is there a way to write a one column pd data frame on top of a mxn pd dataframe to a text file in Python

I have two data frames of different dimensions that I want to write to a text (.txt) file such that one is on top of the other. I'm sure it's easy but I have no way of doing it.
The data I want to write is:
import numpy as np
import pandas as pd
preamble = pd.DataFrame(np.array(["software", "version", "frequency: 100", "firmware:100.10.1"]).T)
data = pd.DataFrame.from_dict({"frame": np.array([1, 2, 3, 4, 5]), "X": np.array([2,4,6,8,10]), "Y": np.array([3,6,9,12,15]), "Z": np.array([1,2,3,4,5])})
I want to create a text file that looks like this:
software
version
frequency:100
firmware: 100.10.1
frame X Y Z
1 2 3 1
2 4 6 2
3 6 8 3
4 8 10 4
5 10 12 5
I tried to get the format correctly at the top end.
I want to keep the [frame, X, Y, Z] headers where they are. But place the "preamble" at the top in a column.
I've tried to append and combine the two data frames, but can't do it. I don't think that's possible.
I've tried looking for ways to write the preamble in cell (column = 1, row = 1) and then start the data in cell (column = 1, row = 5).
Any help here would be appreciated! Please let me know if you need more information!
The main idea is to use to_string()`.
Here is a idea how to do this in your example:
with open('myfile.txt', 'w') as fp:
fp.write(preamble.to_string(index=False).replace('0', '').replace(' ', '')[1:])
fp.write('\n')
fp.write(data.to_string())
Here I removed some element from the first string and did not use the index. I also added one newline character \n to seperate those two DataFrames.

Save each Excel-spreadsheet-row with header in separate .txt-file (saved as a parameter-sample to be read by simulation programs)

I'm a building energy simulation modeller with an Excel-question to enable automated large-scale simulations using parameter samples (samples generated using Monte Carlo). Now I have the following question in saving my samples:
I want to save each row of an Excel-spreadsheet in a separate .txt-file in a 'special' way to be read by simulation programs.
Let's say, I have the following excel-file with 4 parameters (a,b,c,d) and 20 values underneath:
a b c d
2 3 5 7
6 7 9 1
3 2 6 2
5 8 7 6
6 2 3 4
Each row of this spreadsheet represents a simulation-parameter-sample.
I want to store each row in a separate .txt-file as follows (so 5 '.txt'-files for this spreadsheet):
'1.txt' should contain:
a=2;
b=3;
c=5;
d=7;
'2.txt' should contain:
a=6;
b=7;
c=9;
d=1;
and so on for files '3.txt', '4.txt' and '5.txt'.
So basically matching the header with its corresponding value underneath for each row in a separate .txt-file ('header equals value;').
Is there an Excel add-in that does this or is it better to use some VBA-code? Anybody some idea?
(I'm quit experienced in simulation modelling but not in programming, therefore this rather easy parameter-sample-saving question in Excel. (Solutions in Python are also welcome if that's easier for you people))
my idea would be to use Python along with Pandas as it's one of the most flexible solutions, as your use case might expand in the future.
I'm gonna try making this as simple as possible. Though I'm assuming, that you have Python, that you know how to install packages via pip or conda and are ready to run a python script on whatever system you are using.
First your script needs to import pandas and read the file into a DataFrame:
import pandas as pd
df = pd.read_xlsx('path/to/your/file.xlsx')
(Note that you might need to install the xlrd package, in addition to pandas)
Now you have a powerful data structure, that you can manipulate in plenty of ways. I guess the most intuitive one, would be to loop over all items. Use string formatting, which is best explained over here and put the strings together the way you need them:
outputs = {}
for row in df.index:
s = ""
for col in df.columns:
s += "{}={};\n".format(col, df[col][row])
print(s)
now you just need to write to a file using python's io method open. I'll just name the files by the index of the row, but this solution will overwrite older text files, created by earlier runs of this script. You might wonna add something unique like the date and time or the name of the file you read to it or increment the file name further with multiple runs of the script, for example like this.
All together we get:
import pandas as pd
df = pd.read_excel('path/to/your/file.xlsx')
file_count = 0
for row in df.index:
s = ""
for col in df.columns:
s += "{}={};\n".format(col, df[col][row])
file = open('test_{:03}.txt'.format(file_count), "w")
file.write(s)
file.close()
file_count += 1
Note that it's probably not the most elegant way and that there are one liners out there, but since you are not a programmer I thought you might prefer a more intuitive way, that you can tweak yourself easily.
I got this to work in Excel. You can expand the length of the variables x,y and z to match your situation and use LastRow, LastColumn methods to find the dimensions of your data set. I named the original worksheet "Data", as shown below.
Sub TestExportText()
Dim Hdr(1 To 4) As String
Dim x As Long
Dim y As Long
Dim z As Long
For x = 1 To 4
Hdr(x) = Cells(1, x)
Next x
x = 1
For y = 1 To 5
ThisWorkbook.Sheets.Add After:=Sheets(Sheets.Count)
ActiveSheet.Name = y
For z = 1 To 4
With ActiveSheet
.Cells(z, 1) = Hdr(z) & "=" & Sheets("Data").Cells(x + 1, z) & ";"
End With
Next z
x = x + 1
ActiveSheet.Move
ActiveWorkbook.ActiveSheet.SaveAs Filename:="File" & y & ".txt", FileFormat:=xlTextWindows
ActiveWorkbook.Close SaveChanges:=False
Next y
End Sub
If you can save your Excel spreadsheet as a CSV file then this python script will do what you want.
with open('data.csv') as file:
data_list = [l.rstrip('\n').split(',') for l in file]
counter = 1
for x in range (1, len (data_list)) :
output_file_name = str (counter) + '.txt'
with open (output_file_name, 'w' ) as file :
for x in range (len (data_list [counter])) :
print (x)
output_string = data_list [0] [x] + '=' + data_list [counter] [x] + ';\n'
file.write (output_string)
counter += 1

Reading csv in loop stops at row that does not match

I am trying to read a csv then iterate through an sde to find matching features, their fields, and then print them.
There is a table in the list and I'm not able to skip over it and continue reading the csv.
I get the "IOError: table 1 does not exist" and I only get the features that come before the table.
import arcpy
from arcpy import env
import sys
import os
import csv
with open('C:/Users/user/Desktop/features_to_look_for.csv', 'r') as t1:
objectsinESRI = [r[0] for r in csv.reader(t1)]
env.workspace = "//conn/features#dev.sde"
fcs = arcpy.ListFeatureClasses('sometext.*')
for fcs in objectsinESRI:
fieldList = arcpy.ListFields(fcs)
for field in fieldList:
print fcs + " " + ("{0}".format(field.name))
Sample csv rows (can't seem to post a screenshot of the excel file)
feature 1
feature 2
feature 3
feature 4
table 1
feature 5
feature 6
feature 7
feature 8
feature 9
Result
feature 1
feature 2
feature 3
feature 4
Desired Result
feature 1
feature 2
feature 3
feature 4
feature 5
feature 6
feature 7
feature 8
feature 9
So as stated, I have no clue about arcpy but this seems the way so start. Looking at the docs, your objectsInEsri seems to be the equivalent of the datasets in the example. From there I extrapolate the following code which, depending on what print(fc) is printing, you may need to extend with yet another for.
So try this:
for object in objectsInEsri:
for fc in fcs:
print(fc)
Or maybe this:
for object in objectsInEsri:
for fc in fcs:
for field in arcpy.ListFields(fc)
print(object + " " + ("{0}".format(field.name)))
Then I may be completely wrong ofc but then just write first the outermore for, see what is giving to you, and keep building from there :)

Read Delimited File That Wraps Lines

I apologize if there is an obvious answer to this already.
I have a very large file that poses a few challenges for parsing. I am delivered these files from outside my organization, so there is no chance I can change their format.
Firstly, the file is space delimited but the fields that represent a "column" of data can span multiple rows. For example, if you had a row that was supposed to be 25 columns of data, it may be written in the file as:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25
1 2 3 4 5 6 7 8 9 10 11 12 13
14 15 16 17 18
19 20 21 22 23 24 25
As you can see, I can't rely on each set of data being on the same line, but I can rely on there being the same number of columns per set.
To make matters worse, the file follows a definition:data type format where the first 3 or so lines will be describing the data (including a field that tells me how many rows there are) and the next N rows are data. Then it will go back to the 3 lines format again to describe the next set of data. That means I can't just set up a reader for the N columns format and let it run to EOF.
I'm afraid the built in python file reading functionality could get really ugly real fast, but I can't find anything in csv or numpy that works.
Any suggestions?
EDIT: Just as an example of a different solution:
We have an old tool in MATLAB that parses this file using textscan on an open file handle. We know the number of columns so we do something like:
data = textscan(fid, repmat('%f ',1,n_cols), n_rows, 'delimiter', {' ', '\r', '\n'}, 'multipledelimsasone', true);
This would read the data no matter how it wrapped while leaving a file handle open to process the next section later. This is done because the files are so large they can lead to excess RAM usage.
This is a sketch how you can proceed:
(EDIT: with some modifications)
file = open("testfile.txt", "r")
# store data for the different sections here
datasections = list()
while True:
current_row = []
# read three lines
l1 = file.readline()
if line == '': # or other end condition
break
l2 = file.readline()
l3 = file.readline()
# extract the following information from l1, l2, l3
nrows = # extract the number rows in the next section
ncols = # extract the number of columns in the next section
# loop while len(current_row) < nrows * ncols:
# read next line, isolate the items using str.split()
# append items to current_row
# break current_row into the lines after each ncols-th item
# store data in datasections in a new array

python csv: getting subset

here is a snapshot of my csv:
alex 123f 1
harry fwef 2
alex sef 3
alex gsdf 4
alex wf35 6
harry sdfsdf 3
i would like to get the subset of this data where the occurrence of anything in the first column (harry, alex) is at least 4. so i want the resulting data set to be:
alex 123f 1
alex sef 3
alex gsdf 4
alex wf35 6
Clearly, you cannot decide which rows are interesting until you've seen all rows (since the very last row might be the one turning some count from three to four and thereby making some previously seen rows interesting, for example;-). So, unless your CSV file is horribly huge, suck it all into memory, first, as a list...:
import csv
with open('thefile.csv', 'rb') as f:
data = list(csv.reader(f))
then, do the counting -- Python 2.7 has a better way, but assuming you're still on 2.6 like most of us...:
import collections
counter = collections.defaultdict(int)
for row in data:
counter[row[0]] += 1
and finally do the selection loop...:
for row in data:
if counter[row[0]] >= 4:
print row
Of course, this prints each interesting row as a roughly-hewed list (with square brackets and quotes around the items), but it will be easy to format it in any way you might prefer.
if Python is not a must
$ gawk '{b[$1]++;c[++d,$1]=$0}END{for(i in b){if(b[i]>=4){for(j=1;j<=d;j++){print c[j,i]}}}}' file
And yes, 70MB file is fine.

Categories