Splitting up a data file in Python, round 2

Splitting up a data file in Python, round 2 - python

So I have recently found a solution to this question here
in which I wanted to take two columns of a data file and put them into two arrays
I now have this code which does the job nicely.
Xvals=[]; Yvals=[]
i = open('BGBiasRE_IM3_wCSrescaled.txt','r')
lines = [line.split() for line in i if line[:4] not in ('time', 'Step')]
Xvals, Yvals = zip(*lines)
V = [0, 0.004, 0, 0.0004]
pylab.plot(Xvals, Yvals, marker='o')
pylab.axis(V)
pylab.xlabel('Time (ms)')
pylab.ylabel('Current (A)')
pylab.title('Title')
pylab.show()
But I now realise I screwed up the question. I have a data file laid out as below,
time I(R_stkb)
Step Information: Temp=0 (Run: 1/11)
0.000000000000000e+000 0.000000e+000
9.999999960041972e-012 8.924141e-012
1.999999992008394e-011 9.623148e-012
Step Information: Temp=10 (Run: 2/11)
0.000000000000000e+000 0.000000e+000
9.999999960041972e-012 4.924141e-012
1.999999992008394e-011 8.623148e-012
(Note: No empty lines between each data line, and a Tab between the two data values)
The above code appends all the step information into one array, so I get two big long arrays when I want two different arrays for the different steps so I can plot their respective arrays separately later on. I also have to get the Step name, in this case Temp=10 and attach it/name the array to reflect each chunk of step info. Eg. I would like to end up with arrays like such
Temp0_Xvals = [ 0.000000000000000e+000, 9.999999960041972e-012, 1.999999992008394e-011]
Temp0_Yvals = [ 0.000000e+000, 8.924141e-012, 9.623148e-012]
Temp10_Xvals = [...]
Temp10_Yvals = [...] etc etc
Obviously this makes the problem much more complicated and I have no idea where to start.

I would do something along those lines:
i = open('BGBiasRE_IM3_wCSrescaled.txt', 'r')
Xnames, Ynames = [], []
count = 0
for line in i:
if count > 0:
line_tmp = line.split()
if line_tmp[0] == 'Step':
step = (line_tmp[2].split('='))[1]
varnameX = 'Temp' + str(step) +'_Xvals'
varnameY = 'Temp' + str(step) +'_Yvals'
globals()[varnameX] = []
globals()[varnameY] = []
Xnames.append(varnameX)
Ynames.append(varnameY)
else:
globals()[varnameX].append(float(line_tmp[0]))
globals()[varnameY].append(float(line_tmp[1]))
count += 1
i.close()
for name in Xnames:
print name + ' = ' + str(eval(name))
for name in Ynames:
print name + ' = ' + str(eval(name))
This is for sure not the most efficient solution but it works for your specific problem

Use csv with the excel_tab dialect.

Related

python Pythagoras function

So I have a set of positional data that I comes from a factory sensor. It produces x, y and z info in meters from to a known lat/long position. I have a function that will convert the distance in meters from the lat/long but I need to use the x and y data in a Pythagoras function to determine that. Let me try to clarify with an example of the JSON data the sensor gives.
[
{
"id": "84eb18677194",
"name": "forklift_0001",
"areaId": "Tracking001",
"areaName": "Hall1",
"color": "#FF0000",
"coordinateSystemId": "CoordSys001",
"coordinateSystemName": null,
"covarianceMatrix": [
0.82,
-0.07,
-0.07,
0.55
],
"position": [ #this is the x,y and z data, in meters from the ref point
18.11,
33.48,
2.15
],
In this branch the forklift is 18.11m along and 33.38m up from the reference lat/long. The sensor is 2.15m high and that is a constant piece of info i don't need. To work out the distance from the reference point I need to use Pythagoras and then convert that data back into lat/long so my analysis tool can present it.
My problem (as far as python goes) is that I can't figure out how to make it see 18.11 & 33.38 as the x & y and tell it to disregard 2.15 entirely. Here is what i have so far.
import math
import json
import pprint
import os
from glob import iglob
rootdir_glob = 'C:/Users/username/Desktop/test_folder**/*"' # Note the
added asterisks, use forward slash
# This will return absolute paths
file_list = [f for f in
iglob('C:/Users/username/Desktop/test_folder/13/00**/*', recursive=True)
if os.path.isfile(f)]
for f in file_list:
print('Input file: ' + f) # Replace with desired operations
with open(f, 'r') as f:
distros = json.load(f)
output_file = 'position_data_blob_14' + str(output_nr) + '.csv' #output file name may be changed
def pythagoras(a,b):
value = math.sqrt(a*a + b*b)
return value
result = pythagoras(str(distro['position'])) #I am totally stuck here :/
print(result)
This piece of script is part of a wider project to parse the file by machine and people and also by work and non work times of day.
If someone could give me some tips on how to make the pythagorus part work i'd be really grateful. I am not sure if I should define it as a function but as I've typed this I am wondering if it should be a 'for' loop which uses the x & y and ignores the x.
All help really appreciated.

Try this:
position = distro['position'] # Get the full list
result = pythagoras(position[0], position[1]) # Get the first and second element from the list
print(result)
Why do you use str() for the argument of the function ? What were you trying to do ?

You're passing one input, a list of numbers, into a function that takes two numbers as input. There are two solutions to this - either change what you pass in, or change the function.
distro['position'] = [18.11, 33.48, 2.15], so for the first solution all you need to do is pass in distro['position'][0] and distro['position'][1]:
result = pythagoras(distro['position'][0], distro['position'][1])
Alternatively (which in my opinion is more elegant), pass in the list to the function and have the function extract the values it cares about:
result = pythagoras(distro['position'])
def pythagoras(input_triple):
a,b,c = input_triple
value = math.sqrt(a*a + b*b)
return value

The solution i used was
for f in file_list:
print('Input file: ' + f) # Replace with desired operations
with open(f, 'r') as f:
distros = json.load(f)
output_file = '13_01' + str(output_nr) + '.csv' #output file name may be changed
with open(output_file, 'w') as text_file:
for distro in distros:
position = distro['position']
result = math.sqrt(position[0]*position[0] + position[1]*position[1]),
print((result), file=text_file)
print('Output written to file: ' + output_file)
output_nr = output_nr + 1

Did you check with the data type of the parameters you're passing?
def pythagoras(a,b):
value = math.sqrt(int(a)**2 + int(b)**2)
return value
This is in the case of integers.

Find if rows in a large file contain a substring from a seperate list?

I have a large (30GB) file consisting of random terms and sentences. I have two separate lists of words and phrases I want to apply to that and mark (or alternatively filter) a row in which a term for that list appears.
If a term from list X appears in a row of the large file, mark it X, if from list Y, mark it Y. When done, take the rows marked X and output it to a file, same for Y as a separate file. My problem is that both of my lists are 1500 terms long and take a while to go through line by line.
After fidgeting around a while, I've arrived on my current method which filters the chunks on whether it contains a term. My issue is that it is very slow. I am wondering if there is a way to speed up my script to get through it faster? I was using Pandas to process the file in chunks of 1 million rows, it takes around 3 minutes to process a chunk with my current method:
white_text_file = open('lists/health_whitelist_final.txt', "r")
white_list = white_text_file.read().split(',\n')
black_text_file = open('lists/health_blacklist_final.txt', "r")
black_list = black_text_file.read().split(',\n')
for chunk in pd.read_csv('final_cleaned_corpus.csv', chunksize=chunksize, names=['Keyword']):
print("Chunk")
chunk_y = chunk[chunk['Keyword'].str.contains('|'.join(white_list), na=False)]
chunk_y.to_csv(VERTICAL+'_y_list.csv', mode='a', header=None)
chunk_x = chunk[chunk['Keyword'].str.contains('|'.join(black_list), na=False)]
chunk_x.to_csv(VERTICAL+'_x_list.csv', mode='a', header=None)
My first attempt was less pythonic but the aim was to break the loop the first time an element appears, this way was slower that my current one:
def x_or_y(keyword):
#print(keyword)
iden = ''
for item in y_list:
if item in keyword:
iden = 'Y'
break
for item in x_list:
if item in keyword:
iden = 'X'
break
return iden
Is there a faster way I'm missing here?

If I understand correctly, you need to do the following:
fileContent = ['yes','foo','junk','yes','foo','junk']
x_list = ['yes','no','maybe-so']
y_list = ['foo','bar','fizzbuzz']
def x_or_y(keyword):
if keyword in x_list:
return 'X'
if keyword in y_list:
return 'Y'
return ''
results = map(x_or_y, fileContent)
print(list(results))
Here is an example: https://repl.it/F566

Merging lists obtained by a loop

I've only started python recently but am stuck on a problem.
# function that tells how to read the urls and how to process the data the
# way I need it.
def htmlreader(i):
# makes variable websites because it is used in a loop.
pricedata = urllib2.urlopen(
"http://website.com/" + (",".join(priceids.split(",")[i:i + 200]))).read()
# here my information processing begins but that is fine.
pricewebstring = pricedata.split("},{")
# results in [[1234,2345,3456],[3456,4567,5678]] for example.
array1 = [re.findall(r"\d+", a) for a in pricewebstring]
# writes obtained array to my text file
itemtxt2.write(str(array1) + '\n')
i = 0
while i <= totalitemnumber:
htmlreader(i)
i = i + 200
See the comments in the script as well.
This is in a loop and will each time give me an array (defined by array1).
Because I print this to a txt file it results in a txt file with separate arrays.
I need one big array so it needs to merge the results of htmlreader(i).
So my output is something like:
[[1234,2345,3456],[3456,4567,5678]]
[[6789,4567,2345],[3565,1234,2345]]
But I want:
[[1234,2345,3456],[3456,4567,5678],[6789,4567,2345],[3565,1234,2345]]
Any ideas how I can approach this?

Since you want to gather all the elements in a single list, you can simply gather them in another list, by flattening it like this
def htmlreader(i, result):
...
result.extend([re.findall(r"\d+", a) for a in pricewebstring])
i, result = 0, []
while i <= totalitemnumber:
htmlreader(i, result)
i = i + 200
itemtxt2.write(str(result) + '\n')
In this case, the result created by re.findall (a list) is added to the result list. Finally, you are writing the entire list as a whole to the file.
If the above shown method is confusing, then change it like this
def htmlreader(i):
...
return [re.findall(r"\d+", a) for a in pricewebstring]
i, result = 0, []
while i <= totalitemnumber:
result.extend(htmlreader(i))
i = i + 200

Python script for trasnforming ans sorting columns in ascending order, decimal cases

I wrote a script in Python removing tabs/blank spaces between two columns of strings (x,y coordinates) plus separating the columns by a comma and listing the maximum and minimum values of each column (2 values for each the x and y coordinates). E.g.:
100000.00 60000.00
200000.00 63000.00
300000.00 62000.00
400000.00 61000.00
500000.00 64000.00
became:
100000.00,60000.00
200000.00,63000.00
300000.00,62000.00
400000.00,61000.00
500000.00,64000.00
10000000 50000000 60000000 640000000
This is the code I used:
import string
input = open(r'C:\coordinates.txt', 'r')
output = open(r'C:\coordinates_new.txt', 'wb')
s = input.readline()
while s <> '':
s = input.readline()
liste = s.split()
x = liste[0]
y = liste[1]
output.write(str(x) + ',' + str(y))
output.write('\n')
s = input.readline()
input.close()
output.close()
I need to change the above code to also transform the coordinates from two decimal to one decimal values and each of the two new columns to be sorted in ascending order based on the values of the x coordinate (left column).
I started by writing the following but not only is it not sorting the values, it is placing the y coordinates on the left and the x on the right. In addition I don't know how to transform the decimals since the values are strings and the only function I know is using %f and that needs floats. Any suggestions to improve the code below?
import string
input = open(r'C:\coordinates.txt', 'r')
output = open(r'C:\coordinates_sorted.txt', 'wb')
s = input.readline()
while s <> '':
s = input.readline()
liste = string.split(s)
x = liste[0]
y = liste[1]
output.write(str(x) + ',' + str(y))
output.write('\n')
sorted(s, key=lambda x: x[o])
s = input.readline()
input.close()
output.close()
thanks!

First, try to format your code according to PEP8—it'll be easier to read. (I've done the cleanup in your post already).
Second, Tim is right in that you should try to learn how to write your code as (idiomatic) Python not just as if translated directly from its C equivalent.
As a starting point, I'll post your 2nd snippet here, refactored as idiomatic Python:
# there is no need to import the `string` module; `.strip()` is a built-in
# method of strings (i.e. objects of type `str`).
# read in the data as a list of pairs of raw (i.e. unparsed) coordinates in
# string form:
with open(r'C:\coordinates.txt') as in_file:
coords_raw = [line.strip().split() for line in in_file.readlines()]
# convert the raw list into a list of pairs (2-tuples) containing the parsed
# (i.e. float not string) data:
coord_pairs = [(float(x_raw), float(y_raw)) for x_raw, y_raw in coords_raw]
coord_pairs.sort() # you want to sort the entire data set, not just values on
# individual lines as in your original snippet
# build a list of all x and y values we have (this could be done in one line
# using some `zip()` hackery, but I'd like to keep it readable (for you at
# least)):
all_xs = [x for x, y in coord_pairs]
all_ys = [y for x, y in coord_pairs]
# compute min and max:
x_min, x_max = min(all_xs), max(all_xs)
y_min, y_max = min(all_ys), max(all_ys)
# NOTE: the above section performs well for small data sets; for large ones, you
# should combine the 4 lines in a single for loop so as to NOT have to read
# everything to memory and iterate over the data 6 times.
# write everything out
with open(r'C:\coordinates_sorted.txt', 'wb') as out_file:
# here, we're doing 3 things in one line:
# * iterate over all coordinate pairs and convert the pairs to the string
# form
# * join the string forms with a newline character
# * write the result of the join+iterate expression to the file
out_file.write('\n'.join('%f,%f' % (x, y) for x, y in coord_pairs))
out_file.write('\n\n')
out_file.write('%f %f %f %f' % (x_min, x_max, y_min, y_max))
with open(...) as <var_name> gives you guaranteed closing of the file handle as with try-finally; also, it's shorter than open(...) and .close() on separate lines. Also, with can be used for other purposes, but is commonly used for dealing with files. I suggest you look up how to use try-finally as well as with/context managers in Python, in addition to everything else you might have learned here.

Your code looks more like C than like Python; it is quite unidiomatic. I suggest you read the Python tutorial to find some inspiration. For example, iterating using a while loop is usually the wrong approach. The string module is deprecated for the most part, <> should be !=, you don't need to call str() on an object that's already a string...
Then, there are some errors. For example, sorted() returns a sorted version of the iterable you're passing - you need to assign that to something, or the result will be discarded. But you're calling it on a string, anyway, which won't give you the desired result. You also wrote x[o] where you clearly meant x[0].
You should be using something like this (assuming Python 2):
with open(r'C:\coordinates.txt') as infile:
values = []
for line in infile:
values.append(map(float, line.split()))
values.sort()
with open(r'C:\coordinates_sorted.txt', 'w') as outfile:
for value in values:
outfile.write("{:.1f},{:.1f}\n".format(*value))

Appending multiple raster properties to a comma-delimited table

First-time post and python newb who has exhausted all other options. I am interested in appending selected raster properties (using the arcpy.GetRasterProperties_management(input_raster, "property_type") function) to a comma-delimited table, but am having trouble figuring out how to do this for multiple results. As an abridged example (of my actual script), I have created two 'for' loops; one for each raster property I am interested in outputting (i.e. Cell Size X, Cell Size Y). My list of rasters include S01Clip_30m through S05Clip_30m. My goal is to create a .txt file that should look something like this:
RasterName, CellSizeX, CellSizeY
S01Clip_30m, 88.9372, 88.9375
S02Clip_30m, 88.9374, 88.9371
The code I have so far is below (with some uncertain, botched syntax at the bottom). When I run it, I get this result:
S05Clip_30m, 88.9374
(last raster in the list, CellSizeY)
I appreciate any help you can provide on the crucial bottom code block.
import arcpy
from arcpy import env
env.workspace = ('C:\\StudyAreas\\Aggregates.gdb')
InFolder = ('C:\\dre\\python\\tables')
OutputFile = open(InFolder + '\\' + 'RasterProps.txt', 'a')
rlist = arcpy.ListRasters('*','*')
for grid in rlist:
if grid[-8:] == "Clip_30m":
result = arcpy.GetRasterProperties_management(grid,'CELLSIZEX')
CellSizeX = result.getOutput(0)
for grid in rlist:
if grid[-8:] == "Clip_30m":
result = arcpy.GetRasterProperties_management(grid,'CELLSIZEY')
CellSizeY = result.getOutput(0)
> I know the syntax below is incorrect, but I know there are *some* elements that
> should be included based on other example scripts that I have...
> if result.getOutput(0) == CellSizeX:
> coltype = CellSizeX
> elif result.getOutput(0) == CellSizeY:
> coltype = CellSizeY
> r = ''.join(grid)
> colname = r[0:]
> OutputFile.writelines(colname+','+coltype+'\n')

After receiving help from another Q&A forum on my script, I am now providing the answer to my own GIS-related question to close this thread (and move to gis.stackexchange :) - thanks to L.Yip's comment). Here is the final corrected script which outputs my two raster properties (Cell Size in X-direction, Cell Size in Y-direction) for a list of rasters into a .txt file:
import arcpy
from arcpy import env
env.workspace = ('C:\\StudyAreas\\Aggregates.gdb')
InFolder = ('C:\\dre\\python\\tables')
OutputFile = open(InFolder + '\\' + 'RasterProps.txt', 'a')
rlist = arcpy.ListRasters('*','*')
for grid in rlist:
if grid[-8:] == "Clip_30m":
resultX = arcpy.GetRasterProperties_management(grid,'CELLSIZEX')
CellSizeX = resultX.getOutput(0)
resultY = arcpy.GetRasterProperties_management(grid,'CELLSIZEY')
CellSizeY = resultY.getOutput(0)
OutputFile.write(grid + ',' + str(CellSizeX) + ',' + str(CellSizeY) + '\n')
OutputFile.close()
My results after running the script:
S01Clip_30m,88.937158083333,88.9371580833333
S02Clip_30m,88.937158083333,88.937158083333
S03Clip_30m,88.9371580833371,88.9371580833333
S04Clip_30m,88.9371580833308,88.937158083333
S05Clip_30m,88.9371580833349,88.937158083333
Thanks!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Splitting up a data file in Python, round 2 - python

Use csv with the excel_tab dialect.

Related

python Pythagoras function

Find if rows in a large file contain a substring from a seperate list?

Merging lists obtained by a loop

Python script for trasnforming ans sorting columns in ascending order, decimal cases

Appending multiple raster properties to a comma-delimited table

Categories

Resources