Translating a gridded csv file with numpy - python

I need to get some meteorological data into a MySQL database.
File inputFile.csv is a comma-delimited list of values. There are 241 lines and 481 values per line.
Each line maps to a certain latitude, and each value's position within the line maps to a certain longitude.
There are two additional files with the same structure, lat.csv and lon.csv. These files contain the coordinates that the values in inputFile.csv map to.
So to find the latitude and longitude for a value in inputFile.csv, we need to refer to the values at the same line/position (or row/column) within lat.csv and lon.csv
I want to translate inputFile.csv using lat.csv and lon.csv such that my output file contains a list of values (from inputFile.csv),latitudes, and longitudes.
Here is a small visual example:
inputFile.csv
3,5,1,4,5
1,4,1,2,5
5,7,3,8,0
lat.csv
22,31,51,21,52
55,21,24,66,12
11,23,12,55,55
lon.csv
12,35,12,52,11
35,11,25,33,42
62,53,45,25,54
output:
val lat lon
3 22 12
5 31 35
1 51 12
4 21 52
5 52 11
1 55 35
4 21 11
1 24 25
2 66 33
etc
What is the best way to do this in python/numpy?

I suppose that since you know the total size the the array that you want, you can preallocate it:
a = np.empty((241*481,3))
Now you can add the data:
for i,fname in enumerate(('inputFile.csv','lat.csv','lon.csv')):
with open(fname) as f:
data = np.fromfile(f,sep=',')
a[:,i] = data.ravel()
If you don't know the number of elements up front, you can generate a 2-d list instead (a list of np.ndarrays):
alist = []
for fname in ('inputFile.csv','lat.csv','lon.csv'):
with open(fname) as f:
data = np.fromfile(f,sep=',')
alist.append( data.ravel() )
a = np.array(alist).T

Only with numpy functions:
import numpy as np
inputFile = np.gentfromtxt('inputFile.csv',delimiter = ',')
inputFile.reshape(-1)
lat = np.gentfromtxt('lat.csv',delimiter = ',')
lat.reshape(-1)
lon = np.gentfromtxt('lon.csv',delimiter = ',')
lon.reshape(-1)
output = np.vstack( (inputFile,lat,lon) )

Related

Read Delimited File That Wraps Lines

I apologize if there is an obvious answer to this already.
I have a very large file that poses a few challenges for parsing. I am delivered these files from outside my organization, so there is no chance I can change their format.
Firstly, the file is space delimited but the fields that represent a "column" of data can span multiple rows. For example, if you had a row that was supposed to be 25 columns of data, it may be written in the file as:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25
1 2 3 4 5 6 7 8 9 10 11 12 13
14 15 16 17 18
19 20 21 22 23 24 25
As you can see, I can't rely on each set of data being on the same line, but I can rely on there being the same number of columns per set.
To make matters worse, the file follows a definition:data type format where the first 3 or so lines will be describing the data (including a field that tells me how many rows there are) and the next N rows are data. Then it will go back to the 3 lines format again to describe the next set of data. That means I can't just set up a reader for the N columns format and let it run to EOF.
I'm afraid the built in python file reading functionality could get really ugly real fast, but I can't find anything in csv or numpy that works.
Any suggestions?
EDIT: Just as an example of a different solution:
We have an old tool in MATLAB that parses this file using textscan on an open file handle. We know the number of columns so we do something like:
data = textscan(fid, repmat('%f ',1,n_cols), n_rows, 'delimiter', {' ', '\r', '\n'}, 'multipledelimsasone', true);
This would read the data no matter how it wrapped while leaving a file handle open to process the next section later. This is done because the files are so large they can lead to excess RAM usage.
This is a sketch how you can proceed:
(EDIT: with some modifications)
file = open("testfile.txt", "r")
# store data for the different sections here
datasections = list()
while True:
current_row = []
# read three lines
l1 = file.readline()
if line == '': # or other end condition
break
l2 = file.readline()
l3 = file.readline()
# extract the following information from l1, l2, l3
nrows = # extract the number rows in the next section
ncols = # extract the number of columns in the next section
# loop while len(current_row) < nrows * ncols:
# read next line, isolate the items using str.split()
# append items to current_row
# break current_row into the lines after each ncols-th item
# store data in datasections in a new array

Python txt matrix from multiple files

How can I convert line wise frequency distributions from multiple TXT files into a single matrix? Each of the files has exactly the same structure in that all words/terms/phrases are in the same order and contained in every file. Unique for each file is the filename, an issue date and the respective frequency of the words/terms/phrases given by a number after ":", see the following:
How my input files look like:
FilenameA Date:31.12.20XX
('financial' 'statement'):15
('corporate-taxes'):3
('assets'):8
('available-for-sale' 'property'):2
('auditors'):23
I have multiple files which have the exact same order of words/phrases and only differ in the frequency (number behind ":")
Now I want to create a single file containing a matrix, which keeps all words as top column and attaches the file characteristics (filename, date and frequencies) as row wise entries:
Desired Output:
Filename Date ('financial' 'statement') ('corporate-taxes') ... ('auditors)
A 2008 15 3 23
B 2010 9 6 11
C 2013 1 8 4
...
.
.
Really appreciate any help, would be great to have a loop which reads all files from a directory and outputs the above matrix.
The following code should help you:
import os
# Compute matrix
titles = ['Filename', 'Date']
matrix = [titles]
for directory, __, files in os.walk('files'): # replace with your directory
for filename in files:
with open(os.path.join(directory, filename)) as f:
name, date = f.readline().strip().split()
row = [name[8:], date.split('.')[-1]]
for line in f:
header, value = line.strip().split(':')
if len(matrix) == 1:
titles.append(header)
row.append(value)
matrix.append(row)
# Work out column widths
column_widths = [0]*len(titles)
for row in matrix:
for column, data in enumerate(row):
column_widths[column] = max(column_widths[column], len(data))
formats = ['{:%s%ss}' % ('^' if c>1 else '<', w) for c, w in enumerate(column_widths)]
# Print matrix
for row in matrix:
for column, data in enumerate(row):
print formats[column].format(data),
print
Sample output:
Filename Date ('financial' 'statement') ('corporate-taxes') ('assets') ('available-for-sale' 'property') ('auditors')
A 2012 15 3 8 2 23
B 2010 9 6 8 2 11
C 2010 1 8 8 2 4

Selecting columns based on external list/data in python

I have a data set with various Region map variables(around 1000). Sample data looks like:
Userid regionmap1 regionmap2 regionmap3 and so on.
78 7 na na
45 na na na
67 1 na na
Here the number in regionmap variables represent the number of views. Now I have an external file with only 10 region map entries. The file contains 10 entries/rows with 10 different region map variables:
Regionmap1
Regionmap3
Regionmap7
.....
.....
Regionmap856.
So my task is to keep only these regionmap variables as columns in the original file and delete all the other 990 columns. So the final data should look like:
Userid Regionmap1 regionmap3 regionmap7 ........ regionmap856
78 7 na na na
45 na na na na
67 1 na na na
It would be great if anyone can provide me help in this regard in Python.
This is pretty easy to do. What have you tried?
Here's a general procedure to help you get started:
1 - open the smaller file w/ the regionmaps you want to keep and readline those into a list.
2 - open the larger file and create a dictionary of lists to contain the data. You can think of the dict's keys as basically column headers. The values are lists that represent the column values for all your records.
3 - now, remove kvps from your dict where the key is not in your list from step 1 or is not userid.
4 - use resulting dict to write out a new file.
Definitely not the only approach, buts it's a simple one that you should be able to start with. Hope that helps :)
I have a solution adapted for your problem.
You can perform to make the file look better.
import StringIO
import numpy as np
# Preparing an object that simulates a file (f is the file)
f = StringIO.StringIO()
f.write("""Userid regionmap1 regionmap2 regionmap3
78 7 na na
45 na na na
67 1 na na""")
f.seek(0)
# Reading file and getting the header (1st line)
head = f.readline().strip("\n").split()
data = []
for a in f:
data.append([float(e) for e in a.replace('na', 'NaN').split()])
#
data = np.array(data)
# Columns to keep
s = ("Regionmap1", "Regionmap3")
s = map(lambda e: e.lower(), s)
s = ["Userid",] + s
# Index of the columns to keep
idx, = np.where([e in s for e in head])
# Saving the new data in a file (simulated with StringIO)
ff = StringIO.StringIO()
ff.write(' '.join(tuple(s)) + '\n')
np.savetxt(ff, data[:, idx])
The rendered file looks like:
Userid regionmap1 regionmap3
7.800000000000000000e+01 7.000000000000000000e+00 nan
4.500000000000000000e+01 nan nan
6.700000000000000000e+01 1.000000000000000000e+00 nan
Try dis! Dis code is to form the dictionary with headers as key and the list of column values as values
f = open('2.txt', 'r') #opening the large file
data = f.readlines()
f.close()
hdrs = data[0].split('\t') #assuming that large file is tab separated, and the first line is header line
data_dict = {} #main data
for each_line in data[1:]: #starting from second line as the first line is header line
splitdata = each_line.split('\t') #splitting the line with tab
for i, d in enumerate(splitdata):
tmpval = data_dict.get(hdrs[i], [])
tmpval.append(d)
data_dict[hdrs[i]] = tmpval #appending the column value for its respective header
for k, v in data_dict.items(): #printing the final data dict
print k, v

how to convert text values in an array to float and add in python

i have a text file containing the values:
120 25 50
149 33 50
095 41 50
093 05 50
081 11 50
i extracted the values in the first column and put then into an array: adjusted
how do i convert the values from text to float and add 5 to each of them using a for loop?
my desired output is :
125
154
100
098
086
here is my code:
adjusted = [(adj + (y)) for y in floats]
A1 = adjusted[0:1]
A2 = adjusted[1:2]
A3 = adjusted[2:3]
A4 = adjusted[3:4]
A5 = adjusted[4:5]
print A1
print A2
print A3
print A4
print A5
A11= [float(x) for x in adjusted]
FbearingBC = 5 + float(A11)
print FbearingBC
it gives me errors it says i cant add float and string
pliz help
Assuming that you have:
adjusted = ['120', '149', '095', ...]
The simplest way to convert to float and add five is:
converted = [float(s) + 5 for s in adjusted]
This will create a new list, iterating over each string s in the old list, converting it to float and adding 5.
Assigning each value in the list to a separate name (A1, etc.) is not a good way to proceed; this will break if you have more or fewer entries than expected. It is much cleaner and less error-prone to access each value by index (e.g. adjusted[0]) or iterate over them (e.g. for a in adjusted:).
This code should work:
with open('your_text_file.txt', 'rt') as f:
for line in f:
print(float(line.split()[0]) + 5)
It will display:
125.0
154.0
100.0
98.0
86.0
Or of you need all your values in a list:
with open('your_text_file.txt', 'rt') as f:
values = [float(line.split()[0]) + 5 for line in f]
print (values)
Since you are reading the data from a file, this could done as below:
with open('data.txt', 'r+') as f: # where data.txt is the file containing your data
for line in f.readlines(): # Get one line of data from the file
number = float(line.split(' ')[0]) # Split your data by ' ' and get the first number.
print "%03d" % (number + 5) # Add 5 to it and then print using the format "%03d"
Hope this helps.
>>> [float(value)+5 for value in adjusted]
[125.0, 154.0, 100.0, 98.0, 86.0]
This is a list comprehension, so it does all the functions of converting to floats, adding 5 and appending them to a list all in one line. However, if all of your numbers are integers you should convert it to an int so you don't have all of the .0s.

Calculating an average for every X number of lines

I am trying to take data from a text file and calculate an average for every 600 lines of that file. I'm loading the text from the file, putting it into a numpy array and enumerating it. I can get the average for the first 600 lines but I'm not sure how to write a loop so that python calculates an average for every 600 lines and then puts this into a new text file. Here is my code so far:
import numpy as np
#loads file and places it in array
data = np.loadtxt('244UTZ10htz.txt', delimiter = '\t', skiprows = 2)
shape = np.shape(data)
#creates array for u wind values
for i,d in enumerate(data):
data[i] = (d[3])
if i == 600:
minavg = np.mean(data[i == 600])
#finds total u mean for day
ubar = np.mean(data)
Based on what I understand from your question, it sounds like you have some file that you want to take the mean of every line up to the 600th one, and repeat that multiple times till there is no more data. So at line 600 you average lines 0 - 600, at line 1200 you average lines 600 to 1200.
Modulus division would be one approach to taking the average when you hit every 600th line, without having to use a separate variable to keep count how many lines you've looped through. Additionally, I used Numpy Array Slicing to create a view of the original data, containing only the 4th column out of the data set.
This example should do what you want, but it is entirely untested... I'm also not terribly familiar with numpy, so there are some better ways do this as mentioned in the other answers:
import numpy as np
#loads file and places it in array
data = np.loadtxt('244UTZ10htz.txt', delimiter = '\t', skiprows = 2)
shape = np.shape(data)
data_you_want = data[:,3]
daily_averages = list()
#creates array for u wind values
for i,d in enumerate(data_you_want):
if (i % 600) == 0:
avg_for_day = np.mean(data_you_want[i - 600:i])
daily_averages.append(avg_for_day)
You can either modify the example above to write the mean out to a new file, instead of appending to a list as I have done, or just write the daily_averages list out to whatever file you want.
As a bonus, here is a Python solution using only the CSV library. It hasn't been tested much, but theoretically should work and might be fairly easy to understand for someone new to Python.
import csv
data = list()
daily_average = list()
num_lines = 600
with open('testme.csv', 'r') as csvfile:
reader = csv.reader(csvfile, delimiter="\t")
for i,row in enumerate(reader):
if (i % num_lines) == 0 and i != 0:
average = sum(data[i - num_lines:i]) / num_lines
daily_average.append(average)
data.append(int(row[3]))
Hope this helps!
Simple solution would be:
import numpy as np
data = np.loadtxt('244UTZ10htz.txt', delimiter = '\t', skiprows = 2)
mydata=[]; counter=0
for i,d in enumerate(data):
mydata.append((d[3]))
# Find the average of the previous 600 lines
if counter == 600:
minavg = np.mean(np.asarray(mydata))
# reset the counter and start counting from 0
counter=0; mydata=[]
counter+=1
The following program uses array slicing to get the column, and then a list comprehension indexing into the column to get the means. It might be simpler to use a for loop for the latter.
Slicing / indexing into the array rather than creating new objects also has the advantage of speed as you're just creating new views into existing data.
import numpy as np
# test data
nr = 11
nc = 3
a = np.array([np.array(range(nc))+i*10 for i in range(nr)])
print a
# slice to get column
col = a[:,1]
print col
# comprehension to step through column to get means
numpermean = 2
means = [np.mean(col[i:(min(len(col), i+numpermean))]) \
for i in range(0,len(col),numpermean)]
print means
it prints
[[ 0 1 2]
[ 10 11 12]
[ 20 21 22]
[ 30 31 32]
[ 40 41 42]
[ 50 51 52]
[ 60 61 62]
[ 70 71 72]
[ 80 81 82]
[ 90 91 92]
[100 101 102]]
[ 1 11 21 31 41 51 61 71 81 91 101]
[6.0, 26.0, 46.0, 66.0, 86.0, 101.0]
Something like this works. Maybe not that readable. But should be fairly fast.
n = int(data.shape[0]/600)
interestingData = data[:,3]
daily_averages = np.mean(interestingData[:600*n].reshape(-1, 600), axis=1)

Categories