How to make 2 dimensiion array in python with document data - python

I want to make a two-dimensional array with string values from my data document csv, but I have trouble with Indexes
my data =
1.alquran,tunjuk,taqwa,perintah,larang,manfaat 2.taqwa,ghaib,allah,malaikat,surga,neraka,rasul,iman,ibadah,manfaat,taat,ridha
3.taqwa,alquran,hadist,kitab,allah,akhirat,ciri
in a document csv
def ubah(kata):
a=[]
for i in range (0,19):
a.append([kata.values[i,j] for j in range (0,13)])
return a
and the wanted result is
[['alquran','tunjuk','taqwa','perintah','larang','manfaat'],<br>['taqwa','ghaib','allah','malaikat','surga','neraka','rasul','iman','ibadah','manfaat','taat','ridha'],<br>['taqwa','alquran','hadist','kitab','allah','akhirat','ciri']]

You can modify your function as:
def ubah(kata):
a = []
line = kata.split("\n") # will create an array of rows
for i in range(len(line)):
a.append(line[i].split(",")) # will add the separated values
return a
df = open("data.csv", 'r')
kata = df.read()
dataarray = ubah(kata) # calling the function
print(dataarray)
The above program gives the result as you want, like
[['alquran', 'tunjuk', 'taqwa', 'perintah', 'larang', 'manfaat'], ['taqwa', 'ghaib', 'allah', 'malaikat', 'surga', 'neraka', 'rasul', 'iman', 'ibadah', 'manfaat', 'taat', 'ridha '], ['taqwa', 'alquran', 'hadist', 'kitab', 'allah', 'akhirat', 'ciri']]
Hope this helps.

Please remove values[i,j] from the for loop and replace it with j.
for i in range (0,19):
a.append([kata[j] for j in range (0,3)])

Related

How to read data from text file into array with Python

I have a bit trouble with some data stored in a text file on hand for regression analysis using Python.
The data are stored in the format that look like this:
2104,3,399900 1600,3,329900 2400,3,369000 ....
I need to do some analysis like finding mean by this:
(2104+1600+...)/number of data
I think the appropriate steps is to store the data into array. But I have no idea how to store it. I think of two ways to do so. The first one is to set 3 array that stores like
a=[2104 1600 2400 ...] b=[3 3 3 ...] c=[399900 329900 36000 ...]
The second way is to store in
a=[2104 3 399900], b=[1600 3 329900] and so on.
Which one is better?
Also, how to write code that allows the data can be stored into array? I think of like this:
with open("file.txt", "r") as ins:
array = []
elt.strip(',."\'?!*:') for line in ins:
array.append(line)
Is that correct?
You could use :
with open('data.txt') as data:
substrings = data.read().split()
values = [map(int, substring.split(',')) for substring in substrings]
average = sum([a for a, b, c in values]) / float(len(values))
print average
With this data.txt, :
2104,3,399900 1600,3,329900 2400,3,369000
2105,3,399900 1601,3,329900 2401,3,369000
It outputs :
2035.16666667
Using pandas and numpy you can get the data into an array as follows:
In [37]: data = "2104,3,399900 1600,3,329900 2400,3,369000"
In [38]: d = pd.read_csv(StringIO.StringIO(data), sep=',| ', header=None, index_col=None, engine="python")
In [39]: d.values.reshape(3, d.shape[1]/3)
Out[39]:
array([[ 2104, 3, 399900],
[ 1600, 3, 329900],
[ 2400, 3, 369000]])
Instead of having multiple arrays a, b, c... you could store your data as an array of arrays (a 2 dimensional array). For example:
[[2104,3,399900],
[1600,3,329900],
[2400,3,369000]...]
This way you don't have to deal with dynamically naming your arrays. How you store your data, i.e. 3 * array of length n or n * array of length 3 is up to you. I would prefer the second way. To read the data into your array you should then use the split() function, which will split your input into an array. So in your case:
with open("file.txt", "r") as ins:
tmp = ins.read().split(" ")
array = [i.split(",") for i in tmp]
>>> array
[['2104', '3', '399900'], ['1600', '3', '329900'], ['2400', '3', '369000']]
Edit:
To find the mean e.g. for the first element in each list you could do the following:
arraymean = sum([int(i[0]) for i in array]) / len(array)
Where the 0 in i[0] specifies the first element in each list. Note that this code uses list comprehension, which you can learn more about in this post if you want to.
Also this code stores the values in the array as strings, hence the cast to int in the part to get the mean. If you want to store the data as int directly just edit the part in the file reading section:
array = [[int(j) for j in i.split(",")] for i in tmp]
This a quick solution without error checking (using a list comprehension technique, PEP202). But if your file has a consistent format you can do the following:
import numpy as np
a = np.array([np.array(i.split(",")).astype("float") for i in open("example.txt").read().split(" ")])
Should you print it:
print(a)
print("Mean of column 0: ", np.mean(a[:, 0]))
You'll obtain the following:
[[ 2.10400000e+03 3.00000000e+00 3.99900000e+05]
[ 1.60000000e+03 3.00000000e+00 3.29900000e+05]
[ 2.40000000e+03 3.00000000e+00 3.69000000e+05]]
Mean of column 0: 2034.66666667
Notice how, in the code snippet, specified the "," as separator inside triplet, and the space " " as separator between triplets. This is the exact contents of the file I used as an example:
2104,3,399900 1600,3,329900 2400,3,369000

Find if rows in a large file contain a substring from a seperate list?

I have a large (30GB) file consisting of random terms and sentences. I have two separate lists of words and phrases I want to apply to that and mark (or alternatively filter) a row in which a term for that list appears.
If a term from list X appears in a row of the large file, mark it X, if from list Y, mark it Y. When done, take the rows marked X and output it to a file, same for Y as a separate file. My problem is that both of my lists are 1500 terms long and take a while to go through line by line.
After fidgeting around a while, I've arrived on my current method which filters the chunks on whether it contains a term. My issue is that it is very slow. I am wondering if there is a way to speed up my script to get through it faster? I was using Pandas to process the file in chunks of 1 million rows, it takes around 3 minutes to process a chunk with my current method:
white_text_file = open('lists/health_whitelist_final.txt', "r")
white_list = white_text_file.read().split(',\n')
black_text_file = open('lists/health_blacklist_final.txt', "r")
black_list = black_text_file.read().split(',\n')
for chunk in pd.read_csv('final_cleaned_corpus.csv', chunksize=chunksize, names=['Keyword']):
print("Chunk")
chunk_y = chunk[chunk['Keyword'].str.contains('|'.join(white_list), na=False)]
chunk_y.to_csv(VERTICAL+'_y_list.csv', mode='a', header=None)
chunk_x = chunk[chunk['Keyword'].str.contains('|'.join(black_list), na=False)]
chunk_x.to_csv(VERTICAL+'_x_list.csv', mode='a', header=None)
My first attempt was less pythonic but the aim was to break the loop the first time an element appears, this way was slower that my current one:
def x_or_y(keyword):
#print(keyword)
iden = ''
for item in y_list:
if item in keyword:
iden = 'Y'
break
for item in x_list:
if item in keyword:
iden = 'X'
break
return iden
Is there a faster way I'm missing here?
If I understand correctly, you need to do the following:
fileContent = ['yes','foo','junk','yes','foo','junk']
x_list = ['yes','no','maybe-so']
y_list = ['foo','bar','fizzbuzz']
def x_or_y(keyword):
if keyword in x_list:
return 'X'
if keyword in y_list:
return 'Y'
return ''
results = map(x_or_y, fileContent)
print(list(results))
Here is an example: https://repl.it/F566

Vector data from a file

I am just starting out with Python. I have some fortran and some Matlab skills, but I am by no means a coder. I need to post-process some output files.
I can't figure out how to read each value into the respective variable. The data looks something like this:
h5097600N1 2348.13 2348.35 -0.2219 20.0 -4.438
h5443200N1 2348.12 2348.36 -0.2326 20.0 -4.651
h8467200N2 2348.11 2348.39 -0.2813 20.0 -5.627
...
In my limited Matlab notation, I would like to assign the following variables of the form tN1(i,j) something like this:
tN1(1,1)=5097600; tN1(1,2)=5443200; tN2(1,3)=8467200; #time between 'h' and 'N#'
hmN1(1,1)=2348.13; hmN1(1,2)=2348.12; hmN2(1,3)=2348.11; #value in 2nd column
hsN1(1,1)=2348.35; hsN1(1,2)=2348.36; hsN2(1,3)=2348.39; #value in 3rd column
I will have about 30 sets, or tN1(1:30,1:j); hmN1(1:30,1:j);hsN1(1:30,1:j)
I know it may not seem like it, but I have been trying to figure this out for 2 days now. I am trying to learn this on my own and it seems I am missing something fundamental in my understanding of python.
I wrote a simple script which does what you asks. It creates three dictionaries, t, hm and hs. These will have keys as the N values.
import csv
import re
path = 'vector_data.txt'
# Using the <with func as obj> syntax handles the closing of the file for you.
with open(path) as in_file:
# Use the csv package to read csv files
csv_reader = csv.reader(in_file, delimiter=' ')
# Create empty dictionaries to store the values
t = dict()
hm = dict()
hs = dict()
# Iterate over all rows
for row in csv_reader:
# Get the <n> and <t_i> values by using regular expressions, only
# save the integer part (hence [1:] and [1:-1])
n = int(re.findall('N[0-9]+', row[0])[0][1:])
t_i = int(re.findall('h.+N', row[0])[0][1:-1])
# Cast the other values to float
hm_i = float(row[1])
hs_i = float(row[2])
# Try to append the values to an existing list in the dictionaries.
# If that fails, new lists is added to the dictionaries.
try:
t[n].append(t_i)
hm[n].append(hm_i)
hs[n].append(hs_i)
except KeyError:
t[n] = [t_i]
hm[n] = [hm_i]
hs[n] = [hs_i]
Output:
>> t
{1: [5097600, 5443200], 2: [8467200]}
>> hm
{1: [2348.13, 2348.12], 2: [2348.11]}
>> hn
{1: [2348.35, 2348.36], 2: [2348.39]}
(remember that Python uses zero-indexing)
Thanks for all your comments. Suggested readings led to other things which helped. Here is what I came up with:
if len(line) >= 45:
if line[0:45] == " FIT OF SIMULATED EQUIVALENTS TO OBSERVATIONS": #! indicates data to follow, after 4 lines of junk text
for i in range (0,4):
junk = file.readline()
for i in range (0,int(nobs)):
line = file.readline()
sline = line.split()
obsname.append(sline[0])
hm.append(sline[1])
hs.append(sline[2])

How to convert list into array in a function - python?

I defined a function(procedure) to read a file. I want it returns arrays with data I want to read from the file, as it follows:
import csv
import numpy as np
import matplotlib.pyplot as plt
# Subroutine to read the day, Ta,Tp from a file and convert them into arrays
def readstr(fname,day,Ta,Tp):
van = open(fname,'r')
van_csv = van.readlines()[7:] # Skip seven lines
van.close() # close the file
van_csv = csv.reader(van_csv) # now the file is separated by colunms
for row in van_csv: # Passing the values of the each column to arrays
day.append(row[1])
Ta.append(row[8])
Tp.append(row[7])
day = np.array(day,dtype=np.integer)
Ta = np.array(Ta,dtype=np.float)
Tp = np.array(Tp,dtype=np.float)
van = "file"
# Defining the lists
dayVan = []
Tav = []
Tpv = []
readstr(van,dayVan,Tav,Tpv)
print Tav
I thought it would work, but dayVan, Tpv, Tav keep being lists.
The line
Ta = np.array(Ta,dtype=np.float)
Creates a new array object from the contents of the list Ta, it then assigns this array to the local identifier Ta. It does not change the global that references the list.
Python doesn't have "variables". It has identifiers. When doing a = b you simply say "bind the name a to the object bound to b". The a is simply a label that can be used to retrieve an object. If you then do a = 0 you are re-binding the label a but this does not affect the object bound to b. The identifiers are not memory locations.
To pass the resulting arrays out of the function you can:
Return them and re-assign the global Ta.
Assign directly to the global variable. However in order to do this the local Ta should be given a new name and you'd have to use the global statement(Note: avoid this solution.)
The transformation is correctly done, but only inside your function.
Try to return day, Ta and Tp at the end of your function, and get them from the caller, it will work better.
def readstr(fname):
van = open(fname,'r')
van_csv = van.readlines()[7:] # Skip seven lines
van.close() # close the file
van_csv = csv.reader(van_csv) # now the file is separated by colunms
day, Ta, Tp = [], [], []
for row in van_csv: # Passing the values of the each column to arrays
day.append(row[1])
Ta.append(row[8])
Tp.append(row[7])
day = np.array(day,dtype=np.integer)
Ta = np.array(Ta,dtype=np.float)
Tp = np.array(Tp,dtype=np.float)
return day, Ta, Tp
dayVan, Tav, Tpv = readstr(van)
Perhaps you can simply do:
dayVan, Tpv, Tav = np.loadtxt(fname, usecols=(1,7,8), skiprows=7, delimiter=',', unpack=True)

Why doesn't this return the average of the column of the CSV file?

def averager(filename):
f=open(filename, "r")
avg=f.readlines()
f.close()
avgr=[]
final=""
x=0
i=0
while i < range(len(avg[0])):
while x < range(len(avg)):
avgr+=str((avg[x[i]]))
x+=1
final+=str((sum(avgr)/(len(avgr))))
clear(avgr)
i+=1
return final
The error I get is:
File "C:\Users\konrad\Desktop\exp\trail3.py", line 11, in averager
avgr+=str((avg[x[i]]))
TypeError: 'int' object has no attribute '__getitem__'
x is just an integer, so you can't index it.
So, this:
x[i]
Should never work. That's what the error is complaining about.
UPDATE
Since you asked for a recommendation on how to simplify your code (in a below comment), here goes:
Assuming your CSV file looks something like:
-9,2,12,90...
1423,1,51,-12...
...
You can read the file in like this:
with open(<filename>, 'r') as file_reader:
file_lines = file_reader.read().split('\n')
Notice that I used .split('\n'). This causes the file's contents to be stored in file_lines as, well, a list of the lines in the file.
So, assuming you want the ith column to be summed, this can easily be done with comprehensions:
ith_col_sum = sum(float(line.split(',')[i]) for line in file_lines if line)
So then to average it all out you could just divide the sum by the number of lines:
average = ith_col_sum / len(file_lines)
Others have pointed out the root cause of your error. Here is a different way to write your method:
def csv_average(filename, column):
""" Returns the average of the values in
column for the csv file """
column_values = []
with open(filename) as f:
reader = csv.reader(f)
for row in reader:
column_values.append(row[column])
return sum(column_values) / len(column_values)
Let's pick through this code:
def averager(filename):
averager as a name is not as clear as it could be. How about averagecsv, for example?
f=open(filename, "r")
avg=f.readlines()
avg is poorly named. It isn't the average of everything! It's a bunch of lines. Call it csvlines for example.
f.close()
avgr=[]
avgr is poorly named. What is it? Names should be meaningful, otherwise why give them?
final=""
x=0
i=0
while i < range(len(avg[0])):
while x < range(len(avg)):
As mentioned in comments, you can replace these with for loops, as in for i in range(len(avg[0])):. This saves you from needing to declare and increment the variable in question.
avgr+=str((avg[x[i]]))
Huh? Let's break this line down.
The poorly named avg is our lines from the csv file.
So, we index into avg by x, okay, that would give us the line number x. But... x[i] is meaningless, since x is an integer, and integers don't support array access. I guess what you're trying to do here is... split the file into rows, then the rows into columns, since it's csv. Right?
So let's ditch the code. You want something like this, using the split http://docs.python.org/2/library/stdtypes.html#str.split function:
totalaverage = 0
for col in range(len(csvlines[0].split(","))):
average = 0
for row in range(len(csvlines)):
average += int(csvlines[row].split(",")[col])
totalaverage += average/len(csvlines)
return totalaverage
BUT wait! There's more! Python has a built in csv parser that is safer than splitting by ,. Check it out here: http://docs.python.org/2/library/csv.html
In response to OP asking how he should go about this in one of the comments, here is my suggestion:
import csv
from collections import defaultdict
with open('numcsv.csv') as f:
reader = csv.reader(f)
numbers = defaultdict(list) #used to avoid so each column starts with a list we can append to
for row in reader:
for column, value in enumerate(row,start=1):
numbers[column].append(float(value)) #convert the value to a float 1. as the number may be a float and 2. when we calc average we need to force float division
#simple comprehension to print the averages: %d = integer, %f = float. items() goes over key,value pairs
print('\n'.join(["Column %d had average of: %f" % (i,sum(column)/(len(column))) for i,column in numbers.items()]))
Producing
>>>
Column 1 had average of: 2.400000
Column 2 had average of: 2.000000
Column 3 had average of: 1.800000
For a file:
1,2,3
1,2,3
3,2,1
3,2,1
4,2,1
Here's two methods. The first one just gets the average for the line (what your code above looks like it's doing). The second gets the average for a column (which is what your question asked)
''' This just gets the avg for a line'''
def averager(filename):
f=open(filename, "r")
avg = f.readlines()
f.close()
count = 0
for i in xrange(len(avg)):
count += len(avg[i])
return count/len(avg)
''' This gets a the avg for all "columns"
char is what we split on , ; | (etc)
'''
def averager2(filename, char):
f=open(filename, "r")
avg = f.readlines()
f.close()
count = 0 # count of items
total = 0 # sum of all the lengths
for i in xrange(len(avg)):
cols = avg[i].split(char)
count += len(cols)
for j in xrange(len(cols)):
total += len(cols[j].strip()) # Remove line endings
return total/float(count)

Categories