I'm trying to sum some values in a list so i loaded the .dat file that contains the values, but the only way Python makes the sum of the data is by separate it with ','. Now, this is what I get.
altura = np.loadtxt("bio.dat",delimiter=',',usecols=(5,),dtype='float')
File "/usr/lib/python2.7/dist-packages/numpy/lib/npyio.py", line 846, in loadtxt
vals = [vals[i] for i in usecols]
IndexError: list index out of range
This is my code
import numpy as np
altura = np.loadtxt("bio.dat",delimiter=',',usecols=(5,),dtype='str')
print altura
And this is the file 'bio.dat'
1 Katherine Oquendo M 18 1.50 50
2 Pablo Restrepo H 20 1.83 79
3 Ana Agudelo M 18 1.58 45
4 Kevin Vargas H 20 1.74 80
5 Pablo madrid H 20 1.70 55
What I intend to do is
x=sum(altura)
What should i do with the 'separate'?
In my case, some line includes # character.
Then numpy will ignore all the rests of the line, because that means ‘comment’. So try again with comments parameter like
altura = np.loadtxt("bio.dat",delimiter=',',usecols=(5,),dtype=‘str’,comments=‘')
And I recommend you not to use np.loadtxt. Because it’s incredibly slow if you must process a large(>1M lines) file.
Alternatively, you can convert the tab delimited file to csv first.
csv supports tab delimited files. Supply the delimiter argument to reader:
import csv
txt_file = r"mytxt.txt"
csv_file = r"mycsv.csv"
# use 'with' if the program isn't going to immediately terminate
# so you don't leave files open
# the 'b' is necessary on Windows
# it prevents \x1a, Ctrl-z, from ending the stream prematurely
# and also stops Python converting to / from different line terminators
# On other platforms, it has no effect
in_txt = csv.reader(open(txt_file, "rb"), delimiter = '\t')
out_csv = csv.writer(open(csv_file, 'wb'))
out_csv.writerows(in_txt)
This answer is not my work; it is the work of agf found at https://stackoverflow.com/a/10220428/3767980.
The file doesn't need to be comma separated. Here's my sample run, using StringIO to simulate a file. I assume you want to sum the numbers that look a person's height (in meters).
In [17]: from StringIO import StringIO
In [18]: s="""\
1 Katherine Oquendo M 18 1.50 50
2 Pablo Restrepo H 20 1.83 79
3 Ana Agudelo M 18 1.58 45
4 Kevin Vargas H 20 1.74 80
5 Pablo madrid H 20 1.70 55
"""
In [19]: S=StringIO(s)
In [20]: data=np.loadtxt(S,dtype=float,usecols=(5,))
In [21]: data
Out[21]: array([ 1.5 , 1.83, 1.58, 1.74, 1.7 ])
In [22]: np.sum(data)
Out[22]: 8.3499999999999996
as script (with the data in a .txt file)
import numpy as np
fname = 'stack25828405.txt'
data=np.loadtxt(fname,dtype=float,usecols=(5,))
print data
print np.sum(data)
2119:~/mypy$ python2.7 stack25828405.py
[ 1.5 1.83 1.58 1.74 1.7 ]
8.35
Related
my dataframe is like this
star_rating actors_list
0 9.3 [u'Tim Robbins', u'Morgan Freeman']
1 9.2 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 9.1 [u'Al Pacino', u'Robert De Niro']
3 9.0 [u'Christian Bale', u'Heath Ledger']
4 8.9 [u'John Travolta', u'Uma Thurman']
I want to extract the most frequent names in the actors_list column. I found this code. do you have a better suggestion? especially for big data.
import pandas as pd
df= pd.read_table (r'https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/imdb_1000.csv',sep=',')
df.actors_list.str.replace("(u\'|[\[\]]|\')",'').str.lower().str.split(',',expand=True).stack().value_counts()
expected output for (this data)
robert de niro 13
tom hanks 12
clint eastwood 11
johnny depp 10
al pacino 10
james stewart 9
By my tests, it would be much faster to do the regex cleanup after counting.
from itertools import chain
import re
p = re.compile("""^u['"](.*)['"]$""")
ser = pd.Series(list(chain.from_iterable(
x.title().split(', ') for x in df.actors_list.str[1:-1]))).value_counts()
ser.index = [p.sub(r"\1", x) for x in ser.index.tolist()]
ser.head()
Robert De Niro 18
Brad Pitt 14
Clint Eastwood 14
Tom Hanks 14
Al Pacino 13
dtype: int64
Its always better to go for plain python than depending on pandas since it consumes huge amount of memory if the list is large.
If the list is of size 1000, then the non 1000 length lists will have Nan's when you use expand = True which is a waste of memeory. Try this instead.
df = pd.concat([df]*1000) # For the sake of large df.
%%timeit
df.actors_list.str.replace("(u\'|[\[\]]|\')",'').str.lower().str.split(',',expand=True).stack().value_counts()
10 loops, best of 3: 65.9 ms per loop
%%timeit
df['actors_list'] = df['actors_list'].str.strip('[]').str.replace(', ',',').str.split(',')
10 loops, best of 3: 24.1 ms per loop
%%timeit
words = {}
for i in df['actors_list']:
for w in i :
if w in words:
words[w]+=1
else:
words[w]=1
100 loops, best of 3: 5.44 ms per loop
I will using ast convert the list like to list
import ast
df.actors_list=df.actors_list.apply(ast.literal_eval)
pd.DataFrame(df.actors_list.tolist()).melt().value.value_counts()
according to this code I got below chart
which
coldspeed's code is wen2()
Dark's code is wen4()
Mine code is wen1()
W-B's code is wen3()
I am trying to extract data out of a csv file and output the data to another csv file.
relate task perform
0 avc asd
1 12 24
2 34 54
3 22 33
4 11 11
5 335 534
Time A B C D
0 0.334 0.334 0.334 0.334
1 0.543 0.543 0.543 0.543
2 0.752 0.752 0.752 0.752
3 0.961 0.961 0.961 0.961
4 1.17 1.17 1.17 1.17
5 1.379 1.379 1.379 1.379
I am writing a python script to read the above table. I want all the data from Time, A, B,C, and D onwards in a separate file.
import csv
import pandas as pd
import os
read_file = False
with open ('xyz.csv', mode = 'r', encoding = 'utf-8') as f_read:
reader = csv.reader(f_read)
for row in reader:
if 'Time' in row
I am stuck here. I read all the data in 'reader'. All the rows should have been parsed inside 'reader'. Now, how can I extract the data from line with Time and onwards into a separate file?
Is there a better method to achieve the above objective?
Should I use pandas instead of regular python commands?
I read many similar answers on stackoverflow but I am confused on how to finish this problem. Your help is appreciated.
Best
I have to copy information from a pre existing text file and add pos tags on the same line and write it to a new file, but I have no clue how to get the correct output, thanks in advance.
My current output:
0 5 1001 China
5 7 1002 's
8 17 1003 state-run
18 23 1004 media
24 27 1005 say
28 29 1006 a
NNP POS JJ NNS VBP DT
Code:
import sys
import nltk
def main():
list1 = []
read = open("en.tok.off", "r")
data = read.read()
result = ''.join([i for i in data if not i.isdigit()])
result = result.split()
data3 = nltk.pos_tag(result)
words, tags = zip(*data3)
tags = " ".join(tags)
print(tags)
outfile = open("en.tok.off.pos", "w")
outfile.write(data)
outfile.write(tags)
outfile.close()
main()
I want NNP on the fifth column in 0 5 1001 China and POS on the same line after 5 7 1002 's, etc.
Desired output:
0 5 1001 China NNP
5 7 1002 's POS
8 17 1003 state-run JJ
Instead of throwing away the numbers (which will also discard numbers in your text!), collect everything and extract the fourth column for tagging.
data = read.read()
rows = [ line.split() for line in data.splitlines() ]
words = [ row[3] for row in rows ]
tagged = nltk.pos_tag(words)
You can then put the pieces back together like this.
tagged_rows = [ row + [ tw[1] ] for row, tw in zip(rows, tagged) ]
(In relatively new versions of Python, you can compact the above even more). An alternative would be to use a library like numpy or pandas, which let you add a column to your data. But I think this approach is simpler and therefore preferable.
Using the latest nltk release (v3.2.4), there's an align_tokens function that might be helpful for tokens offsets (your first two columns).
As for printing the POS as the last column, simply putting the elements you want per word into a list and use the str.join() function would do, e.g. :
>>> x = ['a', 'b', 'c']
>>> print ('\t'.join(x))
a b c
>>> x.append('d')
>>> print ('\t'.join(x))
a b c d
Specific to your code:
[in]:
$ cat test.txt
China's state-run media say a
[code]:
from nltk.tokenize.treebank import TreebankWordTokenizer
from nltk.tokenize.util import align_tokens
from nltk import pos_tag
tb = TreebankWordTokenizer()
with open('test.txt') as fin:
for line in fin:
text = line.strip()
tokens = tb.tokenize(text)
tagged_tokens = pos_tag(tokens)
offsets = align_tokens(tokens, text)
for (start, end), (tok, tag) in zip(offsets, tagged_tokens):
print ('\t'.join([str(start), str(end), tok, tag]))
[out]:
0 5 China NNP
5 7 's POS
8 17 state-run JJ
18 23 media NNS
24 27 say VBP
28 29 a DT
I am new to the python and scripting in general, so I would really appreciate some guidance in writing a python script.
So, to the point:
I have a big number of files in a directory. Some files are empty, other contain rows like that:
16 2009-09-30T20:07:59.659Z 0.05 0.27 13.559 6
16 2009-09-30T20:08:49.409Z 0.22 0.312 15.691 7
16 2009-09-30T20:12:17.409Z -0.09 0.235 11.826 4
16 2009-09-30T20:12:51.159Z 0.15 0.249 12.513 6
16 2009-09-30T20:15:57.209Z 0.16 0.234 11.776 4
16 2009-09-30T20:21:17.109Z 0.38 0.303 15.201 6
16 2009-09-30T20:23:47.959Z 0.07 0.259 13.008 5
16 2009-09-30T20:32:10.109Z 0.0 0.283 14.195 5
16 2009-09-30T20:32:10.309Z 0.0 0.239 12.009 5
16 2009-09-30T20:37:48.609Z -0.02 0.256 12.861 4
16 2009-09-30T20:44:19.359Z 0.14 0.251 12.597 4
16 2009-09-30T20:48:39.759Z 0.03 0.284 14.244 5
16 2009-09-30T20:49:36.159Z -0.07 0.278 13.98 4
16 2009-09-30T20:57:54.609Z 0.01 0.304 15.294 4
16 2009-09-30T20:59:47.759Z 0.27 0.265 13.333 4
16 2009-09-30T21:02:56.209Z 0.28 0.272 13.645 6
and so on.
I want to get this lines out of the files into a new file. But there are some conditionals!
If two or more successive lines are inside a timewindow of 6 seconds, then only the line with highest treshold should be printed into the new file.
So, something like that:
Original:
16 2009-09-30T20:32:10.109Z 0.0 0.283 14.195 5
16 2009-09-30T20:32:10.309Z 0.0 0.239 12.009 5
in output file:
16 2009-09-30T20:32:10.109Z 0.0 0.283 14.195 5
Keep in mind, that lines from different files can have times inside 6s window with lines from other files, so the line, that will be in output is the one that has highest treshold from different files.
The code that explains what is what in the lines is here:
import glob
from datetime import datetime
path = './*.cat'
files=glob.glob(path)
for file in files:
in_file=open(file, 'r')
out_file = open("times_final", "w")
for line in in_file.readlines():
split_line = line.strip().split(' ')
template_number = split_line[0]
t = datetime.strptime(split_line[1], '%Y-%m-%dT%H:%M:%S.%fZ')
mag = split_line[2]
num = split_line[3]
threshold = float(split_line[4])
no_detections = split_line[5]
in_file.close()
out_file.close()
Thank you very much for hints, guidelines, ...
you said in the comments you know how to merge multiple files into 1 sorted by t and that the 6 second windows start with the first row and are based on actual data.
so, you need a way to remember the maximum threshold per window and write only after you are sure you processed all rows in a window. sample implementation:
from datetime import datetime, timedelta
from csv import DictReader, DictWriter
fieldnames=("template_number", "t", "mag","num", "threshold", "no_detections")
with open('master_data') as f_in, open("times_final", "w") as f_out:
reader = DictReader(f_in, delimiter=" ", fieldnames=fieldnames)
writer = DictWriter(f_out, delimiter=" ", fieldnames=fieldnames,
lineterminator="\n")
window_start = datetime(1900, 1, 1)
window_timedelta = timedelta(seconds=6)
window_max = 0
window_row = None
for row in reader:
try:
t = datetime.strptime(row["t"], "%Y-%m-%dT%H:%M:%S.%fZ")
threshold = float(row["threshold"])
except ValueError:
# replace by actual error handling
print("Problem with: {}".format(row))
# switch to new window after 6 seconds
if t - window_start > window_timedelta:
# write out previous window before switching
if window_row:
writer.writerow(window_row)
window_start = t
window_max = threshold
window_row = row
# remember max threshold inside a single window
elif threshold > window_max:
window_max = threshold
window_row = row
# don't forget the last window
if window_row:
writer.writerow(window_row)
import csv
with open('Met.csv', 'r') as f:
reader = csv.reader(f, delimiter=':', quoting=csv.QUOTE_NONE)
for row in reader:
print row
I am not able to go ahead how to get a column from the csv file I tried
print row[:column_name]
name id name reccla mass (g) fall year GeoLocation
Aachen 1 Valid L5 21 Fell 01/01/1880 (50.775000, 6.083330)
Aarhus 2 Valid H6 720 Fell 1/1/1951 (53.775000, 6.586560)
Abee 6 Valid EH4 -- Fell 1/1/1952 (50.775000, 6.083330)
Acapul 10 Valid A 353 Fell 1/1/1952 (50.775000, 6.083330)
Acapul 1914 valid A -- Fell 1/1/1952 (50.775000, 6.083330)
AdhiK 379 Valid EH4 56655 Fell 1/1/1919 (50.775000, 6.083330)
and I want avg of mass (g)
Try pandas instead of reading from csv
import pandas as pd
data = pd.read_csv('Met.csv')
It is far easier to grab columns and perform operations using pandas.
Here I am loading the csv contents to a dataframe.
Loaded data : (sample data)
>>> data
name id nametype recclass mass
0 Aarhus 2 Valid H6 720
1 Abee 6 Valid EH4 107000
2 Acapulco 10 Valid Acapulcoite 914
3 Achiras 370 Valid L6 780
4 Adhi Kot 379 Valid EH4 4239
5 Adzhi 390 Valid LL3-6 910
6 Agen 392 Valid H5 30000
Just the Mass column :
You can access individual columns as data['column name']
>>> data['mass']
0 720
1 107000
2 914
3 780
4 4239
5 910
6 30000
Name: mass, dtype: int64
Average of Mass column :
>>> data['mass'].mean()
20651.857142857141
You can use csv.DictReader() instead of csv.reader(). The following code works fine to me
import csv
mass_list = []
with open("../data/Met.csv", "r") as f:
reader = csv.DictReader(f, delimiter="\t")
for row in reader:
mass = row["mass"]
if mass is not None and mass is not "--":
mass_list.append(float(row["mass"]))
avg_mass = sum(mass_list) / len(mass_list)
print "avg of mass: ", avg_mass
Hope it helps.