2012-05-10 BRAD 10
2012-05-08 BRAD 40
2012-05-08 BRAD 60
2012-05-12 TOM 100
I wanted an output as
2012-05-08 BRAD|2|100
2012-05-10 BRAD|1|10
2012-05-12 TOM|1|100
i started with this code::
import os,sys
fo=open("meawoo.txt","w")
f=open("test.txt","r")
fn=f.readlines()
f.close()
for line in fn:
line = line.strip()
sline = line.split("|")
p = sline[1].split(" ")[0],sline[2],sline[4]
print p
fo.writelines(str(p)+"\n")
fo.close()
o_read = open("meawoo.txt","r")
x_read=o_read.readlines()
from operator import itemgetter
x_read.sort(key=itemgetter(0))
from itertools import groupby
z = groupby(x_read, itemgetter(0))
print z
for elt, items in groupby(x_read, itemgetter(0)):
print elt, items
for i in items:
print i
It will be very helpful if u suggest me some usefull changes to my work.TIA
The following code should print the data in your wanted format (as far as I understand it):
d = {}
with open("testdata.txt") as f:
for line in f:
parts = line.split()
if parts[0] in d:
if parts[1] in d[parts[0]]:
d[parts[0]][parts[1]][0] += int(parts[2])
else:
d[parts[0]][parts[1]] = [int(parts[2]), 0]
d[parts[0]][parts[1]][1] +=1
else:
d[parts[0]] = {parts[1]: [int(parts[2]), 1]}
for date in sorted(d):
for name in sorted(d[date]):
print "%s %s|%d|%d" % (date, name, d[date][name][0], d[date][name][1])
I save every line in a dictionary with the lines' dates as keys, and the value is another dictionary with the name as the key and the value is a list with two elements: The first is the cumulative sum of the numbers of this name on this date up to this line, and the second is the number of summands for this date/name constellation. I then print the dictionary in your demanded format and use the circumstance that the comparison of two dates gives the same result as the comparison of the dates as strings that have the format YYY-MM-DD, so I can just use the sorted function on the date strings. I sort on names too.
For an example (adapted to not being able to use a file) see http://ideone.com/rx3h2. It gives the same output you demanded.
Related
How can I do a search of a value of the first "latitude, longitude" coordinate in a "file.txt" list in Python and get 3 rows above and 3 rows below?
Value
37.0459
file.txt
37.04278,-95.58895
37.04369,-95.58592
37.04369,-95.58582
37.04376,-95.58557
37.04376,-95.58546
37.04415,-95.58429
37.0443,-95.5839
37.04446,-95.58346
37.04461,-95.58305
37.04502,-95.58204
37.04516,-95.58184
37.04572,-95.58139
37.04597,-95.58127
37.04565,-95.58073
37.04546,-95.58033
37.04516,-95.57948
37.04508,-95.57914
37.04494,-95.57842
37.04483,-95.5771
37.0448,-95.57674
37.04474,-95.57606
37.04467,-95.57534
37.04462,-95.57474
37.04458,-95.57396
37.04454,-95.57274
37.04452,-95.57233
37.04453,-95.5722
37.0445,-95.57164
37.04448,-95.57122
37.04444,-95.57054
37.04432,-95.56845
37.04432,-95.56834
37.04424,-95.5668
37.044,-95.56251
37.04396,-95.5618
Expected Result
37.04502,-95.58204
37.04516,-95.58184
37.04572,-95.58139
37.04597,-95.58127
37.04565,-95.58073
37.04546,-95.58033
37.04516,-95.57948
Additional information
In linux I can get the closest line and do the treatment I need using grep, sed, cut and others, but I'd like in Python.
Any help will be greatly appreciated!
Thank you.
How can I do a search of a value of the first "latitude, longitude"
coordinate in a "file.txt" list in Python and get 3 rows above and 3
rows below?*
You can try:
with open("text_filter.txt") as f:
text = f.readlines() # read text lines to list
filter= "37.0459"
match = [i for i,x in enumerate(text) if filter in x] # get list index of item matching filter
if match:
if len(text) >= match[0]+3: # if list has 3 items after filter, print it
print("".join(text[match[0]:match[0]+3]).strip())
print(text[match[0]].strip())
if match[0] >= 3: # if list has 3 items before filter, print it
print("".join(text[match[0]-3:match[0]]).strip())
Output:
37.04597,-95.58127
37.04565,-95.58073
37.04546,-95.58033
37.04597,-95.58127
37.04502,-95.58204
37.04516,-95.58184
37.04572,-95.58139
You can use pandas to import the data in a dataframe and then easily manipulate it. As per your question the value to check is not the exact match and therefore I have converted it to string.
import pandas as pd
data = pd.read_csv("file.txt", header=None, names=["latitude","longitude"]) #imports text file as dataframe
value_to_check = 37.0459 # user defined
for i in range(len(data)):
if str(value_to_check) == str(data.iloc[i,0])[:len(str(value_to_check))]:
break
print(data.iloc[i-3:i+4,:])
output
latitude longitude
9 37.04502 -95.58204
10 37.04516 -95.58184
11 37.04572 -95.58139
12 37.04597 -95.58127
13 37.04565 -95.58073
14 37.04546 -95.58033
15 37.04516 -95.57948
A solution with iterators, that only keeps in memory the necessary lines and doesn't load the unnecessary part of the file:
from collections import deque
from itertools import islice
def find_in_file(file, target, before=3, after=3):
queue = deque(maxlen=before)
with open(file) as f:
for line in f:
if target in map(float, line.split(',')):
out = list(queue) + [line] + list(islice(f, 3))
return out
queue.append(line)
else:
raise ValueError('target not found')
Some tests:
print(find_in_file('test.txt', 37.04597))
# ['37.04502,-95.58204\n', '37.04516,-95.58184\n', '37.04572,-95.58139\n', '37.04597,-95.58127\n',
# '37.04565,-95.58073\n', '37.04565,-95.58073\n', '37.04565,-95.58073\n']
print(find_in_file('test.txt', 37.044)) # Only one line after the match
# ['37.04432,-95.56845\n', '37.04432,-95.56834\n', '37.04424,-95.5668\n', '37.044,-95.56251\n',
# '37.04396,-95.5618\n']
Also, it works if there is less than the expected number of lines before or after the match. We match floats, not strings, as '37.04' would erroneously match '37.0444' otherwise.
This solution will print the before and after elements even if they are less than 3.
Also I am using string as it is implied from the question that you want partial matches also. ie. 37.0459 will match 37.04597
search_term='37.04462'
with open('file.txt') as f:
lines = f.readlines()
lines = [line.strip().split(',') for line in lines] #remove '\n'
for lat,lon in lines:
if search_term in lat:
index=lines.index([lat,lon])
break
left=0
right=0
for k in range (1,4): #bcoz last one is not included
if index-k >=0:
left+=1
if index+k<=(len(lines)-1):
right+=1
for i in range(index-left,index+right+1): #bcoz last one is not included
print(lines[i][0],lines[i][1])
I have a dataset in raw text file(its a log file),I am preparing python list using this text file reading line by line,with that list i will create a dataframe using pyspark .if you see the dataset ,some value are missing in respective column,i want to fill it with "NA".This is sample of Dataset,missing value can be in any column,column are separated by white space
==============================================
empcode Emnname Date DESC
12d sf 2018-02-06 dghsjf
asf2 asdfw2 2018-02-16 fsfsfg
dsf21 sdf2 2016-02-06 sdgfsgf
sdgg dsds dkfd-sffddfdf aaaa
dfd gfg dfsdffd aaaa
df dfdf efef
4fr freff
----------------------------------------------
MyCode:
path="something/demo.txt"
EndStr="----------------------------------------------"
FilterStr="=============================================="
findStr="empcode Emnname"
def PrepareList(findStr):
with open(path) as f:
out=[]
for line in f:
if line.rstrip()==Findstr:
#print(line)
tmp=[]
tmp.append(re.sub("\s+",",",line.strip()))
#print(tmp)
for line in f:
if line.rstrip()==EndStr:
out.append(tmp)
break
tmp.append(re.sub("\s+",",",line.strip()))
return (tmp)
f.close()
LstEmp=[]
LstEmp=prepareDataset("empcode Emnname Dept DESC")
print(LstEmp)
My output is:
['empcode,Emnname,Date,DESC',
'12d,sf,2018-02-06,dghsjf',
'asf2,asdfw2,2018-02-16,fsfsfg',
'dsf21,sdf2,2016-02-06,sdgfsgf',
'sdgg,dsds,dkfd-sffddfdf,aaaa',
'dfd,gfg,dfsdffd,aaaa',
'df,dfdf,efef',
'4fr,freff']
Expected output:
['empcode,Emnname,Date,DESC',
'12d,sf,2018-02-06,dghsjf',
'asf2,asdfw2,2018-02-16,fsfsfg',
'dsf21,sdf2,2016-02-06,sdgfsgf',
'sdgg,dsds,dkfd-sffddfdf,aaaa',
'dfd,gfg,dfsdffd,aaaa',
'df,NA,dfdf,efef',
'4fr,NA,NA,freff']
Here I tried to follow a general approach, where you won't have to pre program the column spans in your code. For returning dataframe you can use pd.read_csv with stringio. Kindly modify the path as per your file location. And this code is extended from your code to make it simple for you to understand, otherwise there are more beutiful ways to write the same logic
import re
import pandas as pd
import StringIO
path = "/home/clik/clik/demo.txt"
EndStr = "------------------------------"
FilterStr = "=================="
FindStr = "empcode Emnname"
def match(sp1, sp2):
disjunct = max(sp1[0] - sp2[1], sp2[0] - sp1[1])
if disjunct >= 0:
return -abs((sp1[0]+sp1[1])/2.0 - (sp2[0]+sp2[1])/2.0)
return float(disjunct) / min(sp1[0] - sp2[1], sp2[0] - sp1[1])
def PrepareList():
with open(path) as f:
out = []
for i, line in enumerate(f):
print line.rstrip()
if line.rstrip().startswith(FindStr):
print(line)
tmp = []
col_spans = [m.span() for m in re.finditer("[^\s][^\s]+", line)]
tmp.append(re.sub("\s+", ",", line.strip()))
# print(tmp)
for line in f:
if line.rstrip().startswith(EndStr):
out.append(tmp)
break
row = [None] * len(col_spans)
for m in re.finditer("[^\s][^\s]+", line):
colmatches = [match(m.span(), cspan) for cspan in col_spans]
max_index = max(enumerate(colmatches), key=lambda e: e[1])[0]
row[max_index] = m.group() if row[max_index] is None else (row[max_index] + ' ' + m.group())
tmp.append(','.join(['NA' if e is None else e for e in row]))
#tmp.append(re.sub("\s+", ",", line.strip()))
#for pandas dataframe
#return pd.read_csv(StringIO.StringIO('\n'.join(tmp)))
#for returning list of tuples
return map(tuple, tmp)
#for returning list of list
#return tmp
f.close()
LstEmp = PrepareList()
for coverting list of tuples to pyspark dataframe, here is a tutorial http://bigdataplaybook.blogspot.in/2017/01/create-dataframe-from-list-of-tuples.html
From the dataset it appears that the text in fields is variable in length, the fields themselves start and end at fixed position. This usually happens with tab separated fields.
==============================================
empcode Emnname Date DESC
12d sf 2018-02-06 dghsjf
asf2 asdfw2 2018-02-16 fsfsfg
dsf21 sdf2 2016-02-06 sdgfsgf
If this is the case the following should work:
for line in f:
if line.rstrip()==Findstr:
tmp=[]
tmp.append(re.sub("\t",",",line.strip()))
#print(tmp)
for line in f:
if line.rstrip()==EndStr:
out.append(tmp)
break
tmp.append(re.sub("\t",",",line.strip()))
return (tmp)
I have replaced \s in your code with \t and removed +.
In python regex, + sign expands to match one or more occurrences of regex preceding it. In this case it is \s which expands from the end of first field to the next field.
Alternatively, if the input file is not tab separated, you can extract field values considering fixed length field and then doing a strip()
fields = [ (0,10),
(10, 20),
(20,36),
(36,100) # Assuming last field will not cross this length
]
field_values = [ line[ x[0]:x[1] ].strip() for x in fields ]
I have some data which looks like:
key abc key
value 1
value 2
value 3
key bcd key
value 2
value 3
value 4
...
...
Based on it, what I want is to construct a data structure like:
{'abc':[1,2,3]}
{'bcd':[2,3,4]}
...
Is regular expression a good choice to do that? If so, how to write the regular expression so that the process behaves like a for loop (inside the loop, I can do some job to construct a data structure with the data I got) ?
Thanks.
Using regular expression can be more robost relative to using string slicing to identify values in text file. If you have confidence in the format of your data, using string slicing will be fine.
import re
keyPat = re.compile(r'key (\w+) key')
valuePat = re.compile(r'value (\d+)')
result = {}
for line in open('data.txt'):
if keyPat.search(line):
match = keyPat.search(line).group(1)
tempL = []
result[match] = tempL
elif valuePat.search(line):
match = valuePat.search(line).group(1)
tempL.append(int(match))
else:
print('Did not match:', line)
print(result)
x="""key abc key
value 1
value 2
value 3
key bcd key
value 2
value 3
value 4"""
j= re.findall(r"key (.*?) key\n([\s\S]*?)(?=\nkey|$)",x)
d={}
for i in j:
k=map(int,re.findall(r"value (.*?)(?=\nvalue|$)",i[1]))
d[i[0]]=k
print d
The following code should work if the data is always in that format.
str=""
with open(FILENAME, "r") as f:
str =f.read()
regex = r'key ([^\s]*) key\nvalue (\d)+\nvalue (\d)+\nvalue (\d+)'
matches=re.findall(regex, str)
dic={}
for match in matches:
dic[match[0]] = map(int, match[1:])
print dic
EDIT: The other answer by meelo is more robust as it handles cases where values might be more or less than 3.
My program has to do two things with this file.
It needs to print the following information:
def getlines(somefile):
f = open(somefile).readlines()
lines = [line for line in f if not line.startswith("#") and not line.strip() == ""]
return lines
entries = getlines(input("Name of input file: "))
animal_visits = {}
month_visits = [0] * 13
for entry in entries:
# count visits for each animal
animal = entry[:3]
animal_visits[animal] = animal_visits.get(animal, 0) + 1
# count visits for each month
month = int(entry[4:6])
month_visits[month] += 1
print("Total Number of visits for each animal")
for x in sorted(animal_visits):
print(x, "\t", animal_visits[x])
print("====================================================")
print("Month with highest number of visits to the stations")
print(month_visits.index(max(month_visits)))
Outputs:
Name of input file: log
Total Number of visits for each animal
a01 3
a02 3
a03 8
====================================================
Month with highest number of visits to the stations
1
I prepared the following script:
from datetime import datetime # to parse your string as a date
from collections import defaultdict # to accumulate frequencies
import calendar # to get the names of the months
# Store the names of the months
MONTHS = [item for item in calendar.month_name]
def entries(filename):
"""Yields triplets (animal, date, station) contained in
`filename`.
"""
with open(filename, "rb") as fp:
for line in (_line.strip() for _line in fp):
# skip comments
if line.startswith("#"):
continue
try:
# obtain the entry or try next line
animal, datestr, station = line.split(":")
except ValueError:
continue
# convert date string to actual datetime object
date = datetime.strptime(datestr, "%m-%d-%Y")
# yield the value
yield animal, date, station
def visits_per_animal(data):
"""Count of visits per station sorted by animal."""
# create a dictionary whose value is implicitly created to an
# integer=0
counter = defaultdict(int)
for animal, date, station in data:
counter[animal] += 1
# print the outcome
print "Visits Per Animal"
for animal in sorted(counter.keys()):
print "{0}: {1}".format(animal, counter[animal])
def month_of_highest_frequency(data):
"""Calulates the month with the highest frequency."""
# same as above: a dictionary implicitly creation integer=0 for a
# new key
counter = defaultdict(int)
for animal, date, station in data:
counter[date.month] += 1
# select the (key, value) where value is maximum
month_max, visits_max = max(counter.iteritems(), key=lambda t: t[1])
# pretty-print
print "{0} has the most visits ({1})".format(MONTHS[month_max], visits_max)
def main(filename):
"""main program: get data, and apply functions"""
data = [entry for entry in entries(filename)]
visits_per_animal(data)
month_of_highest_frequency(data)
if __name__ == "__main__":
import sys
main(sys.argv[1])
Use as:
$ python animalvisits.py animalvisits.txt
Visits Per Animal
a01: 3
a02: 3
a03: 8
January has the most visits (3)
Having done that I must advice you agains this approach. Querying data like this is very inefficient, difficult, and error prone. I recommend you store your data in an actual database (Python offers an excellent binding for SQlite), and use SQL to make your reductions.
If you adopt the SQlite philosophy, you will simply store your queries as plain text files and run them on demand (via Python, or GUI, or command line).
Visit http://docs.python.org/2/library/sqlite3.html for more details.
have you tried using regex?
I guess your code would reduce to a very few lines if you use regex?
use findall("DIFFERENT REGULAR EXPRESSIONS") and store the values into list. Then you can count the length of the list.
I have a file having a few columns like:
PAIR 1MFK 1 URANIUM 82 HELIUM 112 2.5506
PAIR 2JGH 2 PLUTONIUM 98 POTASSIUM 88 5.3003
PAIR 345G 3 SODIUM 23 CARBON 14 1.664
PAIR 4IG5 4 LITHIUM 82 ARGON 99 2.5506
PAIR 234G 5 URANIUM 99 KRYPTON 89 1.664
Now what I wanted to do is read the last column and iterate the values for repetitions and generate an output file containing two column 'VALUE' & 'NO OF TIMES REPEATED'.
I have tried like:
inp = ('filename'.'r').read().strip().replace('\t',' ').split('\n')
from collections import defaultdict
D = defaultdict(line)
for line in map(str.split,inp):
k=line[-1]
D[k].append(line)
I'm stuck here.
plaese help.!
There are a number of issues with the code as posted. A while-loop isn't allowed inside a list comprehension. The argument to defaultdict should be list not line. Here is a fixed-up version of your code:
from collections import defaultdict
D = defaultdict(list)
for line in open('filename', 'r'):
k = line.split()[-1]
D[k].append(line)
print 'VALUE NO TIMES REPEATED'
print '----- -----------------'
for value, lines in D.items():
print '%-6s %d' % (value, len(lines))
Another way to do it is to use collections.Counter to conveniently sum the number of repetitions. That let's you simplify the code a bit:
from collections import Counter
D = Counter()
for line in open('filename', 'r'):
k = line.split()[-1]
D[k] += 1
print 'VALUE NO TIMES REPEATED'
print '----- -----------------'
for value, count in D.items():
print '%-6s %d' % (value, count)
Now what I wanted to do is read the last column and iterate the values for repetitions and generate an output file containing two column 'VALUE' & 'NO OF TIMES REPEATED'.
So use collections.Counter to count the number of times each value appears, not a defaultdict. (It's not at all clear what you're trying to do with the defaultdict, and your initialization won't work, anyway; defaultdict is constructed with a callable that will create a default value. In your case, the default value you apparently had in mind is an empty list, so you would use list to initialize the defaultdict.) You don't need to store the lines to count them. The Counter counts them for you automatically.
Also, processing the entire file ahead of time is a bit ugly, since you can iterate over the file directly and get lines, which does part of the processing for you. Although you can actually do that iteration automatically in the Counter creation.
Here is a complete solution:
from collections import Counter
with open('input', 'r') as data:
histogram = Counter(line.split('\t')[-1].strip() for line in data)
with open('output', 'w') as result:
for item in histogram.iteritems():
result.write('%s\t%s\n' % item)