How can I categorize numbers that inside a text file? - python

I have a text file that has 5000 lines. It's format is like that:
1,3,4,1,2,3,5,build
2,6,4,6,7,3,4,demolish
3,6,10,2,3,1,3,demolish
4,4,1,2,3,4,5,demolish
5,1,1,1,1,6,8,build
I want to make different lists for example:
for second column:
second_build=[3,1]
second_demolish=[6,6,4]
I've tried something like that:
with open('cons.data') as file:
second_build=[line.split(',')[1] for line in file if line.split(',')[7]=='build']
But It did not work.

You can get values for each column/action as follows:
lines = """1,3,4,1,2,3,5,build
2,6,4,6,7,3,4,demolish
3,6,10,2,3,1,3,demolish
4,4,1,2,3,4,5,demolish
5,1,1,1,1,6,8,build""".split(
"\n"
)
build_cols = [list() for _ in range(7)]
demolish_cols = [list() for _ in range(7)]
data = {"build": build_cols, "demolish": demolish_cols}
for line in lines:
tokens = line.split(",")
for bc, tok in zip(data[tokens[-1]], tokens):
bc.append(tok)
# to access second column build values:
print(build_cols[1])
# ['3', '1']
For example, build_cols stores a list of lists, each entry represents a column. For each build line, you append items from an appropriate column to the corresponding position in the build_cols.

Just simply first make the readlines a variable, then in the list comprehension simply add a rstrip then will work, because the values (except the last) all have '\n' at the end, so strip them out, and make them integers:
with open('cons.data') as file:
f=file.readlines()
second_build=[int(line.split(',')[1]) for line in f if line.rstrip().split(',')[-1]=='build']
second_demolish=[int(line.split(',')[1]) for line in f if line.rstrip().split(',')[-1]=='demolish']
And now:
print(second_build)
print(second_demolish)
Is:
[3, 1]
[6, 6, 4]

Related

Need to split list into a nested list

Im trying to make a sublist based on end of a line and #:
for example the file contains:
#
2.1,-3.1
-0.7,4.1
#
3.8,1.5
-1.2,1.1
and the output needs to be:
[[[2.1, -3.1], [-0.7, 4.1]], [[3.8, 1.5], [-1.2, 1.1]]]
but after coding :
results = []
fileToProcess = open("numerical.txt", "r")
for line in fileToProcess:
results.append(line.strip().split(' '))
print(results)
i get :
[['#'], ['2.1', '-3.1'], ['-0.7', '4.1'], ['#'], ['3.8', '1.5'], ['-1.2', '1.1']]
Assuming Python as a programming language, and assuming you want exactly the output to be like this:
[[[2.1, -3.1], [-0.7, 4.1]], [[3.8, 1.5], [-1.2, 1.1]]]
Here is how to do it:
I commented the code for better understanding. Please tell me if something isn't clear.
fileToProcess = open("numerical.txt", "r")
results = []
hashtag_results = []
# For each line, we have two cases: either the line contains hashtags or contains numbers.
for line in fileToProcess:
'''
If the line doesn't contain hashtags, then we want to:
1. Separate the text by "," and not spaces.
2. Parse the text as floats using list comprehension.
3. Append the parsed line to hashtag_results which contains
all lists between two hashtags.
'''
if not line.startswith("#"):
line_results = [ float(x) for x in line.strip().split(',')]
hashtag_results.append(line_results)
'''
If the line contains a hashtag AND the hastag_results ISN'T EMPTY:
then we want to append the whole hashtag_list to the final results list.
'''
if line.startswith("#") and hashtag_results:
results.append(hashtag_results)
hashtag_results = []
# For the final line, we append the last hashtag_results to the final results too.
results.append(hashtag_results)
print(results)
[[[2.1, -3.1], [-0.7, 4.1]], [[3.8, 1.5], [-1.2, 1.1]]]
The general idea looks fine in your OP, but you will need to split by "," (instead of " "), and append a list to results, where list is a list of the numerical values.
Another issue is that you don't close the file once you're finished with it. I suggest to use the built-in context manager construct (https://book.pythontips.com/en/latest/context_managers.html), which open() supports, and will automatically close the file once you leave the context manager scope.
Parsing data from file is a common data processing task in Python, so it could be achieved in a more "pythonic" way with a list comprehension.
# use a context manager, so once you leave the `with` block,
# the file is closed(!)
with open("numerical.txt", "r") as fileToProcess:
results = [
# split the line on "," and interpret each element as a float
[float(val) for val in line.strip().split(",")]
# iterate through each line in the file
for line in fileToProcess
# ignore lines that just have '#'
if line.strip() != "#"
]
# here, the file would be closed, and `results` will contain the parsed data
# result = [[[2.1, -3.1], [-0.7, 4.1]], [[3.8, 1.5], [-1.2, 1.1]]]

Find first line of text according to value in Python

How can I do a search of a value of the first "latitude, longitude" coordinate in a "file.txt" list in Python and get 3 rows above and 3 rows below?
Value
37.0459
file.txt
37.04278,-95.58895
37.04369,-95.58592
37.04369,-95.58582
37.04376,-95.58557
37.04376,-95.58546
37.04415,-95.58429
37.0443,-95.5839
37.04446,-95.58346
37.04461,-95.58305
37.04502,-95.58204
37.04516,-95.58184
37.04572,-95.58139
37.04597,-95.58127
37.04565,-95.58073
37.04546,-95.58033
37.04516,-95.57948
37.04508,-95.57914
37.04494,-95.57842
37.04483,-95.5771
37.0448,-95.57674
37.04474,-95.57606
37.04467,-95.57534
37.04462,-95.57474
37.04458,-95.57396
37.04454,-95.57274
37.04452,-95.57233
37.04453,-95.5722
37.0445,-95.57164
37.04448,-95.57122
37.04444,-95.57054
37.04432,-95.56845
37.04432,-95.56834
37.04424,-95.5668
37.044,-95.56251
37.04396,-95.5618
Expected Result
37.04502,-95.58204
37.04516,-95.58184
37.04572,-95.58139
37.04597,-95.58127
37.04565,-95.58073
37.04546,-95.58033
37.04516,-95.57948
Additional information
In linux I can get the closest line and do the treatment I need using grep, sed, cut and others, but I'd like in Python.
Any help will be greatly appreciated!
Thank you.
How can I do a search of a value of the first "latitude, longitude"
coordinate in a "file.txt" list in Python and get 3 rows above and 3
rows below?*
You can try:
with open("text_filter.txt") as f:
text = f.readlines() # read text lines to list
filter= "37.0459"
match = [i for i,x in enumerate(text) if filter in x] # get list index of item matching filter
if match:
if len(text) >= match[0]+3: # if list has 3 items after filter, print it
print("".join(text[match[0]:match[0]+3]).strip())
print(text[match[0]].strip())
if match[0] >= 3: # if list has 3 items before filter, print it
print("".join(text[match[0]-3:match[0]]).strip())
Output:
37.04597,-95.58127
37.04565,-95.58073
37.04546,-95.58033
37.04597,-95.58127
37.04502,-95.58204
37.04516,-95.58184
37.04572,-95.58139
You can use pandas to import the data in a dataframe and then easily manipulate it. As per your question the value to check is not the exact match and therefore I have converted it to string.
import pandas as pd
data = pd.read_csv("file.txt", header=None, names=["latitude","longitude"]) #imports text file as dataframe
value_to_check = 37.0459 # user defined
for i in range(len(data)):
if str(value_to_check) == str(data.iloc[i,0])[:len(str(value_to_check))]:
break
print(data.iloc[i-3:i+4,:])
output
latitude longitude
9 37.04502 -95.58204
10 37.04516 -95.58184
11 37.04572 -95.58139
12 37.04597 -95.58127
13 37.04565 -95.58073
14 37.04546 -95.58033
15 37.04516 -95.57948
A solution with iterators, that only keeps in memory the necessary lines and doesn't load the unnecessary part of the file:
from collections import deque
from itertools import islice
def find_in_file(file, target, before=3, after=3):
queue = deque(maxlen=before)
with open(file) as f:
for line in f:
if target in map(float, line.split(',')):
out = list(queue) + [line] + list(islice(f, 3))
return out
queue.append(line)
else:
raise ValueError('target not found')
Some tests:
print(find_in_file('test.txt', 37.04597))
# ['37.04502,-95.58204\n', '37.04516,-95.58184\n', '37.04572,-95.58139\n', '37.04597,-95.58127\n',
# '37.04565,-95.58073\n', '37.04565,-95.58073\n', '37.04565,-95.58073\n']
print(find_in_file('test.txt', 37.044)) # Only one line after the match
# ['37.04432,-95.56845\n', '37.04432,-95.56834\n', '37.04424,-95.5668\n', '37.044,-95.56251\n',
# '37.04396,-95.5618\n']
Also, it works if there is less than the expected number of lines before or after the match. We match floats, not strings, as '37.04' would erroneously match '37.0444' otherwise.
This solution will print the before and after elements even if they are less than 3.
Also I am using string as it is implied from the question that you want partial matches also. ie. 37.0459 will match 37.04597
search_term='37.04462'
with open('file.txt') as f:
lines = f.readlines()
lines = [line.strip().split(',') for line in lines] #remove '\n'
for lat,lon in lines:
if search_term in lat:
index=lines.index([lat,lon])
break
left=0
right=0
for k in range (1,4): #bcoz last one is not included
if index-k >=0:
left+=1
if index+k<=(len(lines)-1):
right+=1
for i in range(index-left,index+right+1): #bcoz last one is not included
print(lines[i][0],lines[i][1])

How to sort a large number of lists to get a top 10 of the longest lists

So I have a text file with around 400,000 lists that mostly look like this.
100005 127545 202036 257630 362970 376927 429080
10001 27638 51569 88226 116422 126227 159947 162938 184977 188045
191044 246142 265214 290507 296858 300258 341525 348922 359832 365744
382502 390538 410857 433453 479170 489980 540746
10001 27638 51569 88226 116422 126227 159947 162938 184977 188045
191044 246142 265214 290507 300258 341525 348922 359832 365744 382502
So far I have a for loop that goes line by line and turns the current line into a temp array list.
How would I create a top ten list that has the list with the most elements of the whole file.
This is the code I have now.
file = open('node.txt', 'r')
adj = {}
top_ten = []
at_least_3 = 0
for line in file:
data = line.split()
adj[data[0]] = data[1:]
And this is what one of the list look like
['99995', '110038', '330533', '333808', '344852', '376948', '470766', '499315']
# collect the lines
lines = []
with open("so.txt") as f:
for line in f:
# split each line into a list
lines.append(line.split())
# sort the lines by length, descending
lines = sorted(lines, key=lambda x: -len(x))
# print the first 10 lines
print(lines[:10])
Why not use collections to display the top 10? i.e.:
import re
import collections
file = open('numbers.txt', 'r')
content = file.read()
numbers = re.findall(r"\d+", content)
counter = collections.Counter(numbers)
print(counter.most_common(10))
Ideone Demo
When wanting to count and then find the one(s) with the highest counts, collections.Counter comes to mind:
from collections import Counter
lists = Counter()
with open('node.txt', 'r') as file:
for line in file:
values = line.split()
lists[tuple(values)] = len(values)
print('Length Data')
print('====== ====')
for values, length in lists.most_common(10):
print('{:2d} {}'.format(length, list(values)))
Output (using sample file data):
Length Data
====== ====
10 ['191044', '246142', '265214', '290507', '300258', '341525', '348922', '359832', '365744', '382502']
10 ['191044', '246142', '265214', '290507', '296858', '300258', '341525', '348922', '359832', '365744']
10 ['10001', '27638', '51569', '88226', '116422', '126227', '159947', '162938', '184977', '188045']
7 ['382502', '390538', '410857', '433453', '479170', '489980', '540746']
7 ['100005', '127545', '202036', '257630', '362970', '376927', '429080']
Use a for loop and max() maybe? You say you've got a for loop that's placing the values into a temp array. From that you could use "max()" to pick out the largest value and put that into a list.
As an open for loop, something like appending max() to a new list:
newlist = []
for x in data:
largest = max(x)
newlist.append(largest)
Or as a list comprehension:
newlist = [max(x) for x in data]
Then from there you have to do the same process on the new list(s) until you get to the desired top 10 scenario.
EDIT: I've just realised that i've misread your question. You want to get the lists with the most elements, not the highest values. Ok.
len() is a good one for this.
for x in data:
if len(templist) > x:
newlist.append(templist)
That would give you the current highest and from there you could create a top 10 list of lengths or of the temp lists themselves, or both.
If your data is really as shown with each number the same length, then I would make a dictionary with key = line, value = length, get the top value / key pairs in the dictionary and voila. Sounds easy enough.

Append ellement to list, err list index out of range

I need to analyse the Brightkite network and its checkins. basically I have to count the number of distinct users who checked-in at each location. I just When I run this piece of code on small file (just cut 300 first lines from original file) it works good. But if I try to do the same with original file. I get the error
users.append(columns[4])
IndexError: list index out of range. What it could be
Here is my code:
from collections import Counter
f = open("b.txt")
locations = []
users = []
for line in f:
columns = line.strip().split("\t")
locations.append(columns[0])
users.append(columns[4])
l = Counter(locations)
ml = l.most_common(10)
print ml
Here is structure of data
58186 2008-12-03T21:09:14Z 39.633321 -105.317215 ee8b88dea22411
58186 2008-11-30T22:30:12Z 39.633321 -105.317215 ee8b88dea22411
58186 2008-11-28T17:55:04Z -13.158333 -72.531389 e6e86be2a22411
58186 2008-11-26T17:08:25Z 39.633321 -105.317215 ee8b88dea22411
58187 2008-08-14T21:23:55Z 41.257924 -95.938081 4c2af967eb5df8
58187 2008-08-14T07:09:38Z 41.257924 -95.938081 4c2af967eb5df8
58187 2008-08-14T07:08:59Z 41.295474 -95.999814 f3bb9560a2532e
58187 2008-08-14T06:54:21Z 41.295474 -95.999814 f3bb9560a2532e
58188 2010-04-06T06:45:19Z 46.521389 14.854444 ddaa40aaa22411
58188 2008-12-30T15:30:08Z 46.522621 14.849618 58e12bc0d67e11
58189 2009-04-08T07:36:46Z 46.554722 15.646667 ddaf9c4ea22411
58190 2009-04-08T07:01:28Z 46.421389 15.869722 dd793f96a22411
You should use the csv module and update the counter as you go:
from collections import Counter
import csv
with open("Brightkite_totalCheckins.txt") as f:
r = csv.reader(f,delimiter="\t")
cn = Counter()
users = []
for row in r:
# update Counter as you go, no need to build another list
# locations is row[4] not row[0]
cn[row[4]] += 1
# same as columns[]
users.append(row[0])
print(cn.most_common(10))
Output from the full file:
[('00000000000000000000000000000000', 254619), ('ee81ef22a22411ddb5e97f082c799f59', 17396), ('ede07eeea22411dda0ef53e233ec57ca', 16896), ('ee8b1d0ea22411ddb074dbd65f1665cf', 16687), ('ee78cc1ca22411dd9b3d576115a846a7', 14487), ('eefadd1aa22411ddb0fd7f1c9c809c0c', 12227), ('ecceeae0a22411dd831d5f56beef969a', 10731), ('ef45799ca22411dd9236df37bed1f662', 9648), ('d12e8e8aa22411dd90196fa5c210e3cc', 9283), ('ed58942aa22411dd96ff97a15c29d430', 8640)]
If you print the lines using repr you see the file is tab separated:
'7611\t2009-08-30T11:07:52Z\t53.6\t-2.55\td138ebbea22411ddbd3a4b5ab989b9d0\n'
'7611\t2009-08-30T00:15:20Z\t53.6\t-2.55\td138ebbea22411ddbd3a4b5ab989b9d0\n'
'7611\t2009-08-29T20:28:13Z\t53.6\t-2.55\td138ebbea22411ddbd3a4b5ab989b9d0\n'
'7611\t2009-08-29T15:53:59Z\t53.6\t-2.55\td138ebbea22411ddbd3a4b5ab989b9d0\n'
'7611\t2009-08-29T15:19:36Z\t53.6\t-2.55\td138ebbea22411ddbd3a4b5ab989b9d0\n'
'7611\t2009-08-29T15:16:45Z\t53.6\t-2.55\td138ebbea22411ddbd3a4b5ab989b9d0\n'
'7611\t2009-08-29T11:52:32Z\t53.6\t-2.55\td138ebbea22411ddbd3a4b5ab989b9d0\n'
..................
The very last line is:
'58227\t2009-01-21T00:24:35Z\t33.833333\t35.833333\t9f6b83bca22411dd85460384f67fcdb0\n'
so make sure that matches and you have not modified the file and there will be no indexError.
You code fails because you have some lines that look like '7573\t\t\t\t\n', the first of which is line number 1909858 so splitting and stripping leaves you with ['7573'].
Using the csv file however gives you ['7573', '', '', '', ''].
If you actually want a list of ten uniques locations, you need to find the values that are equal to 1:
# generator expression of key/values where value == 1
unique = (tup for tup in cn.iteritems() if tup[1] == 1)
from itertools import islice
# take first 10 elements from unique
sli = list(islice(unique,10))
print(sli)
('2d4920e7273c755704c06f2201832d89', 1), ('a4ef963e84f83133484227465e2113e9', 1), ('474f93a6585111dea018003048c10834', 1), ('413754d668b411de9a19003048c0801e', 1), ('d115daaca22411ddb75a33290983eb13', 1), ('4bac110041ad11de8fca003048c0801e', 1), ('fc706c121ec1f54e0a828548ac5e26b8', 1), ('1bcd0cf0f0bd11ddb822003048c0801e', 1), ('e6ed6c09b8994ed125f3c5ef6c210844', 1), ('493ef9b049cfb2c6c24667a931f1592172074545', 1)]
To get the count of all unique locations we can use the rest of our generator expression with sum adding 1 for every element and add the total to the length of what we took with islice:
print(sum(1 for _ in unique) + len(sli))
Which gives you 426831 unique locations.
Using re.split or str.split is not going to work for an obvious reason:
In [13]: re.split("\s+", '7573\t\t\t\t\n'.rstrip())
Out[13]: ['7573']
In [14]: '7573\t\t\t\t\n'.rstrip().split()
Out[14]: ['7573']
The problem was in your data, I checked the website data you provided. The data is not actually separated by tab spaces. They are just separated by spaces. I added some lines to replace the spaces with tabs then splitted the line. It works now.
from collections import Counter
f = open("b.txt")
locations = []
users = []
for line in f:
line = line.replace(" ","\t")
line = line.replace(" ","\t")
line = line.replace(" ","\t")
line = line.replace(" ","\t")
columns = line.strip().split("\t")
locations.append(columns[0])
users.append(columns[4])
l = Counter(locations)
ml = l.most_common(10)
print ml
Note : If you have errors like this check your data, the error message is clear that there is no 4th element.
I hope this is the error you want to resolve.

Python splitting up line into separate lists

I have data in a text file that is space separated into right aligned columns. I would like to be able to take each column and put it in a list, basically like you would do with an array. I can't seem to find an equivalent to
left(strname,#ofcharacters)/mid(strname,firstcharacter,lastcharacter)/right(strname,#ofcharacters)
like you would normally use in VB to accomplish the task. How do I separate off the data and put each like 'unit' with its value next from the next line in Python.
Is it possible? Oh yeah, some spacing is 12 characters apart(right aligned) while others are 15 characters apart.
-1234 56 32452 68584.4 Extra_data
-5356 9 546 12434.5 Extra_data
- 90 12 2345 43522.1 Extra_data
Desired output:
[-1234, -5356, -90]
[56, 9, 12]
[32452, 546, 2345]
etc
The equivalent method in python you are looking for is str.split() without any arguments to split the string on whitespaces. It will also take care of any trailing newline/spaces and as in your VB example, you do not need to care about data width.
Example
with open("data.txt") as fin:
data = map(str.split, fin) #Split each line of data on white-spaces
data = zip(*data) #Transpose the Data
But if you have columns with whitespaces, you need some to split the data, based on column position
>>> def split_on_width(data, pos):
if pos[-1] != len(data):
pos = pos + (len(data), )
indexes = zip(pos, pos[1:]) #Create an index pair with current start and
#end as next start
return [data[start: end].strip() for start, end in indexes] #Slice the data using
#the indexes
>>> def trynum(n):
try:
return int(n)
except ValueError:
pass
try:
return float(n)
except ValueError:
return n
>>> pos
(0, 5, 13, 22, 36)
>>> with open("test.txt") as fin:
data = (split_on_width(data.strip(), pos) for data in fin)
data = [[trynum(n) for n in row] for row in zip(*data)]
>>> data
[[-1234, -5356, -90], [56, 9, 12], [32452, 546, 2345], [68584.4, 12434.5, 43522.1], ['Extra_data', 'Extra_data', 'Extra_data']]
Just use str.split() with no arguments; it splits an input string on arbitrary width whitespace:
>>> ' some_value another_column 123.45 42 \n'.split()
['some_value', 'another_column', '123.45', '42']
Note that any columns containing whitespace would also be split.
If you wanted to have lists if columns, you need to transpose the rows:
with open(filename) as inputfh:
columns = zip(*(l.split() for l in inputfh))

Categories