Extracting data from a text file as an array

Extracting data from a text file as an array - python

I am trying to extract certain string of data from a text file.
The code I use is the following. I want to read the particular string(all actions) from that text file and then store it in an array or list if it is found. and then display in the same order.
import string
solution_path = "/homer/my_dir/solution_detail.txt"
solution = open(solution_path).read()
all_actions = ['company_name','email_address','full_name']
n = 0
sequence_array = []
for line in solution:
for action in all_actions:
if action in line:
sequence_array[n] = action
n = n+1
for x in range(len(sequence_array)):
print (sequence_array[x])
But this code does not do anything but runs without any error.

There are multiple problems with the code.
.read() on a file produces a single string. As a result, for line in solution: iterates over each character of the file's text, not over each line. (The name line is not special, in case you thought it was. The iteration depends only on what is being iterated over.) The natural way to get lines from the file is to loop over the file itself, while it is open. To keep the file open and make sure it closes properly, we use a with block.
You may not simply assign to sequence_array[n] unless the list is already at least n+1 elements long. (The reason you don't get an error from this is because if action in line: is never true, because of the first point.) Fortunately, we can simply .append to the end of the list instead.
If the line contains multiple of the all_actions, it would be stored multiple times. This is probably not what you want to happen. The built-in any function makes it easier to deal with this problem; we can supply it with a generator expression for an elegant solution. But if your exact needs are different then of course there are different approaches.
While the last loop is okay in theory, it is better to loop directly, the same way you attempt to loop over solution. But instead of building up a list, we could instead just print the results as they are found.
So, for example:
with open(solution_path) as solution:
for line in solution:
if any(action in line for action in all_actions):
print(line)

What is happening is that solution contains all the text inside the file. Therefore when you are iterating for line in solution you are actually iterating over each and every character separately, which is why you never get any hits.
try the following code (I can't test it since I don't have you're file)
solution_path = "/homer/my_dir/solution_detail.txt"
all_actions = ['company_name','email_address','full_name']
sequence_array = []
with open(solution_path, 'r') as f:
for line in f.readlines():
for action in all_actions:
if action in line:
sequence_array.append(action)
This will collect all the actions in the documents. if you want to print all of them
for action in sequence_array:
print(action)

Related

Mapping User Input to Text File List

I'm trying to make a function that, given input from the User, can map input to a list of strings in a text file, and return some integer corresponding to the string in the file. Essentially, I check if what the user input is in the file and return the index of the matching string in the file. I have a working function, but it seems slow and error-prone.
def parseInput(input):
Gates = []
try:
textfile = open("words.txt")
while nextLine:
nextLine = textfile.readline()
Gates[n] = nextLine #increment n somewhere
finally:
textfile.close()
while n <= len(Gates):
nextString = Gates[n]
if input in nextString:
#Exit loop
with open("wordsToInts.txt") as textfile:
#Same procedure as the try loop(why isn't that one a with loop?)
if(correct):
return number
This seems rather... bad. I just can't seem to think of a better way to do this though. I have full control over words.txt and wordsToInts.txt(should I combine these?), so I can format them as I please. I'm looking for suggestions re: the function itself, but if a change to the text files would help, I would like to know. My goal is to reduce cause for error, but I will add error checking later. Please, suggest a better way to write this function. If writing in code, please use Python. Pseudocode is fine, however.

I would say to combine the files. You can have your words, and their corresponding values as follows:
words.txt
string1|something here
string2|something here
Then you can store each line as an entry to a dictionary and recall the value based on your input:
def parse_input(input):
word_dict = {}
with open('words.txt') as f:
for line in f.readlines():
line_key, line_value = line.split('|', 1)
word_dict[line_key] = line_value.rstrip('\n')
try:
return word_dict[input]
except KeyError:
return None

I'm trying to make a function that, given input from the User, can map input to a list of strings in a text file, and return some integer corresponding to the string in the file. Essentially, I check if what the user input is in the file and return the index of the matching string in the file
def get_line_number(input):
"""Finds the number of the first line in which `input` occurs.
If input isn't found, returns -1.
"""
with open('words.txt') as f:
for i, line in enumerate(f):
if input in line:
return i
return -1
This function will meet the specification from your description with the additional assumption that the string you care about are on separate lines. Notable things:
File objects in Python act as iterators over the lines of their contents. You don't have to read the lines into a list if all you need to do is check each individual line.
The enumerate function takes an iterator and returns a generator which yields a tuple like (index, element), where element is an element in your iterator and index is its position inside the iterator.
The term iterator means any object that's a sequence of things you can access in a for loop.
The term generator means an object which generates elements to iterate through "on-the-fly". What this means in this case is that you can access each line of a file one by one without having to load the entire file into your machine's memory.
This function is written in the standard Pythonic style, with a docstring, appropriate casing on variable names, and a descriptive name.

any() function with csv file in python not behaving how I'm expecting it to

(I'm using python 2.7 for now)
So maybe I'm not understanding how this line of code is working, because for one part of my program, it seems to be working fine, while in another part, it doesn't.
elif not any(user in line for line in data):
Basically, I have a csv file that I'm reading from and storing in the variable "data" like this:
f = open("scores.csv")
data = csv.reader(f)
the variable "user" is a string from an Entry box in Tkinter,
and the variable "line" is an arbitrary name for the for loop, just like in a piece of code that says "for i in range(69):"
So what my brain thinks that this line should do is that if it fails to find a match of user in any of the lines in the csv file, it should run the code under that statement. But it doesn't seem to do that!
However, later on in my code I try something similar:
elif any(user in line for line in data):
and this seems to work without any problems!!
I have no idea why, and I could not find anywhere on the internet of anyone else trying to do this lol.
I'm trying to make a login form as a beginner project, as I somewhat know python, so I wanted to see what I can do, but I seem to be stuck here.
I have uploaded my code to github for anyone to review:
https://github.com/Arunscape/login-form/blob/master/login.py
oh and don't worry about the "passwords" in the csv file, they're of course fake!
Any help is appreciated. Thanks!!!

The problem you have is that data is an iterator, not a sequence you can iterate on multiple times. After you call any with a generator expression iterating over data, some or all of the items will have been consumed. Later calls will only see what is left over (which may be nothing if the first iteration had to check all the data).
You can reproduce the issue with a much simpler bit of code:
iterator = iter(range(10)) # an iterator over the numbers 0 through 9
first_result = any(x == 3 for x in iterator) # this will be True
second_result = any(x == 3 for x in iterator) # the same expression will be False this time!
The first any call consumes (via the generator expression) the numbers 0 through 3 from the iterator. Then it stops and any returns True (stopping early in this way is known as "short circuiting").
The second any call only gets to consume the remaining items, it can't see the ones that were already yielded to the first any call. Since the iterator will only yield one 3, the second any call will return False after consuming the rest of the numbers.
For your code to work correctly with data being an iterator, you can only iterate over it once.
If there are not too many values in your csv file, you might be better off reading all the rows into a list, which you can iterate over as many times as you want. Try:
data = list(csv.reader(f))
It might make sense to parse the data into a more meaningful data structure though, rather than a list (e.g. a dictionary mapping usernames to passwords, which you could query in O(1) time rather than O(N) time).

Trying to fill an array with data opened from files

the following is code I have written that tries to open individual files, which are long strips of data and read them into an array. Essentially I have files that run over 15 times (24 hours to 360 hours), and each file has an iteration of 50, hence the two loops. I then try to open the files into an array. When I try to print a specific element in the array, I get the error "'file' object has no attribute 'getitem'". Any ideas what the problem is? Thanks.
#!/usr/bin/python
############################################
#
import csv
import sys
import numpy as np
import scipy as sp
#
#############################################
level = input("Enter a level: ");
LEVEL = str(level);
MODEL = raw_input("Enter a model: ");
NX = 360;
NY = 181;
date = 201409060000;
DATE = str(date);
#############################################
FileList = [];
data = [];
for j in range(1,51,1):
J = str(j);
for i in range(24,384,24):
I = str(i);
fileName = '/Users/alexg/ECMWF_DATA/DAT_FILES/'+MODEL+'_'+LEVEL+'_v_'+J+'_FT0'+I+'_'+DATE+'.dat';
FileList.append(fileName);
fo = open(fileName,"rb");
data.append(fo);
fo.close();
print data[1][1];
print FileList;
EDITED TO ADD:
Below, find the CORRECT array that the python script should be producing (sorry it wont let me post this inline yet):
http://i.stack.imgur.com/ItSxd.png
The problem I now run into, is that the first three values in the first row of the output matrix are:
-7.090874
-7.004936
-6.920952
These values are actually the first three values of the 11th row in the array below, which is the how it should look (performed in MATLAB). The next three values the python script outputs (as what it believes to be the second row) are:
-5.255577
-5.159874
-5.064171
These values should be found in the 22nd row. In other words, python is placing the 11th row of values in the first position, the 22nd in the second and so on. I don't have a clue as to why, or where in the code I'm specifying it do this.

You're appending the file objects themselves to data, not their contents:
fo = open(fileName,"rb");
data.append(fo);
So, when you try to print data[1][1], data[1] is a file object (a closed file object, to boot, but it would be just as broken if still open), so data[1][1] tries to treat that file object as if it were a sequence, and file objects aren't sequences.
It's not clear what format your data are in, or how you want to split it up.
If "long strips of data" just means "a bunch of lines", then you probably wanted this:
data.append(list(fo))
A file object is an iterable of lines, it's just not a sequence. You can copy any iterable into a sequence with the list function. So now, data[1][1] will be the second line in the second file.
(The difference between "iterable" and "sequence" probably isn't obvious to a newcomer to Python. The tutorial section on Iterators explains it briefly, the Glossary gives some more information, and the ABCs in the collections module define exactly what you can do with each kind of thing. But briefly: An iterable is anything you can loop over. Some iterables are sequences, like list, which means they're indexable collections that you can access like spam[0]. Others are not, like file, which just reads one line at a time into memory as you loop over it.)
If, on the other hand, you actually imported csv for a reason, you more likely wanted something like this:
reader = csv.reader(fo)
data.append(list(reader))
Now, data[1][1] will be a list of the columns from the second row of the second file.
Or maybe you just wanted to treat it as a sequence of characters:
data.append(fo.read())
Now, data[1][1] will be the second character of the second file.
There are plenty of other things you could just as easily mean, and easy ways to write each one of them… but until you know which one you want, you can't write it.

Loading large file (25k entries) into dict is slow in Python?

I have a file which has about 25000 lines, and it's a s19 format file.
each line is like: S214 780010 00802000000010000000000A508CC78C 7A
There are no spaces in the actual file, the first part 780010 is the address of this line, and I want it to be a dict's key value, and I want the data part 00802000000010000000000A508CC78C be the value of this key. I wrote my code like this:
def __init__(self,filename):
infile = file(filename,'r')
self.all_lines = infile.readlines()
self.dict_by_address = {}
for i in range(0, self.get_line_number()):
self.dict_by_address[self.get_address_of_line(i)] = self.get_data_of_line(i)
infile.close()
get_address_of_line() and get_data_of_line() are all simply string slicing functions. get_line_number() iterates over self.all_lines and returns an int
problem is, the init process takes me over 1 min, is the way I construct the dict wrong or python just need so long to do this?
And by the way, I'm new to python:) maybe the code looks more C/C++ like, any advice of how to program like python is appreciated:)

How about something like this? (I made a test file with just a line S21478001000802000000010000000000A508CC78C7A so you might have to adjust the slicing.)
>>> with open('test.test') as f:
... dict_by_address = {line[4:10]:line[10:-3] for line in f}
...
>>> dict_by_address
{'780010': '00802000000010000000000A508CC78C'}

This code should be tremendously faster than what you have now. EDIT: As #sth pointed out, this doesn't work because there are no spaces in the actual file. I'll add a corrected version at the end.
def __init__(self,filename):
self.dict_by_address = {}
with open(filename, 'r') as infile:
for line in infile:
_, key, value, _ = line.split()
self.dict_by_address[key] = value
Some comments:
Best practice in Python is to use a with statement, unless you are using an old Python that doesn't have it.
Best practice is to use open() rather than file(); I don't think Python 3.x even has file().
You can use the open file object as an iterator, and when you iterate it you get one line from the input. This is better than calling the .readlines() method, which slurps all the data into a list; then you use the data one time and delete the list. Since the input file is large, that means you are probably causing swapping to virtual memory, which is always slow. This version avoids building and deleting the giant list.
Then, having created a giant list of input lines, you use range() to make a big list of integers. Again it wastes time and memory to build a list, use it once, then delete the list. You can avoid this overhead by using xrange() but even better is just to build the dictionary as you go, as part of the same loop that is reading lines from the file.
It might be better to use your special slicing functions to pull out the "address" and "data" fields, but if the input is regular (always follows the pattern of your example) you can just do what I showed here. line.split() splits the line on white space, giving a list of four strings. Then we unpack it into four variables using "destructuring assignment". Since we only want to save two of the values, I used the variable name _ (a single underscore) for the other two. That's not really a language feature, but it is an idiom in the Python community: when you have data you don't care about you can assign it to _. This line will raise an exception if there are ever any number of values other than 4, so if it is possible to have blank lines or comment lines or whatever, you should add checks and handle the error (at least wrap that line in a try:/except).
EDIT: corrected version:
def __init__(self,filename):
self.dict_by_address = {}
with open(filename, 'r') as infile:
for line in infile:
key = extract_address(line)
value = extract_data(line)
self.dict_by_address[key] = value

Searching for duplicate records within a text file where the duplicate is determined by only two fields

First, Python Newbie; be patient/kind.
Next, once a month I receive a large text file (think 7 Million records) to test for duplicate values. This is catalog information. I get 7 fields, but the two I'm interested in are a supplier code and a full orderable part number. To determine if the record is dupliacted, I compress all special characters from the part number (except . and #) and create a compressed part number. The test for duplicates becomes the supplier code and compressed part number combination. This part is fairly straight forward. Currently, I am just copying the original file with 2 new columns (compressed part and duplicate indicator). If the part is a duplicate, I put a "YES" in the last field. Now that this is done, I want to be able to go back (or better yet, at the same time) to get the previous record where there was a supplier code/compressed part number match.
So far, my code looks like this:
# Compress Full Part to a Compressed Part
# and Check for Duplicates on Supplier Code
# and Compressed Part combination
import sys
import re
import time
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
start=time.time()
try:
file1 = open("C:\Accounting\May Accounting\May.txt", "r")
except IOError:
print >> sys.stderr, "Cannot Open Read File"
sys.exit(1)
try:
file2 = open(file1.name[0:len(file1.name)-4] + "_" + "COMPRESSPN.txt", "a")
except IOError:
print >> sys.stderr, "Cannot Open Write File"
sys.exit(1)
hdrList="CIGSUPPLIER|FULL_PART|PART_STATUS|ALIAS_FLAG|ACQUISITION_FLAG|COMPRESSED_PART|DUPLICATE_INDICATOR"
file2.write(hdrList+chr(10))
lines_seen=set()
affirm="YES"
records = file1.readlines()
for record in records:
fields = record.split(chr(124))
if fields[0]=="CIGSupplier":
continue #If incoming file has a header line, skip it
file2.write(fields[0]+"|"), #Supplier Code
file2.write(fields[1]+"|"), #Full_Part
file2.write(fields[2]+"|"), #Part Status
file2.write(fields[3]+"|"), #Alias Flag
file2.write(re.sub("[$\r\n]", "", fields[4])+"|"), #Acquisition Flag
file2.write(re.sub("[^0-9a-zA-Z.#]", "", fields[1])+"|"), #Compressed_Part
dupechk=fields[0]+"|"+re.sub("[^0-9a-zA-Z.#]", "", fields[1])
if dupechk not in lines_seen:
file2.write(chr(10))
lines_seen.add(dupechk)
else:
file2.write(affirm+chr(10))
print "it took", time.time() - start, "seconds."
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
file2.close()
file1.close()
It runs in less than 6 minutes, so I am happy with this part, even if it is not elegant. Right now, when I get my results, I import the results into Access and do a self join to locate the duplicates. Loading/querying/exporting results in Access a file this size takes around an hour, so I would like to be able to export the matched duplicates to another text file or an Excel file.
Confusing enough?
Thanks.

Maybe you could consider building a dictionary mapping (supplier_number, compressed_part_number) tuples to data structures (nested lists perhaps, or instances of a custom class for improved readability & maintainability) holding information on line numbers for the lines the records matching the key tuple appear in your file plus possibly the complete records themselves.
This would end up putting all the data from the file into a large in-memory dictionary, which might or might not be a problem depending on your requirements; if you skip the actual records and only hold line numbers, the dictionary will be much smaller.
You can then iterate over the entries in the dictionary spitting out the duplicates to a file as you go.

I think you should sort the entries in the input file first. Maybe it will consume too much memory, but you should first try to read all input in memory, sort this based upon the value of dupechk and then you can iterate over all entries and easily see if there are two or more identical records. Because identical records are grouped, it is easy to output just those records.

This might be more efficient/feasible for the large files you are dealing with:
Sort the file based on the supplier code and compressed part number - dump it to a temporary file. I don't think it is worth actually tacking on the compressed part number, just compute it from the full part number when needed. However, that is pure conjecture and definitely deserves some quick benchmarking.
Iterate through the temporary file (might want to take advantage of 'with'). Check if current line's supplier code and compressed part number is identical to previous one - if it is, you have identified a duplicate. Handle as you see fit. Since the file is sorted you reduce the memory requirement of needing to store all the lines in memory to a set of consecutive identical lines.

You are already reading the whole file into memory. You don't need to sort. Instead of a set, have a dict mapping (supplier, compressed_pn) to line_number_last_seen - 1. That way, when you discover a duplicate, you can output the two duplicate records immediately. This method requires only one pass over the file. You don't need to write a temporary file.
If you often have 3 or more records with the same key, you may wish to use an approach that maps the key to a list of line indices. At the end of reading the file, you iterate over the dictionary looking for lists with more than 1 entry.

Couple of comments:
Using file.readlines on a large file is wasteful - it's reading the entire file into memory. You should, instead, take advantage that a file is iterable, reading a single line at a time by default.
Your file format is basically a CSV, with a pipe instead of a comma as a separator. So, use the CSV module. The CSV is written in C and escapes most of the interpreted overhead. It also provides a nice iterable interface which also does not require reading the whole file into memory, either.
You should additionally use a DictReader from the csv module. If the header is in the file, great, the class will parse it and use as the keys further on. If not, specify the header in the code. Either way, fields[0] is uninformative and error prone. fields["CIGSUPPLIER"] is much more self-documenting.
Just as with reading, use the csv module for writing. Again, you can specify the delimiter.
Don't use file2.write(char(10)). Use file2.write('\n'), and open your file appropriately. Alternatively, if you're using the csv.writer class, these become unnecessary.
Otherwise, your logic and flow looks alright. I'd overall advise against using the chr(*) calls, unless that character is truly unprintable. newlines and pipes are printable (or have supported escapes), and should be used as such.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting data from a text file as an array - python

Related

Mapping User Input to Text File List

any() function with csv file in python not behaving how I'm expecting it to

Trying to fill an array with data opened from files

Loading large file (25k entries) into dict is slow in Python?

Searching for duplicate records within a text file where the duplicate is determined by only two fields

Categories

Resources