Searching through data in txt files

Searching through data in txt files - python

I'm teaching myself python and wanted to learn how to search through text files. For example, i've got a long list of full names and addresses, and want to be able to type in a first name and then print the details corresponding to that name. What would be the best way to go about this? Thanks!
The data I have is in a .txt file in columns like this:
Doe, John London
Doe, Jane Paris

If you've designed the data format, fixed-width columns aren't a very good one. But if you're stuck with them, they're easy to deal with.
First, you want to parse your data:
addressbook = []
with open('addressbook.txt', 'r') as f:
for line in f:
name, city = line[:17], line[17:]
last, first = name.split(',')
addressbook.append((first, last, city))
But now, you want to be able to search by first name. You can do that, but it might be slow for a huge addressbook, and the code won't be very direct:
def printDetails(addressbook, firstname):
for (first, last, city) in addressbook:
if first == firstname:
print fist, last, city
What if, instead of just a list of tuples, we used a dictionary, mapping first names to the other field?
addressbook = {}
with open('addressbook.txt', 'r') as f:
for line in f:
name, city = line[:17], line[17:]
last, first = name.split(',')
addressbook[first]=((last, city))
But that's no good—each new "John" will erase any previous "John". So what we really want is a dictionary, mapping first names to lists of tuples:
addressbook = collections.defaultdict(list)
with open('addressbook.txt', 'r') as f:
for line in f:
name, city = line[:17], line[17:]
last, first = name.split(',')
addressbook[first].append((last, city))
Now, if I want to see the details for that first name:
def printDetails(addressbook, firstname):
for (last, city) in addressbook[firstname]:
print firstname, last, city
Whichever way you go, there are a few obvious places to improve this. For example, you may notice that some of the fields have extra spaces at the start or end. How would you get rid of those? If you call printDetails on "Joe" and there is no "Joe", you get nothing at all; maybe a nice error message would be better. But once you've got the basics working, you can always add more later.

I would make judicious use of the split command. It depends on how your file is delimited, of course, but your example shows that the characters splitting the data fields are spaces.
For each line in the file, do something like this:
last, first, city = [data.strip(',') for data in line.split(' ') if data]
And then run your comparison based on those attributes.
Obviously, this will break if your data fields have spaces in them, so ensure that's not the case before you take a simple approach like this.

To read a text-file in python, you do something like this:
f = open('yourtextfile.txt')
for line in f:
//The for-loop will loop thru the whole file line by line
//Now you can do what you want to the line, in your example
//You want to extract the first and last name and the city

You could do something as simple as this:
name = raw_input('Type in a first name: ') # name to search for
with open('x.txt', 'r') as f: # 'r' means we only intend to read
for s in f:
if s.split()[1] == name: # s.split()[1] will return the first name
print s
break # end the loop once we've found a match
else:
print 'Name not found.' # this will be executed if no match is found
Type in a first name: Jane
Doe, Jane Paris
Relevant documentation
Reading and Writing Files
open

Related

how to read a large text file in a data frame from a .txt file in python

I have a large text file which has names and long paragraphs of statements made by several different people. the file format is .txt, I am trying to separate the name and the statement into two different columns of a data frame.
Data is in this format-
Harvey: I’m inclined to give you a shot. But what if I decide to go the other way?
Mike: I’d say that’s fair. Sometimes I like to hang out with people who aren’t that bright, you know, just to see how the other half lives.
Mike in the club
(mike speaking to jessica.)
Jessica: How are you mike?
Mike: good!
.....
....
and so on
the length of text file is 4million.
in the output I need a dataframe with one name column having the name of speaker and another statement column with that persons respective statement.

if: the format is always "name: one-liner-no-colon"
you could try:
df = pd.read_csv('untitled.txt',sep=': ', header=None)
or go manually:
f = open("untitled.txt", "r")
file_contents = []
current_name = ""
current_dialogue = ""
for line in f:
splitted_line = line.split(": ")
if len(splitted_line) > 1:
# you are on a row with name: on it
# first stop the current dialogue - save it
if current_name:
file_contents.append([current_name, current_dialogue])
# then update the name encountered
current_name = splitted_line.pop(0)
current_dialogue = ""
current_dialogue += ": ".join(splitted_line)
# add the last dialogue line
file_contents.append([current_name, current_dialogue])
f.close()
df = pd.DataFrame(file_contents)
df

If you read the file line-by-line, you can use something like this to split the speaker from the spoken text, without using regex.
def find_speaker_and_text_from_line(line):
split = line.split(": ")
name = split.pop(0)
rest = ": ".join(split)
return name, rest

How to get text between 2 lines with PYthon

So I have a text file that is structured like this:
Product ID List:
ABB:
578SH8
EFC025
EFC967
CNC:
HDJ834
HSLA87
...
...
This file continues on with many companies' names and Id's below them. I need to then get the ID's of the chosen company and append them to a list, where they will be used to search a website. Here is the current line I have to get the data:
PID = open('PID.txt').read().split()
This works great if there are only Product ID's of only 1 company in there and no text. This does not work for what I plan on doing however... How can I have the reader read from (an example) after where it says ABB: to before the next company? I was thinking maybe add some kind of thing in the file like ABB END to know where to cut to, but I still don't know how to cut out between lines in the first place... If you could let me know, that would be great!

Two consecutive newlines act as a delimeter, so just split there an construct a dictionary of the data:
data = {i.split()[0]: i.split()[1:] for i in open('PID.txt').read().split('\n\n')}

Since the file is structured like that you could follow these steps:
Split based on the two newline characters \n\n into a list
Split each list on a single newline character \n
Drop the first element for a list containing the IDs for each company
Use the first element (mentioned above) as needed for the company name (make sure to remove the colon)
Also, take a look at regular expressions for parsing data like this.

with open('file.txt', 'r') as f: # open the file
next(f) # skip the first line
results = {} # initialize a dictionary
for line in f: # iterate through the remainder of the file
if ':' in line: # if the line contains a :
current = line.strip() # strip the whitespace
results[current] = [] # and add it as a dictionary entry
elif line.strip(): # otherwise, and if content remains after stripping whitespace,
results[current].append(line.strip()) # append this line to the relevant list

This should at least get you started, you will likely have better luck using dictionaries than lists, at least for the first part of your logic. By what method will you pass the codes along?
a = {}
f1 = open("C:\sotest.txt", 'r')
current_key = ''
for row in f1:
strrow = row.strip('\n')
if strrow == "":
pass
elif ":" in strrow:
current_key = strrow.strip(':')
a[current_key] = []
else:
a[current_key].append(strrow)
for key in a:
print key
for item in a[key]:
print item

How to check a list for a string then select the rest of that item,

At the moment I have a text file with people who swim and their times, such as this,
jack 12
sarah 20
ben 4
Now i would like to be able to search this for say sarah and for it to return the code.
This is what i currently have.
def Timers(swimmer):
myFile = open("race.txt","r")
lists = []
for eachLine in myFile:
lists += [eachLine.rstrip("\n")]
so I compiled all them into a single list, although i know i can check the list to see if they are there although i cannot work out how i would just select the time.
At this point i know if i get say, sarah 12 I can then use split and then just formate it to get the times.
Thank you for the help.

You want a dict, a python mapping instead, and read the file only once:
def Timers():
with open("race.txt","r") as myFile:
swimmers = {}
for eachLine in myFile:
if line.strip():
swimmer, timer = line.split()
swimmers[swimmer] = timer
return swimmers
The .split() call splits the line on whitespace, giving you a name and a timer string for each line.
Now Timers() returns a mapping containing all swimmer names as the keys, and their times as values. You can simply look up each and every swimmer:
timers = Timers()
print timers['sarah']

Another approach to the problem:
def Timer(swimmer):
myFile = open("race.txt", "r")
lists = myFile.readlines()
found = [l for l in lists if l.startswith(swimmer)][0] # Gets first found swimmer
time = found.split()[-1] # Gets last item (eg. time) in splitted list
myFile.close()
return time
print Timer('jack')
This works even if the swimmer is specified with both first and last name. I used the same way to open the file as you did. But you really should use the with-statement as in the previous answer!

Adding data to a CSV file through Python

I've got a function that adds new data to a csv file. I've got it somewhat working. However, I'm having a few problems.
When I add the new values (name, phone, address, birthday), it adds them all in one column, rather than separate columns in the same row. (Not really much idea on how to split them up in various columns...)
I can only add numbers rather than string values. So if I write add_friend(blah, 31, 12, 45), it will come back saying blah is not defined. However, if I write add_friend(3,4,5,6), it'll add that to the new row—but, into a single column
An objective with the function is: If you try and add a friend that's already in the csv (say, Bob), and his address, phone, birthday are already in the csv, if you add_friend(Bob, address, phone, birthday), it should state False, and not add it. However, I have no clue how to do this. Any ideas?
Here is my code:
def add_friend (name, phone, address, birthday):
with open('friends.csv', 'ab') as f:
newrow = [name, phone, address, birthday]
friendwriter = csv.writer(open('friends.csv', 'ab'), delimiter=' ',
quotechar='|', quoting=csv.QUOTE_MINIMAL)
friendwriter.writerow(newrow)
#friendreader = csv.reader(open('friends.csv', 'rb'), delimiter=' ', quotechar='|')
#for row in friendreader:
#print ' '.join(row)
print newrow

Based on your requirements, and what you appear to be trying to do, I've written the following. It should be verbose enough to be understandable.
You need to be consistent with your delimiters and other properties when reading the CSV files.
Also, try and move "friends.csv" to a global, or at least in some non-hard coded constant.
import csv
def print_friends():
reader = csv.reader(open("friends.csv", "rb"), delimiter=' ', quotechar='|', quoting=csv.QUOTE_MINIMAL)
for row in reader:
print row
def friend_exists(friend):
reader = csv.reader(open("friends.csv", "rb"), delimiter=' ', quotechar='|', quoting=csv.QUOTE_MINIMAL)
for row in reader:
if (row == friend):
return True
return False
def add_friend(name, phone, address, birthday):
friend = [name, phone, address, birthday]
if friend_exists(friend):
return False
writer = csv.writer(open("friends.csv", "ab"), delimiter=' ', quotechar='|', quoting=csv.QUOTE_MINIMAL)
writer.writerow(friend)
return True
print "print_friends: "
print_friends()
print "get_friend: "
test_friend = ["barney", "4321 9876", "New York", "2000"]
print friend_exists(test_friend)
print "add_friend: "
print add_friend("barney", "4321 9876", "New York", "2000")

It doesn't do that. What makes you think that's what it does? It's possible that the quoting scheme you really want isn't the one you specified to csv.writer: i.e., spaces delimit columns, and | is the quoting character.
blah is not a string literal, "blah" is. blah without quotes is a variable reference, and the variable didn't exist here.
In order to check whether a name is already in the CSV file, you have to read the whole CSV file first, checking for the details. Open the file twice: first for reading ('r'), and use csv.reader to turn that into a row-iterator and find all the names. You can add those names to a set, and then check with that set forever after.
re #3:
To get this set, you could define a function as so:
def get_people():
with open(..., 'r') as f:
return set(map(tuple, csv.reader(f)))
And then if you assigned the set somewhere, e.g. existing_people = get_people()
you could then check against it when adding new people, as follows:
newrow = (name, phone, address, birthday)
if newrow in existing_people:
return False
else:
existing_people.add(newrow)
friendwriter.writerow(newrow)

You aren't stating how experienced with Python you already are, so I am aiming this a little low - no offence intended
There are several "requirements" for your homework. In general, you should ensure that one function does one thing. So, to meet all your requirements, you’ll need several functions; look at creating at least one module (i.e., a file with functions in it).
A space delimiter and a | for quotes is pretty unusual. For the current file, what is the delimieter between columns? And what is used to quote/escape text? (By “escaping text”, I mean: If I have a csv file that uses commas as the column delimiter, and I want to put a sentence with commas into just one column, I need to tell the difference between a comma that means “new column” and a comma that is part of a sentence in a column. Microsoft decided that Excel would support double quotes—so "hello, sailor" became a de facto standard.
If you want to know if "bob brown” is already in the file, you will need to read the whole file first before trying to insert. You can do this using 'r', then 'w'. But should you read the whole file every time you want to insert one record? What if you have a hundred records to add—should you read the whole file each time? Is there a way to store the names during the adding process?
blah is not a string. It needs to be quoted to be a string literal ("blah"). blah just refers to a variable whose name is blah. If it says blah is not defined, that’s because you have not declared the variable blah to hold anything.

Python- File Parsing

Write a program which reads a text
file called input.txt which contains
an arbitrary number of lines of the
form ", " then records this
information using a dictionary, and
finally outputs to the screen a list
of countries represented in the file
and the number of cities contained.
For example, if input.txt contained the following:
New York, US
Angers, France
Los Angeles, US
Pau, France
Dunkerque, France
Mecca, Saudi Arabia
The program would output the following (in some order):
Saudi Arabia : 1
US : 2
France : 3
My code:
from os import dirname
def parseFile(filename, envin, envout = {}):
exec "from sys import path" in envin
exec "path.append(\"" + dirname(filename) + "\")" in envin
envin.pop("path")
lines = open(filename, 'r').read()
exec lines in envin
returndict = {}
for key in envout:
returndict[key] = envin[key]
return returndict
I get a Syntax error: invalid syntax... when I use my file name
i used file name input.txt

I don't understand what you are trying to do, so I can't really explain how to fix it. In particular, why are you execing the lines of the file? And why write exec "foo" instead of just foo? I think you should go back to a basic Python tutorial...
Anyway, what you need to do is:
open the file using its full path
for line in file: process the line and store it in a dictionary
return the dictionary
That's it, no exec involved.

Yup, that's a whole lot of crap you either don't need or shouldn't do. Here's how I'd do it prior to Python 2.7 (after that, use collections.Counter as shown in the other answers). Mind you, this'll return the dictionary containing the counts, not print it, you'd have to do that externally. I'd also not prefer to give a complete solution for homeworks, but it's already been done, so I suppose there's no real damage in explaining a bit about it.
def parseFile(filename):
with open(filename, 'r') as fh:
lines = fh.readlines()
d={}
for country in [line.split(',')[1].strip() for line in lines]:
d[country] = d.get(country,0) + 1
return d
Lets break that down a bit, shall we?
with open(filename, 'r') as fh:
lines = fh.readlines()
This is how you'd normally open a text file for reading. It will raise an IOError exception if the file doesn't exist or you don't have permissions or the likes, so you'll want to catch that. readlines() reads the entire file and splits it into lines, each line becomes an element in a list.
d={}
This simply initializes an empty dictionary
for country in [line.split(',')[1].strip() for line in lines]:
Here is where the fun starts. The bracket enclosed part to the right is called a list comprehension, and it basically generates a list for you. What it pretty much says, in plain english, is "for each element 'line' in the list 'lines', take that element/line, split it on each comma, take the second element (index 1) of the list you get from the split, strip off any whitespace from it, and use the result as an element in the new list"
Then, the left part of it just iterates over the generated list, giving the name 'country' to the current element in the scope of the loop body.
d[country] = d.get(country,0) + 1
Ok, ponder for a second what would happen if instead of the above line, we'd used the following:
d[country] = d[country] + 1
It'd crash, right (KeyError exception), because d[country] doesn't have a value the first time around.
So we use the get() method, all dictionaries have it. Here's the nifty part - get() takes an optional second argument, which is what we want to get from it if the element we're looking for doesn't exist. So instead of crashing, it returns 0, which (unlike None) we can add 1 to, and update the dictionary with the new count. Then we just return the lot of it.
Hope it helps.

I would use a defaultdict plus a list to mantain the structure of the information.
So additional statistics can be derived.
import collections
def parse_cities(filepath):
countries_cities_map = collections.defaultdict(list)
with open(filepath) as fd:
for line in fd:
values = line.strip().split(',')
if len(values) == 2:
city, country = values
countries_cities_map[country].append(city)
return countries_cities_map
def format_cities_per_country(countries_cities_map):
for country, cities in countries_cities_map.iteritems():
print " {ncities} Cities found in {country} country".format(country=country, ncities = len(cities))
if __name__ == '__main__':
import sys
filepath = sys.argv[1]
format_cities_per_country(parse_cities(filepath))

import collections
def readFile(fname):
with open(fname) as inf:
return [tuple(s.strip() for s in line.split(",")) for line in inf]
def countCountries(city_list):
return collections.Counter(country for city,country in city_list)
def main():
cities = readFile("input.txt")
countries = countCountries(cities)
print("{0} cities found in {1} countries:".format(len(cities), len(countries)))
for country, num in countries.iteritems():
print("{country}: {num}".format(country=country, num=num))
if __name__=="__main__":
main()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.