Modifiying a txt file in Python 3 - python

I am working on a school project to make a video club management program and I need some help. Here is what I am trying to do:
I have a txt file with the client data, in which there is this:
clientId:clientFirstName:clientLastName:clientPhoneNumber
The : is the separator for any file in data.
And in the movie title data file I got this:
movieid:movieKindFlag:MovieName:MovieAvalaible:MovieRented:CopieInTotal
where it is going is that in the rentedData file there should be that:
idClient:IdMovie:DateOfReturn
I am able to do this part. Where I fail due to lack of experience:
I need to actually make a container with 3 levels for the movie data file because I want to track the available and rented numbers (changing them when I rent a movie and when I return one).
The first level represents the whole file, calling it will print the whole file, the second level should have each line in a container, the third one is every word of the line in a container.
Here is an example of what I mean:
dataMovie = [[[movie id],[movie title],[MovieAvailable],[MovieRented],[CopieInTotal]],[[movie id],[movie title],[MovieAvailable],[MovieRented],[CopieInTotal]]
I actually know that I can do this for a two layer in this way:
DataMovie=[]
MovieInfo = open('Data_Movie', 'r')
#Reading the file and putting it into a container
for ligne in MovieInfo:
print(ligne, end='')
words = ligne.split(":")
DataMovie.append(words)
print(DataMovie)
MovieInfo.close()
It separates all the words in to this:
[[MovieID],[MovieTitle],[movie id],[movie title],[MovieAvailable],[MovieRented],[CopieInTotal], [MovieID],[MovieTitle],[movie id],[movie title],[MovieAvailable],[MovieRented],[CopieInTotal]]
Each line is in the same container (second layer) but the lines are not separated, not very helpful since I need to change a specific information about the quantity available and the rented one to be able to not rent the movie if all of the copies are rented.

I think you should be using dictionaries to store your data. Rather then just embedding lists on top of one another.
Here is a quick page about dictionaries.
http://www.network-theory.co.uk/docs/pytut/Dictionaries.html
So your data might look like
movieDictionary = {"movie_id":234234,"movie title":"Iron
Man","MovieAvailable":Yes,"MovieRented":No,"CopieInTotal":20}
Then when you want to retrieve a value.
movieDictionary["movie_id"]
would yield the value.
234234
you can also embed lists inside of a dictionary value.
Does this help answer you question?

If you have to use a txt file, storing it in xml format might make the task easier. Since there's already are several good xml parsers for python.
For example ElementTree:
You could structure you'r data like this:
<?xml version="1.0"?>
<movies>
<movie id = "1">
<type>movieKind</type>
<name>firstmovie</name>
<MovieAvalaible>True</MovieAvalaible>
<MovieRented>False</MovieRented>
<CopieInTotal>2</CopieInTotal>
</movie>
<movie id = "2">
<type>movieKind</type>
<name>firstmovie2</name>
<MovieAvalaible>True</MovieAvalaible>
<MovieRented>False</MovieRented>
<CopieInTotal>3</CopieInTotal>
</movie>
</movies>
and then access and modify it like this:
import xml.etree.ElementTree as ET
tree = ET.parse('data.xml')
root = tree.getroot()
search = root.findall('.//movie[#id="2"]')
for element in search:
rented = element.find('MovieRented')
rented.text = "False"
tree.write('data.xml')

What you are actually doing is creating three databases:
one for clients
one for movies
one for rentals
A relatively easy way to read text files with one record per line and a : separator is to create a csv.reader object. For storing the databases into your program I would recommend using lists of collections.namedtuple objects for the clients and the rentals.
from collections import namedtuple
from csv import reader
Rental = namedtuple('Rental', ['client', 'movie', 'returndate'])
with open('rentals.txt', newline='') as rentalsfile:
rentalsreader = csv.reader(rentalsfile, delimiter=':')
rentals = [Rental(int(row[0]), int(row[1]), row[2]) for row in rentalsreader]
And a list of dictionaries for the movies:
with open('movies.txt', 'rb', newline='') as moviesfile:
moviesreader = csv.reader(moviesfile, delimiter=':')
movies = [{'id': int(row[0]), 'kind', row[1], 'name': row[2],
'rented': int(row[3]), 'total': int(row[4])} for row in moviesreader]
The main reason for using a list of dictionaries for the movies is that a named tuple is a tuple and therefore immutable, and presumably you want to be able to change rented.
Referring to your comment on Daniel Rasmuson's answer, since you only put the values of the fields in the text files, you will have to hardocde the names of the fields into your program one way or another.
An alternative solution is to store the date in json files. Those are easily mapped to Python data structures.

This might be what you we're looking for
#Using OrderedDict so we always get the items in the right order when iteration.
#So the values match up with the categories\headers
from collections import OrderedDict as Odict
class DataContainer(object):
def __init__(self, fileName):
'''
Loading the text file in a list. First line assumed a header line and is used to set dictionary keys
Using OrderedDict to fix the order or iteration for dict, so values match up with the headers again when called
'''
self.file = fileName
self.data = []
with open(self.file, 'r') as content:
self.header = content.next().split('\n')[0].split(':')
for line in content:
words = line.split('\n')[0].split(':')
self.data.append(Odict(zip(self.header, words)))
def __call__(self):
'''Outputs the contents as a string that can be written back to the file'''
lines = []
lines.append(':'.join(self.header))
for i in self.data:
this_line = ':'.join(i.values())
lines.append(this_line)
newContent = '\n'.join(lines)
return newContent
def __getitem__(self, index):
'''Allows index access self[index]'''
return self.data[index]
def __setitem__(self, index, value):
'''Allows editing of values self[index]'''
self.data[index] = value
d = DataContainer('data.txt')
d[0]['MovieAvalaible'] = 'newValue' # Example of how to set the values
#Will print out a string with the contents
print d()

Related

Python Readline Loop and Subloop

I'm trying to loop through some unstructured text data in python. End goal is to structure it in a dataframe. For now I'm just trying to get the relevant data in an array and understand the line, readline() functionality in python.
This is what the text looks like:
Title: title of an article
Full text: unfortunately the full text of each article,
is on numerous lines. Each article has a differing number
of lines. In this example, there are three..
Subject: Python
Title: title of another article
Full text: again unfortunately the full text of each article,
is on numerous lines.
Subject: Python
This same format is repeated for lots of text articles in the same file. So far I've figured out how to pull out lines that include certain text. For example, I can loop through it and put all of the article titles in a list like this:
a = "Title:"
titleList = []
sample = 'sample.txt'
with open(sample,encoding="utf8") as unstr:
for line in unstr:
if a in line:
titleList.append(line)
Now I want to do the below:
a = "Title:"
b = "Full text:"
d = "Subject:"
list = []
sample = 'sample.txt'
with open(sample,encoding="utf8") as unstr:
for line in unstr:
if a in line:
list.append(line)
if b in line:
1. Concatenate this line with each line after it, until i reach the line that includes "Subject:". Ignore the "Subject:" line, stop the "Full text:" subloop, add the concatenated full text to the list array.<br>
2. Continue the for loop within which all of this sits
As a Python beginner, I'm spinning my wheels searching google on this topic. Any pointers would be much appreciated.
If you want to stick with your for-loop, you're probably going to need something like this:
titles = []
texts = []
subjects = []
with open('sample.txt', encoding="utf8") as f:
inside_fulltext = False
for line in f:
if line.startswith("Title:"):
inside_fulltext = False
titles.append(line)
elif line.startswith("Full text:"):
inside_fulltext = True
full_text = line
elif line.startswith("Subject:"):
inside_fulltext = False
texts.append(full_text)
subjects.append(line)
elif inside_fulltext:
full_text += line
else:
# Possibly throw a format error here?
pass
(A couple of things: Python is weird about names, and when you write list = [], you're actually overwriting the label for the list class, which can cause you problems later. You should really treat list, set, and so on like keywords - even thought Python technically doesn't - just to save yourself the headache. Also, the startswith method is a little more precise here, given your description of the data.)
Alternatively, you could wrap the file object in an iterator (i = iter(f), and then next(i)), but that's going to cause some headaches with catching StopIteration exceptions - but it would let you use a more classic while-loop for the whole thing. For myself, I would stick with the state-machine approach above, and just make it sufficiently robust to deal with all your reasonably expected edge-cases.
As your goal is to construct a DataFrame, here is a re+numpy+pandas solution:
import re
import pandas as pd
import numpy as np
# read all file
with open('sample.txt', encoding="utf8") as f:
text = f.read()
keys = ['Subject', 'Title', 'Full text']
regex = '(?:^|\n)(%s): ' % '|'.join(keys)
# split text on keys
chunks = re.split(regex, text)[1:]
# reshape flat list of records to group key/value and infos on the same article
df = pd.DataFrame([dict(e) for e in np.array(chunks).reshape(-1, len(keys), 2)])
Output:
Title Full text Subject
0 title of an article unfortunately the full text of each article,\nis on numerous lines. Each article has a differing number \nof lines. In this example, there are three.. Python
1 title of another article again unfortunately the full text of each article,\nis on numerous lines. Python

Extracting data elements from large unstructured text files with Python

I am trying to extract data elements from large unstructured text files (1,000,000 to 15,000,000 lines per file) with no consistent delimiter. The order of the data elements are consistent.
Sample data:
NAME FIRSTNAME LASTNAME DATE-OF-BIRTH 01/01/2019 ID-NUMBER 123
ADDRESS-1 1234 FAKE STREET COUNTY-CODE 123
ADDRESS-2
CITY NOWHERE STATE OH ZIP 12345
RANDOM DATA .... 700+ LINES
NAME FIRSTNAME2 LASTNAME2 DATE-OF-BIRTH 01/01/2019 ID-NUMBER 4567
ADDRESS-1 123456 OTHER STREET COUNTY-CODE 45678
ADDRESS-2
CITY SOMEWHERE STATE MI ZIP 65432
RANDOM DATA .... 700+ LINES
I'm looking for a way to create a CSV output with the values of a few fields listed below:
NAME, COUNTY-CODE, ZIP
FIRSTNAME LASTNAME, 123, 12345
FIRSTNAME2 LASTNAME2, 45678, 65432
The data is NOT tab delimited and spacing will vary. Any help would be greatly appreciated!
Hmm...
I assume that you have a bunch of lines, each containing pairs ID VALUE, and that each chunk starts with the id NAME.
So I would use the re module to search for the expected patterns, the occurence of NAME starting a new element. As real first and last names can use more than one single word (John Fitzgerald Kennedy), I would take the NAME to be everything between NAME and DATE-OF-BIRTH.
IMHO a simple way is to build a dict when parsing the lines, and write the dict using a DictWriter whenever NAME is reached, and at end of file. I would only keep the first occurence of each keyword if more that one was found, but you could also raise an exception.
Code could be
import re
import csv
# prepare the patterns to search for
name = re.compile(r"NAME\s+(.*)\s+DATE")
zip_code = re.compile(r"ZIP\s*([0-9]+)")
county_code = re.compile(r"COUNTY-CODE\s*([0-9]+)")
with open("input.txt") as fdin, open("output.csv", newline='') as fdout:
wr = csv.DictWriter(fdout, fieldnames=['NAME', 'COUNTY-CODE', 'ZIP'])
elt = {}
wr.writeheader()
for line in fdin:
# process NAME
mx = name.search(line)
if mx:
if elt: # write previous dict if any
wr.writerow(elt)
elt = {'NAME': mx.group(1).strip()} # initialize a new dict
# process other keywords
if not 'COUNTY-CODE' in elt: # only keep first one
mx = county_code.search(line)
if mx:
elt['COUNTY-CODE'] = mx.group(1).strip() # update the dict with it
if not 'ZIP' in elt:
mx = zip_code.search(line)
if mx:
elt['ZIP'] = mx.group(1)
wr.writerow(elt) # don't forget last dict
The problem is very similar to the one found in other SO question.
The solution is to construct partial grammar which parses the structure of the recognized construct while skipping what can't be recognized.
In your case, using textX, that would be something along these lines (I haven't tested it but you get the picture):
from textx import metamodel_from_str
mm = metamodel_from_str(r'''
File: ( /(?s:.*?(?=NAME))/ persons*=Person | 'NAME' )*
/(?s).*/;
Person:
'NAME' first_name=Name last_name=Name birth_date=Date
'ADDRESS-1' address_1=UntilEOL
'ADDRESS-2' address_2=UntilEOL
'CITY' city=UntilEOL
;
Name: /\w+/;
Date: /\d{4}-\d{2}-\d{2}/;
UntilEOL[noskipws]: /.*?\n/;
''')
data_model = mm.model_from_file('some_input_file.txt')
# Here data_model is an object with attribute `persons`
# where each person have attributes `first_name`, `last_name`, ...
# from the `Person` rule above.
Note: This solution assume that the start of structural part must have a keyword NAME, but it is ok for the keyword to be found in the random data as on invalid parse of the rule Person the parser will consume the word NAME and continue.
Depending on your actual data you will have to tune the grammar a bit (e.g. particular regexes).

Can't get unique word/phrase counter to work - Python

I'm having trouble getting anything to write in my outut file (word_count.txt).
I expect the script to review all 500 phrases in my phrases.txt document, and output a list of all the words and how many times they appear.
from re import findall,sub
from os import listdir
from collections import Counter
# path to folder containg all the files
str_dir_folder = '../data'
# name and location of output file
str_output_file = '../data/word_count.txt'
# the list where all the words will be placed
list_file_data = '../data/phrases.txt'
# loop through all the files in the directory
for str_each_file in listdir(str_dir_folder):
if str_each_file.endswith('data'):
# open file and read
with open(str_dir_folder+str_each_file,'r') as file_r_data:
str_file_data = file_r_data.read()
# add data to list
list_file_data.append(str_file_data)
# clean all the data so that we don't have all the nasty bits in it
str_full_data = ' '.join(list_file_data)
str_clean1 = sub('t','',str_full_data)
str_clean_data = sub('n',' ',str_clean1)
# find all the words and put them into a list
list_all_words = findall('w+',str_clean_data)
# dictionary with all the times a word has been used
dict_word_count = Counter(list_all_words)
# put data in a list, ready for output file
list_output_data = []
for str_each_item in dict_word_count:
str_word = str_each_item
int_freq = dict_word_count[str_each_item]
str_out_line = '"%s",%d' % (str_word,int_freq)
# populates output list
list_output_data.append(str_out_line)
# create output file, write data, close it
file_w_output = open(str_output_file,'w')
file_w_output.write('n'.join(list_output_data))
file_w_output.close()
Any help would be great (especially if I'm able to actually output 'single' words within the output list.
thanks very much.
Would be helpful if we got more information such as what you've tried and what sorts of error messages you received. As kaveh commented above, this code has some major indentation issues. Once I got around those, there were a number of other logic errors to work through. I've made some assumptions:
list_file_data is assigned to '../data/phrases.txt' but there is then a
loop through all file in a directory. Since you don't have any handling for
multiple files elsewhere, I've removed that logic and referenced the
file listed in list_file_data (and added a small bit of error
handling). If you do want to walk through a directory, I'd suggest
using os.walk() (http://www.tutorialspoint.com/python/os_walk.htm)
You named your file 'pharses.txt' but then check for if the files
that endswith 'data'. I've removed this logic.
You've placed the data set into a list when findall works just fine with strings and ignores special characters that you've manually removed. Test here:
https://regex101.com/ to make sure.
Changed 'w+' to '\w+' - check out the above link
Converting to a list outside of the output loop isn't necessary - your dict_word_count is a Counter object which has an 'iteritems' method to roll through each key and value. Also changed the variable name to 'counter_word_count' to be slightly more accurate.
Instead of manually generating csv's, I've imported csv and utilized the writerow method (and quoting options)
Code below, hope this helps:
import csv
import os
from collections import Counter
from re import findall,sub
# name and location of output file
str_output_file = '../data/word_count.txt'
# the list where all the words will be placed
list_file_data = '../data/phrases.txt'
if not os.path.exists(list_file_data):
raise OSError('File {} does not exist.'.format(list_file_data))
with open(list_file_data, 'r') as file_r_data:
str_file_data = file_r_data.read()
# find all the words and put them into a list
list_all_words = findall('\w+',str_file_data)
# dictionary with all the times a word has been used
counter_word_count = Counter(list_all_words)
with open(str_output_file, 'w') as output_file:
fieldnames = ['word', 'freq']
writer = csv.writer(output_file, quoting=csv.QUOTE_ALL)
writer.writerow(fieldnames)
for key, value in counter_word_count.iteritems():
output_row = [key, value]
writer.writerow(output_row)
Something like this?
from collections import Counter
from glob import glob
def extract_words_from_line(s):
# make this as complicated as you want for extracting words from a line
return s.strip().split()
tally = sum(
(Counter(extract_words_from_line(line))
for infile in glob('../data/*.data')
for line in open(infile)),
Counter())
for k in sorted(tally, key=tally.get, reverse=True):
print k, tally[k]

how to put data from a text file into array in python

I am trying to put data from a text file into an array. below is the array i am trying to create.
[("major",r,w,w,s,w,w,w,s), ("relative minor",r,w,s,w,w,s,w,w),
("harmonic minor",r,w,s,w,w,s,w+s,s)]
But instead when i use the text file and load the data from it I get below as my output. it should output as above, i realise i have to split it but i dont really know how for this sort of set array. could anyone help me with this
['("major",r,w,w,s,w,w,w,s), ("relative minor",r,w,s,w,w,s,w,w),
("harmonic minor",r,w,s,w,w,s,w+s,s)']
below is my text file I am trying to load.
("major",r,w,w,s,w,w,w,s), ("relative minor",r,w,s,w,w,s,w,w), ("harmonic minor",r,w,s,w,w,s,w+s,s)
And this is how im loading it
file = open("slide.txt", "r")
scale = [file.readline()]
If you mean a list instead of an array:
with open(filename) as f:
list_name = f.readlines()
Some questions come to mind about what the rest of your implementation looks like and how you figure it all will work, but below is an example of how this could be done in a pretty straight forward way:
class W(object):
pass
class S(object):
pass
class WS(W, S):
pass
class R(object):
pass
def main():
# separate parts that should become tuples eventually
text = str()
with open("data", "r") as fh:
text = fh.read()
parts = text.split("),")
# remove unwanted characters and whitespace
cleaned = list()
for part in parts:
part = part.replace('(', '')
part = part.replace(')', '')
cleaned.append(part.strip())
# convert text parts into tuples with actual data types
list_of_tuples = list()
for part in cleaned:
t = construct_tuple(part)
list_of_tuples.append(t)
# now use the data for something
print list_of_tuples
def construct_tuple(data):
t = tuple()
content = data.split(',')
for item in content:
t = t + (get_type(item),)
return t
# there needs to be some way to decide what type/object should be used:
def get_type(id):
type_mapping = {
'"harmonic minor"': 'harmonic minor',
'"major"': 'major',
'"relative minor"': 'relative minor',
's': S(),
'w': W(),
'w+s': WS(),
'r': R()
}
return type_mapping.get(id)
if __name__ == "__main__":
main()
This code makes some assumptions:
there is a file data with the content:
("major",r,w,w,s,w,w,w,s), ("relative minor",r,w,s,w,w,s,w,w), ("harmonic minor",r,w,s,w,w,s,w+s,s)
you want a list of tuples which contains the values.
It's acceptable to have w+s represented by some data type, as it would be difficult to have something like w+s appear inside a tuple without it being evaluated when the tuple is created. Another way to do it would be to have w and s represented by data types that can be used with +.
So even if this works, it might be a good idea to think about the format of the text file (if you have control of that), and see if it can be changed into something which would allow you to use some parsing library in a simple way, e.g. see how it could be more easily represented as csv or even turn it into json.

Writing multiple strings into csv from python

I have written a python web scraper and would like to output the data strings I have gotten into a csv/excel file. So far, I have a for loop that accesses multiple database websites and stores the data in a string. I would like to pop out these strings each time I complete the web scraping before moving onto the next page.
Someone suggested to create a whole repository of them or a dictionary and then reference it. I tried implementing it, but my code instead returns me a the data in one cell instead of spanning multiple cells because I have a header at the top that separates the data into my desired attributes.
Substances = []
Whole_list = []
f = open(filename) # chemtest.txt
for sub in f:
Substances.append(sub)
print sub
for substance in Substances:
#some logic
names1 = [data ]
Whole_list.append(names1)
with open('chemtest.csv', 'wb') as myfile: #creates new chemtest.csv
wr = csv.writer(myfile)
wr.writerow(Whole_list)
So far I'm running through 2 websites as a test and my outputs are:
names1 = ['Acetaldehyde', 'Acetaldehyde', '75-07-0', 'GO1N1ZPR3B', 'CC=O']
Whole_list = [['Acetaldehyde', 'Acetaldehyde', '75-07-0', 'GO1N1ZPR3B', 'CC=O']]
names1 = ['Acetone', 'Acetone', '67-64-1', '1364PS73AF', '=O']
Whole_list = [['Acetaldehyde', 'Acetaldehyde', '75-07-0', 'GO1N1ZPR3B', 'CC=O'], ['Acetone', 'Acetone', '67-64-1', '1364PS73AF', '=O']]
What is wrong with my method exactly and how can I improve it?
Use writerows (note the s at the end). writerow is for writing one line at a time.
wr.writerows(Whole_list)
As a side note, capitalized variable names are usually reserved to classes, so prefer whole_list.

Categories