So I am reading lines from a CSV, and trying to create class instances from those items. I am having problems getting the correct amount of parameters/formatting my str from the CSV so that it recognizes them as separate objects. It says that I have only given 2 parameters (self, and one of the rows)
I tried using the split() and strip(), but can't get it to work.
Any help would be appreciated. This has taken me way too long to figure out.
Here is an example of what the rows look like.
Current input:
whiskers,pugsley,Raja,Panders,Smokey
kitty,woofs,Tigger,Yin,Baluga
Current code:
import sys
class Animals:
def__init__(self,cats,dogs,tiger,panda,bear)
self.cats=cats
self.dogs = dogs
self.tiger = tiger
self.panda = panda
self.bear = bear
csv = open(file, 'r')
rowList = csv.readlines()
for row in rowList:
animalList = Animals(row.split(',')) # Fails here...
animals = []
animals = animals.append(animalsList) # Want to add to list
print animals
You seem to have a couple syntax errors, I went ahead and corrected those. Next, I decided to use csv to split up the rows properly. And then, when your inputting a list to a function (and you want them to be parsed as arguments) you should use the * operator.
import sys, csv
file = "file.txt" #added for testing purposes.
class Animals:
def __init__(self,cats,dogs,tiger,panda,bear):
self.cats = cats
self.dogs = dogs
self.tiger = tiger
self.panda = panda
self.bear = bear
csv = csv.reader(open(file, 'r'))
animalsList = []
for row in csv:
animalClass = Animals(*row) # Fails here...
animalsList.append(animalClass) # Want to add to list
Related
I have a data set with a column which has google drive link for resumes, I have 5000 rows so there are 5000 links , I am trying to extract information like years of experience and salary from these resumes in 2 separate columns. so far I've seen so many examples mentioned here on SO.
For example: the code mentioned below can only read the data from one file , how do I replicate this to multiple rows ?
Please help me with this , else I will have to manually go through 500 resumes and fill in the data
Hoping that I'll get a solution for this painful problem that I have.
pdf_file = open('sample.pdf', 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(0)
page_content = page.extractText()
print page_content.encode('utf-8')
#to extract salary , experience using regular expressions
import re
prog = re.compile("\s*(Name|name|nick).*")
result = prog.match("Name: Bob Exampleson")
if result:
print result.group(0)
result = prog.match("University: MIT")
if result:
print result.group(0)
Use a loop. Basically you put your main code into a function (easier to read) and create a list of filenames. Then you iterate over this list, using the values from the list as argument for your function:
Note: I didn't check your scraping code, just showing how to loop. There are also way more efficient ways to do this, but I'm assuming you're somewhat of a Python beginner so lets keep it simple to start with.
# add your imports to the top
import re
# create a list of your filenames
files_list = ['a.pdf', 'b.pdf', 'c.pdf']
for filename in files_list: # iterate over the list
get_data(filename)
# put the rest in a function for readability
def get_data(filename):
pdf_file = open(filename, 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(0)
page_content = page.extractText()
print page_content.encode('utf-8')
prog = re.compile("\s*(Name|name|nick).*")
result = prog.match("Name: Bob Exampleson")
if result:
print result.group(0)
result = prog.match("University: MIT")
if result:
print result.group(0)
So now your next question might be, how do I create this list with 5000 filenames? This depends on what the files are called and where they are stored. If they are sequential, you could to something like:
files_list = [] # empty list
num_files = 5000 # total number of files
for i in range(1, num_files+1):
files_list.append(f'myfile-{i}.pdf')
This will create a list with 'myfile-1.pdf', 'myfile-2.pdf', etc.
Hopefully this is enough to get you started.
You can also use return in your function to create a new list with all of the output which you can use later on, instead of printing the output as you go:
output = []
def doSomething(i):
return i * 2
for i in range(1, 100):
output.append(doSomething(i))
# output is now a list with values like:
# [2, 4, 6, 8, 10, 12, ...]
I'm trying to loop through some unstructured text data in python. End goal is to structure it in a dataframe. For now I'm just trying to get the relevant data in an array and understand the line, readline() functionality in python.
This is what the text looks like:
Title: title of an article
Full text: unfortunately the full text of each article,
is on numerous lines. Each article has a differing number
of lines. In this example, there are three..
Subject: Python
Title: title of another article
Full text: again unfortunately the full text of each article,
is on numerous lines.
Subject: Python
This same format is repeated for lots of text articles in the same file. So far I've figured out how to pull out lines that include certain text. For example, I can loop through it and put all of the article titles in a list like this:
a = "Title:"
titleList = []
sample = 'sample.txt'
with open(sample,encoding="utf8") as unstr:
for line in unstr:
if a in line:
titleList.append(line)
Now I want to do the below:
a = "Title:"
b = "Full text:"
d = "Subject:"
list = []
sample = 'sample.txt'
with open(sample,encoding="utf8") as unstr:
for line in unstr:
if a in line:
list.append(line)
if b in line:
1. Concatenate this line with each line after it, until i reach the line that includes "Subject:". Ignore the "Subject:" line, stop the "Full text:" subloop, add the concatenated full text to the list array.<br>
2. Continue the for loop within which all of this sits
As a Python beginner, I'm spinning my wheels searching google on this topic. Any pointers would be much appreciated.
If you want to stick with your for-loop, you're probably going to need something like this:
titles = []
texts = []
subjects = []
with open('sample.txt', encoding="utf8") as f:
inside_fulltext = False
for line in f:
if line.startswith("Title:"):
inside_fulltext = False
titles.append(line)
elif line.startswith("Full text:"):
inside_fulltext = True
full_text = line
elif line.startswith("Subject:"):
inside_fulltext = False
texts.append(full_text)
subjects.append(line)
elif inside_fulltext:
full_text += line
else:
# Possibly throw a format error here?
pass
(A couple of things: Python is weird about names, and when you write list = [], you're actually overwriting the label for the list class, which can cause you problems later. You should really treat list, set, and so on like keywords - even thought Python technically doesn't - just to save yourself the headache. Also, the startswith method is a little more precise here, given your description of the data.)
Alternatively, you could wrap the file object in an iterator (i = iter(f), and then next(i)), but that's going to cause some headaches with catching StopIteration exceptions - but it would let you use a more classic while-loop for the whole thing. For myself, I would stick with the state-machine approach above, and just make it sufficiently robust to deal with all your reasonably expected edge-cases.
As your goal is to construct a DataFrame, here is a re+numpy+pandas solution:
import re
import pandas as pd
import numpy as np
# read all file
with open('sample.txt', encoding="utf8") as f:
text = f.read()
keys = ['Subject', 'Title', 'Full text']
regex = '(?:^|\n)(%s): ' % '|'.join(keys)
# split text on keys
chunks = re.split(regex, text)[1:]
# reshape flat list of records to group key/value and infos on the same article
df = pd.DataFrame([dict(e) for e in np.array(chunks).reshape(-1, len(keys), 2)])
Output:
Title Full text Subject
0 title of an article unfortunately the full text of each article,\nis on numerous lines. Each article has a differing number \nof lines. In this example, there are three.. Python
1 title of another article again unfortunately the full text of each article,\nis on numerous lines. Python
I'm having trouble getting anything to write in my outut file (word_count.txt).
I expect the script to review all 500 phrases in my phrases.txt document, and output a list of all the words and how many times they appear.
from re import findall,sub
from os import listdir
from collections import Counter
# path to folder containg all the files
str_dir_folder = '../data'
# name and location of output file
str_output_file = '../data/word_count.txt'
# the list where all the words will be placed
list_file_data = '../data/phrases.txt'
# loop through all the files in the directory
for str_each_file in listdir(str_dir_folder):
if str_each_file.endswith('data'):
# open file and read
with open(str_dir_folder+str_each_file,'r') as file_r_data:
str_file_data = file_r_data.read()
# add data to list
list_file_data.append(str_file_data)
# clean all the data so that we don't have all the nasty bits in it
str_full_data = ' '.join(list_file_data)
str_clean1 = sub('t','',str_full_data)
str_clean_data = sub('n',' ',str_clean1)
# find all the words and put them into a list
list_all_words = findall('w+',str_clean_data)
# dictionary with all the times a word has been used
dict_word_count = Counter(list_all_words)
# put data in a list, ready for output file
list_output_data = []
for str_each_item in dict_word_count:
str_word = str_each_item
int_freq = dict_word_count[str_each_item]
str_out_line = '"%s",%d' % (str_word,int_freq)
# populates output list
list_output_data.append(str_out_line)
# create output file, write data, close it
file_w_output = open(str_output_file,'w')
file_w_output.write('n'.join(list_output_data))
file_w_output.close()
Any help would be great (especially if I'm able to actually output 'single' words within the output list.
thanks very much.
Would be helpful if we got more information such as what you've tried and what sorts of error messages you received. As kaveh commented above, this code has some major indentation issues. Once I got around those, there were a number of other logic errors to work through. I've made some assumptions:
list_file_data is assigned to '../data/phrases.txt' but there is then a
loop through all file in a directory. Since you don't have any handling for
multiple files elsewhere, I've removed that logic and referenced the
file listed in list_file_data (and added a small bit of error
handling). If you do want to walk through a directory, I'd suggest
using os.walk() (http://www.tutorialspoint.com/python/os_walk.htm)
You named your file 'pharses.txt' but then check for if the files
that endswith 'data'. I've removed this logic.
You've placed the data set into a list when findall works just fine with strings and ignores special characters that you've manually removed. Test here:
https://regex101.com/ to make sure.
Changed 'w+' to '\w+' - check out the above link
Converting to a list outside of the output loop isn't necessary - your dict_word_count is a Counter object which has an 'iteritems' method to roll through each key and value. Also changed the variable name to 'counter_word_count' to be slightly more accurate.
Instead of manually generating csv's, I've imported csv and utilized the writerow method (and quoting options)
Code below, hope this helps:
import csv
import os
from collections import Counter
from re import findall,sub
# name and location of output file
str_output_file = '../data/word_count.txt'
# the list where all the words will be placed
list_file_data = '../data/phrases.txt'
if not os.path.exists(list_file_data):
raise OSError('File {} does not exist.'.format(list_file_data))
with open(list_file_data, 'r') as file_r_data:
str_file_data = file_r_data.read()
# find all the words and put them into a list
list_all_words = findall('\w+',str_file_data)
# dictionary with all the times a word has been used
counter_word_count = Counter(list_all_words)
with open(str_output_file, 'w') as output_file:
fieldnames = ['word', 'freq']
writer = csv.writer(output_file, quoting=csv.QUOTE_ALL)
writer.writerow(fieldnames)
for key, value in counter_word_count.iteritems():
output_row = [key, value]
writer.writerow(output_row)
Something like this?
from collections import Counter
from glob import glob
def extract_words_from_line(s):
# make this as complicated as you want for extracting words from a line
return s.strip().split()
tally = sum(
(Counter(extract_words_from_line(line))
for infile in glob('../data/*.data')
for line in open(infile)),
Counter())
for k in sorted(tally, key=tally.get, reverse=True):
print k, tally[k]
I am trying to measure the length of vectors based on a value of the first column of my input data.
For instance: my input data is as follows:
dog nmod+n+-n 4
dog nmod+n+n-a-commitment-n 6
child into+ns-j+vn-pass-rb-divide-v 3
child nmod+n+ns-commitment-n 5
child nmod+n+n-pledge-n 3
hello nmod+n+ns 2
The value that I want to calculate is based on an identical value in the first column. For instance, I would calculate a value based on all rows in which dog is in the first column, then I would calculate a value based on all rows in which child is in the first column... and so on.
I have worked out the mathematics to calculate the vector length (Euc. norm). However, I am unsure how to base the calculation based on grouping the identical values in the first column.
So far, this is the code that I have written:
#!/usr/bin/python
import os
import sys
import getopt
import datetime
import math
print "starting:",
print datetime.datetime.now()
def countVectorLength(infile, outfile):
with open(infile, 'rb') as inputfile:
flem, _, fw = next(inputfile).split()
current_lem = flem
weights = [float(fw)]
for line in inputfile:
lem, _, w = line.split()
if lem == current_lem:
weights.append(float(w))
else:
print current_lem,
print math.sqrt(sum([math.pow(weight,2) for weight in weights]))
current_lem = lem
weights = [float(w)]
print current_lem,
print math.sqrt(sum([math.pow(weight,2) for weight in weights]))
print "Finish:",
print datetime.datetime.now()
path = '/Path/to/Input/'
pathout = '/Path/to/Output'
listing = os.listdir(path)
for infile in listing:
outfile = 'output' + infile
print "current file is:" + infile
countVectorLength(path + infile, pathout + outfile)
This code outputs the length of vector of each individual lemma. The above data gives me the following output:
dog 7.211102550927978
child 6.48074069840786
hello 2
UPDATE
I have been working on it and I have managed to get the following working code, as updated in the code sample above. However, as you will be able to see. The code has a problem with the output of the very last line of each file --- which I have solved rather rudimentarily by manually adding it. However, because of this problem, it does not permit a clean iteration through the directory -- outputting all of the results of all files in an appended > document. Is there a way to make this a bit cleaner, pythonic way to output directly each individual corresponding file in the outpath directory?
First thing, you need to transform the input into something like
dog => [4,2]
child => [3,5,3]
etc
It goes like this:
from collections import defaultdict
data = defaultdict(list)
for line in file:
line = line.split('\t')
data[line[0]].append(line[2])
Once this is done, the rest is obvious:
def vector_len(vec):
you already got that
vector_lens = {name: vector_len(values) for name, values in data.items()}
I am working on a school project to make a video club management program and I need some help. Here is what I am trying to do:
I have a txt file with the client data, in which there is this:
clientId:clientFirstName:clientLastName:clientPhoneNumber
The : is the separator for any file in data.
And in the movie title data file I got this:
movieid:movieKindFlag:MovieName:MovieAvalaible:MovieRented:CopieInTotal
where it is going is that in the rentedData file there should be that:
idClient:IdMovie:DateOfReturn
I am able to do this part. Where I fail due to lack of experience:
I need to actually make a container with 3 levels for the movie data file because I want to track the available and rented numbers (changing them when I rent a movie and when I return one).
The first level represents the whole file, calling it will print the whole file, the second level should have each line in a container, the third one is every word of the line in a container.
Here is an example of what I mean:
dataMovie = [[[movie id],[movie title],[MovieAvailable],[MovieRented],[CopieInTotal]],[[movie id],[movie title],[MovieAvailable],[MovieRented],[CopieInTotal]]
I actually know that I can do this for a two layer in this way:
DataMovie=[]
MovieInfo = open('Data_Movie', 'r')
#Reading the file and putting it into a container
for ligne in MovieInfo:
print(ligne, end='')
words = ligne.split(":")
DataMovie.append(words)
print(DataMovie)
MovieInfo.close()
It separates all the words in to this:
[[MovieID],[MovieTitle],[movie id],[movie title],[MovieAvailable],[MovieRented],[CopieInTotal], [MovieID],[MovieTitle],[movie id],[movie title],[MovieAvailable],[MovieRented],[CopieInTotal]]
Each line is in the same container (second layer) but the lines are not separated, not very helpful since I need to change a specific information about the quantity available and the rented one to be able to not rent the movie if all of the copies are rented.
I think you should be using dictionaries to store your data. Rather then just embedding lists on top of one another.
Here is a quick page about dictionaries.
http://www.network-theory.co.uk/docs/pytut/Dictionaries.html
So your data might look like
movieDictionary = {"movie_id":234234,"movie title":"Iron
Man","MovieAvailable":Yes,"MovieRented":No,"CopieInTotal":20}
Then when you want to retrieve a value.
movieDictionary["movie_id"]
would yield the value.
234234
you can also embed lists inside of a dictionary value.
Does this help answer you question?
If you have to use a txt file, storing it in xml format might make the task easier. Since there's already are several good xml parsers for python.
For example ElementTree:
You could structure you'r data like this:
<?xml version="1.0"?>
<movies>
<movie id = "1">
<type>movieKind</type>
<name>firstmovie</name>
<MovieAvalaible>True</MovieAvalaible>
<MovieRented>False</MovieRented>
<CopieInTotal>2</CopieInTotal>
</movie>
<movie id = "2">
<type>movieKind</type>
<name>firstmovie2</name>
<MovieAvalaible>True</MovieAvalaible>
<MovieRented>False</MovieRented>
<CopieInTotal>3</CopieInTotal>
</movie>
</movies>
and then access and modify it like this:
import xml.etree.ElementTree as ET
tree = ET.parse('data.xml')
root = tree.getroot()
search = root.findall('.//movie[#id="2"]')
for element in search:
rented = element.find('MovieRented')
rented.text = "False"
tree.write('data.xml')
What you are actually doing is creating three databases:
one for clients
one for movies
one for rentals
A relatively easy way to read text files with one record per line and a : separator is to create a csv.reader object. For storing the databases into your program I would recommend using lists of collections.namedtuple objects for the clients and the rentals.
from collections import namedtuple
from csv import reader
Rental = namedtuple('Rental', ['client', 'movie', 'returndate'])
with open('rentals.txt', newline='') as rentalsfile:
rentalsreader = csv.reader(rentalsfile, delimiter=':')
rentals = [Rental(int(row[0]), int(row[1]), row[2]) for row in rentalsreader]
And a list of dictionaries for the movies:
with open('movies.txt', 'rb', newline='') as moviesfile:
moviesreader = csv.reader(moviesfile, delimiter=':')
movies = [{'id': int(row[0]), 'kind', row[1], 'name': row[2],
'rented': int(row[3]), 'total': int(row[4])} for row in moviesreader]
The main reason for using a list of dictionaries for the movies is that a named tuple is a tuple and therefore immutable, and presumably you want to be able to change rented.
Referring to your comment on Daniel Rasmuson's answer, since you only put the values of the fields in the text files, you will have to hardocde the names of the fields into your program one way or another.
An alternative solution is to store the date in json files. Those are easily mapped to Python data structures.
This might be what you we're looking for
#Using OrderedDict so we always get the items in the right order when iteration.
#So the values match up with the categories\headers
from collections import OrderedDict as Odict
class DataContainer(object):
def __init__(self, fileName):
'''
Loading the text file in a list. First line assumed a header line and is used to set dictionary keys
Using OrderedDict to fix the order or iteration for dict, so values match up with the headers again when called
'''
self.file = fileName
self.data = []
with open(self.file, 'r') as content:
self.header = content.next().split('\n')[0].split(':')
for line in content:
words = line.split('\n')[0].split(':')
self.data.append(Odict(zip(self.header, words)))
def __call__(self):
'''Outputs the contents as a string that can be written back to the file'''
lines = []
lines.append(':'.join(self.header))
for i in self.data:
this_line = ':'.join(i.values())
lines.append(this_line)
newContent = '\n'.join(lines)
return newContent
def __getitem__(self, index):
'''Allows index access self[index]'''
return self.data[index]
def __setitem__(self, index, value):
'''Allows editing of values self[index]'''
self.data[index] = value
d = DataContainer('data.txt')
d[0]['MovieAvalaible'] = 'newValue' # Example of how to set the values
#Will print out a string with the contents
print d()