Splitting a list a specific amount of times - python

I have an input file containing names, last names and GPA's of students. I'm having a bit of trouble as this is the first time I'm working with splitting text files. I want to split the text file, store it in the Temp list, then put the first and last names in Names and GPA's in Scores. The code below splits the file and puts it in Temp but unfortunately it is splitting it by name, last name and GPA.Is there a way to have it split by names and GPA and not names last names and GPA? This is my output:
enter image description here
This is what I came up with so far:
def main():
try:
inputFile=open("input.txt", "r")
outputFile=open("output.txt", "w")
Names=[]
Scores=[]
Temp=[]
for i in inputFile:
splitlist=i.split()
Temp.append(splitlist)
print(Temp)
except:
print(" ")
main()

You can use rsplit and specify a maximum number of splits
splitlist = i.rsplit(maxsplit=1)
Except for splitting from the right, rsplit() behaves like split()

So, basically, there are 2 ways to achieve your desired result
1. Splitting at the last occurence (easy, recommended method)
In Python, strings can be split in two ways:
"Normal" splitting (str#split)
"Reverse" splitting (str#rsplit), which is like split, but it from the end of the string
Therefore, you can do something like this:
splitlist = i.rsplit(maxsplit=1)
2. Overengineered method (not recommended, but overengineered)
You can reverse your string, use split with a maxsplit of 1, reverse the resulting list AND reverse every entry in the list
So:
splitlist = [j[::-1] for j in i[::-1].split(maxsplit=1)][::-1]
;)

Related

Why am I getting an IndexError from a for loop?

I'm writing code that will take dates and numeric values from a csv file and compare them.
date_location = 3
numeric_location = 4
with open('file1.csv', 'r') as f1:
next(f1)
with open('file2.csv', 'r') as f2:
next(f2)
for i in (f1):
f1_date = (i.split()[date_location])
f1_number = (i.split()[numeric_location])
for j in (f2):
f2_date = (j.split()[date_location])
f2_number = (j.split()[numeric_location])
print(f1_date, f1_number)
print(f2_date, f2_number)
if f1_date == f2_date:
print(f1_date == f2_date)
if f2_number > f1_number:
print('WIN')
continue
elif f2_number <= f1_number:
print('lose')
f2.seek(0, 0)`
I get this error IndexError: list index out of range for f1_date = (i.split()[date_location]), which i assume will also affect:
f1_number = (i.split()[numeric_location])
f2_date = (j.split()[date_location])
f2_number = (j.split()[numeric_location])
Can anyone explain why? I haven't found a way to make it so this error doesn't show.
EDIT: I forgot to change the separator for .split() after messing around with the for loop using text files
Two main possibilities:
1) Your csv files are not space delimited, and as the default separator for .split() is " ", you will not have at least 4 space-separated items in i.split() (or 5 for numeric_location).
2) Your csv is space delimited, but is ragged, i.e. it has incomplete rows, so for some row, there is no data for column 4.
I also highly suggest using a library for reading csvs. csv is in the standard library, and pandas has built-in handling of ragged lines.
f1_number = (i.split()[numeric_location])
This is doing a lot in a single line. I suggest you split this into two lines:
f1 = i.split()
f1_number = f1[numeric_location]
f1_date = f1[date_location]
Now you will see which of these causes the problem. You should add a print(f1) to see the value after the split. Most likely it doesn't have as many elements as you think it does. Or your indexes are off from what they should be.
The call to i.split() is going to generate a new list, which will contain each word of the from the string i. So
"this is an example".split() == ["this", "is", "an", "example"]
You are trying to access the third element of the resulting list, and the index error tells you that this list has less than four members. I suggest printing the result of i.split(). Very likely this is either an off by one error, or the first line of your file contains something different than what you are expecting.
Also split() by default will split on whitespace, given that you have a csv you may have wanted to do split(',').
The error is happening because you only have one element in the case of i.split()
But date_location is equal to 3.
You need to add a separator based on your csv file in the str.split method.
You can read more about it here

Python: Appending string constructed out of multiple lines to list

I'm trying to parse a txt file and put sentences in a list that fit my criteria.
The text file consists of several thousand lines and I'm looking for lines that start with a specific string, lets call this string 'start'.
The lines in this text file can belong together and are somehow seperated with \n at random.
This means I have to look for any string that starts with 'start', put it in an empty string 'complete' and then continue scanning each line after that to see if it also starts with 'start'.
If not then I need to append it to 'complete' because then it is part of the entire sentence. If it does I need to append 'complete' to a list, create a new, empty 'complete' string and start appending to that one. This way I can loop through the entire text file without paying attention to the number of lines a sentence exists of.
My code thusfar:
import sys, string
lines_1=[]
startswith = ('keys', 'values', 'files', 'folders', 'total')
completeline = ''
with open (sys.argv[1]) as f:
data = f.read()
for line in data:
if line.lower().startswith(startswith):
completeline = line
else:
completeline += line
lines_1.append(completeline)
# check some stuff in output
for l in lines_1:
print "______"
print l
print len(lines_1)
However this puts the entire content in 1 item in the list, where I'd like everything to be seperated.
Keep in mind that the lines composing one sentence can span one, two, 10 or 1000 lines so it needs to spot the next startswith value, append the existing completeline to the list and then fill completeline up with the next sentence.
Much obliged!
Two issues:
Iterating over a string, not lines:
When you iterate over a string, the value yielded is a character, not a line. This means for line in data: is going character by character through the string. Split your input by newlines, returning a list, which you then iterate over. e.g. for line in data.split('\n'):
Overwriting the completeline inside the loop
You append a completed line at the end of the loop, but not when you start recording a new line inside the loop. Change the if in the loop to something like this:
if line.lower().startswith(startswith):
if completeline:
lines_1.append(completeline)
completeline = line
For task like this
"I'm trying to parse a txt file and put sentences in a list that fit my criteria"
I usually prefer using dictionary for such kind of ideas, for example
from collections import defaultdict
seperatedItems = defaultdict(list)
for sentence in fileDataAsAList:
if satisfiesCriteria("start",sentence):
seperatedItems["start"].append(sentence)
def satisfiesCriteria(criteria,sentence):
if sentence.lower.startswith(criteria):
return True
return False
Something like this should suffise.. the code is just for giving you idea of what you might like to do.. you can have list of criterias and loop over them which will add sentences related to different creterias into dictionary something like this
mycriterias = ['start','begin','whatever']
for criteria in mycriterias:
for sentence in fileDataAsAList:
if satisfiesCriteria(criteria ,sentence):
seperatedItems[criteria ].append(sentence)
mind the spellings :p

remove duplicates from list product of tab delimited file and further classification

I have a tab delimited file that I need to extract all of of the column 12 content from (which documents categories). However the column 12 content is highly repetitive so firstly I need to get a list that just returns the number of categories (by removing repeats). And then I need to find a way to get the number of lines per category. My attempt is as follows:
def remove_duplicates(l): # define function to remove duplicates
return list(set(l))
input = sys.argv[1] # command line arguments to open tab file
infile = open(input)
for lines in infile: # split content into lines
words = lines.split("\t") # split lines into words i.e. columns
dataB2.append(words[11]) # column 12 contains the desired repetitive categories
dataB2 = dataA.sort() # sort the categories
dataB2 = remove_duplicates(dataA) # attempting to remove duplicates but this just returns an infinite list of 0's in the print command
print(len(dataB2))
infile.close()
I have no idea how I would get the number of lines for each category though?
So my questions are: how do eliminate the repeats effectively? and how do I get the number of lines for each category?
I suggest using a python Counter to implement this. A counter does almost exactly what you are asking for and so your code would look like follows:
from collections import Counter
import sys
count = Counter()
# Note that the with open()... syntax is generally preferred.
with open(sys.argv[1]) as infile:
for lines in infile: # split content into lines
words = lines.split("\t") # split lines into words i.e. columns
count.update([words[11]])
print count
All you need to do is read each line from a file, split it by tabs, grab column 12 for each line and put it in a list. (if you don't care about repeating lines just make column_12 = set() and use add(item) instead of append(item)). Then you simply use len() to get the length of the collection. Or if you want both you can use a list and change it to a set later.
EDIT: To count each catagory (Thank you Tom Morris for alerting me to the fact I didn't actually answer the question). You iterate over the set of column_12 so as to not count anything more than once and use lists built in count() method.
with open(infile, 'r') as fob:
column_12 = []
for line in fob:
column_12.append(line.split('\t')[11])
print 'Unique lines in column 12 %d' % len(set(column_12))
print 'All lines in column 12 %d' % len(column_12)
print 'Count per catagory:'
for cat in set(column_12):
print '%s - %d' % (cat, column_12.count(cat))

count word in textfile

I have a textfile that I wanna count the word "quack" in.
textfile named "quacker.txt" example:
This is the textfile quack.
Oh, and how quack did quack do in his exams back in 2009?\n Well, he passed with nine P grades and one B.\n He says that quack he wants to go to university in the\n future but decided to try and make a career on YouTube before that Quack....\n So, far, it’s going very quack well Quack!!!!
So here I want 7 as the output.
readf= open("quacker.txt", "r")
lst= []
for x in readf:
lst.append(str(x).rstrip('\n'))
readf.close()
#above gives a list of each row.
cv=0
for i in lst:
if "quack" in i.strip():
cv+=1
above only works for one "quack" in the element of the list
Well if the file isn't too long, you could try:
with open('quacker.txt') as f:
text = f.read().lower() # make it all lowercase so the count works below
quacks = text.count('quack')
As #PadraicCunningham mentioned in the comments, this would also count the 'quack' in
words like 'quacks' or 'quacking'. But if that's not an issue, then this is fine.
you're incrementing by one if the line contains the string, but what if the line has several occurrences of 'quack'?
try:
for line in lst:
for word in line.split():
if 'quack' in word:
cv+=1
You need to lower, strip and split to get an accurate count:
from string import punctuation
with open("test.txt") as f:
quacks = sum(word.lower().strip(punctuation) == "quack"
for line in f for word in line.split())
print(quacks)
7
You need to split each word in the file into individual words or you will get False positives using in or count. word.lower().strip(punctuation) lowers each word and removes any punctuation, sum will sum all the times word.lower().strip(punctuation) == "quack" is True.
In your own code x is already a string so calling str(x)... is unnecessary, you could also just check each line the first time you iterate, there is no need to add the strings to a list and then iterate a second time. Why you only get one returned is most like because all the data is actually on a single line, you are also comparing quack to Quack which will not work, you need to lower the string.

I cannot get split to work, what am I doing wrong?

Here is the code for the program that I have done so far. I am trying to calculate the efficiency of NBA players for a class project. When I run the program on a comma-delimited file that contains all the stats, instead of splitting on each comma it is creating a list entry of the entire line of the stat file. I get an index out of range error or it treats each character as a index point instead of the separate fields. I am new to this but it seems it should be creating a list for each line in the file that is separated by elements of that list, so I get a list of lists. I hope I have made myself understood.
Here is the code:
def get_data_list (file_name):
data_file = open(file_name, "r")
data_list = []
for line_str in data_file:
# strip end-of-line, split on commas, and append items to list
line_str.strip()
line_str.split(',')
print(line_str)
data_list.append(line_str)
print(data_list)
file_name1 = input("File name: ")
result_list = get_data_list (file_name1)
print(result_list)
I do not see how to post the data file for you to look at and try it with, but any file of numbers that are comma-delimited should work.
If there is a way to post the data file or email to you for you to help me with it I would be happy to do so.
Boliver
Strings are immutable objects, this means you can't change them in place. That means, any operation on a string returns a new one. Now look at your code:
line_str.strip() # returns a string
line_str.split(',') # returns a list of strings
data_list.append(line_str) # appends original 'line_str' (i.e. the entire line)
You could solve this by:
stripped = line_str.strip()
data = stripped.split(',')
data_list.append(data)
Or concatenating the string operations:
data = line_str.strip().split(',')
data_list.append(data)

Categories