More efficient way to go through .csv file?

More efficient way to go through .csv file? - python

I'm trying to parse through a few dictionary a in .CSV file, using two lists in separate .txt files so that the script knows what it is looking for. The idea is to find a line in the .CSV file which matches both a Word and IDNumber, and then pull out a third variable if there is a match. However, the code is running really slow. Any ideas how I could make it more efficient?
import csv
IDNumberList_filename = 'IDs.txt'
WordsOfInterest_filename = 'dictionary_WordsOfInterest.txt'
Dictionary_filename = 'dictionary_individualwords.csv'
WordsOfInterest_ReadIn = open(WordsOfInterest_filename).read().split('\n')
#IDNumberListtoRead = open(IDNumberList_filename).read().split('\n')
for CurrentIDNumber in open(IDNumberList_filename).readlines():
for CurrentWord in open(WordsOfInterest_filename).readlines():
FoundCurrent = 0
with open(Dictionary_filename, newline='', encoding='utf-8') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
if ((row['IDNumber'] == CurrentIDNumber) and (row['Word'] == CurrentWord)):
FoundCurrent = 1
CurrentProportion= row['CurrentProportion']
if FoundCurrent == 0:
CurrentProportion=0
else:
CurrentProportion=1
print('found')

First of all, consider to load file dictionary_individualwords.csv into the memory. I guess that python dictionary is proper data structure for this case.

Your are opening the CSV file N times where N = (# lines in IDS.txt) * (# lines in dictionary_WordsOfInterest.txt). If the file is not too large, you can avoid that by saving its content to a dictionary or a list of lists.
The same way you open dictionary_WordsOfInterest.txt every time you read a new line from IDS.txt
Also It seems that you are looking for any combination of pair (CurrentIDNumber, CurrentWord) possible from the txt files. So for example you can store the ids in a set, and the words in an other, and for each row in the csv file, you can check if both the id and the word are in their respective set.

As you use readlines for the .txt files, you already build an in memory list with them. You should build those lists first and them only parse once the csv file. Something like:
import csv
IDNumberList_filename = 'IDs.txt'
WordsOfInterest_filename = 'dictionary_WordsOfInterest.txt'
Dictionary_filename = 'dictionary_individualwords.csv'
WordsOfInterest_ReadIn = open(WordsOfInterest_filename).read().split('\n')
#IDNumberListtoRead = open(IDNumberList_filename).read().split('\n')
numberlist = open(IDNumberList_filename).readlines():
wordlist = open(WordsOfInterest_filename).readlines():
FoundCurrent = 0
with open(Dictionary_filename, newline='', encoding='utf-8') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
for CurrentIDNumber in numberlist:
for CurrentWord in wordlist :
if ((row['IDNumber'] == CurrentIDNumber) and (row['Word'] == CurrentWord)):
FoundCurrent = 1
CurrentProportion= row['CurrentProportion']
if FoundCurrent == 0:
CurrentProportion=0
else:
CurrentProportion=1
print('found')
Beware: untested

Related

Reading and writing variables from CSV file in Python (Selenium)

I'm having some difficulties with my code - wondering if anyone could help me as to where I'm going wrong.
The general syntax of the goal I'm trying to achieve is:
Get user input
Split input into individual variables
Write variables (amend) to 'data.csv'
Read variables from newly amended 'data.csv'
Add variables to list
If variable 1 <= length of list, #run some code
If variable 2 <= length of list, #run some code
Here is my python code:
from selenium import webdriver
import time
import csv
x = raw_input("Enter numbers separated by a space")
integers = [[int(i)] for i in x.split()]
with open("data.csv", "a") as f:
writer = csv.writer(f)
writer.writerows(integers)
with open('data.csv', 'r') as f:
file_contents = f.read()
previous_FONs = file_contents.split(' ')
if list.count(integers[i]) == 1:
#run some code
elif list.count(integers[i]) == 2:
#run some code
The error message I'm receiving is TypeError: count() takes exactly one argument (0 given)

Because of the following line
integers = [[int(i)] for i in x.split()]
you're creating a list of lists. Therefore you're passing lists to the count method. Try this one:
integers = [int(i) for i in x.split()]
Edit: Based on your explanation what you want to achieve, this code should do it:
import csv
x = raw_input('Enter numbers separated by a space: ')
new_FONs = [[int(i)] for i in x.split()]
with open('data.csv', 'a', newline='') as f:
writer = csv.writer(f)
writer.writerows(new_FONs)
with open('data.csv', 'r') as f:
all_FONs_str = [line.split() for line in f]
all_FONs = [[int(FON[0])] for FON in all_FONs_str]
# For each of the user-input numbers
for FON in new_FONs:
# Count the occurance of this number in the CSV file
FON_count = all_FONs.count(FON)
if FON_count == 1:
print(f'{FON[0]} occurs once')
# do stuff
elif FON_count == 2:
print(f'{FON[0]} occurs twice')
# do stuff
I have changed the name of the list read from CSV to all_FONs just to remind that this contains the old entries and also the new once (as we wrote them into the file before reading).
In addition you need to convert the entries as when reading from CSV you get strings not integers, what would make the comparison difficult. Maybe the whole conversion to int is not necessary, just work on strings. But that depends on what you need.
Edit2: Sorry forgot to change from input to raw_input for Python 2.7 :)

How to have multiple arrays in python variable

Right now I have a small script that writes and read data to a CSV file.
Brief snippet of the write function:
with open(filename,'w') as f1:
writer=csv.writer(f1, delimiter=';',lineterminator='\n',)
for a,b in my_function:
do_things_to_get_data
writer.writerow([tech_link, str(total), str(avg), str(unique_count)])
Then brief snippet of reading the file:
infile = open(filename,"r")
for line in infile:
row = line.split(";")
tech = row[0]
total = row[1]
average = row[2]
days_worked = row[3]
do_things_with_each_row_of_data
I'd like to just skip the CSV part all together and see if I can just hold all that data in a variable but I'm not sure what that looks like. Any help is appreciated.
Thank you.

...no point in me saving data to a csv file just to read it later in the script
Just keep it in a list of lists
data = []
for a,b in my_function:
do_things_to_get_data
data.append([tech_link, str(total), str(avg), str(unique_count)])
...
for tech, total, average, days_worked in data:
do_things_with_each_row_of_data
It might be worth saving each row as a namedtuple or a dictionary

how to extract specific data from a csv file with given parameters?

I want to extract Neutral words from the given csv file (to a separate .txt file), but I'm fairly new to python and don't know much about file handling. I could not find a neutral words dataset, but after searching here and there, this is what I was able to find.
Here is the Gtihub project from where I want to extract data (just in case anyone needs to know) : hoffman-prezioso-projects/Amazon_Review_Sentiment_Analysis
Neutral Words
Word Sentiment Score
a 0.0125160264947
the 0.00423728459134
it -0.0294755274737
and 0.0810574365028
an 0.0318918766949
or -0.274298468178
normal -0.0270787859177
So basically I want to extract only those words (text) from csv where the numeric value is 0.something.

Even without using any libraries, this is fairly easy with the csv you're using.
First open the file (I'm going to assume you have the path saved in the variable filename), then read the file with the readlines() function, and then filter out according to the condition you give.
with open(filename, 'r') as csv: # Open the file for reading
rows = [line.split(',') for line in csv.readlines()] # Read each the file in lines, and split on commas
filter = [line[0] for line in rows if abs(float(line[1])) < 1]
# Filter out all lines where the second value is not equal to 1
This is now the accepted answer, so I'm adding a disclaimer. There are numerous reasons why this code should not be applied to other CSVs without thought.
It reads the entire CSV in memory
It does not account for e.g. quoting
It is acceptable for very simple CSVs but the other answers here are better if you cannot be certain that the CSV won't break this code.

Here is one way to do it with only vanilla libs and not holding the whole file in memory
import csv
def get_vals(filename):
with open(filename, 'rb') as fin:
reader = csv.reader(fin)
for line in reader:
if line[-1] <= 0:
yield line[0]
words = get_vals(filename)
for word in words:
do stuff...

Use pandas like so:
import pandas
df = pandas.read_csv("yourfile.csv")
df.columns = ['word', 'sentiment']
to choose words by sentiment:
positive = df[df['sentiment'] > 0]['word']
negative = df[df['sentiment'] < 0]['word']
neutral = df[df['sentiment'] == 0]['word']

If you don't want to use any additional libraries, you can try with csv module. Note that delimiter='\t' can be different in your case.
import csv
f = open('name.txt', 'r')
reader = csv.reader(f, delimiter='\t', quoting=csv.QUOTE_NONE)
for row in reader:
if(float(row[1]) > 0.0):
print(row[0] + ' ' row[1])

Check whether string is in CSV

I want to search a CSV file and print either True or False, depending on whether or not I found the string. However, I'm running into the problem whereby it will return a false positive if it finds the string embedded in a larger string of text. E.g.: It will return True if string is foo and the term foobar is in the CSV file. I need to be able to return exact matches.
username = input()
if username in open('Users.csv').read():
print("True")
else:
print("False")
I've looked at using mmap, re and csv module functions, but I haven't got anywhere with them.
EDIT: Here is an alternative method:
import re
import csv
username = input()
with open('Users.csv', 'rt') as f:
reader = csv.reader(f)
for row in reader:
re.search(r'\bNOTSUREHERE\b', username)

when you look inside a csv file using the csv module, it will return each row as a list of columns. So if you want to lookup your string, you should modify your code as such:
import csv
username = input()
with open('Users.csv', 'rt') as f:
reader = csv.reader(f, delimiter=',') # good point by #paco
for row in reader:
for field in row:
if field == username:
print "is in file"
but as it is a csv file, you might expect the username to be at a given column:
with open('Users.csv', 'rt') as f:
reader = csv.reader(f, delimiter=',')
for row in reader:
if username == row[2]: # if the username shall be on column 3 (-> index 2)
print "is in file"

You should have a look at the csv module in python.
is_in_file = False
with open('my_file.csv', 'rb') as csvfile:
my_content = csv.reader(csvfile, delimiter=',')
for row in my_content:
if username in row:
is_in_file = True
print is_in_file
It assumes that your delimiter is a comma (replace with the your delimiter. Note that username must be defined previously. Also change the name of the file.
The code loops through all the lines in the CSV file. row a list of string containing each element of your row. For example, if you have this in your CSV file: Joe,Peter,Michel the row will be ['Joe', 'Peter', 'Michel']. Then you can check if your username is in that list.

I have used the top comment, it works and looks OK, but it was too slow for me.
I had an array of many strings that I wanted to check if they were in a large csv-file. No other requirements.
For this purpose I used (simplified, I iterated through a string of arrays and did other work than print):
with open('my_csv.csv', 'rt') as c:
str_arr_csv = c.readlines()
Together with:
if str(my_str) in str(str_arr_csv):
print("True")
The reduction in time was about ~90% for me. Code locks ugly but I'm all about speed. Sometimes.

import csv
scoresList=[]
with open ("playerScores_v2.txt") as csvfile:
scores=csv.reader(csvfile, delimiter= ",")
for row in scores:
scoresList.append(row)
playername=input("Enter the player name you would like the score for:")
print("{0:40} {1:10} {2:10}".format("Name","Level","Score"))
for i in range(0,len(scoresList)):
print("{0:40} {1:10} {2:10}".format(scoresList[i] [0],scoresList[i] [1], scoresList[i] [2]))

EXTENDED ALGO:
As i can have in my csv some values with space:
", atleft,atright , both " ,
I patch the code of zmo as follow
if field.strip() == username:
and it's ok, thanks.
OLD FASHION ALGO
i had previously coded an 'old fashion' algorithm that takes care of any allowed separators ( here comma, space and newline),so i was curious to compare performances.
With 10000 rounds on a very simple csv file, i got:
------------------ algo 1 old fashion ---------------
done in 1.931804895401001 s.
------------------ algo 2 with csv ---------------
done in 1.926626205444336 s.
As this is not too bad, 0.25% longer, i think that this good old hand made algo can help somebody (and will be useful if more parasitic chars as strip is only for spaces)
This algo uses bytes and can be used for anything else than strings.
It search for a name not embedded in another by checking left and right bytes that must be in the allowed separators.
It mainly uses loops with ejection asap through break or continue.
def separatorsNok(x):
return (x!=44) and (x!=32) and (x!=10) and (x!=13) #comma space lf cr
# set as a function to be able to run several chained tests
def searchUserName(userName, fileName):
# read file as binary (supposed to be utf-8 as userName)
f = open(fileName, 'rb')
contents = f.read()
lenOfFile = len(contents)
# set username in bytes
userBytes = bytearray(userName.encode('utf-8'))
lenOfUser = len(userBytes)
posInFile = 0
posInUser = 0
while posInFile < lenOfFile:
found = False
posInUser = 0
# search full name
while posInFile < lenOfFile:
if (contents[posInFile] == userBytes[posInUser]):
posInUser += 1
if (posInUser == lenOfUser):
found = True
break
posInFile += 1
if not found:
continue
# found a fulll name, check if isolated on left and on right
# left ok at very beginning or space or comma or new line
if (posInFile > lenOfUser):
if separatorsNok(contents[posInFile-lenOfUser]): #previousLeft
continue
# right ok at very end or space or comma or new line
if (posInFile < lenOfFile-1):
if separatorsNok(contents[posInFile+1]): # nextRight
continue
# found and bordered
break
# main while
if found:
print(userName, "is in file") # at posInFile-lenOfUser+1)
else:
pass
to check: searchUserName('pirla','test.csv')
As other answers, code exit at first match but can be easily extended to find all.
HTH

#!/usr/bin/python
import csv
with open('my.csv', 'r') as f:
lines = f.readlines()
cnt = 0
for entry in lines:
if 'foo' in entry:
cnt += 1
print"No of foo entry Count :".ljust(20, '.'), cnt

Using CSV module to append multiple files while removing appended headers

I would like to use the Python CSV module to open a CSV file for appending. Then, from a list of CSV files, I would like to read each csv file and write it to the appended CSV file. My script works great - except that I cannot find a way to remove the headers from all but the first CSV file being read. I am certain that my else block of code is not executing properly. Perhaps my syntax for my if else code is the problem? Any thoughts would be appreciated.
writeFile = open(append_file,'a+b')
writer = csv.writer(writeFile,dialect='excel')
for files in lstFiles:
readFile = open(input_file,'rU')
reader = csv.reader(readFile,dialect='excel')
for i in range(0,len(lstFiles)):
if i == 0:
oldHeader = readFile.readline()
newHeader = writeFile.write(oldHeader)
for row in reader:
writer.writerow(row)
else:
reader.next()
for row in reader:
row = readFile.readlines()
writer.writerow(row)
readFile.close()
writeFile.close()

You're effectively iterating over lstFiles twice. For each file in your list, you're running your inner for loop up from 0. You want something like:
writeFile = open(append_file,'a+b')
writer = csv.writer(writeFile,dialect='excel')
headers_needed = True
for input_file in lstFiles:
readFile = open(input_file,'rU')
reader = csv.reader(readFile,dialect='excel')
oldHeader = reader.next()
if headers_needed:
newHeader = writer.writerow(oldHeader)
headers_needed = False
for row in reader:
writer.writerow(row)
readFile.close()
writeFile.close()
You could also use enumerate over the lstFiles to iterate over tuples containing the iteration count and the filename, but I think the boolean shows the logic more clearly.
You probably do not want to mix iterating over the csv reader and directly calling readline on the underlying file.

I think you're iterating too many times (over various things: both your list of files and the files themselves). You've definitely got some consistency problems; it's a little hard to be sure since we can't see your variable initializations. This is what I think you want:
with open(append_file,'a+b') as writeFile:
need_headers = True
for input_file in lstFiles:
with open(input_file,'rU') as readFile:
headers = readFile.readline()
if need_headers:
# Write the headers only if we need them
writeFile.write(headers)
need_headers = False
# Now write the rest of the input file.
for line in readFile:
writeFile.write(line)
I took out all the csv-specific stuff since there's no reason to use it for this operation. I also cleaned the code up considerably to make it easier to follow, using the files as context managers and a well-named boolean instead of the "magic" i == 0 check. The result is a much nicer block of code that (hopefully) won't have you jumping through hoops to understand what's going on.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

More efficient way to go through .csv file? - python

First of all, consider to load file dictionary_individualwords.csv into the memory. I guess that python dictionary is proper data structure for this case.

Related

Reading and writing variables from CSV file in Python (Selenium)

How to have multiple arrays in python variable

how to extract specific data from a csv file with given parameters?

Check whether string is in CSV

Using CSV module to append multiple files while removing appended headers

Categories

Resources