Check whether string is in CSV - python

I want to search a CSV file and print either True or False, depending on whether or not I found the string. However, I'm running into the problem whereby it will return a false positive if it finds the string embedded in a larger string of text. E.g.: It will return True if string is foo and the term foobar is in the CSV file. I need to be able to return exact matches.
username = input()
if username in open('Users.csv').read():
print("True")
else:
print("False")
I've looked at using mmap, re and csv module functions, but I haven't got anywhere with them.
EDIT: Here is an alternative method:
import re
import csv
username = input()
with open('Users.csv', 'rt') as f:
reader = csv.reader(f)
for row in reader:
re.search(r'\bNOTSUREHERE\b', username)

when you look inside a csv file using the csv module, it will return each row as a list of columns. So if you want to lookup your string, you should modify your code as such:
import csv
username = input()
with open('Users.csv', 'rt') as f:
reader = csv.reader(f, delimiter=',') # good point by #paco
for row in reader:
for field in row:
if field == username:
print "is in file"
but as it is a csv file, you might expect the username to be at a given column:
with open('Users.csv', 'rt') as f:
reader = csv.reader(f, delimiter=',')
for row in reader:
if username == row[2]: # if the username shall be on column 3 (-> index 2)
print "is in file"

You should have a look at the csv module in python.
is_in_file = False
with open('my_file.csv', 'rb') as csvfile:
my_content = csv.reader(csvfile, delimiter=',')
for row in my_content:
if username in row:
is_in_file = True
print is_in_file
It assumes that your delimiter is a comma (replace with the your delimiter. Note that username must be defined previously. Also change the name of the file.
The code loops through all the lines in the CSV file. row a list of string containing each element of your row. For example, if you have this in your CSV file: Joe,Peter,Michel the row will be ['Joe', 'Peter', 'Michel']. Then you can check if your username is in that list.

I have used the top comment, it works and looks OK, but it was too slow for me.
I had an array of many strings that I wanted to check if they were in a large csv-file. No other requirements.
For this purpose I used (simplified, I iterated through a string of arrays and did other work than print):
with open('my_csv.csv', 'rt') as c:
str_arr_csv = c.readlines()
Together with:
if str(my_str) in str(str_arr_csv):
print("True")
The reduction in time was about ~90% for me. Code locks ugly but I'm all about speed. Sometimes.

import csv
scoresList=[]
with open ("playerScores_v2.txt") as csvfile:
scores=csv.reader(csvfile, delimiter= ",")
for row in scores:
scoresList.append(row)
playername=input("Enter the player name you would like the score for:")
print("{0:40} {1:10} {2:10}".format("Name","Level","Score"))
for i in range(0,len(scoresList)):
print("{0:40} {1:10} {2:10}".format(scoresList[i] [0],scoresList[i] [1], scoresList[i] [2]))

EXTENDED ALGO:
As i can have in my csv some values with space:
", atleft,atright , both " ,
I patch the code of zmo as follow
if field.strip() == username:
and it's ok, thanks.
OLD FASHION ALGO
i had previously coded an 'old fashion' algorithm that takes care of any allowed separators ( here comma, space and newline),so i was curious to compare performances.
With 10000 rounds on a very simple csv file, i got:
------------------ algo 1 old fashion ---------------
done in 1.931804895401001 s.
------------------ algo 2 with csv ---------------
done in 1.926626205444336 s.
As this is not too bad, 0.25% longer, i think that this good old hand made algo can help somebody (and will be useful if more parasitic chars as strip is only for spaces)
This algo uses bytes and can be used for anything else than strings.
It search for a name not embedded in another by checking left and right bytes that must be in the allowed separators.
It mainly uses loops with ejection asap through break or continue.
def separatorsNok(x):
return (x!=44) and (x!=32) and (x!=10) and (x!=13) #comma space lf cr
# set as a function to be able to run several chained tests
def searchUserName(userName, fileName):
# read file as binary (supposed to be utf-8 as userName)
f = open(fileName, 'rb')
contents = f.read()
lenOfFile = len(contents)
# set username in bytes
userBytes = bytearray(userName.encode('utf-8'))
lenOfUser = len(userBytes)
posInFile = 0
posInUser = 0
while posInFile < lenOfFile:
found = False
posInUser = 0
# search full name
while posInFile < lenOfFile:
if (contents[posInFile] == userBytes[posInUser]):
posInUser += 1
if (posInUser == lenOfUser):
found = True
break
posInFile += 1
if not found:
continue
# found a fulll name, check if isolated on left and on right
# left ok at very beginning or space or comma or new line
if (posInFile > lenOfUser):
if separatorsNok(contents[posInFile-lenOfUser]): #previousLeft
continue
# right ok at very end or space or comma or new line
if (posInFile < lenOfFile-1):
if separatorsNok(contents[posInFile+1]): # nextRight
continue
# found and bordered
break
# main while
if found:
print(userName, "is in file") # at posInFile-lenOfUser+1)
else:
pass
to check: searchUserName('pirla','test.csv')
As other answers, code exit at first match but can be easily extended to find all.
HTH

#!/usr/bin/python
import csv
with open('my.csv', 'r') as f:
lines = f.readlines()
cnt = 0
for entry in lines:
if 'foo' in entry:
cnt += 1
print"No of foo entry Count :".ljust(20, '.'), cnt

Related

My code is not working properly only on my device

I created a program to create a csv where every number from 0 to 1000000
import csv
nums = list(range(0,1000000))
with open('codes.csv', 'w') as f:
writer = csv.writer(f)
for val in nums:
writer.writerow([val])
then another program to remove a number from the file taken as input
import csv
import os
while True:
members= input("Please enter a number to be deleted: ")
lines = list()
with open('codes.csv', 'r') as readFile:
reader = csv.reader(readFile)
for row in reader:
if all(field != members for field in row):
lines.append(row)
else:
print('Removed')
os.remove('codes.csv')
with open('codes.csv', 'w') as writeFile:
writer = csv.writer(writeFile)
writer.writerows(lines)
The above code is working fine on any other device except my pc, in the first program it creates the csv file with empty rows between every number, in the second program the number of empty rows multiplies and the file size also multiples.
what is wrong with my device then?
Thanks in advance
I think you shouldn't use a csv file for single column data. Use a json file instead.
And the code that you've written for checking which value to not remove, is unnecessary. Instead you could write a list of numbers to the file, and read it back to a variable while removing a number you desire to, using the list.remove() method.
And then write it back to the file.
Here's how I would've done it:
import json
with open("codes.json", "w") as f: # Write the numbers to the file
f.write(json.dumps(list(range(0, 1000000))))
nums = None
with open("codes.json", "r") as f: # Read the list in the file to nums
nums = json.load(f)
to_remove = int(input("Number to remove: "))
nums.remove(to_remove) # Removes the number you want to
with open("codes.json", "w") as f: # Dump the list back to the file
f.write(json.dumps(nums))
Seems like you have different python versions.
There is a difference between built-in python2 open() and python3 open(). Python3 defaults to universal newlines mode, while python2 newlines depends on mode argument open() function.
CSV module docs provides a few examples where open() called with newline argument explicitly set to empty string newline='':
import csv
with open('some.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerows(someiterable)
Try to do the same. Probably without explicit newline='' your writerows calls add one more newline character.
CSV file from English - Comma-Separated Values, you have a record with spaces.
To remove empty lines - when opening a file for writing, add newline="".
Since this format is tabular data, you cannot simply delete the element, the table will go. It is necessary to insert an empty string or "NaN" instead of the deleted element.
I reduced the number of entries and made them in the form of a table for clarity.
import csv
def write_csv(file, seq):
with open(file, 'w', newline='') as f:
writer = csv.writer(f)
for val in seq:
writer.writerow([v for v in val])
nums = ((j*10 + i for i in range(0, 10)) for j in range(0, 10))
write_csv('codes.csv', nums)
nums_new = []
members = input("Please enter a number, from 0 to 100, to be deleted: ")
with open('codes.csv', 'r') as f:
reader = csv.reader(f)
for row in reader:
rows_new = []
for elem in row:
if elem == members:
elem = ""
rows_new.append(elem)
nums_new.append(rows_new)
write_csv('codesdel.csv', nums_new)

Python - Get the largest age from a list txt.file

any idea how should I get the largest age from the text file and print it?
The text file:
Name, Address, Age,Hobby
Abu, “18, Jalan Satu, Penang”, 18, “Badminton, Swimming”
Choo, “Vista Gambier, 10-3A-88, Changkat Bukit Gambier Dua, 11700, Penang”, 17, Dancing
Mutu, Kolej Abdul Rahman, 20, “Shopping, Investing, Youtube-ing”
This is my coding:
with open("iv.txt",encoding="utf8") as file:
data = file.read()
splitdata = data.split('\n')
I am not getting what I want from this.
This works! I hope it helps. Let me know if there are any questions.
This approach essentially assumes that values associated with Hobby do not have numbers in them.
import csv
max_age = 0
with open("iv.txt", newline = '', encoding = "utf8") as f:
# spamreader returns reader object used to iterate over lines of f
# delimiter=',' is the default but I like to be explicit
spamreader = csv.reader(f, delimiter = ',')
# skip first row
next(spamreader)
# each row read from file is returned as a list of strings
for row in spamreader:
# reversed() returns reverse iterator (start from end of list of str)
for i in reversed(row):
try:
i = int(i)
break
# ValueError raised when string i is not an int
except ValueError:
pass
print(i)
if i > max_age:
max_age = i
print(f"\nMax age from file: {max_age}")
Output:
18
17
20
Max age from file: 20
spamreader from the csv module of Python's Standard Library returns a reader object used to iterate over lines of f. Each row (i.e. line) read from the file f is returned as a list of strings.
The delimiter (in our case, ',', which is also the default) determines how a raw line from the file is broken up into mutually exclusive but exhaustive parts -- these parts become the elements of the list that is associated with a given line.
Given a raw line, the string associated with the start of the line to the first comma is an element, then the string associated with any part of the line that is enclosed by two commas is also an element, and finally the string associated with the last comma to the end of the line is also an element.
For each line/list of the file, we start iterating from the end of the list, using the reversed built-in function, because we know that age is the second-to-last category. We assume that the hobby category does not have numbers in them such that the number would appear as an element of the list for the raw line. For example, for the line associated with Abu, if instead of "Badminton, Swimming" we had "Badminton, 30, Swimming", then the code would not have the desired effect as 30 would be treated as Abu's age.
I'm sure there is a built-in feature to parse a composite string like the one you posted, but as I don't know, I've created a CustomParse class to do the job:
class CustomParser():
def __init__(self, line: str, delimiter: str):
self.line = line
self.delimiter = delimiter
def split(self):
word = ''
words = []
inside_string = False
for letter in line:
if letter in '“”"':
inside_string = not inside_string
continue
if letter == self.delimiter and not inside_string:
words.append(word.strip())
word = ''
continue
word += letter
words.append(word.strip())
return words
with open('people_data.csv') as file:
ages = []
for line in file:
ages.append(CustomParser(line, ',').split()[2])
print(max(ages[1:]))
Hope that helps.

Reading and writing variables from CSV file in Python (Selenium)

I'm having some difficulties with my code - wondering if anyone could help me as to where I'm going wrong.
The general syntax of the goal I'm trying to achieve is:
Get user input
Split input into individual variables
Write variables (amend) to 'data.csv'
Read variables from newly amended 'data.csv'
Add variables to list
If variable 1 <= length of list, #run some code
If variable 2 <= length of list, #run some code
Here is my python code:
from selenium import webdriver
import time
import csv
x = raw_input("Enter numbers separated by a space")
integers = [[int(i)] for i in x.split()]
with open("data.csv", "a") as f:
writer = csv.writer(f)
writer.writerows(integers)
with open('data.csv', 'r') as f:
file_contents = f.read()
previous_FONs = file_contents.split(' ')
if list.count(integers[i]) == 1:
#run some code
elif list.count(integers[i]) == 2:
#run some code
The error message I'm receiving is TypeError: count() takes exactly one argument (0 given)
Because of the following line
integers = [[int(i)] for i in x.split()]
you're creating a list of lists. Therefore you're passing lists to the count method. Try this one:
integers = [int(i) for i in x.split()]
Edit: Based on your explanation what you want to achieve, this code should do it:
import csv
x = raw_input('Enter numbers separated by a space: ')
new_FONs = [[int(i)] for i in x.split()]
with open('data.csv', 'a', newline='') as f:
writer = csv.writer(f)
writer.writerows(new_FONs)
with open('data.csv', 'r') as f:
all_FONs_str = [line.split() for line in f]
all_FONs = [[int(FON[0])] for FON in all_FONs_str]
# For each of the user-input numbers
for FON in new_FONs:
# Count the occurance of this number in the CSV file
FON_count = all_FONs.count(FON)
if FON_count == 1:
print(f'{FON[0]} occurs once')
# do stuff
elif FON_count == 2:
print(f'{FON[0]} occurs twice')
# do stuff
I have changed the name of the list read from CSV to all_FONs just to remind that this contains the old entries and also the new once (as we wrote them into the file before reading).
In addition you need to convert the entries as when reading from CSV you get strings not integers, what would make the comparison difficult. Maybe the whole conversion to int is not necessary, just work on strings. But that depends on what you need.
Edit2: Sorry forgot to change from input to raw_input for Python 2.7 :)

Reading a numbers off a list from a txt file, but only upto a comma

This is data from a lab experiment (around 717 lines of data). Rather than trying to excell it, I want to import and graph it on either python or matlab. I'm new here btw... and am a student!
""
"Test Methdo","exp-l Tensile with Extensometer.msm"
"Sample I.D.","Sample108.mss"
"Speciment Number","1"
"Load (lbf)","Time (s)","Crosshead (in)","Extensometer (in)"
62.638,0.900,0.000,0.00008
122.998,1.700,0.001,0.00012
more numbers : see Screenshot of more data from my file
I just can't figure out how to read the line up until a comma. Specifically, I need the Load numbers for one of my arrays/list, so for example on the first line I only need 62.638 (which would be the first number on my first index on my list/array).
How can I get an array/list of this, something that iterates/reads the list and ignores strings?
Thanks!
NOTE: I use Anaconda + Jupyter Notebooks for Python & Matlab (school provided software).
EDIT: Okay, so I came home today and worked on it again. I hadn't dealt with CSV files before, but after some searching I was able to learn how to read my file, somewhat.
import csv
from itertools import islice
with open('Blue_bar_GroupD.txt','r') as BB:
BB_csv = csv.reader(BB)
x = 0
BB_lb = []
while x < 7: #to skip the string data
next(BB_csv)
x+=1
for row in islice(BB_csv,0,758):
print(row[0]) #testing if I can read row data
Okay, here is where I am stuck. I want to make an arraw/list that has the 0th index value of each row. Sorry if I'm a freaking noob!
Thanks again!
You can skip all lines till the first data row and then parse the data into a list for later use - 700+ lines can be easily processd in memory.
Therefor you need to:
read the file line by line
remember the last non-empty line before number/comma/dot ( == header )
see if the line is only number/comma/dot, else increase a skip-counter (== data )
seek to 0
skip enough lines to get to header or data
read the rest into a data structure
Create test file:
text = """
""
"Test Methdo","exp-l Tensile with Extensometer.msm"
"Sample I.D.","Sample108.mss"
"Speciment Number","1"
"Load (lbf)","Time (s)","Crosshead (in)","Extensometer (in)"
62.638,0.900,0.000,0.00008
122.998,1.700,0.001,0.00012
"""
with open ("t.txt","w") as w:
w.write(text)
Some helpers and the skipping/reading logic:
import re
import csv
def convert_row(row):
"""Convert one row of data into a list of mixed ints and others.
Int is the preferred data type, else string is used - no other tried."""
d = []
for v in row:
try:
# convert to int && add
d.append(float(v))
except:
# not an int, append as is
d.append(v)
return d
def count_to_first_data(fh):
"""Count lines in fh not consisting of numbers, dots and commas.
Sideeffect: will reset position in fh to 0."""
skiplines = 0
header_line = 0
fh.seek(0)
for line in fh:
if re.match(r"^[\d.,]+$",line):
fh.seek(0)
return skiplines, header_line
else:
if line.strip():
header_line = skiplines
skiplines += 1
raise ValueError("File does not contain pure number rows!")
Usage of helpers / data conversion:
data = []
skiplines = 0
with open("t.txt","r") as csvfile:
skip_to_data, skip_to_header = count_to_first_data(csvfile)
for _ in range(skip_to_header): # skip_to_data if you do not want the headers
next(csvfile)
reader = csv.reader(csvfile, delimiter=',',quotechar='"')
for row in reader:
row_data = convert_row(row)
if row_data:
data.append(row_data)
print(data)
Output (reformatted):
[['Load (lbf)', 'Time (s)', 'Crosshead (in)', 'Extensometer (in)'],
[62.638, 0.9, 0.0, 8e-05],
[122.998, 1.7, 0.001, 0.00012]]
Doku:
re.match
csv.reader
Method of file objekts (i.e.: seek())
With this you now have "clean" data that you can use for further processing - including your headers.
For visualization you can have a look at matplotlib
I would recommend reading your file with python
data = []
with open('my_txt.txt', 'r') as fd:
# Suppress header lines
for i in range(6):
fd.readline()
# Read data lines up to the first column
for line in fd:
index = line.find(',')
if index >= 0:
data.append(float(line[0:index]))
leads to a list containing your data of the first column
>>> data
[62.638, 122.998]
The MATLAB solution is less nice, since you have to know the number of data lines in your file (which you do not need to know in the python solution)
n_header = 6
n_lines = 2 % Insert here 717 (as you mentioned)
M = csvread('my_txt.txt', n_header, 0, [n_header 0 n_header+n_lines-1 0])
leads to:
>> M
M =
62.6380
122.9980
For the sake of clarity: You can also use MATLABs textscan function to achieve what you want without knowing the number of lines, but still, the python code would be the better choice in my opinion.
Based on your format, you will need to do 3 steps. One, read all lines, two, determine which line to use, last, get the floats and assign them to a list.
Assuming you file name is name.txt, try:
f = open("name.txt", "r")
all_lines = f.readlines()
grid = []
for line in all_lines:
if ('"' not in line) and (line != '\n'):
grid.append(list(map(float, line.strip('\n').split(','))))
f.close()
The grid will then contain a series of lists containing your group of floats.
Explanation for fun:
In the "for" loop, i searched for the double quote to eliminate any string as all strings are concocted between quotes. The other one is for skipping empty lines.
Based on your needs, you can use the list grid as you please. For example, to fetch the first line's first number, do
grid[0][0]
as python's list counts from 0 to n-1 for n elements.
This is super simple in Matlab, just 2 lines:
data = dlmread('data.csv', ',', 6,0);
column1 = data(:,1);
Where 6 and 0 should be replaced by the row and column offset you want. So in this case, the data starts at row 7 and you want all the columns, then just copy over the data in column 1 into another vector.
As another note, try typing doc dlmread in matlab - it brings up the help page for dlmread. This is really useful when you're looking for matlab functions, as it has other suggestions for similar functions down the bottom.

How to clean large malformed CSV file using Python

I'm attempting to use Python 2.7.5 to clean up a malformed CSV file. The CSV file is fairly large (over 1GB). The first row of the file correctly lists the column headings, but after that each field is on a new line (unless it is blank) and some fields are multi-line. The multi-line fields are not surrounded by quotes, but need to be surrounded by quotes in the output. The number of columns is static and known. The pattern in the sample input provided is repeated throughout the length of the file.
Input file (sample):
Hostname,Username,IP Addresses,Timestamp,Test1,Test2,Test3
my_hostname
,my_username
,10.0.0.1
192.168.1.1
,2015-02-11 13:41:54 -0600
,,true
,false
my_2nd_hostname
,my_2nd_username
,10.0.0.2
192.168.1.2
,2015-02-11 14:04:41 -0600
,true
,,false
Desired output:
Hostname,Username,IP Addresses,Timestamp,Test1,Test2,Test3
my_hostname,my_username,"10.0.0.1 192.168.1.1",2015-02-11 13:41:54 -0600,,true,false
my_2nd_hostname,my_2nd_username,"10.0.0.2 192.168.1.2",2015-02-11 14:04:41 -0600,true,,false
I've gone down a couple paths that address one of the issues only to realize that it doesn't handle another aspect of the malformed data. I would appreciate if anyone could please help me identify an efficient way to clean up this file.
Thanks
EDIT
I have several code scraps from going down different paths, but here is the current iteration. It isn't pretty, just a bunch of hacks to try and figure this out.
import csv
inputfile = open('input.csv', 'r')
outputfile_1 = open('output.csv', 'w')
counter = 1
for line in inputfile:
#Skip header row
if counter == 1:
outputfile_1.write(line)
counter = counter + 1
else:
line = line.replace('\r', '').replace('\n', '')
outputfile_1.write(line)
inputfile.close()
outputfile_1.close()
with open('output.csv', 'r') as f:
text = f.read()
comma_count = text.count(',') #comma_count/6 = total number of rows
#need to insert a newline after the field contents after every 6th comma
#unfortunately the last field of the row and the first field of the next row are now rammed up together becaue of the newline replaces above...
#then process as normal CSV
#one path I started to go down... but this isn't even functional
groups = text.split(',')
counter2 = 1
while (counter2 <= comma_count/6):
line = ','.join(groups[:(6*counter2)]), ','.join(groups[(6*counter2):])
print line
counter2 = counter2 + 1
EDIT 2
Thanks to #DSM and #Ryan Vincent for getting me on the right track. Using their ideas I made the following code, which seems to correct my malformed CSV. I'm sure there are many places for improvement though, which I would happily accept.
import csv
import re
outputfile_1 = open('output.csv', 'wb')
wr = csv.writer(outputfile_1, quoting=csv.QUOTE_ALL)
with open('input.csv', 'r') as f:
text = f.read()
comma_indices = [m.start() for m in re.finditer(',', text)] #Find all the commas - the fields are between them
cursor = 0
field_counter = 1
row_count = 0
csv_row = []
for index in comma_indices:
newrowflag = False
if "\r" in text[cursor:index]:
#This chunk has two fields, the last of one row and first of the next
next_field=text[cursor:index].split('\r')
next_field_trimmed = next_field[0].replace('\n',' ').rstrip().lstrip()
csv_row.extend([next_field_trimmed]) #Add the last field of this row
#Reset the cursor to be in the middle of the chuck (after the last field and before the next)
#And set a flag that we need to start the next csvrow before we move on to the next comma index
cursor = cursor+text[cursor:index].index('\r')+1
newrowflag = True
else:
next_field_trimmed = text[cursor:index].replace('\n',' ').rstrip().lstrip()
csv_row.extend([next_field_trimmed])
#Advance the cursor to the character after the comma to start the next field
cursor = index + 1
#If we've done 7 fields then we've finished the row
if field_counter%7==0:
row_count = row_count + 1
wr.writerow(csv_row)
#Reset
csv_row = []
#If the last chunk had 2 fields in it...
if newrowflag:
next_field_trimmed = next_field[1].replace('\n',' ').rstrip().lstrip()
csv_row.extend([next_field_trimmed])
field_counter = field_counter + 1
field_counter = field_counter + 1
#Write the last row
wr.writerow(csv_row)
outputfile_1.close()
# Process output.csv as normal CSV file...
This is a comment about how i would tackle this.
For each line:
I can easily identify start and of end of certain groups:
Hostname - there is only one
usernames - read these until you meet something that does not look like a username (comma delimited)
ip address - read these until you meet a timestamp - identified with a pattern match - be aware these are separated by space rather than comma. The end of group is identified by the trailing comma.
timestamp - easy to identify with a pattern match
test1, test2, test3 - certain to be there as comma delimted fields
Notes: I would use the 'pattern' matches to enable me to identify i have the correct thing in the correct place. It enables spotting errors sooner.
From your data excerpt it seems like any line that starts with a comma needs to be joined to the preceding line and any line starting with anything other than a comma marks a new row.
If that's the case than you could use something the following code to clean up the CSV file such that the standard library csv parser can handle it.
#!/usr/bin/python
raw_data = 'somefilename.raw'
csv_data = 'somefilename.csv'
with open(raw_data, 'Ur') as inp, open(csv_data, 'wb') as out:
row = list()
for line in inp:
line.rstrip('\n')
if line.startswith(','):
row.append(line)
else:
out.write(''.join(row)+'\n')
row = list()
row.append(line))
# Don't forget to write the last row!
out.write(''.join(row)+'\n')
This is a miniature state machine ... accumulating lines into each row until we find a line that doesn't start with a comma, writing the previous row and so on.

Categories