I have similar problem to this guy: find position of a substring in a string
The difference is that I don't know what my "mystr" is. I know my substring but my string in the input file could be random amount of words in any order, but i know one of those words include substring cola.
For example a csv file: fanta,coca_cola,sprite in any order.
If my substring is "cola", then how can I make a code that says
mystr.find('cola')
or
match = re.search(r"[^a-zA-Z](cola)[^a-zA-Z]", mystr)
or
if "cola" in mystr
When I don't know what my "mystr" is?
this is my code:
import csv
with open('first.csv', 'rb') as fp_in, open('second.csv', 'wb') as fp_out:
reader = csv.DictReader(fp_in)
rows = [row for row in reader]
writer = csv.writer(fp_out, delimiter = ',')
writer.writerow(["new_cola"])
def headers1(name):
if "cola" in name:
return row.get("cola")
for row in rows:
writer.writerow([headers1("cola")])
and the first.csv:
fanta,cocacola,banana
0,1,0
1,2,1
so it prints out
new_cola
""
""
when it should print out
new_cola
1
2
Here is a working example:
import csv
with open("first.csv", "rb") as fp_in, open("second.csv", "wb") as fp_out:
reader = csv.DictReader(fp_in)
writer = csv.writer(fp_out, delimiter = ",")
writer.writerow(["new_cola"])
def filter_cola(row):
for k,v in row.iteritems():
if "cola" in k:
yield v
for row in reader:
writer.writerow(list(filter_cola(row)))
Notes:
rows = [row for row in reader] is unnecessary and inefficient (here you convert a generator to list which consumes a lot of memory for huge data)
instead of return row.get("cola") you meant return row.get(name)
in the statement return row.get("cola") you access a variable outside of the current scope
you can also use the unix tool cut. For example:
cut -d "," -f 2 < first.csv > second.csv
Related
I created a program to create a csv where every number from 0 to 1000000
import csv
nums = list(range(0,1000000))
with open('codes.csv', 'w') as f:
writer = csv.writer(f)
for val in nums:
writer.writerow([val])
then another program to remove a number from the file taken as input
import csv
import os
while True:
members= input("Please enter a number to be deleted: ")
lines = list()
with open('codes.csv', 'r') as readFile:
reader = csv.reader(readFile)
for row in reader:
if all(field != members for field in row):
lines.append(row)
else:
print('Removed')
os.remove('codes.csv')
with open('codes.csv', 'w') as writeFile:
writer = csv.writer(writeFile)
writer.writerows(lines)
The above code is working fine on any other device except my pc, in the first program it creates the csv file with empty rows between every number, in the second program the number of empty rows multiplies and the file size also multiples.
what is wrong with my device then?
Thanks in advance
I think you shouldn't use a csv file for single column data. Use a json file instead.
And the code that you've written for checking which value to not remove, is unnecessary. Instead you could write a list of numbers to the file, and read it back to a variable while removing a number you desire to, using the list.remove() method.
And then write it back to the file.
Here's how I would've done it:
import json
with open("codes.json", "w") as f: # Write the numbers to the file
f.write(json.dumps(list(range(0, 1000000))))
nums = None
with open("codes.json", "r") as f: # Read the list in the file to nums
nums = json.load(f)
to_remove = int(input("Number to remove: "))
nums.remove(to_remove) # Removes the number you want to
with open("codes.json", "w") as f: # Dump the list back to the file
f.write(json.dumps(nums))
Seems like you have different python versions.
There is a difference between built-in python2 open() and python3 open(). Python3 defaults to universal newlines mode, while python2 newlines depends on mode argument open() function.
CSV module docs provides a few examples where open() called with newline argument explicitly set to empty string newline='':
import csv
with open('some.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerows(someiterable)
Try to do the same. Probably without explicit newline='' your writerows calls add one more newline character.
CSV file from English - Comma-Separated Values, you have a record with spaces.
To remove empty lines - when opening a file for writing, add newline="".
Since this format is tabular data, you cannot simply delete the element, the table will go. It is necessary to insert an empty string or "NaN" instead of the deleted element.
I reduced the number of entries and made them in the form of a table for clarity.
import csv
def write_csv(file, seq):
with open(file, 'w', newline='') as f:
writer = csv.writer(f)
for val in seq:
writer.writerow([v for v in val])
nums = ((j*10 + i for i in range(0, 10)) for j in range(0, 10))
write_csv('codes.csv', nums)
nums_new = []
members = input("Please enter a number, from 0 to 100, to be deleted: ")
with open('codes.csv', 'r') as f:
reader = csv.reader(f)
for row in reader:
rows_new = []
for elem in row:
if elem == members:
elem = ""
rows_new.append(elem)
nums_new.append(rows_new)
write_csv('codesdel.csv', nums_new)
This is my code i am able to print each line but when blank line appears it prints ; because of CSV file format, so i want to skip when blank line appears
import csv
import time
ifile = open ("C:\Users\BKA4ABT\Desktop\Test_Specification\RDBI.csv", "rb")
for line in csv.reader(ifile):
if not line:
empty_lines += 1
continue
print line
If you want to skip all whitespace lines, you should use this test: ' '.isspace().
Since you may want to do something more complicated than just printing the non-blank lines to the console(no need to use CSV module for that), here is an example that involves a DictReader:
#!/usr/bin/env python
# Tested with Python 2.7
# I prefer this style of importing - hides the csv module
# in case you do from this_file.py import * inside of __init__.py
import csv as _csv
# Real comments are more complicated ...
def is_comment(line):
return line.startswith('#')
# Kind of sily wrapper
def is_whitespace(line):
return line.isspace()
def iter_filtered(in_file, *filters):
for line in in_file:
if not any(fltr(line) for fltr in filters):
yield line
# A dis-advantage of this approach is that it requires storing rows in RAM
# However, the largest CSV files I worked with were all under 100 Mb
def read_and_filter_csv(csv_path, *filters):
with open(csv_path, 'rb') as fin:
iter_clean_lines = iter_filtered(fin, *filters)
reader = _csv.DictReader(iter_clean_lines, delimiter=';')
return [row for row in reader]
# Stores all processed lines in RAM
def main_v1(csv_path):
for row in read_and_filter_csv(csv_path, is_comment, is_whitespace):
print(row) # Or do something else with it
# Simpler, less refactored version, does not use with
def main_v2(csv_path):
try:
fin = open(csv_path, 'rb')
reader = _csv.DictReader((line for line in fin if not
line.startswith('#') and not line.isspace()),
delimiter=';')
for row in reader:
print(row) # Or do something else with it
finally:
fin.close()
if __name__ == '__main__':
csv_path = "C:\Users\BKA4ABT\Desktop\Test_Specification\RDBI.csv"
main_v1(csv_path)
print('\n'*3)
main_v2(csv_path)
Instead of
if not line:
This should work:
if not ''.join(line).strip():
my suggestion would be to just use the csv reader who can delimite the file into rows. Like this you can just check whether the row is empty and if so just continue.
import csv
with open('some.csv', 'r') as csvfile:
# the delimiter depends on how your CSV seperates values
csvReader = csv.reader(csvfile, delimiter = '\t')
for row in csvReader:
# check if row is empty
if not (row):
continue
You can always check for the number of comma separated values. It seems to be much more productive and efficient.
When reading the lines iteratively, as these are a list of comma separated values you would be getting a list object. So if there is no element (blank link), then we can make it skip.
with open(filename) as csv_file:
csv_reader = csv.reader(csv_file, delimiter=",")
for row in csv_reader:
if len(row) == 0:
continue
You can strip leading and trailing whitespace, and if the length is zero after that the line is empty.
import csv
with open('userlist.csv') as f:
reader = csv.reader(f)
user_header = next(reader) # Add this line if there the header is
user_list = [] # Create a new user list for input
for row in reader:
if any(row): # Pick up the non-blank row of list
print (row) # Just for verification
user_list.append(row) # Compose all the rest data into the list
This example just prints the data in array form while skipping the empty lines:
import csv
file = open("data.csv", "r")
data = csv.reader(file)
for line in data:
if line: print line
file.close()
I find it much clearer than the other provided examples.
import csv
ifile=csv.reader(open('C:\Users\BKA4ABT\Desktop\Test_Specification\RDBI.csv', 'rb'),delimiter=';')
for line in ifile:
if set(line).pop()=='':
pass
else:
for cell_value in line:
print cell_value
I've a large csv file(comma delimited). I would like to replace/rename few random cell with the value "NIL" to an empty string "".
I tried this to find the keyword "NIL" and replace with '' empty
string. But it's giving me an empty csv file
ifile = open('outfile', 'rb')
reader = csv.reader(ifile,delimiter='\t')
ofile = open('pp', 'wb')
writer = csv.writer(ofile, delimiter='\t')
findlist = ['NIL']
replacelist = [' ']
s = ifile.read()
for item, replacement in zip(findlist, replacelist):
s = s.replace(item, replacement)
ofile.write(s)
From seeing you code i fell you directly should
read the file
with open("test.csv") as opened_file:
data = opened_file.read()
then use regex to change all NIL to "" or " " and save back the data to the file.
import re
data = re.sub("NIL"," ",data) # this code will replace NIL with " " in the data string
NOTE: you can give any regex instead of NIL
for more info see re module.
EDIT 1: re.sub returns a new string so you need to return it to data.
A few tweaks and your example works. I edited your question to get rid of some indenting errors - assuming those were a cut/paste problem. The next problem is that you don't import csv ... but even though you create a reader and writer, you don't actually use them, so it could just be removed. So, opening in text instead of binary mode, we have
ifile = open('outfile') # 'outfile' is the input file...
ofile = open('pp', 'w')
findlist = ['NIL']
replacelist = [' ']
s = ifile.read()
for item, replacement in zip(findlist, replacelist):
s = s.replace(item, replacement)
ofile.write(s)
We could add 'with' clauses and use a dict to make replacements more clear
replace_this = { 'NIL': ' '}
with open('outfile') as ifile, open('pp', 'w') as ofile:
s = ifile.read()
for item, replacement in replace_this.items:
s = s.replace(item, replacement)
ofile.write(s)
The only real problem now is that it also changes things like "NILIST" to "IST". If this is a csv with all numbers except for "NIL", that's not a problem. But you could also use the csv module to only change cells that are exactly "NIL".
with open('outfile') as ifile, open('pp', 'w') as ofile:
reader = csv.reader(ifile)
writer = csv.writer(ofile)
for row in reader:
# row is a list of columns. The following builds a new list
# while checking and changing any column that is 'NIL'.
writer.writerow([c if c.strip() != 'NIL' else ' '
for c in row])
I'm working on a script to remove bad characters from a csv file then to be stored in a list.
The script runs find but doesn't remove bad characters so I'm a bit puzzled any pointers or help on why it's not working is appreciated
def remove_bad(item):
item = item.replace("%", "")
item = item.replace("test", "")
return item
raw = []
with open("test.csv", "rb") as f:
rows = csv.reader(f)
for row in rows:
raw.append((remove_bad(row[0].strip()),
row[1].strip().title()))
print raw
If I have a csv-file with one line:
tst%,testT
Then your script, slightly modified, should indeed filter the "bad" characters. I changed it to pass both items separately to remove_bad (because you mentioned you had to "remove bad characters from a csv", not only the first row):
import csv
def remove_bad(item):
item = item.replace("%","")
item = item.replace("test","")
return item
raw = []
with open("test.csv", "rb") as f:
rows = csv.reader(f)
for row in rows:
raw.append((remove_bad(row[0].strip()), remove_bad(row[1].strip()).title()))
print raw
Also, I put title() after the function call (else, "test" wouldn't get filtered out).
Output (the rows will get stored in a list as tuples, as in your example):
[('tst', 'T')]
Feel free to ask questions
import re
import csv
p = re.compile( '(test|%|anyotherchars)') #insert bad chars insted of anyotherchars
def remove_bad(item):
item = p.sub('', item)
return item
raw =[]
with open("test.csv", "rb") as f:
rows = csv.reader(f)
for row in rows:
raw.append( ( remove_bad(row[0].strip()),
row[1].strip().title() # are you really need strip() without args?
) # here you create a touple which you will append to array
)
print raw
I have two csv files each which contain ngrams that look like this:
drinks while strutting,4,1.435486010883783160220299732E-8
and since that,6,4.306458032651349480660899195E-8
the state face,3,2.153229016325674740330449597E-8
It's a three word phrase followed by a frequency number followed by a relative frequency number.
I want to write a script that finds the ngrams that are in both csv files, divides their relative frequencies, and prints them to a new csv file. I want it to find a match whenever the three word phrase matches a three word phrase in the other file and then divide the relative frequency of the phrase in the first csv file by the relative frequency of that same phrase in the second csv file. Then I want to print the phrase and the division of the two relative frequencies to a new csv file.
Below is as far as I've gotten. My script is comparing lines but only finds a match when the entire line (including the frequencies and relative frequencies) matches exactly. I realize that that is because I'm finding the intersection between two entire sets but I have no idea how to do this differently. Please forgive me; I'm new to coding. Any help you can give me to get a little closer would be such a big help.
import csv
import io
alist, blist = [], []
with open("ngrams.csv", "rb") as fileA:
reader = csv.reader(fileA, delimiter=',')
for row in reader:
alist.append(row)
with open("ngramstest.csv", "rb") as fileB:
reader = csv.reader(fileB, delimiter=',')
for row in reader:
blist.append(row)
first_set = set(map(tuple, alist))
secnd_set = set(map(tuple, blist))
matches = set(first_set).intersection(secnd_set)
c = csv.writer(open("matchedngrams.csv", "a"))
c.writerow(matches)
print matches
print len(matches)
Without dump res in a new file (tedious). The idea is that the first element is the phrase and the other two the frequencies. Using dict instead of set to do matching and mapping together.
import csv
import io
alist, blist = [], []
with open("ngrams.csv", "rb") as fileA:
reader = csv.reader(fileA, delimiter=',')
for row in reader:
alist.append(row)
with open("ngramstest.csv", "rb") as fileB:
reader = csv.reader(fileB, delimiter=',')
for row in reader:
blist.append(row)
f_dict = {e[0]:e[1:] for e in alist}
s_dict = {e[0]:e[1:] for e in blist}
res = {}
for k,v in f_dict.items():
if k in s_dict:
res[k] = float(v[1])/float(s_dict[k][1])
print(res)
You could store the relative frequencies from the 1st file into a dictionary, then iterate over the 2nd file and if the 1st column matches anything seen in the original file, write out the result directly to the output file:
import csv
tmp = {}
# if 1 file is much larger than the other, load the smaller one here
# make sure it will fit into the memory
with open("ngrams.csv", "rb") as fr:
# using tuple unpacking to extract fixed number of columns from each row
for txt, abs, rel in csv.reader(fr):
# converting strings like "1.435486010883783160220299732E-8"
# to float numbers
tmp[txt] = float(rel)
with open("matchedngrams.csv", "wb") as fw:
writer = csv.writer(fw)
# the 2nd input file will be processed per 1 line to save memory
# the order of items from this file will be preserved
with open("ngramstest.csv", "rb") as fr:
for txt, abs, rel in csv.reader(fr):
if txt in tmp:
# not sure what you want to do with absolute, I use 0 here:
writer.writerow((txt, 0, tmp[txt] / float(rel)))
My script is comparing lines but only finds a match when the entire line (including the frequencies and relative frequencies) matches exactly. I realize that that is because I'm finding the intersection between two entire sets but I have no idea how to do this differently.
This is exactly what dictionaries are used for: when you have a separate key and value (or when only part of the value is the key). So:
a_dict = {row[0]: row for row in alist}
b_dict = {row[0]: row for row in blist}
Now, you can't directly use set methods on dictionaries. Python 3 offers some help here, but you're using 2.7. So, you have to write it explicitly:
matches = {key for key in a_dict if key in b_dict}
Or:
matches = set(a_dict) & set(b_dict)
But you really don't need the set; all you want to do here is iterate over them. So:
for key in a_dict:
if key in b_dict:
a_values = a_dict[key]
b_values = b_dict[key]
do_stuff_with(a_values[2], b_values[2])
As a side note, you really don't need to build up the lists in the first place just to turn them into sets, or dicts. Just build up the sets or dicts:
a_set = set()
with open("ngrams.csv", "rb") as fileA:
reader = csv.reader(fileA, delimiter=',')
for row in reader:
a_set.add(tuple(row))
a_dict = {}
with open("ngrams.csv", "rb") as fileA:
reader = csv.reader(fileA, delimiter=',')
for row in reader:
a_dict[row[0]] = row
Also, if you know about comprehensions, all three versions are crying out to be converted:
with open("ngrams.csv", "rb") as fileA:
reader = csv.reader(fileA, delimiter=',')
# Now any of these
a_list = list(reader)
a_set = {tuple(row) for row in reader}
a_dict = {row[0]: row for row in reader}
Avoid saving small numbers as they are, they go into underflow problems (see What are arithmetic underflow and overflow in C?), dividing a small number with another will give you even more underflow problem, so do this to preprocess your relative frequencies as such:
>>> import math
>>> num = 1.435486010883783160220299732E-8
>>> logged = math.log(num)
>>> logged
-18.0591772685384
>>> math.exp(logged)
1.4354860108837844e-08
Now to the reading of the csv. Since you're only manipulating the relative frequencies, your 2nd column don't matter, so let's skip that and save the first column (i.e. the phrases) as key and third column (i.e. relative freq) as value:
import csv, math
# Writes a dummy csv file as example.
textfile = """drinks while strutting, 4, 1.435486010883783160220299732E-8
and since that, 6, 4.306458032651349480660899195E-8
the state face, 3, 2.153229016325674740330449597E-8"""
textfile2 = """and since that, 3, 2.1532290163256747e-08
the state face, 1, 7.1774300544189156e-09
drinks while strutting, 2, 7.1774300544189156e-09
some silly ngram, 99, 1.235492312e-09"""
with open('ngrams-1.csv', 'w') as fout:
for line in textfile.split('\n'):
fout.write(line + '\n')
with open('ngrams-2.csv', 'w') as fout:
for line in textfile2.split('\n'):
fout.write(line + '\n')
# Read and save the two files into a dict structure
ngramfile1 = 'ngrams-1.csv'
ngramfile2 = 'ngrams-2.csv'
ngramdict1 = {}
ngramdict2 = {}
with open(ngramfile1, 'r') as fin:
reader = csv.reader(fin, delimiter=',')
for row in reader:
phrase, raw, rel = row
ngramdict1[phrase] = math.log(float(rel))
with open(ngramfile2, 'r') as fin:
reader = csv.reader(fin, delimiter=',')
for row in reader:
phrase, raw, rel = row
ngramdict2[phrase] = math.log(float(rel))
Now for the tricky part you want division of the relative frequency of ngramdict2's phrases by ngramdict1's phrases, i.e.:
if phrase_from_ngramdict1 == phrase_from_ngramdict2:
relfreq = relfreq_from_ngramdict2 / relfreq_from_ngramdict1
Since we kept the relative frequencies in logarithic units, we don't have to divide but to simply subtract it, i.e.
if phrase_from_ngramdict1 == phrase_from_ngramdict2:
logrelfreq = logrelfreq_from_ngramdict2 - logrelfreq_from_ngramdict1
And to get the phrases that occurs in both, you wont need to check the phrases one by one simply use cast the dictionary.keys() into a set and then doset1.intersection(set2), see https://docs.python.org/2/tutorial/datastructures.html
phrases1 = set(ngramdict1.keys())
phrases2 = set(ngramdict2.keys())
overlap_phrases = phrases1.intersection(phrases2)
print overlap_phrases
[out]:
set(['drinks while strutting', 'the state face', 'and since that'])
So now let's print it out with the relative frequencies:
with open('ngramcombined.csv', 'w') as fout:
for p in overlap_phrases:
relfreq1 = ngramdict1[p]
relfreq2 = ngramdict2[p]
combined_relfreq = relfreq2 - relfreq1
fout.write(",".join([p, str(combined_relfreq)])+ '\n')
The ngramcombined.csv looks like this:
drinks while strutting,-0.69314718056
the state face,-1.09861228867
and since that,-0.69314718056
Here's the full code:
import csv, math
# Writes a dummy csv file as example.
textfile = """drinks while strutting, 4, 1.435486010883783160220299732E-8
and since that, 6, 4.306458032651349480660899195E-8
the state face, 3, 2.153229016325674740330449597E-8"""
textfile2 = """and since that, 3, 2.1532290163256747e-08
the state face, 1, 7.1774300544189156e-09
drinks while strutting, 2, 7.1774300544189156e-09
some silly ngram, 99, 1.235492312e-09"""
with open('ngrams-1.csv', 'w') as fout:
for line in textfile.split('\n'):
fout.write(line + '\n')
with open('ngrams-2.csv', 'w') as fout:
for line in textfile2.split('\n'):
fout.write(line + '\n')
# Read and save the two files into a dict structure
ngramfile1 = 'ngrams-1.csv'
ngramfile2 = 'ngrams-2.csv'
ngramdict1 = {}
ngramdict2 = {}
with open(ngramfile1, 'r') as fin:
reader = csv.reader(fin, delimiter=',')
for row in reader:
phrase, raw, rel = row
ngramdict1[phrase] = math.log(float(rel))
with open(ngramfile2, 'r') as fin:
reader = csv.reader(fin, delimiter=',')
for row in reader:
phrase, raw, rel = row
ngramdict2[phrase] = math.log(float(rel))
# Find the intersecting phrases.
phrases1 = set(ngramdict1.keys())
phrases2 = set(ngramdict2.keys())
overlap_phrases = phrases1.intersection(phrases2)
# Output to new file.
with open('ngramcombined.csv', 'w') as fout:
for p in overlap_phrases:
relfreq1 = ngramdict1[p]
relfreq2 = ngramdict2[p]
combined_relfreq = relfreq2 - relfreq1
fout.write(",".join([p, str(combined_relfreq)])+ '\n')
If you like SUPER UNREADBLE but short code (in no. of lines):
import csv, math
# Read and save the two files into a dict structure
ngramfile1 = 'ngrams-1.csv'
ngramfile2 = 'ngrams-2.csv'
ngramdict1 = {row[0]:math.log(float(row[2])) for row in csv.reader(open(ngramfile1, 'r'), delimiter=',')}
ngramdict2 = {row[0]:math.log(float(row[2])) for row in csv.reader(open(ngramfile2, 'r'), delimiter=',')}
# Find the intersecting phrases.
overlap_phrases = set(ngramdict1.keys()).intersection(set(ngramdict2.keys()))
# Output to new file.
with open('ngramcombined.csv', 'w') as fout:
for p in overlap_phrases:
fout.write(",".join([p, str(ngramdict2[p] - ngramdict1[p])])+ '\n')