I'm working on a script to remove bad characters from a csv file then to be stored in a list.
The script runs find but doesn't remove bad characters so I'm a bit puzzled any pointers or help on why it's not working is appreciated
def remove_bad(item):
item = item.replace("%", "")
item = item.replace("test", "")
return item
raw = []
with open("test.csv", "rb") as f:
rows = csv.reader(f)
for row in rows:
raw.append((remove_bad(row[0].strip()),
row[1].strip().title()))
print raw
If I have a csv-file with one line:
tst%,testT
Then your script, slightly modified, should indeed filter the "bad" characters. I changed it to pass both items separately to remove_bad (because you mentioned you had to "remove bad characters from a csv", not only the first row):
import csv
def remove_bad(item):
item = item.replace("%","")
item = item.replace("test","")
return item
raw = []
with open("test.csv", "rb") as f:
rows = csv.reader(f)
for row in rows:
raw.append((remove_bad(row[0].strip()), remove_bad(row[1].strip()).title()))
print raw
Also, I put title() after the function call (else, "test" wouldn't get filtered out).
Output (the rows will get stored in a list as tuples, as in your example):
[('tst', 'T')]
Feel free to ask questions
import re
import csv
p = re.compile( '(test|%|anyotherchars)') #insert bad chars insted of anyotherchars
def remove_bad(item):
item = p.sub('', item)
return item
raw =[]
with open("test.csv", "rb") as f:
rows = csv.reader(f)
for row in rows:
raw.append( ( remove_bad(row[0].strip()),
row[1].strip().title() # are you really need strip() without args?
) # here you create a touple which you will append to array
)
print raw
Related
I have a sample file called 'scores.txt' which holds the following values:
10,0,6,3,7,4
I want to be able to somehow take each value from the line, and append it to a list so that it becomes sampleList = [10,0,6,3,7,4].
I have tried doing this using the following code below,
score_list = []
opener = open('scores.txt','r')
for i in opener:
score_list.append(i)
print (score_list)
which partially works, but for some reason, it doesn't do it properly. It just sticks all the values into one index instead of separate indexes. How can I make it so all the values get put into their own separate index?
You have CSV data (comma separated). Easiest is to use the csv module:
import csv
all_values = []
with open('scores.txt', newline='') as infile:
reader = csv.reader(infile)
for row in reader:
all_values.extend(row)
Otherwise, split the values. Each line you read is a string with the ',' character between the digits:
all_values = []
with open('scores.txt', newline='') as infile:
for line in infile:
all_values.extend(line.strip().split(','))
Either way, all_values ends up with a list of strings. If all your values are only consisting of digits, you could convert these to integers:
all_values.extend(map(int, row))
or
all_values.extend(map(int, line.strip().split(',')))
That is an efficient way how to do that without using any external package:
with open('tmp.txt','r') as f:
score_list = f.readline().rstrip().split(",")
# Convert to list of int
score_list = [int(v) for v in score_list]
print score_list
Just use split on comma on each line and add the returned list to your score_list, like below:
opener = open('scores.txt','r')
score_list = []
for line in opener:
score_list.extend(map(int,line.rstrip().split(',')))
print( score_list )
I want to create a csv from an existing csv, by splitting its rows.
Input csv:
A,R,T,11,12,13,14,15,21,22,23,24,25
Output csv:
A,R,T,11,12,13,14,15
A,R,T,21,22,23,24,25
So far my code looks like:
def update_csv(name):
#load csv file
file_ = open(name, 'rb')
#init first values
current_a = ""
current_r = ""
current_first_time = ""
file_content = csv.reader(file_)
#LOOP
for row in file_content:
current_a = row[0]
current_r = row[1]
current_first_time = row[2]
i = 2
#Write row to new csv
with open("updated_"+name, 'wb') as f:
writer = csv.writer(f)
writer.writerow((current_a,
current_r,
current_first_time,
",".join((row[x] for x in range(i+1,i+5)))
))
#do only one row, for debug purposes
return
But the row contains double quotes that I can't get rid of:
A002,R051,02-00-00,"05-21-11,00:00:00,REGULAR,003169391"
I've tried to use writer = csv.writer(f,quoting=csv.QUOTE_NONE) and got a _csv.Error: need to escape, but no escapechar set.
What is the correct approach to delete those quotes?
I think you could simplify the logic to split each row into two using something along these lines:
def update_csv(name):
with open(name, 'rb') as file_:
with open("updated_"+name, 'wb') as f:
writer = csv.writer(f)
# read one row from input csv
for row in csv.reader(file_):
# write 2 rows to new csv
writer.writerow(row[:8])
writer.writerow(row[:3] + row[8:])
writer.writerow is expecting an iterable such that it can write each item within the iterable as one item, separate by the appropriate delimiter, into the file. So:
writer.writerow([1, 2, 3])
would write "1,2,3\n" to the file.
Your call provides it with an iterable, one of whose items is a string that already contains the delimiter. It therefore needs some way to either escape the delimiter or a way to quote out that item. For example,
write.writerow([1, '2,3'])
Doesn't just give "1,2,3\n", but e.g. '1,"2,3"\n' - the string counts as one item in the output.
Therefore if you want to not have quotes in the output, you need to provide an escape character (e.g. '/') to mark the delimiters that shouldn't be counted as such (giving something like "1,2/,3\n").
However, I think what you actually want to do is include all of those elements as separate items. Don't ",".join(...) them yourself, try:
writer.writerow((current_a, current_r,
current_first_time, *row[i+2:i+5]))
to provide the relevant items from row as separate items in the tuple.
I have similar problem to this guy: find position of a substring in a string
The difference is that I don't know what my "mystr" is. I know my substring but my string in the input file could be random amount of words in any order, but i know one of those words include substring cola.
For example a csv file: fanta,coca_cola,sprite in any order.
If my substring is "cola", then how can I make a code that says
mystr.find('cola')
or
match = re.search(r"[^a-zA-Z](cola)[^a-zA-Z]", mystr)
or
if "cola" in mystr
When I don't know what my "mystr" is?
this is my code:
import csv
with open('first.csv', 'rb') as fp_in, open('second.csv', 'wb') as fp_out:
reader = csv.DictReader(fp_in)
rows = [row for row in reader]
writer = csv.writer(fp_out, delimiter = ',')
writer.writerow(["new_cola"])
def headers1(name):
if "cola" in name:
return row.get("cola")
for row in rows:
writer.writerow([headers1("cola")])
and the first.csv:
fanta,cocacola,banana
0,1,0
1,2,1
so it prints out
new_cola
""
""
when it should print out
new_cola
1
2
Here is a working example:
import csv
with open("first.csv", "rb") as fp_in, open("second.csv", "wb") as fp_out:
reader = csv.DictReader(fp_in)
writer = csv.writer(fp_out, delimiter = ",")
writer.writerow(["new_cola"])
def filter_cola(row):
for k,v in row.iteritems():
if "cola" in k:
yield v
for row in reader:
writer.writerow(list(filter_cola(row)))
Notes:
rows = [row for row in reader] is unnecessary and inefficient (here you convert a generator to list which consumes a lot of memory for huge data)
instead of return row.get("cola") you meant return row.get(name)
in the statement return row.get("cola") you access a variable outside of the current scope
you can also use the unix tool cut. For example:
cut -d "," -f 2 < first.csv > second.csv
I'm writing a program that reads names and statistics related to those names from a file. Each line of the file is another person and their stats. For each person, I'd like to make their last name a key and everything else linked to that key in the dictionary. The program first stores data from the file in an array and then I'm trying to get those array elements into the dictionary, but I'm not sure how to do that. Plus I'm not sure if each time the for loop iterates, it will overwrite the previous contents of the dictionary. Here's the code I'm using to attempt this:
f = open("people.in", "r")
tmp = None
people
l = f.readline()
while l:
tmp = l.split(',')
print tmp
people = {tmp[2] : tmp[0])
l = f.readline()
people['Smith']
The error I'm currently getting is that the syntax is incorrect, however I have no idea how to transfer the array elements into the dictionary other than like this.
Use key assignment:
people = {}
for line in f:
tmp = l.rstrip('\n').split(',')
people[tmp[2]] = tmp[0]
This loops over the file object directly, no need for .readline() calls here, and removes the newline.
You appear to have CSV data; you could also use the csv module here:
import csv
people = {}
with open("people.in", "rb") as f:
reader = csv.reader(f)
for row in reader:
people[row[2]] = row[0]
or even a dict comprehension:
import csv
with open("people.in", "rb") as f:
reader = csv.reader(f)
people = {r[2]: r[0] for r in reader}
Here the csv module takes care of the splitting and removing newlines.
The syntax error stems from trying close the opening { with a ) instead of }:
people = {tmp[2] : tmp[0]) # should be }
If you need to collect multiple entries per row[2] value, collect these in a list; a collections.defaultdict instance makes that easier:
import csv
from collections import defaultdict
people = defaultdict(list)
with open("people.in", "rb") as f:
reader = csv.reader(f)
for row in reader:
people[row[2]].append(row[0])
In repsonse to Generalkidd's comment above, multiple people with the same last time, an addition to Martijn Pieter's solution, posted as an answer for better formatting:
import csv
people = {}
with open("people.in", "rb") as f:
reader = csv.reader(f)
for row in reader:
if not row[2] in people:
people[row[2]] = list()
people[row[2]].append(row[0])
I have this code to check indexes in a file to see if they match, but to start off I am having trouble being able to select an index. What do I have to do in order to be able to do so, because at this moment it doesn't show the values as being in a list.
def checkOS():
fid = open("C:/Python/NSRLOS.txt", 'r')
fhand = open("C:/Python/sha_sub_hashes.out", 'r')
sLine = fhand.readline()
line = fid.readline()
outdata = []
print line
checkOS()
Right now it prints:
"190","Windows 2000","2000","609"
I only want it to print: (so index[0])
190
And when I try index[0], I just get ' " '. So the first value in the whole string, I want a list to be able to select the index.
Try using line.split(",") to split the line by the commas, then strip out the quotation marks by slicing the result.
Example:
>>> line = '"190","Windows 2000","2000","609"'
>>> sliced = line.split(',')
>>> print sliced
['"190"', '"Windows 2000"', '"2000"', '"609"']
>>> first_item = sliced[0][1:-1]
>>> print first_item
190
...and here's the whole thing, abstracted into a function:
def get_item(line, index):
return line.split(',')[index][1:-1]
(This is assuming, of course, that all the items in the line are divided by commas, that they're all wrapped by quotation marks, that there's no spaces after the commas (although you could take care of that by doing item.strip() to remove whitespace). It also fails if the quoted items contains commas, as noted in the comments.)
And if you try using split() to split each comma and return first value? Try this.
[0] applied to a string only returns the first character.
You want the first item of a comma-separated list. You could write your own parsing code, or you could use the csv module which already handles this.
import csv
def get_first_row(fname):
with open(fname, 'rb') as inf:
incsv = csv.reader(inf)
try:
row = incsv.next()
except StopIteration:
row = [None]
return row
def checkOS():
fid = get_first_row("C:/Python/NSRLOS.txt")[0]
fhand = get_first_row("C:/Python/sha_sub_hashes.out")[0]
print fid
csv.reader would be a good start.
import csv
from itertools import izip
with open('file1.csv') as fid, open('file2.csv') as fhand:
fidcsv = csv.reader(fid)
fhandcsv = csv.reder(fhand)
for row1, row2 in izip(fidcsv, fhandcsv):
print row1, row2, row[1] # etc...
Using csv.reader will handle CSV formatted files better than pure str methods. The izip will read line1 then 2, then 3 etc.. from both files (it will stop at the shortest number of rows in the file though), then line2 from both files etc... (not sure if this is what you want though). row1 and row2 will end up being a list of columns, and then just index if row1[0] == row2[0]: or whatever logic you wish to use.