I'm working on a Python script which opens a DBF file, and then from that creates a text output of the contents (either .txt or .csv).
I've now managed to get it writing the output file, but I need to replace the space character in one of the database fields (It's a UK Car registration number, e.g. I need "AB09 CDE" to output as "AB09CDE") but I've been unable to work out how to do this as it seems to be nested lists. The field is rec[7] in the code below.
if __name__ == '__main__':
import sys, csv
from cStringIO import StringIO
from operator import itemgetter
# Read a database
filename = 'DATABASE.DBF'
if len(sys.argv) == 2:
filename = sys.argv[1]
f = open(filename, 'rb')
db = list(dbfreader(f))
f.close()
fieldnames, fieldspecs, records = db[0], db[1], db[2:]
# Remove some fields that we don't want to use...
del fieldnames[0:]
del fieldspecs[0:]
#Put the relevant data into the temporary table
records = [rec[7:8] + rec[9:12] + rec[3:4] for rec in records]
# Create outputfile
output_file = 'OUTPUT.txt'
f = open (output_file, 'wb')
csv.writer(f).writerows(records)
This also adds a lot of spaces to the end of each outputted value. How would I get rid of these?
I'm fairly new to Python so any guidance would be gratefully received!
The problem is that you are using slicing:
>>> L = [1,2,3,4,5,6,7,8,9,10]
>>> L[7]
8
>>> L[7:8] #NOTE: it's a *list* of a single element!
[8]
To replace spaces in rec[7] do:
records = [[rec[7].replace(' ', '')] + rec[9:12] + rec[3:4] for rec in records]
records = [rec[7].replace(' ', '') + rec[9:12] + rec[3:4] for rec in records]
?
Python documentation wrote:
str.replace(old, new[, count]) Return a copy of the string with all occurrences of substring old replaced by new. If the optional argument count is given, only the first count occurrences are replaced.
This example demonstrate it:
In [13]: a ="AB09CDE"
In [14]: a.replace(" ", "")
Out[14]: 'AB09CDE'
In [15]: print a
AB09CDE
So, if rec is string field, then:
records = [rec.replace(" ", "") for rec in records]
Related
I've a large csv file(comma delimited). I would like to replace/rename few random cell with the value "NIL" to an empty string "".
I tried this to find the keyword "NIL" and replace with '' empty
string. But it's giving me an empty csv file
ifile = open('outfile', 'rb')
reader = csv.reader(ifile,delimiter='\t')
ofile = open('pp', 'wb')
writer = csv.writer(ofile, delimiter='\t')
findlist = ['NIL']
replacelist = [' ']
s = ifile.read()
for item, replacement in zip(findlist, replacelist):
s = s.replace(item, replacement)
ofile.write(s)
From seeing you code i fell you directly should
read the file
with open("test.csv") as opened_file:
data = opened_file.read()
then use regex to change all NIL to "" or " " and save back the data to the file.
import re
data = re.sub("NIL"," ",data) # this code will replace NIL with " " in the data string
NOTE: you can give any regex instead of NIL
for more info see re module.
EDIT 1: re.sub returns a new string so you need to return it to data.
A few tweaks and your example works. I edited your question to get rid of some indenting errors - assuming those were a cut/paste problem. The next problem is that you don't import csv ... but even though you create a reader and writer, you don't actually use them, so it could just be removed. So, opening in text instead of binary mode, we have
ifile = open('outfile') # 'outfile' is the input file...
ofile = open('pp', 'w')
findlist = ['NIL']
replacelist = [' ']
s = ifile.read()
for item, replacement in zip(findlist, replacelist):
s = s.replace(item, replacement)
ofile.write(s)
We could add 'with' clauses and use a dict to make replacements more clear
replace_this = { 'NIL': ' '}
with open('outfile') as ifile, open('pp', 'w') as ofile:
s = ifile.read()
for item, replacement in replace_this.items:
s = s.replace(item, replacement)
ofile.write(s)
The only real problem now is that it also changes things like "NILIST" to "IST". If this is a csv with all numbers except for "NIL", that's not a problem. But you could also use the csv module to only change cells that are exactly "NIL".
with open('outfile') as ifile, open('pp', 'w') as ofile:
reader = csv.reader(ifile)
writer = csv.writer(ofile)
for row in reader:
# row is a list of columns. The following builds a new list
# while checking and changing any column that is 'NIL'.
writer.writerow([c if c.strip() != 'NIL' else ' '
for c in row])
I'm working on a script to remove bad characters from a csv file then to be stored in a list.
The script runs find but doesn't remove bad characters so I'm a bit puzzled any pointers or help on why it's not working is appreciated
def remove_bad(item):
item = item.replace("%", "")
item = item.replace("test", "")
return item
raw = []
with open("test.csv", "rb") as f:
rows = csv.reader(f)
for row in rows:
raw.append((remove_bad(row[0].strip()),
row[1].strip().title()))
print raw
If I have a csv-file with one line:
tst%,testT
Then your script, slightly modified, should indeed filter the "bad" characters. I changed it to pass both items separately to remove_bad (because you mentioned you had to "remove bad characters from a csv", not only the first row):
import csv
def remove_bad(item):
item = item.replace("%","")
item = item.replace("test","")
return item
raw = []
with open("test.csv", "rb") as f:
rows = csv.reader(f)
for row in rows:
raw.append((remove_bad(row[0].strip()), remove_bad(row[1].strip()).title()))
print raw
Also, I put title() after the function call (else, "test" wouldn't get filtered out).
Output (the rows will get stored in a list as tuples, as in your example):
[('tst', 'T')]
Feel free to ask questions
import re
import csv
p = re.compile( '(test|%|anyotherchars)') #insert bad chars insted of anyotherchars
def remove_bad(item):
item = p.sub('', item)
return item
raw =[]
with open("test.csv", "rb") as f:
rows = csv.reader(f)
for row in rows:
raw.append( ( remove_bad(row[0].strip()),
row[1].strip().title() # are you really need strip() without args?
) # here you create a touple which you will append to array
)
print raw
Following up my previous question, because I couldn't get a satisfactory answer. Now I have data like this, don't know what it exactly is
["'A','B','C'"]["'a1,a2','b1','c1'"]["'a2,a4','b3','ct'"]
I'd like my final output to be written to a csv file like below. How can I achieve this?
A ,B ,C
a1,a2 ,b1 ,c1
a2,a4 ,b3 ,ct
Assuming that ["'A','B','C'"]["'a1,a2','b1','c1'"]["'a2,a4','b3','ct'"] is one long string as the original post seems to imply, ie:
"""["'A','B','C'"]["'a1,a2','b1','c1'"]["'a2,a4','b3','ct'"]"""
then the following code should work:
# ORIGINAL STRING
s = """["'A','B','C'"]["'a1,a2','b1','c1'"]["'a2,a4','b3','ct'"]"""
# GET RID OF UNNECESSARY CHARACTERS FOR OUR CSV
s = s.replace("][", "--") # temporary chars to help split into lines later on
s = s.replace("[", "")
s = s.replace("]", "")
s = s.replace("\'", "")
s = s.replace("\"", "")
# SPLIT UP INTO A LIST OF LINES OF TEXT
lines = s.split("--")
# WRITE EACH LINE IN TURN TO A CSV FILE
with open("myFile.csv", mode = "w") as textFile:
# mode = w to override any other contents of an existing file, or
# create a new one.
# mode = a To append to an exising file
for line in lines:
textFile.write(line + str("\n"))
An alternative way, again assuming that the data is encoded as one long string:
import ast
# ORIGINAL STRING
s = """["'A','B','C'"]["'a1,a2','b1','c1'"]["'a2,a4','b3','ct'"]"""
# PARSE INTO A LIST OF LISTS WITH STRING ELEMENTS
s2 = s.replace("][", "],[")
s2 = ast.literal_eval(s2)
s2 = [ast.literal_eval(s2[x][0]) for x in range(len(s2))]
# WRITE EACH LIST AS A LINE IN THE CSV FILE
with open("myFile.csv", mode = "w") as textFile:
# mode = w to override any other contents of an existing file, or
# create a new one.
# mode = a To append to an exising file
for i in range(len(s2)):
line = ",".join(s2[i])
textFile.write(line + str("\n"))
Since the given input won't be accepted by any inbuilt data structure, you need to convert it either into a string or a list of lists. Assuming your input as a string in the following. Also, you can modify the formatting as per your requirement.
#!/usr/bin/python
from ast import literal_eval
def csv(li):
file_handle = open("test.csv", "w")
#stripping the outer double_quotes and splitting the list by commas
for outer in li:
temp = outer[0].strip("'")
temp = temp.split("',")
value = ""
#bulding a formatted string(change this as per your requirement
for inner in temp:
value += '{0: <10}'.format(inner.strip("'")) + '{0: >10}'.format(",")
value = value.strip(", ")
#writing the built string into the file
file_handle.write(value + "\n")
file_handle.close()
#assuming your input as string
def main():
li_str = """["'A','B','C'"]["'a1,a2','b1','c1'"]["'a2,a4','b3','ct'"]"""
li = []
start_pos, end_pos = 0, -1
#break each into a new list and appending it to li
while(start_pos != -1):
start_pos = li_str.find("[", end_pos+1)
if start_pos == -1:
break
end_pos = li_str.find("]", start_pos+1)
li.append(literal_eval(li_str[start_pos:end_pos+1]))
#li now conatins a list of lists i.e. same as the input
csv(li)
if __name__=="__main__":
main()
I know this question has been asked before, but never with the following caveats:
I'm a complete python n00b. Also a JSON noob.
The JSON file / string is not the same as those seen in json2csv examples.
The CSV file output is supposed to have standard columns.
Due to point number 1, I'm not aware of most terminologies and technologies used for this. So please bear with me.
Point number 2: Here's a single line of the supposed JSON file:
"id":"123456","about":"YESH","can_post":true,"category":"Community","checkins":0,"description":"OLE!","has_added_app":false,"is_community_page":false,"is_published":true,"likes":48,"link":"www.fake.com","name":"Test Name","parking":{"lot":0,"street":0,"valet":0},"talking_about_count":0,"website":"www.fake.com/blog","were_here_count":0^
Weird, I know - it lacks braces and brackets and stuff. Which is why I'm convinced posted solutions won't work.
I'm not sure what the 0^ at the end of the line is, but I see it at the end of every line. I'm assuming the 0 is the value for "were_here_count" while the ^ is a... line terminator? EDIT: Apparently, I can just disregard it.
Of note is that the value of "parking" appears to be yet another array - I'm fine with just displaying it as is (minus the double quotes).
Point number 3: Here's the columns of the supposed CSV file output. This is the complete column set - the JSON file won't always have them all.
ID STRING,
ABOUT STRING,
ATTIRE STRING,
BAND_MEMBERS STRING,
BEST_PAGE STRING,
BIRTHDAY STRING,
BOOKING_AGENT STRING,
CAN_POST STRING,
CATEGORY STRING,
CATEGORY_LIST STRING,
CHECKINS STRING,
COMPANY_OVERVIEW STRING,
COVER STRING,
CONTEXT STRING,
CURRENT_LOCATION STRING,
DESCRIPTION STRING,
DIRECTED_BY STRING,
FOUNDED STRING,
GENERAL_INFO STRING,
GENERAL_MANAGER STRING,
GLOBAL_BRAND_PARENT_PAGE STRING,
HOMETOWN STRING,
HOURS STRING,
IS_PERMANENTLY_CLOSED STRING,
IS_PUBLISHED STRING,
IS_UNCLAIMED STRING,
LIKES STRING,
LINK STRING,
LOCATION STRING,
MISSION STRING,
NAME STRING,
PARKING STRING,
PHONE STRING,
PRESS_CONTACT STRING,
PRICE_RANGE STRING,
PRODUCTS STRING,
RESTAURANT_SERVICES STRING,
RESTAURANT_SPECIALTIES STRING,
TALKING_ABOUT_COUNT STRING,
USERNAME STRING,
WEBSITE STRING,
WERE_HERE_COUNT STRING
Here's my code so far:
import os
num = '1'
inPath = "./fb-data_input/"
outPath = "./fb-data_output/"
#Get list of Files, put them in filenameList array
fileNameList = os.listdir(path)
#Process per file in
for item in fileNameList:
print("Processing: " + item)
fb_inputFile = open(inPath + item, "rb").read().split("\n")
fb_outputFile = open(outPath + "fbdata-IAB-output" + num, "wb")
num++
jsonString = fb_inputFile.split("\",\"")
jsonField = jsonString[0]
jsonValue = jsonString[1]
jsonHash[?] = [?,?]
#Do Code stuff here
Up until the for loop, it just loads the json file names into an array, and then processes it one by one.
Here's my logic for the rest of the code:
Split the json string by something. Perhaps the "," so that other commas won't get split.
Store it into a hashmap / 2D array (dynamic?)
Trim away the JSON fields and the first and/or last double quotes.
Add the resulting output to another hashmap, with those set columns, putting in null in a column that the JSON file does not have.
And then I output the result to a CSV.
It sounds logical in my head, but I'm pretty sure there's something I missed. And of course, I have a hard time putting it in code.
Can I have some help on this? Thanks.
P.S.
Additional information:
OS: Mac OSX
Target platform OS: Ubuntu of some sort
Here is a full solution, based on your original code:
import os
import json
from csv import DictWriter
import codecs
def get_columns():
columns = []
with open("columns.txt") as f:
columns = [line.split()[0] for line in f if line.strip()]
return columns
if __name__ == "__main__":
in_path = "./fb-data_input/"
out_path = "./fb-data_output/"
columns = get_columns()
bad_keys = ("has_added_app", "is_community_page")
for filename in os.listdir(in_path):
json_filename = os.path.join(in_path, filename)
csv_filename = os.path.join(out_path, "%s.csv" % (os.path.basename(filename)))
with open(json_filename) as f, open(csv_filename, "wb") as csv_file:
csv_file.write(codecs.BOM_UTF8)
csv = DictWriter(csv_file, columns)
csv.writeheader()
for line_number, line in enumerate(f, start=1):
try:
data = json.loads("{%s}" % (line.strip().strip('^')))
# fix parking column
if "parking" in data:
data['parking'] = ", ".join("%s: %s" % (k, str(v)) for k, v in data['parking'].items())
data = {k.upper(): unicode(v).encode('utf8') for k, v in data.items() if k not in bad_keys}
except Exception, e:
import traceback
traceback.print_exc()
data = {columns[0]: "Error on line %s of %s: %s" % (line_number, json_filename, e)}
csv.writerow(data)
Edited: Full unicode support plus extended error information.
So, first off, your string is valid json if you just add curly braces around it. You can then deserialize with Python's json library. Setup your csv columns as a dictionary with each of them pointing to whatever you want as a default value (None? ""? you're choice). Once you've deserialized the json to a dict, just loop through each key there and fill in the csv_columns dict as appropriate. Then just use Python's csv module to write it out:
import json
import csv
string = '"id":"123456","about":"YESH","can_post":true,"category":"Community","checkins":0,"description":"OLE!","has_added_app":false,"is_community_page":false,"is_published":true,"likes":48,"link":"www.fake.com","name":"Test Name","parking":{"lot":0,"street":0,"valet":0},"talking_about_count":0,"website":"www.fake.com/blog","were_here_count":0^'
string = '{%s}' % string[:-1]
json_dict = json.loads(string)
#make 'parking' a string. I'm assuming that's your only hash.
json_dict['parking'] = json.dumps(json_dict['parking'])
csv_cols_list = ['a','b','c'] #put your actual csv columns here
csv_cols = {col: '' for col in csv_cols_list}
for k, v in json_dict.iterkeys():
if k in csv_cols:
csv_cols[k] = v
#now just write to csv using Python's csv library
Note: this is a general answer that assumes that your "json" will be valid key/value pairs. Your "parking" key is a special case you'll need to deal with somehow. I left it as is because I don't know what you want with it. I'm also assuming the '^' at the end of your string was a typo.
[EDIT] Changed to account for parking and the '^' at the end. [/EDIT]
Either way, the general idea here is what you want.
The first thing is your input is not JSON. Its just a string that is delimited, where the column and value is quoted.
Here is a solution that would work:
import csv
columns = ['ID', 'ABOUT', ... ]
with open('input_file.txt', 'r') as f, open('output_file.txt', 'w') as o:
reader = csv.reader(f, delimiter=',')
writer = csv.writer(o, delimiter=',')
writer.writerow(columns)
for row in reader:
data = {k.upper():v for k,v in row.split(':', 1)}
row = [data.get(v, '') for v in columns]
writer.writerow(row)
In this loop, for each line we read from the input file, a dictionary is created. The key is the first value from the 'foo:bar' pair, and we convert it to upper case.
Next, for each column, we try to fetch a value from this dictionary in the order that the columns are written out. If a value for the column doesn't exist, a blank '' is returned. These values are collected in a list row. This makes sure no matter how many columns are missing, we write an equal number of columns to the output.
I have an input file in the below format:
<ftnt>
<p><su>1</su> aaaaaaaaaaa </p>
</ftnt>
...........
...........
...........
... the <su>1</su> is availabe in the .........
I need to convert this to the below format by replacing the value and deleting the whole data in ftnt tags:
"""...
...
... the aaaaaaaaaaa is available in the ..........."""
Please find the code which i have written. Initially i saved the keys & values in dictionary and tried to replace the value based on the key using grouping.
import re
dict = {}
in_file = open("in.txt", "r")
outfile = open("out.txt", "w")
File1 = in_file.read()
infile1 = File1.replace("\n", " ")
for mo in re.finditer(r'<p><su>(\d+)</su>(.*?)</p>',infile1):
dict[mo.group(1)] = mo.group(2)
subval = re.sub(r'<p><su>(\d+)</su>(.*?)</p>','',infile1)
subval = re.sub('<su>(\d+)</su>',dict[\\1], subval)
outfile.write(subval)
I tried to use dictionary in re.sub but I am getting a KeyError. I don't know why this happens could you please tell me how to use. I'd appreciate any help here.
Try using a lambda for the second argument to re.sub, rather than a string with backreferences:
subval = re.sub('<su>(\d+)</su>',lambda m:dict[m.group(1)], subval)
First off, don't name dictionaries dict or you'll destroy the dict function. Second, \\1 doesn't work outside of a string hence the syntax error. I think the best bet is to take advantage of str.format
import re
# store the substitutions
subs = {}
# read the data
in_file = open("in.txt", "r")
contents = in_file.read().replace("\n", " ")
in_file.close()
# save some regexes for later
ftnt_tag = re.compile(r'<ftnt>.*</ftnt>')
var_tag = re.compile(r'<p><su>(\d+)</su>(.*?)</p>')
# pull the ftnt tag out
ftnt = ftnt_tag.findall(contents)[0]
contents = ftnt_tag.sub('', contents)
# pull the su
for match in var_tag.finditer(ftnt):
# added s so they aren't numbers, useful for format
subs["s" + match.group(1)] = match.group(2)
# replace <su>1</su> with {s1}
contents = re.sub(r"<su>(\d+)</su>", r"{s\1}", contents)
# now that the <su> are the keys, we can just use str.format
out_file = open("out.txt", "w")
out_file.write( contents.format(**subs) )
out_file.close()