Python reading csv problem : extra whitespace

Python reading csv problem : extra whitespace - python

When I tried to parse a csv which was exported by MS SQL 2005 express edition's query, the string python gives me is totally unexpected. For example if the line in the csv file is :"
aaa,bbb,ccc,dddd", then when python parsed it as string, it becomes :" a a a a , b b b , c c c, d d d d" something like that.....What happens???
I tried to remove the space in the code but don't work.
import os
import random
f1 = open('a.txt', 'r')
f2 = open('dec_sql.txt', 'w')
text = 'abc'
while(text != ''):
text = f1.readline()
if(text==''):
break
splited = text.split(',')
for i in range (0, 32):
splited[i] = splited[i].replace(' ', '')
sql = 'insert into dbo.INBOUND_RATED_DEC2010 values ('
sql += '\'' + splited[0] + '\', '
sql += '\'' + splited[1] + '\', '
sql += '\'' + splited[2] + '\', '
sql += '\'' + splited[3] + '\', '
sql += '\'' + splited[4] + '\', '
sql += '\'' + splited[5] + '\', '
sql += '\'' + splited[6] + '\', '
sql += '\'' + splited[7] + '\', '
sql += '\'' + splited[8] + '\', '
sql += '\'' + splited[9] + '\', '
sql += '\'' + splited[10] + '\', '
sql += '\'' + splited[11] + '\', '
sql += '\'' + splited[12] + '\', '
sql += '\'' + splited[13] + '\', '
sql += '\'' + splited[14] + '\', '
sql += '\'' + splited[15] + '\', '
sql += '\'' + splited[16] + '\', '
sql += '\'' + splited[17] + '\', '
sql += '\'' + splited[18] + '\', '
sql += '\'' + splited[19] + '\', '
sql += '\'' + splited[20] + '\', '
sql += '\'' + splited[21] + '\', '
sql += '\'' + splited[22] + '\', '
sql += '\'' + splited[23] + '\', '
sql += '\'' + splited[24] + '\', '
sql += '\'' + splited[25] + '\', '
sql += '\'' + splited[26] + '\', '
sql += '\'' + splited[27] + '\', '
sql += '\'' + splited[28] + '\', '
sql += '\'' + splited[29] + '\', '
sql += '\'' + splited[30] + '\', '
sql += '\'' + splited[31] + '\', '
sql += '\'' + splited[32] + '\' '
sql += ')'
print sql
f2.write(sql+'\n')
f2.close()
f1.close()

Sounds to me like the output of the MS SQL 2005 query is a unicode file. The python csv module cannot handle unicode files, but there is some sample code in the documentation for the csv module describing how to work around the problem.
Alternately, some text editors allow you to save a file with a different encoding. For example, I opened the results of a MS SQL 2005 query in Notepad++ and it told me the file was UCS-2 encoded and I was able to convert it to UTF-8 from the Encoding menu.

Try to open the file in notepad and use the replace all function to replace ' ' with ''

Your file is most likely encoded with a 2byte character encoding - most likely utf-16 (but it culd be some other encoding.
To get the CSV proper reading it, you'd open it with a codec so that it is decoded as its read - doing that you have Unicode objects (not string objects) inside your python program.
So, instead of opening the file with
my_file = open ("data.dat", "rt")
Use:
import codecs
my_file = codecs.open("data.dat", "rt", "utf-16")
And then feed this to the CSV module, with:
import csv
reader = csv.reader(my_file)
first_line = False
for line in reader:
if first_line: #skips header line
first_line = True
continue
#assemble sql query and issue it
Another thing is that your "query" being constructed into 32 lines of repetitive code is a nice thing to do when programing. Even in languages that lack rich string processing facilities, there are better ways to do it, but in Python, you can simply do:
sql = 'insert into dbo.INBOUND_RATED_DEC2010 values (%s);' % ", ".join("'%s'" % value for value in splited )
Instead of those 33 lines assembling your query. (I am telling it to insert a string inside
the parentheses on the first string. After the %operator, the string ", " is used with the "join" method so that it is used to paste together all elements on the sequence passed as a parameter to join. This sequence is made of a string, containing a value enclosed inside single quotes for each value in your splited array.

It may help to use Python's built in CSV reader. Looks like an issue with unicode, a problem that frustrated me a lot.
import tkFileDialog
import csv
ENCODING_REGEX_REPLACEMENT_LIST = [(re.compile('\xe2\x80\x99'), "'"),
(re.compile('\xe2\x80\x94'), "--"),
(re.compile('\xe2\x80\x9c'), '"'),
(re.compile('\xe2\x80\x9d'), '"'),
(re.compile('\xe2\x80\xa6'), '...')]
def correct_encoding(csv_row):
for key in csv_row.keys():
# if there is a value for the current key
if csv_row[key]:
try:
csv_row[key] = unicode(csv_row[key], errors='strict')
except ValueError:
# we have a bad encoding, try iterating through all the known
# bad encodings in the ENCODING_REGEX_REPLACEMENT and replace
# everything and then try again
for (regex, replacement) in ENCODING_REGEX_REPLACEMENT_LIST:
csv_row[key] = regex.sub(replacement,csv_row[key])
print(csv_row)
csv_row[key] = unicode(csv_row[key])
# if there is NOT a value for the current key
else:
csv_row[key] = unicode('')
return csv_row
filename = tkFileDialog.askopenfilename()
csv_reader = csv.DictReader(open(filename, "rb"), dialect='excel') # assuming similar dialect
for csv_row in csv_reader:
csv_row = correct_encoding(csv_row)
# your application logic here

Related

Tweepy error with exporting array content

I am looking to extract tweets and write them to a CSV file, however, I cannot figure out how to get it to generate a file. I am using Tweepy to extract the tweets. I would like the CSV file to contain the following cells: User, date, tweet, likes, retweets, total, eng rate, rating, tweet id
import tweepy
import csv
auth = tweepy.OAuthHandler("", "")
auth.set_access_token("", "")
api = tweepy.API(auth)
try:
api.verify_credentials()
print("Authentication OK")
except:
print("Error during authentication")
def timeline(username):
tweets = api.user_timeline(screen_name=username, count = '100', tweet_mode="extended")
for status in (tweets):
eng = round(((status.favorite_count + status.retweet_count)/status.user.followers_count)*100, 2)
if (not status.retweeted) and ('RT #' not in status.full_text) and (eng <= 0.02):
print (status.user.screen_name + ',' + str(status.created_at) + ',' + status.full_text + ",Likes: " + str(status.favorite_count) + ",Retweets: " + str(status.retweet_count) + ',Total: ' + str(status.favorite_count + status.retweet_count) + ',Engagement rate: ' + str(eng) + '%' + 'Rating: Low' + ',Tweet ID: ' + str(status.id))
elif (not status.retweeted) and ('RT #' not in status.full_text) and (0.02 < eng <= 0.09):
print (status.user.screen_name + ',' + str(status.created_at) + ',' + status.full_text + ",Likes: " + str(status.favorite_count) + ",Retweets: " + str(status.retweet_count) + ',Total: ' + str(status.favorite_count + status.retweet_count) + ',Engagement rate: ' + str(eng) + '%' + 'Rating: Good' + ',Tweet ID: ' + str(status.id))
elif (not status.retweeted) and ('RT #' not in status.full_text) and (0.09 < eng <= 0.33):
print (status.user.screen_name + ',' + str(status.created_at) + ',' + status.full_text + ",Likes: " + str(status.favorite_count) + ",Retweets: " + str(status.retweet_count) + ',Total: ' + str(status.favorite_count + status.retweet_count) + ',Engagement rate: ' + str(eng) + '%' + 'Rating: High' + ',Tweet ID: ' + str(status.id))
elif (not status.retweeted) and ('RT #' not in status.full_text) and (0.33 < eng):
print (status.user.screen_name + ',' + str(status.created_at) + ',' + status.full_text + ",Likes: " + str(status.favorite_count) + ",Retweets: " + str(status.retweet_count) + ',Total: ' + str(status.favorite_count + status.retweet_count) + ',Engagement rate: ' + str(eng) + '%' + 'Rating: Very High' + ',Tweet ID: ' + str(status.id))
tweet = timeline("twitter")
with open('tweet.csv', 'w', newline='', encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow([tweet])

You can look at https://docs.python.org/3/library/csv.html for the info on how to generate a csv file in Python. Quick exmaple:
import csv
with open('some_output.csv', 'w') as csvfile:
writer = csv.writer(csvfile)
writer.writerow(["field1", "field2", "field3"])

Your function get_tweets does not return a value but you are trying to retrieve a value from that function which would result in None. Also it looks like tweet value will be list of strings. writerow method from csv.writer should get list of items and not list of lists. I have modified your code to address those issues. Let me know if it works.
def get_tweets(username):
tweets = api.user_timeline(screen_name=username, count=100)
tweets_for_csv = [tweet.text for tweet in tweets]
print(tweets_for_csv)
return tweets_for_csv
tweet = get_tweets("fazeclan")
with open('tweet.csv', 'w') as f:
writer = csv.writer(f)
writer.writerow(tweet)

How to encrypt a .bin file

I need to encrypt 3 .bin files which contain 2 keys for Diffie-Hellman. I have no clue how to do that, all I could think of was what I did in the following Python file. I have an example what the output should look like but my code doesn't seem to produce the right keys. The output file server.ini is used by a client to connect to a server.
import base64
fileList = [['game_key.bin', 'Game'], ['gate_key.bin', 'Gate'], ['auth_key.bin', 'Auth']]
iniList = []
for i in fileList:
file = open(i[0], 'rb')
n = list(file.read(64))
x = list(file.read(64))
file.close()
n.reverse()
x.reverse()
iniList.append(['Server.' + i[1] + '.N "' + base64.b64encode("".join(n)) + '"\n', 'Server.' + i[1] + '.X "' + base64.b64encode("".join(x)) + '"\n'])
iniList[0].append('\n')
#time for user Input
ip = '"' + raw_input('Hostname: ') + '"'
dispName = 'Server.DispName ' + '"' + raw_input('DispName: ') + '"' + '\n'
statusUrl = 'Server.Status ' + '"' + raw_input('Status URL: ') + '"' + '\n'
signupUrl = 'Server.Signup ' + '"' + raw_input('Signup URL: ') + '"' + '\n'
for l in range(1, 3):
iniList[l].append('Server.' + fileList[l][1] + '.Host ' + ip + '\n\n')
for l in [[dispName], [statusUrl], [signupUrl]]:
iniList.append(l)
outFile = open('server.ini', 'w')
for l in iniList:
for i in l:
outFile.write(i)
outFile.close()
The following was in my example file:
# Keys are Base64-encoded 512 bit RC4 keys, as generated by DirtSand's keygen
# command. Note that they MUST be quoted in the commands below, or the client
# won't parse them correctly!
I also tried it without inverting n and x

Concatenating strings containing many quotations results in slashes in output

I am trying to build a string that needs to contain specific double and single quotation characters for executing a SQL expression.
I need my output to be formatted like this:
" "Full_Stree" = 'ALLENDALE RD' "
where the value of ALLENDALE RD will be a variable defined through a For Loop. In the following code sample, the variable tOS is what I am trying to pass into the query variable.
tOS = "ALLENDALE RD"
query = '" "Full_Stree" = ' + "'" + tOS + "' " + '"'
and when I print the value of query variable I get this output:
'" "Full_Stree" = \'ALLENDALE RD\' "'
The slashes are causing my query to fail. I also tried using a modulus operator to pass the value of the tOS variable, but get the same results:
where = '" "Full_Stree" = \'%s\' "' % (tOS)
print where
'" "Full_Stree" = \'ALLENDALE RD\' "'
How can I get my string concatenated into the correct format, leaving the slashes out of the expression?

What you are seeing is the repr of your string.
>>> s = '" "Full_Stree" = \'ALLENDALE RD\' "'
>>> s # without print console displays the repr
'" "Full_Stree" = \'ALLENDALE RD\' "'
>>> print s # with print the string itself is displayed
" "Full_Stree" = 'ALLENDALE RD' "
Your real problem is the extra quotes at the beginning and end of your where-clause.
This
query = '" "Full_Stree" = ' + "'" + tOS + "' " + '"'
should be
query = '"Full_Stree" = ' + "'" + tOS + "'"
It is more clearly written as
query = """"Full_Stree" = '%s'""" % tOS
The ArcGis docs recommend something more like this
dataset = '/path/to/featureclass/shapefile/or/table'
field = arcpy.AddFieldDelimiters(dataset, 'Full_Stree')
whereclause = "%s = '%s'" % (field, tOS)
arcpy.AddFieldDelimiters makes sure that the field name includes the proper quoting style for the dataset you are using (some use double-quotes and some use square brackets).

Somehow the way I already tried worked out:
where = '" "Full_Stree" = \'%s\' "' % (tOS)
print where
'" "Full_Stree" = \'ALLENDALE RD\' "'

Can't you just use triple quotes?
a=""" "Full_Street" = 'ALLENDALE RD' """
print a
"Full_Street" = 'ALLENDALE RD'

Trouble with apostrophe in arcpy search cursor where clause

I've put together a tkinter form and python script for downloading files from an ftp site. The filenames are in the attribute table of a shapefile, as well as an overall Name that the filenames correspond too. In other words I look up a Name such as "CABOT" and download the filename 34092_18.tif. However, if a Name has an apostrophe, such as "O'KEAN", it's giving me trouble. I try to replace the apostrophe, like I've done in previous scripts, but it doesn't download anything....
whereExp = quadField + " = " + "'" + quadName.replace("'", '"') + "'"
quadFields = ["FILENAME"]
c = arcpy.da.SearchCursor(collarlessQuad, quadFields, whereExp)
for row in c:
tifFile = row[0]
tifName = quadName.replace("'", '') + '_' + tifFile
#fullUrl = ftpUrl + tifFile
local_filename = os.path.join(downloadDir, tifName)
lf = open(local_filename, "wb")
ftp.retrbinary('RETR ' + tifFile, lf.write)
lf.close()
Here is an example of a portion of a script that works fine by replacing the apostrophe....
where_clause = quadField + " = " + "'" + quad.replace("'", '"') + "'"
#out_quad = quad.replace("'", "") + ".shp"
arcpy.MakeFeatureLayer_management(quadTable, "quadLayer")
select_out_feature_class = arcpy.SelectLayerByAttribute_management("quadLayer", "NEW_SELECTION", where_clause)

Python non-ascii characters

I have a python file that creates and populates a table in ms sql. The only sticking point is that the code breaks if there are any non-ascii characters or single apostrophes (and there are quite a few of each). Although I can run the replace function to rid the strings of apostrophes, I would prefer to keep them intact. I have also tried converting the data into utf-8, but no luck there either.
Below are th error messages I get:
"'ascii' codec can't encode character u'\2013' in position..." (for non-ascii characters)
and for the single quotes
class 'pyodbc.ProgrammingError'>: ('42000', "[42000] [Microsoft][ODBC SQL Server Driver][SQL Server] Incorrect syntax near 'S, 230 X 90M.; Eligibilty....
When I try to encode string in utf-8, I instead get the following error message:
<type 'exceptions.UnicodeDecodeError'>: ascii' codec can't decode byte 0xe2 in position 219: ordinal not in range(128)
The python code is included below. I believe the point in the code where this break occurs is after the following line: InsertValue = str(row.GetValue(CurrentField['Name'])).
# -*- coding: utf-8 -*-
import pyodbc
import sys
import arcpy
import arcgisscripting
gp = arcgisscripting.create(9.3)
SQL_KEYWORDS = ['PERCENT', 'SELECT', 'INSERT', 'DROP', 'TABLE']
#SourceFGDB = '###'
#SourceTable = '###'
SourceTable = sys.argv[1]
TempInputName = sys.argv[2]
SourceTable2 = sys.argv[3]
#---------------------------------------------------------------------------------------------------------------------
# Target Database Settings
#---------------------------------------------------------------------------------------------------------------------
TargetDatabaseDriver = "{SQL Server}"
TargetDatabaseServer = "###"
TargetDatabaseName = "###"
TargetDatabaseUser = "###"
TargetDatabasePassword = "###"
# Get schema from FGDB table.
# This should be an ordered list of dictionary elements [{'FGDB_Name', 'FGDB_Alias', 'FGDB_Type', FGDB_Width, FGDB_Precision, FGDB_Scale}, {}]
if not gp.Exists(SourceTable):
print ('- The source does not exist.')
sys.exit(102)
#### Should see if it is actually a table type. Could be a Feature Data Set or something...
print(' - Processing Items From : ' + SourceTable)
FieldList = []
Field_List = gp.ListFields(SourceTable)
print(' - Getting number of rows.')
result = gp.GetCount_management(SourceTable)
Number_of_Features = gp.GetCount_management(SourceTable)
print(' - Number of Rows: ' + str(Number_of_Features))
print(' - Getting fields.')
Field_List1 = gp.ListFields(SourceTable, 'Layer')
Field_List2 = gp.ListFields(SourceTable, 'Comments')
Field_List3 = gp.ListFields(SourceTable, 'Category')
Field_List4 = gp.ListFields(SourceTable, 'State')
Field_List5 = gp.ListFields(SourceTable, 'Label')
Field_List6 = gp.ListFields(SourceTable, 'DateUpdate')
Field_List7 = gp.ListFields(SourceTable, 'OBJECTID')
for Current_Field in Field_List1 + Field_List2 + Field_List3 + Field_List4 + Field_List5 + Field_List6 + Field_List7:
print(' - Field Found: ' + Current_Field.Name)
if Current_Field.AliasName in SQL_KEYWORDS:
Target_Name = Current_Field.Name + '_'
else:
Target_Name = Current_Field.Name
print(' - Alias : ' + Current_Field.AliasName)
print(' - Type : ' + Current_Field.Type)
print(' - Length : ' + str(Current_Field.Length))
print(' - Scale : ' + str(Current_Field.Scale))
print(' - Precision: ' + str(Current_Field.Precision))
FieldList.append({'Name': Current_Field.Name, 'AliasName': Current_Field.AliasName, 'Type': Current_Field.Type, 'Length': Current_Field.Length, 'Scale': Current_Field.Scale, 'Precision': Current_Field.Precision, 'Unique': 'UNIQUE', 'Target_Name': Target_Name})
# Create table in SQL Server based on FGDB table schema.
cnxn = pyodbc.connect(r'DRIVER={SQL Server};SERVER=###;DATABASE=###;UID=sql_webenvas;PWD=###')
cursor = cnxn .cursor()
#### DROP the table first?
try:
DropTableSQL = 'DROP TABLE dbo.' + TempInputName + '_Test;'
print DropTableSQL
cursor.execute(DropTableSQL)
dbconnection.commit()
except:
print('WARNING: Can not drop table - may not exist: ' + TempInputName + '_Test')
CreateTableSQL = ('CREATE TABLE ' + TempInputName + '_Test '
' (Layer varchar(500), Comments varchar(5000), State int, Label varchar(500), DateUpdate DATETIME, Category varchar(50), OBJECTID int)')
cursor.execute(CreateTableSQL)
cnxn.commit()
# Cursor through each row in the FGDB table, get values, and insert into the SQL Server Table.
# We got Number_of_Features earlier, just use that.
Number_Processed = 0
print(' - Processing ' + str(Number_of_Features) + ' features.')
rows = gp.SearchCursor(SourceTable)
row = rows.Next()
while row:
if Number_Processed % 10000 == 0:
print(' - Processed ' + str(Number_Processed) + ' of ' + str(Number_of_Features))
InsertSQLFields = 'INSERT INTO ' + TempInputName + '_Test ('
InsertSQLValues = 'VALUES ('
for CurrentField in FieldList:
InsertSQLFields = InsertSQLFields + CurrentField['Target_Name'] + ', '
InsertValue = str(row.GetValue(CurrentField['Name']))
if InsertValue in ['None']:
InsertValue = 'NULL'
# Use an escape quote for the SQL.
InsertValue = InsertValue.replace("'","' '")
if CurrentField['Type'].upper() in ['STRING', 'CHAR', 'TEXT']:
if InsertValue == 'NULL':
InsertSQLValues = InsertSQLValues + "NULL, "
else:
InsertSQLValues = InsertSQLValues + "'" + InsertValue + "', "
elif CurrentField['Type'].upper() in ['GEOMETRY']:
## We're not handling geometry transfers at this time.
if InsertValue == 'NULL':
InsertSQLValues = InsertSQLValues + '0' + ', '
else:
InsertSQLValues = InsertSQLValues + '1' + ', '
else:
InsertSQLValues = InsertSQLValues + InsertValue + ', '
InsertSQLFields = InsertSQLFields[:-2] + ')'
InsertSQLValues = InsertSQLValues[:-2] + ')'
InsertSQL = InsertSQLFields + ' ' + InsertSQLValues
## print InsertSQL
cursor.execute(InsertSQL)
cnxn.commit()
Number_Processed = Number_Processed + 1
row = rows.Next()
print(' - Processed all ' + str(Number_Processed))
del row
del rows

James, I believe the real issue is that your are not using Unicode accross the board. Try to do the following:
Make sure that your input file that you are using to populate the DB is in UTF-8 and that you are reading it with the UTF-8 encoder.
Make sure your DB is actually storing the data as Unicode
When you retrieve data from the file or from the DB or want to manipulate strings (with the + operator for instance) you need to make sure that all parts are proper Unicode. You can NOT use the str() method. You need to use unicode() as Dave pointed out. If you define strings in your code use u'my string' instead of 'my string' (otherwise it is not considered unicode).
Also, please provide us the full stack trace and the exception name.

I'm going to use my psychic debugging skills and say you are trying to str()ify something and getting an error with the ascii codec. What you really should do is to use the utf-8 codec instead like this:
insert_value_uni = unicode(row.GetValue(CurrentField['Name']))
InsertValue = insert_value_uni.encode('utf-8')

Or you can take the view that only ASCII is allowed and use the awesomely named Unicode Hammer

In general you want to convert to unicode on data input, and convert to the desired encoding on output.
So it will be easier to find your problem if you do this. This means changing all strings to unicode, 'INSERT INTO ' to u'INSERT INTO '. (Notice the "u" before the string)
Then when you send the string to be executed convert to the desired encoding, "utf8".
cursor.execute(InsertSQL.encode("utf8")) # Where InsertSQL is unicode
Also, you should add the encoding string to the top of your source code.
This means adding the encoding cookie to one of the first two lines of the file:
#!/usr/bin/python
# -*- coding: <encoding name> -*-
If your pulling data from a file to build your string you can uses codecs.open to auto convert from a specific encoding to unicode on load.

When I converted my str() to unicode, that solved the problem. A simple answer, and I appreciate everyone's help on this.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python reading csv problem : extra whitespace - python

Try to open the file in notepad and use the replace all function to replace ' ' with ''

Related

Tweepy error with exporting array content

How to encrypt a .bin file

Concatenating strings containing many quotations results in slashes in output

Trouble with apostrophe in arcpy search cursor where clause

Python non-ascii characters

Categories

Resources