I have a csv file that contains some strange (incorrect) encoded danish characters (å-ø-æ). In my Django view I'm trying to grab a string from the first row, and the date from the second row in the file. The file looks like this if I copy paste it.
01,01,Project Name: SAM_LOGIK_rsm¿de_HD,,,Statistics as of: Sat Oct 01 17:09:16 2016
02,01,Project created: Tue Apr 12 09:10:16 2016,,,Last Session Started: Sat Oct 01 16:59:22 2016
The string SAM_LOGIK_rsm¿de_HD should be SAM_LOGIK_Årsmøde_HD - which is the value I want to store in the DB.
I am decoding the file with iso-8859-1 (otherwise I get an error).
with open(latest, 'rt', encoding='iso-8859-1') as csvfile:
for i, row in enumerate(csvfile):
if "Project Name:" in row:
this = row.split(',')
project_list.append(this[2][14:]) # gets the project name as is
if i >= 1:
break
else:
this = row.split(',')
date = datetime.strptime(this[5][22:-1], '%c') # datetime object
project_list.append(date)
if i >= 1:
break # break at row 2
csvfile.close()
This stores the string 'as is', and I'm not sure what to do to convert it back into danish before I store it in the DB. The DB and Django are set up to work with danish chars.
If I try to decode it as utf.8 - I get a UnicodeDecodeError which reveals some more information.
01,01,Project Name: SAM_LOGIK_\x81rsm\xbfde_HD,,,Statistics as of: Sat Oct'
01 17:09:16 2016\r02,01,Project created: Tue Apr 12 09:10:16 2016,,,Last'
EDIT:
I found out that the strings in the csv are actually corrupted - and the application that created them (Avid Media Composer) at least consistently applies the same values for - Å-å-Æ-æ-Ø-ø
Å = \x81 unassigned in UTF8
å = Œ - u"\u0153" OE ligature
Æ = ® - chr(174)
æ = ¾ - chr(190)
Ø = » - chr(187)
ø = ¿ - chr(191)
I fixed it like this.
replacements = {'\x81':'Å','Œ':'å','®':'Æ','¾':'æ','¿':'ø','»':'Ø'}
with open(newest, 'rt', encoding='iso-8859-1') as csvfile:
for i, row in enumerate(csvfile):
if "Project Name:" in row:
this = row.split(',')
project_list.append("".join([replacements.get(c, c) for c in this[2][14:]]))
if i >= 1:
break
else:
this = row.split(',')
date = datetime.strptime(this[5][22:-1], '%c') # datetime object
project_list.append(date)
if i >= 1:
break # break at row 2
try that
row.decode('iso-8859-1').encode('utf-8')
And if you're use the "with" statement closing the file isn't necessary
Related
Trying to parse based on the grouping, below is the input file to parse.
Cannot able to aggregate multiple groups from my regex which produces expected output. Need some recommendations to print data in expected output. (Note Group2 can have different other (strings) in the actual log-file)
#Parse out the timedate stamp Jan 20 03:25:08 to capture two groups
Example groups
1.) Jan 20 03:25 2.) logrotate
1.) Jan 20 05:03 2.) ntpd
logfile= """Jan 20 03:25:08 fakehost logrotate: ALERT exited abnormally with [1]
Jan 20 03:25:08 fakehost run-parts(/etc/cron.daily)[20447]: finished logrotate
Jan 20 03:26:21 fakehost anacron[28969]: Job 'cron.daily' terminated
Jan 20 03:26:21 fakehost anacron[28969]: Normal exit (1 job run)
Jan 20 03:30:01 fakehost CROND[31462]: (root) CMD (/usr/lib64/sa/sa1 1 1)
Jan 20 03:30:01 fakehost CROND[31461]: (root) CMD (/var/system/bin/sys-cmd -F
Jan 20 05:03:03 fakehost ntpd[3705]: synchronized to time.faux.biz, stratum 2
"""
Expected output:
minute,total_count,logrotate,CROND,ntpd,anacron,run-parts
Jan 20 03:25,2,1,0,0,0,1
Jan 20 03:26,2,0,2,0,1,1
Jan 20 03:30,2,0,2,0,0,0
Jan 20 05:03,1,0,0,1,0,0
This is my code:
import re
output = {}
regex = re.compile(r'^(\w+ \d+ \d+:\d+):\d+ \w+ (\w+).*$')
with open("logfile", "r+") as myfile:
for log_line in myfile:
match = regex.match(log_line)
if match:
if match.group(1) and match.group(2):
print(match.groups())
# Struck here to arrange the data
output[match.group(1)]['total_count'] += 1
output[match.group(1)][match.group(2)] += 1
for k, v in output.items():
print('{0} {1}'.format(k, v))
import re
output = []
regex = re.compile(r'^(\w+ \d+ \d+:\d+):\d+ \w+ (\w+).*$')
with open("logfile.txt", "r+") as myfile:
for log_line in myfile:
match = regex.match(log_line)
if match:
if match.group(1) and match.group(2):
dataDict = {'minute': match.group(1), 'total_count': 1}
dataDict[match.group(2)] = 1
lastInsertedIndex = len(output) - 1
if (len(output) > 0): # data exist, check if same minute data exist or not
# same minute, update existing data
if (output[lastInsertedIndex]['minute'] == match.group(1)):
lastInsertedIndexDict = output[lastInsertedIndex]
if (match.group(2) in lastInsertedIndexDict):
lastInsertedIndexDict[match.group(2)] = lastInsertedIndexDict[match.group(2)] + 1 # updating group(2)
else:
lastInsertedIndexDict[match.group(2)] = 1
# updating total count
lastInsertedIndexDict['total_count'] = lastInsertedIndexDict['total_count'] + 1
output[lastInsertedIndex] = lastInsertedIndexDict
else: # new minute, simply append
output.append(dataDict)
else: # output list is empty
output.append(dataDict)
for data in output:
print(data)
Here the idea is after we have match.groups(), create a dictionary with minute as a key & for total_count value as 1. Then for match.group(1) set value as 1 with the new found key.
As the data would be in increasing order of time, so check if previously inserted data is of same minute or different minute.
If same minute then increase the dictionary's total_count & match.group(2) values by 1
If different minute then simply append the dictionary to output list
Currently output list would print keys & values. Incase if you want to print only values then instead of print(data) in the last line, you can change to print(data.values())
Just to mention, I have assumed that you are not facing any issue in regex & that whatever regex you have provided is fulfilling your requirement.
In case you face any issue in regex or need help in that do let me know in comment.
I have converted a PDF bank statement to a txt file. Here is a snippet of the .txt file:
15 Apr 20DDOPEN 100.00DDBENNON WATER SRVCS29.00DDBG BUSINESS106.00BPC BOB PETROL MINISTRY78.03BPC BARBARA STREAMING DATA30.50CRPAYPAL Z4J22FR450.00CRPAYNAL AAWDL4Z4J22222KHMG30.0019,028.4917 Apr 20CRCASH IN AT HSBC BANK
What is the easiest way of re-writing the text file in python to create a new line at certain points. i.e. after a number ‘xx.xx’ there in a new date such as ‘xx APR’
For example the text to become:
15 Apr 20DDOPEN 100.00
BENNON WATER SRVCS29.00
DDBG BUSINESS106.00...(etc)
I am just trying to make a PDF more readable and useful when working amongst my other files.
If you know of another PDF to txt python converter which works better, I would also be interested.
Thanks for your help
First step would be getting the text file into Python
with open(“file.txt”) as file:
data = file.read()
This next part, initially, I thought you wouldn't be able to do, but in your example, each part contains a number XX.XX The important thing to notice here is that there is a '.' in each number.
Using Python's string find command, you can iteratively look for that '.' and add a newline character two characters later. You can change my indices below to remove the DD as well if you want.
index = 0
while(index != -1):
index = data.find('.', index)
if index != -1:
data = data[:index+3] + '\n' + data[index+3:]
Then you need to write the new data back to the file.
file = open('ValidEmails.txt','w')
file.write(data)
For the given input the following should work:
import re
counter = 0
l = "15 Apr 20DDOPEN 100.00DDBENNON WATER SRVCS29.00DDBG BUSINESS106.00BPC BOB PETROL MINISTRY78.03BPC BARBARA STREAMING DATA30.50CRPAYPAL Z4J22FR450.00CRPAYNAL AAWDL4Z4J22222KHMG30.0019,028.4917 Apr 20CRCASH IN AT HSBC BANK"
nums = re.finditer("[\d]+[\.][\d]+", l)
for elem in nums:
idx = elem.span()[1] + counter
l = l[:idx] + '\n' + l[idx:]
counter += 1
print(l)
The output is:
15 Apr 20DDOPEN 100.00
DDBENNON WATER SRVCS29.00
DDBG BUSINESS106.00
BPC BOB PETROL MINISTRY78.03
BPC BARBARA STREAMING DATA30.50
CRPAYPAL Z4J22FR450.00
CRPAYNAL AAWDL4Z4J22222KHMG30.0019
,028.4917
Apr 20CRCASH IN AT HSBC BANK
Then you should easily able to write line by line to a file.
I have two UTF-8 text files:
repr(file1.txt):
\nSTATEMENT OF WORK\n\n\nSTATEMENT OF WORK NO. 7\nEffective Date: February 15, 2015
repr(file2.txt):
RENEWAL/AMENDMENT\n\nTHIS agreement is entered as of July 25, 2014. b
Their respective Brat annotation files have the following annotation:
file1.ann:
T1 date 61 78 February 15, 2015
file2.ann:
T1 date 53 67 July 25, 2014.
But when I use python to retrieve the characters from .txt using above offsets, I get:
file1.read()[61:78]:
February 15, 2015
file2.read()[53:67]:
ly 25, 2014. b
Why does my offsetting work in the first case but not the second case?
The problem comes from the fact the carriage returns (\r in the text file) and newline (\n) are not considered the same way in Windows and Unix/Mac. If you use a Windows system to generate or modify the .txt files there will be some '\r\n' but brat (that is not thought for Windows) will only counts the '\n' sign.
Using python, you may pass from Windows count to brat count using a dict after having opened the file with the argument newline='' that ensures '\r' will be present in the created W_Contents variable:
with open('file.txt', newline='', encoding='utf-8') as f:
W_Content = f.read()
counter = -1
UfromW_dic = {}
for n, char in enumerate(W_Content):
if char != '\r':
counter += 1
UfromW_dic[n] = counter
After that, the intial span [x,y] will be found at [UfromW_dic[x],UfromW_dic[y]].
I have a large number of text files to read from in Python. Each file is structured as the following sample:
------------------------------------------------------------------------------
\\
Paper: some_integer
From: <some_email_address>
Date: Wed, 4 Apr 2001 12:08:13 GMT (27kb)
Date (revised v2): Tue, 8 May 2001 10:39:33 GMT (27kb)
Title: some_title
Authors: name_1, name_2
Comments: 28 pages, JHEP latex
\\
blablabla (this is a multiline abstract of the paper)
blablabla
blablabla
\\
I would like to automatically extract and store (e.g., as a list) the Title, Authors, and abstract (the text between the second and third \\ - note that it starts with an indent) from each text file. Also note that the white line between Date (revised) and Title is really there (it is not a typo that I introduced).
My attempts so far have involved (I am showing the steps for a single text file, say the first file in the list):
filename = os.listdir(path)[0]
test = pd.read_csv(filename, header=None, delimiter="\t")
Which gives me:
0
0 ----------------------------------------------...
1 \\
2 Paper: some_integer
3 From: <some_email_address>
4 Date: Wed, 4 Apr 2001 12:08:13 GMT (27kb)
5 Date (revised v2): Tue, 8 May 2001 10:39:33 G...
6 Title: some_title...
7 Authors: name_1, name_2
8 Comments: 28 pages, JHEP latex
9 Report-no: DUKE-CGTP-00-01
10 \\
11 blabla...
12 blabla...
13 blabla...
14 \\
I can then select a given row (e.g., the one featuring the title) with:
test[test[0].str.contains("Title")].to_string()
But it is truncated, it is not a clean string (some attributes show up) and I find this entire pandas-based approach quite tedious actually... There must be an easier way to directly select the rows of interest from the text file using regex. At least I hope so...
you could process line by line.
import re
data = {}
temp_s = match = ''
with open('myfile.txt', 'r') as infile:
for line in infile:
if ":" in line:
line = line.split(':')
data[line[0]] = line[1]
elif re.search(r'.*\w+', line):
match = re.search(r'(\w.*)', line)
match = match.group(1)
temp_s += match
while 1:
line = infile.next()
if re.search(r'.*\w+', line):
match = re.search(r'(\w.*)', line)
temp_s += match.group(1)
else:
break
data['abstract'] = temp_s
How about iterating over each line in the file and split by the first : if it is present in line, collecting the result of the split in a dictionary:
with open("input.txt") as f:
data = dict(line.strip().split(": ", 1) for line in f if ": " in line)
As a result, the data would contain:
{
'Comments': '28 pages, JHEP latex',
'Paper': 'some_integer',
'From': '<some_email_address>',
'Date (revised v2)': 'Tue, 8 May 2001 10:39:33 GMT (27kb)',
'Title': 'some_title',
'Date': 'Wed, 4 Apr 2001 12:08:13 GMT (27kb)',
'Authors': 'name_1, name_2'
}
If your files really always have the same structure, you could come up with:
# -*- coding: utf-8> -*-
import re
string = """
------------------------------------------------------------------------------
\\
Paper: some_integer
From: <some_email_address>
Date: Wed, 4 Apr 2001 12:08:13 GMT (27kb)
Date (revised v2): Tue, 8 May 2001 10:39:33 GMT (27kb)
Title: some_title
Authors: name_1, name_2
Comments: 28 pages, JHEP latex
\\
blablabla (this is the abstract of the paper)
\\
"""
rx = re.compile(r"""
^Title:\s(?P<title>.+)[\n\r] # Title at the beginning of a line
Authors:\s(?P<authors>.+)[\n\r] # Authors: ...
Comments:\s(?P<comments>.+)[\n\r] # ... and so on ...
.*[\n\r]
(?P<abstract>.+)""",
re.MULTILINE|re.VERBOSE) # so that the caret matches any line
# + verbose for this explanation
for match in rx.finditer(string):
print match.group('title'), match.group('authors'), match.group('abstract')
# some_title name_1, name_2 blablabla (this is the abstract of the paper)
This approach takes Title as the anchor (beginning of a line) and skims the text afterwards. The named groups may not really be necessary but make the code easier to understand. The pattern [\n\r] looks for newline characters.
See a demo on regex101.com.
This pattern will get you started:
\\[^\\].*[^\\]+Title:\s+(\S+)\s+Authors:\s+(.*)[^\\]+\\+\s+([^\\]*)\n\\
Assume 'txtfile.txt' is of the format as shown on the top. If using python 2.7x:
import re
with open('txtfile.txt', 'r') as f:
input_string = f.read()
p = r'\\[^\\].*[^\\]+Title:\s+(\S+)\s+Authors:\s+(.*)[^\\]+\\+\s+([^\\]*)\n\\'
print re.findall(p, input_string)
Output:
[('some_title', 'name_1, name_2', 'blablabla (this is a multiline abstract of the paper)\n blablabla\n blablabla')]
I have a Python Script that generate a CSV (data parsed from a website).
Here is an exemple of the CSV file:
File1.csv
China;Beijing;Auralog Software Development (Deijing) Co. Ltd.;;;
United Kingdom;Oxford;Azad University (Ir) In Oxford Ltd;;;
Italy;Bari;Bari, The British School;;Yes;
China;Beijing;Beijing Foreign Enterprise Service Group Co Ltd;;;
China;Beijing;Beijing Ying Biao Human Resources Development Limited;;Yes;
China;Beijing;BeiwaiOnline BFSU;;;
Italy;Curno;Bergamo, Anderson House;;Yes;
File2.csv
China;Beijing;Auralog Software Development (Deijing) Co. Ltd.;;;
United Kingdom;Oxford;Azad University (Ir) In Oxford Ltd;;;
Italy;Bari;Bari, The British School;;Yes;
China;Beijing;Beijing Foreign Enterprise Service Group Co Ltd;;;
China;Beijing;Beijing Ying Biao Human Resources Development Limited;;Yes;
This;Is;A;New;Line;;
Italy;Curno;Bergamo, Anderson House;;Yes;
As you can see,
China;Beijing;BeiwaiOnline BFSU;;; ==> This line from File1.csv is not more present in File2.csv and This;Is;A;New;Line;; ==> This line from File2.csv is new (is not present in File1.csv).
I am looking for a way to compare this two CSV files (one important thing to know is that the order of the lines doesn't count ... they cant be anywhere).
What I'd like to have is a script which can tell me:
- One new line : This;Is;A;New;Line;;
- One removed line : China;Beijing;BeiwaiOnline BFSU;;;
And so on ... !
I've tried but without any success:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import csv
f1 = file('now.csv', 'r')
f2 = file('past.csv', 'r')
c1 = csv.reader(f1)
c2 = csv.reader(f2)
now = [row for row in c2]
past = [row for row in c1]
for row in now:
#print row
lol = past.index(row)
print lol
f1.close()
f2.close()
_csv.Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
Any idea of the best way to proceed ? Thank you so much in advance ;)
EDIT:
import csv
f1 = file('now.csv', 'r')
f2 = file('past.csv', 'r')
c1 = csv.reader(f1)
c2 = csv.reader(f2)
s1 = set(c1)
s2 = set(c2)
lol = s1 - s2
print type(lol)
print lol
This seems to be a good idea but :
Traceback (most recent call last):
File "compare.py", line 20, in <module>
s1 = set(c1)
TypeError: unhashable type: 'list'
EDIT 2 (Please don't care about what is above):
*with your help, here is the script I'm writing :*
#!/usr/bin/python
# -*- coding: utf-8 -*-
import os
import csv
### COMPARISON THING ###
x=0
fichiers = os.listdir('/me/CSV')
for fichier in fichiers:
if '.csv' in fichier:
print('%s -----> %s' % (x,fichier))
x=x+1
choice = raw_input("Which file do you want to compare with the new output ? ->>>")
past_file = fichiers[int(choice)]
print 'We gonna compare %s to our output' % past_file
s_now = frozenset(tuple(row) for row in csv.reader(open('/me/CSV/now.csv', 'r'), delimiter=';')) ## OUR OUTPUT
s_past = frozenset(tuple(row) for row in csv.reader(open('/me/CSV/'+past_file, 'r'), delimiter=';')) ## CHOOSEN ONE
added = [";".join(row) for row in s_now - s_past] # in "now" but not in "past"
removed = [";".join(row) for row in s_past - s_now] # in "past" but not in "now"
c = csv.writer(open("CHANGELOG.csv", "a"),delimiter=";" )
line = ['AD']
for item_added in added:
line.append(item_added)
c.writerow(['AD',item_added])
line = ['RM']
for item_removed in removed:
line.append(item_removed)
c.writerow(line)
Two kind of errors:
File "programcompare.py", line 21, in <genexpr>
s_past = frozenset(tuple(row) for row in csv.reader(open('/me/CSV/'+past_file, 'r'), delimiter=';')) ## CHOOSEN ONE
_csv.Error: line contains NULL byte
or
File "programcompare.py", line 21, in <genexpr>
s_past = frozenset(tuple(row) for row in csv.reader(open('/me/CSV/'+past_file, 'r'), delimiter=';')) ## CHOOSEN ONE
_csv.Error: newline inside string
It was working few minutes ago but I've changed the CSV files to test with different datas and here I am :-)
Sorry, last question !
If your data is not prohibitively large, loading them into a set (or frozenset) will be an easy approach:
s_now = frozenset(tuple(row) for row in csv.reader(open('now.csv', 'r'), delimiter=';'))
s_past = frozenset(tuple(row) for row in csv.reader(open('past.csv', 'r'), delimiter=';'))
To get the list of entries that were added:
added = [";".join(row) for row in s_now - s_past] # in "now" but not in "past"
# Or, simply "added = list(s_now - s_past)" to keep them as tuples.
similarly, list of entries that were removed:
removed = [";".join(row) for row in s_past - s_now] # in "past" but not in "now"
To address your updated question on why you're seeing TypeError: unhashable type: 'list', the csv returns each entry as a list when iterated. lists are not hashable and therefore cannot be inserted into a set.
To address this, you'll need to convert the list entries into tuples before adding the to the set. See previous section in my answer for an example of how this can be done.
To address the additional errors you're seeing, they are both due to the content of your CSV files.
_csv.Error: newline inside string
It looks like you have quote characters (") somewhere in data which confuses the parser. I'm not familiar enough with the CSV module to tell you exactly what has gone wrong, not without having a peek at your data anyway.
I did however manage to reproduce the error as such:
>>> [e for e in csv.reader(['hello;wo;"rld'], delimiter=";")]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
_csv.Error: newline inside string
In this case, it can fixed by instructing the reader not to do any special processing with quotes (see csv.QUOTE_NONE). (Do note that this will disable the handling of quoted data whereby delimiters can appear within a quoted string without the string being split into separate entries.)
>>> [e for e in csv.reader(['hello;wo;"rld'], delimiter=";", quoting=csv.QUOTE_NONE)]
[['hello', 'wo', '"rld']]
_csv.Error: line contains NULL byte
I'm guessing this might be down to the encoding of your CSV files. See the following questions:
Python CSV error: line contains NULL byte
"Line contains NULL byte" in CSV reader (Python)
Read the csv files line by line into sets. Compare the sets.
>>> s1 = set('''China;Beijing;Auralog Software Development (Deijing) Co. Ltd.;;;
... United Kingdom;Oxford;Azad University (Ir) In Oxford Ltd;;;
... Italy;Bari;Bari, The British School;;Yes;
... China;Beijing;Beijing Foreign Enterprise Service Group Co Ltd;;;
... China;Beijing;Beijing Ying Biao Human Resources Development Limited;;Yes;
... China;Beijing;BeiwaiOnline BFSU;;;
... Italy;Curno;Bergamo, Anderson House;;Yes;'''.split('\n'))
>>> s2 = set('''China;Beijing;Auralog Software Development (Deijing) Co. Ltd.;;;
... United Kingdom;Oxford;Azad University (Ir) In Oxford Ltd;;;
... Italy;Bari;Bari, The British School;;Yes;
... China;Beijing;Beijing Foreign Enterprise Service Group Co Ltd;;;
... China;Beijing;Beijing Ying Biao Human Resources Development Limited;;Yes;
... This;Is;A;New;Line;;
... Italy;Curno;Bergamo, Anderson House;;Yes;'''.split('\n'))
>>> s1 - s2
set(['China;Beijing;BeiwaiOnline BFSU;;;'])
>>> s2 - s1
set(['This;Is;A;New;Line;;'])