I have .pst (outlook) file, which contains old emails and email contacts (around 3980 of them), which I'd like to export to a machine readable format.
Outlook 2016 already has an option to export the contacts to a .csv file, but after the export operation is performed, one can see, that the file is not structured properly. The "Notes" field may contain a messagge, which might contain multiple new line characters. This, in turn, breaks the .csv format, since every entry should start with the value of the first contact field (but in these cases, the lines represent the successive content of the mentioned "Notes" field). When the "Notes" field is finished, the next line usually contains the rest of the values of the entry.
Example csv output:
"Title","First Name",... <- header field values of the exported .csv
"","John","","Travolta","","ValueX","","","ValueY",,,"ValueZ",... <- start of the contact entry
www.link1.com <- start of the "Notes" field (same contact)
.................. <- "Notes" field continued (same contact)
www.link2.com <- "Notes" field continued (same contact)
................... <- "Notes" field continued (same contact)
"asd","asdas","asdasd","asdasd" <- rest of the contact fields (same contact)
"","Nicolas","Cage","","","ValueX","","","ValueY",,,"ValueZ",... <- 2nd contact (in one line)
I'd like to fix the formatting of the exported file, so the "Notes" field would not stretch across multiple lines and each contact would be represented in the file as a single line.
I think I have two options here:
write a script (python), which goes over the lines and fixes the formatting (I'd like to avoid doing this, since the script might overlook something).
find an API for parsing .pst files and try to serialize the contacts in the suitable format (by specifying how to serialize the "Notes" field manually).
Does anybody know, if I'm overlooking something and if this could be solved in an easier way?
Kind regards.
EDIT: I'm talking about this issue.
The file exported from Outlook is not broken although it may appear to look like it is. In effect, a newline character inside quotes is considered part of the cell. So if cells have newlines, it would mean a single "row" will be be loaded from many lines in the file.
For example for a CSV say you have four cells in one row, a, b, c and d. This would look like:
a,b,c,d
Now change c to be c1\nc2, i.e. it has a newline in it:
a,b,"c1
c2",d
The cell is now quoted and appears on multiple lines. The standard Python CSV library will be able to correctly parse this, including a standard Outlook exported CSV contact file.
The following displays a name and home address from each contact given a standard contacts CSV file exported from Outlook:
import csv
with open('contacts.csv', 'r', newline='') as f_contacts:
csv_contacts = csv.DictReader(f_contacts)
for contact in csv_contacts:
print(contact['First Name'], contact['Last Name'])
print("{}{}{}".format(contact['Home Street'], contact['Home Street 2'], contact['Home Street 3']).replace('\n\n','\n'))
print()
This assumes you are using Python 3.x and was tested using a CSV file exported directly from Outlook.
I am giving up with this, its almost due date. I enrolled to regex class this summer (biggest mistake of my life), and we have this topic (where we choose an old software and make updates to it), well I'm almost done with everything but, except this, I have a .txt document of database of monster attributes?
Anyways, the logic is each variable represent columns/keys and each column are separated by comma. And we need to delete/add/reposition the columns using any available tool (regex the only thing I know can help me? do you know anything? )
Here is the OLD form:
ID,Name,JName,LV,HP,SP,EXP,JEXP,Range1,ATK1,ATK2,DEF,MDEF,STR,AGI,VIT,INT,DEX,LUK,Range2,Range3,Scale,Race,Element,Mode,Speed,ADelay,aMotion,dMotion,Drop1id,Drop1per,Drop2id,Drop2per,Drop3id,Drop3per,Drop4id,Drop4per,Drop5id,Drop5per,Drop6id,Drop6per,Drop7id,Drop7per,Drop8id,Drop8per,MEXP,ExpPer,MVP1id,MVP1per,MVP2id,MVP2per,MVP3id,MVP3per
First, delete 7th column from the last (deleting all ExpPer entries):
Results to:
ID,Name,JName,LV,HP,SP,EXP,JEXP,Range1,ATK1,ATK2,DEF,MDEF,STR,AGI,VIT,INT,DEX,LUK,Range2,Range3,Scale,Race,Element,Mode,Speed,ADelay,aMotion,dMotion,Drop1id,Drop1per,Drop2id,Drop2per,Drop3id,Drop3per,Drop4id,Drop4per,Drop5id,Drop5per,Drop6id,Drop6per,Drop7id,Drop7per,Drop8id,Drop8per,MEXP,MVP1id,MVP1per,MVP2id,MVP2per,MVP3id,MVP3per
Second, duplicate JName column to next column:
Results to:
ID,Name,JName,Jname,LV,HP,SP,EXP,JEXP,Range1,ATK1,ATK2,DEF,MDEF,STR,AGI,VIT,INT,DEX,LUK,Range2,Range3,Scale,Race,Element,Mode,Speed,ADelay,aMotion,dMotion,Drop1id,Drop1per,Drop2id,Drop2per,Drop3id,Drop3per,Drop4id,Drop4per,Drop5id,Drop5per,Drop6id,Drop6per,Drop7id,Drop7per,Drop8id,Drop8per,MEXP,MVP1id,MVP1per,MVP2id,MVP2per,MVP3id,MVP3per
Third, pull the last 7 columns, put them starting to 31st column, i.e. from ...,dMotion,Drop1id,Drop1per,... to ...,dMotion,MEXP,...,MVP3per,Drop1id,...
Results to:
ID,Name,JName,Jname,LV,HP,SP,EXP,JEXP,Range1,ATK1,ATK2,DEF,MDEF,STR,AGI,VIT,INT,DEX,LUK,Range2,Range3,Scale,Race,Element,Mode,Speed,ADelay,aMotion,dMotion,MEXP,MVP1id,MVP1per,MVP2id,MVP2per,MVP3id,MVP3per,Drop1id,Drop1per,Drop2id,Drop2per,Drop3id,Drop3per,Drop4id,Drop4per,Drop5id,Drop5per,Drop6id,Drop6per,Drop7id,Drop7per,Drop8id,Drop8per
Fourth, Finally, add these columns to the last: ,0,0,DONE,1:
Results to:
ID,Name,JName,Jname,LV,HP,SP,EXP,JEXP,Range1,ATK1,ATK2,DEF,MDEF,STR,AGI,VIT,INT,DEX,LUK,Range2,Range3,Scale,Race,Element,Mode,Speed,ADelay,aMotion,dMotion,MEXP,MVP1id,MVP1per,MVP2id,MVP2per,MVP3id,MVP3per,Drop1id,Drop1per,Drop2id,Drop2per,Drop3id,Drop3per,Drop4id,Drop4per,Drop5id,Drop5per,Drop6id,Drop6per,Drop7id,Drop7per,Drop8id,Drop8per,0,0,DONE,1
Hence, if I run whatever or how many regex search/replace tool,
the original:
1052,ROCKER,Rocker,9,198,0,20,16,1,24,29,5,10,1,9,18,10,14,15,10,12,1,4,22,129,200,1864,864,540,940,5000,909,5500,2298,4,1402,80,520,10,752,5,703,3,4021,10,0,0,0,0,0,0,0,0
would result to:
1052,ROCKER,Rocker,Rocker,9,198,0,20,16,1,24,29,5,10,1,9,18,10,14,15,10,12,1,4,22,129,200,1864,864,540,0,0,0,0,0,0,0,940,5000,909,5500,2298,4,1402,80,520,10,752,5,703,3,4021,10,0,0,DONE,1
Hope somebody can help me, there are 500+ monsters in this old database .txt file.
Thanks!
Microsoft Excel has a Text Import Wizard to import data in any CSV format from any text file into an empty Excel worksheet. This wizard can be used for small CSV files to load the data, next delete/move/copy data columns and finally export/save the modified data again in CSV format into a file.
But the question was about reformatting the CSV file using a text editor with regular expression.
I used UltraEdit v21.20 with selecting Perl regular expression engine, but below should work with any text editor supporting Perl regular expressions. The regular expression search and replace strings should work also with Python.
Important:
The regular expressions below work only if CSV file does not contain commas in double quoted values.
First, delete 7th column from the last (deleting all ExpPer entries):
Search: ,[^,\r\n]*?(,(?:[^,\r\n]*?,){5}[^,\r\n]*)$
Replace: \1
Second, duplicate JName column to next column:
Search: ^((?:[^,\r\n]*?,){2})([^,\r\n]*?,)
Replace: \1\2\2
Third, pull the last 7 columns, put them starting to 31st column:
Search: ^((?:[^,\r\n]*?,){30})((?:[^,\r\n]*?,){15}[^,]*?),((?:[^,\r\n]*?,){6}[^,\r\n]*)$
Replace: \1\3,\2
Fourth, finally, add ,0,0,DONE,1 to the last:
Search: (.)$
Replace: \1,0,0,DONE,1
But those 4 replaces can be done also with a single regular expression replace:
Search: ^((?:[^,\r\n]*?,){2})([^,\r\n]*?,)((?:[^,\r\n]*?,){26})((?:[^,\r\n]*?,){16})([^,\r\n]*?,)[^,\r\n]*?,((?:[^,\r\n]*?,){5}[^,\r\n]*)$
Replace: \1\2\2\3\5\6,\40,0,DONE,1