I have .pst (outlook) file, which contains old emails and email contacts (around 3980 of them), which I'd like to export to a machine readable format.
Outlook 2016 already has an option to export the contacts to a .csv file, but after the export operation is performed, one can see, that the file is not structured properly. The "Notes" field may contain a messagge, which might contain multiple new line characters. This, in turn, breaks the .csv format, since every entry should start with the value of the first contact field (but in these cases, the lines represent the successive content of the mentioned "Notes" field). When the "Notes" field is finished, the next line usually contains the rest of the values of the entry.
Example csv output:
"Title","First Name",... <- header field values of the exported .csv
"","John","","Travolta","","ValueX","","","ValueY",,,"ValueZ",... <- start of the contact entry
www.link1.com <- start of the "Notes" field (same contact)
.................. <- "Notes" field continued (same contact)
www.link2.com <- "Notes" field continued (same contact)
................... <- "Notes" field continued (same contact)
"asd","asdas","asdasd","asdasd" <- rest of the contact fields (same contact)
"","Nicolas","Cage","","","ValueX","","","ValueY",,,"ValueZ",... <- 2nd contact (in one line)
I'd like to fix the formatting of the exported file, so the "Notes" field would not stretch across multiple lines and each contact would be represented in the file as a single line.
I think I have two options here:
write a script (python), which goes over the lines and fixes the formatting (I'd like to avoid doing this, since the script might overlook something).
find an API for parsing .pst files and try to serialize the contacts in the suitable format (by specifying how to serialize the "Notes" field manually).
Does anybody know, if I'm overlooking something and if this could be solved in an easier way?
Kind regards.
EDIT: I'm talking about this issue.
The file exported from Outlook is not broken although it may appear to look like it is. In effect, a newline character inside quotes is considered part of the cell. So if cells have newlines, it would mean a single "row" will be be loaded from many lines in the file.
For example for a CSV say you have four cells in one row, a, b, c and d. This would look like:
a,b,c,d
Now change c to be c1\nc2, i.e. it has a newline in it:
a,b,"c1
c2",d
The cell is now quoted and appears on multiple lines. The standard Python CSV library will be able to correctly parse this, including a standard Outlook exported CSV contact file.
The following displays a name and home address from each contact given a standard contacts CSV file exported from Outlook:
import csv
with open('contacts.csv', 'r', newline='') as f_contacts:
csv_contacts = csv.DictReader(f_contacts)
for contact in csv_contacts:
print(contact['First Name'], contact['Last Name'])
print("{}{}{}".format(contact['Home Street'], contact['Home Street 2'], contact['Home Street 3']).replace('\n\n','\n'))
print()
This assumes you are using Python 3.x and was tested using a CSV file exported directly from Outlook.
Related
As the title mentions, my issue is that I don't understand quite how to extract the data I need for my table (The columns for the table I need are Date, Time, Courtroom, File Number, Defendant Name, Attorney, Bond, Charge, etc.)
I think regex is what I need but my class did not go over this, so I am confused on how to parse in order to extract and output the correct data into an organized table...
I am supposed to turn my text file from this
https://pastebin.com/ZM8EPu0p
and export it into a more readable format like this- example output is below
Here is what I have so far.
def readFile(court):
csv_rows = []
# read and split txt file into pages & chunks of data by pagragraph
with open(court, "r") as file:
data_chunks = file.read().split("\n\n")
for chunk in data_chunks:
chunk = chunk.strip # .strip removes useless spaces
if str(data_chunks[:4]).isnumeric(): # if first 4 characters are digits
entry = None # initialize an empty dictionary
elif (
str(data_chunks).isspace() and entry
): # if we're on an empty line and the entry dict is not empty
csv_rows.DictWriter(dialect="excel") # turn csv_rows into needed output
entry = {}
else:
# parse here?
print(data_chunks)
return csv_rows
readFile("/Users/mia/Desktop/School/programming/court.txt")
It is quite a lot of work to achieve that, but it is possible. If you split it in a couple of sub-tasks.
First, your input looks like a text file so you could parse it line by line. -- using https://www.w3schools.com/python/ref_file_readlines.asp
Then, I noticed that your data can be split in pages. You would need to prepare a lot of regular expressions, but you can start with one for identifying where each page starts. -- you may want to read this as your expression might get quite complicated: https://www.w3schools.com/python/python_regex.asp
The goal of this step is to collect all lines from a page in some container (might be a list, dict, whatever you find it suitable).
And afterwards, write some code that parses the information page by page. But for simplicity I suggest to start with something easy, like the columns for "no, file number and defendant".
And when you got some data in a reliable manner, you can address the export part, using pandas: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_excel.html
train, test = data.TabularDataset.splits(path="./data/", train="train.csv",test="test.csv",format="csv",fields=[("Tweet",TEXT), ("Affect Dimension",LABEL)])
I have this code and want to evaluate, if the loaded data is correct or if it's using wrong columns for the actual text fields etc.
If my file has the columns "Tweet" for the Texts and "Affect Dimension" for the Class name, is it correct to put them like this is the fields section?
Edit: TabularDataset includes an Example object, in which the data can be read. When reading csv files, only a "," is accepted as a delimiter. Everything else will result in corrupted data.
You can put any field name irrespective of what your file has. Also, I recommend NOT TO use white-spaces in the field names.
So, rename Affect Dimension to Affect_Dimension or anything convenient for you.
Then you can iterate over different fields like below to check the read data.
for i in train.Tweet:
print i
for i in train.Affect_Dimension:
print i
for i in test.Tweet:
print i
for i in test.Affect_Dimension:
print i
I have a dataset and some data. When user selects particular data, the relevant rows/columns of it should be displayed. I have a csv file with different professions, their average pay, locations and skills needed. Now if the user selects a profession, everything linked to this profession tuple should be displayed.
Example: the columns of row are : Lawyer, $45000, US and Canada, Degree in law.
Now if user selects his profession to be lawyer, various options like $45000, US and Canada should be displayed one after the other. How Can I do this directly from a CSV file?
I will design this part of website in python flask
To answer your question of whether you can do this directly from a csv file, well, I dont think there is a way to validate information or search records directly from a CSV. You could, however, try one of the following approaches.
One way this can be done is by,simply opening the csv file and saving each line/record as a sublist and then appending it to a parent list. Add the following code snippet to your application:
final_list=[]
with open('your_file.csv', 'r') as f:
result = [line.strip(',') for line in f]
final_list.append(result)
print(final_list)
#[[Lawyer, $45000, US and Canada, Degree in law],
#[Doctor...],]
In case you want to use a python module then check this :Someone has already answered a similar question.
Hope this helps :)
The data I pull from DB comes in the following format:
+jacket
online trading account
+neptune
When I write this data to a CSV I end up with a #NAME? error. I tried adding single quote ' to the front of the values when I pull the data, however, this does not fix the issue. I need to write the values exactly as they come, with the plus sign at the front.
You simply need to format the desired output column as a text column. This will result in:
+jacket
online trading account
+neptune
being written to the file exactly as is. No more #NAME? errors.
I've just written my first script for my first proper job (rather proud).
Basically I parse a large xml file that contains 3 types of data: estates, symbols and types.
It then creates 3 .txt files listing all items of the 3 types, one file for estates, one for symbols and one for types.
What I need to do now is format the output for use in our internal wiki.
I want to be able to filter it with a drop down menu, so that I can select an "Estate" and see what "Symbols" are in the "Estate" and then see what "Type" these symbols are.
For scope there are ~50 estates, ~26 types, and about 93000 Symbols (They all vary from day to day).
A symbol belongs to an estate and each symbol has a type.
If you want any code snippets from either the xml doc or my current script, feel free to ask, I didn't want to dump a load of code in here.
EDIT:
Here is an example of how the XML is formatted, showing the symbol name, its estate and then its type
<Symbol SymbolName="<SYMNAME>" Estate="<ESTATENAME>" TickType="<TYPE>" />
Names have been omitted for confidentiality.
EDIT2:
Had the idea of using dictionaries to better sort the parsed data.
Eg.
dictionary1 = {symbol1[estate], symbol2[estate]}
dictionary2 = {symbol1[type], symbol2[type]}
TO CLARFIY: I have a bunch of data from an xml that needs to be written to an output file in such a way that it can be filtered on a web page (drop down menus preferable)