python parsing file into data structure

python parsing file into data structure - python

So I started looking into it, and I haven't found a good way to parse a file following the format I will show you below. I have taken a data structures course, but it doesn't really help me with what I want to do. Any help will be greatly appreciated!
Goal: Create a tool that can read, create, and manipulate a custom file type
File Format: I'm sure there is a name for this type of format, but I couldn't find it. Anyways, the format is subject to some change since the variable names can be added, removed, or changed. Also, after each variable name the data could be one of several different types. Right now the files do not use sub groups, but I want to be prepared in case they decide to change that. The only things I can think of that will remain constant are the GROUP = groupName, END_GROUP = groupName, and the varName = data.
GROUP = myGroup
name1 = String, datenum, number, list, array
name2 = String, datenum, number, list, array
// . . .
name# = String, datenum, number, list, array
GROUP = mySubGroup
name1 = String, datenum, number, list, array
END_GROUP = mySubGroup
// More names could go here
END_GROUP = myGroup
GROUP = myGroup2
// etc.
END_GROUP = myGroup2
Strings and dates are enclosed in " (ie "myString")
Numbers are written as a raw ascii encoded number. They also use the E format if they are large or small (ie 5.023E-6)
Lists are comma separated and enclosed in parentheses (ie (1,2,3,4) )
Additional Info:
I want to be able to easily read a file and manipulate it as needed. For example, if I read the file and I want to change an attribute of a specific variable within a group I should be able to do something along the lines of dataStructure.groupName.varName = newData.
It should be easy to create my own file (using a default template that I will make myself or a custom template that has been passed in).
I want it to treat numbers as numbers and not strings. I should be able to add, subtract, multiply, etc. values within the data structure that are numbers
The big kicker, I'd like to have this written in vanilla python since our systems have only the most basic modules. It is a huge pain for someone to download another module since they have to create their own virtual environment and import the module to it. This tool should be as system independent as possible
Initial Attempt: I was thinking of using a dictionary to organize the data in levels. I do, however, like the idea of using dot structures (like what one would see using MATLAB structures). I wrote a function that will read all the lines of the file and remove the newline characters from each line. From there I want to check for every GROUP = I can find. I would start adding data to that group until I hit an END_GROUP line. Using regular expressions I should be able to parse out the line to determine whether it is a date, number, string, etc.
I am asking this question because I hope to have some insight on things I may be missing. I'd like for this tool to be used long after I've left the dev team which is why I'm trying to do my best to make it as intuitive and easy to use as possible. Thank you all for your help, I really appreciate it! Let me know if you need any more information to help you help me.
EDIT: To clarify what help I need, here are my two main questions I am hoping to answer:
How should I build a data structure to hold grouped data?
Is there an accepted algorithm for parsing data like this?

Related

Data inside a JSON has letters and numbers I do not need, how to get data I need in Python

I am looking at extracting data from within a JSON file, but the data I need has numbers and letters before and sometimes after the data. I would like to know if it is possible to remove the unnecessary numbers and letter I do not need. Here is an example of the data:
"most_common_aircraft":[{"planned_aircraft":"B738/L","dcount":4592},{"planned_aircraft":"H/B744/L","dcount":3639},{"planned_aircraft":"H/B77L/L","dcount":2579},{"planned_aircraft":"H/B772/L","dcount":1894},{"planned_aircraft":"H/B763/L","dcount":1661},{"planned_aircraft":"H/B748/L","dcount":1303},{"planned_aircraft":"B712/L","dcount":1289},{"planned_aircraft":"B739/L","dcount":1198},{"planned_aircraft":"H/B77W/L","dcount":978},{"planned_aircraft":"B738","dcount":957}]
"H/B77L/L , B752/L, A320/X, B738,"
all I am interested in is the main 4 letters/numbers, for example instead of "H/B77L/L" I want just "B77L", instead of "B752/L" I want "B752". The data is very mixed, so some will have a letters in front, some at the end and some with both, then there are others that are already in the correct format I want. Is there a way to remove the additional letters during the extracting of data from a JSON file using Python, if not would it be better as I am using Pandas to extracting them all to a dataframe then compare it to another dataframe which has the correct sequence without the additional letters?

I have managed to find the answer and solve my problem. I will put it here so to help others that may have a similar problem -
for entry in json_data['results']:
for value in entry['most_common_aircraft']:
for splitted_string in value['planned_aircraft'].split('/'):
if len(splitted_string) == 4:
value['planned_aircraft'] = splitted_string

Using Python & NLP, how can I extract certain text strings & corresponding numbers preceding the strings from Excel column having a lot of free text?

I am relatively new to Python and very new to NLP (and nltk) and I have searched the net for guidance but not finding a complete solution. Unfortunately the sparse code I have been playing with is on another network, but I am including an example spreadsheet. I would like to get suggested steps in plain English (more detailed than I have below) so I could first try to script it myself in Python 3. Unless it would simply be easier for you to just help with the scripting... in which case, thank you.
Problem: A few columns of an otherwise robust spreadsheet are very unstructured with anywhere from 500-5000 English characters that tell a story. I need to essentially make it a bit more structured by pulling out the quantifiable data. I need to:
1) Search for a string in the user supplied unstructured free text column (The user inputs the column header) (I think I am doing this right)
2) Make that string a NEW column header in Excel (I think I am doing this right)
3) Grab the number before the string (This is where I am getting stuck. And as you will see in the sheet, sometimes there is no space between the number and text and of course, sometimes there are misspellings)
4) Put that number in the NEW column on the same row (Have not gotten to this step yet)
I will have to do this repeatedly for multiple keywords but I can figure that part out, I believe, with a loop or something. Thank you very much for your time and expertise...

If I'm understanding this correctly, first we need to obtain the numbers from the string of text.
cell_val = sheet1wb1.cell(row=rowNum,column=4).value
This will create a list containing every number in the string
new_ = [int(s) for s in cell_val.split() if s.isdigit()]
print(new_)
You can use the list to assign the values to the column.
Then define the value of the 1st number in the list to the 5th column
sheet1wb1.cell(row=rowNum, column=5).value = str(new_[1])

I think I have found what I am looking for. https://community.esri.com/thread/86096 has 3 or 4 scripts that seem to do the trick. Thank you..!

Store data based on location in a dataset python

Forgive me if this question is trivial, I am just having some trouble finding a solution online, and I'm a bit new to python. Essentially, I have a dataset which is full of various numbers all of which are arranged in this format:
6.1101,17.592
5.5277,9.1302
8.5186,13.662
I'm trying to write some python to get the number on either side of the comma. I assume it's some type of splitting, but I can't seem to find anything that works for this problem specifically since I want to take the ALL the numbers from the left and store them in a variable, then take ALL the numbers on the right store them in a variable. The goal is to plot the data points, and normally I would modify the data set, but it's a challenge problem so I am trying to figure this out with the data as is.

Here's one way:
with open('mydata.csv') as f:
lines = f.read().splitlines()
left_numbers, right_numbers = [], []
for line in lines:
numbers = line.split(',')
left_num = float(numbers[0])
right_num = float(numbers[1])
left_numbers.append(left_num)
right_numbers.append(right_num)
Edit: added float conversion

How to correctly parse as text numbers separated by mixed commas and dots in excel file using Python?

I'm importing data coming from excel files that come from another office.
In one of the columns, for each cell, I have lists of numbers used as tags. These were manually inserted, by different people and (my guess) using computers with different thousands settings, so the result is very heterogeneous.
As an example I have:
tags= ['205', '306.3', '3,206,302','7.205.206']
If this was a CSV file (I tried converting one single file to check), using
pd.read_csv(my_file,sep=';')
would give me exactly the above mentioned list.
Unfortunately as said, we're talking about excel files (plural) and I have to deal with it, and using
pd.read_excel(my_file,sheetname=my_sheet,encoding='utf-16',converters{'my_column':str})
what I get instead is:
tags= ['205', '306.3', '3,206,302','7205206']
As you see, whenever the number can be expressed logically in thousands (so, not the second number in my list) the dot is recognised as a thousands separator and I get a single number, instead of three.
I tried reading documentation, and searching on stackoverflow and google, but the keywords to describe this problem are too vague and I didn't find a viable solution, yet.
How can I get the right list using excel files?
Thanks.

This problem is likely happening because pandas is running their number parser before their date parser.
One possible fix is to add a thousands separator. For example, if you are actually using ',' as your thousands separator, you could add thousands=',' in your excel reader:
pd.read_excel(my_file,sheetname=my_sheet,encoding='utf-16',thousands=',',converters{'my_column':str})
You could also pick an arbitrary thousand separator that doesn't exist in your data to make the output stay the same if thousands=None (which should be the default according to documentation), doesn't already deal with your problem. You should also make sure that you are converting the fields to str (in which case using thousands is kind of redundant, as it's not applied to trings either way).
EDIT:
I tried using the following dummy data ('test.xlsx'):
a b c d
205 306.3 3,206,302 7.205.206
and with
dataf = pandas.read_excel('test.xlsx', header=0, converters={'a':str, 'b':str,'c':str,'d':str})
print(dataf.to_string)
I got the following output:
Columns: [205, 306.3, 3,206,302, 7.205.206]
Which is exactly what you were looking for. Are you sure you have the latest version of pandas and that you are in fact not using converters = {'col':int} or float in your converters keyword?
As it stands, it sounds like you are either converting your fields to numeric (int or float), or there is a problem elsewhere in your code. The pandas read_excel seems to work as described, and I can get the results you specified with the code specified above. In other wods: Your code should work, if it doesn't it might be due to outdated pandas version, other parts in your code or even problems with the source data. As it stands, it's not possible to answer your question further with the information you have provided.

Tips for reading in a complex file - Python

I have complex, variable text files that I want to read into Python, but I'm not sure what the best strategy would be. I'm not looking for you to code anything for me, just some tips about what modules would best suit my needs/tips etc.
The files look something like:
Program
Username: X Laser: X Em: X
exp 1
sample 1
Time: X Notes: X
Read 1 X data
Read 2 X data
# unknown number of reads
sample 2
Time: X Notes: X
Read 1 X data
...
# Unknown number of samples
exp 2
sample 1
...
# Unknown number of experiments, samples and reads
# The 4 spaces between certain words represent tabs
To analyse this data I need to get the data for each reading and know which sample and experiment it came from. Also, I can change the output file format but I think the way I have written it here is the easiest to read.
To read this file in to Python the best way I can think of would be to read it in row by row and search for key words with regular expressions. For example, search the row for the "exp" keyword and then record the number after it, then search for sample in the next line and so on. However, of course this would not work if a keyword was used in the 'notes' section.
So, I'm kind of stumped as to what would best suit my needs (it's hard to use something if you don't know it exists!)
Thanks for your time.

It's a typical task for a syntactic analyzer. In this case, since
lexical constructs do not cross line boundaries and there's a single construct ("statement") per line. In other words, each line is a single statement
full syntax for a single line can be covered by a set of regexes
the structure of compounds (=entities connecting multiple "statements" into something bigger) is simple and straightforward
a (relatively) simple scannlerless parser based on lines, DFA and the aforementioned set of regexes can be applied:
set up the initial parser state (=current position relative to various entities to be tracked) and the parse tree (=data structure representing the information from the file in a convenient way)
for each line
classify it, e.g. by matching against the regexes applicable to the current state
use the matched regex's groups to get the line's statement's meaningful parts
using these parts, update the state and the parse tree
See get the path in a file inside {} by python for an example. There, I do not construct a parse tree (wasn't needed) but only track the current state.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.