I have a list of linelists file as below.how do i make a list of only the 3rd column of the output(the one starting with 0.002147)this is how my output is
Assuming linelists looks something like this:
l = ['col11\tcol21\tcol31\tcol41\tcol51',
'col12\tcol22\tcol32\tcol42\tcol52']
you can select the third column with
col3list = [line.split("\t")[2] for line in l]
which gives you ['col31', 'col32'].
Based on the image you provided, it is not clear if the columns are separated by a tab or multiple spaces. If the separators are spaces, you would need to change the argument of the split function.
Something similar to your question was also already answered here.
In general I would recommend loading files with the python library pandas, which can handle all kinds of data loading and selection tasks for you.
Related
I am looking at extracting data from within a JSON file, but the data I need has numbers and letters before and sometimes after the data. I would like to know if it is possible to remove the unnecessary numbers and letter I do not need. Here is an example of the data:
"most_common_aircraft":[{"planned_aircraft":"B738/L","dcount":4592},{"planned_aircraft":"H/B744/L","dcount":3639},{"planned_aircraft":"H/B77L/L","dcount":2579},{"planned_aircraft":"H/B772/L","dcount":1894},{"planned_aircraft":"H/B763/L","dcount":1661},{"planned_aircraft":"H/B748/L","dcount":1303},{"planned_aircraft":"B712/L","dcount":1289},{"planned_aircraft":"B739/L","dcount":1198},{"planned_aircraft":"H/B77W/L","dcount":978},{"planned_aircraft":"B738","dcount":957}]
"H/B77L/L , B752/L, A320/X, B738,"
all I am interested in is the main 4 letters/numbers, for example instead of "H/B77L/L" I want just "B77L", instead of "B752/L" I want "B752". The data is very mixed, so some will have a letters in front, some at the end and some with both, then there are others that are already in the correct format I want. Is there a way to remove the additional letters during the extracting of data from a JSON file using Python, if not would it be better as I am using Pandas to extracting them all to a dataframe then compare it to another dataframe which has the correct sequence without the additional letters?
I have managed to find the answer and solve my problem. I will put it here so to help others that may have a similar problem -
for entry in json_data['results']:
for value in entry['most_common_aircraft']:
for splitted_string in value['planned_aircraft'].split('/'):
if len(splitted_string) == 4:
value['planned_aircraft'] = splitted_string
I am relatively new to Python and very new to NLP (and nltk) and I have searched the net for guidance but not finding a complete solution. Unfortunately the sparse code I have been playing with is on another network, but I am including an example spreadsheet. I would like to get suggested steps in plain English (more detailed than I have below) so I could first try to script it myself in Python 3. Unless it would simply be easier for you to just help with the scripting... in which case, thank you.
Problem: A few columns of an otherwise robust spreadsheet are very unstructured with anywhere from 500-5000 English characters that tell a story. I need to essentially make it a bit more structured by pulling out the quantifiable data. I need to:
1) Search for a string in the user supplied unstructured free text column (The user inputs the column header) (I think I am doing this right)
2) Make that string a NEW column header in Excel (I think I am doing this right)
3) Grab the number before the string (This is where I am getting stuck. And as you will see in the sheet, sometimes there is no space between the number and text and of course, sometimes there are misspellings)
4) Put that number in the NEW column on the same row (Have not gotten to this step yet)
I will have to do this repeatedly for multiple keywords but I can figure that part out, I believe, with a loop or something. Thank you very much for your time and expertise...
If I'm understanding this correctly, first we need to obtain the numbers from the string of text.
cell_val = sheet1wb1.cell(row=rowNum,column=4).value
This will create a list containing every number in the string
new_ = [int(s) for s in cell_val.split() if s.isdigit()]
print(new_)
You can use the list to assign the values to the column.
Then define the value of the 1st number in the list to the 5th column
sheet1wb1.cell(row=rowNum, column=5).value = str(new_[1])
I think I have found what I am looking for. https://community.esri.com/thread/86096 has 3 or 4 scripts that seem to do the trick. Thank you..!
So I started looking into it, and I haven't found a good way to parse a file following the format I will show you below. I have taken a data structures course, but it doesn't really help me with what I want to do. Any help will be greatly appreciated!
Goal: Create a tool that can read, create, and manipulate a custom file type
File Format: I'm sure there is a name for this type of format, but I couldn't find it. Anyways, the format is subject to some change since the variable names can be added, removed, or changed. Also, after each variable name the data could be one of several different types. Right now the files do not use sub groups, but I want to be prepared in case they decide to change that. The only things I can think of that will remain constant are the GROUP = groupName, END_GROUP = groupName, and the varName = data.
GROUP = myGroup
name1 = String, datenum, number, list, array
name2 = String, datenum, number, list, array
// . . .
name# = String, datenum, number, list, array
GROUP = mySubGroup
name1 = String, datenum, number, list, array
END_GROUP = mySubGroup
// More names could go here
END_GROUP = myGroup
GROUP = myGroup2
// etc.
END_GROUP = myGroup2
Strings and dates are enclosed in " (ie "myString")
Numbers are written as a raw ascii encoded number. They also use the E format if they are large or small (ie 5.023E-6)
Lists are comma separated and enclosed in parentheses (ie (1,2,3,4) )
Additional Info:
I want to be able to easily read a file and manipulate it as needed. For example, if I read the file and I want to change an attribute of a specific variable within a group I should be able to do something along the lines of dataStructure.groupName.varName = newData.
It should be easy to create my own file (using a default template that I will make myself or a custom template that has been passed in).
I want it to treat numbers as numbers and not strings. I should be able to add, subtract, multiply, etc. values within the data structure that are numbers
The big kicker, I'd like to have this written in vanilla python since our systems have only the most basic modules. It is a huge pain for someone to download another module since they have to create their own virtual environment and import the module to it. This tool should be as system independent as possible
Initial Attempt: I was thinking of using a dictionary to organize the data in levels. I do, however, like the idea of using dot structures (like what one would see using MATLAB structures). I wrote a function that will read all the lines of the file and remove the newline characters from each line. From there I want to check for every GROUP = I can find. I would start adding data to that group until I hit an END_GROUP line. Using regular expressions I should be able to parse out the line to determine whether it is a date, number, string, etc.
I am asking this question because I hope to have some insight on things I may be missing. I'd like for this tool to be used long after I've left the dev team which is why I'm trying to do my best to make it as intuitive and easy to use as possible. Thank you all for your help, I really appreciate it! Let me know if you need any more information to help you help me.
EDIT: To clarify what help I need, here are my two main questions I am hoping to answer:
How should I build a data structure to hold grouped data?
Is there an accepted algorithm for parsing data like this?
My Python script parsed some text of a Excel file. It strips white-space from an Excel file and changes the delimiters
(from " : "--> " , ")
and my script outputs to a CSV file. Much of the data looks like this
(what data looks like in Excel)
Separated by a single column due to there being a extra comma or two.
CSV == Comma separated values.
I have tried using if statements to add or subtract commas to try shore it up but it ends up completely messing up the relative order it was first in. Driving me nuts!
To try do it another way installed the pandas library (a data manipulating library) using pip.
Is it possible to merge columns that have no column headers inside a single Data Frame? There's plenty of advice regarding separate DataFrames but much for one single one.
Furthermore how can I merge the columns while retaining the row position. The emails are in the correct row position but not the column position.
Or am I on the wrong track completely, is pandas overkill for a simple parsing script? I've been learning python as I go along to try complete the script so I might have missed a simple way of doing it.
Some sample data:
C5XXEmployeeNumXX,C5XXEmployeeNumXX,JohnSmith,1,,John,,Smith,,IT Supp.Centre,EU,,London1,,,59XXXX,ITServiceDesk,LOND01,,,,Notmaintained,,,,,,,,john.smith#company.com,
Snippet of parsing logic
for line in f:
#finds the identifier for users
if ':LON ' in line:
#parsing logic.
#Delimiters are swapped. Whitespace is scrubbed
line = line.replace(':', ',')
line = line.replace(' ', '')
You can user a separator/delimiter of your choice. Check out: https://docs.python.org/2/library/csv.html#csv.Dialect.delimiter.
Also, regarding the order, if you are reading in a list it should be fine but if you are reading the contents of a row in a dict then it is normal that the order is not preserved.
My preference would be for this to be in Python since I am working on learning more. If you can provide help in bash that would still be helpful, though.
I've looked around Stack Overflow and found some helpful things but not enough for me to finish this.
I have two CSV files with some shared fields. The data is not INT. I would like to join based on matching 3 specific fields and write it out to a new output.csv when all the processing is done.
sourceA.csv looks like this:
fieldname_1,fieldname_2,fieldname_3,fieldname_4,fieldname_5,fieldname_6,fieldname_7,fieldname_8,fieldname_9,fieldname_10,fieldname_11,fieldname_12,fieldname_13,fieldname_14,fieldname_15,fieldname_16
sourceB.csv looks like this:
fieldname_4,fieldname_5,fieldname_OTHER,fieldname_8,fieldname_16
As you can see, sourceB.csv has 4 field names that are also in sourceA.csv and one field name that does not. The data in fieldname_OTHER will need to replace the data in sourceA[fieldname_6].
The whole process should go like this:
Replace data in sourceA[fieldname_6] with data from sourceB[fieldname_OTHER] if all of the following criteria are met:
data in sourceA[fieldname_4]=sourceB[fieldname_4]
data in sourceA[fieldname_8]=sourceB[fieldname_8]
data in sourceA[fieldname_16]=sourceB[fieldname_16]
(The data in sourceB[fieldname_5] does not need to be evaluated.)
If the above criteria aren't met, just replace sourceA[fieldname_6] with the text ANY.
Write each processed line out to output.csv.
A sample of what I would like the output to be based on the input CSVs and processing outlined above:
dataA,dataB,dataC,dataD,dataE,dataOTHER,dataG,dataH,dataI,dataJ,dataK,dataL,dataM,dataN,dataO,dataP
I hope the details I've provided haven't made it more confusing than it needs to be. Thank you for all your help!
I'm not sure I'd bother with SQL for a one-off merger like this. It's straightforward in python.
Read in both files with the csv module, to get two lists. Index sourceA into a dictionary whose key is the tuple of fields that need to be matched. You can then loop over sourceB, find the matching row instantly, and merge into it from sourceB.
When you're done, you can just output the list you read from sourceA: the dict and the list point to the same values, which you've now updated.