Use Python to find groups of consecutive numbers in excel - python

SO i embarked on learning python a couple of weeks ago through codecademy and it has been a slow process for me. But unfortunately I'm at a point right now that i need a script to do a specific function for me right now.
I have an excel doc that is a list of ID numbers, example -
72100234
72100235
72100239
There are roughly 50,000 id numbers. I need a script that will do a couple of things for me.
- first is to show me how the job is done so i can learn from it
- second is its purpose, I need it to go through and find the consecutive numbers and group them together in either a separate file or another tab
Example -
72100234
72100235
72100236
72100239
72100240
I have to find these blocks of numbers so they can be assigned to a specific agency and going through the list manually... yeah, been there, done that, dont want to do it again.
Thanks for any help.

try to divide your problem into different steps.
The first step could be to get familiar with a library to access your excel file. The one i tend to use is "openpyxl". To find out how to get data from an excel file you can use this link: https://openpyxl.readthedocs.io/en/stable/optimized.html.
I am new here aswell and i found out that you should first make an affort to try yourself and afterwards people will be very helpful.
After you found out how to get the data from excel to python you could provide some of it. That would make it easier to understand your problem. I hope i could help you a bit.

Related

PyPDF2 Extract from field or location

I have a python script running fine, it scans a folder and collects data based on text line position which could work great but if any lines have missing data it throws my numbers off obviously.
I have looked in the pdf file using iText RUPS and I can find a reference to one set of the data I need
BT
582 -158.78 Td
(213447) Tj
ET
the information I want is in the brackets, can I somehow use the coordinates? if all fails, I might be able to get people to agree to start the info I need to collect with a flag XX12345 or YY12345 then I can easily pick out the data from the text extraction, but I'd rather find a better way.
Not added code examples as that works fine it's just the next step I'm struggling with, but I can if anyone wishes.
Many thanks
I tried to use just text extraction, but missing inputs throw my numbering scheme off.

Python - Change and update the same header files from two different projects

I am performing data analysis. I want to segment the steps of data analysis into different projects, as the analysis will be performed in the same order, but not usually all at the same time. There is so much code and data cleaning that keeping all of this in the same project may get confusing.
However, I have been keeping any header files for tracking columns of information in the data consistent. It is possible I will change the sequence of these headers at some point and want to run all sequences of code. I also want to make sure that the header used remains the same so I don't erroneously analyze one piece of data instead of another. I use headers so that if a column order changes at any time, I am accessing the index of the data based on the header that matches the data changes rather than changing every instance of appearance of a particular column number throughout my code.
To accomplish this, I would like to file track multiple projects that access the SAME header files, and update and alter the header files without having to access the header files from each project individually.
Finally, I don't want to just store it somewhere on my computer and not track it, because I work from two different work stations.
Any good solutions or best practice for what I want to do? Have I made an error somewhere in project set-up? I am mostly self-taught and so have developed my own project organization and sequence of data analysis based on my own ideas and research as I go, so if I've done some terribly bad practice that would be great to know.
I've found a possible solution that uses independent branches in the same repo for tracking two separate projects, but I'm not convinced this is the best solution either.
Thanks!

Removing specific section from thousands of pdfs (using python)

There is a case in my job where l have to remove a specific section (Glossary) from thousands of pdf documents.
The text l want to remove has a different font from the other parts:
Example:
"Floor" the lower surface of a room, on which one may walk.
"exchange" an act of giving one thing and receiving another (especially of the same type or value) in return.
Can you please suggest a way how to do it faster?
One of the possible ways to solve this problem is to find the section you want to delete using regex. Then using one of the libraries for pdf editing in python to delete this section.

How can I format google sheets so I can export my data properly?

I plan to make an educational web game. I have thousands of trivia questions I need to write down in a way that can be easily transferred out and automatically organized based on their column, at a later date.
I was suggested to use google sheets so I can later export as a .csv, and that should be easy to work with for a developer. When i exported a .csv and opened it in Panda python the a column was cut off and 1 column was used as a 'header', not just a normal entry https://imgur.com/a/olcpVO8. This obviously wont work and seems to be an issue.
Should I just leave the first row and column empty and work around the issue? I don't want to write thousands of sets only to find out I did this the wrong way. Can anyone give any insight into whether this is my best option and how I should best format it?
I have to write Questions(1), Answers(4), Explanations(1) per entry
I hope this makes sense, thanks for your time.
I tried doing this and have no issue at all using the exported CSV from Google Sheets, using the same data as in your example.
In my opinion, whatever software you're using in your second screenshot is your issue, it seems like its removing numbers from the first row because that should be your header row. Check around in your software for options like, "First column contains headers" or "Use row 1 as Header" and make sure these aren't being used.

Database searches in separate files

I am looking for a kind of database which can search in separate files eg. pdf, xls, doc that I get from different suppliers. My idea is something like this:
For example, I need to search for a part number and check different data about it. The file containing the part number must then be opened with the part number marked. If there are multiple hits, the database should display a list of the various files containing the searched item number. The list should act as links that open the file with the item number selected when selecting one from the list.
Does this already exist or how do I approach it?
Today, it's all assembled into a single PDF file of more than 1000 pages, and it's a time-consuming and laborious process to maintain.
I've only used vba in connection with Excel, so maybe it's too complicated for me. But is it possible for a programmer without spending 1000 hours on it?
Please help me :-)
Either Access or Excel could do this. I noticed the Python tag. I'm sure Python could handle this as well, although it seems more like a database solution would be best. It sounds like a one-to-many scenario. See the link below for some ideas of how this technique works.
https://www.tutorialspoint.com/ms_access/ms_access_one_to_many_relationship.htm
Also, below is a link with a whole bunch of MS Access templates. Take a look at that and hopefully that will give you some ideas of how to get started.
https://www.microsoftaccessexpert.com/Microsoft-Access-Templates.aspx
I agree, keeping this in a PDF with 1000 pages is NOT the way to go!!

Categories