How to count the number of lines in a text? [closed]

How to count the number of lines in a text? [closed] - python

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 11 months ago.
Improve this question
I have a list of text like this- and I want to count the number of lines in this. I thought of splitting it by ". ", but it will create extra lines for words like "a.m.". Please help!
[
'when people hear ai they often think about sentient robots and magic boxes. ai today is much more mundane and simple—but that doesn’t mean it’s not powerful. another misconception is that high-profile research projects can be applied directly to any business situation. ai done right can create an extreme return on investments (rois)—for instance through automation or precise prediction. but it does take thought, time, and proper implementation. we have seen that success and value generated by ai projects are increased when there is a grounded understanding and expectation of what the technology can deliver from the c-suite down.',
'“artificial intelligence (ai) is a science and a set of computational technologies that are inspired by—but typically operate quite differently from—the ways people use their nervous systems and bodies to sense, learn, reason and take action.”3 lately there has been a big rise in the day-to-day use of machines powered by ai. these machines are wired using cross-disciplinary approaches based on mathematics, computer science, statistics, psychology, and more.4 virtual assistants are becoming more common, most of the web shops predict your purchases, many companies make use of chatbots in their customer service and many companies use algorithms to detect fraud.',
'ai and deep learning technology employed in office entry systems will bring proper time tracking of each employee. as this system tries to learn each person with an image processing technology whose data is feed forwarded to a deep learning model where deep learning isn’t an algorithm per se, but rather a family of algorithms that implements deep networks (many layers). these networks are so deep that new methods of computation, such as graphics processing units (gpus), are required to train them, in addition to clusters of compute nodes. so using deep learning we can take detect the employee using face and person recognition scan and through which login, logout timing is recorded. using an ai system we can even identify each employee’s entry time, their working hours, non-working hours by tracking the movement of an employee in the office so that system can predict and report hr for the salary for each employee based on their working hours. our system can take feed from cctv to track movements of employees and this system is capable of recognizing a person even he/she is being masked as in this pandemic situation by taking their iris scan. with this system installed inside the office, the following are some of the benefits:',
'for several countries, regulations insist that the employer must keep documents available that can demonstrate the working hours performed by each employee. in the event of control from the labor inspectorate or a dispute with an employee, the employer must be able to explain and justify the working hours for the company. this can be made easy as our system is tracking employee movements',
'this is about monitoring user connection times to detect suspicious access times. in the event where compromised credentials are used to log on at 3 a.m. on a saturday, a notification on this access could alert the it team that an attack is possibly underway.',
'to manage and react to employees’ attendance, overtime thresholds, productivity, and suspicious access times, our system records and stores detailed and interactive reporting on users’ connection times. these records allow you to better manage users’ connection times and provide accurate, detailed data required by management.',
'4)if you want to avoid paying overtime, make sure that your employees respect certain working time quotas or even avoid suspicious access. our system will alert the hr officer about each employee’s office in and out time so that they can accordingly take action.',
'5)last but not least it reduces human resource needs to keep track of the records and sending the report to hr and hr officials has to check through the report so this system will reduce times and human resource needs',
'with the use of ai and deep learning technologies, we can automate some routines stuff with more functionality which humans need more resources to keep track thereby reducing time spent on manual data entry works rather companies can think of making their position high in the competitive world.'
]

Sentence splitting is not a trivial task. I'd suggest you use a ready-made library like the NLTK.
import nltk
text = "..." # your raw text
sentences = nltk.sent_tokenize(text)
This works fairly well, but don't expect perfect results.

I have a list of text like this- and I want to count no. of lines in this. I thought of splitting it by". "
Nice idea, but if you want to count the number of lines, why don't you simply take the lenght of that list?
lenght = len(that_long_text_list)
And then you can sum this lenght to the ., removing by regex expressions like a.m.. To do this check this question about regex pattern matching for acronyms.

Since we don't know how many characters fit into a line as you define it, we can't know for sure. My suggestion would be as follows:
Split the text into paragraphs, since the start of a new paragraph will, in any case, introduce a line break. This is easy - each item in your list will be a paragraph.
For each of these paragraphs, divide it by the number of characters in a line and round up to the nearest full integer.
So my code would be as follows:
raw_text = # Your raw text in the form of a list
line_count = 0
chars_per_line = # Whatever the number of characters in a line of yours is
for paragraph in raw_text:
line_count += (paragraph // chars_per_line) + 1
Not perfect because it doesn't take the rules of syllabification into account, but close enough, I guess.

Related

How to treat a trained and tuned model as a function for optimization?

I'm currently running a project to optimize media mixed marketing channel spend. I followed this guide and am using the code here: https://towardsdatascience.com/an-upgraded-marketing-mix-modeling-in-python-5ebb3bddc1b6.
Towards the end of the article, it states:
"We have talked about optimizing spendings and how this was not possible with the old model. Well, with the new one it is, but I will not go into detail here. In short, treat our tuned_model a function and optimize it with a program of your choice, for example, Optuna or scipy.optimize.minimize. Add some budget constraints to the optimization, such as that the sum of spendings should be less than 1,000,000 in total."
I'm not sure how I can do this.
The goal here is to optimize spend (capped at a certain amount) to hit maximum profit.

Approach for classifying job description sentences

I need to classify/categorize the various sentences in the job_experience section of n=630 job descriptions. I'm particularly interested in extracting the work experience and ability-related sentences, but I need to be able keep them attached to the job_title that they are associated with.
Current state of these job descriptions: many different ways of saying similar things (e.g., "Needs Microsoft Office skills." "Experience using Microsoft Word, PowerPoint." "Minimum 3 years experience in related field." "Minimum three years experience in similar role.").
In the future, we will need to condense these job description statements so that, for example, the same statement can be applied to multiple jobs, and where managers select from a drop-down list of job experience statements.
So I would like to categorize these individual sentences so that we can begin condensing them and deciding on which statements will be used going forward.
I've been researching what I should do and I would appreciate any suggestions on which approach will be the most efficient. I am familiar with R but use it mostly for data wrangling and visualization. LDA, kmeans text clustering, feature identification... these are the things I'm finding in my research (scikit-learn.org) and mostly with application in Python.
Is Python best for this kind of thing? Can I use R?
Which algorithmic approach is best for a beginner?
I know this isn't magic - just looking for the best approach to this task.
My data looks like:
df <- data.frame(job_title = c("Recruiter","Recruiter","Recruiter","Recruiter",
"File Clerk","File Clerk",
"Learning & Org. Development Specialist","Learning & Org. Development Specialist","Learning & Org. Development Specialist","Learning & Org. Development Specialist",
"CNA","CNA","CNA"),
job_experience = c("Minimum 1 year experience in recruitment or related human resources function.",
"Proficient in Microsoft Office Applications.",
"High school diploma required.",
"Bachelors Degree in Human Resources or related field preferred.",
"High School diploma preferred.",
"Ability to use relevant computer systems.",
"Bachelors Degree in related field (e.g., Human Resources, Education, Organizational Development).",
"Minimum 2 years experience applying L&OD principles and practices in an organizational setting.",
"Previous work experience in Human Resources preferred.",
"Experience with a learning management system (LMS).",
"High school diploma or GED equivalent.",
"Certified Nursing Assistant, certified by the Virginia Board of Health Professions.",
"CPR certification required at date of hire."))
My goal is to have a dataset like this (new column = job_exp_category):
job_title job_experience job_exp_category
"Recruiter" "Minimum 1 year experience in recruitment..." "Work experience"
"Recruiter" "Proficient in Microsoft Office Applicati..." "Skill/Ability"
"Recruiter" "High school diploma required." "Degree"
... ... ...
"CNA" "Certified Nursing Assistant, certificati..." "Certification/License"
"CNA" "CPR certification required at date of hire." "Certification/License"
Thank you for any insight SO community.

In case someone sees this post and has a similar need, here is what I (OP) ended up doing:
Following the content in this link, I used a combination of supervised learning (random forest) to classify the job description statements into four categories (degree, work experience, certification/license, and ksa) and unsupervised learning (kmeans cluster analysis) to bucket work experience statements that used similar words (e.g., cluster 1 = statements that referenced Microsoft office products).
The general process included:
PHASE 1 (identify work experience-related job description statements):
hand-coding a sample of the job description statements into the appropriate category
converting my data set to a tidytext dataframe that would prepare it for analysis
create a training data set using the hand-coded job descriptions statements with their associated categories, then creating a test data set that contained the to-be-categorized job description statements
estimating a random forest model (supervised learning) using the caret package [ details: method="ranger", out-of-box resampling method, number of trees = 200]. My OOB prediction error was 2.96%.
used predict() to predict the job description categories on the remaining data (i.e., test data set)
PHASE 2 (classify work experience statements into related buckets):
being comfortable with my prediction error for this task, I filtered to include only the job description statements that fell under 'work experience'
after cleaning the data set to avoid clustering on non-important words (e.g., preferred) I used kmeans() cluster analysis (k = 200) to bucket the work experience statements into clusters based on the words used.
At this point, we're still ultimately deciding what the final work experience descriptions/categorizations will be, but this process is now more efficient with the trimmings gone and a bit of a head start on how to categorize appropriately.

Vehicle routing with time window Implementation in Python

I am working on a use case, in which I have multiple vehicles as depot,delivery boys and set of customers from diverse location to serve fresh food. Customers would place an order from an application,then delivery boy receives the order and get the food from Van and then deliver it with some promised delivery time(15 mins). I want to optimize this problem so that operational cost for traveling is reduced and delivery time is minimized. Just wanted to know is there any implementation in Python so solve VRPTW problem ? Please help

You can find implementation of Dijkstra shortest path algorithm in python.
An example implementation is
http://code.activestate.com/recipes/577506-dijkstra-shortest-path-algorithm/

read some research papers on vehicle routing problem. i've seen some of the papers provides a complete algorithm on vehicle routing, and they come in different ways by considering multiple criteria. hence, it's possible to implement one or more of the algorithms provided in these papers and do a test to use the optimal solution.

If you want to solve a routing problem, the very first thing to figure out is what variant of the vehicle routing problem you're solving. I'm going to assume the vans are stationary (i.e. you're not trying to optimise the positioning of the vans themselves as well). Firstly the problem is dynamic as it's happening in realtime - i.e. it's a realtime route optimisation problem. If the delivery people are pre-assigned to a single van, then this might be considered a dynamic multi-trip vehicle routing problem (with time windows obviously). Generally speaking though it's a dynamic pickup and delivery vehicle routing problem, as presumably the delivery people can pickup from different vans (so DPDVRPTW). You'd almost certainly need soft timewindows as well, making it a
DPDVRP with soft time windows. Soft time windows are essential because in a realtime setting you generally want to deliver as fast as possible, and so want to minimise how late you are. Normal 'hard' time windows like in the VRPTW don't let you deliver after a certain time, but place no cost penalty on delivering before this time (i.e. they're binary). Therefore you can't use them to minimise lateness.
I'm afraid I don't know of any open source solver in python or any other language that solves the dynamic pickup and delivery vehicle routing problem with soft time windows.
This survey article has a good overview of the subject. We also published a white paper on developing realtime route optimisers, which is probably an easier read than the academic paper. (Disclaimer - I am the author of this white paper).

How to programmatically classify a list of objects

I'm trying to take a long list of objects (in this case, applications from the iTunes App Store) and classify them more specifically. For instance, there are a bunch of applications currently classified as "Education," but I'd like to label them as Biology, English, Math, etc.
Is this an AI/Machine Learning problem? I have no background in that area whatsoever but would like some resources or ideas on where to start for this sort of thing.

Yes, you are correct. Classification is a machine learning problem, and classifying stuff based on text data involves natural language processing.
The canonical classification problem is spam detection using a Naive Bayes classifier, which is very simple. The idea is as follows:
Gather a bunch of data (emails), and label them by class (spam, or not spam)
For each email, remove stopwords, and get a list of the unique words in that email
Now, for each word, calculate the probability it appears in a spam email, vs a non-spam email (ie count occurrences in spam, vs non spam)
Now you have a model- the probability of a email being spam, given it contains a word. However, an email contains many words. In Naive Bayes, you assume the words occur independently of each other (which turns out to to be an ok assumption), and multiply the probabilities of all words in the email against each other.
You usually divide data into training and testing, so you'll have a set of emails you train your model on, and then a set of labeled stuff you test against where you calculate precision and recall.
I'd highly recommend playing around with NLTK, a python machine learning and nlp library. It's very user friendly and has good docs and tutorials, and is a good way to get acquainted with the field.
EDIT: Here's an explanation of how to build a simple NB classifier with code.

Probably not. You'd need to do a fair bit of work to extract data in some usable form (such as names), and at the end of the day, there are probably few enough categories that it would simply be easier to manually identify a list of keywords for each category and set a parser loose on titles/descriptions.
For example, you could look through half a dozen biology apps, and realize that in the names/descriptions/whatever you have access to, the words "cell," "life," and "grow" appear fairly often - not as a result of some machine learning, but as a result of your own human intuition. So build a parser to classify everything with those words as biology apps, and do similar things for other categories.
Unless you're trying to classify the entire iTunes app store, that should be sufficient, and it would be a relatively small task for you to manually check any apps with multiple classifications or no classifications. The labor involved with using a simple parser + checking anomalies manually is probably far less than the labor involved with building a more complex parser to aid machine learning, setting up machine learning, and then checking everything again, because machine learning is not 100% accurate.

How do content discovery engines, like Zemanta and Open Calais work?

I was wondering how as semantic service like Open Calais figures out the names of companies, or people, tech concepts, keywords, etc. from a piece of text. Is it because they have a large database that they match the text against?
How would a service like Zemanta know what images to suggest to a piece of text for instance?

Michal Finkelstein from OpenCalais here.
First, thanks for your interest. I'll reply here but I also encourage you to read more on OpenCalais forums; there's a lot of information there including - but not limited to:
http://opencalais.com/tagging-information
http://opencalais.com/how-does-calais-learn
Also feel free to follow us on Twitter (#OpenCalais) or to email us at team#opencalais.com
Now to the answer:
OpenCalais is based on a decade of research and development in the fields of Natural Language Processing and Text Analytics.
We support the full "NLP Stack" (as we like to call it):
From text tokenization, morphological analysis and POS tagging, to shallow parsing and identifying nominal and verbal phrases.
Semantics come into play when we look for Entities (a.k.a. Entity Extraction, Named Entity Recognition). For that purpose we have a sophisticated rule-based system that combines discovery rules as well as lexicons/dictionaries. This combination allows us to identify names of companies/persons/films, etc., even if they don't exist in any available list.
For the most prominent entities (such as people, companies) we also perform anaphora resolution, cross-reference and name canonization/normalization at the article level, so we'll know that 'John Smith' and 'Mr. Smith', for example, are likely referring to the same person.
So the short answer to your question is - no, it's not just about matching against large databases.
Events/Facts are really interesting because they take our discovery rules one level deeper; we find relations between entities and label them with the appropriate type, for example M&As (relations between two or more companies), Employment Changes (relations between companies and people), and so on. Needless to say, Event/Fact extraction is not possible for systems that are based solely on lexicons.
For the most part, our system is tuned to be precision-oriented, but we always try to keep a reasonable balance between accuracy and entirety.
By the way there are some cool new metadata capabilities coming out later this month so stay tuned.
Regards,
Michal

I'm not familiar with the specific services listed, but the field of natural language processing has developed a number of techniques that enable this sort of information extraction from general text. As Sean stated, once you have candidate terms, it's not to difficult to search for those terms with some of the other entities in context and then use the results of that search to determine how confident you are that the term extracted is an actual entity of interest.
OpenNLP is a great project if you'd like to play around with natural language processing. The capabilities you've named would probably be best accomplished with Named Entity Recognizers (NER) (algorithms that locate proper nouns, generally, and sometimes dates as well) and/or Word Sense Disambiguation (WSD) (eg: the word 'bank' has different meanings depending on it's context, and that can be very important when extracting information from text. Given the sentences: "the plane banked left", "the snow bank was high", and "they robbed the bank" you can see how dissambiguation can play an important part in language understanding)
Techniques generally build on each other, and NER is one of the more complex tasks, so to do NER successfully, you will generally need accurate tokenizers (natural language tokenizers, mind you -- statistical approaches tend to fare the best), string stemmers (algorithms that conflate similar words to common roots: so words like informant and informer are treated equally), sentence detection ('Mr. Jones was tall.' is only one sentence, so you can't just check for punctuation), part-of-speech taggers (POS taggers), and WSD.
There is a python port of (parts of) OpenNLP called NLTK (http://nltk.sourceforge.net) but I don't have much experience with it yet. Most of my work has been with the Java and C# ports, which work well.
All of these algorithms are language-specific, of course, and they can take significant time to run (although, it is generally faster than reading the material you are processing). Since the state-of-the-art is largely based on statistical techniques, there is also a considerable error rate to take into account. Furthermore, because the error rate impacts all the stages, and something like NER requires numerous stages of processing, (tokenize -> sentence detect -> POS tag -> WSD -> NER) the error rates compound.

Open Calais probably use language parsing technology and language statics to guess which words or phrases are Names, Places, Companies, etc. Then, it is just another step to do some kind of search for those entities and return meta data.
Zementa probably does something similar, but matches the phrases against meta-data attached to images in order to acquire related results.
It certainly isn't easy.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.