Text Classification & Recommendation [closed] - python

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I'm trying to create a machine learning algorithm, for address classification or similar address classification, for rural(Villages) areas. I have a historical data, which includes list of Addresses (Independent Variable), Village Name (Independent Variable) Pin-Codes (Independent Variable), Customer Mobile Number and Route No (Dependent Variable). Route No is for delivery cart, which will help them to cover maximum number of delivery destination in that area.
Challenges -
"Address" can be miss spelled.
"Villages Name" can be null.
"Pin-codes" can be wrong.
Good Thing -
Not all the independent variables can be wrong/null at the same time.
Now the point of creating this algorithm is for selecting the best Route Number, on the basis of "Address", "Villages", "Pin-Codes", and Historical Data(In which we have manually selected the Route for delivery carts).
I'm the beginner, i'm confused how to do this which process is to use.
Tasked I have done.
Address cleaning - Removed short words, Removed Big Words, Removed Stop Words.
Now trying to do it with word vector, but i'm not able to do that.

for this first you'll have to build a dataset first - consisting the names of as many villages as you can! because many villages have similar names so identifying a typo is pretty difficult and risky! there is a difference of one or two letters. So, bigger dataset is better.
Then, try to use TF-IDF on the combination of village name and PIN code (this link may be helpful for Indian data) or you can go for fuzzy logic.
Hope it helps! Happy coding!

Related

How to train a ML Model for detecting duplicates without having data sets [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
I am currently working on my master thesis. My topic is the development of a duplicate checking system in ABAP for a customer in the SAP environment. Up to now, the whole thing works in such a way that if the checking system does not know exactly whether an incoming data record is a duplicate or not, a clerk intervenes and makes the final decision. The clerk is to be "replaced by a machine learning model", which is to be located on a Python server, so that the clerk intervenes less in the process and becomes more and more of an "AI trainer". The communication between the duplicate checking system and the ML model is done by a REST API. The ML-model should decide whether incoming data records are duplicates or not.
My first problem is that I don't have any training data to create an ML model. The second problem is that I still do not know exactly what my training data might look like from the structure. It is possible to get client data records from the client, but for various reasons this is quite unlikely. Is there a way to generate "synthetic data" to create an ML model? What could it look like for my application? Which tools could I use to make my work a little easier?
Many greetings
You can't.
When you don't have any real-world data and how humans classified it, then you can not train a ML system to classify real-world data.
What you can do is train a system with data you generated and classified in a way you believe to be similar to what the program might encounter in the wild. This would allow you to try your methodology and see if it works at all. But if you want to create an ML model which is actually useful in the real world, you need to repeat the training process with real-world data.
There is a lot of variability with what your data COULD be. You need to narrow down with what it will look like.
You can always create random data:
>>> N = 5
>>> import random
>>> import numpy as np
>>> np.random.random((N,N))
Create yourself a whole bunch of random arrays, but then copy one, and then you can test whether you can catch whether it is duplicate or not?

How can I increase email open rates with machine learning? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
For example, I have email sending logs like this:
day_of_week| time | cta | is_opened |is_clicked
1 |10:00AM|CLICK HERE|True |False
7 |07:30PM|BUY NOW |False |False
...
I want to write a program to see "best performing day and time to send emails".
This example is only for sending day/time. I want I can add extra parameters (like CTA, sender name etc.) when I need it.
Is the machine learning best way to do it? (I have no experience in ML) I'm experienced with Python and I think I can use the TensorFlow to do it.
ps: These are marketing emails that we send our members, not spam or malware.
There are two views for your case :
given day, time, etc. predict will be opened/clicked or not.
given cta, etc. predict the best day-time to send the email
For the first case, you can use Neural Net, or any classifier to predict it will be opened/clicked or not
For the second case, I assume this is your case, You may look at Multivariate Regression, because two variables you need to predict (day_of_week, time) may not be handled separately (e.g. by creating two models then predict day_of_week and time separately ). You need to predict two variables simultaneously. But you need to clean your data first, so it only contain opened/clicked email.
And of course, you can implement it using Tensorflow.

Text analysis with a mix of text & categorical columns in R [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I have a dataset of IT operations tickets with fields like Ticket No, Description, Category,SubCategory,Priority etc.
What I need to do is to use available data(except ticket no) to predict the ticket priority. Sample data shown below.
Number Priority Created_on Description Category Sub Category
719515 MEDIUM 05-01-2016 MedWay 3rd Lucene.... Server Change
720317 MEDIUM 07-01-2016 DI - Medway 13146409 Application Incident
720447 MEDIUM 08-01-2016 DI QLD Chermside.... Application Medway
Please guide me on this.
Answering without more is a bit tough, and this is more of a context questions than a code question. But here is the logic I would use to start to evaluate this problem Keep in mind it might involve writing a few separate scripts each performing part of the task.
Try breaking the problem up into smaller pieces.You cannot do an analysis without all the data so start by creating the data.
You have the category and sub category already make a list of all the unique factors in each list and create a set of weights for each based on your system and business needs. As you make subcategory weights, keep in mind how they will interact with categories (+/- as well as magnitude).
Write a script to read the description, count all the non-trivial words. Create some kind of classifications for words to help you build lists that will inform the model with categories and sub categories.
Is the value an error message, or machine name, or some other code or type of problem you can extract using key words?
How are all the word groupings meaningful?
How would the contribute to making a decision?
Think about the categories when you decide these things.
Then with all of the parts, decide on a model, build, test and refine. I know there is no code in this but the problem solving part of Data Science happens outside of code most of the time.
You need to come up with the code yourself. If you get stuck post an edit and we can help.

finding word associations using natural language processing [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Given words like "romantic" or "underground", I'd like to use python to go through a list of text data and retrieve entries that contain those words and associated words such as "girlfriend" or "hole-in-the-wall".
It's been suggested that I work with NLTK to do this, but I have no idea where to start and I know nothing about language processing or linguistics. Any pointers would be much appreciated.
You haven't given us much to go on. But let's assume you have a paragraph of text. Here's one I just stole from a Yelp review:
What a beautiful train station in the heart of New York City. I've grown up seeing memorable images of GCT on newspapers, in movies, and in magazines, so I was well aware of what the interior of the station looked like. However, it's still a gem. To stand in the centre of the main hall during rush hour is an interesting experience- commuters streaming vigorously around you, sunlight beaming in through the massive windows, announcements booming on the PA system. It's a true NY experience.
Okay, there are a bunch of words there. What kind of words do you want? Adjectives? Adverbs? NLTK will help you "tag" the words, so you can find all the ad-words: "beautiful", "memorable", "interesting", "massive", "true".
Now, what are you going to do with them? Maybe you can throw in some verbs and nouns, "beaming" sounds pretty good. But "announcements" isn't so interesting.
Regardless, you can build an associations database. This ad-word appears in a paragraph with these other words.
Maybe you can count the frequency of each word, over your total corpus. Maybe "restaurant" appears a lot, but "pesthole" is relatively rare. So you can filter that way? (Only keep "interesting" words.)
Or maybe you go the other way, and extract synonyms: if "romantic" and "girlfriend" appear together a lot, then call them "correlated words" and use them as part of your search engine?
We don't know what you're trying to accomplish, so it's hard to make suggestions. But yes, NLTK can help you select certain subgroups of words, IF that's actually relevant.

What is the appropriate machine learning algorithm for a restaurant's sales prediction? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I want to estimate the sales of a restaurant three days in advance, so that the staff can order fresh ingredients in time. I started off using linear regression, but noticed the following:
For the restaurant it is worse, if a customer won't get the food he ordered, compared to the case that food that is eventually thrown away.
I figured I might just need a skewed cost function, but I am not sure. Maybe there is something already implemented.
Another question: Some days, there are some reservations (pre-orders) for the restaurant, so I know they will need at least a certain amount. How to include that?
Thank you!
Pretty general question, requiring more than a stack overflow response. The first thing I'd consider is setting up a predictive algorithm like the linear regression you spoke of. You can also add a constant to it, as in mx+b where the B is the known quantity of food for reservations. So you would run linear regression, and then add a constant to the finalized prediction approximating the impact of reservations. As you get more data, you could start to incorporate reservations as a variable to your model. From there, you would want to build another model for estimating the amount to buy, because you are going to have a cost function that places more emphasis on having extra vs too little. You would have to know the cost vs. the profit to develop an algorithm for calculating the risk associated with too much food vs. too little, but it would not be difficult. You might want to research profit curves: https://en.wikipedia.org/wiki/Profit_maximization
Hopefully that's enough to get you started!

Categories