Pandas deep dive [closed]

Pandas deep dive [closed] - python

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed last year.
Improve this question
I'm very familiar with pandas and using it on a daily basis.
Recently I made couple of projects where I needed to build things in a very efficient way and I needed to understand the fundamentals pandas is built on.
I'm looking for a book/material where I can understand pandas better in term of efficiency.
Any recommendations?

If you're already familiar with Pandas (e.g. you can fairly quickly write up a syntax to accomplish a task XYZ, and you know of a few different approaches to accomplish most common tasks), then I would suggest the better approach is to benchmark your options yourself.
You could use timeit for that. You're likely to get a much better insight into things this way for one, and then you'll also be building a basis on which you can add later as you identify specific needs. No need for very fancy coding, just dump a bunch of test cases with timeit in a file. It doesn't need to take that long, and is more re-usable/adaptable than reading some thing on some test case in a benchmark that may or not actually mirror the behavior of your actual data.
From quick Google searches I did earlier, what you'll find on this topic as far as I know are very broad recommendations that you likely already know about - user proper datatypes (int are faster than floats), avoid for-loops and use vectorized notations instead, etc. Those are good advices, but based on your level of understanding of Pandas, it sounds like you're already past that level of advice.

Related

Python Libraries for Exact (Weighted) Maximum Independent Sets [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 1 year ago.
Improve this question
I'm trying to get some approximation ratios for the Maximum Independent Set Problem and so I need some exact solutions !
I've found libraries written in C++ (i.e https://github.com/iPapatsoris/Maximum-Independent-Set)
but wondered if there were any directly in Python. I know of the `networkx' maximal indepedent set function but these are only approximations.
I realise it's far from the most efficient language to use but I'm only solving small Erdős–Rényi graphs (N<20).
In addition to this, are there any libraries that solve this for the weighted problem, where some nodes matter more than others?

This is the only python library I could find:
https://github.com/pchervi/Graph-Coloring/blob/master/Coloring_MWIS_heuristics.py
I haven't checked that it works correctly however.
I've been using KaMIS instead, which is a C++ implementation.
https://github.com/KarlsruheMIS/KaMIS

Pre-proccessing steps in Machine Learning [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
Why it is so important to do pre-processing and what are the simple steps doing it? Can anyone help. I am working on python.
I have a dataframe consisting of null values. The data consist of outliers, moreover it is not distributed uniformly.
My question is what protocol I should follow inorder to fill null values, should I remove outliers because this might lead to loss of information and what are the steps to make data distributed uniformly?

Firstly it really doesnot matter which language you are working on. Both python and R are popular in Data Science.
Secondly, you cannot insert raw data to any machine learning models. Before you need to clean it. Here are some simple steps:
1. Remove missing values: Many a times there are missing values present in the data. So you have to fill those data. Question arises how? There are planty of methods which you can google out.
2. Remove skewness and outliers: Generally data contains values that are not within the range of other data. So you have to bring those data with that range.
3. One-hot-encoding: Categorical values needs to be transformed to encoding format.
Still there are more steps but you to google it out there are tons of blogs you can go through.

Noise Reduction in an Audio file using Python [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
We can perform noise reduction using Open-source Software like Audacity, which is commonly used for the purpose. Please click the below link for reference.
denoising with audacity image
Is there a python library that can perform a similar function?

If you want to reduce noise the audacity way, to my understanding, you should program your algorithm using scipy filters provided by scipy library.
Besides that pyaudio is one dedicated library for audio analysis and here is a kickstart tutorial.
If you are not restricted only to Python, you can check out on Essentia. This is by far an exhaustive library for music and audio analysis.
Nutshell: While python libraries provide functionalities, it is you who should code your noise reduction algorithm (tailored to your needs). May be you can follow the audacity's approach.
You can refer this question for better, technical/implementation, clarity: Noise reduction on wave file
Good luck! Try to be precise and post questions focusing on implementation pertaining to programming languages rather than generic things.
As a general guideline:
Understand the behavior of your noise and then you can choose your noise removal strategy accordingly.May be you need a simple low pass filter or high-pass filter.

Topic or Tag suggestion algorithm [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
Here is the problem: When given a block of text, I want to suggest possible topics . For example, a news article about Kobe Bryant would have suggested tags like: ‘basketball’, ‘nba’, ‘sports’.
I have a fairly large training dataset (350k+) that includes bodies of text and tags that users have assigned to the text. There are about 40k, pre existing topics; however, many of the topics do not have too many entries in them. I would say only about 5k of the topics have more than 10 entries in them. Users cannot assign topics that don’t already exist in the system. I'd also like to include that
Does anyone have any suggestions for algorithms to use?
If anyone has any suggestions of python libraries as well that would be awesome.

There have been attemps on similar problem - one example is right here - stackoverflow. When you wrote your question, stackoverflow itself suggests some tags without your intervention, though you can manually add or remove them.
Out-of-the-box classification would fail as the number of tags is really huge. There're two directions you could work on this problem from.
Nearest Neighbors
Easy, fast and effective. You have a labelled training set. When a new document comes, you look for closest matches, e.g. words like 'tags', ' training', 'dataset', 'labels' etc helped your question map with other similar questions on StackOverflow. In those questions, machine-learning tag was there - so this tag was suggested. Best way for implementation is index your training data (search-engine tactic). You may use Lucene, Elastic Search or something similar. When a new document appears, use that as a query and search for top 10 matching documents stored previously. Poll their tags. Sort the tags and use the scores of the documents to find how important the tags are. Done.
Probabilistic Models
Idea is on the lines of classification but off-the-shelf tools won't help you with that. Check the works like Clayton Stanley, Predicting Tags for StackOverflow Posts, Darren Kuo, On Word Prediction Methods
or Schuster's report on Predicting Tags for StackOverflow Questions
If you have got this problem as a part of long-term academic project or research, working on Method 2 would be better. However if you need off the shelf solution, use Method 1. Lucene is a great indexing tool used even in production. It is originally in Java but you can easily find wrappers for Python. Another alternatives are Elastic Search, Katta and many more.
p.s. A lot of experimentation is required while playing with the tag-scores.

Stepinfo in Python [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
I am trying to determine the following step characteristics for a step response in Python:
RiseTime
SettlingTime
SettlingMin
SettlingMax
Overshoot
Undershoot
Peak
PeakTime
Matlab offers me the function stepinfo, but I am unable to find a suitable alternative in Python. I did try to roll my own using numpy and scipy, but I haven't had much luck yet, my knowledge of signal processing is lacking.
Most information that I can find on the internet look rather complicated but I do like to learn more about this. If any one could recommend me a good book or other source to learn more from I would appreciate it! Thank you!
This is the step response that I currently have:

This discussion suggests a sort of implementation:
def step_info(t,yout):
print "OS: %f%s"%((yout.max()/yout[-1]-1)*100,'%')
print "Tr: %fs"%(t[next(i for i in range(0,len(yout)-1) if yout[i]>yout[-1]*.90)]-t[0])
print "Ts: %fs"%(t[next(len(yout)-i for i in range(2,len(yout)-1) if abs(yout[-i]/yout[-1])>1.02)]-t[0])
Then you need to use numpy functions in the Signal Processing section to get the other information that you want.

Could you not just implement the formulas? (Assuming that this is a second order system / has two dominant poles and can be approximated as second order)
For rise and settling time there are a few different approximations, so the internet is your friend.
You could also figure out the damped frequency (from the maxima and minima of your plot data), and use that to figure out the natural frequency:
There are a handful of formulas that relate these various quantities, depending on what you know.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.