How to lock down pandas dataframe structure [closed] - python

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
Simply put, what are the preferred practices for writing larger python applications that use pandas dataframes as its primary method for data representation?
I often find myself struggling to maintain inconsistencies in dataframes, sometimes invariants leak through in data, datatypes are not what you expect etc.
I'm wondering just what are the best practices for writing larger, stable applications in pandas? I want to take advantage of array-representation in data for speed, but I also want to make sure that there's a way to further define the "bounds" of dataframe, what it should have in it, in a clean way.
Assertions on receiving a dataframe from a caller.
Forcing a dataframe parameter to have specific dtypes.
Defining a dataframe "type" based upon the columns it has.
Opportunities for OOP, at the dataframe level
Also, sorry for the vague nature of this. I'm starting on a project, and I want to ask this question before I get too far off course. I've been burned in the past with regards to not enforcing enough of a structure when it comes to dataframes.

Related

When to use Lists, Sets, Dictionaries, or tuples in python? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I'm new to python, and I'm a bit confused about the use cases of data types in python
Can someone please explain in detail when to use each data type, with an example if possible
Thank you.
Lists are used when you have data you want to further modify, alter like sorting and all.
Dictionary is used when you have to sets of data where data of the first set corresponds to data of other set. And the position of the data doesn't matter only the relation of the two sets matters.
A tuple is used when position of the data is very important and you don't want to alter the position throughout.

Can dividing code too much make it inefficient? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
If code is divided into too many segments, can this make the program slow?
For example - Creating a separate file for just a single function.
In my case, I'm using Python, and suppose there are two functions that I need in the main.py file. If I placed them in different files (just containing the function).
(Suppose) Also, If I'm using the same library for the two functions and I've divided the functions into separate files.
How can this affect efficiency? (Machine performance-wise and Team-wise).
It depends on the language, the framework you use etc. However, dividing the code too much can make it unreadable, which is (most of the time) the bigger problem. Since most of the time you will (or should) be working in a team, you should consider how readable your code would be for them.
However, answering this in a definite way is difficult. You should ask a Senior developer on your team for guidelines.

How do I pivot a DataFrame correctly? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
This is my first post on stack overflow, so apologies in advance for making mistakes in asking this question.
I am trying to pivot a DataFrame but I am struggling with understanding how it should be done properly, accounting for changes in values. I am a beginner in Python and Pandas.
The dataset I am using can be found here: https://www.kaggle.com/szymonjanowski/internet-articles-data-with-users-engagement
I have processed this dataset to this point:article_data df
What I would like to do next is to pivot this df so that 'source_id' will become the columns. I have done that using pivot_table method but I get a lot of NaN values. Here is a printscreen of the result I get: pivoted data
Moreover, I am not sure whether the pivot accounts only for unique values in the 'source_id' column. For that I was trying to implement a for loop which will iterate through the unique values of source_id and store them in the pivoted DF. However, I don't know how to write that code.
If you could provide me with some advice regarding what I am doing good and what not (and some ideas of how to fix that) I would be very thankful.
Since you have duplicate values in source_id, you'd need to perform some sort of aggregation grouped by that column and then use .unstack(). That's not advisable though since you have a lot of text data that cannot be aggregated.
You can try
df.set_index('source_id').T
but I don't know if duplicate index names are allowed.

Why is list comprehension so prevalent in python? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
Often you see question asked about a better method of doing something, or just generally a looping question and very often the top answers will use some form of convoluted list/dict/tuple comprehension that takes longer for others to understand than create themselves. While a simple and understandable loop could have just been made.
Since it cannot provide any speed benefits that I could imagine, is there any use of it in python other than to look smart or be Pythonic?
Thanks.
I believe the goal in this case to make your code as concise and efficient as possible. At times it can seem convoluted, but the computer looping through multiple lines as opposed to a single line adds processing time, which in large applications and across many iterations can cause some delays.
Additionally, although it seems harder to understand initially, for an outside individual reading your code, it's much quicker for them to read simplified expressions than pages of loops to get an idea of what you're attempting to accomplish.

Array vs object - what's faster in Python [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I am wondering what I should do for the purpose of my project.
I am gonna operate on about 100 000 rows, every time.
what I wanted to do is to create an object "{}" and then, if I need to search for a value, just call it , for example
data['2018']['09']['Marketing']['AccountName']
the second option is to pull everyting into an array "[]" and in case I need to pull value, I will create a function to go through the array and sum numbers for specific parameters.
But don't know which method is faster.
Will be thankful if you can shed some light on this
Thanks in advance,
If performance (speed) is an issue, Python might not be the ideal choice...
Otherwise:
Might I suggest the use of a proper database, such as SQLLite (which comes shipped with Python).
And maybe SQLAlchemy as an abstraction layer. (https://docs.sqlalchemy.org/en/latest/orm/tutorial.html)
After all, they were made exactly for this kind of tasks.
If that seems overkill: Have a look at Pandas.

Categories