Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
This is my first post on stack overflow, so apologies in advance for making mistakes in asking this question.
I am trying to pivot a DataFrame but I am struggling with understanding how it should be done properly, accounting for changes in values. I am a beginner in Python and Pandas.
The dataset I am using can be found here: https://www.kaggle.com/szymonjanowski/internet-articles-data-with-users-engagement
I have processed this dataset to this point:article_data df
What I would like to do next is to pivot this df so that 'source_id' will become the columns. I have done that using pivot_table method but I get a lot of NaN values. Here is a printscreen of the result I get: pivoted data
Moreover, I am not sure whether the pivot accounts only for unique values in the 'source_id' column. For that I was trying to implement a for loop which will iterate through the unique values of source_id and store them in the pivoted DF. However, I don't know how to write that code.
If you could provide me with some advice regarding what I am doing good and what not (and some ideas of how to fix that) I would be very thankful.
Since you have duplicate values in source_id, you'd need to perform some sort of aggregation grouped by that column and then use .unstack(). That's not advisable though since you have a lot of text data that cannot be aggregated.
You can try
df.set_index('source_id').T
but I don't know if duplicate index names are allowed.
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 days ago.
Improve this question
I read in a csv file as a dataframe with hundreds of columns and thousands of rows. I need to apply a calculation that uses two columns I use for conditions and then for cases that match these conditions I apply a calculation to those rows and columns (hundreds of rows and thousands of columns).
I couldn't find a way to do this in the dataframe itself so I broke up the dataframe in separate dataframes based on one of the 2 conditions and then converted it to dictionary and used the other condition in the dictionary and applied the calculations in a nested for loop to the dictionary values. Then I converted the dictionaries back into dataframes and merged them together.
If I had millions of rows this might be a slow process so does it make sense to try to figure out how to change the values in the dataframe itself instead of converting a dataframe into a dictionary (or list) to apply some calculations and then convert back to a dataframe?
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed last month.
Improve this question
Let's say that you have an insanely large dataframe with several observations (rows) and labels/characteristics (columns) and the first thing you want to do is to exclude all the columns who has irrelevant informantion. For that, you need to first of all, glance over the different values the columns, but you can't truly do that with head or tail.
Is there a fuction who returns all the non-repeated values of the columns of a dataframe instead of doing column by column? Thank in advance
I'm able to do it with single columns through the fuction unique. For example using df.color.unique(), it gives me the list of the different colors that there are but I want to do it directly for all of the 100 colums of my dataframe
You can use a for loop, in order to know all the unique value for each columns
for column in df.columns:
print(f"{column}: {df[column].unique()}")
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
Read two SQL(both SQL query has 7 similar column structure) using two different data frame and compare the two resultant datasets whether they match.
I have tried with .equals operator but I got:
ValueError: too many values to unpack (expected 2)
I am writing the code using Python Pandas. Let me know if something like that is possible, I am new to Python any help or advice would be appreciated.
Thanks in advance.
You can check the (exact) equality of a DataFrame like this:
import pandas as pd
df1 = pd.DataFrame({1: [20], 2: [30]}) # here would be your first sql-query
df2 = pd.DataFrame({1: [20], 2: [30]}) # here would be your second sql-query
df1.equals(df2) # results in True/False
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
Simply put, what are the preferred practices for writing larger python applications that use pandas dataframes as its primary method for data representation?
I often find myself struggling to maintain inconsistencies in dataframes, sometimes invariants leak through in data, datatypes are not what you expect etc.
I'm wondering just what are the best practices for writing larger, stable applications in pandas? I want to take advantage of array-representation in data for speed, but I also want to make sure that there's a way to further define the "bounds" of dataframe, what it should have in it, in a clean way.
Assertions on receiving a dataframe from a caller.
Forcing a dataframe parameter to have specific dtypes.
Defining a dataframe "type" based upon the columns it has.
Opportunities for OOP, at the dataframe level
Also, sorry for the vague nature of this. I'm starting on a project, and I want to ask this question before I get too far off course. I've been burned in the past with regards to not enforcing enough of a structure when it comes to dataframes.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have a dataframe in pandas and I am trying to find the easiest way to find the max value across rows and create a new column with the max value. See Below as example
MA10D MA30D MA50D MA100D MA200D
19.838 17.197333 16.5896 16.5207 16.52065
19.296 17.015333 16.4758 16.4676 16.48300
18.722 16.833000 16.3680 16.4106 16.44475
So in the first row of the new column I would want to return a 19.838 then 19.296 and 18.722 (it is just by chance that in this example all numbers are under MA10D column). Can someone help me find the best way to do this.
In Pandas, the vast majority of operations apply through rows, i.e. per column, and it is called axis=0. When it makes sense to apply these operations through columns, i.e. per row, use axis=1.
Finding the maximum is an expected operation on a dataframe. df.max() is equivalent to df.max(axis=0) and gives one resulting row with the max per column. For your case, use df.max(axis=1).