Exploding two columns in pandas: ValueError - python

I have tried to explode two columns as follows
df2=df[['Name','Surname','Properties','Score']].copy()
df2 = df2.reset_index()
df2.set_index('Name').apply(pd.Series.explode).reset_index()
but I have got the error: ValueError: cannot reindex from a duplicate axis
My dataset looks like
Name Surname Properties Score
A. McLarry ['prop1','prop2'] [1,2]
G. Livingstone [] []
S. Silver ['prop5','prop3', 'prop2'] [2,55,2]
...
I would like to explode both Properties and Score. If you can tell me what I am doing wrong, it would be great!

Try using pd.Series.explode as an apply function, after setting ALL the other columns as index. Post that you can reset_index to get the columns back. -
df.set_index(['Name','Surname']).apply(pd.Series.explode).reset_index()

Related

How to drop columns and rows with missing values?

I've been trying to take a pandas.Dataframe and drop its rows and columns with missing values simultaneously. While trying to use dropna and applying on both axis, I found out that this is no longer supported. So then I tried, using dropna, to drop the columns and then drop the rows and vice versa and obviously, the results come out different as the values no longer reflect the initial state accurately.
So to give an example I receive:
pandas.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman'],
"toy": [numpy.nan, 'Batmobile', 'Bullwhip'],
"weapon": [numpy.nan, 'Boomerang', 'Gun']})
and return:
pandas.DataFrame({"name": ['Batman', 'Catwoman']})
Any help will be appreciated.
Test if all values per columns and for rows use DataFrame.notna with DataFrame.any and DataFrame.loc:
m = df.notna()
df0 = df.loc[m.all(1), m.all()]
print (df0)
name
1 Batman
2 Catwoman

Is there a way to return a pandas dataframe with a modified column?

Say I have a dataframe df with column "age".
Say "age" has some NaN values and I want to create two new dataframes, dfMean and dfMedian, which fill in the NaN values differently.
This is the way I would do it:
# Step 1:
dfMean = df
dfMean["age"].fillna(df["age"].mean(),inplace=True)
# Step 2:
dfMedian= df
dfMedian["age"].fillna(df["age"].median(),inplace=True)
I'm curious whether there's a way to do each of these steps in one line instead of two, by returning the modified dataframe without needing to copy the original. But I haven't been able to find anything so far. Thanks, and let me know if I can clarify or if you have a better title in mind for the question :)
Doing dfMean = dfMean["age"].fillna(df["age"].mean()) you create a Series, not a DataFrame.
To add two new Series (=columns) to your DataFrame, use:
df2 = df.assign(age_fill_mean=df["age"].fillna(df["age"].mean()),
age_fill_median=df["age"].fillna(df["age"].median()),
)
You alternatively can use alias Pandas.DataFrame.agg()
"Aggregate using one or more operations over the specified axis."
df.agg({'age' : ['mean', 'median']})
No, need 2 times defined new 2 DataFrames by DataFrame.fillna with dictionary for specify columns names for replacement missing values:
dfMean = df.fillna({'age': df["age"].mean()})
dfMedian = df.fillna({'age': df["age"].median()})
One line is:
dfMean,dfMedian=df.fillna({'age': df["age"].mean()}), df.fillna({'age': df["age"].median()})

Sort dataframe by value returns "For a multi-index, the label must be a tuple with elements corresponding to each level."

Objective: Based off dataframe with 5 columns, return dataframe with 3 columns including one which is the count and sort by largest count to smallest.
What I have tried:
df = df[['Country', 'Year','NumInstances']].groupby(['Country', 'Year']).agg(['count'])
df = df.sort_values(by='NumInstances', ascending=False)
print(df)
Error:
ValueError: The column label 'NumInstances' is not unique.
For a multi-index, the label must be a tuple with elements corresponding to each level.
Before this gets mark as a duplicate, I have gone through all other suggested duplicates and it seems they all suggest using the same code as I have above.
Is there something small that I am doing that may be incorrect?
Thanks!
I guess you need to remove multi-index -
Try this -
df = df[['Country', 'Year','NumInstances']].groupby(['Country', 'Year']).agg(['count']).reset_index()
or -
df = df[['Country', 'Year','NumInstances']].groupby(['Country', 'Year'], as_index=False).agg(['count'])
Found the issue. Adding an agg to the NumInstances column made the NumInstances column name a tuple of ('NumInstances', 'sum'), therefore I just updated the sort code to:
df = df.sort_values(by=('NumInstances', 'sum'), ascending=False)

Dataframe sum(axis=1) is returning Nan Values

I'm trying to make a sum of the second column ('ALL_PPA'), grouping by Numéro_département
Here's my code :
df.fillna(0,inplace=True)
df = df.loc[:, ('Numéro_département','ALL_PPA')]
df = df.groupby('Numéro_département').sum(axis=1)
print(df)
My DF is full of numbers, I don't have any NaN values, but when I apply the function df.sum(axis=1),some rows appear to have a NaN Value
Here's how my tab looks like before sum():
Here's after sum()
My question is : How am I supposed to do this? I've try to use numpy library but, it doesn't work as I want it to work
Drop the first row of that dataframe, as it just as the column names in it, and convert it to an int. Right now, it is an object because of the mixed data types:
df2 = df.iloc[1:].astype(int).copy()
Then, apply groupby.sum() and specify the column as well:
df3 = df2.groupby('Numero_department')['ALL_PPA'].sum()
I think using .dropna() before summing the DF will help remove any rows or columns (depending on the axis= you choose) with nan values. According to the screenshot provided, please drop the first line of the DF as it is a string.

Adding comuln/s if not existing using Pandas

I am using Pandas with PsychoPy to reorder my results in a dataframe. The problem is that the dataframe is going to vary according to the participant performance. However, I would like to have a common dataframe, where non-existing columns are created as empty. Then the columns have to be in a specific order in the output file.
Let´s suppose I have a dataframe from a participant with the following columns:
x = ["Error_1", "Error_2", "Error_3"]
I want the final dataframe to look like this:
x = x[["Error_1", "Error_2", "Error_3", "Error_4"]]
Where "Error_4" is created as an empty column.
I tried applying something like this (adapted from another question):
if "Error_4" not in x:
x["Error_4"] = ""
x = x[["Error_1", "Error_2", "Error_3", "Error_4"]]
In principle it should work, however I have more or less other 70 columns for which I should do this, and it doesn´t seem practical to do it for each of them.
Do you have any suggestions?
I also tried creating a new dataframe with all the possible columns, e.g.:
y = ["Error_1", "Error_2", "Error_3", "Error_4"]
However, it is still not clear to me how to merge the dataframes x and y skipping columns with the same header.
Use DataFrame.reindex:
x = x.reindex(["Error_1", "Error_2", "Error_3", "Error_4"], axis=1, fill_value='')
Thanks for the reply, I followed your suggestion and adapted it. I post it here, since it may be useful for someone else.
First I create a dataframe y as I want my output to look like:
y = ["Error_1", "Error_2", "Error_3", "Error_4", "Error_5", "Error_6"]
Then, I get my actual output file df and modify it as df2, adding to it all the columns of y in the exact same order.
df = pd.DataFrame(myData)
columns = df.columns.values.tolist()
df2 = df.reindex(columns = y, fill_value='')
In this case, all the columns that are absent in df2 but are present in y, are going to be added to df2.
However, let´s suppose that in df2 there is a column "Error_7" absent in y. To keep track of these columns I am just applying merge and creating a new dataframe df3:
df3 = pd.merge(df2,df)
df3.to_csv(filename+'UPDATED.csv')
The missing columns are going to be added at the end of the dataframe.
If you think this procedure might have drawbacks, or if there is another way to do it, let me know :)

Categories