Using count inside pivot table Pandas - python

I want to count values from a column called "profitable_trades". This values are "TRUE" or "FALSE" depending if they are <0 or not. The problem I'm facing is that seems you can't use COUNT with Numpy. Is that correct?
What I want to know it's how many TRUE or FALSE I have for each item.
Here is the code I'm using:
currencies = ['audcad','audchf','audjpy','cadchf','eurcad','gbpaud','gbpchf','nzdusd']
filtered_df = df[df['Item'].isin(currencies)]
df3 = pd.pivot_table(filtered_df,index=["profitable_trades","month"],columns=["Item"],values=["profitable_trades"],aggfunc=[np.count],margins=True)
Here is the output:
AttributeError: module 'numpy' has no attribute 'count'
Any idea about how to use count inside a pivot table?
Thanks!

np.count does not exist. You can just use aggfunc = len:
df3 = pd.pivot_table(
df,
index=["profitable_trades","month"],
columns=["Item"],
aggfunc=len,
margins=True
)

Related

How to delete multiple rows in data frame at panda in python?

I am using pandas to make a dataframe. I want to delete 12 initial rows by drop function. every resources website says that you should use drop to delete the rows unfortunately it doesn't work. I don't know why. the error says that 'list' object has no attribute 'drop' could you do me a favor and find it what should I do?
url=Exp01.html
url=str(url)
df = pd.read_html(url)
df = df.drop(index=['1','12'],axis=0,inplace=True)
print(df)
You can slice the rows out:
df = df.loc[11:]
df
loc in general is configured this way:
df.loc[x:y]
where x is the starting index and y is the ending index.
[11:] gives starting index as 11 and no ending index
Pandas read_html returns a list of dataframes.
So df is a list on your example. First, take a look at what the list holds.
If it's just one table (dataframe), you can change it to:
df = pd.read_html(url)[0]
Full code:
url=Exp01.html
url=str(url)
df = pd.read_html(url)[0]
df.drop(index=df.index[:12], axis=0, inplace=True)

Sort dataframe by value returns "For a multi-index, the label must be a tuple with elements corresponding to each level."

Objective: Based off dataframe with 5 columns, return dataframe with 3 columns including one which is the count and sort by largest count to smallest.
What I have tried:
df = df[['Country', 'Year','NumInstances']].groupby(['Country', 'Year']).agg(['count'])
df = df.sort_values(by='NumInstances', ascending=False)
print(df)
Error:
ValueError: The column label 'NumInstances' is not unique.
For a multi-index, the label must be a tuple with elements corresponding to each level.
Before this gets mark as a duplicate, I have gone through all other suggested duplicates and it seems they all suggest using the same code as I have above.
Is there something small that I am doing that may be incorrect?
Thanks!
I guess you need to remove multi-index -
Try this -
df = df[['Country', 'Year','NumInstances']].groupby(['Country', 'Year']).agg(['count']).reset_index()
or -
df = df[['Country', 'Year','NumInstances']].groupby(['Country', 'Year'], as_index=False).agg(['count'])
Found the issue. Adding an agg to the NumInstances column made the NumInstances column name a tuple of ('NumInstances', 'sum'), therefore I just updated the sort code to:
df = df.sort_values(by=('NumInstances', 'sum'), ascending=False)

Drop/edit rows in dataframe where entry doesn't meet condition

I know this has been asked before but I cannot find an answer that is working for me. I have a dataframe df that contains a column age, but the values are not all integers, some are strings like 35-59. I want to drop those entries. I have tried these two solutions as suggested by kite but they both give me AttributeError: 'Series' object has no attribute 'isnumeric'
df.drop(df[df.age.isnumeric()].index, inplace=True)
df = df.query("age.isnumeric()")
df = df.reset_index(drop=True)
Additionally is there a simple way to edit the value of an entry if it matches a certain condition? For example instead of deleting rows that have age as a range of values, I could replace it with a random value within that range.
Try with:
df.drop(df[df.age.str.isnumeric() == False].index, inplace=True)
If you check documentation isnumeric is a method of Series.str and not of Series. That's why you get that error.
Also you will need the ==False because you have mixed types and get a series with only booleans.
I'm posting it in case this also helps you with your last question. You can use pandas.DataFrame.at with pandas.DataFrame.Itertuples for iteration over rows of the dataframe and replace values:
for row in df.itertuples():
# iterate every row and change the value of that column
if row.age == 'non_desirable_value:
df.at[row.Index, "age"] = 'desirable_value'
Hence, it could be:
for row in df.itertuples():
if row.age.str.isnumeric() == False or row.age == 'non_desirable_value':
df.at[row.Index, "age"] = 'desirable_value'

Renaming pandas columns gives not found in index error

I have a data frame called v where columns are = ['self','id','desc','name','arch','rel']. And when I rename is as follows it won't let me drop columns giving column not found in axis error.
case1:
for i in range(0,len(v.columns)):
#I'm trying to add 'v_' prefix to all col names
v.columns.values[i] = 'v_' + v.columns.values[i]
v.drop('v_self',1)
#leads to error
KeyError: "['v_self'] not found in axis"
But if I do it as follows then it works fine
case2:
v.columns = ['v_self','v_id','v_desc','v_name','v_arch','v_rel']
v.drop('v_self',1)
# no error
In both cases if I do following it give same results for its columns
v.columns
#both cases gives
Index(['v_self', 'v_id', 'v_description', 'v_name', 'v_archived',
'v_released'],
dtype='object')
I can't understand why in the case1 it gives an error? Please help, thanks.
That's because .values returns the underlying values. You're not supposed to change those directly. Assigning directly to .columns is supported though.
Try something like this:
import pandas
df = pandas.DataFrame(
[
{key: 0 for key in ["self", "id", "desc", "name", "arch", "rel"]}
for _ in range(100)
]
)
# Add a v_ to every column
df.columns = [f"v_{column}" for column in df.columns]
# Drop one column
df = df.drop(columns=["v_self"])
To your "case 1":
You meet a bug (#38547) in pandas — “Direct renaming of 1 column seems to be accepted, but only old name is working”.
It means that after that "renaming", you may delete the first column
not by using
v.drop('v_self',1)
but using the old name
v.drop('self',1)`.
Of course, the better option is not using such a buggy renaming in the
current versions of pandas.
To renaming columns by adding a prefix to every label:
There is a direct dateframe method .add_prefix() for it, isn't it?
v = df.add_prefix("v_")

print the unique values in every column in a pandas dataframe

I have a dataframe (df) and want to print the unique values from each column in the dataframe.
I need to substitute the variable (i) [column name] into the print statement
column_list = df.columns.values.tolist()
for column_name in column_list:
print(df."[column_name]".unique()
Update
When I use this: I get "Unexpected EOF Parsing" with no extra details.
column_list = sorted_data.columns.values.tolist()
for column_name in column_list:
print(sorted_data[column_name].unique()
What is the difference between your syntax YS-L (above) and the below:
for column_name in sorted_data:
print(column_name)
s = sorted_data[column_name].unique()
for i in s:
print(str(i))
It can be written more concisely like this:
for col in df:
print(df[col].unique())
Generally, you can access a column of the DataFrame through indexing using the [] operator (e.g. df['col']), or through attribute (e.g. df.col).
Attribute accessing makes the code a bit more concise when the target column name is known beforehand, but has several caveats -- for example, it does not work when the column name is not a valid Python identifier (e.g. df.123), or clashes with the built-in DataFrame attribute (e.g. df.index). On the other hand, the [] notation should always work.
Most upvoted answer is a loop solution, hence adding a one line solution using pandas apply() method and lambda function.
print(df.apply(lambda col: col.unique()))
This will get the unique values in proper format:
pd.Series({col:df[col].unique() for col in df})
If you're trying to create multiple separate dataframes as mentioned in your comments, create a dictionary of dataframes:
df_dict = dict(zip([i for i in df.columns] , [pd.DataFrame(df[i].unique(), columns=[i]) for i in df.columns]))
Then you can access any dataframe easily using the name of the column:
df_dict[column name]
We can make this even more concise:
df.describe(include='all').loc['unique', :]
Pandas describe gives a few key statistics about each column, but we can just grab the 'unique' statistic and leave it at that.
Note that this will give a unique count of NaN for numeric columns - if you want to include those columns as well, you can do something like this:
df.astype('object').describe(include='all').loc['unique', :]
I was seeking for a solution to this problem as well, and the code below proved to be more helpful in my situation,
for col in df:
print(col)
print(df[col].unique())
print('\n')
It gives something like below:
Fuel_Type
['Diesel' 'Petrol' 'CNG']
HP
[ 90 192 69 110 97 71 116 98 86 72 107 73]
Met_Color
[1 0]
The code below could provide you a list of unique values for each field, I find it very useful when you want to take a deeper look at the data frame:
for col in list(df):
print(col)
print(df[col].unique())
You can also sort the unique values if you want them to be sorted:
import numpy as np
for col in list(df):
print(col)
print(np.sort(df[col].unique()))
cu = []
i = []
for cn in card.columns[:7]:
cu.append(card[cn].unique())
i.append(cn)
pd.DataFrame( cu, index=i).T
Simply do this:
for i in df.columns:
print(df[i].unique())
Or in short it can be written as:
for val in df['column_name'].unique():
print(val)
Even better. Here's code to view all the unique values as a dataframe column-wise transposed:
columns=[*df.columns]
unique_values={}
for i in columns:
unique_values[i]=df[i].unique()
unique=pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in unique_vals.items() ]))
unique.fillna('').T
This solution constructs a dataframe of unique values with some stats and gracefully handles any unhashable column types.
Resulting dataframe columns are: col, unique_len, df_len, perc_unique, unique_values
df_len = len(df)
unique_cols_list = []
for col in df:
try:
unique_values = df[col].unique()
unique_len = len(unique_values)
except TypeError: # not all cols are hashable
unique_values = ""
unique_len = -1
perc_unique = unique_len*100/df_len
unique_cols_list.append((col, unique_len, df_len, perc_unique, unique_values))
df_unique_cols = pd.DataFrame(unique_cols_list, columns=["col", "unique_len", "df_len", "perc_unique", "unique_values"])
df_unique_cols = df_unique_cols[df_unique_cols["unique_len"] > 0].sort_values("unique_len", ascending=False)
print(df_unique_cols)
The best way to do that:
Series.unique()
For example students.age.unique() the output will be the different values that occurred in the age column of the students' data frame.
To get only the number of how many different values:
Series.nunique()

Categories