I have a dataframe named df_train with 20 columns. Is there a pythonic way to just view info on only one column by selecting its name.
Basically I am trying to loop through the df and extract number of unique values and add missing values
print("\nUnique Values:")
for col in df_train.columns:
print(f'{col:<25}: {df_train[col].nunique()} unique values. \tMissing values: {} ')
If you want the total number of null values, this is the pythonic way to achieve it:
df_train[col].isnull().sum()
Yes there is a way to select individual columns from a dataframe.
df_train['your_column_name']
This will extract only the column with <your_column_name>.
PS: This is my first StackOverflow answer. Please be nice.
Related
I see a lot of questions related to dropping rows that have a certain value in a column, or dropping the entirety of columns, but pretend we have a Pandas Dataframe like the one below.
In this case, how could one write a line to go through the CSV, and drop all rows like 2 and 4? Thank you.
You could try
~((~df).all(axis=1))
to get the rows that you want to keep/drop. To get the dataframe with just those rows, you would use
df = df[~((~df).all(axis=1))]
A more detailed explanation is here:
Delete rows from a pandas DataFrame based on a conditional expression involving len(string) giving KeyError
This should help
for i in range(df.shape[0]):
value=df.shape[1]
count=0
for column_name in column_names:
if df.loc[[i]].column_name==False:
count=count+1
if count==value:
df.drop(index=i,inplace=True)
Apologies if this is contained in a previous answer but I've read this one: How to select rows from a DataFrame based on column values? and can't work out how to do what I need to do:
Suppose have some pandas dataframe X and one of the columns is 'timestamp'. The entries are formatted like '2010-11-03 09:44:05'. I want to select just those rows that correspond to a specific day, for example, select just those rows for which the actual string in timestamp column starts with '2010-11-03'. Is there a neat way to do this? Can I do it with a mask or Boolean indexing? Or should I just write a separate line to peel off the day from each entry and then select the rows? Bear in mind the dataframe is large if it helps.
i.e. I want to write something like
X.loc[X['timestamp'].startswith('2010-11-03')]
or
mask = '2010-11-03' in X["timestamp"]
but these don't actually make any sense.
This should work:-
X[X['timestamp'].str.startswith('2010-11-03')]
I have the following df and I'm trying to figure out how to extract the unique values from each list in each row in order to simplify my df.
As if you were to apply unique() to the first row and then you get 'NEUTRALREGION' only once. Please note that I have another 4 columns with the same requirements.
I solved this using df.applymap(lambda x: set(x)).
That allowed me to check the unique values in each cell.
enter image description here
Our objective right now is to drop the duplicate player rows, but keep the row with the highest count in the G column (Games played). What code can we use to achieve this? I've attached a link to the image of our Pandas output here.
You probably want to first sort the dataframe by column G.
df = df.sort_values(by='G', ascending=False)
You can then use drop_duplicates to drop all duplicates except for the first occurrence.
df.drop_duplicates(['Player'], keep='first')
There are 2 ways that I can think of
df.groupby('Player', as_index=False)['G'].max()
and
df.sort_values('G').drop_duplicates(['Player'] , keep = 'last')
The first method uses groupby to group values by Player, and contracts rows keeping the one with the maximum of G. The second one uses the drop_duplicate method of Pandas to achieve the same.
Try this,
Assume your dataframe object is df1 then
series= df1.groupby('Player')['G'].max() # this will return series.
pd.DataFrame(series)
let me know if this work for you or not.
I am working on a dataframe with where I have multiple columns and in one of the columns where there are many rows approx more than 1000 rows which contains the string values. Kindly check the below table for more details:
In the above image I want to change the string values in the column Group_Number to number by picking the values from the first column (MasterGroup) and increment by one (01) and want values to be like below:
Also need to verify that if the String is duplicating then instead of giving a new number it replaces with already changed number. For example in the above image ANAYSIM is duplicating and instead of giving a new sequence number I want already given number to repeating string.
Have checked different links but they are focusing on giving values from user:
Pandas DataFrame: replace all values in a column, based on condition
Change one value based on another value in pandas
Conditional Replace Pandas
Any help with achieving the desired outcome is highly appreciated.
We could do cumcount with groupby
s=(df.groupby('MasterGroup').cumcount()+1).mul(10).astype(str)
t=pd.to_datetime(df.Group_number, errors='coerce')
Then we assign
df.loc[t.isnull(), 'Group_number']=df.MasterGroup.astype(str)+s