import polars as pl
import pandas as pd
A = ['a','a','a','a','a','a','a','b','b','b','b','b','b','b']
B = [1,2,3,4,5,6,7,8,9,10,11,12,13,14]
df = pl.DataFrame({'cola':A,
'colb':B})
df_pd = df.to_pandas()
index = df_pd.groupby('cola')['colb'].idxmax()
df_pd.loc[index,'top'] = 1
in pandas i can get the column of top using idxmax().
however, in polars
i use the arg_max()
index = df[pl.col('colb').arg_max().over('cola').flatten()]
seems cannot get what i want..
is there any way to get generate a column of 'top' in polars?
thx a lot!
In Polars, window functions (the .over()) will do an aggregation + self-join (see https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.Expr.over.html?highlight=over#polars.Expr.over), which means you cannot return a unique value per row, which is what you are after.
A way to compute the top column is to use apply:
df.groupby("cola").apply(lambda x: x.with_columns([pl.col("colb"), (pl.col("colb")==pl.col("colb").max()).alias("top")]))
Related
I want to do data insepction and print count of rows that matches a certain value in one of the columns. So below is my code
import numpy as np
import pandas as pd
data = pd.read_csv("census.csv")
The census.csv has a column "income" which has 3 values '<=50K', '=50K' and '>50K'
and i want to print number of rows that has income value '<=50K'
i was trying like below
count = data['income']='<=50K'
That does not work though.
Sum Boolean selection
(data['income'].eq('<50K')).sum()
The key is to learn how to filter pandas rows.
Quick answer:
import pandas as pd
data = pd.read_csv("census.csv")
df2 = data[data['income']=='<=50K']
print(df2)
print(len(df2))
Slightly longer answer:
import pandas as pd
data = pd.read_csv("census.csv")
filter = data['income']=='<=50K'
print(filter) # notice the boolean list based on filter criteria
df2 = data[filter] # next we use that boolean list to filter data
print(df2)
print(len(df2))
Thank you in advance for taking the time to help me! (Code provided below) (Data Here)
I am trying to average the first 3 columns and insert it as a new column labeled 'Topsoil'. What is the best way to go about doing that?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
raw_data = pd.read_csv('all-deep-soil-temperatures.csv', index_col=1, parse_dates=True)
df_all_stations = raw_data.copy()
df_selected_station.fillna(method = 'ffill', inplace=True);
df_selected_station_D=df_selected_station.resample(rule='D').mean()
df_selected_station_D['Day'] = df_selected_station_D.index.dayofyear
mean=df_selected_station_D.groupby(by='Day').mean()
mean['Day']=mean.index
#mean.head()
Try this :
mean['avg3col']=mean[['5 cm', '10 cm','15 cm']].mean(axis=1)
df['new column'] = (df['col1'] + df['col2'] + df['col3'])/3
You could use the apply method in the following way:
mean['Topsoil'] = mean.apply(lambda row: np.mean(row[0:3]), axis=1)
You can read about the apply method in the following link: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html
The logic is that you perform the same task along a specific axis multiple times.
Note: It is not wise to call data-structures in names of functions, in your case it might be better be mean_df rather the mean
Use DataFrame.iloc for select by positions - first 3 columns with mean:
mean['Topsoil'] = mean.iloc[:, :3].mean(axis=1)
I can successfully fill my new column with group counts, but I suspect there is a simpler way:
# How do I simplify this?
def f(gr):
return pd.Series([gr['class_name'].count()] * gr.shape[0], index=gr.index)
df['class_size'] = df.groupby("class_name").apply(f).reset_index(level=0, drop=True)
column_list = ['class_name', 'class_size']
df[column_list].head(5)
Gets:
I think you need transform:
df['class_size'] = df.groupby('class_name')['class_name'].transform('size')
Or:
df['class_size'] = df.groupby('class_name')['class_name'].transform('count')
What is the difference between size and count in pandas?
Depending on your DataFrame shape you can also just do a count on the groupby:
import pandas as pd
df = pd.DataFrame({'class names':list('abracadabra'),'class count':1})
df.groupby('class names').count().reset_index()
I am trying to get calculate the mean for Score 1 only if column Dates is equal to Oct-16:
What I originally tried was:
import pandas as pd
import numpy as np
import os
dataFrame = pd.read_csv("test.csv")
for date in dataFrame["Dates"]:
if date == "Oct-16":
print(date)##Just checking
print(dataFrame["Score 1"].mean())
But my results are the mean for the whole column Score 1
Another thing I tried was manually telling it which indices to calculate the mean for:
dataFrame["Score 1"].iloc[0:2].mean()
But ideally I would like to find a way to do it if Dates == "Oct-16".
Iterating through the rows doesn't take advantage of Pandas' strengths. If you want to do something with a column based on values of another column, you can use .loc[]:
dataFrame.loc[dataFrame['Dates'] == 'Oct-16', 'Score 1']
The first part of .loc[] selects the rows you want, using your specified criteria (dataFrame['Dates'] == 'Oct-16'). The second part specifies the column you want (Score 1). Then if you want to get the mean, you can just put .mean() on the end:
dataFrame.loc[dataFrame['Dates'] == 'Oct-16', 'Score 1'].mean()
How about the mean for all dates
dataframe.groupby('Dates').['Score 1'].mean()
import pandas as pd
import numpy as np
import os
dataFrame = pd.read_csv("test.csv")
dates = dataFrame["Dates"]
score1s = dataFrame["Score 1"]
result = []
for i in range(0,len(dates)):
if dates[i] == "Oct-16":
result.append(score1s[i])
print(result.mean())
I'm using Pandas with Python 3. I have a dataframe with a bunch of columns, but I only want to change the data type of all the values in one of the columns and leave the others alone. The only way I could find to accomplish this is to edit the column, remove the original column and then merge the edited one back. I would like to edit the column without having to remove and merge, leaving the the rest of the dataframe unaffected. Is this possible?
Here is my solution now:
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
def make_float(var):
var = float(var)
return var
#create a new dataframe with the value types I want
df2 = df1['column'].apply(make_float)
#remove the original column
df3 = df1.drop('column',1)
#merge the dataframes
df1 = pd.concat([df3,df2],axis=1)
It also doesn't work to apply the function to the dataframe directly. For example:
df1['column'].apply(make_float)
print(type(df1.iloc[1]['column']))
yields:
<class 'str'>
df1['column'] = df1['column'].astype(float)
It will raise an error if conversion fails for some row.
Apply does not work inplace, but rather returns a series that you discard in this line:
df1['column'].apply(make_float)
Apart from Yakym's solution, you can also do this -
df['column'] += 0.0