Dear fellows I´ve difficulties by performing a condition over a column in my DataFrame, i want to iterate over the column and extract only the values that starts with the number 6, the values from that column are floats.
The columns is called "Vendor".
This is my Dataframe, and I want to sum the values from the column "Amount in loc.curr.2" only for the values from column "Vendor" starts with 6.
This is what I´ve been traying
Also this
idx = df_spend['Vendor'].apply(lambda x: str(x).startswith('6'))
This should create a Boolean pandas.Series that you can use as an index.
summed_col=df_spend.loc[idx,"Amount in loc.curr.2"].apply(sum)
summed_col contains the sum of the column
Definitely take a look at the pandas documentation for the apply function: http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html
Hope this works! :)
Related
I'm getting the following error when trying to groupby and sum by dataframe by specific columns.
ValueError: Grouper for '<class 'pandas.core.frame.DataFrame'>' not 1-dimensional
I've checked other solutions and it's not a double column name header issue.
See df3 below which I want to group by on all columns except last two, I want to sum()
dfs head shows that if I just group by the columns names it works fine but not with iloc which I know to be the correct formula to pull back column I want to group by.
I need to use ILOC as final dataframe will have many more columns.
df.iloc[:,0:3] returns a dataframe. So you are trying to group dataframe with another dataframe.
But you just need a column list.
can you try this:
dfs = df3.groupby(list(df3.iloc[:,0:3].columns))['Churn_Alive_1','Churn_Alive_0'].sum()
I have a data frame that contains product sales for each day starting from 2018 to 2021 year. Dataframe contains four columns (Date, Place, Product Category and Sales). From the first two columns (Date, Place) I want to use the available data to fill in the gaps. Once the data is added, I would like to delete rows that do not have data in ProductCategory. I would like to do in python pandas.
The sample of my data set looked like this:
I would like the dataframe to look like this:
Use fillna with method 'ffill' that propagates last valid observation forward to next valid backfill. Then drop the rows that contain NAs.
df['Date'].fillna(method='ffill',inplace=True)
df['Place'].fillna(method='ffill',inplace=True)
df.dropna(inplace=True)
You are going to use the forward-filling method to replace null values with the value of the nearest one above it df['Date', 'Place'] = df['Date', 'Place'].fillna(method='ffill'). Next, to drop rows with missing values df.dropna(subset='ProductCategory', inplace=True). Congrats, now you have your desired df 😄
Documentation: Pandas fillna function, Pandas dropna function
compute the frequency of catagories in the column by plotting,
from plot you can see bars reperesenting the most repeated values
df['column'].value_counts().plot.bar()
and get the most frequent value using index, index[0] gives most repeated and
index[1] gives 2nd most repeated and you can choose as per your requirement.
most_frequent_attribute = df['column'].value_counts().index[0]
then fill missing values by above method
df['column'].fillna(df['column'].most_freqent_attribute,inplace=True)
to fill multiple columns with same method just define this as funtion, like this
def impute_nan(df,column):
most_frequent_category=df[column].mode()[0]
df[column].fillna(most_frequent_category,inplace=True)
for feature in ['column1','column2']:
impute_nan(df,feature)
I just edited the question as maybe I didn't make myself clear.
I have two dataframes (MR and DT)
The column 'A' in dataframe DT is a subset of the column 'A' in dataframe MR, they both are just similar (not equal) in this ID column, the rest of the columns are different as well as the number of rows.
How can I get the rows from dataframe MR['ID'] that are equal to the dataframe DT['ID']? Knowing that values in 'ID' can appear several times in the same column.
The DT is 1538 rows and MR is 2060 rows).
I tried some lines proposed here >https://stackoverflow.com/questions/28901683/pandas-get-rows-which-are-not-in-other-dataframe but I got bizarre results as I don't fully understand the methods they proposed (and the goal is little different)
Thanks!
Take a look at pandas.Series.isin() method. In your case you'd want to use something like:
matching_id = MR.ID.isin(DT.ID) # This returns a boolean Series of whether values match or not
# Now filter your dataframe to keep only matching rows
new_df = MR.loc[matching_id, :]
Or if you want to just get a new dataframe of combined records for the same ID you need to use merge():
new_df = pd.merge(MR, DT, on='ID')
This will create a new dataframe with columns from both original dfs but only where ID is the same.
I have a dataframe which looks as follows:
I want to multiply elements in a row except for the "depreciation_rate" column with the value in the same row in the "depreciation_rate" column.
I tried df2.iloc[:,6:26]*df2["depreciation_rate"] as well as df2.iloc[:,6:26].mul(df2["depreciation_rate"])
I get the same results with both which look as follows. I get NaN values with additional columns which I don't want. I think the elements in rows also multiply with values in other rows in the "depreciation_rate" column. What would be a good way to solve this issue?
Try using mul() along axis=0:
df2.iloc[:,6:26].mul(df2["depreciation_rate"], axis=0)
I would like to sort a pandas dataframe, as follows:
order by first column
if two rows are equal in the first column then, order by second column, if two rows are equal in the second column then, order by third column, and so on.
I would like to obtain the same behaviour of this function in matlab (https://it.mathworks.com/help/matlab/ref/double.sortrows.html#bt8bz9j-2)
is there a function in pandas for this?
I hope I have been clear, thanks!
In panda we have pd.DataFrame.sort_values()
out = df.sort_values(df.columns.tolist())