I have a dataset that has many columns, I want to extract the numeric columns and replace with the mean of the columns and then these modified columns replacing the ones in the original dataframe.
df1 = df.select_dtypes(include = ["number"]).apply(lambda x: x.fillna(x.mean()),axis=0)
df.loc[df.select_dtypes(include = ["number"])] = df1
I managed to extract the numeric columns but I couldn't replace them, the idea is not to manually indicate which are those numeric columns.
It's probably easier to assign a new/changed DataFrame. This will only change the columns you altered.
new_df = df.assign(**df.select_dtypes('number').apply(lambda x: x.fillna(x.mean())))
If you want to preserve the original DataFrame, you can do it in steps:
cols = df.select_dtypes('number').columns
df[cols] = df[cols].apply(lambda x: x.fillna(x.mean()))
Related
I have the following pandas dataframe
I would like it to be converted to a pandas dataframe with one row. Is there a simple way to do it. I tried pivot but was getting weird results.
You can pivot, swap the level of columns names, shift values up to fill NaN values and flatten column names:
out = df.pivot(columns='Study Identification').swaplevel(0,1,axis=1).apply(lambda x: pd.Series(x.dropna().values)).fillna('')
out.columns = s.columns.map(''.join)
So in your case reshape the df with unstack
s = df.set_index('A',append=True).unstack(level=1).swaplevel(0,1,axis=1)
s.columns = s.columns.map(''.join)
I have a DataFrame with columns like:
>>> df.columns
['A_ugly_column_name', 'B_ugly_column_name', ...]
and a Series, series_column_names, with nice column names like:
>>> series_column_names = pd.Series(
data=["A_ugly_column_name", "B_ugly_column_name"],
index=["A", "B"],
)
>>> print(series_column_names)
A A_ugly_column_name
B B_ugly_column_name
...
Name: column_names, dtype: object
Is there a nice way to rename the columns in df according to series_column_names? More specifically, I'd like to rename the columns in df to the index in column_names where value in the series is the old column name in df.
Some context - I have several DataFrames with columns for the same thing, but they're all named slightly differently. I have a DataFrame where, like here, the index is a standardized name and the columns contain the column names used by the various DataFrames. I want to use this "name mapping" DataFrame to rename the columns in the several DataFrames to the same thing.
a solution i have...
So far, the best solution I have is:
>>> df.rename(columns=lambda old: series_column_names.index[series_column_names == old][0])
which works but I'm wondering if there's a better, more pandas-native way to do this.
first create a dictionary out of your series by using .str.split
cols = {y : x for x,y in series_column_names.str.split('\s+').tolist()}
print(cols)
Edit.
If your series has your target column names as the index and the values as the series you can still create a dictionary by inversing the keys and values.
cols = {y : x for x,y in series_column_names.to_dict().items()}
or
cols = dict(zip(series_column_names.tolist(), series_column_names.index))
print(cols)
{'B_ugly_column_name': 'B_nice_column_name',
'C_ugly_column_name': 'C_nice_column_name',
'A_ugly_column_name': 'A_nice_column_name'}
then assign your column names.
df.columns = df.columns.map(cols)
print(df)
A_nice_column_name B_nice_column_name
0 0 0
Just inverse the index/values in series_column_names and use it to rename. It doesn't matter if there are extra names.
series_column_names = pd.Series(
data=["A_ugly", "B_ugly", "C_ugly"],
index=["A", "B", "C"],
)
df.rename(columns=pd.Series(series_column_names.index.values, index=series_column_names))
Wouldn't it be as simple as this?
series_column_names = pd.Series(['A_nice_column_name', 'B_nice_column_name'])
df.columns = series_column_names
I have a dataframe ('df') containing several columns and would like to only keep those columns with a column header starting with the prefix 'x1' or 'x4'. That is, I want to 'drop' all columns except those with a column header starting with either 'x1' or 'x4'.
How can I do this in one step?
I know that if I wanted to keep only those columns with the x1 prefix I could do:
df = df [df.columns.drop(list(df .filter(regex='x1')))]
..but this results in me losing columns with the x4 prefix, which I want to keep.
Similarly, if I wanted to keep only those columns with the x4 prefix I can do:
df = df [df.columns.drop(list(df .filter(regex='x4')))]
..but this results in me losing columns with the x1 prefix, which I want to keep.
You can use df.loc with list comprehension:
df.loc[:, [x for x in df.columns if x.startswith(('x1', 'x4'))]]
It will show you all rows and columns which have 'x1' or 'x4' at the beginning.
You can choose the desired columns first and then just select those columns.
data = [{"x1":"a", "x2":"a", "x4":"a"}]
df = pd.DataFrame(data)
desired_columns = [x for x in df.columns if x.startswith("x1") or x.startswith("x4")]
df = df[desired_columns]
You can also use a function:
def is_valid(x):
return x.startswith("x1") or x.startswith("x4")
data = [{"x1":"a", "x2":"a", "x4":"a"}]
df = pd.DataFrame(data)
desired_columns = [x for x in df.columns if is_valid(x)]
df = df[desired_columns]
You can also use filter option,
df.filter(regex='^x1|^x4')
What i have is a list of Dataframes.
What is important to note is that the shape of the dataframes differ between 2-7 columns, also the columns are named between 0 & len of the column (e.g. df1 has 5 columns named 0,1,2,3,4 etc. df2 has 4 columns named 0,1,2,3)
I would like is to check if a row in a column contains a certain string, then delete that column.
list_dfs1=[df1,df2,df3...df100]
What i have done so far is the below & i get an error that column 5 is not in axis (it is there for some DF)
for i, df in enumerate(list_dfs1):
for index,row in df.iterrows():
if np.where(row.str.contains("DEC")):
df.drop(index, axis=1)
Any suggestions.
You could try:
for df in list_dfs:
for col in df.columns:
# If you are unsure about column types, cast column as string:
df[col] = df[col].astype(str)
# Check if the column contains the string of interest
if df[col].str.contains("DEC").any():
df.drop(columns=[col], inplace=True)
If you know that all columns are of type string, you don't have to actually do df[col] = df[col].astype(str).
You can write a custom function that checks whether the dataframe has the pattern or not. You can use pd.Series.str.contains with pd.Series.any
def func(s):
return s.str.contains('DEC').any()
list_df = [df.loc[:, ~df.apply(func)] for df in list_dfs1]
I would take another approach. I would concatenate the list into a data frame and then eliminate the column where finding the string
import pandas as pd
df = pd.concat(list_dfs1)
Let us say your condition was to eliminate any column with "DEC"
df.mask(df == "DEC").dropna(axis=1, how="any")
I have a dataframe contains 4 columns, the first 3 columns are numerical variables which indicate the feature of the variable at the last column, and the last column are strings.
I want to merge the last string column by the previous 3 columns through the groupby function. Then it works(I mean the string which shares the same feature logged by the first three columns had been merged successfully)
Previously the length of the dataframe was 1200, and the length of the merged dataframe is 1100. I found the later df is multindexed. Which only contain 2 columns.(hierarchical index ) Thus I tried the reindex method by a generated ascending numerical list. Sadly I failed.
df1.columns
*[Out]Index(['time', 'column','author', 'text'], dtype='object')
series = df1.groupby(['time', 'column','author'])
['body_text'].sum()#merge the last column by the first 3 columns
dfx = series.to_frame()# get the new df
dfx.columns
*[Out]Index(['author', 'text'], dtype='object')
len(dfx)
*[Out]1100
indexs = list(range(1100))
dfx.reindex(index = indexs)
*[Out]Exception: cannot handle a non-unique multi-index!
Reindex here is not necessary, better is use DataFrame.reset_index or add parameter as_index=False to DataFrame.groupby
dfx = df1.groupby(['time', 'column','author'])['body_text'].sum().reset_index()
Or:
dfx = df1.groupby(['time', 'column','author'], as_index=False)['body_text'].sum()