How to remove/replace any string from a dataframe? - python

I have a data frame including 950 rows and 204 columns, and I want to find and replace any possible string from this dataframe. When it is only one column I can simply do that through 2 lines code below:
for i in df['name of column']:
df[i].replace(r'^([A-Za-z]|[0-9]|_)+$', np.NaN, regex=True,inplace=True)
but now when it is more than 200 columns, how can I do that?
Every help is appreciated.

When it is only one column I can simply do that through 2 lines code below: ...
But you better should not - iterating dataframe can affect performance negatively (though currently you don't have that much data).
Just use replace on dataframe itself:
df = df.replace(r'^([A-Za-z]|[0-9]|_)+$', np.NaN, regex=True)

Related

Pandas: How to read contents of a CSV into a single column?

I want to read a file 'tos_year.csv' into a Pandas dataframe, such that all values are in one single column. I will later use pd.concat() to add this column to an existing dataframe.
The CSV file holds 80 entries in the form of years, i.e. "... 1966,1966,1966,1966,1967,1967,... "
What I can't figure out is how to read the values into one column with 80 rows, instead of 80 columns with one row.
This is probably quite basic but I'm new to this. Here's my code:
import pandas as pd
tos_year = pd.read_csv('tos_year.csv').T
tos_year.reset_index(inplace=True)
tos_year.columns = ['Year']
As you can see, I tried reading it in and then transposing the dataframe, but when it gets read in initially, the year numbers are interpreted as column names, and there apparently cannot be several columns with identical names, so I end up with a dataframe that holds str-values like
...
1966
1966.1
1966.2
1966.3
1967
1967.1
...
which is not what I want. So clearly, it's preferable to read it in correctly from the start.
Thanks for any advice!
Add header=None for avoid parse years like columns names, then transpose and rename column, e.g. by DataFrame.set_axis:
tos_year = pd.read_csv('tos_year.csv', header=None).T.set_axis(['Year'], axis=1)
Or:
tos_year = pd.read_csv('tos_year.csv', header=None).T
tos_year.columns = ['Year']

Pivot a dataframe by splitting a string & format specific columns

I am faced with a problem that is above my level of pandas - but might well be simple once I know the steps.
I have a dataframe with column names as below and I want to extract the period from the string of each column and pivot the period to the row as in the second example below.
I also want to format each column differently - currently it is just a number, but some should be % and some numbers and with certain amount of decimals. What I have now and what I want is outlined below.
I have tried a few things - creating a multi index with a string splitting method and then pivoting the multi index. I feel I am on the right track but just cannot make it work at present. Any help appreciated.
what I have now in a dataframe
client_return_12m,client_return_36m,client_return_60m,client_sharpe_12m,client_sharpe_36m,client_sharpe_60m
0.34116,0.56439,0.701156,0.74320,0.82349,0.76889
after
period,client_return,client_sharpe
12m,34.1%,0.74
36m,56.4%,0.82
60m,70.1%,0.77
Use str.rsplit by last _ and then reshape by DataFrame.stack:
df.columns = df.columns.str.rsplit('_', expand=True, n=1)
df = df.stack().reset_index(level=0, drop=True).rename_axis('period').reset_index()
print (df)
period client_return client_sharpe
0 12m 0.341160 0.74320
1 36m 0.564390 0.82349
2 60m 0.701156 0.76889

Pandas, when merging two dataframes and values for some columns don't carry over

I'm trying to combine two dataframes together in pandas using left merge on common columns, only when I do that the data that I merged doesn't carry over and instead gives NaN values. All of the columns are objects and match that way, so i'm not quite sure whats going on.
this is my first dateframe header, which is the output from a program
this is my second data frame header. the second df is a 'key' document to match the first output with its correct id/tastant/etc and they share the same date/subject/procedure/etc
and this is my code thats trying to merge them on the common columns.
combined = first.merge(second, on=['trial', 'experiment','subject', 'date', 'procedure'], how='left')
with output (the id, ts and tastant columns should match correctly with the first dataframe but doesn't.
Check your dtypes, make sure they match between the 2 dataframes. Pandas makes assumptions about data types when it imports, it could be assuming numbers are int in one dataframe and object in another.
For the string columns, check for additional whitespaces. They can appear in datasets and since you can't see them and Pandas can, it result in no match. You can use df['column'].str.strip().
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.strip.html

Aggregate Python DF based on column

I have a big dataframe (approximately 35 columns), where 1 column - concat_strs is a concatenation of 8 columns in the dataframe. This is used to detect duplicates. What I want to do is to aggregate rows, where concat_strs has the same value, on columns val, abs_val, price, abs_price (using sum).
I have done the following:
agg_attributes = {'val': 'sum', 'abs_val': 'sum', 'price': 'sum', 'abs_price': 'sum'}
final_df= df.groupby('concat_strs', as_index=False).aggregate(agg_attributes)
But, when I look at final_df, I notice 2 issues:
Other columns are removed, so I have only 5 columns. I have tried to do final_df.reindex(columns=df.columns), but then all of the other columns are NaN
The number of rows in the final_df remains the same as in the df (ca. 300k rows). However, it should be reduced (checked manually)
The question is - what is done wrong and is there any improvement suggestion?
You groupby concat_strs, so only concat_strs and the columns in agg_attributes is kept, because groupby operation, pandas does not know what to do with other columns.
You can include all columns with first agg to keep the first value of that column (if duplicated), or last etc.. depends on what you need.
Also this way to dedup I bet it a good operation, can you simply drop all the duplicates?
You dont need to concat_strs too, as groupby support input in a list of column to group on
Not sure if I understood the question correctly. but you can try this?
final_df = df.groupby(['concat_strs']).sum()

What's the most efficient way to drop columns (from beginning and end) in pandas from a large dataframe?

I am trying to drop a number of columns from the beginning and end of the pandas dataframe.
My dataframe has 397 rows and 291 columns. I currently have this solution to remove the first 8 columns, but I also want to remove some at the end:
SMPS_Data = SMPS_Data.drop(SMPS_Data.columns[0:8], axis=1)
I know I could just repeat this step and remove the last few columns, but I was hoping there is a more direct way to approach this problem.
I tried using
SMPS_Data = SMPS_Data.drop(SMPS_Data.columns[0:8,278:291], axis=1)
but it doesn't work.
Also, it seems that the .drop method somehow slows down the console responsiveness, so maybe there's a cleaner way to do it?
You could use .drop(), if you want to remove your columns by their column names
drop_these = ['column_name1', 'column_name2', 'last_columns']
df = df.drop(columns=drop_these)
If you know you want to remove them by their location, you could use .iloc():
df.iloc[:, 8:15] # For columns 8-15
df.iloc[:, :-5] # For all columns, except the last five
df.iloc[:. 2:-5] # For all columns, except the first column, and the last five
See this documentation on indexing and slicing data with pandas, for more information.

Categories