Python Pandas Copy Columns - python

How do I copy multiple columns from one dataframe to a new dataframe? it would also be nice to rename them at the same time
df2['colA']=df1['col-a'] #This works
df2['colA', 'colB']=df1['col-a', 'col-b'] #Tried and Failed
Thanks

You have to use double brackets:
df2[['colA', 'colB']] = df1[['col-a', 'col-b']]

This is tried and tested for pandas=1.3.0 :
df2 = df1[['col-a', 'col-b']].copy()
If you also want to rename the column names at the same time, you can write:
df2 = pd.DataFrame(columns=['colA', 'colB'])
df2[['colA', 'colB']] = df1[['col-a', 'col-b']]

Following also works:
# original DataFrame
df = pd.DataFrame({'a': ['hello', 'cheerio', 'hi', 'bye'], 'b': [1, 0, 1, 0]})
# new DataFrame created from 2 original cols (new cols renamed)
df_new = pd.DataFrame(columns=['greeting', 'mode'], data=df[['a','b']].values)
If you want to use condition for the new dataframe:
df_new = pd.DataFrame(columns=['farewell', 'mode'], data=df[df['b']==0][['a','b']].values)
Or if you want use just particular rows (index), you can use "loc":
df_new = pd.DataFrame(columns=['greetings', 'mode'], data=df.loc[2:3][['a','b']].values)
# if you need preserve row index, then add index=... as argument, like:
df_new = pd.DataFrame(columns=['farewell', 'mode'], data=df.loc[2:3][['a','b']].values,
index=df.loc[2:3].index )

As for Pandas=1.2.4, the easiest way would be:
df2[['colA', 'colB']] = df1[['col-a', 'col-b']].values
note that it doesn't require .copy() as applying values first converts dataframe values into a numpy array (shallow copy) and then copy values (link the address of array in the memory) into dataframe. Exactly the same as when you apply copy() at the end of it.

Related

Is there a function for making a new dataframe using pandas to select only part of a word?

I am looking to select all values that include "hennessy" in the name, i.e. "Hennessy Black Cognac", "Hennessy XO". I know it would simply be
trial = Sales[Sales["Description"]if=="Hennessy"]
if I wanted only the value "Hennessy", but I want it if it contains the word "Hennessy" at all.
working on python with pandas imported
Thanks :)
You can use the in keyword to check if a value is present in a sequence.
Like this:
trial = "hennessy" in lower(Sales[Sales["Description"]])
you can try using str.startswith
import pandas as pd
# initialize list of lists
data = [['Hennessy Black Cognac', 10], ['Hennessy XO', 15], ['julian merger', 14]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Name', 'Age'])
new_df = df.loc[df.Name.str.startswith('Hennessy', na=False)]
new_df
or You can use apply to easily apply any string matching function to your column elementwise
df_new =df[df['Name'].apply(lambda x: x.startswith('Hennessy'))]
df_new

Rename columns in a pandas.DataFrame via lookup from a Series

I have a DataFrame with columns like:
>>> df.columns
['A_ugly_column_name', 'B_ugly_column_name', ...]
and a Series, series_column_names, with nice column names like:
>>> series_column_names = pd.Series(
data=["A_ugly_column_name", "B_ugly_column_name"],
index=["A", "B"],
)
>>> print(series_column_names)
A A_ugly_column_name
B B_ugly_column_name
...
Name: column_names, dtype: object
Is there a nice way to rename the columns in df according to series_column_names? More specifically, I'd like to rename the columns in df to the index in column_names where value in the series is the old column name in df.
Some context - I have several DataFrames with columns for the same thing, but they're all named slightly differently. I have a DataFrame where, like here, the index is a standardized name and the columns contain the column names used by the various DataFrames. I want to use this "name mapping" DataFrame to rename the columns in the several DataFrames to the same thing.
a solution i have...
So far, the best solution I have is:
>>> df.rename(columns=lambda old: series_column_names.index[series_column_names == old][0])
which works but I'm wondering if there's a better, more pandas-native way to do this.
first create a dictionary out of your series by using .str.split
cols = {y : x for x,y in series_column_names.str.split('\s+').tolist()}
print(cols)
Edit.
If your series has your target column names as the index and the values as the series you can still create a dictionary by inversing the keys and values.
cols = {y : x for x,y in series_column_names.to_dict().items()}
or
cols = dict(zip(series_column_names.tolist(), series_column_names.index))
print(cols)
{'B_ugly_column_name': 'B_nice_column_name',
'C_ugly_column_name': 'C_nice_column_name',
'A_ugly_column_name': 'A_nice_column_name'}
then assign your column names.
df.columns = df.columns.map(cols)
print(df)
A_nice_column_name B_nice_column_name
0 0 0
Just inverse the index/values in series_column_names and use it to rename. It doesn't matter if there are extra names.
series_column_names = pd.Series(
data=["A_ugly", "B_ugly", "C_ugly"],
index=["A", "B", "C"],
)
df.rename(columns=pd.Series(series_column_names.index.values, index=series_column_names))
Wouldn't it be as simple as this?
series_column_names = pd.Series(['A_nice_column_name', 'B_nice_column_name'])
df.columns = series_column_names

How to apply pandas.DataFrame.dropna on a subset of columns with inplace = True and axis = 1?

import pandas as pd
df = pd.DataFrame({
'col1': [99, None, 99],
'col2': [4, 5, 6],
'col3': [7, None, None]})
col_list = ['col1', 'col2']
df[col_list].dropna(axis=1, thresh=2, inplace = True)
This returns a warning and leaves the dataframe unchanged:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
The following generates no warning but still leaves the DataFrame unchanged.
df.loc[:,col_list].dropna(axis=1, thresh=2, inplace=True)
Problem:
From among a list of columns specified by the user, remove those columns from the dataframe which have less than 'thresh' non-null vales. Make no changes to the columns that are not in the list.
I need to use inplace=True to avoid making a copy of the dataframe, since it is huge
I cannot loop over the columns and apply dropna one column at a time, because pandas.Series.dropna does not have the 'thresh' argument.
Funnily enough, dropna does not support this functionality, but there is a workaround.
v = df[col_list].notna().sum().le(2) # thresh=2
df.drop(v.index[v], axis=1, inplace=True)
By the way,
I need to use inplace=True to avoid making a copy of the dataframe
I'm sorry to inform you that even with inplace=True, a copy is generated. The only difference is that the copy is assigned back to the original object in-place, so a new object is not returned.
I think the problem is df['col_list'] or the slicing creates a new df and inplace=True effects on that df and not on the original one.
You might have to use subset param of dropna and pass the column list to it.
df.dropna(axis=1, thresh=2, subset=col_list,inplace = True)

Pandas how to concat two dataframes without losing the column headers

I have the following toy code:
import pandas as pd
df = pd.DataFrame()
df["foo"] = [1,2,3,4]
df2 = pd.DataFrame()
df2["bar"]=[4,5,6,7]
df = pd.concat([df,df2], ignore_index=True,axis=1)
print(list(df))
Output: [0,1]
Expected Output: [foo,bar] (order is not important)
Is there any way to concatenate two dataframes without losing the original column headers, if I can guarantee that the headers will be unique?
Iterating through the columns and then adding them to one of the DataFrames comes to mind, but is there a pandas function, or concat parameter that I am unaware of?
Thanks!
As stated in merge, join, and concat documentation, ignore index will remove all name references and use a range (0...n-1) instead. So it should give you the result you want once you remove ignore_index argument or set it to false (default).
df = pd.concat([df, df2], axis=1)
This will join your df and df2 based on indexes (same indexed rows will be concatenated, if other dataframe has no member of that index it will be concatenated as nan).
If you have different indexing on your dataframes, and want to concatenate it this way. You can either create a temporary index and join on that, or set the new dataframe's columns after using concat(..., ignore_index=True).
I don't think the accepted answer answers the question, which is about column headers, not indexes.
I am facing the same problem, and my workaround is to add the column names after the concatenation:
df.columns = ["foo", "bar"]

Selecting/excluding sets of columns in pandas [duplicate]

This question already has answers here:
Delete a column from a Pandas DataFrame
(20 answers)
Closed 4 years ago.
I would like to create views or dataframes from an existing dataframe based on column selections.
For example, I would like to create a dataframe df2 from a dataframe df1 that holds all columns from it except two of them. I tried doing the following, but it didn't work:
import numpy as np
import pandas as pd
# Create a dataframe with columns A,B,C and D
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
# Try to create a second dataframe df2 from df with all columns except 'B' and D
my_cols = set(df.columns)
my_cols.remove('B').remove('D')
# This returns an error ("unhashable type: set")
df2 = df[my_cols]
What am I doing wrong? Perhaps more generally, what mechanisms does pandas have to support the picking and exclusions of arbitrary sets of columns from a dataframe?
You can either Drop the columns you do not need OR Select the ones you need
# Using DataFrame.drop
df.drop(df.columns[[1, 2]], axis=1, inplace=True)
# drop by Name
df1 = df1.drop(['B', 'C'], axis=1)
# Select the ones you want
df1 = df[['a','d']]
There is a new index method called difference. It returns the original columns, with the columns passed as argument removed.
Here, the result is used to remove columns B and D from df:
df2 = df[df.columns.difference(['B', 'D'])]
Note that it's a set-based method, so duplicate column names will cause issues, and the column order may be changed.
Advantage over drop: you don't create a copy of the entire dataframe when you only need the list of columns. For instance, in order to drop duplicates on a subset of columns:
# may create a copy of the dataframe
subset = df.drop(['B', 'D'], axis=1).columns
# does not create a copy the dataframe
subset = df.columns.difference(['B', 'D'])
df = df.drop_duplicates(subset=subset)
Another option, without dropping or filtering in a loop:
import numpy as np
import pandas as pd
# Create a dataframe with columns A,B,C and D
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
# include the columns you want
df[df.columns[df.columns.isin(['A', 'B'])]]
# or more simply include columns:
df[['A', 'B']]
# exclude columns you don't want
df[df.columns[~df.columns.isin(['C','D'])]]
# or even simpler since 0.24
# with the caveat that it reorders columns alphabetically
df[df.columns.difference(['C', 'D'])]
You don't really need to convert that into a set:
cols = [col for col in df.columns if col not in ['B', 'D']]
df2 = df[cols]
Also have a look into the built-in DataFrame.filter function.
Minimalistic but greedy approach (sufficient for the given df):
df.filter(regex="[^BD]")
Conservative/lazy approach (exact matches only):
df.filter(regex="^(?!(B|D)$).*$")
Conservative and generic:
exclude_cols = ['B','C']
df.filter(regex="^(?!({0})$).*$".format('|'.join(exclude_cols)))
You have 4 columns A,B,C,D
Here is a better way to select the columns you need for the new dataframe:-
df2 = df1[['A','D']]
if you wish to use column numbers instead, use:-
df2 = df1[[0,3]]
You just need to convert your set to a list
import pandas as pd
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
my_cols = set(df.columns)
my_cols.remove('B')
my_cols.remove('D')
my_cols = list(my_cols)
df2 = df[my_cols]
Here's how to create a copy of a DataFrame excluding a list of columns:
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
df2 = df.drop(['B', 'D'], axis=1)
But be careful! You mention views in your question, suggesting that if you changed df, you'd want df2 to change too. (Like a view would in a database.)
This method doesn't achieve that:
>>> df.loc[0, 'A'] = 999 # Change the first value in df
>>> df.head(1)
A B C D
0 999 -0.742688 -1.980673 -0.920133
>>> df2.head(1) # df2 is unchanged. It's not a view, it's a copy!
A C
0 0.251262 -1.980673
Note also that this is also true of #piggybox's method. (Although that method is nice and slick and Pythonic. I'm not doing it down!!)
For more on views vs. copies see this SO answer and this part of the Pandas docs which that answer refers to.
In a similar vein, when reading a file, one may wish to exclude columns upfront, rather than wastefully reading unwanted data into memory and later discarding them.
As of pandas 0.20.0, usecols now accepts callables.1 This update allows more flexible options for reading columns:
skipcols = [...]
read_csv(..., usecols=lambda x: x not in skipcols)
The latter pattern is essentially the inverse of the traditional usecols method - only specified columns are skipped.
Given
Data in a file
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
filename = "foo.csv"
df.to_csv(filename)
Code
skipcols = ["B", "D"]
df1 = pd.read_csv(filename, usecols=lambda x: x not in skipcols, index_col=0)
df1
Output
A C
0 0.062350 0.076924
1 -0.016872 1.091446
2 0.213050 1.646109
3 -1.196928 1.153497
4 -0.628839 -0.856529
...
Details
A DataFrame was written to a file. It was then read back as a separate DataFrame, now skipping unwanted columns (B and D).
Note that for the OP's situation, since data is already created, the better approach is the accepted answer, which drops unwanted columns from an extant object. However, the technique presented here is most useful when directly reading data from files into a DataFrame.
A request was raised for a "skipcols" option in this issue and was addressed in a later issue.

Categories