Change the order of DataFrame columns using column names - python

I need to move the 3rd column to the end of the column list. This can be done as follows:
cols = df.columns.tolist()
cols = cols[0:2] + cols[4:] + cols[3]
df= df[cols]
However what if I don't know an index of the column that I'm interested in. Let's say that the only thing I know is that the column called MyMagicCol should be moved at the end of the column list.
How can I do this?

If you want to do it within a list, and you know the name of the column you're interested in, you could just do
cols = [c for c in df.columns if c != 'MyMagicCol'] + ['MyMagicCol']
df = df[cols]
You could also do this with df.loc and use the column names to slice your dataframe if you wanted to work with the columns directly.

Related

Python/Pandas: drop columns *not* containing either of two strings in one step?

I have a dataframe ('df') containing several columns and would like to only keep those columns with a column header starting with the prefix 'x1' or 'x4'. That is, I want to 'drop' all columns except those with a column header starting with either 'x1' or 'x4'.
How can I do this in one step?
I know that if I wanted to keep only those columns with the x1 prefix I could do:
df = df [df.columns.drop(list(df .filter(regex='x1')))]
..but this results in me losing columns with the x4 prefix, which I want to keep.
Similarly, if I wanted to keep only those columns with the x4 prefix I can do:
df = df [df.columns.drop(list(df .filter(regex='x4')))]
..but this results in me losing columns with the x1 prefix, which I want to keep.
You can use df.loc with list comprehension:
df.loc[:, [x for x in df.columns if x.startswith(('x1', 'x4'))]]
It will show you all rows and columns which have 'x1' or 'x4' at the beginning.
You can choose the desired columns first and then just select those columns.
data = [{"x1":"a", "x2":"a", "x4":"a"}]
df = pd.DataFrame(data)
desired_columns = [x for x in df.columns if x.startswith("x1") or x.startswith("x4")]
df = df[desired_columns]
You can also use a function:
def is_valid(x):
return x.startswith("x1") or x.startswith("x4")
data = [{"x1":"a", "x2":"a", "x4":"a"}]
df = pd.DataFrame(data)
desired_columns = [x for x in df.columns if is_valid(x)]
df = df[desired_columns]
You can also use filter option,
df.filter(regex='^x1|^x4')

List of Dataframes, drop Dataframe column (columns have different names) if row contains a special string

What i have is a list of Dataframes.
What is important to note is that the shape of the dataframes differ between 2-7 columns, also the columns are named between 0 & len of the column (e.g. df1 has 5 columns named 0,1,2,3,4 etc. df2 has 4 columns named 0,1,2,3)
I would like is to check if a row in a column contains a certain string, then delete that column.
list_dfs1=[df1,df2,df3...df100]
What i have done so far is the below & i get an error that column 5 is not in axis (it is there for some DF)
for i, df in enumerate(list_dfs1):
for index,row in df.iterrows():
if np.where(row.str.contains("DEC")):
df.drop(index, axis=1)
Any suggestions.
You could try:
for df in list_dfs:
for col in df.columns:
# If you are unsure about column types, cast column as string:
df[col] = df[col].astype(str)
# Check if the column contains the string of interest
if df[col].str.contains("DEC").any():
df.drop(columns=[col], inplace=True)
If you know that all columns are of type string, you don't have to actually do df[col] = df[col].astype(str).
You can write a custom function that checks whether the dataframe has the pattern or not. You can use pd.Series.str.contains with pd.Series.any
def func(s):
return s.str.contains('DEC').any()
list_df = [df.loc[:, ~df.apply(func)] for df in list_dfs1]
I would take another approach. I would concatenate the list into a data frame and then eliminate the column where finding the string
import pandas as pd
df = pd.concat(list_dfs1)
Let us say your condition was to eliminate any column with "DEC"
df.mask(df == "DEC").dropna(axis=1, how="any")

Find the difference between data frames based on specific columns and output the entire record

I want to compare 2 csv (A and B) and find out the rows which are present in B but not in A in based only on specific columns.
I found few answers to that but it is still not giving result what I expect.
Answer 1 :
df = new[~new['column1', 'column2'].isin(old['column1', 'column2'].values)]
This doesn't work. It works for single column but not for multiple.
Answer 2 :
df = pd.concat([old, new]) # concat dataframes
df = df.reset_index(drop=True) # reset the index
df_gpby = df.groupby(list(df.columns)) #group by
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1] #reindex
final = df.reindex(idx)
This takes as an input specific columns and also outputs specific columns. I want to print the whole record and not only the specific columns of the record.
I tried this and it gave me the rows:
import pandas as pd
columns = [{Name of columns you want to use}]
new = pd.merge(A, B, how = 'right', on = columns)
col = new['{Any column from the first DataFrame which isn't in the list columns. You will probably have to add an '_x' at the end of the column name}']
col = col.dropna()
new = new[~new['{Any column from the first DataFrame which isn't in the list columns. You will probably have to add an '_x' at the end of the column name}'].isin(col)]
This will give you the rows based on the columns list. Sorry for the bad naming. If you want to rename the columns a bit too, here's the code for that:
for column in new.columns:
if '_x' in column:
new = new.drop(column, axis = 1)
elif '_y' in column:
new = new.rename(columns = {column: column[:column.find('_y')]})
Tell me if it works.

Python Pandas: How to remove all columns from dataframe that contains the values in a list?

include_cols_path = sys.argv[5]
with open(include_cols_path) as f:
include_cols = f.read().splitlines()
include_cols is a list of strings
df1 = sqlContext.read.csv(input_path + '/' + lot_number +'.csv', header=True).toPandas()
df1 is a dataframe of a large file. I would like to only retain the columns with names that contain any of the strings in include_cols.
final_cols = [col for col in df.columns.values if col in include_cols]
df = df[final_cols]
Doing this in pandas is certainly a dupe. However, it seems that you are converting a spark DataFrame to a pandas DataFrame.
Instead of performing the (expensive) collect operation and then filtering the columns you want, it's better to just filter on the spark side using select():
df1 = sqlContext.read.csv(input_path + '/' + lot_number +'.csv', header=True)
pandas_df = df1.select(include_cols).toPandas()
You should also think about whether or not converting to a pandas DataFrame is really what you want to do. Just about anything you can do in pandas can also be done in spark.
EDIT
I misunderstood your question originally. Based on your comments, I think this is what you're looking for:
selected_columns = [c for c in df1.columns if any([x in c for x in include_cols])]
pandas_df = df1.select(selected_columns).toPandas()
Explanation:
Iterate through the columns in df1 and keep only those for which at least one of the strings in include_cols is contained in the column name. The any() functions returns True if at least one of the conditions is True.
df1.loc[:, df1.columns.str.contains('|'.join(include_cols))]
For example:
df1 = pd.DataFrame(data=np.random.random((5, 5)), columns=list('ABCDE'))
include_cols = ['A', 'C', 'Z']
df1.loc[:, df1.columns.str.contains('|'.join(include_cols))]
>>> A C
0 0.247271 0.761153
1 0.390240 0.050055
2 0.333401 0.823384
3 0.821196 0.929520
4 0.210226 0.406168
The '|'.join(include_cols) part will create an or condition with all elements of the input list. In the above example A|C|Z. This conditions will be True if one of the element is contained in the column names using the .contains() method on the column names.

How to add suffix and prefix to all columns in python/pyspark dataframe

I have a data frame in pyspark with more than 100 columns. What I want to do is for all the column names I would like to add back ticks(`) at the start of the column name and end of column name.
For example:
column name is testing user. I want `testing user`
Is there a method to do this in pyspark/python. when we apply the code it should return a data frame.
Use list comprehension in python.
from pyspark.sql import functions as F
df = ...
df_new = df.select([F.col(c).alias("`"+c+"`") for c in df.columns])
This method also gives you the option to add custom python logic within the alias() function like: "prefix_"+c+"_suffix" if c in list_of_cols_to_change else c
To add prefix or suffix:
Refer df.columns for list of columns ([col_1, col_2...]). This is the dataframe, for which we want to suffix/prefix column.
df.columns
Iterate through above list and create another list of columns with alias that can used inside select expression.
from pyspark.sql.functions import col
select_list = [col(col_name).alias("prefix_" + col_name) for col_name in df.columns]
When using inside select, do not forget to unpack list with asterisk(*). We can assign it back to same or different df for use.
df.select(*select_list).show()
df = df.select(*select_list)
df.columns will now return list of new columns(aliased).
If you would like to add a prefix or suffix to multiple columns in a pyspark dataframe, you could use a for loop and .withColumnRenamed().
As an example, you might like:
def add_prefix(sdf, prefix):
for c in sdf.columns:
sdf = sdf.withColumnRenamed(c, '{}{}'.format(prefix, c))
return sdf
You can amend sdf.columns as you see fit.
You can use withColumnRenamed method of dataframe in combination with na to create new dataframe
df.na.withColumnRenamed('testing user', '`testing user`')
edit : suppose you have list of columns, you can do like -
old = "First Last Age"
new = ["`"+field+"`" for field in old.split()]
df.rdd.toDF(new)
output :
DataFrame[`First`: string, `Last`: string, `Age`: string]
here is how one can solve the similar problems:
df.select([col(col_name).alias('prefix' + col_name + 'suffix') for col_name in df])
I had a dataframe that I duplicated twice then joined together. Since both had the same columns names I used :
df = reduce(lambda df, idx: df.withColumnRenamed(list(df.schema.names)[idx],
list(df.schema.names)[idx] + '_prec'),
range(len(list(df.schema.names))),
df)
Every columns in my dataframe then had the '_prec' suffix which allowed me to do sweet stuff

Categories