I need to move the 3rd column to the end of the column list. This can be done as follows:
cols = df.columns.tolist()
cols = cols[0:2] + cols[4:] + cols[3]
df= df[cols]
However what if I don't know an index of the column that I'm interested in. Let's say that the only thing I know is that the column called MyMagicCol should be moved at the end of the column list.
How can I do this?
If you want to do it within a list, and you know the name of the column you're interested in, you could just do
cols = [c for c in df.columns if c != 'MyMagicCol'] + ['MyMagicCol']
df = df[cols]
You could also do this with df.loc and use the column names to slice your dataframe if you wanted to work with the columns directly.
Related
I have a dataframe ('df') containing several columns and would like to only keep those columns with a column header starting with the prefix 'x1' or 'x4'. That is, I want to 'drop' all columns except those with a column header starting with either 'x1' or 'x4'.
How can I do this in one step?
I know that if I wanted to keep only those columns with the x1 prefix I could do:
df = df [df.columns.drop(list(df .filter(regex='x1')))]
..but this results in me losing columns with the x4 prefix, which I want to keep.
Similarly, if I wanted to keep only those columns with the x4 prefix I can do:
df = df [df.columns.drop(list(df .filter(regex='x4')))]
..but this results in me losing columns with the x1 prefix, which I want to keep.
You can use df.loc with list comprehension:
df.loc[:, [x for x in df.columns if x.startswith(('x1', 'x4'))]]
It will show you all rows and columns which have 'x1' or 'x4' at the beginning.
You can choose the desired columns first and then just select those columns.
data = [{"x1":"a", "x2":"a", "x4":"a"}]
df = pd.DataFrame(data)
desired_columns = [x for x in df.columns if x.startswith("x1") or x.startswith("x4")]
df = df[desired_columns]
You can also use a function:
def is_valid(x):
return x.startswith("x1") or x.startswith("x4")
data = [{"x1":"a", "x2":"a", "x4":"a"}]
df = pd.DataFrame(data)
desired_columns = [x for x in df.columns if is_valid(x)]
df = df[desired_columns]
You can also use filter option,
df.filter(regex='^x1|^x4')
What i have is a list of Dataframes.
What is important to note is that the shape of the dataframes differ between 2-7 columns, also the columns are named between 0 & len of the column (e.g. df1 has 5 columns named 0,1,2,3,4 etc. df2 has 4 columns named 0,1,2,3)
I would like is to check if a row in a column contains a certain string, then delete that column.
list_dfs1=[df1,df2,df3...df100]
What i have done so far is the below & i get an error that column 5 is not in axis (it is there for some DF)
for i, df in enumerate(list_dfs1):
for index,row in df.iterrows():
if np.where(row.str.contains("DEC")):
df.drop(index, axis=1)
Any suggestions.
You could try:
for df in list_dfs:
for col in df.columns:
# If you are unsure about column types, cast column as string:
df[col] = df[col].astype(str)
# Check if the column contains the string of interest
if df[col].str.contains("DEC").any():
df.drop(columns=[col], inplace=True)
If you know that all columns are of type string, you don't have to actually do df[col] = df[col].astype(str).
You can write a custom function that checks whether the dataframe has the pattern or not. You can use pd.Series.str.contains with pd.Series.any
def func(s):
return s.str.contains('DEC').any()
list_df = [df.loc[:, ~df.apply(func)] for df in list_dfs1]
I would take another approach. I would concatenate the list into a data frame and then eliminate the column where finding the string
import pandas as pd
df = pd.concat(list_dfs1)
Let us say your condition was to eliminate any column with "DEC"
df.mask(df == "DEC").dropna(axis=1, how="any")
I want to compare 2 csv (A and B) and find out the rows which are present in B but not in A in based only on specific columns.
I found few answers to that but it is still not giving result what I expect.
Answer 1 :
df = new[~new['column1', 'column2'].isin(old['column1', 'column2'].values)]
This doesn't work. It works for single column but not for multiple.
Answer 2 :
df = pd.concat([old, new]) # concat dataframes
df = df.reset_index(drop=True) # reset the index
df_gpby = df.groupby(list(df.columns)) #group by
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1] #reindex
final = df.reindex(idx)
This takes as an input specific columns and also outputs specific columns. I want to print the whole record and not only the specific columns of the record.
I tried this and it gave me the rows:
import pandas as pd
columns = [{Name of columns you want to use}]
new = pd.merge(A, B, how = 'right', on = columns)
col = new['{Any column from the first DataFrame which isn't in the list columns. You will probably have to add an '_x' at the end of the column name}']
col = col.dropna()
new = new[~new['{Any column from the first DataFrame which isn't in the list columns. You will probably have to add an '_x' at the end of the column name}'].isin(col)]
This will give you the rows based on the columns list. Sorry for the bad naming. If you want to rename the columns a bit too, here's the code for that:
for column in new.columns:
if '_x' in column:
new = new.drop(column, axis = 1)
elif '_y' in column:
new = new.rename(columns = {column: column[:column.find('_y')]})
Tell me if it works.
include_cols_path = sys.argv[5]
with open(include_cols_path) as f:
include_cols = f.read().splitlines()
include_cols is a list of strings
df1 = sqlContext.read.csv(input_path + '/' + lot_number +'.csv', header=True).toPandas()
df1 is a dataframe of a large file. I would like to only retain the columns with names that contain any of the strings in include_cols.
final_cols = [col for col in df.columns.values if col in include_cols]
df = df[final_cols]
Doing this in pandas is certainly a dupe. However, it seems that you are converting a spark DataFrame to a pandas DataFrame.
Instead of performing the (expensive) collect operation and then filtering the columns you want, it's better to just filter on the spark side using select():
df1 = sqlContext.read.csv(input_path + '/' + lot_number +'.csv', header=True)
pandas_df = df1.select(include_cols).toPandas()
You should also think about whether or not converting to a pandas DataFrame is really what you want to do. Just about anything you can do in pandas can also be done in spark.
EDIT
I misunderstood your question originally. Based on your comments, I think this is what you're looking for:
selected_columns = [c for c in df1.columns if any([x in c for x in include_cols])]
pandas_df = df1.select(selected_columns).toPandas()
Explanation:
Iterate through the columns in df1 and keep only those for which at least one of the strings in include_cols is contained in the column name. The any() functions returns True if at least one of the conditions is True.
df1.loc[:, df1.columns.str.contains('|'.join(include_cols))]
For example:
df1 = pd.DataFrame(data=np.random.random((5, 5)), columns=list('ABCDE'))
include_cols = ['A', 'C', 'Z']
df1.loc[:, df1.columns.str.contains('|'.join(include_cols))]
>>> A C
0 0.247271 0.761153
1 0.390240 0.050055
2 0.333401 0.823384
3 0.821196 0.929520
4 0.210226 0.406168
The '|'.join(include_cols) part will create an or condition with all elements of the input list. In the above example A|C|Z. This conditions will be True if one of the element is contained in the column names using the .contains() method on the column names.
I have a data frame in pyspark with more than 100 columns. What I want to do is for all the column names I would like to add back ticks(`) at the start of the column name and end of column name.
For example:
column name is testing user. I want `testing user`
Is there a method to do this in pyspark/python. when we apply the code it should return a data frame.
Use list comprehension in python.
from pyspark.sql import functions as F
df = ...
df_new = df.select([F.col(c).alias("`"+c+"`") for c in df.columns])
This method also gives you the option to add custom python logic within the alias() function like: "prefix_"+c+"_suffix" if c in list_of_cols_to_change else c
To add prefix or suffix:
Refer df.columns for list of columns ([col_1, col_2...]). This is the dataframe, for which we want to suffix/prefix column.
df.columns
Iterate through above list and create another list of columns with alias that can used inside select expression.
from pyspark.sql.functions import col
select_list = [col(col_name).alias("prefix_" + col_name) for col_name in df.columns]
When using inside select, do not forget to unpack list with asterisk(*). We can assign it back to same or different df for use.
df.select(*select_list).show()
df = df.select(*select_list)
df.columns will now return list of new columns(aliased).
If you would like to add a prefix or suffix to multiple columns in a pyspark dataframe, you could use a for loop and .withColumnRenamed().
As an example, you might like:
def add_prefix(sdf, prefix):
for c in sdf.columns:
sdf = sdf.withColumnRenamed(c, '{}{}'.format(prefix, c))
return sdf
You can amend sdf.columns as you see fit.
You can use withColumnRenamed method of dataframe in combination with na to create new dataframe
df.na.withColumnRenamed('testing user', '`testing user`')
edit : suppose you have list of columns, you can do like -
old = "First Last Age"
new = ["`"+field+"`" for field in old.split()]
df.rdd.toDF(new)
output :
DataFrame[`First`: string, `Last`: string, `Age`: string]
here is how one can solve the similar problems:
df.select([col(col_name).alias('prefix' + col_name + 'suffix') for col_name in df])
I had a dataframe that I duplicated twice then joined together. Since both had the same columns names I used :
df = reduce(lambda df, idx: df.withColumnRenamed(list(df.schema.names)[idx],
list(df.schema.names)[idx] + '_prec'),
range(len(list(df.schema.names))),
df)
Every columns in my dataframe then had the '_prec' suffix which allowed me to do sweet stuff