How to select all columns that start with "durations" or "shape"? - python

How to select all columns that have header names starting with "durations" or "shape"? (instead of defining a long list of column names). I need to select these columns and substitute blank fields by 0.
column_names = ['durations.blockMinutes_x',
'durations.scheduledBlockMinutes_y']
data[column_names] = data[column_names].fillna(0)

You could use str methods of dataframe startwith:
df = data[data.columns[data.columns.str.startwith('durations') | data.columns.str.startwith('so')]]
df.fillna(0)
Or you could use contains method:
df = data.iloc[:, data.columns.str.contains('durations.*'|'shape.*') ]
df.fillna(0)

I would use the select method:
df.select(lambda c: c.startwith('durations') or c.startswith('shape'), axis=1)

Use my_dataframe.columns.values.tolist() to get the column names (based on Get list from pandas DataFrame column headers):
column_names = [x for x in data.columns.values.tolist() if x.startswith("durations") or x.startswith("shape")]

A simple and easy way
data[data.filter(regex='durations|shape').columns].fillna(0)
Sample Screenshot

Related

Is there a function for making a new dataframe using pandas to select only part of a word?

I am looking to select all values that include "hennessy" in the name, i.e. "Hennessy Black Cognac", "Hennessy XO". I know it would simply be
trial = Sales[Sales["Description"]if=="Hennessy"]
if I wanted only the value "Hennessy", but I want it if it contains the word "Hennessy" at all.
working on python with pandas imported
Thanks :)
You can use the in keyword to check if a value is present in a sequence.
Like this:
trial = "hennessy" in lower(Sales[Sales["Description"]])
you can try using str.startswith
import pandas as pd
# initialize list of lists
data = [['Hennessy Black Cognac', 10], ['Hennessy XO', 15], ['julian merger', 14]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Name', 'Age'])
new_df = df.loc[df.Name.str.startswith('Hennessy', na=False)]
new_df
or You can use apply to easily apply any string matching function to your column elementwise
df_new =df[df['Name'].apply(lambda x: x.startswith('Hennessy'))]
df_new

How to query a Pandas Dataframe based on column values

I have a dataframe:
ID
Name
1
A
2
B
3
C
I defined a list:
mylist =[A,C]
If I want to extract only the rows where Name is equal to A and C (namely, mylist), I am trying to use the following code:
df_new = df[(df['Name'].isin(mylist))]
>>> df_new
As result, I get an empty table.
Any suggestion regarding why I get this error?
Just remove the additional open bracket before the df['Name']
df_new = df[df['Name'].isin(lst)]
Found the solution, It was a problem related to the list that caused the result of the empty table.
The format of the list should be:
mylist =['A','C']
instead of
mylist =[A,C]
You could use .loc and lambda as it’s more readable
import pandas as pd
dataf = pd.DataFrame({'ID':[1,2,3],'Name':['A','B','C']})
names = ['A','C']
# lock rows where column Name in names
df = dataf.loc[lambda d: d['Name'].isin(names)]
print(df)

Python/Pandas: drop columns *not* containing either of two strings in one step?

I have a dataframe ('df') containing several columns and would like to only keep those columns with a column header starting with the prefix 'x1' or 'x4'. That is, I want to 'drop' all columns except those with a column header starting with either 'x1' or 'x4'.
How can I do this in one step?
I know that if I wanted to keep only those columns with the x1 prefix I could do:
df = df [df.columns.drop(list(df .filter(regex='x1')))]
..but this results in me losing columns with the x4 prefix, which I want to keep.
Similarly, if I wanted to keep only those columns with the x4 prefix I can do:
df = df [df.columns.drop(list(df .filter(regex='x4')))]
..but this results in me losing columns with the x1 prefix, which I want to keep.
You can use df.loc with list comprehension:
df.loc[:, [x for x in df.columns if x.startswith(('x1', 'x4'))]]
It will show you all rows and columns which have 'x1' or 'x4' at the beginning.
You can choose the desired columns first and then just select those columns.
data = [{"x1":"a", "x2":"a", "x4":"a"}]
df = pd.DataFrame(data)
desired_columns = [x for x in df.columns if x.startswith("x1") or x.startswith("x4")]
df = df[desired_columns]
You can also use a function:
def is_valid(x):
return x.startswith("x1") or x.startswith("x4")
data = [{"x1":"a", "x2":"a", "x4":"a"}]
df = pd.DataFrame(data)
desired_columns = [x for x in df.columns if is_valid(x)]
df = df[desired_columns]
You can also use filter option,
df.filter(regex='^x1|^x4')

How to change multiple columns' types in pyspark?

I am just studying pyspark. I want to change the column types like this:
df1=df.select(df.Date.cast('double'),df.Time.cast('double'),
df.NetValue.cast('double'),df.Units.cast('double'))
You can see that df is a data frame and I select 4 columns and change all of them to double. Because of using select, all other columns are ignored.
But, if df has hundreds of columns and I just need to change those 4 columns. I need to keep all the columns. So, how to do it?
Try this:
from pyspark.sql.functions import col
df = df.select([col(column).cast('double') for column in df.columns])
for c in df.columns:
# add condition for the cols to be type cast
df=df.withColumn(c, df[c].cast('double'))
Another way using selectExpr():
df1 = df.selectExpr("cast(Date as double) Date",
"cast(NetValueas string) NetValue")
df1.printSchema()
Using withColumn():
from pyspark.sql.types import DoubleType, StringType
df1 = df.withColumn("Date", df["Date"].cast(DoubleType())) \
.withColumn("NetValueas ", df["NetValueas"].cast(StringType()))
df1.printSchema()
Check types documentation.
I understand that you would like to have a non-for-loop answer that preserves the original set of columns whilst only updating a subset. The following should be the answer you were looking for:
from pyspark.sql.functions import col
df = df.select(*(col(c).cast("double").alias(c) for c in subset),*[x for x in df.columns if x not in subset])
where subset is a list of the columnnames you would like to update.

How to add suffix and prefix to all columns in python/pyspark dataframe

I have a data frame in pyspark with more than 100 columns. What I want to do is for all the column names I would like to add back ticks(`) at the start of the column name and end of column name.
For example:
column name is testing user. I want `testing user`
Is there a method to do this in pyspark/python. when we apply the code it should return a data frame.
Use list comprehension in python.
from pyspark.sql import functions as F
df = ...
df_new = df.select([F.col(c).alias("`"+c+"`") for c in df.columns])
This method also gives you the option to add custom python logic within the alias() function like: "prefix_"+c+"_suffix" if c in list_of_cols_to_change else c
To add prefix or suffix:
Refer df.columns for list of columns ([col_1, col_2...]). This is the dataframe, for which we want to suffix/prefix column.
df.columns
Iterate through above list and create another list of columns with alias that can used inside select expression.
from pyspark.sql.functions import col
select_list = [col(col_name).alias("prefix_" + col_name) for col_name in df.columns]
When using inside select, do not forget to unpack list with asterisk(*). We can assign it back to same or different df for use.
df.select(*select_list).show()
df = df.select(*select_list)
df.columns will now return list of new columns(aliased).
If you would like to add a prefix or suffix to multiple columns in a pyspark dataframe, you could use a for loop and .withColumnRenamed().
As an example, you might like:
def add_prefix(sdf, prefix):
for c in sdf.columns:
sdf = sdf.withColumnRenamed(c, '{}{}'.format(prefix, c))
return sdf
You can amend sdf.columns as you see fit.
You can use withColumnRenamed method of dataframe in combination with na to create new dataframe
df.na.withColumnRenamed('testing user', '`testing user`')
edit : suppose you have list of columns, you can do like -
old = "First Last Age"
new = ["`"+field+"`" for field in old.split()]
df.rdd.toDF(new)
output :
DataFrame[`First`: string, `Last`: string, `Age`: string]
here is how one can solve the similar problems:
df.select([col(col_name).alias('prefix' + col_name + 'suffix') for col_name in df])
I had a dataframe that I duplicated twice then joined together. Since both had the same columns names I used :
df = reduce(lambda df, idx: df.withColumnRenamed(list(df.schema.names)[idx],
list(df.schema.names)[idx] + '_prec'),
range(len(list(df.schema.names))),
df)
Every columns in my dataframe then had the '_prec' suffix which allowed me to do sweet stuff

Categories