Create a subset of a DataFrame depending on column name - python

I have a pandas DataFrame called timedata with different column names, some of which contain the word Vibration, some eccentricity. Is is possible to create a dataframe of just the columns containing the word Vibration?
I have tried using
vib=[]
for i in timedata:
if 'Vibration' in i:
vib=vib.append(i)
to then create a DataFrame based on the indicies of these columns. This really does not seem like the most efficient way to do it and I'm sure there must be something simple to do with list comprehension.
EDIT
Dataframe of form:
df = DataFrame({'Ch 1:Load': randn(10), 'Ch 2:Vibration Brg 1T ': randn(10), 'Ch 3:Eccentricity Brg 1H ': randn(10), 'Ch 4:Vibration Brg 2T ': randn(10)})
Sorry I'm having a slow day! thanks for any help

Something like this to manually select all columns with the word "Vibration" in it:
df[[col for col in df.columns if "Vibration" in col]]
You can also do the same with the filter method:
df.filter(like="Vibration")
If you want to do a more flexible filter, you can use the regex option. E.g. to look if "Vibration" or "Ecc" is in the column name:
df.filter(regex='Ecc|Vibration')

newDf = Df.loc[:,['Vibration']]
or
newDf = Df.loc[:,['Vibration','eccentricity']]
to get more collumns
to search for a value in a collumn:
newDf = Df[Df["CollumnName"] == "vibration"]

Related

Extract strings values from DataFrame column

I have the following DataFrame:
Student
food
1
R0100000
2
R0200000
3
R0300000
4
R0400000
I need to extract as a string the values of the "food" column of the df DataFrame when I filter the data.
For example, when I filter by the Student=1, I need the return value of "R0100000" as a string value, without any other characters or spaces.
This is the code to create the same DataFrame as mine:
data={'Student':[1,2,3,4],'food':['R0100000', 'R0200000', 'R0300000', 'R0400000']}
df=pd.DataFrame(data)
I tried to select the Dataframe Column and apply str(), but it does not return me the desired results:
df_new=df.loc[df['Student'] == 1]
df_new=df_new.food
df_str=str(df_new)
del df_new
This works for me:
s = df[df.Student==1]['food'][0]
s.strip()
It's pretty simple, first get the column.
like, col =data["food"] and then use col[index] to get respective value
So, you answer would be data["food"][0]
Also, you can use iloc and loc search for these.
(df.iloc[rows,columns], so we can use this property to get answer as, df.iloc[0,1])
df.loc[rows, column_names] example: df.loc[0,"food"]

create json dynamic with group by column name from dataframe

I am trying to create datasets from the name of the columns of a dataframe. Where I have the columns ['NAME1', 'EMAIL1', 'NAME2', 'EMAIL2', NAME3', 'EMAIL3', etc].
I'm trying to split the dataframe based on the 'EMAIL' column, where through a loop, but it's not working properly.
I need it to be a JSON, because there is the possibility that between each 'EMAILn' column there may be a difference in number of columns.
This is my input:
I need this:
This is my code:
for i in df_entities.filter(regex=('^(EMAIL)' + str(i))).columns:
df_groups = df_temp_1.groupby(i)
df_detail = df_groups.get_group(i)
display(df_detail)
What do you recommend me to do?
From already thank you very much.
Regards
filter returns a copy of your dataframe with only the matching columns, but you're trying to loop over just the column names. Just add .columns:
for i in df_entities.filter(regex=('^(Email)' + str(i))).columns:
... # ^^^^^^^^^ important
From your input and desired output, simply call pandas.wide_to_long:
long_df = pd.wide_to_long(
df_entities.reset_index(),
stubnames=["NAME", "EMAIL"],
i="index",
j="version"
)

Using pd.Dataframe.replace with an apply function as the replace value

I have several dataframes that have mixed in some columns with dates in this ASP.NET format "/Date(1239018869048)/". I've figured out how to parse this into python's datetime format for a given column. However I would like to put this logic into a function so that I can pass it any dataframe and have it replace all the dates that it finds that match a regex using pd.Dataframe.replace.
something like:
def pretty_dates():
#Messy logic here
df.replace(to_replace=r'\/Date(d+)', value=pretty_dates(df), regex=True)
Problem with this is that the df that is being passed to pretty_dates is the whole dataframe not just the cell that is needed to be replaced.
So the concept I'm trying to figure out is if there is a way that the value that should be replaced when using df.replace can be a function instead of a static value.
Thank you so much in advance
EDIT
To try to add some clarity, I have many columns in a dataframe, over a hundred that contain this date format. I would like not to list out every single column that has a date. Is there a way to apply the function the clean my dates across all the columns in my dataset? So I do not want to clean 1 column but all the hundreds of columns of my dataframe.
I'm sure you can use regex to do this in one step, but here is how to apply it to the whole column at once:
df = pd.Series(['/Date(1239018869048)/',
'/Date(1239018869048)/'],dtype=str)
df = df.str.replace('\/Date\(', '')
df = df.str.replace('\)\/', '')
print(df)
0 1239018869048
1 1239018869048
dtype: object
As far as I understand, you need to apply custom function to selected cells in specified column. Hope, that the following example helps you:
import pandas as pd
df = pd.DataFrame({'x': ['one', 'two', 'three']})
selection = df.x.str.contains('t', regex=True) # put your regexp here
df.loc[selection, 'x'] = df.loc[selection, 'x'].map(lambda x: x+x) # do some logic instead
You can apply this procedure to all columns of the df in a loop:
for col in df.columns:
selection = df.loc[:, col].str.contains('t', regex=True) # put your regexp here
df.loc[selection, col] = df.loc[selection, col].map(lambda x: x+x) # do some logic instead

How to add suffix and prefix to all columns in python/pyspark dataframe

I have a data frame in pyspark with more than 100 columns. What I want to do is for all the column names I would like to add back ticks(`) at the start of the column name and end of column name.
For example:
column name is testing user. I want `testing user`
Is there a method to do this in pyspark/python. when we apply the code it should return a data frame.
Use list comprehension in python.
from pyspark.sql import functions as F
df = ...
df_new = df.select([F.col(c).alias("`"+c+"`") for c in df.columns])
This method also gives you the option to add custom python logic within the alias() function like: "prefix_"+c+"_suffix" if c in list_of_cols_to_change else c
To add prefix or suffix:
Refer df.columns for list of columns ([col_1, col_2...]). This is the dataframe, for which we want to suffix/prefix column.
df.columns
Iterate through above list and create another list of columns with alias that can used inside select expression.
from pyspark.sql.functions import col
select_list = [col(col_name).alias("prefix_" + col_name) for col_name in df.columns]
When using inside select, do not forget to unpack list with asterisk(*). We can assign it back to same or different df for use.
df.select(*select_list).show()
df = df.select(*select_list)
df.columns will now return list of new columns(aliased).
If you would like to add a prefix or suffix to multiple columns in a pyspark dataframe, you could use a for loop and .withColumnRenamed().
As an example, you might like:
def add_prefix(sdf, prefix):
for c in sdf.columns:
sdf = sdf.withColumnRenamed(c, '{}{}'.format(prefix, c))
return sdf
You can amend sdf.columns as you see fit.
You can use withColumnRenamed method of dataframe in combination with na to create new dataframe
df.na.withColumnRenamed('testing user', '`testing user`')
edit : suppose you have list of columns, you can do like -
old = "First Last Age"
new = ["`"+field+"`" for field in old.split()]
df.rdd.toDF(new)
output :
DataFrame[`First`: string, `Last`: string, `Age`: string]
here is how one can solve the similar problems:
df.select([col(col_name).alias('prefix' + col_name + 'suffix') for col_name in df])
I had a dataframe that I duplicated twice then joined together. Since both had the same columns names I used :
df = reduce(lambda df, idx: df.withColumnRenamed(list(df.schema.names)[idx],
list(df.schema.names)[idx] + '_prec'),
range(len(list(df.schema.names))),
df)
Every columns in my dataframe then had the '_prec' suffix which allowed me to do sweet stuff

Selecting Columns by a string - Pytables

I know we can access columns using table.cols.somecolumn, but I need to apply the same operation on 10-15 columns of my table. So I'd like an iterative solution. I have the names of the columns as strings in a list : ['col1','col2','col3'].
So I'm looking for something along the lines of:
for col in columnlist:
thiscol = table.cols[col]
#apply whatever operation
Try this:
columnlist = ['col1','col2','col3']
for col in columnlist:
thiscol = getattr(table.cols, col)

Categories