Selecting Columns by a string - Pytables - python

I know we can access columns using table.cols.somecolumn, but I need to apply the same operation on 10-15 columns of my table. So I'd like an iterative solution. I have the names of the columns as strings in a list : ['col1','col2','col3'].
So I'm looking for something along the lines of:
for col in columnlist:
thiscol = table.cols[col]
#apply whatever operation

Try this:
columnlist = ['col1','col2','col3']
for col in columnlist:
thiscol = getattr(table.cols, col)

Related

Remove objects which has been repeated in two columns in dataframe

I have a data frame like this:
and the dataset in the CSV file is here.
this data was extracted from the IMDb dataset.
but I have a problem, I could not be able to remove the actor's names which are repeated in the same row for example in row number 4 I want to drop 'Marie Gruber' in both name and actors column.
I tried to use to apply and all conditions but always code consider it the same.
like this code:
data[data['name'] != data['actors']]
Trere are traling spaces for actors column, so first remove them by Series.str.strip:
data['actors'] = data['actors'].str.strip()
data[data['name'] != data['actors']]
Or use skipinitialspace=True in read_csv:
data = pd.read_csv(file, skipinitialspace=True)
data[data['name'] != data['actors']]
Use pandas.dataframe.drop function.
data.drop(data[data.apply(lambda x: x['name'] in x['actors'], axis = 1)].index)

Summary Row for a pd.DataFrame with multiindex

I have a multiIndex dataframe created with pandas similar to this one:
nest = {'A1': dfx[['aa','bb','cc']],
'B1':dfx[['dd']],
'C1':dfx[['ee', 'ff']]}
reform = {(outerKey, innerKey): values for outerKey, innerDict in nest.items() for innerKey, values in innerDict.items()}
dfzx = pd.DataFrame(reform)
What I am trying to achieve is to add a new row at the end of the dataframe that contains a summary of the total for the three categories represented by the new index (A1, B1, C1).
I have tried with df.loc (what I would normally use in this case) but I get error. Similarly for iloc.
a1sum = dfzx['A1'].sum().to_list()
a1sum = sum(a1sum)
b1sum = dfzx['B1'].sum().to_list()
b1sum = sum(b1sum)
c1sum = dfzx['C1'].sum().to_list()
c1sum = sum(c1sum)
totalcat = a1sum, b1sum, c1sum
newrow = ['Total', totalcat]
newrow
dfzx.loc[len(dfzx)] = newrow
ValueError: cannot set a row with mismatched columns
#Alternatively
newrow2 = ['Total', a1sum, b1sum, c1sum]
newrow2
dfzx.loc[len(dfzx)] = newrow2
ValueError: cannot set a row with mismatched columns
How can I fix the mistake? Or else is there any other function that would allow me to proceed?
Note: the DF is destined to be moved on an Excel file (I use ExcelWriter).
The type of results I want to achieve in the end is this one (gray row "SUM"
I came up with a sort of solution on my own.
I created a separate DataFrame in Pandas that contains the summary.
I used ExcelWriter to have both dataframes on the same excel worksheet.
Technically It would be then possible to style and format data in Excel (xlsxwriter or framestyle seem to be popular modules to do so). Alternatively one should be doing that manually.

Exclude column from being read using pd.ExcelFile().parse()

I would like to exclude certain columns from being read when using pd.ExcelFile('my.xls').parse()
Excel file I am trying to parse has too many columns to list them all in usecols argument since I only need to get rid off a single column that is causing trouble.
Is there like a simple way to ~ invert list (I know you can't do that) passed to usecols or something?
We can usually do
head = list(pd.read_csv('your.xls', nrows = 1))
df = pd.read_excel('your.xls', usecols = [col for col in head if col != 'the one drop']))
However , why not read whole file then drop it
df = pd.read_excel('your.xls').drop('the col drop', axis = 1)

Take all unique values from certain columns in pandas dataframe

I have a simple question for style and how to do something correctly.
I want to take all the unique values of certain columns in a pandas dataframe and create a map ['columnName'] -> [valueA,valueB,...]. Here is my code that does that:
listUnVals = {}
for col in df:
if ((col != 'colA') and (col != 'colB')):
listUnVals[col] = (df[col].unique())
I want to exclude some columns like colA and colB. Is there a better way to filter out the columns I don't want, except writing an if (( != ) and ( != ...) . I hoped to create a lambda expression that filters this values but I can't create it correctly.
Any answer would be appreciated.
Couple of ways to remove unneeded columns
df.columns[~df.columns.isin(['colA', 'colB'])]
Or,
df.columns.difference(['colA', 'colB'])
And, you can ignore loop with
{c: df[c].unique() for c in df.columns[~df.columns.isin(['colA', 'colB'])]}
You can create a list of unwanted columns and then check for in status
>>> unwanted = ['columnA' , 'columnB']
>>> for col in df:
if col not in unwanted:
listUnVals[col] = (df[col].unique())
Or using dict comprehension:
{col : df[col].unique() for col in df if col not in unwanted}

Create a subset of a DataFrame depending on column name

I have a pandas DataFrame called timedata with different column names, some of which contain the word Vibration, some eccentricity. Is is possible to create a dataframe of just the columns containing the word Vibration?
I have tried using
vib=[]
for i in timedata:
if 'Vibration' in i:
vib=vib.append(i)
to then create a DataFrame based on the indicies of these columns. This really does not seem like the most efficient way to do it and I'm sure there must be something simple to do with list comprehension.
EDIT
Dataframe of form:
df = DataFrame({'Ch 1:Load': randn(10), 'Ch 2:Vibration Brg 1T ': randn(10), 'Ch 3:Eccentricity Brg 1H ': randn(10), 'Ch 4:Vibration Brg 2T ': randn(10)})
Sorry I'm having a slow day! thanks for any help
Something like this to manually select all columns with the word "Vibration" in it:
df[[col for col in df.columns if "Vibration" in col]]
You can also do the same with the filter method:
df.filter(like="Vibration")
If you want to do a more flexible filter, you can use the regex option. E.g. to look if "Vibration" or "Ecc" is in the column name:
df.filter(regex='Ecc|Vibration')
newDf = Df.loc[:,['Vibration']]
or
newDf = Df.loc[:,['Vibration','eccentricity']]
to get more collumns
to search for a value in a collumn:
newDf = Df[Df["CollumnName"] == "vibration"]

Categories