Check if a value exists in pandas dataframe index - python

I am sure there is an obvious way to do this but cant think of anything slick right now.
Basically instead of raising exception I would like to get True or False to see if a value exists in pandas df index.
import pandas as pd
df = pd.DataFrame({'test':[1,2,3,4]}, index=['a','b','c','d'])
df.loc['g'] # (should give False)
What I have working now is the following
sum(df.index == 'g')

This should do the trick
'g' in df.index

Multi index works a little different from single index. Here are some methods for multi-indexed dataframe.
df = pd.DataFrame({'col1': ['a', 'b','c', 'd'], 'col2': ['X','X','Y', 'Y'], 'col3': [1, 2, 3, 4]}, columns=['col1', 'col2', 'col3'])
df = df.set_index(['col1', 'col2'])
in df.index works for the first level only when checking single index value.
'a' in df.index # True
'X' in df.index # False
Check df.index.levels for other levels.
'a' in df.index.levels[0] # True
'X' in df.index.levels[1] # True
Check in df.index for an index combination tuple.
('a', 'X') in df.index # True
('a', 'Y') in df.index # False

Just for reference as it was something I was looking for, you can test for presence within the values or the index by appending the ".values" method, e.g.
g in df.<your selected field>.values
g in df.index.values
I find that adding the ".values" to get a simple list or ndarray out makes exist or "in" checks run more smoothly with the other python tools. Just thought I'd toss that out there for people.

Code below does not print boolean, but allows for dataframe subsetting by index... I understand this is likely not the most efficient way to solve the problem, but I (1) like the way this reads and (2) you can easily subset where df1 index exists in df2:
df3 = df1[df1.index.isin(df2.index)]
or where df1 index does not exist in df2...
df3 = df1[~df1.index.isin(df2.index)]

with DataFrame: df_data
>>> df_data
id name value
0 a ampha 1
1 b beta 2
2 c ce 3
I tried:
>>> getattr(df_data, 'value').isin([1]).any()
True
>>> getattr(df_data, 'value').isin(['1']).any()
True
but:
>>> 1 in getattr(df_data, 'value')
True
>>> '1' in getattr(df_data, 'value')
False
So fun :D

df = pandas.DataFrame({'g':[1]}, index=['isStop'])
#df.loc['g']
if 'g' in df.index:
print("find g")
if 'isStop' in df.index:
print("find a")

I like to use:
if 'value' in df.index.get_level_values(0):
print(True)
get_level_values method is good because it allows you to get the value in the indexes no matter if your index is simple or composite.
Use 0 (zero) if you have a single index in your dataframe [or you want to check the first index in multiple index levels]. Use 1 for the second index, and so on...

Related

Pandas set column names by position

I have the following code:
df1 = pd.read_excel(f, sheet_name=0, header=6)
# Drop Columns by position
df1 = df1.drop([df1.columns[5],df1.columns[8],df1.columns[10],df1.columns[14],df1.columns[15],df1.columns[16],df1.columns[17],df1.columns[18],df1.columns[19],df1.columns[21],df1.columns[22],df1.columns[23],df1.columns[24],df1.columns[25]], axis=1)
# rename cols
This is where I am struggling, as each time I attempt to rename the cols by position it returns "None" which is a <class 'NoneType'> ( when I use print(type(df1)) ). Note that df1 returns the dataframe as expected after dropping the columns
I get this with everything I have tried below:
column_indices = [0,1,2,3,4,5,6,7,8,9,10,11]
new_names = ['AWG Item Code','Description','UPC','PK','Size','Regular Case Cost','Unit Scan','AMAP','Case Bill Back','Monday Start Date','Sunday End Date','Net Unit']
old_names = df1.columns[column_indices]
df1 = df1.rename(columns=dict(zip(old_names, new_names)), inplace=True)
And with:
df1 = df1.rename({df1.columns[0]:"AWG Item Code",df1.columns[1]:"Description",df1.columns[2]:"UPC",df1.columns[3]:"PK",df1.columns[4]:"Size",df1.columns[5]:"Regular Case Cost",df1.columns[6]:"Unit Scan",df1.columns[7]:"AMAP",df1.columns[8]:"Case Bill Back",df1.columns[9]:"Monday Start Date",df1.columns[10]:"Sunday End Date",df1.columns[11]:"Net Unit"}, inplace = True)
When I remove the inplace=True essentially setting it to false, it returns the dataframe but without any of the changes I am wanting.
The tricky part is that in this program my column headers will change each time, but the columns the data is in will not. Otherwise I would just use df = df.rename(columns=["a":"newname"])
One simpler version of your code could be :
df1.columns = new_names
It should work as intended, i.e. renaming columns in the index order.
Otherwise, in your own code : if you print df1.columns[column_indices]
You do not get a list but a pandas.core.indexes.base.Index
So to correct your code you just need to change the 2 last lines by :
old_names = df1.columns[column_indices].tolist()
df1.rename(columns=dict(zip(old_names, new_names)), inplace=True)
Have a nice day
I was dumb and missing columns=
df1.rename(columns={df1.columns[0]:"AWG Item Code",df1.columns[1]:"Description",df1.columns[2]:"UPC",df1.columns[3]:"PK",df1.columns[4]:"Size",df1.columns[5]:"Regular Case Cost",df1.columns[6]:"Unit Scan",df1.columns[7]:"AMAP",df1.columns[8]:"Case Bill Back",df1.columns[9]:"Monday Start Date",df1.columns[10]:"Sunday End Date",df1.columns[11]:"Net Unit"}, inplace = True)
works fine
I am not sure whether this answers your question:
There is a simple way to rename the columns:
If I have a data frame: say df1. I can see the columns name using the following code:
df.columns.to_list()
which gives me suppose following columns name:
['A', 'B', 'C','D']
And I want to keep the first three columns and rename them as 'E', 'F' and 'G' respectively. The following code gives me the desired outcome:
df = df[['A','B','C']]
df.columns = ['E','F','G]
new outcome:
df.columns.to_list()
output: ['E','F','G']

Pandas Add column and fill it with complex Concatenate

I have this Excel formula in A2 : =IF(B1=1;CONCAT(D1:Z1);"Null")
All cells are string or integer but some are empty. I filled the empty one with "null"
I've tried to translate it with pandas, and so far I wrote this :
'''
import pandas as pd
df = pd.read_table('C:/*path*/001.txt', sep=';', header=0, dtype=str)
rowcount = 0
for row in df:
rowcount+= 1
n = rowcount
m = len(df)
df['A']=""
for i in range(1,n):
if df[i-1]["B"]==1:
for k in range(2,m):
if df[i][k]!="Null"
df[i]['A']+=df[i][k]
'''
I can't find something close enough to my problem in questions, anyone can help?
I not sure you really expecting for this. If you need to fill empty cell with 'null' string in dataframe. You can use this
df.fillna('null', inplace=True)
If you provide the expected output with your input file. May helpful for the contributors.
Test dataframe:
df = pd.DataFrame({
"b":[1,0,1],
"c":["dummy", "dummy", "dummy"],
"d":["red", "green", "blue"],
"e":["-a", "-b", "-c"]
})
First step: add a new column and fill with NULL.
df["concatenated"] = "NULL"
Second step: filter by where column B is 1, and then set the value of the new column to the concatenation of columns D to Z.
df["concatenated"][df["b"]==1] = df[sub_columns].sum(axis=1)
df
Output:
EDIT: I notice there is an offset in your excel formula. Not sure if this is deliberate, but experiment with df.shift(-1) if so.
There's a lot to unpack here.
Firstly, len(df) gives us the row count. In your code, n and m will be one and the same.
Secondly, please never do chain indexing in pandas unless you absolutely have to. There's a number of reasons not to, one of them being that it's easy to make a mistake; also, assignment can fail. Here, we have a default range index, so we can use df.loc[i-1, 'B'] in place of df[i - 1]["B"].
Thirdly, the dtype is str, so please use =='1' rather than ==1.
If I understand your problem correctly, the following code should help you:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({
'B': ['1','2','0','1'],
'C': ['first', 'second', 'third', 'fourth'],
'D': ['-1', '-2', '-3', '-4']
})
In [3]: RELEVANT_COLUMNS = ['C', 'D'] # You can also extract them in any automatic way you please
In [4]: df["A"] = ""
In [5]: df.loc[df['B'] == '1', 'A'] = df.loc[df['B'] == '1', RELEVANT_COLUMNS].sum(axis=1)
In [6]: df
Out[6]:
B C D A
0 1 first -1 first-1
1 2 second -2
2 0 third -3
3 1 fourth -4 fourth-4
We note which columns to concat (In [3]) (we do not want to make the mistake of adding a column later on and using it. Here if we add 'A' it doesn't hurt, because it's full of empty strings. But it's more manageable to save the columns we concat.
We then add the column with empty strings (In [4]) (if we skip this step, we'll get NaNs instead of empty strings for the records where B does not equal 1).
The In [5] uses pandas' boolean indexing (through the Series to scalar equality operator) to limit our scope to where column B equals 1, and then we pull up the columns to concat and do just that, using the an axis-reducing sum operation.

remove first 2 rows in a dataframe based on the value in another column

I have a df with stock tickers in a column and the next column is called 'Fast Add' which will either be populated with the value 'Add' or be empty.
I want to remove the 2 stocks tickers but only where the fast add column = ADD. the below code will remove the first 2 lines but i need to add a argument which only removes the first 2 lines where the 'Fast Add' column = 'Add'. Can someone help please
new_df = df_obj[2:]
You can use the drop function in Pandas to remove specific indices from a DataFrame. Here's a code example for your use case:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Ticker': ['A', 'B', 'C', 'D'],
'Fast Add': ['Add', np.nan, 'Add', 'Add']
})
new_df = df.drop(df[df['Fast Add'] == 'Add'][:2].index)
new_df is a DataFrame with the following contents:
Ticker Fast Add
1 B NaN
3 D Add
The approach here is to select all the rows you want to remove and then pass their indices into DataFrame.drop() to remove them.
References:
https://showmecode.info/pandas/DataFrame/remove-rows/ (personal site)
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html
IIUC something like this should work:
df_obj["record_idx"] = df_obj.groupby('FastAdd').cumcount()
new_df = df_obj.query("record_idx >= 2 & FastAdd == 'ADD'")
You can also use a cheap hack like below:
df_obj.sort_values("FastAdd", inplace = True)
new_df = df_obj.iloc[2:].copy()

How do I filter out multiple columns witha certain string in Python

I'm new to python and especially to pandas so I don't really know what I'm doing. I have 10 columns with 100000 rows and 4 letter strings. I need to filter out rows which don't contain 'DDD' in all of the columns/rows.
I tried to do it with iloc and loc, but it doesn't work:
import pandas as pd
df = pd.read_csv("data_3.csv", delimiter = '!')
df.iloc[:,10:20].str.contains('DDD', regex= False, na = False)
df.head()
It returns me an error: 'DataFrame' object has no attribute 'str'
I suggest doing it without a for loop like this:
df[df.apply(lambda x: x.str.contains('DDD')).all(axis=1)]
To select only string columns
df[df.select_dtypes(include='object').apply(lambda x: x.str.contains('DDD')).all(axis=1)]
To select only some string columns
selected_cols = ['A','B']
df[df[selected_cols].apply(lambda x: x.str.contains('DDD')).all(axis=1)]
You can do this but if your all column type is StringType:
for column in foo.columns:
df = df[~df[c].str.contains('DDD')]
You can use str.contains, but only on Series not on DataFrames. So to use it we look at each column (which is a series) one by one by for looping over them:
>>> import pandas as pd
>>> df = pd.DataFrame([['DDDA', 'DDDB', 'DDDC', 'DDDD'],
['DDDE', 'DDDF', 'DDDG', 'DHDD'],
['DDDI', 'DDDJ', 'DDDK', 'DDDL'],
['DMDD', 'DNDN', 'DDOD', 'DDDP']],
columns=['A', 'B', 'C', 'D'])
>>> for column in df.columns:
df = df[df[column].str.contains('DDD')]
In our for loop we're overwriting the DataFrame df with df where the column contains 'DDD'. By looping over each column we cut out rows that don't contain 'DDD' in that column until we've looked in all of our columns, leaving only rows that contain 'DDD' in every column.
This gives you:
>>> print(df)
A B C D
0 DDDA DDDB DDDC DDDD
2 DDDI DDDJ DDDK DDDL
As you're only looping over 10 columns this shouldn't be too slow.
Edit: You should probably do it without a for loop as explained by Christian Sloper as it's likely to be faster, but I'll leave this up as it's slightly easier to understand without knowledge of lambda functions.

KeyError when running df through function

I'm trying to apply the function below to a dataframe and return only the rows that qualify, but get an KeyError. What am I doing wrong?
N = 100
np.random.seed(0)
df = pd.DataFrame(
{'X':np.random.uniform(-3,10,N),
'Y':np.random.uniform(-3,10,N),
'Z':np.random.uniform(-3,10,N),
})
def func_sec(df):
for i in range(len(df)):
for k in range( i+1, len(df)+1 ):
df_sum = df[i:k].sum()
m = (df_sum>2).all() & (df_sum.sum()>10)
return df[m]
func_sec(df)
Like others have noted, the key error is thrown off because of df[m]. Your column names aren't booleans, they are 'X', 'Y', 'Z'. Somewhere at the bottom of the pandas documentation there is some information on boolean indexing, so i suggest you check it out.
Long story short, you can't do df[True], but you can do df[df['X'] > 10] per se.
For a dataframe df you can select by column e.g. 'X' in your case:
df['X']
or slice some rows
df[0:10]
If you try something invalid like df[0] or df[True] you will get a key error.

Categories