How to get row and colum number in dataframe in Pandas? - python

How can I get the number of the row and the column in a dataframe that contains a certain value using Pandas? For example, I have the following dataframe:
For example, i need to know the row and column of "Smith" (row 1, column LastName)

Maybe this is a solution or a first step to a solution.
If you filter for the value you are looking for all items which are not the value you want are replaced with NaN. Now you can drop all columns where all values are NaN. This leaves a DataFrame with your item and the indices. Then you can ask for index and name.
import numpy as np
import pandas as pd
df = pd.DataFrame({'LastName':['a', 'Smith', 'b'], 'other':[1,2,3]})
value = df[df=='Smith'].dropna(axis=0, how='all').dropna(axis=1, how='all')
print(value.index.values)
print(value.columns.values)
But I think this can be improved.

Here's a one liner that efficiently gets the row and column of a value:
df = pd.DataFrame({"ClientID": [34, 67, 53], "LastName": ["Johnson", "Smith", "Brows"] })
result = next(x[1:] for x in ((v, i, j) for i, row_tup in enumerate(df.itertuples(index=False)) for j, v in zip(df.columns, row_tup)) if x[0] == "Smith")
print(result)
Output
(1, "LastName")
Unpacking that one liner
# This is a generator that unpacks the dataframe and gets the value, row number (i) and column name (j) for every value in the dataframe
item_generator = ((v, i, j) for i, row_tup in enumerate(df.itertuples(index=False)) for j, v in zip(df.columns, row_tup))
# This iterates through the generator until it finds a match
# It outputs just the row and column number by leaving off the first item in the tuple
next(x[1:] for x in item_generator if x[0] == "Smith")
Props to this this answer for the second half of the solution

I tried to simplify the code and make it more readable. This is my attempt:
df = pd.DataFrame({'points': [25, 12, 15, 14, 19],
'assists': [5, 7, 7, 9, 12],
'rebounds': [11, 8, 10, 6, 6]})
index = df.index # Allows to get the row index
columns = df.columns # Allows to get the column name
value_to_be_checked = 6
for i in index[df.isin([value_to_be_checked]).any(axis=1)].to_list():
for j, e in enumerate(df.iloc[i]):
if e == value_to_be_checked:
print("(row {}, column {})".format(i, column[j])

Just to add another possible solution to the bucket. If you really need to your search your whole DataFrame, you may consider using numpy.where, such as:
import numpy as np
value = 'Smith'
rows, cols = np.where(df.values == value)
where_are_you = [(df.index[row], df.columns[col]) for row, col in zip(rows, cols)]
So, if your DataFrame is like
ClientID First Name LastName
0 34 Mr Smith
1 67 Keanu Reeves
2 53 Master Yoda
3 99 Smith Smith
4 100 Harry Potter
The code output will be:
[(0, 'LastName'), (3, 'First Name'), (3, 'LastName')]
Edit: Just to satisfy everybody's curiosity, here it is a benchmark of all answers
The code is written below. I removed the print statements to be fair, because they would make codes really slow for bigger dataframes.
val = 0
def setup(n=10):
return pd.DataFrame(np.random.randint(-100, 100, (n, 3)))
def nested_for(df):
index = df.index # Allows to get the row index
columns = df.columns # Allows to get the column name
value_to_be_checked = val
for i in index[df.isin([value_to_be_checked]).any(axis=1)].to_list():
for j, e in enumerate(df.iloc[i]):
if e == value_to_be_checked:
_ = "(row {}, column {})".format(i, columns[j])
def df_twin_dropna(df):
value = df[df == val].dropna(axis=0, how='all').dropna(axis=1, how='all')
return value.index.values, value.columns.values
def numpy_where(df):
rows, cols = np.where(df.values == val)
return [(df.index[row], df.columns[col]) for row, col in zip(rows, cols)]
def one_line_generator(df):
return [x[1:] for x in ((v, i, j) for i, row_tup in enumerate(df.itertuples(index=False))
for j, v in zip(df.columns, row_tup)) if x[0] == "Smith"]

You can do this by looping though all the columns and finding the matching rows. This will give you a list of all the cells that matches your criteria:
Method 1(without comprehension):
import pandas as pd
# assume this df and that we are looking for 'abc'
df = pd.DataFrame({
'clientid': [34, 67, 53],
'lastname': ['Johnson', 'Smith', 'Brows']
})
Searchval = 'Smith'
l1 = []
#loop though all the columns
for col in df.columns:
#finding the matching rows
for i in range(len(df[col][df[col].eq(Searchval)].index)):
#appending the output to the list
l1.append((df[col][df[col].eq(Searchval)].index[i], col))
print(l1)
Method 2 (With comprehension):
import pandas as pd
df = pd.DataFrame({
'clientid': [34, 67, 53],
'lastname': ['Johnson', 'Smith', 'Brows']
})
#Value to search
Searchval = 'Smith'
#using list comprehension to find the rows in each column which matches the criteria
#and saving it in a list in case we get multiple matches
l = [(df[col][df[col].eq(Searchval)].index[i], col) for col in df.columns
for i in range(len(df[col][df[col].eq(Searchval)].index))]
print(l)

Thanks for submitting your request. This is something you can find with a Google search. Please make some attempt to find answers before asking a new question.
You can find simple and excellent dataframe examples that include column and row selection here: https://studymachinelearning.com/python-pandas-dataframe/
You can also see the official documentation here: https://pandas.pydata.org/pandas-docs/stable/
Select a column by column name:
df['col']
select a row by index:
df.loc['b']

Related

How can I use a list of data to match data in a dataframe?

I got a list of coordinates, and I need to match the coordinates in a dataframe which contains a unique id and index for each of the coordinates. I want to match the coordinates and print the id and index of each coordinates in the list.
e.g.
List_coords = [[1,2],[3,4],[5,6]]
df =
Index ID Coords
1 23 [1,2]
2 34 [3,4]
3 45 [4,5]
4 56 [5,6]
I expect to get something like 1-23, 2-34, 4-56 and save them to another list. How can I do this?
Is this you are looking for?
match = df['Coords'].isin(List_coords)
(df.loc[match, 'Index'].astype(str) + '-' + df.loc[match, 'ID'].astype(str)).tolist()
The output is
['1-23', '2-34', '4-56']
IIUC you want to get list from Index, ID columns by concatening them with '-' but only for those rows whose 'Coords' is in List_coords?
Then:
m = df['Coords'].isin(List_coords)
out = df.Index.astype(str).add('-').add(df.ID.astype(str))
out = out[m].tolist()
print(out):
['1-23', '2-34', '4-56']
I think you need,
List_coords = [[1,2],[3,4],[5,6]]
df_matched = df[df['Coords'].isin(List_coords)]
output = df_matched[["Index", "ID"]].astype(str).apply(lambda row: row.str.cat(sep="-"), axis=1).values.tolist()
print(output)
>> ['1-23', '2-34', '4-56']
You could use Pandas 'merge'. This solution is merging two DataFrames together: one with the ids + coordinates and another which is made from a list of the coordiantes being looked up.
import pandas as pd
# Create the parent DF
parent_df = pd.DataFrame([
[23, [1,2]],
[45, [4,5]],
[56, [5,6]],
[34, [3,4]]
], columns=['id', 'coordinates'])
# Set as string to perform merge
parent_df['coordinates'] = parent_df['coordinates'].astype(str)
# Take a list of input coords, set as a DF
input_coords = [[1,2],[3,4],[5,6],[99,99]]
list_of_list_of_input_coords = [[coord] for coord in input_coords]
input_coords_df = pd.DataFrame(list_of_list_of_input_coords, columns=['coordinates'])
input_coords_df['coordinates'] = input_coords_df['coordinates'].astype(str)
# Merge the DFs together
merged_df = input_coords_df.merge(parent_df, how='left', on=['coordinates'])
final_list = []
# Createa final list of the ID and coordinates
for index, row in merged_df.iterrows():
final_list.append([row['id'], row['coordinates']])
This would five a final result in a list:
[[23.0, '[1, 2]'], [34.0, '[3, 4]'], [56.0, '[5, 6]'], [nan, '[99, 99]']]

How to split one row into multiple rows in python

I have a pandas dataframe that has one long row as a result of a flattened json list.
I want to go from the example:
{'0_id': 1, '0_name': a, '0_address': USA, '1_id': 2, '1_name': b, '1_address': UK, '1_hobby': ski}
to a table like the following:
id
name
address
hobby
1
a
USA
2
b
UK
ski
Any help is greatly appreciated :)
There you go:
import json
json_data = '{"0_id": 1, "0_name": "a", "0_address": "USA", "1_id": 2, "1_name": "b", "1_address": "UK", "1_hobby": "ski"}'
arr = json.loads(json_data)
result = {}
for k in arr:
kk = k.split("_")
if int(kk[0]) not in result:
result[int(kk[0])] = {"id":"", "name":"", "hobby":""}
result[int(kk[0])][kk[1]] = arr[k]
for key in result:
print("%s %s %s" % (key, result[key]["name"], result[key]["address"]))
if you want to have field more dynamic, you have two choices - either go through all array and gather all possible names and then build template associated empty array, or just check if key exist in result when you returning results :)
This way only works if every column follows this pattern, but should otherwise be pretty robust.
data = {'0_id': '1', '0_name': 'a', '0_address': 'USA', '1_id': '2', '1_name': 'b', '1_address': 'UK', '1_hobby': 'ski'}
df = pd.DataFrame(data, index=[0])
indexes = set(x.split('_')[0] for x in df.columns)
to_concat = []
for i in indexes:
target_columns = [col for col in df.columns if col.startswith(i)]
df_slice = df[target_columns]
df_slice.columns = [x.split('_')[1] for x in df_slice.columns]
to_concat.append(df_slice)
new_df = pd.concat(to_concat)

How to set K number of random column values to empty in DataFrame?

import pandas as pd
df = pd.DataFrame( { 'A': [1,2,3,4],
'B': [10,20,30,40],
'C': [20,40,60,80]
}, )
df['A'] = ''
print(df)
I want to set K number of column A's value into empty value, and these K values should be randomly selected. Those len(df)-K values of column A won't be affected. I wrote this function to generate the random row indexes, and then how to set the df's row values to empty for these indexes?
def random_rows(K=2, df):
col_length = df.shape[1]
row_indexes = [i for i in range(col_length)]
if col_length<K:
K = col_length
selected_row_indexes = random.sample(row_indexes, K)
return selected_row_indexes
You can use sample to get the random rows and loc to modify them:
df.loc[df['A'].sample(n=2).index, 'A'] = '' # or whatever value you want

Pandas: selecting columns in a DataFrame question - e.g. row[1]['Column']

I don't understand this line of code
minimum.append(min(j[1]['Data_Value']))
...specifically
j[1]['Data_Value']
I know the full code returns the minimum value and stores it in a list called minimum, but what does the j[1] do there? I've tried using other numbers to figure it out but get an error. Is it selecting the index or something?
Full code below. Thanks!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib notebook
df1 = pd.read_csv('./data/C2A2_data/BinnedCsvs_d400/ed157460d30113a689e487b88dcbef1f5d64cbd8bb7825f5f485013d.csv')
minimum = []
maximum = []
month = []
df1 = df1[~(df1['Date'].str.endswith(r'02-29'))]
times1 = pd.DatetimeIndex(df1['Date'])
df = df1[times1.year != 2015]
times = pd.DatetimeIndex(df['Date'])
for j in df.groupby([times.month, times.day]):
minimum.append(min(j[1]['Data_Value']))
maximum.append(max(j[1]['Data_Value']))
Explanation
pandas.groupby returns a list of tuples, (key, dataframe). Key is the groupby key; the key value of that group. See below for example.
Looping over these j's, means looping over these tuples.
j[0] refers to the group "key"
j[1] means taking the dataframe component of that tuple. ['Data_Value'] takes a column of that dataframe.
Example
df = pd.DataFrame({'a': [1, 1, 2], 'b': [2, 4, 6]})
df_grouped = df.groupby('a')
for j in df_grouped:
print(f"Groupby key (col a): {j[0]}")
print("dataframe:")
print(j[1])
Yields:
Groupby key (col a): 1
dataframe:
a b
0 1 2
1 1 4
Groupby key (col a): 2
dataframe:
a b
2 2 6
More readable solution
Another, more comfortable, way to get the min/max of Data_Value for every month-day combination is this:
data_value_summary = df \
.groupby([times.month, times.day]) \
.agg({'Data_Value': [min, max]}) \
['Data_Value'] # < this removed the 2nd header from the newly created dataframe
minimum = data_value_summary['min']
maximum = data_value_summary['max']

Pandas use cell value as dict key to return dict value

my question relates to using the values in a dataframe column as keys in order to return their respective values and run a conditional.
I have a dataframe, df, containing a column "count" that has integers from 1 to 8 and a column "category" that has values either "A", "B", or "C"
I have a dictionary, dct, containing pairs A:2, B:4, C:6
This is my (incorrect) code:
result = df[df["count"] >= dct.get(df["category"])]
So I want to return a dataframe where the "count" value for a given row is equal to more than the value retrieved from a dictionary using the "category" letter in the same row.
So if there were count values of (1, 2, 6, 6) and category values of (A, B, C, A), the third and forth row would be return in the resultant dataframe.
How do I modify the above code to achieve this?
A good way to go is to add your dictionary into a the existing dataframe and then apply a query on the new dataframe:
import pandas as pd
df = pd.DataFrame(data={'count': [4, 5, 6], 'category': ['A', 'B', 'C']})
dct = {'A':5, 'B':4, 'C':-1}
df['min_count'] = df['category'].map(dct)
df = df.query('count>min_count')
following your logic:
import pandas as pd
dct = {'A':2, 'B':4, 'C':6}
df = pd.DataFrame({'count':[1,2,5,6],
'category':['A','B','C','A']})
print('original dataframe')
print(df)
def process_row(x):
return True if x['count'] >= dct[x['category']] else False
f = df.apply(lambda row: process_row(row), axis=1)
df = df[f]
print('final output')
print(df)
output:
original dataframe
count category
0 1 A
1 2 B
2 5 C
3 6 A
final output
count category
3 6 A
A small modification to your code:
result = df[df['count'] >= df['category'].apply(lambda x: dct[x])]
You cannot directly use dct.get(df['category']) because df['category'] returns a mutable Series which cannot be used as a dictionary key (Dictionary keys need to be immutable objects)
So, apply and lambda to the rescue! :)

Categories