I don't understand this line of code
minimum.append(min(j[1]['Data_Value']))
...specifically
j[1]['Data_Value']
I know the full code returns the minimum value and stores it in a list called minimum, but what does the j[1] do there? I've tried using other numbers to figure it out but get an error. Is it selecting the index or something?
Full code below. Thanks!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib notebook
df1 = pd.read_csv('./data/C2A2_data/BinnedCsvs_d400/ed157460d30113a689e487b88dcbef1f5d64cbd8bb7825f5f485013d.csv')
minimum = []
maximum = []
month = []
df1 = df1[~(df1['Date'].str.endswith(r'02-29'))]
times1 = pd.DatetimeIndex(df1['Date'])
df = df1[times1.year != 2015]
times = pd.DatetimeIndex(df['Date'])
for j in df.groupby([times.month, times.day]):
minimum.append(min(j[1]['Data_Value']))
maximum.append(max(j[1]['Data_Value']))
Explanation
pandas.groupby returns a list of tuples, (key, dataframe). Key is the groupby key; the key value of that group. See below for example.
Looping over these j's, means looping over these tuples.
j[0] refers to the group "key"
j[1] means taking the dataframe component of that tuple. ['Data_Value'] takes a column of that dataframe.
Example
df = pd.DataFrame({'a': [1, 1, 2], 'b': [2, 4, 6]})
df_grouped = df.groupby('a')
for j in df_grouped:
print(f"Groupby key (col a): {j[0]}")
print("dataframe:")
print(j[1])
Yields:
Groupby key (col a): 1
dataframe:
a b
0 1 2
1 1 4
Groupby key (col a): 2
dataframe:
a b
2 2 6
More readable solution
Another, more comfortable, way to get the min/max of Data_Value for every month-day combination is this:
data_value_summary = df \
.groupby([times.month, times.day]) \
.agg({'Data_Value': [min, max]}) \
['Data_Value'] # < this removed the 2nd header from the newly created dataframe
minimum = data_value_summary['min']
maximum = data_value_summary['max']
Related
import pandas as pd
df = pd.DataFrame( { 'A': [1,2,3,4],
'B': [10,20,30,40],
'C': [20,40,60,80]
}, )
df['A'] = ''
print(df)
I want to set K number of column A's value into empty value, and these K values should be randomly selected. Those len(df)-K values of column A won't be affected. I wrote this function to generate the random row indexes, and then how to set the df's row values to empty for these indexes?
def random_rows(K=2, df):
col_length = df.shape[1]
row_indexes = [i for i in range(col_length)]
if col_length<K:
K = col_length
selected_row_indexes = random.sample(row_indexes, K)
return selected_row_indexes
You can use sample to get the random rows and loc to modify them:
df.loc[df['A'].sample(n=2).index, 'A'] = '' # or whatever value you want
How can I get the number of the row and the column in a dataframe that contains a certain value using Pandas? For example, I have the following dataframe:
For example, i need to know the row and column of "Smith" (row 1, column LastName)
Maybe this is a solution or a first step to a solution.
If you filter for the value you are looking for all items which are not the value you want are replaced with NaN. Now you can drop all columns where all values are NaN. This leaves a DataFrame with your item and the indices. Then you can ask for index and name.
import numpy as np
import pandas as pd
df = pd.DataFrame({'LastName':['a', 'Smith', 'b'], 'other':[1,2,3]})
value = df[df=='Smith'].dropna(axis=0, how='all').dropna(axis=1, how='all')
print(value.index.values)
print(value.columns.values)
But I think this can be improved.
Here's a one liner that efficiently gets the row and column of a value:
df = pd.DataFrame({"ClientID": [34, 67, 53], "LastName": ["Johnson", "Smith", "Brows"] })
result = next(x[1:] for x in ((v, i, j) for i, row_tup in enumerate(df.itertuples(index=False)) for j, v in zip(df.columns, row_tup)) if x[0] == "Smith")
print(result)
Output
(1, "LastName")
Unpacking that one liner
# This is a generator that unpacks the dataframe and gets the value, row number (i) and column name (j) for every value in the dataframe
item_generator = ((v, i, j) for i, row_tup in enumerate(df.itertuples(index=False)) for j, v in zip(df.columns, row_tup))
# This iterates through the generator until it finds a match
# It outputs just the row and column number by leaving off the first item in the tuple
next(x[1:] for x in item_generator if x[0] == "Smith")
Props to this this answer for the second half of the solution
I tried to simplify the code and make it more readable. This is my attempt:
df = pd.DataFrame({'points': [25, 12, 15, 14, 19],
'assists': [5, 7, 7, 9, 12],
'rebounds': [11, 8, 10, 6, 6]})
index = df.index # Allows to get the row index
columns = df.columns # Allows to get the column name
value_to_be_checked = 6
for i in index[df.isin([value_to_be_checked]).any(axis=1)].to_list():
for j, e in enumerate(df.iloc[i]):
if e == value_to_be_checked:
print("(row {}, column {})".format(i, column[j])
Just to add another possible solution to the bucket. If you really need to your search your whole DataFrame, you may consider using numpy.where, such as:
import numpy as np
value = 'Smith'
rows, cols = np.where(df.values == value)
where_are_you = [(df.index[row], df.columns[col]) for row, col in zip(rows, cols)]
So, if your DataFrame is like
ClientID First Name LastName
0 34 Mr Smith
1 67 Keanu Reeves
2 53 Master Yoda
3 99 Smith Smith
4 100 Harry Potter
The code output will be:
[(0, 'LastName'), (3, 'First Name'), (3, 'LastName')]
Edit: Just to satisfy everybody's curiosity, here it is a benchmark of all answers
The code is written below. I removed the print statements to be fair, because they would make codes really slow for bigger dataframes.
val = 0
def setup(n=10):
return pd.DataFrame(np.random.randint(-100, 100, (n, 3)))
def nested_for(df):
index = df.index # Allows to get the row index
columns = df.columns # Allows to get the column name
value_to_be_checked = val
for i in index[df.isin([value_to_be_checked]).any(axis=1)].to_list():
for j, e in enumerate(df.iloc[i]):
if e == value_to_be_checked:
_ = "(row {}, column {})".format(i, columns[j])
def df_twin_dropna(df):
value = df[df == val].dropna(axis=0, how='all').dropna(axis=1, how='all')
return value.index.values, value.columns.values
def numpy_where(df):
rows, cols = np.where(df.values == val)
return [(df.index[row], df.columns[col]) for row, col in zip(rows, cols)]
def one_line_generator(df):
return [x[1:] for x in ((v, i, j) for i, row_tup in enumerate(df.itertuples(index=False))
for j, v in zip(df.columns, row_tup)) if x[0] == "Smith"]
You can do this by looping though all the columns and finding the matching rows. This will give you a list of all the cells that matches your criteria:
Method 1(without comprehension):
import pandas as pd
# assume this df and that we are looking for 'abc'
df = pd.DataFrame({
'clientid': [34, 67, 53],
'lastname': ['Johnson', 'Smith', 'Brows']
})
Searchval = 'Smith'
l1 = []
#loop though all the columns
for col in df.columns:
#finding the matching rows
for i in range(len(df[col][df[col].eq(Searchval)].index)):
#appending the output to the list
l1.append((df[col][df[col].eq(Searchval)].index[i], col))
print(l1)
Method 2 (With comprehension):
import pandas as pd
df = pd.DataFrame({
'clientid': [34, 67, 53],
'lastname': ['Johnson', 'Smith', 'Brows']
})
#Value to search
Searchval = 'Smith'
#using list comprehension to find the rows in each column which matches the criteria
#and saving it in a list in case we get multiple matches
l = [(df[col][df[col].eq(Searchval)].index[i], col) for col in df.columns
for i in range(len(df[col][df[col].eq(Searchval)].index))]
print(l)
Thanks for submitting your request. This is something you can find with a Google search. Please make some attempt to find answers before asking a new question.
You can find simple and excellent dataframe examples that include column and row selection here: https://studymachinelearning.com/python-pandas-dataframe/
You can also see the official documentation here: https://pandas.pydata.org/pandas-docs/stable/
Select a column by column name:
df['col']
select a row by index:
df.loc['b']
I have a dataframe column as:
df = pd.DataFrame({'a':[1,3,5,7,4,5,6,4,7,8,9]})
My output list needed is (keeping the min max values of ranges):
[[1,4],[5,7],[8,9]]
Here's how far I got:
import pandas as pd
df = pd.DataFrame({'a':[1,3,5,7,4,5,6,4,7,8,9]})
# Convert df column to a unique value list and sort ascending
us = df['a'].unique().tolist()
us.sort()
lst1 = [int(v) for v in us]
# Create 3 groups of values
lst2 = [lst1[i:i + 3] for i in xrange(0, len(lst1), 3)]
# Keep only min and max of these groups
How do I convert this:
[[1,3,4],[5,6,7],[8,9]]
to my desired output ?
You can use a list comprehension for this:
lst3 = [[min(i), max(i)] for i in lst2]
You can use dataframe
df = df.sort_values("a").drop_duplicates().reset_index(drop=True)
df.groupby(df.index // 3).agg(['min', 'max']).values.tolist()
Let's say I have this pandas dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({'x': np.random.randint(-10, 10, size=100),
'y': np.random.randint(-10, 10, size=100)})
And I have any query that selects rows, e.g.
query = (df['x'] > 3) & (df['y'] < 0)
How do I get groups of the rows that match this query AND the next successive k rows (if there's less than k, then return however many are available)?
For example, for k = 2, a cumbersome and manual way to do it is:
# 1st value
sel0 = df[query].reset_index()
# 2nd value
sel1 = df[query.shift(1).fillna(False)].reset_index()
# 3rd value
sel2 = df[query.shift(2).fillna(False)].reset_index()
concat_df = pd.concat([sel0, sel1, sel2])
grouped_df = concat_df.groupby(concat_df.index)
groups = [grouped_df.get_group(i) for i in grouped_df.groups]
Is there a one-liner that can generalize this to any k and execute it fast?
I think you can do this using cumsum, groupby and head:
Try this, where k=2 use head(3), current record plus two:
df.groupby(query.cumsum()).head(3)
and to generalize try this
k=2
df.groupby(query.cumsum()).head(k+1)
my question relates to using the values in a dataframe column as keys in order to return their respective values and run a conditional.
I have a dataframe, df, containing a column "count" that has integers from 1 to 8 and a column "category" that has values either "A", "B", or "C"
I have a dictionary, dct, containing pairs A:2, B:4, C:6
This is my (incorrect) code:
result = df[df["count"] >= dct.get(df["category"])]
So I want to return a dataframe where the "count" value for a given row is equal to more than the value retrieved from a dictionary using the "category" letter in the same row.
So if there were count values of (1, 2, 6, 6) and category values of (A, B, C, A), the third and forth row would be return in the resultant dataframe.
How do I modify the above code to achieve this?
A good way to go is to add your dictionary into a the existing dataframe and then apply a query on the new dataframe:
import pandas as pd
df = pd.DataFrame(data={'count': [4, 5, 6], 'category': ['A', 'B', 'C']})
dct = {'A':5, 'B':4, 'C':-1}
df['min_count'] = df['category'].map(dct)
df = df.query('count>min_count')
following your logic:
import pandas as pd
dct = {'A':2, 'B':4, 'C':6}
df = pd.DataFrame({'count':[1,2,5,6],
'category':['A','B','C','A']})
print('original dataframe')
print(df)
def process_row(x):
return True if x['count'] >= dct[x['category']] else False
f = df.apply(lambda row: process_row(row), axis=1)
df = df[f]
print('final output')
print(df)
output:
original dataframe
count category
0 1 A
1 2 B
2 5 C
3 6 A
final output
count category
3 6 A
A small modification to your code:
result = df[df['count'] >= df['category'].apply(lambda x: dct[x])]
You cannot directly use dct.get(df['category']) because df['category'] returns a mutable Series which cannot be used as a dictionary key (Dictionary keys need to be immutable objects)
So, apply and lambda to the rescue! :)