I would like to locate a specific row (given all its columns values) within a pandas frame.
My attempts so far:
df = pd.DataFrame(
columns = ["A", "B", "C"],
data = [
[1, 2, 3],
[4, 5, 6],
[7, 8, 9],
[10, 11, 12],
])
# row to find (last one)
row = {"A" : 10, "B" : 11, "C" : 12}
# chain
idx = df[(df["A"] == 10) & (df["B"] == 11) & (df["B"] == 11)].index[0]
print(idx)
# iterative
mask = pd.Series([True] * len(df))
for k, v in row.items():
mask &= (df[k] == v)
idx = df[mask].index[0]
print(idx)
# pandas series
for idx in df.index:
print(idx, (df.iloc[idx,:] == pd.Series(row)).all())
Is there a simpler way to do that? Something like idx = df.find(row)?
This functionality is often needed for example to locate one specific sample in a time series. I cannot believe that there is no straightforward way to do that.
Do you simply want?
df[df.eq(row).all(axis=1)] #.index # if the indices are needed
output:
A B C
3 10 11 12
Or, if you have more columns and want to ignore them for the comparison:
df[df[list(row)].eq(row).all(axis=1)]
Related
I am a Python newbie and have a question.
As a simple example, I have three variables:
a = 3
b = 10
c = 1
I'd like to create a data frame with three columns ('a', 'b', and 'c') with:
each column +/- a certain constant from the original value AND also >0 and <=10.
If the constant is 1 then:
the possible values of 'a' will be 2, 3, 4
the possible values of 'b' will be 9, 10
the possible values of 'c' will be 1, 2
The final data frame will consist of all possible combination of a, b and c.
Do you know any Python code to do so?
Here is a code to start.
import pandas as pd
data = [[3 , 10, 1]]
df1 = pd.DataFrame(data, columns=['a', 'b', 'c'])
You may use itertools.product for this.
Create 3 separate lists with the necessary accepted data. This can be done by calling a method which will return you the list of possible values.
def list_of_values(n):
if 1 < n < 9:
return [n - 1, n, n + 1]
elif n == 1:
return [1, 2]
elif n == 10:
return [9, 10]
return []
So you will have the following:
a = [2, 3, 4]
b = [9, 10]
c = [1,2]
Next, do the following:
from itertools import product
l = product(a,b,c)
data = list(l)
pd.DataFrame(data, columns =['a', 'b', 'c'])
I have two DataFrames, df1 and df2. The information in df1 has to be used to populate cells in df2 if a specific condition is met. This is an example:
df1 = pd.DataFrame({"A":[1, 1, 1, 2, 2, 3, 3, 4, 4, 4, 4],"B":[1, 2, 3, 1, 2, 2, 3, 1, 2, 3, 4],"C":[5, 3, 2,10,11,12, 4, 5, 7, 2, 7], "D":[0.5, 0.3, 0.5, 0.7, 0.5, 0.6, 0.1, 0.6, 0.6, 0.5, 0.6]})
df2 = pd.DataFrame({"A":[5, 5, 6, 6, 6], "B":[1, 2, 1, 2, 3], "C":np.nan, "D":np.nan})
The np.nan entries in df2 are meant to represent the cells that need to be populated. These are empty at the start of the process.
To populate df2, I need to use the values in the column df2['B']. Specifically, in this example, if the value of df2['B'] is equal to 1, then I need to get a random sample, with replacement, from df1[df1['B']==1], for both df1['C'] and df1['D']. Importantly, these values are not independent. Therefore, I need to draw a random row from the subset of rows of df1 where df1['B'] is equal to one. And then I need to do this for all rows in df2.
Doing df1[df1['B']==1][['C','D']].sample(replace = True) draws a random sample for one case when the value of df1['B'] is one, but
How do I assign the corresponding values to df2?
How do I do this for every row in df2?
I have tried several alternatives with loops, such as
for index, value in df2.iterrows():
if df2.loc[index,'B'] == 1:
temp_df = df1[df1['B'] == 1][['C','D']].sample(n = 1, replace = True)
if df2.loc[index,'B'] == 2:
temp_df = df1[df1['B'] == 2][['C','D']].sample(n = 1, replace = True)
if df2.loc[index,'B'] == 3:
temp_df = df1[df1['B'] == 3][['C','D']].sample(n = 1, replace = True)
if df2.loc[index,'B'] == 4:
temp_df = df1[df1['B'] == 4][['C','D']].sample(n = 1, replace = True)
df2.loc[index, 'C'] = temp_df['C']
df2.loc[index, 'D'] = temp_df['D']
but I get an error message saying
---> 15 df2.loc[index, 'C'] = temp_df['C']
16 df2.loc[index, 'D'] = temp_df['D']
...
ValueError: Incompatible indexer with Series
where the ... denotes lines from the error message that I skipped.
Here's one approach:
(i) get the sample sizes from df2 with groupby + size.
(ii) use groupby + apply where we use a lambda function to sample items from df1 with the sample sizes obtained from (i) for each unique "B".
(iii) assign these sampled values to df2 (since "B" is not unique, we sorted df2 by "B" to make the rows align)
cols = ['C','D']
sample_sizes = df2.groupby('B')[cols].size()
df2 = df2.sort_values(by='B')
df2[cols] = (df1[df1['B'].isin(sample_sizes.index)]
.groupby('B')[cols]
.apply(lambda g: g.sample(sample_sizes[g.name], replace=True))
.droplevel(1).reset_index(drop=True))
df2 = df2.sort_index()
One sample:
A B C D
0 5 1 5 0.6
1 5 2 10 0.7
2 6 1 12 0.6
3 6 2 11 0.5
4 6 3 4 0.1
So I have a dataset that contains history of a specific tag from a start to end date. I am trying to compare rows based on the a date column, if they're similar by month, day and year, I'll add those to a temporary list by the value of the next column and then once I have those items by similar date, I'll take that list and find the min/max values subtract them, then add the result to another list and empty the temp_list to start all over again.
For the sake of time and simplicity, I am just presenting a example of 2D List. Here's my example data
dataset = [[1,5],[1,6],[1,10],[1,23],[2,4],[2,8],[2,12],[3,10],[3,20],[3,40],[4,50],[4,500]]
Where the first column will act as dates and second value.
The issues I am having is :
I cant seem to compare every row based on its first column which would take the value in the second column and include it in the temp list to perform min/max operations?
Based on the above 2D List I would expect to get [18,8,30,450] but the result is [5,4,10]
dataset = [[1,5],[1,6],[1,10],[1,23],[2,4],[2,8],[2,12],[3,10],[3,30],[3,40],[4,2],[4,5]]
temp_list = []
daily_total = []
for i in range(len(dataset)-1):
if dataset[i][0] == dataset[i+1][0]:
temp_list.append(dataset[i][1])
else:
max_ = max(temp_list)
min_ = min(temp_list)
total = max_ - min_
daily_total.append(total)
temp_list = []
print([x for x in daily_total])
Try:
tmp = {}
for d, v in dataset:
tmp.setdefault(d, []).append(v)
out = [max(v) - min(v) for v in tmp.values()]
print(out)
Prints:
[18, 8, 30, 450]
Here is a solution using pandas:
import pandas as pd
dataset = [
[1, 5],
[1, 6],
[1, 10],
[1, 23],
[2, 4],
[2, 8],
[2, 12],
[3, 10],
[3, 20],
[3, 40],
[4, 50],
[4, 500],
]
df = pd.DataFrame(dataset)
df.columns = ["date", "value"]
df = df.groupby("date").agg(min_value=("value", "min"), max_value=("value", "max"))
df["res"] = df["max_value"] - df["min_value"]
df["res"].to_list()
Output:
[18, 8, 30, 450]
I can use np.select to insert a new column and set the value for one dataFrame.
But when I combined both dataFrame. The np.select does not work. Seems index error.
import pandas as pd
import numpy as np
df = pd.DataFrame([[3, 2, 1],[4, 5, 6]], columns=['col1','col2','col3'], index=['a','b'])
df2 = pd.DataFrame([[14, 15, 16],[17, 16, 15]], columns=['col1','col2','col3'], index=['c','e'])
count = df.append(df2)
print(count)
conditions = [
(df["col1"] >= df["col2"]) & (df["col2"] >= df["col3"]),
]
choices = [100]
count["col4"] = np.select(conditions,choices, default='WHAT')
count
This is success
This is error after combine, error is :
ValueError: Length of values does not match length of index
I think there is a typo in your code when it comes to count vs df. The following code just works fine.
import pandas as pd
import numpy as np
df = pd.DataFrame([[3, 2, 1],[4, 5, 6]], columns=['col1','col2','col3'], index=['a','b'])
df2 = pd.DataFrame([[14, 15, 16],[17, 16, 15]], columns=['col1','col2','col3'], index=['c','e'])
count = df.append(df2)
print(count)
conditions = [
(count["col1"] >= count["col2"]) & (count["col2"] >= count["col3"]),
]
print(conditions)
choices = [100]
count["col4"] = np.select(conditions,choices, default='WHAT')
count
I want to find duplicates in a selection of columns of a df,
# converts the sub df into matrix
mat = df[['idx', 'a', 'b']].values
str_dict = defaultdict(set)
for x in np.ndindex(mat.shape[0]):
concat = ''.join(str(x) for x in mat[x][1:])
# take idx as values of each key a + b
str_dict[concat].update([mat[x][0]])
dups = {}
for key in str_dict.keys():
dup = str_dict[key]
if len(dup) < 2:
continue
dups[key] = dup
The code finds duplicates of the concatenation of a and b. Uses the concatenation as key for a set defaultdict (str_dict), updates the key with idx values; finally uses a dict (dups) to store any concatenation if the length of its value (set) is >= 2.
I am wondering if there is a better way to do that in terms of efficiency.
You can just concatenate and convert to set:
res = set(df['a'].astype(str) + df['b'].astype(str))
Example:
df = pd.DataFrame({'idx': [1, 2, 3],
'a': [4, 4, 5],
'b': [5, 5,6]})
res = set(df['a'].astype(str) + df['b'].astype(str))
print(res)
# {'56', '45'}
If you need to map indices too:
df = pd.DataFrame({'idx': [1, 2, 3],
'a': [41, 4, 5],
'b': [3, 13, 6]})
df['conc'] = (df['a'].astype(str) + df['b'].astype(str))
df = df.reset_index()
res = df.groupby('conc')['index'].apply(set).to_dict()
print(res)
# {'413': {0, 1}, '56': {2}}
You can filter the column you need before drop_duplicate
df[['a','b']].drop_duplicates().astype(str).apply(np.sum,1).tolist()
Out[1027]: ['45', '56']