How do I NOT use iterrows to solve my problem? - python

I've been reading about how it is best practice to avoid using iterrows to iterate through a pandas DataFrame, but I am not sure how else I can solve my particular problem:
How can I:
Find the "time" of the first instance of the value "c" in one DataFrame, df1, grouped by "num" and sorted by "time"
Then add that "time" into a separate DataFrame, df2, based on "num".
Here is an example of my input DataFrame:
import pandas as pd
df = pd.DataFrame({'num': [2, 2, 2, 2, 5, 5, 5, 5, 5, 5, 5, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8,
8, 8, 8, 8, 9, 9, 9, 9, 9],
'state': ['a', 'b', 'c', 'b', 'a', 'b', 'c', 'b', 'c', 'b', 'c', 'a',
'b', 'c', 'b', 'c', 'b', 'c', 'a', 'b', 'c', 'b', 'c', 'b',
'c', 'b', 'c', 'b', 'c', 'b'],
'time': [234, 239, 244, 249, 100, 105, 110, 115, 120, 125, 130, 3, 8,
13, 18, 23, 28, 33, 551, 556, 561, 566, 571, 576, 581, 45, 50,
55, 60, 65]})
Expected output (df2):
num time
2 244
5 110
7 13
8 561
9 50
Every solution I attempt seems like it would require iterrows to load the "time" into df2.

You can do it in one line, using df.groupby() with min() as the aggregation function:
df[df.state == 'c'].drop('state', axis=1).groupby('num').aggregate(min)
time
num
2 244
5 110
7 13
8 561
9 50

Its hard to check without re-creating the df but i think this should do it
def first_c(group):
filtered = group[group['state'] == 'c'].iloc[0]
return filtered[['num', 'time']]
df2 = df.groupby('num').apply(first_c)
Group by num
Apply function and filter for c, find first integer index with iloc
return num and time

Related

Pandas latest update filtered Grouped by objects breaks px.bar

i have a datafarame where i want to filter using pd.CategoricalDtype() and display the result in a bar chart using px.bar.
before the last update of pandas it was working perfectly but with the latest update it crash the chart and display the below error:
Traceback (most recent call last): File "", line 1, in
File
"/home/marco/python-wsl/project_folder/venv/lib/python3.8/site-packages/plotly/express/_chart_types.py",
line 373, in bar
return make_figure( File "/home/marco/python-wsl/project_folder/venv/lib/python3.8/site-packages/plotly/express/_core.py",
line 2003, in make_figure
groups, orders = get_groups_and_orders(args, grouper) File "/home/marco/python-wsl/project_folder/venv/lib/python3.8/site-packages/plotly/express/_core.py",
line 1978, in get_groups_and_orders
groups = { File "/home/marco/python-wsl/project_folder/venv/lib/python3.8/site-packages/plotly/express/_core.py",
line 1979, in
sf: grouped.get_group(s if len(s) > 1 else s[0]) File "/home/marco/python-wsl/project_folder/venv/lib/python3.8/site-packages/pandas/core/groupby/groupby.py",
line 811, in get_group
raise KeyError(name) KeyError: 'C'
code:
# Code outside px.bar
old_df2 = pd.DataFrame({'name': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'C'],
'id1': [18, 22, 19, 14, 14, 11, 20, 28],
'id2': [5, 7, 7, 9, 12, 9, 9, 4],
'id3': [11, 8, 10, 6, 6, 7, 9, 12]})
new_df = old_df2.groupby([pd.CategoricalDtype(old_df2.name),'id2'])['id3'].count().fillna(0)
# Transforms count from series to data frame
new_df = new_df.to_frame()
# rowname to index
new_df.reset_index(inplace=True)
new_df = new_df[new_df["level_0"].isin(["A","B"])]
new_df .rename(columns={'level_0': 'name'}, inplace=True)
# Not working here the error
fig_bar = px.bar(new_df.loc[::-1], x="id2", y="id3", color = "name", barmode="group")
# Working version identical data
new_df_list = new_df.to_dict("records")
unlinked_df = pd.DataFrame(new_df_list )
how to fix the error ?
I think you can convert column to Categorical if need default behavior - categories are inferred from the data and Categories are unordered:
new_df = old_df2.groupby([pd.Categorical(old_df2.name),'id2'])['id3'].count().fillna(0)
If need CategoricalDtype pass categories by unique values of old_df2.name:
from pandas.api.types import CategoricalDtype
cat_type = CategoricalDtype(categories=old_df2.name.unique())
new_df = old_df2.groupby([old_df2.name.astype(cat_type),'id2'])['id3'].count().fillna(0)
Also change iloc from loc:
fig_bar = px.bar(new_df.iloc[::-1], x="id2", y="id3", color = "name", barmode="group")
EDIT: I do some research and problem is if filtering by category column missing categories are not removed. You can try cat.remove_unused_categories after isin:
old_df2 = pd.DataFrame({'name': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'C'],
'id1': [18, 22, 19, 14, 14, 11, 20, 28],
'id2': [5, 7, 7, 9, 12, 9, 9, 4],
'id3': [11, 8, 10, 6, 6, 7, 9, 12]})
from pandas.api.types import CategoricalDtype
cat_type = CategoricalDtype(categories=old_df2.name.unique())
new_df = old_df2.groupby([old_df2.name.astype(cat_type),'id2'])['id3'].count().fillna(0)
# rowname to index
new_df = new_df.reset_index()
new_df = new_df[new_df["name"].isin(["A","B"])]
print (new_df['name'])
# 0 A
# 1 A
# 2 A
# 3 A
# 4 A
# 5 B
# 6 B
# 7 B
# 8 B
# 9 B
# Name: name, dtype: category
# Categories (3, object): ['A', 'B', 'C']
new_df['name'] = new_df['name'].cat.remove_unused_categories()
print (new_df['name'])
# 0 A
# 1 A
# 2 A
# 3 A
# 4 A
# 5 B
# 6 B
# 7 B
# 8 B
# 9 B
# Name: name, dtype: category
# Categories (2, object): ['A', 'B']

How to access a specific value in a list within a dictionary? [duplicate]

This question already has an answer here:
Python: Accessing list in dictionary in different function [closed]
(1 answer)
Closed 1 year ago.
upatients = {
'Junipero':[ 37, 114, 'M', 'I', 6, 2],
'Sunita':[22, 27, 'F', 'B', 9, 5],
'Issur':[ 38, 48, 'D', 'W', 4, 1],
'Luitgard':[ 20, 105, 'M', 'L', 1, 4],
'Rudy':[ 20, 27, 'D', 'O', 9, 5],
'Ioudith':[ 19, 93, 'D', 'I', 4, 3]
}
for key in upatients:
patients = key
print(patients)
I am trying to access the values in the [5] position in the list and sort the list based on those values. I am having trouble printing those values.
Use the key option to sorted() to use an element of the value when sorting.
def sort_dict_keys(d, index):
return sorted(d, key = lambda p: d[p][index])
print(sort_dict_keys(upatients, 5))
To print the item at the 5th index in each list you can use:
for key in upatients:
patients = upatients[key]
print(patients[5])
Here is one way of getting a sorted list of patients:
tmp = [(upatients[key][5], key) for key in upatients]
sort(tmp)
new_list = [x[1] for x in tmp]
print(new_list)
Output:
['Issur', 'Junipero', 'Ioudith', 'Luitgard', 'Rudy', 'Sunita']

how to show the initial array of values after concatenating and sorting

I have 3 arrays of
a = np.array([1, 4, 5, 11, 46]), b = np.array([3, 2, 12, 14, 42]) and c = np.array([6, 23, 24, 45, 47])
I have merged these arrays and sorted them in ascending:
new = np.sort(np.concatenate([a, b, c])) which results in:
[ 1 2 3 4 5 6 11 12 14 23 24 42 45 46 47]
Now I am looking for a way to show from what initial array (a,b,c), each value is picked.
for example, I get ['a', 'b', 'b', 'a', 'a', 'c', 'a', 'b', 'b', 'c', 'c', 'b', 'c', 'a', 'c']
I am not sure if I am in the right way or should I use dictionaries for this purpose?
What do you do if there are same numbers in a,b,c otherwise
so if there are same numbers in a,b,c. 'a' will be added to list. if there are same numbers in b,c 'b' will be added to the list.
You can probably do it in less lines but for readability:
import numpy as np
a = np.array([1, 4, 5, 11, 46])
b = np.array([3, 2, 12, 14, 42])
c = np.array([6, 23, 24, 45, 47])
new = np.concatenate([a, b, c])
indices = np.concatenate([np.array(['a']*a.size),np.array(['b']*b.size),np.array(['c']*c.size)])
sorted = indices.sort()
result = indices[new.argsort()]
print(result)
This gives the following output:
['a' 'b' 'b' 'a' 'a' 'c' 'a' 'b' 'b' 'c' 'c' 'b' 'c' 'a' 'c']

How can I group a Pandas DataFrame by Local Minima?

Originally, I had a Pandas DataFrame that consists of two columns A (for x-axis values) and B (for y-axis values) that are plotted to form a simple x-y coordinate graph. The data consisted of a few peaks, where the peaks all occurred at the same y-axis value with the same increments. Thus, I was able to do the following:
df = pd.read_csv(r'/Users/_______/Desktop/Data Packets/Cycle Data.csv')
nrows = int(df['B'].max() * 2) - 1
alphabet: list = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
groups = df.groupby(df.index // nrows)
for (frameno, frame) in groups:
frame.to_csv("/Users/_______/Desktop/Cycle Test/" + alphabet[frameno] + "%s.csv" % frameno, index=False)
The above code parses the large cycle data file into many data files of the same size, since the local minima and maxima of each cycle is the same.
However, I want to be able to parse a data file that has arbitrary peaks and minima. I can't split the large data file simultaneously because each data file is a different size. Here is an example illustration:
Edit: sample data (A is x-axis, B is y-axis):
data = {'A': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26], 'B': [0, 1, 2, 3, 4, 5, 6, 7, 5, 3, 1, -1, 1, 3, 5, 7, 9, 8, 7, 6, 5, 4, 6, 8, 6, 4, 2]}
df = pd.DataFrame(data)
Edit 2: different sample data (Displacement goes from 1 to 50 back to 1, then 1 to 60 back to 1, etc. etc.):
Load Displacement
0 0.100000 1.0
1 0.101000 2.0
2 0.102000 3.0
3 0.103000 4.0
4 0.104000 5.0
.. ... ...
391 0.000006 5.0
392 0.000005 4.0
393 0.000004 3.0
394 0.000003 2.0
395 0.000002 1.0
col = df['B'] # replace with the appropriate column name
# find local minima. FIXED: use rightmost min value if repeating
minima = (col <= col.shift()) & (col < col.shift(-1))
# create groups
groups = minima.cumsum()
# group
df.groupby(groups).whatever() # replace with whatever the appropriate aggregation is
Example, count values:
df.groupby(groups).count()
Out[10]:
A B
B
0 11 11
1 10 10
2 6 6
We can try with scipy , argrelextrema
from scipy.signal import argrelextrema
idx = argrelextrema(df.col.values, np.less)
g = df.groupby(df.index.isin(df.index[idx[0]]).cumsum())

Filter Model Prediction DataFrame

I have two dataframes:
lang_df = pd.DataFrame(data = {'Content ID': [1, 1, 2, 2, 3, 3],
'User ID': [10, 11, 10, 11, 10, 11],
'Language': ['A', 'A', 'B', 'B', 'C', 'C']})
pred_df = pd.DataFrame(data = {'Content ID': [4, 7, 14, 6, 6, 6],
'User ID': [10, 11, 10, 11, 10, 11],
'Language': ['A', 'D', 'Z', 'B', 'B', 'A']})
I want to filter out the rows in the second dataframe so that users only get content IDs in languages they have previously watched. Result for this example would look like:
result_df = pd.DataFrame(data = {'Content ID': [4, 6, 6, 6],
'User ID': [10, 11, 10, 11],
'Language': ['A', 'B', 'B', 'A']})
I know how to do it using a for loop, but this seems highly inefficient. Not sure how to make the DFs appear in the question for better clarity.
You need to have a inner join on the columns User ID and Language from lang_df with pred_df.
lang_df[['User ID', 'Language']].merge(pred_df, on=['User ID', 'Language'])
Output:
User ID Language Content ID
0 10 A 4
1 11 A 6
2 10 B 6
3 11 B 6
You can use inner join with merge to accomplish this, then use column filtering to on return columns from the pred_df dataframe:
pred_df.merge(lang_df, on=['User ID','Language'], suffixes=('','_2'))[pred_df.columns]
Output:
Content ID User ID Language
0 4 10 A
1 6 11 B
2 6 10 B
3 6 11 A

Categories