Originally, I had a Pandas DataFrame that consists of two columns A (for x-axis values) and B (for y-axis values) that are plotted to form a simple x-y coordinate graph. The data consisted of a few peaks, where the peaks all occurred at the same y-axis value with the same increments. Thus, I was able to do the following:
df = pd.read_csv(r'/Users/_______/Desktop/Data Packets/Cycle Data.csv')
nrows = int(df['B'].max() * 2) - 1
alphabet: list = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
groups = df.groupby(df.index // nrows)
for (frameno, frame) in groups:
frame.to_csv("/Users/_______/Desktop/Cycle Test/" + alphabet[frameno] + "%s.csv" % frameno, index=False)
The above code parses the large cycle data file into many data files of the same size, since the local minima and maxima of each cycle is the same.
However, I want to be able to parse a data file that has arbitrary peaks and minima. I can't split the large data file simultaneously because each data file is a different size. Here is an example illustration:
Edit: sample data (A is x-axis, B is y-axis):
data = {'A': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26], 'B': [0, 1, 2, 3, 4, 5, 6, 7, 5, 3, 1, -1, 1, 3, 5, 7, 9, 8, 7, 6, 5, 4, 6, 8, 6, 4, 2]}
df = pd.DataFrame(data)
Edit 2: different sample data (Displacement goes from 1 to 50 back to 1, then 1 to 60 back to 1, etc. etc.):
Load Displacement
0 0.100000 1.0
1 0.101000 2.0
2 0.102000 3.0
3 0.103000 4.0
4 0.104000 5.0
.. ... ...
391 0.000006 5.0
392 0.000005 4.0
393 0.000004 3.0
394 0.000003 2.0
395 0.000002 1.0
col = df['B'] # replace with the appropriate column name
# find local minima. FIXED: use rightmost min value if repeating
minima = (col <= col.shift()) & (col < col.shift(-1))
# create groups
groups = minima.cumsum()
# group
df.groupby(groups).whatever() # replace with whatever the appropriate aggregation is
Example, count values:
df.groupby(groups).count()
Out[10]:
A B
B
0 11 11
1 10 10
2 6 6
We can try with scipy , argrelextrema
from scipy.signal import argrelextrema
idx = argrelextrema(df.col.values, np.less)
g = df.groupby(df.index.isin(df.index[idx[0]]).cumsum())
Related
i have a datafarame where i want to filter using pd.CategoricalDtype() and display the result in a bar chart using px.bar.
before the last update of pandas it was working perfectly but with the latest update it crash the chart and display the below error:
Traceback (most recent call last): File "", line 1, in
File
"/home/marco/python-wsl/project_folder/venv/lib/python3.8/site-packages/plotly/express/_chart_types.py",
line 373, in bar
return make_figure( File "/home/marco/python-wsl/project_folder/venv/lib/python3.8/site-packages/plotly/express/_core.py",
line 2003, in make_figure
groups, orders = get_groups_and_orders(args, grouper) File "/home/marco/python-wsl/project_folder/venv/lib/python3.8/site-packages/plotly/express/_core.py",
line 1978, in get_groups_and_orders
groups = { File "/home/marco/python-wsl/project_folder/venv/lib/python3.8/site-packages/plotly/express/_core.py",
line 1979, in
sf: grouped.get_group(s if len(s) > 1 else s[0]) File "/home/marco/python-wsl/project_folder/venv/lib/python3.8/site-packages/pandas/core/groupby/groupby.py",
line 811, in get_group
raise KeyError(name) KeyError: 'C'
code:
# Code outside px.bar
old_df2 = pd.DataFrame({'name': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'C'],
'id1': [18, 22, 19, 14, 14, 11, 20, 28],
'id2': [5, 7, 7, 9, 12, 9, 9, 4],
'id3': [11, 8, 10, 6, 6, 7, 9, 12]})
new_df = old_df2.groupby([pd.CategoricalDtype(old_df2.name),'id2'])['id3'].count().fillna(0)
# Transforms count from series to data frame
new_df = new_df.to_frame()
# rowname to index
new_df.reset_index(inplace=True)
new_df = new_df[new_df["level_0"].isin(["A","B"])]
new_df .rename(columns={'level_0': 'name'}, inplace=True)
# Not working here the error
fig_bar = px.bar(new_df.loc[::-1], x="id2", y="id3", color = "name", barmode="group")
# Working version identical data
new_df_list = new_df.to_dict("records")
unlinked_df = pd.DataFrame(new_df_list )
how to fix the error ?
I think you can convert column to Categorical if need default behavior - categories are inferred from the data and Categories are unordered:
new_df = old_df2.groupby([pd.Categorical(old_df2.name),'id2'])['id3'].count().fillna(0)
If need CategoricalDtype pass categories by unique values of old_df2.name:
from pandas.api.types import CategoricalDtype
cat_type = CategoricalDtype(categories=old_df2.name.unique())
new_df = old_df2.groupby([old_df2.name.astype(cat_type),'id2'])['id3'].count().fillna(0)
Also change iloc from loc:
fig_bar = px.bar(new_df.iloc[::-1], x="id2", y="id3", color = "name", barmode="group")
EDIT: I do some research and problem is if filtering by category column missing categories are not removed. You can try cat.remove_unused_categories after isin:
old_df2 = pd.DataFrame({'name': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'C'],
'id1': [18, 22, 19, 14, 14, 11, 20, 28],
'id2': [5, 7, 7, 9, 12, 9, 9, 4],
'id3': [11, 8, 10, 6, 6, 7, 9, 12]})
from pandas.api.types import CategoricalDtype
cat_type = CategoricalDtype(categories=old_df2.name.unique())
new_df = old_df2.groupby([old_df2.name.astype(cat_type),'id2'])['id3'].count().fillna(0)
# rowname to index
new_df = new_df.reset_index()
new_df = new_df[new_df["name"].isin(["A","B"])]
print (new_df['name'])
# 0 A
# 1 A
# 2 A
# 3 A
# 4 A
# 5 B
# 6 B
# 7 B
# 8 B
# 9 B
# Name: name, dtype: category
# Categories (3, object): ['A', 'B', 'C']
new_df['name'] = new_df['name'].cat.remove_unused_categories()
print (new_df['name'])
# 0 A
# 1 A
# 2 A
# 3 A
# 4 A
# 5 B
# 6 B
# 7 B
# 8 B
# 9 B
# Name: name, dtype: category
# Categories (2, object): ['A', 'B']
I have two dataframes. One has demographics information about patients and other has some feature information. Below is some dummy data representing my dataset:
Demographics:
demographics = {
'PatientID': [10, 11, 12, 13],
'DOB': ['1971-10-23', '1969-06-18', '1973-04-20', '1971-05-31'],
'Sex': ['M', 'M', 'F', 'M'],
'Flag': [0, 1, 0, 0]
}
demographics = pd.DataFrame(demographics)
demographics['DOB'] = pd.to_datetime(demographics['DOB'])
Here is the printed dataframe:
print(demographics)
PatientID DOB Sex Flag
0 10 1971-10-23 M 0
1 11 1969-06-18 M 1
2 12 1973-04-20 F 0
3 13 1971-05-31 M 0
Features:
features = {
'PatientID': [10, 10, 10, 10, 10, 10, 10, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12],
'Feature': ['A', 'B', 'A', 'A', 'C', 'B', 'C', 'A', 'B', 'B', 'A', 'C', 'D', 'A', 'B', 'C', 'C', 'D', 'D', 'D', 'B', 'C', 'C', 'C', 'B', 'B', 'C'],
}
features = pd.DataFrame(features)
Here is a count of each features of each patient:
print(features.groupby(['PatientID', 'Feature']).size())
PatientID Feature
10 A 3
B 2
C 2
11 A 3
B 3
C 3
D 1
12 B 3
C 4
D 3
dtype: int64
I want to integrate each patients feature counts of their features into the demographics table. Note that patient 13 is absent from the features table. The final dataframe will look as shown below:
result = {
'PatientID': [10, 11, 12, 13],
'DOB': ['1971-10-23', '1969-06-18', '1973-04-20', '1971-05-31'],
'Feature_A': [3, 3, 0, 0],
'Feature_B': [2, 3, 3, 0],
'Feature_C': [2, 3, 4, 0],
'Feature_D': [0, 1, 3, 0],
'Sex': ['M', 'M', 'F', 'M'],
'Flag': [0, 1, 0, 0],
}
result = pd.DataFrame(result)
result['DOB'] = pd.to_datetime(result['DOB'])
print(result)
PatientID DOB Feature_A Feature_B Feature_C Feature_D Sex Flag
0 10 1971-10-23 3 2 2 0 M 0
1 11 1969-06-18 3 3 3 1 M 1
2 12 1973-04-20 0 3 4 3 F 0
3 13 1971-05-31 0 0 0 0 M 0
How can I get this result from these two dataframes?
Cross-tabulate features and merge with demographics.
# cross-tabulate feature df
# and reindex it by PatientID to carry PatientIDs without features
feature_counts = (
pd.crosstab(features['PatientID'], features['Feature'])
.add_prefix('Feature_')
.reindex(demographics['PatientID'], fill_value=0)
)
# merge the two
demographics.merge(feature_counts, on='PatientID')
Fix your code adding unstack
out = (features.groupby(['PatientID', 'Feature']).size().
unstack(fill_value=0).
add_prefix('Feature_').
reindex(demographics['PatientID'],fill_value=0).
reset_index().
merge(demographics))
Out[30]:
PatientID Feature_A Feature_B Feature_C Feature_D DOB Sex Flag
0 10 3 2 2 0 1971-10-23 M 0
1 11 3 3 3 1 1969-06-18 M 1
2 12 0 3 4 3 1973-04-20 F 0
3 13 0 0 0 0 1971-05-31 M 0
I've been reading about how it is best practice to avoid using iterrows to iterate through a pandas DataFrame, but I am not sure how else I can solve my particular problem:
How can I:
Find the "time" of the first instance of the value "c" in one DataFrame, df1, grouped by "num" and sorted by "time"
Then add that "time" into a separate DataFrame, df2, based on "num".
Here is an example of my input DataFrame:
import pandas as pd
df = pd.DataFrame({'num': [2, 2, 2, 2, 5, 5, 5, 5, 5, 5, 5, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8,
8, 8, 8, 8, 9, 9, 9, 9, 9],
'state': ['a', 'b', 'c', 'b', 'a', 'b', 'c', 'b', 'c', 'b', 'c', 'a',
'b', 'c', 'b', 'c', 'b', 'c', 'a', 'b', 'c', 'b', 'c', 'b',
'c', 'b', 'c', 'b', 'c', 'b'],
'time': [234, 239, 244, 249, 100, 105, 110, 115, 120, 125, 130, 3, 8,
13, 18, 23, 28, 33, 551, 556, 561, 566, 571, 576, 581, 45, 50,
55, 60, 65]})
Expected output (df2):
num time
2 244
5 110
7 13
8 561
9 50
Every solution I attempt seems like it would require iterrows to load the "time" into df2.
You can do it in one line, using df.groupby() with min() as the aggregation function:
df[df.state == 'c'].drop('state', axis=1).groupby('num').aggregate(min)
time
num
2 244
5 110
7 13
8 561
9 50
Its hard to check without re-creating the df but i think this should do it
def first_c(group):
filtered = group[group['state'] == 'c'].iloc[0]
return filtered[['num', 'time']]
df2 = df.groupby('num').apply(first_c)
Group by num
Apply function and filter for c, find first integer index with iloc
return num and time
I have 3 arrays of
a = np.array([1, 4, 5, 11, 46]), b = np.array([3, 2, 12, 14, 42]) and c = np.array([6, 23, 24, 45, 47])
I have merged these arrays and sorted them in ascending:
new = np.sort(np.concatenate([a, b, c])) which results in:
[ 1 2 3 4 5 6 11 12 14 23 24 42 45 46 47]
Now I am looking for a way to show from what initial array (a,b,c), each value is picked.
for example, I get ['a', 'b', 'b', 'a', 'a', 'c', 'a', 'b', 'b', 'c', 'c', 'b', 'c', 'a', 'c']
I am not sure if I am in the right way or should I use dictionaries for this purpose?
What do you do if there are same numbers in a,b,c otherwise
so if there are same numbers in a,b,c. 'a' will be added to list. if there are same numbers in b,c 'b' will be added to the list.
You can probably do it in less lines but for readability:
import numpy as np
a = np.array([1, 4, 5, 11, 46])
b = np.array([3, 2, 12, 14, 42])
c = np.array([6, 23, 24, 45, 47])
new = np.concatenate([a, b, c])
indices = np.concatenate([np.array(['a']*a.size),np.array(['b']*b.size),np.array(['c']*c.size)])
sorted = indices.sort()
result = indices[new.argsort()]
print(result)
This gives the following output:
['a' 'b' 'b' 'a' 'a' 'c' 'a' 'b' 'b' 'c' 'c' 'b' 'c' 'a' 'c']
I have a list A of the form:
A = ['P', 'Q', 'R', 'S', 'T', 'U']
and an array B of the form:
B = [[ 1 2 3 4 5 6]
[ 7 8 9 10 11 12]
[13 14 15 16 17 18]
[19 20 21 22 23 24]]
now I would like to create a structured array C of the form:
C = [[ P Q R S T U]
[ 1 2 3 4 5 6]
[ 7 8 9 10 11 12]
[13 14 15 16 17 18]
[19 20 21 22 23 24]]
so that I can extract columns with column names P, Q, R, etc. I tried the following code but it does not create a structured array and gives the following error.
Code
import numpy as np
A = (['P', 'Q', 'R', 'S', 'T', 'U'])
B = np.array([[1, 2, 3, 4, 5, 6], [7, 8, 9, 10, 11, 12], [13, 14, 15, 16, 17, 18], [19, 20, 21, 22, 23, 24]])
C = np.vstack((A, B))
print (C)
D = C['P']
Error
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
How to create structured array in Python in this case?
Update
Both are variables, their shape changes during runtime but both list and array will have the same number of columns.
If you want to do it in pure numpy you can do
A = np.array(['P', 'Q', 'R', 'S', 'T', 'U'])
B = np.array([[ 1, 2, 3, 4, 5, 6],
[ 7, 8, 9, 10, 11, 12],
[13, 14, 15, 16, 17, 18],
[19, 20, 21, 22, 23, 24]])
# define the structured array with the names from A
C = np.zeros(B.shape[0],dtype={'names':A,'formats':['f8','f8','f8','f8','f8','f8']})
# copy the data from B into C
for i,n in enumerate(A):
C[n] = B[:,i]
C['Q']
array([ 2., 8., 14., 20.])
Edit: you can automatize the format list by using instead
C = np.zeros(B.shape[0],dtype={'names':A,'formats':['f8' for x in range(A.shape[0])]})
Furthermore, the names do not appear in C as data but in dtype. In order to get the names from C you can use
C.dtype.names
This is what the pandas library is for:
>>> A = ['P', 'Q', 'R', 'S', 'T', 'U']
>>> B = np.arange(1, 25).reshape(4, 6)
>>> B
array([[ 1, 2, 3, 4, 5, 6],
[ 7, 8, 9, 10, 11, 12],
[13, 14, 15, 16, 17, 18],
[19, 20, 21, 22, 23, 24]])
>>> import pandas as pd
>>> pd.DataFrame(B, columns=A)
P Q R S T U
0 1 2 3 4 5 6
1 7 8 9 10 11 12
2 13 14 15 16 17 18
3 19 20 21 22 23 24
>>> df = pd.DataFrame(B, columns=A)
>>> df['P']
0 1
1 7
2 13
3 19
Name: P, dtype: int64
>>> df['T']
0 5
1 11
2 17
3 23
Name: T, dtype: int64
>>>
http://pandas.pydata.org/pandas-docs/dev/tutorials.html
Your error occurs on:
D = C['P']
Here is a simple approach, using regular Python lists on the title row.
import numpy as np
A = (['P', 'Q', 'R', 'S', 'T', 'U'])
B = np.array([[1, 2, 3, 4, 5, 6], [7, 8, 9, 10, 11, 12],
[13, 14, 15, 16, 17, 18], [19, 20, 21, 22, 23, 24]])
C = np.vstack((A, B))
print (C)
D = C[0:len(C), list(C[0]).index('P')]
print (D)