how do I change the GROUPBY order in a histogram?

how do I change the GROUPBY order in a histogram? - python

I have a dataframe that has 4 fields in it, Responder, female, married, and children which I plotted as a histogram.
import pandas as pd
data2= data1.groupby('Responder')
data3= data2['female','married','children'].mean()
data3.plot(kind='bar')
As you can see in the output, it was grouped, which is what I wanted. The only thing I want to do now just have it so that each variable is grouped together. So for example you would have two blue bars for female, first one for N and second for Y. Then next to that, the N and Y bars for married, etc.
What is the syntax I need to do this?

When plotting a DataFrame, each column becomes a legend entry, and each row becomes a horizontal axis category.
# Example data (different from yours):
df = pd.DataFrame({'Responder': ['Y', 'N', 'N', 'Y', 'Y', 'N', 'Y', 'N'],
'female': [0, 1, 1, 0, 1, 1, 0, 1],
'married': [0, 1, 1, 1, 1, 0, 0, 1],
'children': [0, 1, 0, 1, 1, 0, 1, 0]})
g = df.groupby('Responder')
res = g.mean().T
res
Responder N Y
female 1.00 0.25
married 0.75 0.50
children 0.25 0.75
res.plot(kind='bar')
By the way, I'm not sure if mean is the correct choice here, since your original data consists of binary counts. Would a normalized sum make more sense?

Related

Python - Summing values and number of duplicates

I have csv file looking like this: part of the data.
X and Y are my coordinates of pixel.
I need to filter column ADC only for TDC values (in this column are also 0 values), and after this I need to sum up the energy value for every unique value of pixel, so for every x=0 y=0, x=0 y=1, x=0 y=2... until x=127 y=127. And in another column I need the number of duplicates for the pixel coordinates that occurs (so the number of places/rows from which I need to make a summation in the Energy column).
I don't know how to write the appropriate conditions for this kind of task. I will appreciate any type of help.

The following StackOverflow question and answers might help you out:
Group dataframe and get sum AND count?
But here is some code for your case which might be useful, too:
# import the pandas package, for doing data analysis and manipulation
import pandas as pd
# create a dummy dataframe using data of the type you are using (I hope)
df = pd.DataFrame(
data = {
"X": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
"Y": [0, 0, 1, 1, 1, 1, 1, 1, 2, 2],
"ADC": ["TDC", "TDC", "TDC", "TDC", "TDC", 0, 0, 0, "TDC", "TDC"],
"Energy": [0, 0, 1, 1, 1, 2, 2, 2, 3, 3],
"Time": [1.2, 1.2, 2.3, 2.3, 3.6, 3.61, 3.62, 0.66, 0.67, 0.68],
}
)
# use pandas' groupby method and aggregation methods to get the sum of the energy in every unique combination of X and Y, and the number of times those combinations appear
df[df["ADC"] == "TDC"].groupby(by=["X","Y"]).agg({"Energy": ['sum','count']}).reset_index()
The result I get from this in my dummy example is:
X Y Energy
sum count
0 0 0 0 2
1 0 1 3 3
2 0 2 6 2

Scanning for groups of the same value in numpy array

I have a numpy array where 0 denotes empty space and 1 denotes that a location is filled. I am trying to find a quick method of scanning the numpy array for where there are multiple values of zero adjacent to each other and return the location of the central zero.
For Example if I had the following array
[0 1 0 1]
[0 0 0 1]
[0 1 0 1]
[1 1 1 1]
I want to return the locations for which there is an adjacent zero on either side of a central zero
e.g
[1,1]
as this is the central of 3 zeros, i.e there is a zero either side of the zero at this location
Im aware that this can be calculated using if statements, but wondered if there was a more pythonic way of doing this.
Any help is greatly appreciated

The desired output here for arbitrary inputs is not exhaustively specified in the question, but here is a possible approach that might be useful for this kind of problem, and adapted to the details of the desired output. It uses np.cumsum, np.bincount, np.where, and np.median to find the middle index for groups of consecutive zeros along rows of a 2D array:
import numpy as np
def find_groups(x, min_size=3, value=0):
# Compute a sequential label for groups in each row.
xc = (x != value).cumsum(1)
# Count the number of occurances per group in each row.
counts = np.apply_along_axis(
lambda x: np.bincount(x, minlength=1 + xc.max()),
axis=1, arr=xc)
# Filter by minimum number of occurances.
i, j = np.where(counts >= min_size)
# Compute the median index of each group.
return [
(ii, int(np.ceil(np.median(np.where(xc[ii] == jj)[0]))))
for ii, jj in zip(i, j)
]
x = np.array([[0, 1, 0, 1],
[0, 0, 0, 1],
[0, 1, 0, 1],
[1, 1, 1, 1]])
print(find_groups(x))
# [(1, 1)]
It should work properly even for multiple rows with groups of varying sizes, and even multiple groups per row:
x2 = np.array([[0, 1, 0, 1, 1, 1, 1],
[0, 0, 0, 1, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 0]])
print(find_groups(x2))
# [(1, 1), (1, 5), (2, 3), (3, 3)]

how to iterate though 5 rows from get Max per row of 4 columns and set value of a different column accordingly

I have a DataFrame, i've made a subset df2 of 4 columns from df1 and create a list of 5 items containing max value from each row. Now depending on which column the max value for that row is in i.e column 1, 2, 3, 4, determines the int label i.e. 1, 2, 3, or 4 in the label column in df1.
The df2 is because some of the other columns not including those 4 have a higher value than the 4 to compare and obviously screws that up. Starting to think it should be a list or series?
code
df1= pd.DataFrame({'x_1': [xvalues[0][0], xvalues[0][1], xvalues[0][2],
xvalues[0][3], xvalues[0][4]],
'x_2': [yvalues[0][0], yvalues[0][1], yvalues[0][2],
yvalues[0][3], yvalues[0][4]],
'True labels': [truelabels[0], truelabels[1],
truelabels[2],truelabels[3], truelabels[4]],
'g11': [classifier1[0][0],classifier1[0][1],
classifier1[0][2],classifier1[0][3],
classifier1[0][4],],
'g12': [classifier1[1][0],classifier1[1][1],
classifier1[1][2],classifier1[1][3],
classifier1[1][4],],
'g13': [classifier1[2][0],classifier1[2][1],
classifier1[2][2],classifier1[2][3],
classifier1[2][4],],
'g14': [classifier1[3][0],classifier1[3][1],
classifier1[3][2],classifier1[3][3],
classifier1[3][4],],
'L1': [2, 5, 6, 7, 8],
'g21': [classifier2[0][0],classifier2[0][1],
classifier2[0][2],classifier2[0][3],
classifier2[0][4],],
'g22': [classifier2[1][0],classifier2[1][1],
classifier2[1][2],classifier2[1][3],
classifier2[1][4],],
'g23': [classifier2[2][0],classifier2[2][1],
classifier2[2][2],classifier2[2][3],
classifier2[2][4],],
'g24': [classifier2[3][0],classifier2[3][1],
classifier2[3][2],classifier2[3][3],
classifier2[3][4],],
'L2': [0, 0, 0, 0, 0],
'g31': [classifier3[0],classifier3[0],
classifier3[0],classifier3[0],
classifier3[0],],
'g32': [classifier3[1][0],classifier3[1][1],
classifier3[1][2],classifier3[1][3],
classifier3[1][4],],
'g33': [classifier3[2][0],classifier3[2][1],
classifier3[2][2],classifier3[2][3],
classifier3[2][4],],
'g34': [classifier3[3][0],classifier3[3][1],
classifier3[3][2],classifier3[3][3],
classifier3[3][4],],
'L3': [0, 0, 0, 0, 0],
'Assigned L':[1, 1, 1, 1,1]}, index =['Datapoint1', 'D2', 'D3',
'D4', 'D5'])
df2= df1[['g11','g12','g13','g14']]
hdf = df2.max(axis = 1)
g11 = df1['g11'].to_list()
g12 = df1['g12'].to_list()
g13 = df1['g13'].to_list()
g14 = df1['g14'].to_list()
for item, label in zip(hdf, table['L1']):
if hdf[item] in g11:
df1['L1'][label] = labels[0]
print(item, label)
elif hdf[item] in g12:
df1['L1'][label] = labels[1]
print(item, label)
elif hdf[item] in g13:
df1['L1'][label] = labels[2]
print(item, label)
elif hdf[item] in g14:
df1['L1'][label] = labels[3]
print(item, label)
I have tried using .loc, .at but when it didn't work I just scrapped it and tried something else, maybe those approaches would be better? This is where I'm at so far.
The error is coming from the for loop for hdf,
The issue i'm having is "cannot do label indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [0.0311272081] of <class 'float'>"
I don't think the other values in the data frame are relvant, its just there so people know I have made one. The 5 relavant columns in the dataframe are g11, g12, g13, g14 and L1.

Accelerate a pandas operation involving several dataframes

Hello everyone
For a school project, I am stuck with the duration of an operation with Pandas Dataframe.
I have one dataframe df which shape is (250 000 000, 200). This dataframe contains values of variable describing the behaviours of sensors on a machine.
They are organized by 'Cycle' (everytime the machine begins a new cycle, this variable is incremented by one). And in this cycle, 'CycleTime' describes the position of the row within the 'Cycle'.
In the 'mean' DataFrame, I compute the mean of each variables group by the 'CycleTime'
The 'anomaly_matrix' DataFrame represents the global anomaly of each cycle which is the sum of the square difference of each rows belonging to the Cycle with the mean of corresponding CycleTime.
An example of my code is below
df = pd.DataFrame({'Cycle': [0, 0, 0, 1, 1, 1, 2, 2], 'CycleTime': [0, 1, 2, 0, 1, 2, 0, 1], 'variable1': [0, 0.5, 0.25, 0.3, 0.4, 0.1, 0.2, 0.25], 'variable2':[1, 2, 1, 1, 2, 2, 1, 2], 'variable3': [100, 5000, 200, 900, 100, 2000, 300, 300]})
mean = df.drop(['Cycle'], axis = 1).groupby("CycleTime").agg('mean')
anomali_matrix = df.drop(['CycleTime'], axis = 1).groupby("Cycle").agg('mean')
anomaly_matrix = anomali_matrix - anomali_matrix
for index, row in df.iterrows():
cycle = row["Cycle"]
time = row["CycleTime"]
anomaly_matrix.loc[cycle] += (row - mean.loc[time])**2
>>>anomaly_matrix
variable1 variable2 variable3
Cycle
0 0.047014 0.25 1.116111e+07
1 0.023681 0.25 3.917778e+06
2 0.018889 0.00 2.267778e+06
This calculation for my (250 000 000, 200) DataFrame last 6 hours, it is due to anomaly_matrix.loc[cycle] += (row - mean.loc[time])**2
I tried to improve by using an apply function but I do not succeed in adding other DataFrame in that apply function. Same thing trying to vectorize pandas.
Do you have any idea how to accelerate this process ? Thanks

You can use:
df1 = df.set_index(['Cycle', 'CycleTime'])
mean = df1.sub(df1.groupby('CycleTime').transform('mean'))**2
df2 = mean.groupby('Cycle').sum()
print (df2)
variable1 variable2 variable3
Cycle
0 0.047014 0.25 1.116111e+07
1 0.023681 0.25 3.917778e+06
2 0.018889 0.00 2.267778e+06

Divide seaborn matrix by lines according to clustering

This paper has a nice way of visualizing clusters of a dataset with binary features by plotting a 2D matrix and sorting the values according to a cluster.
In this case, there are three clusters, as indicated by the black dividing lines; the rows are sorted, and show which examples are in each cluster, and the columns are the features of each example.
Given a vector of cluster assignments and a pandas DataFrame, how can I replicate this using a Python library (e.g. seaborn)? Plotting a DataFrame using seaborn isn't difficult, nor is sorting the rows of the DataFrame to align with the cluster assignments. What I am most interested in is how to display those black dividing lines which delineate each cluster.
Dummy data:
"""
col1 col2
x1_c0 0 1
x2_c0 0 1
================= I want a line drawn here
x3_c1 1 0
================= and here
x4_c2 1 0
"""
import pandas as pd
import seaborn as sns
df = pd.DataFrame(
data={'col1': [0, 0, 1, 1], 'col2': [1, 1, 0, 0]},
index=['x1_c0', 'x2_c0', 'x3_c1', 'x4_c2']
)
clus = [0, 0, 1, 2] # This is the cluster assignment
sns.heatmap(df)

The link that mwaskom posted in a comment is good starting place. The trick is figuring out what the coordinates are for the vertical and horizontal lines.
To illustrate what the code is actually doing, it's worthwhile to just plot all of the lines individually
%matplotlib inline
import pandas as pd
import seaborn as sns
df = pd.DataFrame(data={'col1': [0, 0, 1, 1], 'col2': [1, 1, 0, 0]},
index=['x1_c0', 'x2_c0', 'x3_c1', 'x4_c2'])
f, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(df)
ax.axvline(1, 0, 2, linewidth=3, c='w')
ax.axhline(1, 0, 1, linewidth=3, c='w')
ax.axhline(2, 0, 1, linewidth=3, c='w')
ax.axhline(3, 0, 1, linewidth=3, c='w')
f.tight_layout()
The the way that the axvline method works is the first argument is the x location of the line and then the lower bound and upper bound of the line (in this case 1, 0, 2). The horizontal line takes the y location and then the x start and x stop of the line. The defaults will create the line for the entire plot, so you can typically leave those out.
This code above creates a line for every value in the dataframe. If you want to create groups for the heatmap, you will want to create an index in your data frame, or some other list of values to loop through. For instance with a more complicated example using code from this example:
df = pd.DataFrame(data={'col1': [0, 0, 1, 1, 1.5], 'col2': [1, 1, 0, 0, 2]},
index=['x1_c0', 'x2_c0', 'x3_c1', 'x4_c2', 'x5_c2'])
df['id_'] = df.index
df['group'] = [1, 2, 2, 3, 3]
df.set_index(['group', 'id_'], inplace=True)
df
col1 col2
group id_
1 x1_c0 0.0 1
2 x2_c0 0.0 1
x3_c1 1.0 0
3 x4_c2 1.0 0
x5_c2 1.5 2
Then plot the heatmap with the groups:
f, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(df)
groups = df.index.get_level_values(0)
for i, group in enumerate(groups):
if i and group != groups[i - 1]:
ax.axhline(len(groups) - i, c="w", linewidth=3)
ax.axvline(1, c="w", linewidth=3)
f.tight_layout()
Because your heatmap is not symmetric you may need to use a separate for loop for the columns

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

how do I change the GROUPBY order in a histogram? - python

Related

Python - Summing values and number of duplicates

Scanning for groups of the same value in numpy array

how to iterate though 5 rows from get Max per row of 4 columns and set value of a different column accordingly

Accelerate a pandas operation involving several dataframes

Divide seaborn matrix by lines according to clustering

Categories

Resources