Create a Single Boxplot from Multiple DataFrames

Create a Single Boxplot from Multiple DataFrames - python

I have multiple data frames with different no. of rows and same no.of columns i.e
DATA
female_df1 = pd.DataFrame({'ID': [5,21,17], 'value': [85, 56.7, 77.9]})
female_df2 = pd.DataFrame({'ID': [75,1,7], 'value': [39, 66.7, 77.9]})
female_df3 = pd.DataFrame({'ID': [5,21,17], 'value': [85, 56.7, 77.9]})
female_df4 = pd.DataFrame({'ID': [5,21,17], 'value': [85, 56.7, 77.9]})
male_df1 = pd.DataFrame({'ID': [35,1,7], 'value': [15, 36.7, 87.9]})
male_df2 = pd.DataFrame({'ID': [5,11,17], 'value': [99, 96.7, 97.9]})
male_df3 = pd.DataFrame({'ID': [35,41,37], 'value': [15, 16.7, 17.9]})
male_df4 = pd.DataFrame({'ID': [51,11,27], 'value': [35, 36.7, 37.9]})
Now, I would like to plot a single boxplot from above multiple df's. I used below code to do so
fig, ax2 = plt.subplots(figsize = (15,10))
vec = [female_df1['value'].values,female_df2['value'].values,female_df3['value'].values,female_df4['value'].values]
labels = ['f1','f2','f3', 'f4']
ax2.boxplot(vec, labels = labels)
plt.show()
The Output in female values boxplot, now similarly I have Male data frames with values, and I want to plot side by side (i.e fbeta1.0 and mbeta1.0) to observe the difference in data distribution. Valuable insights much appreciated
Desired Output plot:
Desired Output

This is a bit manual, but should do what you need...
### DATA ###
female_df1 = pd.DataFrame({'ID': [5,21,17], 'value': [85, 56.7, 77.9]})
female_df2 = pd.DataFrame({'ID': [75,1,7], 'value': [39, 66.7, 77.9]})
female_df3 = pd.DataFrame({'ID': [5,21,17], 'value': [85, 56.7, 77.9]})
female_df4 = pd.DataFrame({'ID': [5,21,17], 'value': [85, 56.7, 77.9]})
male_df1 = pd.DataFrame({'ID': [35,1,7], 'value': [15, 36.7, 87.9]})
male_df2 = pd.DataFrame({'ID': [5,11,17], 'value': [99, 96.7, 97.9]})
male_df3 = pd.DataFrame({'ID': [35,41,37], 'value': [15, 16.7, 17.9]})
male_df4 = pd.DataFrame({'ID': [51,11,27], 'value': [35, 36.7, 37.9]})
### PLOTTING ###
fig, ax = plt.subplots(1,4, figsize = (15,6))
ax[0].boxplot([female_df1['value'].values, male_df1['value'].values], labels = ['f1','m1'])
ax[1].boxplot([female_df2['value'].values, male_df2['value'].values], labels = ['f1','m1'])
ax[2].boxplot([female_df3['value'].values, male_df3['value'].values], labels = ['f1','m1'])
ax[3].boxplot([female_df4['value'].values, male_df4['value'].values], labels = ['f1','m1'])
ax[0].set_title("M1 & F1")
ax[1].set_title("M2 & F2")
ax[2].set_title("M3 & F3")
ax[3].set_title("M4 & F4")
plt.show()
Plot

Related

Python: How to validate and append non-existing row in a dataset/dataframe?

How can we append a non-existing row/value in a dataset? I have here a sample table with list of names and the objective is to validate first the name if this doesn't exist and append it to the dataset.
Please see code below for reference:
import pandas as pd
df = pd.DataFrame.from_dict({
'Name': ['Nik', 'Kate', 'Evan', 'Kyra'],
'Age': [31, 30, 40, 33],
'Location': ['Toronto', 'London', 'Kingston', 'Hamilton']
})
df = df.append({'Name':'Jane', 'Age':25, 'Location':'Madrid'}, ignore_index=True)
print(df)

you can check the condition before insering in the dataframe :
import pandas as pd
df = pd.DataFrame.from_dict({
'Name': ['Nik', 'Kate', 'Evan', 'Kyra'],
'Age': [31, 30, 40, 33],
'Location': ['Toronto', 'London', 'Kingston', 'Hamilton']
})
if 'Jane' not in df.Name.values:
df = df.append({'Name':'Jane', 'Age':25, 'Location':'Madrid'}, ignore_index=True)
print(df)

Complete a pandas data frame with values from other data frames

I have 3 data frames. I need to enrich the data from df with the data columns from df2 and df3 so that df ends up with the columns 'Code', 'Quantity', 'Payment', 'Date', 'Name', 'Size', 'Product','product_id', 'Sector'.
The codes that are in df and not in df2 OR df3, need to receive "unknown" for the string columns and "0" for the numeric dtype columns
import pandas as pd
data = {'Code': [356, 177, 395, 879, 952, 999],
'Quantity': [20, 21, 19, 18, 15, 10],
'Payment': [173.78, 253.79, 158.99, 400, 500, 500],
'Date': ['2022-06-01', '2022-09-01','2022-08-01','2022-07-03', '2022-06-09', '2022-06-09']
}
df = pd.DataFrame(data)
df['Date']= pd.to_datetime(df['Date'])
data2 = {'Code': [356, 177, 395, 893, 697, 689, 687],
'Name': ['John', 'Mary', 'Ann', 'Mike', 'Bill', 'Joana', 'Linda'],
'Product': ['RRR', 'RRT', 'NGF', 'TRA', 'FRT', 'RTW', 'POU'],
'product_id': [189, 188, 16, 36, 59, 75, 55],
'Size': [1, 1, 3, 4, 5, 4, 7],
}
df2 = pd.DataFrame(data2)
data3 = {'Code': [879, 356, 389, 395, 893, 697, 689, 978],
'Name': ['Mark', 'John', 'Marry', 'Ann', 'Mike', 'Bill', 'Joana', 'James'],
'Product': ['TTT', 'RRR', 'RRT', 'NGF', 'TRA', 'FRT', 'RTW', 'DTS'],
'product_id': [988, 189, 188, 16, 36, 59, 75, 66],
'Sector': ['rt' , 'dx', 'sx', 'da', 'sa','sd','ld', 'pc'],
}
df3 = pd.DataFrame(data3)
I was using the following code to obtain the unknown codes by comparing with df2, but now i have to compare with df3 also and also add the data from the columns ['Name', 'Size', 'Product','product_id', 'Sector'].
common = df2.merge(df,on=['Code'])
new_codes = df[(~df['Code'].isin(common['Code']))]

python: seaborn lineplot customize marker size based on values of a column

I have a data like the following:
#df
df = pd.DataFrame({
'id': {0: -3, 1: 2, 2: -3, 3: 1},
'val': {0: 0.4, 1: 0.03, 2: 0.88, 3: 1.3},
'indicator': {0: 'A', 1: 'A', 2: 'B', 3: 'B'},
'count': {0: 40000, 1: 5779, 2: 3000, 3: 31090}
})
df
if I do the following code, I will have:
sns.relplot(x = 'id', y = 'val', hue = 'indicator', size = 'count', data = df)
I want to have a line connecting the dots. But if I change the plot to a line plot, I will have any graphs.
sns.lineplot(x = 'id', y = 'val', hue = 'indicator', size = 'count', data = df)

Seems like you want to combine a lineplot with a scatterplot
plt.figure()
sns.lineplot(x = 'id', y = 'val', hue = 'indicator', data = df)
sns.scatterplot(x = 'id', y = 'val', hue = 'indicator', size = 'count', data = df)

How to combine two (or morer) DF w/ different leng and ind providing an appropriate index for both in a single DF

I have two dataframes (DF and DF2). Anyobody could help me in understand how can I combine these two dataframes and make them look like this third one (DF3)? I presented a simple example, but I need this to compile dataframes that include different samples (or observations). Eventually, there are samples that emcompass the same group of variables. But most of the cases, the samples present different variables. Each column corresponds to one sample.
Any help is welcome!
DF -
raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'],
'age': [42, 52, 36, 24, 73],
'preTestScore': [4, 24, 31, 2, 3],
'postTestScore': [25, 94, 57, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])
print(df)
DF2 -
raw_data2 = {'first_name': ['Molly', 'Jake'],
'civil_status': ['Single', 'Single']}
df2 = pd.DataFrame(raw_data2, columns = ['first_name', 'civil_status'])
print(df2)´´´
DF3 -
raw_data3 = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'],
'age': [42, 52, 36, 24, 73],
'preTestScore': [4, 24, 31, 2, 3],
'postTestScore': [25, 94, 57, 62, 70],
'civil_status': ['NaN', 'Single', 'NaN', 'Single', 'NaN']}
df3 = pd.DataFrame(raw_data3, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore',
'civil_status'])
print(df3)

join
df.set_index("first_name").join(df2.set_index("first_name"))

I applied the solution above in the real context, by using the code below:
arquivo1 = pd.read_excel(f1, header=7, index_col=False)
arquivo2 = pd.read_excel(f2, header=7, index_col=False)
joined = arquivo1.set_index("Element").join(arquivo2.set_index("Element"))
It provided ValueError: columns overlap but no suffix specified: Index(['AN', 'series', ' Net', ' [wt.%]', ' [norm. wt.%]', '[norm. at.%]',
'Error in %'],
dtype='object')
The pictures below represent "arquivo1" and "arquivo2"
arquivo1
arquivo2
When I include the suffix 'Element' in the right and left, it actually join the both dataframe.
joined = arquivo1.set_index("Element").join(arquivo2.set_index("Element"), lsuffix='Element', rsuffix='Element')
But when a dataframe containing more variables (lines) is joined to the first, it simply delete the new variables. Anybody know how to fix it?

Custom sort for histogram

After looking at countless questions and answers on how to do custom sorting of the bars in bar charts (or a histogram in my case) it seemed the answer was to sort the dataframe as desired and then do the plot, only to find that the plot ignores the data and blithely sorts alphabetically. There does not seem to be a simple option to turn sorting off, or just supply a list to the plot to sort by.
Here's my sample code
from matplotlib import pyplot as plt
import pandas as pd
%matplotlib inline
diamonds = pd.DataFrame({'carat': [0.23, 0.21, 0.23, 0.24, 0.22],
'cut' : ['Ideal', 'Premium', 'Good', 'Very Good', 'Fair'],
'color': ['E', 'E', 'E', 'J', 'E'],
'clarity': ['SI2', 'SI1', 'VS1', 'VVS2', 'VS2'],
'depth': [61.5, 59.8, 56.9, 62.8, 65.1],
'table': [55, 61, 65, 57, 61],
'price': [326, 326, 327, 336, 337]})
diamonds.set_index('cut', inplace=True)
cuts_order = ['Fair','Good','Very Good','Premium','Ideal']
df = pd.DataFrame(diamonds.loc[cuts_order].carat)
df.reset_index(inplace=True)
plt.hist(df.cut);
This returns the 'cuts' in alphabetical order but not as sorted in the data. I was quite excited to have figured out a clever way of sorting the data, so much bigger the disappointment the plot is ignorant.
What is the most straightforward way of doing this?
Here's what I get with the above code:

Update of your code with the answers in the comments:
In [1]:
from matplotlib import pyplot as plt
import pandas as pd
%matplotlib inline
diamonds = pd.DataFrame({'carat': [0.23, 0.21, 0.23, 0.24, 0.22],
'cut' : ['Ideal', 'Premium', 'Good', 'Very Good', 'Fair'],
'color': ['E', 'E', 'E', 'J', 'E'],
'clarity': ['SI2', 'SI1', 'VS1', 'VVS2', 'VS2'],
'depth': [61.5, 59.8, 56.9, 62.8, 65.1],
'table': [55, 61, 65, 57, 61],
'price': [326, 326, 327, 336, 337]})
diamonds.set_index('cut', inplace=True)
cuts_order = ['Fair','Good','Very Good','Premium','Ideal']
df = pd.DataFrame(diamonds.loc[cuts_order].carat)
df.plot.bar(use_index=True, y='carat')
Out [1]:

A histogram was not the right plot here. With the following code the bars, sorted as desired, are created:
from matplotlib import pyplot as plt
import pandas as pd
%matplotlib inline
diamonds = pd.DataFrame({'carat': [0.23, 0.21, 0.23, 0.24, 0.22],
'cut' : ['Ideal', 'Premium', 'Good', 'Very Good', 'Fair'],
'color': ['E', 'E', 'E', 'J', 'E'],
'clarity': ['SI2', 'SI1', 'VS1', 'VVS2', 'VS2'],
'depth': [61.5, 59.8, 56.9, 62.8, 65.1],
'table': [55, 61, 65, 57, 61],
'price': [326, 326, 327, 336, 337]})
cuts_order = ['Fair','Good','Very Good','Premium','Ideal']
c_classes = pd.api.types.CategoricalDtype(ordered = True, categories = cuts_order)
diamonds['cut'] = diamonds['cut'].astype(c_classes)
to_plot = diamonds.cut.value_counts(sort=False)
plt.bar(to_plot.index, to_plot.values)
Side note, matplotlib 2.1.0 behaves differently because plt.bar will blithely ignore the sort order that it is given, I can only confirm this works with 3.0.3 (and hopefully higher).
I also tried sorting the data by index but this does not take effect for some reason, looks like value_counts(sort=False) does not return values in the order it is found in the data:
from matplotlib import pyplot as plt
import pandas as pd
%matplotlib inline
diamonds = pd.DataFrame({'carat': [0.23, 0.21, 0.23, 0.24, 0.22],
'cut' : ['Ideal', 'Premium', 'Good', 'Very Good', 'Fair'],
'color': ['E', 'E', 'E', 'J', 'E'],
'clarity': ['SI2', 'SI1', 'VS1', 'VVS2', 'VS2'],
'depth': [61.5, 59.8, 56.9, 62.8, 65.1],
'table': [55, 61, 65, 57, 61],
'price': [326, 326, 327, 336, 337]})
diamonds.set_index('cut', inplace=True)
cuts_order = ['Fair','Good','Very Good','Premium','Ideal']
diamonds = diamonds.loc[cuts_order]
to_plot = diamonds.index.value_counts(sort=False)
plt.bar(to_plot.index, to_plot.values)
Seaborn is also an option as it potentially removes the dependency on the available matplotlib version:
import pandas as pd
import seaborn as sb
%matplotlib inline
diamonds = pd.DataFrame({'carat': [0.23, 0.21, 0.23, 0.24, 0.22],
'cut' : ['Ideal', 'Premium', 'Good', 'Very Good', 'Fair'],
'color': ['E', 'E', 'E', 'J', 'E'],
'clarity': ['SI2', 'SI1', 'VS1', 'VVS2', 'VS2'],
'depth': [61.5, 59.8, 56.9, 62.8, 65.1],
'table': [55, 61, 65, 57, 61],
'price': [326, 326, 327, 336, 337]})
cuts_order = ['Fair','Good','Very Good','Premium','Ideal']
c_classes = pd.api.types.CategoricalDtype(ordered = True, categories = cuts_order)
diamonds['cut'] = diamonds['cut'].astype(c_classes)
to_plot = diamonds.cut.value_counts(sort=False)
ax = sb.barplot(data = diamonds, x = to_plot.index, y = to_plot.values)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Create a Single Boxplot from Multiple DataFrames - python

Related

Python: How to validate and append non-existing row in a dataset/dataframe?

Complete a pandas data frame with values from other data frames

python: seaborn lineplot customize marker size based on values of a column

How to combine two (or morer) DF w/ different leng and ind providing an appropriate index for both in a single DF

Custom sort for histogram

Categories

Resources