Create multiple barplots based off groupby conditions - python

I am trying to create mutliple horizontal barplots for a dataset. The data deals with race times from a running race.
Dataframe has the following columns: Name, Age Group, Finish Time, Finish Place, Hometown. Sample data below.
Name
Age Group
Finish Time
Finish Place
Hometown
Times Ran The Race
John
30-39
15.5
1
New York City
2
Mike
30-39
17.2
2
Denver
1
Travis
40-49
20.4
1
Louisville
3
James
40-49
22.1
2
New York City
1
I would like to create a bar plot similar to what is shown below. There would be 1 bar chart per age group, fastest runner on bottom of chart, runner name with city and number of times ran the race below their name.
Do I need a for loop or would a simple groupby work? The number and sizing of each age group can be dynamic based off the race so it is not a constant, but would be dependent on the dataframe that is used for each race.

I employed a looping process. I use the extraction by age group as a temporary data frame, and then accumulate label information for multiple x-axis to prepare for reuse. The accumulated label information is decomposed into strings and stored in a new list. Next, draw a horizontal bar graph and update the labels on the x-axis.
for ag in df['Age Group'].unique():
label_all = []
tmp = df[df['Age Group'] == ag]
labels = [[x,y,z] for x,y,z in zip(tmp.Name.values, tmp.Hometown.values, tmp['Times Ran The Race'].values)]
for k in range(len(labels)):
label_all.append(labels[k])
l_all = []
for l in label_all:
lbl = l[0] + '\n'+ l[1] + '\n' + str(l[2]) + ' Time'
l_all.append(lbl)
ax = tmp[['Name', 'Finish Time']].plot(kind='barh', legend=False)
ax.set_title(ag +' Age Group')
ax.set_yticklabels([l_all[x] for x in range(len(l_all))])
ax.grid(axis='x')
for i in ['top','bottom','left','right']:
ax.spines[i].set_visible(False)

Here's a quite compact solution. Only tricky part is the ordinal number, if you really want to have that. I copied the lambda solution from Ordinal numbers replacement
Give this a try and please mark the answer with Up-button if you like it.
import matplotlib.pyplot as plt
ordinal = lambda n: "{}{}".format(n,"tsnrhtdd"[(n/10%10!=1)*(n%10<4)*n%10::4])
for i, a in enumerate(df['Age Group'].unique()):
plt.figure(i)
dfa = df.loc[df['Age Group'] == a].copy()
dfa['Info'] = dfa.Name + '\n' + dfa.Hometown + '\n' + \
[ordinal(row) for row in dfa['Times Ran The Race']] + ' Time'
plt.barh(dfa.Info, dfa['Finish Time'])
plt.title(f'{a} Age Group')
plt.xlabel("Time (Minutes)")

Related

group by similar value of column in dataframe

I'm using a DataFrame that contains sample data on rocks and soils. I want to create 2 separate plots, one for rocks and one for soils, showing SO3 composition relative to SIO2. I created a dictionary of rocks only, but there are still 90+ samples. As it's shown in the figure, some have similar names. For example 'Adirondack' appears 3 times. I could manually go through them all, but that would take a while (P.S. I did, but I would still like to know the easier way than if ... elif ... statements, since I had to manually create a legend entry to avoid many duplicate entries).
How can I just group together the ones with the same x letters and save them in a new dataframe or my dictionary as just 'Adirondack (all)', for example (take the part of the name before the '_' perhaps, so that it will appear in the legend that way), and have the three sets of values for 'Adirondack_' etc. in one dictionary entry.
Rocks = APXSData[APXSData.Type.str.contains('R')]
RockLabels = Rocks['Sample'].to_list()
RockDict = {}
for i in RockLabels:
SiO2val = np.extract(Rocks["Sample"]==i, Rocks["SiO2"])
SO3val = np.extract(Rocks["Sample"]==i, Rocks["SO3"])
newKey = i
RockDict[newKey] = {'SiO2':SiO2val, 'SO3':SO3val}
DatabyRockSample = pd.DataFrame(RockDict)
fig = plt.figure()
for i in RockLabels:
plt.scatter(
DatabyRockSample[i]["SiO2"],
DatabyRockSample[i]["SO3"],
marker='o',
label = i) #, color = colors[count], edgecolors = edgecolor[count],
plt.xlabel("SiO$_2$", labelpad = 10)
plt.ylabel("SO$_3$", labelpad = 10)
plt.title('Composition of all rocks \n at Gusev Crater')
plt.legend()
Let's prepare some dummy data:
df = pd.DataFrame({
'Sol': [14,18,33,34,41],
'Type': ['SU','RU','RB','RR','SU'],
'Sample': ['Gusev_Soil','Adirondack_asis','Adirondack_brush','Adirondack_RAT','Gusev_Other'],
'N': [45,126,129,128,76],
'Na2O': [2.8,2.3,2.8,2.4,2.7],
# ...
})
So here's our data frame:
Sol Type Sample N Na2O
0 14 SU Gusev_Soil 45 2.8
1 18 RU Adirondack_asis 126 2.3
2 33 RB Adirondack_brush 129 2.8
3 34 RR Adirondack_RAT 128 2.4
4 41 SU Gusev_Other 76 2.7
We can use grouping in this way.
If the only option we have is matching first n letters, then:
n = 5
grouper = df['Sample'].str[:n]
groups = {name: group for name, group in df.groupby(grouper)}
If we can extract meaningful data by splitting, which is better I think, then:
# in this case we can split by '_' and get the first word
grouper = df['Sample'].str.split('_').str.get(0)
groups = {name: group for name, group in df.groupby(grouper)}
If splitting isn't that simple, say our words are separated by space, underscore or hyphen, then we could use str.extract method:
grouper = df['Sample'].str.extract('\A(.*)(?=[ _-])')
groups = {name: group for name, group in df.groupby(grouper)}
We can also avoid creating dictionaries. Let's see how we can iterate over the groups obtained by splitting as an example:
grouper = df['Sample'].str.split('_').str.get(0)
groups = df.groupby(grouper)
for name, dataframe in groups:
print(f'name: {name}')
print(dataframe, '\n')
Output:
name: Adirondack
Sol Type Sample N Na2O
1 18 RU Adirondack_asis 126 2.3
2 33 RB Adirondack_brush 129 2.8
3 34 RR Adirondack_RAT 128 2.4
name: Gusev
Sol Type Sample N Na2O
0 14 SU Gusev_Soil 45 2.8
4 41 SU Gusev_Other 76 2.7
The same with rocks. IMO we can do better than APXSData.Type.str.contains('R'):
APXSData['Type'].str[0] == 'R'
APXSData['Type'].str.startswith('R')
Let's separate rocks and group them by the leading name:
is_rock = df['Type'].str.startswith('R')
grouper = df['Sample'].str.split('_').str.get(0)
groups_of_rocks = df[is_rock].groupby(grouper)
for k,v in groups_of_rocks:
print(k)
print(v)
Output:
Adirondack
Sol Type Sample N Na2O
1 18 RU Adirondack_asis 126 2.3
2 33 RB Adirondack_brush 129 2.8
3 34 RR Adirondack_RAT 128 2.4
To plot data for some group of interest only, we can use get_group(name):
groups.get_group('Adirondack').plot.bar(x='Sample', y=['N','Na2O'])
See also:
detail about str in pandas
pandas.Series.split
pandas.Series.str.get
pandas.Series.str.extract
regex in python
run help('pandas.core.strings.StringMethods') to see help offline

Add string in a certain position in column in dataframe

Basically this:
hash = "355879ACB6"
hash = hash[:4] + '-' + hash[4:]
print (hash)
3558-79ACB6
I got this part above from another stackoverflow post here
but for a DataFrame.
I am only able to successfully add strings before and after, like this:
data ['col1'] = data['col1'] + 'teststring'
If I try the solution from the link above [:amountofcharacterstocutafter] to add values at a certain position, which would be something like:
test = data[:2] + 'zz'
print (test)
It does not seem to be applicable, as the [:2] operator works different for dataframes as it does for strings. It cuts the ouput after the first 2 rows.
Goal:
I want to add a ' - ' at a certain position. Let's say the input row value is 'TTTT1234', output should be 'TTTT-1234'. For every row.
You can perform the operation you presented on a list but you have a column in a dataframe so its (a bit) different.
So while you can do this:
hash = "355879ACB6"
hash = hash[:4] + '-' + hash[4:]
in order to do this on a dataframe you can do it in at least 2 ways:
consider this dummy df:
LOCATION Hash
0 USA 355879ACB6
1 USA 455879ACB6
2 USA 388879ACB6
3 USA 800879ACB6
4 JAPAN 355870BCB6
5 JAPAN 355079ACB6
A. vectorization: the most efficient way
df['new_hash']=df['Hash'].str[:4]+'-'+df['Hash'].str[4:]
LOCATION Hash new_hash
0 USA 355879ACB6 3558-79ACB6
1 USA 455879ACB6 4558-79ACB6
2 USA 388879ACB6 3888-79ACB6
3 USA 800879ACB6 8008-79ACB6
4 JAPAN 355870BCB6 3558-70BCB6
5 JAPAN 355079ACB6 3550-79ACB6
B. apply lambda: intuitive to implement but less attractive in terms of performance
df['new_hash'] = df.apply(lambda x: x['Hash'][:4]+'-'+x['Hash'][4:], axis=1)
Use pd.Series.str. For example:
import pandas as pd
df = pd.DataFrame({
"c": ["TTTT1234"]
})
df["c"].str[:4] + "-" + df["c"].str[4:] # It will output 'TTTT-1234'
pd.Series.str gives vectorized string functions.

Filtering CSV data using python and storing different values in array

I am trying to filter CSV file where I need to store prices of different commodities that are > 1000 in different arrays, I can able to get only 1 commodity values perfectly but other commodity array just a duplicate of the 1st commodity.
CSV file looks like below figure:
CODE
import matplotlib.pyplot as plt
import csv
import pandas as pd
import numpy as np
# csv file name
filename = "CommodityPrice.csv"
# List gold price above 1000
gold_price_above_1000 = []
palladiun_price_above_1000 = []
gold_futr_price_above_1000 = []
cocoa_future_price_above_1000 = []
df = pd.read_csv(filename)
commodity = df["Commodity"]
price = df['Price']
for gold_price in price:
if (gold_price <= 1000):
break
else:
for gold in commodity:
if ('Gold' == gold):
gold_price_above_1000.append(gold_price)
break
for palladiun_price in price:
if (palladiun_price <= 1000):
break
else:
for palladiun in commodity:
if ('Palladiun' == palladiun):
palladiun_price_above_1000.append(palladiun_price)
break
for gold_futr_price in price:
if (gold_futr_price <= 1000):
break
else:
for gold_futr in commodity:
if ('Gold Futr' == gold_futr):
gold_futr_price_above_1000.append(gold_futr_price)
break
for cocoa_future_price in price:
if (cocoa_future_price <= 1000):
break
else:
for cocoa_future in commodity:
if ('Cocoa Future' == cocoa_future):
cocoa_future_price_above_1000.append(cocoa_future_price)
break
print(gold_price_above_1000)
print(palladiun_price_above_1000)
print(gold_futr_price_above_1000)
print(cocoa_future_price_above_1000)
plt.ylim(1000, 3000)
plt.plot(gold_price_above_1000)
plt.plot(palladiun_price_above_1000)
plt.plot(gold_futr_price_above_1000)
plt.plot(cocoa_future_price_above_1000)
plt.title('Commodity Price(>=1000)')
y = np.array(gold_price_above_1000)
plt.ylabel("Price")
plt.show()
print("SUCCESS")
Here is my question in detail,
Please use pandas and matplotlib to sort out the data in the csv and output and store the sorted data into the process chart. The output results are shown in the following figures.
Figure 1 The upper picture is to take all the products with Price> = 1000 in csv, mark all their prices in April and May and draw them into a linear graph. When outputting, the year in the date needs to be removed. The label name is marked and displayed. The title names of the chart, x-axis, and y- axis need to be marked. The range of the y-axis falls within 1000 ~ 3000, and the color of the line is not specified.
Figure 1 The picture below is from all the products with Price> = 1000 in csv. Mark their Change% in April and May and draw them into a dotted line graph. The dots need to be in a dot style other than '.' And 'o'. To mark, please mark the line with a line other than a solid line. When outputting, you need to remove the year from the date. You need to mark and display the label name of each line. The title names of the chart, x-axis, and y-axis must be marked. You need to add a grid line, the y-axis range falls from -15 to +15, and the color of the line is not specified.
The upper and lower two pictures in Figure 2 are changed to 1000> Price> = 500. The other conditions are basically the same as in Figure 1, except that the points and lines of the dot and line diagrams below Figure 2 need to use different styles from Figure 1.
The first and second pictures in Figure 1 must be displayed in the same window, as is the picture in Figure 2.
All of your blocks of code are doing the exact same thing. Changing the same of the iterator doesn't change what it does.
for gold_price in price:
for palladiun_price in price:
for gold_futr_price in price:
for cocoa_future_price in price:
This is going through the exact same data. You haven't subsetted for specific commodities.
Using the break statement in that loop doesn't make sense either. It should be a pass.
Basically for every number above 1000, you iterate through your entire Commodities column and add number to the list for every time you see a specific commodity.
Read how to index and select data in pandas.
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html
gold_price_above_1000 = df[(df.Commodity=='Gold') & (df.Price>1000)]

Find max halflife values relative to their temperature value in the same array

Basically I load the excel file into a pandas dataframe here:
dv = pd.read_excel('data.xlsx')
Then I clean it up and rename it to "cleaned" which is not important for this reproducible example, just mentioning for clarity:
if (selected_x.title()=="Viscosity" or selected_y.title()=="Viscosity"):
cleaned = cleaned[cleaned.Study != "Yanqing Wang 2017"]
cleaned = cleaned[cleaned.Study != "Thakore 2020"]
From there, I separate the cleaned dataframe into separate studies, this project is a composition of literature. I will include an example of two below:
yan = cleaned[cleaned.Study == "Yanqing Wang 2017"]
tha = cleaned[cleaned.Study == "Thakore 2020"]
Finally, I load each of the individual studies into traces, and display them in a graph. Selected y and selected x are strings, such as "Temperature (C) " and "Halflife (Min)":
trace1 = go.Scatter(y=tha[selected_y], x=tha[selected_x])
trace2 = go.Scatter(y=yan[selected_y], x=yan[selected_x])
What I need to do is, after splitting the array into individual studies, find the maximum halflife relative to each temperature (0,50,100,150,200,250,300) and compile them into separate lists, then find the max value of these lists, take the whole row and append them into the same list. I have tried to do this using stuff like:
yan50 = yanq[yanq['Temperature (C) '] == 50]
yan100 = yanq[yanq['Temperature (C) '] == 100]
yan150 = yanq[yanq['Temperature (C) '] == 150]
yan200 = yanq[yanq['Temperature (C) '] == 200]
yan250 = yanq[yanq['Temperature (C) '] == 250]
yan300 = yanq[yanq['Temperature (C) '] == 300]
To split the study into the varying degree lists. I am currently stuck where I have to find the max value in halflife column of each list and add the whole corresponding row into a new list. This is what I am trying:
yan = pd.DataFrame(columns=["Study","Gas","Surfactant","Surfactant Concentration","Additive","Additive Concentration","LiquidPhase","Quality","Pressure (Psi)","Temperature (C) ","Shear Rate (/Sec)","Halflife (Min)","Viscosity","Color"])
if (len(yan50) > 0):
yan50.loc[yan50['Halflife (Min)'].idxmax()]
yan50 = yan50.dropna()
yan.append(yan50)
if (len(yan100) > 0):
yan100.loc[yan100['Halflife (Min)'].idxmax()]
yan100 = yan100.dropna()
yan.append(yan100)
if (len(yan150) > 0):
yan150.loc[yan150['Halflife (Min)'].idxmax()]
yan150 = yan150.dropna()
yan.append(yan150)
if (len(yan200) > 0):
yan200.loc[yan200['Halflife (Min)'].idxmax()]
yan200 = yan200.dropna()
yan.append(yan200)
if (len(yan250) > 0):
yan250.loc[yan250['Halflife (Min)'].idxmax()]
yan250 = yan250.dropna()
yan.append(yan250)
if (len(yan300) > 0):
yan300.loc[yan300['Halflife (Min)'].idxmax()]
yan300 = yan300.dropna()
yan.append(yan300)yan50.iloc[yan50['Halflife (Min)'].idxmax()]
The error I am getting is the individual temperature lists are empty.
I also got a bunch of Nan values for the separate temperature lists I compiled, and I am unsure if I am splitting the list correctly. I am not too strong with Pandas. Recommendations needed!
Link to CSV of data
------------Edit-------------
What I have, all the studies placed on the same temp points (50, 100, etc). I want to find the maximum value of halflife, so that only the topmost point shows. The reason I am doing this is to aid in data-visualization. Future plans beyond this topic include: connecting the max value dots with a line and comparing the trends of the separate studies halflife values.
IIUC, what you need is
df2 = df.groupby(['Study','Temperature (C) '])['Halflife (Min)'].max().reset_index(name='Max_halflife')
This will result in
Study Temperature (C) Max_halflife
0 Thakore 2020 50 120.00
1 Thakore 2020 100 2.40
2 Thakore 2020 150 0.20
3 Yanqing Wang 2017 50 123.00
4 Yanqing Wang 2017 100 3.20
5 Yanqing Wang 2017 150 0.31
Then the code below should get you the graph you wnat.
import seaborn as sns
df2 = df.groupby(['Study','Temperature (C) '])['Halflife (Min)'].max().reset_index(name='Max_halflife')
fig = plt.figure(figsize=(8, 5))
sns.scatterplot(x='Temperature (C) ', y='Max_halflife', data=df2, hue='Study')

Pandas Dataframe: Accessing via composite index created by groupby operation

I want to calculate a group specific ratio gathered from two datasets.
The two Dataframes are read from a database with
leases = pd.read_sql_query(sql, connection)
sales = pd.read_sql_query(sql, connection)
one for real estate offered for sale, the other for rented objects.
Then I group both of them by their city and the category I'm interested in:
leasegroups = leases.groupby(['IDconjugate', "city"])
salegroups = sales.groupby(['IDconjugate', "city"])
Now I want to know the ratio between the cheapest rental object per category and city and the most expensively sold object to obtain a lower bound for possible return:
minlease = leasegroups['price'].min()
maxsale = salegroups['price'].max()
ratios = minlease*12/maxsale
I get an output like: Category - City: Ratio
But I cannot access the ratio object by city nor category. I tried creating a new dataframe with:
newframe = pd.DataFrame({"Minleases" : minlease,"Maxsales" : maxsale,"Ratios" : ratios})
newframe = newframe.loc[newframe['Ratios'].notnull()]
which gives me the correct rows, and newframe.index returns the groups.
index.name gives ['IDconjugate', 'city'] but indexing results in a KeyError. How can I make an index out of the different groups: ID0+city1, ID0+city2 etc... ?
EDIT:
The output looks like this:
Maxsales Minleases Ratios
IDconjugate city
1 argeles gazost 59500 337 0.067966
chelles 129000 519 0.048279
enghien-les-bains 143000 696 0.058406
esbly 117990 495 0.050343
foix 58000 350 0.072414
The goal was to select the top ratios and plot them with bokeh, which takes a
dataframe object and plots a column versus an index as I understand it:
topselect = ratio.loc[ratio["Ratios"] > ratio["Ratios"].quantile(quant)]
dots = Dot(topselect, values='Ratios', label=topselect.index, tools=[hover,],
title="{}% best minimal Lease/Sale Ratios per City and Group".format(topperc*100), width=600)
I really only needed the index as a list in the original order, so the following worked:
ids = []
cities = []
for l in topselect.index:
ids.append(str(int(l[0])))
cities.append(l[1])
newind = [i+"_"+j for i,j in zip(ids, cities)]
topselect.index = newind
Now the plot shows 1_city1 ... 1_city2 ... n_cityX on the x-axis. But I figure there must be some obvious way inside the pandas framework that I'm missing.

Categories