Boxplot needs to use multiple groupby in Pandas - python

I am using pandas, Jupyter notebooks and python.
I have a following dataset as a dataframe
Cars,Country,Type
1564,Australia,Stolen
200,Australia,Stolen
579,Australia,Stolen
156,Japan,Lost
900,Africa,Burnt
2000,USA,Stolen
1000,Indonesia,Stolen
900,Australia,Lost
798,Australia,Lost
128,Australia,Lost
200,Australia,Burnt
56,Australia,Burnt
348,Australia,Burnt
1246,USA,Burnt
I would like to know how I can use a box plot to answer the following question "Number of cars in Australia that were affected by each type". So basically, I should have 3 boxplots(for each type) showing the number of cars affected in Australia.
Please keep in mind that this is a subset of the real dataset.

You can select only the rows corresponding to "Australia" from the column "Country" and group it by the column "Type" as shown:
from StringIO import StringIO
import pandas as pd
text_string = StringIO(
"""
Cars,Country,Type,Score
1564,Australia,Stolen,1
200,Australia,Stolen,2
579,Australia,Stolen,3
156,Japan,Lost,4
900,Africa,Burnt,5
2000,USA,Stolen,6
1000,Indonesia,Stolen,7
900,Australia,Lost,8
798,Australia,Lost,9
128,Australia,Lost,10
200,Australia,Burnt,11
56,Australia,Burnt,12
348,Australia,Burnt,13
1246,USA,Burnt,14
""")
df = pd.read_csv(text_string, sep = ",")
# Specifically checks in column name "Cars"
group = df.loc[df['Country'] == 'Australia'].boxplot(column = 'Cars', by = 'Type')

Related

numpy where with multiple conditions linked to dataframe

I'm using numpy where with multiple conditions to assign a category based on a text string a transaction description.
Part of the code is below
`import numpy as np
conditions = [
df2['description'].str.contains('AEGON', na=False),
df2['description'].str.contains('IB/PVV', na=False),
df2['description'].str.contains('Picnic', na=False),
df2['description'].str.contains('Jumbo', na=False),
]
values = [
'Hypotheek',
'Hypotheek',
'Boodschappen',
'Boodschappen']
df2['Classificatie'] = np.select(conditions, values, default='unknown')
I have many conditions which - only partly shown here.
I want to create a table / dataframe in stead of including every seperate condition and value in the code. So for instance the following dataframe:
import pandas as pd
Conditions = {'Condition': ['AEGON','IB/PVV','Picnic','Jumbo'],
'Value': ['Hypotheek','Hypotheek','Boodschappen','Boodschappen']
}
df_conditions = pd.DataFrame(Conditions, columns= ['Condition','Value'])
How can I adjust the condition to look for (in the str.contains) a text string as listed in df_condictions['condition'] and to apply the Value column to df2['Classificatie']?
The values are already a list in the variable explorer, but I can't find a way to have the str.contains to look for a value in a list / dataframe.
Desired outcome:
In [3]: iwantthis
Out[3]:
Description Classificatie
0 groceries Jumbo on date boodschappen
1 mortgage payment Aegon. Hypotheek
2 transfer picnic. Boodschappen
The first column is the input data frame, te second column is what I'm looking for.
Please note that my current code already allows me to create this column, but I want to use another more automated way using de df_condtions table.
I'm not yet really familiair with Python and I can't find anything online.
Try:
import re
df_conditions["Condition"] = df_conditions["Condition"].str.lower()
df_conditions = df_conditions.set_index("Condition")
tmp = df["Description"].str.extract(
"(" + "|".join(re.escape(c) for c in df_conditions.index) + ")",
flags=re.I,
)
df["Classificatie"] = tmp[0].str.lower().map(df_conditions["Value"])
print(df)
Prints:
Description Classificatie
0 groceries Jumbo on date Boodschappen
1 mortgage payment Aegon. Hypotheek
2 transfer picnic. Boodschappen

Pandas/Geopandas Merge with a mask selection

I usually work with Arcpy but am trying to learn more pandas/geopandas uses. I have a mask applied to a csv table and a shapefile that I want to merge together in order to find matches between the two based on a specific field.
However, when I try to merge them together, I get the error "The truth value of a Dataframe is ambiguous." How do I merge a masked dataframe? I've included the segment of code below that creates the mask (utilizing two date variables and a date field) and the merge which uses the Location fields (different names on each dataframe).
What do I need to do to manipulate the mask dataframe into functioning in a mask?
mask = (svc_df['createdate'] < curdate) & (svc_df['createdate'] >= backdate)
print(svc_df.loc[mask])
# Detect the sub-dataframe and then assign to a new dataframe
sel_df = svc_df.loc[mask]
#Create a geodf from alabama services
al_gdf = geopandas.read_file(alSvc_shp)
al_merge = al_gdf.merge(al_gdf, sel_df, left_on="Location", right_on="sketch_LOC")
have synthesized a MWE from your code. Generation and data frame and geo data frame
you have an error:
al_merge = al_gdf.merge(al_gdf, sel_df, left_on="Location", right_on="sketch_LOC")
have used dataframe.merge() not pd.merge() hence only one data frame should be passed as a parameter
full working example below
import pandas as pd
import numpy as np
import geopandas as gpd
# synthesize
svc_df = pd.DataFrame(
{
"createdate": pd.date_range("1-mar-2022", periods=30),
"sketch_LOC": np.random.choice(["CHN", "USA", "IND", "JPN", "DEU"], 30),
}
)
curdate = pd.to_datetime("today")
backdate = curdate - pd.Timedelta("5 days")
mask = (svc_df["createdate"] < curdate) & (svc_df["createdate"] >= backdate)
print(svc_df.loc[mask])
# Detect the sub-dataframe and then assign to a new dataframe
sel_df = svc_df.loc[mask]
# Create a geodf from alabama services
# al_gdf = geopandas.read_file(alSvc_shp)
# synthesize
al_gdf = gpd.read_file(gpd.datasets.get_path("naturalearth_lowres")).assign(
Location=lambda d: d["iso_a3"]
)
al_merge = al_gdf.merge(sel_df, left_on="Location", right_on="sketch_LOC")

Using pandas to categories text data in one column and have corresponding categories stated in the next column

My excel spread sheet currently looks like this after inserting the new column "Expense" by using the code:
import pandas as pd
df = pd.read_csv(r"C:\Users\Mihir Patel\Project\Excel & CSV Stuff\June '20 CSVData.csv")
df_Expense = df.insert(2, "Expense", " ")
df.to_excel(r"C:\Users\Mihir Patel\Project\Excel & CSV Stuff\June '20 CSVData.xlsx", index=None, header=True)
So because the Description column contains the word "DRAKES" I can categories that expense as "Personal" which should appear in the Expense column next to it.
Similarly the next one down contains "Optus" is categorized as a mobile related expense so the word "Phone" should appear in the Expense column.
I have tried searching on Google and YouTube but I just can't seem to find an example for something like this.
Thanks for your help.
You can define a function which has all these rules and simply apply it. For ex.
def rules(x):
if "DRAKES" in x.description:
return "Personal"
if "OPUS" in x.description:
return "Mobile"
df["Expense"] = df.apply(lambda x: rules(x), axis=1)
I have solved my problem by using a while loop. I tried to use the method in quest's answer but I most likely didn't use it properly and kept getting an error. So I used a while loop to search through each individual cell in the "Description" column and categories it in the same row on the "Expense" column.
My solution using a while loop:
import pandas as pd
df = pd.read_csv("C:\\Users\\Mihir Patel\\PycharmProjects\\pythonProject\\June '20 CSVData.csv")
df.insert(2, "Expenses", "")
description = "Description"
expense = "Expenses"
transfer = "Transfer"
i = -1 #Because I wanted python to start searching from index 0
while i < 296: #296 is the row where my data ends
i = i + 1
if "Drakes".upper() in df.loc[i, description]:
df.loc[i, expense] = "Personal"
if "Optus".upper() in df.loc[i, description]:
df.loc[i, expense] = "Phone"
df.sort_values(by=["Expenses"], inplace=True)
df.to_excel("C:\\Users\\Mihir Patel\\PycharmProjects\\pythonProject\\June '20 CSVData.xlsx", index=False)

Iterating through values of one column to get descriptive statistics for another column

I'm trying to get descriptive statistics for a column of data (the tb column which is a list of numbers) for every individual (i.e., each ID). Normally, I'd use a for i in range(len(list)) statement but since the ID is not a number I'm unsure of how to do that. Any tips would be helpful! The code included below gets me descriptive statistics for the entire tb column, instead of for tb data for each individual in the ID list.
df = pd.DataFrame(pd.read_csv("SurgeryTpref.csv")) #importing data
df.columns = ['date', 'time', 'tb', 'ID','before_after'] #column headers
df.to_numpy()
import pandas as pd
# read the data in with
df = pd.read_clipboard(sep=',')
# data
,date,time,tb,ID,before_after
0,6/29/20,4:15:33 PM,37.1,SR10,after
1,6/29/20,4:17:33 PM,38.1,SR10,after
2,6/29/20,4:19:33 PM,37.8,SR10,after
3,6/29/20,4:21:33 PM,37.5,SR10,after
4,6/29/20,4:23:33 PM,38.1,SR10,after
5,6/29/20,4:25:33 PM,38.5,SR10,after
6,6/29/20,4:27:33 PM,38.6,SR10,after
7,6/29/20,4:29:33 PM,37.6,SR10,after
8,6/29/20,4:31:33 PM,35.5,SR10,after
9,6/29/20,4:33:33 PM,34.7,SR10,after
summary=[]
for individual in (ID):
vals= df['tb'].describe()
summary.append(vals)
print(summary)

Pandas: count number of times every value in one column appears in another column

I want to count the number of times a value in Child column appears in Parent column then display this count in new column renamed child count. See previews df below.
I have this done via VBA (COUNTIFS) but now need dynamic visualization and animated display with data fed from a dir. So I resorted to Python and Pandas and tried below code after searching and reading answers like: Countif in pandas with multiple conditions | Determine if value is in pandas column | Iterate over rows in Pandas df | many others...
but still can't get the expected preview as illustrated in image below.
Any help will be very much appreciated. Thanks in advance.
#import libraries
import pandas as pd
import numpy as np
import os
#get datasets
path_dataset = r'D:\Auto'
df_ns = pd.read_csv(os.path.join(path_dataset, 'Scripts', 'data.csv'), index_col = False, encoding = 'ISO-8859-1', engine = 'python')
#preview dataframe
df_ns
#tried
df_ns.groupby(['Child','Parent', 'Site Name']).size().reset_index(name='child count')
#preview output
df_ns.groupby(['Child','Parent', 'Site Name']).size().reset_index(name='child count')
preview dataframe
preview output
expected output
[Edited] My data
Child = ['Tkt01', 'Tkt02', 'Tkt03', 'Tkt04', 'Tkt05', 'Tkt06', 'Tkt07', 'Tkt08', 'Tkt09', 'Tkt10']
Parent = [' ', ' ', 'Tkt03',' ',' ', 'Tkt03',' ', 'Tkt03',' ',' ', 'Tkt06',' ',' ',' ',]
Site_Name =[Yaounde','Douala','Bamenda','Bafoussam','Kumba','Garoua','Maroua','Ngaoundere','Buea','Ebolowa']
I created a lookalike of your df.
Before
Try this code
df['Count'] = [len(df[df['parent'].str.contains(value)]) for index, value in enumerate(df['child'])]
#breaking it down as a line by line code
counts = []
for index, value in enumerate(df['child']):
found = df[df['parent'].str.contains(value)]
counts.append(len(found))
df['Count'] = counts
After
Hope this works for you.
Since I don't have access to your data, I cannot check the code I am giving you. I suggest you will have problems with nan values with this line but you can give it a try.:
df_ns['child_count'] = df_ns['Parent'].groupby(df_ns['Child']).value_counts()
I give a name to the new column and directly assign values to it through the groupby -> value_counts functions.

Categories