Count Rows in Dictionary of Dataframes

Count Rows in Dictionary of Dataframes - python

I have a dictionary of dataframes. I am trying to count the rows in each dataframe. For the real data, my code is counting just over ten thousand rows for a dataframe that has only has a few rows.
I have tried to reproduce the error using dummy data. Unfortunately, the code works fine with the dummy data!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Dataframe
Df = pd.DataFrame(np.random.randint(0,10,size=(100, 4)), columns=list('ABCD'))
# Map
Ma = Df.groupby('D')
# Dictionary of Dataframes
Di = {}
for name, group in Ma:
Di[str(name)] = group
# Count the Rows in each Dataframe
Li = []
for k in Di:
Count = Di[k].shape[0]
Li.append([Count])
# Flatten
Li_1 = []
for sublist in Li:
for item in sublist:
Li_1.append(item)
# Histogram
plt.hist(Li_1, bins=10)
plt.xlabel("Rows / Dataframe")
plt.ylabel("Frequency")
fig = plt.gcf()

To get the number of rows corresponding to each category in 'D', you can simply use .size when you do your groupby:
Df.groupby('D').size()
pandas also allows you to directly plot graphs, so your code can be reduced to:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Df = pd.DataFrame(np.random.randint(0,10,size=(100, 4)), columns=list('ABCD'))
Df.groupby('D').size().plot.hist()
plt.xlabel("Rows / Dataframe")
plt.ylabel("Frequency")
fig = plt.gcf()

Assuming that, the data in column D is a categorical variable. You can get the count for each category using seaborn countplot.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Dataframe
df = pd.DataFrame(np.random.randint(0,10,size=(100, 4)), columns=list('ABCD'))
# easy count plot in sns
sns.countplot(x='D',data=df)
plt.xlabel("category")
plt.ylabel("frequency")
But if you are looking for distribution plot but not categorical count plot then you can use the folowing part of the code to have distribution plot.
# for distribution plot
sns.distplot(df['D'],kde=False,bins=10)
plt.xlabel("Spread")
plt.ylabel("frequency")
But if you want distribution plot after group by the elements which does not make any sense to me but you can use the following:
# for distribution plot after group by
sns.distplot(df.groupby('D').size() ,kde=False,bins=10)
plt.xlabel("Spread")
plt.ylabel("frequency")

Related

2 line plot using seaborn

I want to plot this array. I am using seaborn to do that. I used
import seaborn as sns
sns.set_style('whitegrid')
sns.kdeplot(data= score_for_modelA[:,0])
But the above one only gives for column 1. My scores are in column 1 and 2 and I want both of them plotted in the same graph.
The sample data is like this:
array ([[0.67,0.33],[0.45,0.55],......,[0.81,0.19]]

You can try putting them into a data frame first, with the proper column names, for example:
import seaborn as sns
import numpy as np
import pandas as pd
# create sample dataframe in wide format
score_for_modelA = np.random.normal(0, 1, (50, 2))
df = pd.DataFrame(score_for_modelA, columns=['col1', 'col2'])
# use melt to convert the dataframe to a long form
dfm = df.melt()
Plot the long form dataframe
sns.kdeplot(data=dfm, hue="variable", x="value")
As pointed out by #JohanC, if you want all of the columns:
sns.kdeplot(data=df)

Box and whisker plot on multiple columns

I am trying to make a Box and Whisker plot on my dataset that looks something like this -
& the chart I'm trying to make
My current lines of code are below -
import seaborn as sns
import matplotlib.pyplot as plt
d = df3.boxplot(column = ['Northern California','New York','Kansas','Texas'], by = 'Banner')
d
Thank you

I've recreated a dummy version of your dataset:
import numpy as np
import pandas as pd
dictionary = {'Banner':['Type1']*10+['Type2']*10,
'Northen_californina':np.random.rand(20),
'Texas':np.random.rand(20)}
df = pd.DataFrame(dictionary)
What you need is to melt your dataframe (unpivot) in orther to have the information of geographical zone stored in a column and not as column name. You can use pandas.melt method and specify all the columns you want to put in your boxplot in the value_vars argument.
With my dummy dataset you can do this:
df = pd.melt(df,id_vars=['Banner'],value_vars=['Northen_californina','Texas'],
var_name='zone', value_name='amount')
Now you can apply a boxplot using the hue argument:
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(9,9)) #for a bigger image
sns.boxplot(x="Banner", y="amount", hue="zone", data=df, palette="Set1")

Plotting 2 columns of a csv with matplotlib error

I am trying to make a simple bar graph out of a 2 column CSV file. One column is the x axis names, the other column is the actual data which will be used for the bars. The CSV looks like this:
count,team
21,group1
15,group2
63,group3
22,group4
42,group5
72,group6
21,group7
23,group8
24,group9
31,group10
32,group11
I am using this code:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
df = pd.read_csv("sampleData.csv",sep=",").set_index('count')
d = dict(zip(df.index,df.values.tolist()))
df.plot.bar(x = 'count', y = 'team')
print(d)
However, I get an error
KeyError: 'count' from this line :
df.plot.bar(x = 'count', y = 'team')
I don't understand how there is an error for something that exists.

When you set the count as index, you just have a single column left in your DataFrame, i.e., team. Don't set the count as index and switch the order of x and y values for plotting the bar chart
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
df = pd.read_csv("sampleData.csv", sep=",")
df.plot.bar(x = 'team', y = 'count')
Matplotlib solution
plt.bar(df['team'], df['count'])
plt.xticks(rotation=45) # Just rotating for better visualizaton

plot graph from python dataframe

i want to convert that dataframe
into this dataframe and plot a matplotlib graph using date along x axis
changed dataframe

Use df.T.plot(kind='bar'):
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame.from_csv('./housing_price_index_2010-11_100.csv')
df.T.plot(kind='bar')
plt.show()
you can also assign the transpose to a new variable and plot that (what you asked in the comment):
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame.from_csv('./housing_price_index_2010-11_100.csv')
df_transposed = df.T
df_transposed.plot(kind='bar')
plt.show()
both result the same:

By group, plot highest quantile data vs lowest, and capture stats

I wish to group a dataset by "assay", then compare intensities for small cells versus large cells. The problem I have is that in writing my code I only understand how to group the top and bottom cellArea quantiles of the entire dataFrame, rather than for each individual assay ('wt' and 'cnt').
As a final point, I would like to compare the mean values between the intensities of the two groups for each assay type...
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
df = DataFrame({'assay':['cnt']*10+['wt']*10,
'image':['001']*10+['002']*5+['001']*5,
'roi':['1']*5+['2']*5+['3']*5+['1']*5,
'cellArea':[99,90,50,2,30,65,95,30,56,5,33,18,98,76,56,72,12,5,47,89],
'intensity':[88,34,1,50,2,67,88,77,73,3,2,67,37,34,12,45,23,82,12,1]},
columns=['assay','image','roi','cellArea','intensity'])
df.loc[(df['cellArea'] < df['cellArea'].quantile(.20)),'group'] = 'Small_CellArea'
df.loc[(df['cellArea'] > df['cellArea'].quantile(.80)),'group'] = 'Large_CellArea'
df = df.reset_index(drop=True)
sns.violinplot(data=df,y='intensity',x='assay',hue='group',capsize=1,ci=95,palette="Set3",inner='quartile',split=True, cut=0)
plt.ylim(-20,105)
plt.legend(loc='center', bbox_to_anchor=(0.5, 0.08), ncol=3, frameon=True, fancybox=True, shadow=True, fontsize=12)

Calculate the low and high quantile by groups and then merge them back to the original data frame from where you can then calculate the group variable as Small or large:
from pandas import pd
quantileLow = df.groupby('assay').cellArea.quantile(0.2).reset_index()
quantileHigh = df.groupby('assay').cellArea.quantile(0.8).reset_index()
df = pd.merge(df, pd.merge(quantileLow, quantileHigh, on = "assay"), on = "assay")
df.loc[df['cellArea'] < df.cellArea_x,'group'] = 'Small_CellArea'
df.loc[df['cellArea'] > df.cellArea_y,'group'] = 'Large_CellArea'

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Count Rows in Dictionary of Dataframes - python

Related

2 line plot using seaborn

Box and whisker plot on multiple columns

Plotting 2 columns of a csv with matplotlib error

plot graph from python dataframe

By group, plot highest quantile data vs lowest, and capture stats

Categories

Resources