Plotting top 10 Values in Big Data - python

I need help plotting some categorical and numerical Values in python. the code is given below:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df=pd.read_csv('train_feature_store.csv')
df.info
df.head
df.columns
plt.figure(figsize=(20,6))
sns.countplot(x='Store', data=df)
plt.show()
Size = df[['Size','Store']].groupby(['Store'], as_index=False).sum()
Size.sort_values(by=['Size'],ascending=False).head(10)
However, the data size is so huge (Big data) that I'm not even able to make meaningful plotting in python. Basically, I just want to take the top 5 or top 10 values in python and make a plot of that as given below:-
In an attempt to plot the thing, I'm trying to put the below code into a dataframe and plot it, but not able to do so. Can anyone help me out in this:-
Size = df[['Size','Store']].groupby(['Store'], as_index=False).sum()
Size.sort_values(by=['Size'],ascending=False).head(10)
Below, is a link to the sample dataset. However, the dataset is a representation, in the original one where I'm trying to do the EDA, which has around 3 thousand unique stores and 60 thousand rows of data. PLEASE HELP! Thanks!
https://drive.google.com/drive/folders/1PdXaKXKiQXX0wrHYT3ZABjfT3QLIYzQ0?usp=sharing

You were pretty close.
import pandas as pd
import seaborn as sns
df = pd.read_csv('train_feature_store.csv')
sns.set(rc={'figure.figsize':(16,9)})
g = df.groupby('Store', as_index=False)['Size'].sum().sort_values(by='Size', ascending=False).head(10)
sns.barplot(data=g, x='Store', y='Size', hue='Store', dodge=False).set(xticklabels=[]);

First of all.. looking at the data ..looks like it holds data from scotland to Kolkata ..
categorize the data by geography first & then visualize.
Regards
Maitryee

Related

Plotting complex graph in pandas

I have the following dataset
ids count
1 2000210
2 -23123
3 100
4 500
5 102300120
...
1 million 123213
I want a graph where I have group of ids (all unique ids) in the x axis and count in y axis and a distribution chart that looks like the following
How can I achieve this in pandas dataframe in python.
I tried different ways but I am only getting a basic plot and not as complex as the drawing.
What I tried
df = pd.DataFrame(np.random.randn(1000000, 2), columns=["count", "ids"]).cumsum()
df["range"] = pd.Series(list(range(len(df))))
df.plot(x="range", y="count");
But the plots dont make any sense. I am also new to plotting in pandas. I searched for a long time for charts like this in the internet and could really use some help with such graphs
From what I understood from your question and comments here is what you can do:
1) Import the libraries and set the default theme:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme()
2) Create your dataframe:
df = pd.DataFrame(np.random.randn(1000000, 2), columns=["count", "ids"]).cumsum()
df["range"] = pd.Series(list(range(len(df))))
3) Plot your data
3.1) Simple take using only the seaborn library:
sns.kdeplot(data=df, x="count", weights="range")
Output:
3.2) More complex take using seaborn and matplotlib libraries:
sns.histplot(x=df["count"], weights=df["range"], discrete=True,
color='darkblue', edgecolor='black',
kde=True, kde_kws={'cut': 2}, line_kws={'linewidth': 4})
plt.ylabel("range")
plt.show()
Output:
Personal note: please make sure to check all the solutions, if they
are not enough comment and we will work together in order to find you
a solution
From a distribution plot of ids you can use:
import numpy as np
import pandas as pd
np.random.seed(seed=123)
df = pd.DataFrame(np.random.randn(1000000), columns=["ids"])
df['ids'].plot(kind='kde')

How to display X axis from Pandas Dataframe Object to Matplotlib barchart

I have created a pandas dataframe object from a CSV file being read into a dataframe. The csv is very short and includes the following data:
Group Title, Hosts ,Soccer 5, Soccer 4, Soccer 3 , Soccer 2, Soccer 1, Soccer X ,Soccer Y, Soccer Total
Units,11,1,3,4,4,5,
[1 rows x 8 columns]
I have successfully displayed the data on a bar chart however I want the x axis to be labelled for each group title (Hosts, Socker 5, Socker 4 and so on) and plotted in alignment with the data
Please see a picture of my current graph below to get a better understanding
I know I can do this manually but I want it to be read from the CSV file i.e the dataframe object. I have tried different methods to do this such as trying to add the following code
dataframe.plot.bar(dataframe['Group Title'], title="Soccer", ylabel="Quantity", xlabel="Devices")
My full code is below
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
dataframe = pd.read_csv('C:\Scripts\custom.csv', delimiter=",")
print(dataframe)
dataframe.plot()
#plt.rcParams['figure.figsize'] = (15,8)
matplotlib.style.use('ggplot')
dataframe.plot.bar(title="Devices", ylabel="Quantity", xlabel="Devices")
#Show Graphs
plt.show()
Any help or guidance will be appreciated.
Thank you
You can use seaborn's barplot:
import seaborn as sns
import matplotlib.pyplot as plt
sns.barplot(data=df)
plt.xticks(rotation=45)
Alternatively, with pandas only:
df.set_index('Group Title').T.plot.bar()

Facing difficulty to chose right plotting graph in python for large categories

I have data frame with 3 columns. Language, Total Value and Percentage . I am not sure which plotting to use in python for better visualization.
Below is the data:
import pandas as pd
data={'Language':['Haitian,Creole','Dutch','French','English','Xhosa','Afrikaans','Lati','Galicia','Quechua','Danish','Western,Frisia','Xhosa,French','French,Xhosa','Spanish','Norwegian,Nynorsk','Norwegia','Germa','Indonesia','Interlingua','Romania','French,English','Interlingue','Czech','Scots','Uzbek','Manx','Luxembourgish','Malagasy','Irish','Slovak','Inupiaq','Morisye','English,French','Finnish','Dutch,Afrikaans','Afar','Corsica','Portuguese','Dutch,English','Sundanese','Kinyarwanda','Malay','Volapük','Afrikaans,Dutch','Wolof','Basque','Estonia','Italia','Lithuania','Scottish,Gaelic','Hungaria','Breto','Kalaallisut','Welsh','Zhuang','Lingala','Occita','Maori','Khasi','Maltese','Seselwa,Creole,French','Vietnamese','Tagalog','Fijia','zzp','Romansh','Bislama','Polish','Swedish','Xhosa,English','English,Dutch','Catala','Hmong','Turkme','Somali','Nyanja','Turkish','Oromo','Ganda','Tswana','Javanese','Southern,Sotho','Samoa','Guarani','Aymara','Naur','Waray','Icelandic','Rundi','Latvia','Shona','Klingo','Tonga','Cebuano','Igbo','Aka','French,Dutch','Hawaiia','Esperanto','Albania','Yoruba','Swahili','Breton,French','Dutch,Danish','Serbia'],'Total_Value':['180455','86394','40609','18355','17882','2508','483','362','259','258','247','209','172','162','156','139','130','71','70','64','45','39','38','33','33','30','29','27','26','24','22','21','20','17','16','14','14','13','13','13','12','11','11','10','9','9','9','8','8','8','7','7','6','6','6','6','6','6','6','5','5','5','5','5','4','4','4','4','4','4','3','3','3','3','3','3','2','2','2','2','2','2','2','2','2','2','2','2','1','1','1','1','1','1','1','1','1','1','1','1','1','1','1','1','1'],'Percentage':['0.515799403','0.246942305','0.116073802','0.052464592','0.051112604','0.007168684','0.001380572','0.001034714','0.000740307','0.000737448','0.000706007','0.00059739','0.000491632','0.000463049','0.000445899','0.000397307','0.000371583','0.000202941','0.000200083','0.000182933','0.000128625','0.000111475','0.000108616','0.0000943','0.0000943','0.0000857','0.0000829','0.0000772','0.0000743','0.0000686','0.0000629','0.00006','0.0000572','0.0000486','0.0000457','0.00004','0.00004','0.0000372','0.0000372','0.0000372','0.0000343','0.0000314','0.0000314','0.0000286','0.0000257','0.0000257','0.0000257','0.0000229','0.0000229','0.0000229','0.00002','0.00002','0.0000171','0.0000171','0.0000171','0.0000171','0.0000171','0.0000171','0.0000171','0.0000143','0.0000143','0.0000143','0.0000143','0.0000143','0.0000114','0.0000114','0.0000114','0.0000114','0.0000114','0.0000114','0.00000857','0.00000857','0.00000857','0.00000857','0.00000857','0.00000857','0.00000572','0.00000572','0.00000572','0.00000572','0.00000572','0.00000572','0.00000572','0.00000572','0.00000572','0.00000572','0.00000572','0.00000572','0.00000286','0.00000286','0.00000286','0.00000286','0.00000286','0.00000286','0.00000286','0.00000286','0.00000286','0.00000286','0.00000286','0.00000286','0.00000286','0.00000286','0.00000286','0.00000286','0.00000286']}
df = pd.DataFrame(data)
I don't know which is the best way to visualize this three attributes using matplotlib,seaborn,plotly
Language column has 106 categories and it has equivalent total value and percentage column
Request help to provide good interpretable visualization graph
Tried with below code I could see only 52 languages in x axis
import chart_studio.plotly as py
import plotly.graph_objects as go
fig = go.Figure(data=go.Heatmap(
z=[code_lang['percentage']],
x=code_lang['Language'],
y=code_lang['Total Value'],
hoverongaps = False))
fig.show()
It would be helpful if any better one is there
Here is a way to show the data as a wordcloud.
Some remarks:
the original Total_Value and Percentage columns are text strings; they need to be converted to numeric
the Total_Value and Percentage columns have equivalent information: only one of the two needs to be shown
a lot of the percentages are extremely small, so they get invisible with most types of visualization
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import pandas as pd
# data=...
df = pd.DataFrame(data)
df.Percentage = df.Percentage.astype(float)
df.Total_Value = df.Total_Value.astype(int)
word_dict = {}
for row in df.itertuples(index=False):
word_dict[row.Language] = row.Percentage
wordcloud = WordCloud(background_color="white", width=1200, height=1000
).generate_from_frequencies(word_dict)
plt.axis('off')
plt.imshow(wordcloud)
plt.show()
In order to have the large values not overwhelm the smaller, the percentages could be brought closer together, e.g. using word_dict[row.Language] = row.Percentage ** .2.

Problem plotting single and double column data with a boxplot

I am trying to plot columns of data form a .csv file in a boxplot/violin plot using matplotlib.pyplot.
When setting the dataframe [df] to one column of data, the plotting works fine. However once I try to plot two columns I do not get a plot generated, and the code seems like it's just running and running, so I think there is something to how I am passing along the data. Each columns is 54,500 row long.
import os
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from pandas import read_csv
os.chdir(r"some_directory//")
df = read_csv(r"csv_file.csv")
# the csv file is 7 columns x 54500 rows, only concerned with two columns
df = df[['surge', 'sway']]
# re-size the dataframe to only use two columns
data = df[['surge', 'sway']]
#print data to just to confirm
print(data)
plt.violinplot(data, vert=True, showmeans=True, showmedians=True)
plt.show()
If I change the data line to data = df['surge'] I get a perfect plot with the 54501 surge values.
When I introduce the second variable as data = df[['surge', 'sway']] is when the program gets hung up. I should note the same problem exists if I let data = df[['surge']] so I think it's something to do with the double braces and going from a list to an array, perhaps?

pandas scatter plot not showing all data

I am new to pandas data visulaizations and I'm having some trouble with a simple scatter plot. I have a dataframe loaded up from a csv, 6 columns, and 137 rows. But when I try to scatter the data from two columns, I only see 20 datapoints in the generated graph. I expected to see all 137. Any suggestions?
Here is a tidbit of code:
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('ggplot')
df = pd.read_csv(file, sep=',', header=0)
df.plot.scatter(x="Parte_aerea_peso_fresco", y="APCEi", marker=".")
And here is the output.
Possibility 1)
Many points are on exactly the same spot. You can manually check in your file.csv
Possibility 2)
Some value are not valid i.e : NaN ( not a number ) or a string, ...
Your dataframe is small: You can check this possibility by printing your DataFrame.
print (df)
print (df[40:60])
df.describe()

Categories