Plotting top 10 Values in Big Data - python
I need help plotting some categorical and numerical Values in python. the code is given below:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df=pd.read_csv('train_feature_store.csv')
df.info
df.head
df.columns
plt.figure(figsize=(20,6))
sns.countplot(x='Store', data=df)
plt.show()
Size = df[['Size','Store']].groupby(['Store'], as_index=False).sum()
Size.sort_values(by=['Size'],ascending=False).head(10)
However, the data size is so huge (Big data) that I'm not even able to make meaningful plotting in python. Basically, I just want to take the top 5 or top 10 values in python and make a plot of that as given below:-
In an attempt to plot the thing, I'm trying to put the below code into a dataframe and plot it, but not able to do so. Can anyone help me out in this:-
Size = df[['Size','Store']].groupby(['Store'], as_index=False).sum()
Size.sort_values(by=['Size'],ascending=False).head(10)
Below, is a link to the sample dataset. However, the dataset is a representation, in the original one where I'm trying to do the EDA, which has around 3 thousand unique stores and 60 thousand rows of data. PLEASE HELP! Thanks!
https://drive.google.com/drive/folders/1PdXaKXKiQXX0wrHYT3ZABjfT3QLIYzQ0?usp=sharing
You were pretty close.
import pandas as pd
import seaborn as sns
df = pd.read_csv('train_feature_store.csv')
sns.set(rc={'figure.figsize':(16,9)})
g = df.groupby('Store', as_index=False)['Size'].sum().sort_values(by='Size', ascending=False).head(10)
sns.barplot(data=g, x='Store', y='Size', hue='Store', dodge=False).set(xticklabels=[]);
First of all.. looking at the data ..looks like it holds data from scotland to Kolkata ..
categorize the data by geography first & then visualize.
Regards
Maitryee
Related
Plotting complex graph in pandas
I have the following dataset ids count 1 2000210 2 -23123 3 100 4 500 5 102300120 ... 1 million 123213 I want a graph where I have group of ids (all unique ids) in the x axis and count in y axis and a distribution chart that looks like the following How can I achieve this in pandas dataframe in python. I tried different ways but I am only getting a basic plot and not as complex as the drawing. What I tried df = pd.DataFrame(np.random.randn(1000000, 2), columns=["count", "ids"]).cumsum() df["range"] = pd.Series(list(range(len(df)))) df.plot(x="range", y="count"); But the plots dont make any sense. I am also new to plotting in pandas. I searched for a long time for charts like this in the internet and could really use some help with such graphs
From what I understood from your question and comments here is what you can do: 1) Import the libraries and set the default theme: import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt sns.set_theme() 2) Create your dataframe: df = pd.DataFrame(np.random.randn(1000000, 2), columns=["count", "ids"]).cumsum() df["range"] = pd.Series(list(range(len(df)))) 3) Plot your data 3.1) Simple take using only the seaborn library: sns.kdeplot(data=df, x="count", weights="range") Output: 3.2) More complex take using seaborn and matplotlib libraries: sns.histplot(x=df["count"], weights=df["range"], discrete=True, color='darkblue', edgecolor='black', kde=True, kde_kws={'cut': 2}, line_kws={'linewidth': 4}) plt.ylabel("range") plt.show() Output: Personal note: please make sure to check all the solutions, if they are not enough comment and we will work together in order to find you a solution
From a distribution plot of ids you can use: import numpy as np import pandas as pd np.random.seed(seed=123) df = pd.DataFrame(np.random.randn(1000000), columns=["ids"]) df['ids'].plot(kind='kde')
How to display X axis from Pandas Dataframe Object to Matplotlib barchart
I have created a pandas dataframe object from a CSV file being read into a dataframe. The csv is very short and includes the following data: Group Title, Hosts ,Soccer 5, Soccer 4, Soccer 3 , Soccer 2, Soccer 1, Soccer X ,Soccer Y, Soccer Total Units,11,1,3,4,4,5, [1 rows x 8 columns] I have successfully displayed the data on a bar chart however I want the x axis to be labelled for each group title (Hosts, Socker 5, Socker 4 and so on) and plotted in alignment with the data Please see a picture of my current graph below to get a better understanding I know I can do this manually but I want it to be read from the CSV file i.e the dataframe object. I have tried different methods to do this such as trying to add the following code dataframe.plot.bar(dataframe['Group Title'], title="Soccer", ylabel="Quantity", xlabel="Devices") My full code is below import pandas as pd import matplotlib.pyplot as plt import matplotlib dataframe = pd.read_csv('C:\Scripts\custom.csv', delimiter=",") print(dataframe) dataframe.plot() #plt.rcParams['figure.figsize'] = (15,8) matplotlib.style.use('ggplot') dataframe.plot.bar(title="Devices", ylabel="Quantity", xlabel="Devices") #Show Graphs plt.show() Any help or guidance will be appreciated. Thank you
You can use seaborn's barplot: import seaborn as sns import matplotlib.pyplot as plt sns.barplot(data=df) plt.xticks(rotation=45) Alternatively, with pandas only: df.set_index('Group Title').T.plot.bar()
Facing difficulty to chose right plotting graph in python for large categories
I have data frame with 3 columns. Language, Total Value and Percentage . I am not sure which plotting to use in python for better visualization. Below is the data: import pandas as pd data={'Language':['Haitian,Creole','Dutch','French','English','Xhosa','Afrikaans','Lati','Galicia','Quechua','Danish','Western,Frisia','Xhosa,French','French,Xhosa','Spanish','Norwegian,Nynorsk','Norwegia','Germa','Indonesia','Interlingua','Romania','French,English','Interlingue','Czech','Scots','Uzbek','Manx','Luxembourgish','Malagasy','Irish','Slovak','Inupiaq','Morisye','English,French','Finnish','Dutch,Afrikaans','Afar','Corsica','Portuguese','Dutch,English','Sundanese','Kinyarwanda','Malay','Volapük','Afrikaans,Dutch','Wolof','Basque','Estonia','Italia','Lithuania','Scottish,Gaelic','Hungaria','Breto','Kalaallisut','Welsh','Zhuang','Lingala','Occita','Maori','Khasi','Maltese','Seselwa,Creole,French','Vietnamese','Tagalog','Fijia','zzp','Romansh','Bislama','Polish','Swedish','Xhosa,English','English,Dutch','Catala','Hmong','Turkme','Somali','Nyanja','Turkish','Oromo','Ganda','Tswana','Javanese','Southern,Sotho','Samoa','Guarani','Aymara','Naur','Waray','Icelandic','Rundi','Latvia','Shona','Klingo','Tonga','Cebuano','Igbo','Aka','French,Dutch','Hawaiia','Esperanto','Albania','Yoruba','Swahili','Breton,French','Dutch,Danish','Serbia'],'Total_Value':['180455','86394','40609','18355','17882','2508','483','362','259','258','247','209','172','162','156','139','130','71','70','64','45','39','38','33','33','30','29','27','26','24','22','21','20','17','16','14','14','13','13','13','12','11','11','10','9','9','9','8','8','8','7','7','6','6','6','6','6','6','6','5','5','5','5','5','4','4','4','4','4','4','3','3','3','3','3','3','2','2','2','2','2','2','2','2','2','2','2','2','1','1','1','1','1','1','1','1','1','1','1','1','1','1','1','1','1'],'Percentage':['0.515799403','0.246942305','0.116073802','0.052464592','0.051112604','0.007168684','0.001380572','0.001034714','0.000740307','0.000737448','0.000706007','0.00059739','0.000491632','0.000463049','0.000445899','0.000397307','0.000371583','0.000202941','0.000200083','0.000182933','0.000128625','0.000111475','0.000108616','0.0000943','0.0000943','0.0000857','0.0000829','0.0000772','0.0000743','0.0000686','0.0000629','0.00006','0.0000572','0.0000486','0.0000457','0.00004','0.00004','0.0000372','0.0000372','0.0000372','0.0000343','0.0000314','0.0000314','0.0000286','0.0000257','0.0000257','0.0000257','0.0000229','0.0000229','0.0000229','0.00002','0.00002','0.0000171','0.0000171','0.0000171','0.0000171','0.0000171','0.0000171','0.0000171','0.0000143','0.0000143','0.0000143','0.0000143','0.0000143','0.0000114','0.0000114','0.0000114','0.0000114','0.0000114','0.0000114','0.00000857','0.00000857','0.00000857','0.00000857','0.00000857','0.00000857','0.00000572','0.00000572','0.00000572','0.00000572','0.00000572','0.00000572','0.00000572','0.00000572','0.00000572','0.00000572','0.00000572','0.00000572','0.00000286','0.00000286','0.00000286','0.00000286','0.00000286','0.00000286','0.00000286','0.00000286','0.00000286','0.00000286','0.00000286','0.00000286','0.00000286','0.00000286','0.00000286','0.00000286','0.00000286']} df = pd.DataFrame(data) I don't know which is the best way to visualize this three attributes using matplotlib,seaborn,plotly Language column has 106 categories and it has equivalent total value and percentage column Request help to provide good interpretable visualization graph Tried with below code I could see only 52 languages in x axis import chart_studio.plotly as py import plotly.graph_objects as go fig = go.Figure(data=go.Heatmap( z=[code_lang['percentage']], x=code_lang['Language'], y=code_lang['Total Value'], hoverongaps = False)) fig.show() It would be helpful if any better one is there
Here is a way to show the data as a wordcloud. Some remarks: the original Total_Value and Percentage columns are text strings; they need to be converted to numeric the Total_Value and Percentage columns have equivalent information: only one of the two needs to be shown a lot of the percentages are extremely small, so they get invisible with most types of visualization from wordcloud import WordCloud import matplotlib.pyplot as plt import pandas as pd # data=... df = pd.DataFrame(data) df.Percentage = df.Percentage.astype(float) df.Total_Value = df.Total_Value.astype(int) word_dict = {} for row in df.itertuples(index=False): word_dict[row.Language] = row.Percentage wordcloud = WordCloud(background_color="white", width=1200, height=1000 ).generate_from_frequencies(word_dict) plt.axis('off') plt.imshow(wordcloud) plt.show() In order to have the large values not overwhelm the smaller, the percentages could be brought closer together, e.g. using word_dict[row.Language] = row.Percentage ** .2.
Problem plotting single and double column data with a boxplot
I am trying to plot columns of data form a .csv file in a boxplot/violin plot using matplotlib.pyplot. When setting the dataframe [df] to one column of data, the plotting works fine. However once I try to plot two columns I do not get a plot generated, and the code seems like it's just running and running, so I think there is something to how I am passing along the data. Each columns is 54,500 row long. import os import matplotlib.pyplot as plt import seaborn as sns import pandas as pd from pandas import read_csv os.chdir(r"some_directory//") df = read_csv(r"csv_file.csv") # the csv file is 7 columns x 54500 rows, only concerned with two columns df = df[['surge', 'sway']] # re-size the dataframe to only use two columns data = df[['surge', 'sway']] #print data to just to confirm print(data) plt.violinplot(data, vert=True, showmeans=True, showmedians=True) plt.show() If I change the data line to data = df['surge'] I get a perfect plot with the 54501 surge values. When I introduce the second variable as data = df[['surge', 'sway']] is when the program gets hung up. I should note the same problem exists if I let data = df[['surge']] so I think it's something to do with the double braces and going from a list to an array, perhaps?
pandas scatter plot not showing all data
I am new to pandas data visulaizations and I'm having some trouble with a simple scatter plot. I have a dataframe loaded up from a csv, 6 columns, and 137 rows. But when I try to scatter the data from two columns, I only see 20 datapoints in the generated graph. I expected to see all 137. Any suggestions? Here is a tidbit of code: import pandas as pd import matplotlib.pyplot as plt plt.style.use('ggplot') df = pd.read_csv(file, sep=',', header=0) df.plot.scatter(x="Parte_aerea_peso_fresco", y="APCEi", marker=".") And here is the output.
Possibility 1) Many points are on exactly the same spot. You can manually check in your file.csv Possibility 2) Some value are not valid i.e : NaN ( not a number ) or a string, ... Your dataframe is small: You can check this possibility by printing your DataFrame. print (df) print (df[40:60]) df.describe()