pandas scatter plot not showing all data - python

I am new to pandas data visulaizations and I'm having some trouble with a simple scatter plot. I have a dataframe loaded up from a csv, 6 columns, and 137 rows. But when I try to scatter the data from two columns, I only see 20 datapoints in the generated graph. I expected to see all 137. Any suggestions?
Here is a tidbit of code:
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('ggplot')
df = pd.read_csv(file, sep=',', header=0)
df.plot.scatter(x="Parte_aerea_peso_fresco", y="APCEi", marker=".")
And here is the output.

Possibility 1)
Many points are on exactly the same spot. You can manually check in your file.csv
Possibility 2)
Some value are not valid i.e : NaN ( not a number ) or a string, ...
Your dataframe is small: You can check this possibility by printing your DataFrame.
print (df)
print (df[40:60])
df.describe()

Related

Graphing in Dataframe Pandas Pyton. How to plot a line after filtering a dataframe

So I have a pandas Dataframe with pateint id, date of visit, location, weight, and heartrate. I need to graph the line of the number of visits in one location in the Dataset over a period of 12 months with the month number on the horizontal axis.
Any other suggestions about how I may go about this?
I tried making the data into 3 data sets and then just graphing the number of visits counted from each data set but creating new columns and assigning the values wasn't working, it only worked for when I was graphing the values of all of the clinics but after splitting it into 3 dataframes, it stopped working.
DataFrame
Here is a working example of filtering a DataFrame and using the filtered results to plot a chart.
import pandas as pd
import matplotlib.pyplot as plt
# larger dataframe example
d = {'x values':[1,2,3,4,5,6,7,8,9],'y values':[2,4,6,8,10,12,14,16,18]}
df = pd.DataFrame(d)
# apply filter
df = df[df['x values'] < 5]
# plot chart
plt.plot(df['x values'], df['y values'])
plt.show()
result:
simply place your data into an ndarray and plot it with the matplotlib.pyplot or you can simply plot from a dataframe for example plt.plot(df['something'])

Plotting top 10 Values in Big Data

I need help plotting some categorical and numerical Values in python. the code is given below:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df=pd.read_csv('train_feature_store.csv')
df.info
df.head
df.columns
plt.figure(figsize=(20,6))
sns.countplot(x='Store', data=df)
plt.show()
Size = df[['Size','Store']].groupby(['Store'], as_index=False).sum()
Size.sort_values(by=['Size'],ascending=False).head(10)
However, the data size is so huge (Big data) that I'm not even able to make meaningful plotting in python. Basically, I just want to take the top 5 or top 10 values in python and make a plot of that as given below:-
In an attempt to plot the thing, I'm trying to put the below code into a dataframe and plot it, but not able to do so. Can anyone help me out in this:-
Size = df[['Size','Store']].groupby(['Store'], as_index=False).sum()
Size.sort_values(by=['Size'],ascending=False).head(10)
Below, is a link to the sample dataset. However, the dataset is a representation, in the original one where I'm trying to do the EDA, which has around 3 thousand unique stores and 60 thousand rows of data. PLEASE HELP! Thanks!
https://drive.google.com/drive/folders/1PdXaKXKiQXX0wrHYT3ZABjfT3QLIYzQ0?usp=sharing
You were pretty close.
import pandas as pd
import seaborn as sns
df = pd.read_csv('train_feature_store.csv')
sns.set(rc={'figure.figsize':(16,9)})
g = df.groupby('Store', as_index=False)['Size'].sum().sort_values(by='Size', ascending=False).head(10)
sns.barplot(data=g, x='Store', y='Size', hue='Store', dodge=False).set(xticklabels=[]);
First of all.. looking at the data ..looks like it holds data from scotland to Kolkata ..
categorize the data by geography first & then visualize.
Regards
Maitryee

Create X Axis from row in pandas dataframe

I'm trying to plot the x-axis from the top row of my dataframe, and the y-axis from another row in my dataframe.
My dataframe looks like this:
sector_data =
Time 13:00 13:15 13:30 13:45
Utilities 1235654 1456267 1354894 1423124
Transports 506245 554862 534685 524962
Telecomms 142653 153264 162357 154698
I've tried a lot of different things, with this seeming to make the most sense. But nothing works:
sector_data.plot(kind='line',x='Time',y='Utilities')
plt.show()
I keep getting:
KeyError: 'Time'
It should end up looking like this:
Expected Chart
enter image description here
Given the little information you provide I believe this should help:
df = sector_data.T
df.plot(kind='line',x='Time',y='Utilities')
plt.show()
This is how I made a case example (I have already transposed the dataframe)
import pandas as pd
import matplotlib.pyplot as plt
a = {'Time':['13:00','13:15','13:30','13:45'],'Utilities':[1235654,1456267,1354894,1423124],'Transports':[506245,554862,534685,524962],'Telecomms':[142653,153264,162357,154698]}
df = pd.DataFrame(a)
df.plot(kind='line',x='Time',y='Utilities')
plt.show()
Output:
Let's take an example DataFrame:
import pandas as pd
df = pd.DataFrame({'ColA':['Time','Utilities','Transports','Telecomms'],'ColB':['13:00', 1235654, 506245, 142653],'ColC':['14:00', 1234654, 506145, 142650], 'ColD':['15:00', 4235654, 906245, 142053],'ColE':['16:00', 4205654, 906845, 742053]})
df = df.set_index('ColA') #set index for the column A or the values you want to plot for
Now you can easily plot with matplotlib
plt.plot(df.loc['Time'].values,df.loc['Utilities'].values)

Problem plotting single and double column data with a boxplot

I am trying to plot columns of data form a .csv file in a boxplot/violin plot using matplotlib.pyplot.
When setting the dataframe [df] to one column of data, the plotting works fine. However once I try to plot two columns I do not get a plot generated, and the code seems like it's just running and running, so I think there is something to how I am passing along the data. Each columns is 54,500 row long.
import os
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from pandas import read_csv
os.chdir(r"some_directory//")
df = read_csv(r"csv_file.csv")
# the csv file is 7 columns x 54500 rows, only concerned with two columns
df = df[['surge', 'sway']]
# re-size the dataframe to only use two columns
data = df[['surge', 'sway']]
#print data to just to confirm
print(data)
plt.violinplot(data, vert=True, showmeans=True, showmedians=True)
plt.show()
If I change the data line to data = df['surge'] I get a perfect plot with the 54501 surge values.
When I introduce the second variable as data = df[['surge', 'sway']] is when the program gets hung up. I should note the same problem exists if I let data = df[['surge']] so I think it's something to do with the double braces and going from a list to an array, perhaps?

Pandas Series not plotting to timeseries chart

I have a data set of house prices - House Price Data. When I use a subset of the data in a Numpy array, I can plot it in this nice timeseries chart:
However, when I use the same data in a Panda Series, the chart goes all lumpy like this:
How can I create a smooth time series line graph (like the first image) using a Panda Series?
Here is what I am doing to get the nice looking time series chart (using Numpy array)(after importing numpy as np, pandas as pd and matplotlib.pyplot as plt):
data = pd.read_csv('HPI.csv', index_col='Date', parse_dates=True) #pull in csv file, make index the date column and parse the dates
brixton = data[data['RegionName'] == 'Lambeth'] # pull out a subset for the region Lambeth
prices = brixton['AveragePrice'].values # create a numpy array of the average price values
plt.plot(prices) #plot
plt.show() #show
Here is what I am doing to get the lumpy one using a Panda series:
data = pd.read_csv('HPI.csv', index_col='Date', parse_dates=True)
brixton = data[data['RegionName'] == 'Lambeth']
prices_panda = brixton['AveragePrice']
plt.plot(prices_panda)
plt.show()
How do I make this second graph show as a nice smooth proper time series?
* This is my first StackOverflow question so please shout if I have left anything out or not been clear *
Any help greatly appreciated
When you did parse_dates=True, pandas read the dates in its default method, which is month-day-year. Your data is formatted according to the British convention, which is day-month-year. As a result, instead of having a data point for the first of every month, your plot is showing data points for the first 12 days of January, and a flat line for the rest of each year. You need to reformat the dates, such as
data.index = pd.to_datetime({'year':data.index.year,'month':data.index.day,'day':data.index.month})
The date format in the file you have is Day/Month/Year. In order for pandas to interprete this format correctly you can use the option dayfirst=True inside the read_csv call.
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('data/UK-HPI-full-file-2017-08.csv',
index_col='Date', parse_dates=True, dayfirst=True)
brixton = data[data['RegionName'] == 'Lambeth']
prices_panda = brixton['AveragePrice']
plt.plot(prices_panda)
plt.show()

Categories