csv file
Hello so I have this csv file that I want to convert to a graph, what I want is it to pretty much graph the number of jobs in each region by city. I have the columns for both cities and countries in this csv file, I want to toss out the date created and just have the city and number of job offers.
Here is the code I tried to use and it didn't work:
import pandas as pd
from matplotlib.pyplot import pie, axis, show
%matplotlib inline
df = pd.read_csv ('compuTrabajo_business_summary_by_industry.csv')
sums = df.groupby(df["country;"])["business count"].sum()
axis('equal');
pie(sums, labels=sums.index);
show()
Thanks for the help
As Abhinav Kinagi already answered, pandas assumes that your values are separated by commas. You can either change your csv-file or simply put sep='|'in pd.read_csv. Your code should be
%matplotlib inline
import pandas as pd
from matplotlib.pyplot import pie, axis, show
df = pd.read_csv ('compuTrabajo_business_summary_by_industry.csv', sep='|')
sums = df.groupby(df["country"])["business count"].sum()
axis('equal');
pie(sums, labels=sums.index);
show()
I also removed the ; after country.
Related
I need help plotting some categorical and numerical Values in python. the code is given below:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df=pd.read_csv('train_feature_store.csv')
df.info
df.head
df.columns
plt.figure(figsize=(20,6))
sns.countplot(x='Store', data=df)
plt.show()
Size = df[['Size','Store']].groupby(['Store'], as_index=False).sum()
Size.sort_values(by=['Size'],ascending=False).head(10)
However, the data size is so huge (Big data) that I'm not even able to make meaningful plotting in python. Basically, I just want to take the top 5 or top 10 values in python and make a plot of that as given below:-
In an attempt to plot the thing, I'm trying to put the below code into a dataframe and plot it, but not able to do so. Can anyone help me out in this:-
Size = df[['Size','Store']].groupby(['Store'], as_index=False).sum()
Size.sort_values(by=['Size'],ascending=False).head(10)
Below, is a link to the sample dataset. However, the dataset is a representation, in the original one where I'm trying to do the EDA, which has around 3 thousand unique stores and 60 thousand rows of data. PLEASE HELP! Thanks!
https://drive.google.com/drive/folders/1PdXaKXKiQXX0wrHYT3ZABjfT3QLIYzQ0?usp=sharing
You were pretty close.
import pandas as pd
import seaborn as sns
df = pd.read_csv('train_feature_store.csv')
sns.set(rc={'figure.figsize':(16,9)})
g = df.groupby('Store', as_index=False)['Size'].sum().sort_values(by='Size', ascending=False).head(10)
sns.barplot(data=g, x='Store', y='Size', hue='Store', dodge=False).set(xticklabels=[]);
First of all.. looking at the data ..looks like it holds data from scotland to Kolkata ..
categorize the data by geography first & then visualize.
Regards
Maitryee
If I have this length.csv file content:
May I know how can I use pandas plot dot graph base on this xy and yx?
import pandas as pd
df = pd.read_csv('C:\\path\to\folder\length.csv')
Now if you print df, you will get the following
df.plot(x='yx', y='xy', kind='scatter')
You can change your plot type to different types like line, bar etc.
Refer to https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html
You can easily use matplotlib. The plot method in Pandas is a wrapper for matplotlib.
If you wish to use Pandas, you can do it as such:
import pandas as pd
df = pd.read_csv('length.csv')
df.plot(x='xy', y='yx')
If you decide to go ahead with matplotlib, you can do as follows:
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline # Include this line only if on a notebook (like Jupyter or Colab)
df = pd.read_csv('length.csv')
plt.plot(df['xy'], df['yx'])
plt.xlabel('xy')
plt.ylabel('yx')
plt.title('xy vs yx Plot')
plt.show()
I am trying to plot columns of data form a .csv file in a boxplot/violin plot using matplotlib.pyplot.
When setting the dataframe [df] to one column of data, the plotting works fine. However once I try to plot two columns I do not get a plot generated, and the code seems like it's just running and running, so I think there is something to how I am passing along the data. Each columns is 54,500 row long.
import os
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from pandas import read_csv
os.chdir(r"some_directory//")
df = read_csv(r"csv_file.csv")
# the csv file is 7 columns x 54500 rows, only concerned with two columns
df = df[['surge', 'sway']]
# re-size the dataframe to only use two columns
data = df[['surge', 'sway']]
#print data to just to confirm
print(data)
plt.violinplot(data, vert=True, showmeans=True, showmedians=True)
plt.show()
If I change the data line to data = df['surge'] I get a perfect plot with the 54501 surge values.
When I introduce the second variable as data = df[['surge', 'sway']] is when the program gets hung up. I should note the same problem exists if I let data = df[['surge']] so I think it's something to do with the double braces and going from a list to an array, perhaps?
I am new to pandas data visulaizations and I'm having some trouble with a simple scatter plot. I have a dataframe loaded up from a csv, 6 columns, and 137 rows. But when I try to scatter the data from two columns, I only see 20 datapoints in the generated graph. I expected to see all 137. Any suggestions?
Here is a tidbit of code:
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('ggplot')
df = pd.read_csv(file, sep=',', header=0)
df.plot.scatter(x="Parte_aerea_peso_fresco", y="APCEi", marker=".")
And here is the output.
Possibility 1)
Many points are on exactly the same spot. You can manually check in your file.csv
Possibility 2)
Some value are not valid i.e : NaN ( not a number ) or a string, ...
Your dataframe is small: You can check this possibility by printing your DataFrame.
print (df)
print (df[40:60])
df.describe()
I have a csv file (excel spreadsheet) of a column of roughly a million numbers. I want to make a histogram of this data with the frequency of the numbers on the y-axis and the number quantities on the x-axis. I know matplotlib can plot a histogram, but my main problem is converting the csv file from string to float since a string can't be graphed. This is what I have:
import matplotlib.pyplot as plt
import csv
with open('D1.csv', 'rb') as data:
rows = csv.reader(data, quoting = csv.QUOTE_NONNUMERIC)
floats = [[item for number, item in enumerate(row) if item and (1 <= number <= 12)] for row in rows]
plt.hist(floats, bins=50)
plt.title("histogram")
plt.xlabel("value")
plt.ylabel("frequency")
plt.show()
You can do it in one line with pandas:
import pandas as pd
pd.read_csv('D1.csv', quoting=2)['column_you_want'].hist(bins=50)
Okay I finally got something to work with headings, titles, etc.
import matplotlib.pyplot as plt
import pandas as pd
data = pd.read_csv('D1.csv', quoting=2)
data.hist(bins=50)
plt.xlim([0,115000])
plt.title("Data")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.show()
My first problem was that matplotlib is necessary to actually show the graph. Also, I needed to set the action
pd.read_csv('D1.csv', quoting=2)
to data so I could plot the histogram of that action with
data.hist
Thank you all for the help.
Panda's read_csv is very powerful, but if your csv file is simple (without headers, or NaNs or comments) you do not need Pandas, as you can use Numpy:
import numpy as np
import matplotlib.pyplot as plt
data = np.loadtxt('D1.csv')
plt.hist(data, normed=True, bins='auto')
(In fact loadtxt can deal with some headers and comments, but read_csv is more versatile)