I have a CSV file with 27.000 lines. I am trying to create a jitter plot, just like this one [https://static1.squarespace.com/static/56fd706140261df95349d4bd/t/59297c72579fb3d813d591c1/1495891103667/Jitter+Example+The+Truthful+Art.png?format=1000w].
The 'y' axis would be the column called "VALOR_REEMBOLSADO" (stands for "refund value"). The 'x' axis would be the column called "MES" (stands for "month").
It represents the spending of brazilian senators in 2017. The CSV file is very organized, but originally has the "VALOR_REEMBOLSADO" as string and not as float. I replaced the "," for ".", but I still can't plot the chart.
Can someone help me with the code? What code can create a chart like that?
Here you find the CSV file of the year 2017: https://www12.senado.leg.br/transparencia/dados-abertos-transparencia/dados-abertos-ceaps
At first I have to admit that I cannot understand some aspects of your question (first link doesn't work, and even more important: you want an x-axis which shows the months but in the plot, the data is shown over states).
But I see that your problems start already at the very beginning of reading the data in, so I'll try to give you the needed hints to start:
For reading in csv-data like this, I'd recommend pandas, usually imported with
import pandas as pd
It has a csv reader included, which is quite powerful. Generally, you should avoid manually tweaking the data sources you have (like changing decimal signs etc.), because this is something which is already adressed by importer functions like read_csv (and you don't want to do this again and again in the future with new data files but the same plot generation):
filepath = 'wherever/file/may/roam/2017.csv'
data = pd.read_csv(filepath, skiprows=1, sep=';', usecols=[1, 9], decimal=',')
With filepath you tell the importer where you stored the csv-file, skiprows=1 says that you're not interested in the first line of the file, sep defines the delimiter between the columns and via usecols you can pick only the columns of interest, 'MES' and 'VALOR_REEMBOLSADO' in your example.
decimal specifies the decimal sign of float numbers in your data.
Now data contains a pandas dataframe of your data:
In: data[:10]
Out:
MES VALOR_REEMBOLSADO
0 1 97.00
1 1 6000.00
2 1 418.04
3 1 1958.95
4 1 1178.67
5 1 1252.65
6 2 62.30
7 2 240.81
8 2 6000.00
9 2 2062.25
So this should be already something you can play around with.
This data can now be plotted with matplotlib or seaborn if you like.
pandas itself has also some plotting methods already included.
However, your question differs from the example plot you added, as I pointed out, so from this point on it's a little difficult to help precisely your needs.
You can aggregate all equal months for example, to create a plot over months. For those cases there is a groupby method for Dataframes:
data.groupby('MES')
This only returns a so called grouby-object, but you can tell it, what you want to do with the grouped data, e.g.:
In: data.groupby('MES').sum()
Out:
VALOR_REEMBOLSADO
MES
1 1558581.11
2 1951731.07
3 2225328.21
4 2248882.83
5 2256224.68
6 2216981.94
7 2053173.90
8 2372847.10
9 2161915.35
10 2355417.34
11 2294658.51
12 2938033.00
if you are interested in the sum within each month. The same for the average with data.groupby('MES').mean(). And for a first plot you could just add the plotting method like
data.groupby('MES').sum().plot()
which produces
If you want to see the distribution and the mean value like in the picture in your question (but still plotted over months, not over states, because I don't see this information in your file) you could have a look at scatter plots:
import matplotlib.pyplot as plt
plt.figure()
plt.scatter(data['MES'],data['VALOR_REEMBOLSADO'])
plt.plot(data.groupby('MES').mean()['VALOR_REEMBOLSADO'], 'k_', ms=10)
which produces
But as you mention seaborn in your tag list: this library provides a jitter plot like the on you reference to via stripplot. So this is finally the answer to the plotting part of your question, leading to this piece of code:
import pandas as pd
import seaborn as sns
filepath = 'https://raw.githubusercontent.com/gabrielacaesar/studyingPython/master/ceap-sf-new-12-04-2018.csv'
data = pd.read_csv(filepath, usecols=[1,9], decimal=',')
x = data['MES'].values
y = data['VALOR_REEMBOLSADO'].values
sns.stripplot(x, y, jitter=True)
which produces
Related
I am using pycharm to create plots of data, and I am following along a kaggle tutorial of seaborn. The bar plot plots flight delays throughout 12 months, and on the tutorial it shows 1-12 on the x axis, but when I try to execute this in my code in python it shows only up to 11.
I am very new to python, and coding in general and trying to self teach, but I'm having a lot of problems navigating pycharm and solving this issue.
import pandas as pd
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
import seaborn as sns
print("Setup Complete")
flight_delays = pd.read_csv(r'C:\Users\Matt\Desktop\Portfolio Projects\seaborn_work\flight_delays.csv')
print(flight_delays)
plt.figure(figsize=(10,6))
sns.barplot(x=flight_delays.index, y=flight_delays['NK'])
plt.ylabel("Arrival delay (In minutes)")
plt.title("Average Arrival Delay for Spirit Airline Flights, by Month")
plt.show()
I have tried using xlim to show all 12 x values, but that does not work for me or I dont understand how to use the command.
https://www.kaggle.com/code/alexisbcook/bar-charts-and-heatmaps/tutorial
here is the link to the tutorial I am following as well.
Thank you
I assume you use https://www.kaggle.com/datasets/alexisbcook/data-for-datavis?select=flight_delays.csv ?
When you read a csv with pandas by default the index will be numbers starting from 0. So if there are 12 months (rows) the index is 0 to 11. Unlike the Month column which contains the numbers from 1 to 12.
You can either relace the x argument with the Month column instead of the index:
sns.barplot(x=flight_delays['Month'], y=flight_delays['NK'])
Or you first set the Month column as index and then use the same command you did before:
flight_delays.set_index('Month', inplace=True)
sns.barplot(x=flight_delays.index, y=flight_delays['NK'])
You could also just explicitly set which column should be used as index column when you use read_csv, then you do not have to make any other changes to your code:
flight_delays = pd.read_csv(r'C:\Users\Matt\Desktop\Portfolio Projects\seaborn_work\flight_delays.csv', index_col='Month')
country = str(input())
import matplotlib.pyplot as plt
lines = f.readlines ()
x = []
y = []
results = []
for line in lines:
words = line.split(',')
f.close()
plt.plot(x,y)
plt.show()
First problem is in the title of the plot. It is giving Population inCountryI instead of Population in Country I.
Second problem is in the graph.
While my answer could point out the mistakes in your code, I think it might also be enlightening to show another, perhaps more standard way, of doing this. This is particularly useful if you're going to do this more often, or with large datasets.
Handling CSV files and creating subgroups out of them by yourself is nice, but can become very tricky. Python already has a built-in csv module, but the Pandas library is nowadays basically the default (there are other options as well) for handling tabular data. Which means it is widely available, and/or easy to install. Plus it goes well with Matplotlib. (Read some of Pandas' user's guide for a good overview.)
With Pandas, you can use the following (I've put comments on the code in between the actual code):
import pandas as pd
import matplotlib.pyplot as plt
mpl.rcParams['figure.figsize'] = (8, 8)
# Read the CSV file into a Pandas dataframe
# For a normal CSV, this will work fine without tweaks
df = pd.read_csv('population.csv')
# Convert the month and year columns to a datetime
# Years have to be converted to string type for that
# '%b%Y' is the format for month abbrevation (English) and 4-digit year;
# see e.g. https://strftime.org/
# Instead of creating a new column, we set the date as the index ("row-indices")
# of the dataframe
df.index = pd.to_datetime(df['Month'] + df['Year'].astype(str), format='%b%Y')
# We can remove the month and year columns now
df = df.drop(columns=['Month', 'Year'])
# For nicety, replace the dot in the country name with a space
df['Country'] = df['Country'].str.replace('.', ' ', regex=False)
# Group the dataframe by country, and loop over the groups
# The resulting grouped dataframes, `grouped`, will have just
# their index (date) values and population values
# The .plot() method will therefore automatically use
# the index/dates as x-axis, and the population as
# y-axis.
for country, grouped in df.groupby('Country'):
# Use the convenience .plot() method
grouped.plot()
# Standard Matplotlib functions are still available
plt.title(country)
The resulting plots are shown below (2, given the example data).
If you don't want a legend (since there is only one line), use grouped.plot(legend=None) instead.
If you want to pick one specific country, remove and replace the whole for-loop with the following
country = "Country II"
df[df['Country'] == country].plot()
If you want to do even more, also have a look at the Seaborn library.
Resulting plots:
Updated with more info
I've seen this answered on here for single line plots, but I need help with a plot showing two variables, if that matters at all... I am fairly new to python in general. My line graph shows two different departments' funding over the years. I just want to reformat the y axis to display as a number in the hundreds of millions.
Using a csv for the general public funding report of Minneapolis.
msp_df = pd.read_csv('Minneapolis_Data_Snapshot_v2.csv',error_bad_lines=False)
msp_df.info()
Saved just the two depts I was interested in, to a dataframe.
CPED_df = (msp_df['Unnamed: 0'] == 'CPED')
msp_df.iloc[CPED_df.values]
police_df = (msp_df['Unnamed: 0'] == 'Police')
msp_df.iloc[police_df.values]
("test" is the new name of my data frame containing all the info as seen below.)
test = pd.DataFrame({'Year': range(2014,2021),
'CPED': msp_df.iloc[CPED_df.values].T.reset_index(drop=True).drop(0,0)[5].tolist(),
'Police': msp_df.iloc[police_df.values].T.reset_index(drop=True).drop(0,0)[4].tolist()})
The numbers from the original dataset were being read as strings because of the commas so had to fix that first.)
test['Police2'] = test['Police'].str.replace(',','').astype(int)
test['CPED2'] = test['CPED'].str.replace(',','').astype(int)
And here is my code for the plot. It executes, I'm just wanting to reformat the y axis number scale. Right now it just shows up as a decimal. (I've already imported pandas and seaborn and matploblib)
plt.plot(test.Year, test.Police2, test.Year, test.CPED2)
plt.ylabel('Budget in Hundreds of Millions')
plt.xlabel('Year')
Current plot
Any help super appreciated! Thanks :)
the easiest way to reformat the y axis, to force it to take certain values is to use
plt.yticks(ticks, labels)
for example if you want to have only display values from 0 to 1 you can do :
plt.yticks([0,0.2,0.5,0.7,1], ['a', 'b', 'c', 'd', 'e'])
I am an MPH Epidemiology student in a data science introduction class with just about NO programming experience. I have uploaded a json file into pycharm, converted it to a dataframe using
pub_num = pd.DataFrame(papers['Publication_Year'].value_counts())
Then reset the index using
pub_num = pub_num.reset_index()
After resetting the index, it took the whole numbers that were in my dataframe and added 5 zeros after a decimal point. Now i'm trying to plot the dataframe, and I can't plot them correctly bc it's not recognizing whole numbers.
Why is it adding zeroes and how do I get rid of them? It is showing up fine in my console. No zeros. But then I look in the environment and 'view as dataframe' in the bottom right corner, I can see all the zeroes. screen shot showing the console with no zeroes and the dataframe with zeroes.
I've tried changing back to int using df.astype(int) and changing the precision to 0. But neither have worked.
import json
import pandas as pd
import matplotlib.pyplot as plt
# open and prints out the json file
with open('Papers.json') as file:
data = json.load(file)
# convert to pandas dataframe.
papers = pd.read_json('Papers.json')
# creates a dataframe to count the number of publications in each year
pub_num = pd.DataFrame(papers['Publication_Year'].value_counts())
pub_num = pub_num.reset_index()
pub_num.columns = ['Publication_Year', 'Counts']
print(pub_num)
The output of the df is:
Publication_Year Counts
0 2010 10
1 2009 5
my code for the plot is this:
plt.scatter(x = 'Publication_Year', y = 'Counts', data = pub_num)
plt.xlabel('Publication Year')
plt.ticklabel_format(useOffset=False)
plt.show()
Plot using the plt.ticklabel_format(useOffset=False
plot if I don't use plt.ticklable_format function
UPDATE:
So I took the suggestion of transforming to date time using:
pub_num['Publication_Year'] = pd.to_datetime(pub_num['Publication_Year'],format='%Y')
This is the graph that came out:
Graph using the conversion to years instead of integers
It's still adding extra numbers after year, which is why I honestly believe it because there are zeroes after my decimals in my df as shown in the first picture.
This has nothing to do with zeroes in your data frame.
In your first output, you have only two rows.
Publication_Year Counts
0 2010 10
1 2009 5
In plotting terms, you'll have two ordered pairs : (2009, 5) and (2010, 10). This means you'll have two points in your graph.
That's exactly what's being outputted in this link you provided. Since 2010 and 2009 are integers, pandas will just interpolate values in the xticks on the x axis for readability. These values don't mean anything, they are just part of the x axis, but you can totally modify them by messing with the xticks and xtickslabels arguments of the plt.plot function.
When you make your values datetime, your data will look something like this:
Publication_Year Counts
0 2010-01-01 10
1 2009-01-01 5
Again, you'll have two points in your data frame. Pandas will, again, interpolate in between these points for readability. Since the beginning is January 2009 and the end is January 2010, you'll have March, April, July etc in between just for readability.
Again, this has nothing to do with decimal points.
If you add plt.xticks([2009, 2010]) just before your plt.show() line, you'll enforce your code to have just two ticks: 2009 and 2010. The result would be something like:
This must be very simple but i am not able to figure out how to do it.I am trying to plot the data present in my dataset.
Below is my code ,
import pandas as pd
import matplotlib.pyplot as plt
dataset = pd.read_csv('TipsReceivedPerMeal.csv')
plt.scatter(dataset[0],dataset[1])
plt.show()
The data in my CSV file is some random data, which specifies what tip a waiter receive at one particular day.
Data in CSV
MealNumber TipReceived
1 17
2 10
3 5
4 7
5 14
6 25
Thanks in advance for the help.
Another option is to replace
plt.scatter(dataset[0],dataset[1])
with
plt.scatter(dataset[[0]],dataset[[1]])
There are several options, some already mentionned in previous answers,
plt.scatter(dataset['MealNumber'],dataset['TipReceived']) (as mentioned by #Ankit Malik)
plt.scatter(dataset.iloc[:,0],dataset.iloc[:,1])
plt.scatter(dataset[[0]],dataset[[1]]) (as mentioned by #Miriam)
In order for those to work with the data from the question, one should use the delim_whitespace=True paramter, as otherwise the read-in would not work:
dataset = pd.read_csv('TipsReceivedPerMeal.csv', delim_whitespace=True)
Just replace:
plt.scatter(dataset[0],dataset[1])
With:
plt.scatter(dataset['MealNumber'],dataset['TipReceived'])
In Pandas columns can either be referenced by name or by column number with iloc.