Plotting from dataset in Python - python

This must be very simple but i am not able to figure out how to do it.I am trying to plot the data present in my dataset.
Below is my code ,
import pandas as pd
import matplotlib.pyplot as plt
dataset = pd.read_csv('TipsReceivedPerMeal.csv')
plt.scatter(dataset[0],dataset[1])
plt.show()
The data in my CSV file is some random data, which specifies what tip a waiter receive at one particular day.
Data in CSV
MealNumber TipReceived
1 17
2 10
3 5
4 7
5 14
6 25
Thanks in advance for the help.

Another option is to replace
plt.scatter(dataset[0],dataset[1])
with
plt.scatter(dataset[[0]],dataset[[1]])

There are several options, some already mentionned in previous answers,
plt.scatter(dataset['MealNumber'],dataset['TipReceived']) (as mentioned by #Ankit Malik)
plt.scatter(dataset.iloc[:,0],dataset.iloc[:,1])
plt.scatter(dataset[[0]],dataset[[1]]) (as mentioned by #Miriam)
In order for those to work with the data from the question, one should use the delim_whitespace=True paramter, as otherwise the read-in would not work:
dataset = pd.read_csv('TipsReceivedPerMeal.csv', delim_whitespace=True)

Just replace:
plt.scatter(dataset[0],dataset[1])
With:
plt.scatter(dataset['MealNumber'],dataset['TipReceived'])
In Pandas columns can either be referenced by name or by column number with iloc.

Related

Why does my bar plot in python cut off part of the x variable when plotting?

I am using pycharm to create plots of data, and I am following along a kaggle tutorial of seaborn. The bar plot plots flight delays throughout 12 months, and on the tutorial it shows 1-12 on the x axis, but when I try to execute this in my code in python it shows only up to 11.
I am very new to python, and coding in general and trying to self teach, but I'm having a lot of problems navigating pycharm and solving this issue.
import pandas as pd
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
import seaborn as sns
print("Setup Complete")
flight_delays = pd.read_csv(r'C:\Users\Matt\Desktop\Portfolio Projects\seaborn_work\flight_delays.csv')
print(flight_delays)
plt.figure(figsize=(10,6))
sns.barplot(x=flight_delays.index, y=flight_delays['NK'])
plt.ylabel("Arrival delay (In minutes)")
plt.title("Average Arrival Delay for Spirit Airline Flights, by Month")
plt.show()
I have tried using xlim to show all 12 x values, but that does not work for me or I dont understand how to use the command.
https://www.kaggle.com/code/alexisbcook/bar-charts-and-heatmaps/tutorial
here is the link to the tutorial I am following as well.
Thank you
I assume you use https://www.kaggle.com/datasets/alexisbcook/data-for-datavis?select=flight_delays.csv ?
When you read a csv with pandas by default the index will be numbers starting from 0. So if there are 12 months (rows) the index is 0 to 11. Unlike the Month column which contains the numbers from 1 to 12.
You can either relace the x argument with the Month column instead of the index:
sns.barplot(x=flight_delays['Month'], y=flight_delays['NK'])
Or you first set the Month column as index and then use the same command you did before:
flight_delays.set_index('Month', inplace=True)
sns.barplot(x=flight_delays.index, y=flight_delays['NK'])
You could also just explicitly set which column should be used as index column when you use read_csv, then you do not have to make any other changes to your code:
flight_delays = pd.read_csv(r'C:\Users\Matt\Desktop\Portfolio Projects\seaborn_work\flight_delays.csv', index_col='Month')

Python Group by function

and first of all, thanks to all the peeps here helping out others. I'm currently undergoing a career change and working on a Dataset with Coviddata as a training project.
The issue I am having is that I have daily reports on covid Data but since that, in the last two years, is a massive amount of data, I want to group the data into year_month so that all the data from each month of each year of data that I have gets combined for more accessible and better visualizations/scatterplots.
First, I changed the Datatype of the date column as it was displayed as an object.
dfgermany['date'] = pd.to_datetime(dfgermany['date'])
#Scatterplot between total vaccinations and people fully vaccinated in Germany
plt.scatter(dfgermany['date'].tolist(), dfgermany['people_fully_vaccinated'])
plt.title('Date vs People fully Vaccinated')
#plt.xlabel('Date from 2020-01 until 2022-04-19')
plt.ylabel('People Fully Vaccinated')
plt.show()
However, the result in the axis is visually not very pleasing as the Dates get overlapped.
So I am thinking of somehow grouping the data and displaying each month's summary.
After looking up this for quite some time, I worked on this.
dfgermany['year_month'] = dfgermany['date'].dt.to_period('M')
which created a new column called year_month [name to be edited]
following this, I used
dfgermany.groupby(['year_month']).sum()
which gave me the exact result I wanted, but I somehow did not get this to be "saved."
Dropping the date column is not an issue there.
dfgermany.drop(['date'], axis=1)
I am Leaving only the year_month column for each day's life datapoint.
After this, I changed the data type back to DateTime
dfgermany['year_month'] = dfgermany['year_month'].astype(str)
dfgermany['year_month'] = pd.to_datetime(dfgermany['year_month'])
This is the result I'm getting.
The problem I'm having now is that the axis labels are still not readable, and the data is instead stacked over each other than grouped. . When I use the group function, the result is changed but not saved so that I can use it for a scatterplot.
dfgermany.groupby(['year_month']).sum()
The output looks exactly like I want but is not saved for the scatterplot... I researched this the whole day and didn't get any further. Can someone here assist, or better enlighten me on where I made a mistake?
Sorry for the wall of text on my first post
UPDATE
My Basecode/Testset i used for all trials so far is this one
#importing libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
import os
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
#This option ensures that the graphs you create are displayed within the notebook without the need to "call" them specifically.
%matplotlib inline
#creating path
path = r'C:\Users\stefa\Jupyter Analysis\14-04-2022 Achievement6'
#importing Dataframe
dfgermany = pd.read_csv(os.path.join(path, '02 Data', 'Prepared Data', 'CovidDE.csv'), index_col = False)
#Mathploblib known issue checkup with Version
matplotlib.__version__
enter code here
#Data Consistency checkup
dfgermany
dfgermany.dtypes
dfgermany['date'] = pd.to_datetime(dfgermany['date'])
I am fairly new to all this so i'm truly sorry if i'm asking probably easy things . I didnt know how i can add the test dataset i have from Our World in Data https://ourworldindata.org/coronavirus/country/germany
Since you haven't provided any sample input data nor any expected output data it is difficult to reproduce your problem, so forgive any of the following assumptions which maybe incorrect.
given a input dataframe of the form:
Movie Day Length View_Time
0 A 2 2.6 1.8
1 A 3 2.6 2.6
2 B 1 5.8 2.9
3 A 4 2.6 2.6
4 B 6 5.8 4.2
5 A 0 2.6 0.5
6 B 3 5.8 2.0
xdf = df.groupby(['Movie', 'Day']).Length.sum().to_frame()
will create a new dataframe
Length
Movie Day
A 0 2.6
2 2.6
3 2.6
4 2.6
B 1 5.8
3 5.8
6 5.8
Since you seem to be wanting to create a new dataframe that you can plot, this should provide the result you are looking for.
If this isn't the answer you need, please provide sufficient data so we can reproduce your problem

How do I extract variables that repeat from an Excel Column using Python?

I'm a beginner at Python and I have a school proyect where I need to analyze an excel document with information. It has aproximately 7 columns and more than 1000 rows.
Theres a column named "Materials" that starts at B13. It contains a code that we use to identify some materials. The material code looks like this -> 3A8356. There are different material codes in the same column they repeat a lot. I want to identify them and make a list with only one code, no repeating. Is there a way I can analyze the column and extract the codes that repeat so I can take them and make a new column with only one of each material codes?
An example would be:
12 Materials
13 3A8356
14 3A8376
15 3A8356
16 3A8356
17 3A8346
18 3A8346
and transform it toosomething like this:
1 Materials
2 3A8346
3 3A8356
4 3A8376
Yes.
If df is your dataframe, you only have to do df = df.drop_duplicates(subset=['Materials',], keep=False)
To load the dataframe from an excel file, just do:
import pandas as pd
df = pd.read_excel(path_to_file)
the subset argument indicates which column headings you want to look at.
Docs: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html
For the docs, the new data frame with the duplicates dropped is returned so you can assign it to any variable you want. If you want to re_index the first column, take a look at:
new_data_frame = new_data_frame.reset_index(drop=True)
Or simply
new_data_frame.reset_index(drop=True, inplace=True)

How to draw a plot like this using Python? [duplicate]

I have a CSV file with 27.000 lines. I am trying to create a jitter plot, just like this one [https://static1.squarespace.com/static/56fd706140261df95349d4bd/t/59297c72579fb3d813d591c1/1495891103667/Jitter+Example+The+Truthful+Art.png?format=1000w].
The 'y' axis would be the column called "VALOR_REEMBOLSADO" (stands for "refund value"). The 'x' axis would be the column called "MES" (stands for "month").
It represents the spending of brazilian senators in 2017. The CSV file is very organized, but originally has the "VALOR_REEMBOLSADO" as string and not as float. I replaced the "," for ".", but I still can't plot the chart.
Can someone help me with the code? What code can create a chart like that?
Here you find the CSV file of the year 2017: https://www12.senado.leg.br/transparencia/dados-abertos-transparencia/dados-abertos-ceaps
At first I have to admit that I cannot understand some aspects of your question (first link doesn't work, and even more important: you want an x-axis which shows the months but in the plot, the data is shown over states).
But I see that your problems start already at the very beginning of reading the data in, so I'll try to give you the needed hints to start:
For reading in csv-data like this, I'd recommend pandas, usually imported with
import pandas as pd
It has a csv reader included, which is quite powerful. Generally, you should avoid manually tweaking the data sources you have (like changing decimal signs etc.), because this is something which is already adressed by importer functions like read_csv (and you don't want to do this again and again in the future with new data files but the same plot generation):
filepath = 'wherever/file/may/roam/2017.csv'
data = pd.read_csv(filepath, skiprows=1, sep=';', usecols=[1, 9], decimal=',')
With filepath you tell the importer where you stored the csv-file, skiprows=1 says that you're not interested in the first line of the file, sep defines the delimiter between the columns and via usecols you can pick only the columns of interest, 'MES' and 'VALOR_REEMBOLSADO' in your example.
decimal specifies the decimal sign of float numbers in your data.
Now data contains a pandas dataframe of your data:
In: data[:10]
Out:
MES VALOR_REEMBOLSADO
0 1 97.00
1 1 6000.00
2 1 418.04
3 1 1958.95
4 1 1178.67
5 1 1252.65
6 2 62.30
7 2 240.81
8 2 6000.00
9 2 2062.25
So this should be already something you can play around with.
This data can now be plotted with matplotlib or seaborn if you like.
pandas itself has also some plotting methods already included.
However, your question differs from the example plot you added, as I pointed out, so from this point on it's a little difficult to help precisely your needs.
You can aggregate all equal months for example, to create a plot over months. For those cases there is a groupby method for Dataframes:
data.groupby('MES')
This only returns a so called grouby-object, but you can tell it, what you want to do with the grouped data, e.g.:
In: data.groupby('MES').sum()
Out:
VALOR_REEMBOLSADO
MES
1 1558581.11
2 1951731.07
3 2225328.21
4 2248882.83
5 2256224.68
6 2216981.94
7 2053173.90
8 2372847.10
9 2161915.35
10 2355417.34
11 2294658.51
12 2938033.00
if you are interested in the sum within each month. The same for the average with data.groupby('MES').mean(). And for a first plot you could just add the plotting method like
data.groupby('MES').sum().plot()
which produces
If you want to see the distribution and the mean value like in the picture in your question (but still plotted over months, not over states, because I don't see this information in your file) you could have a look at scatter plots:
import matplotlib.pyplot as plt
plt.figure()
plt.scatter(data['MES'],data['VALOR_REEMBOLSADO'])
plt.plot(data.groupby('MES').mean()['VALOR_REEMBOLSADO'], 'k_', ms=10)
which produces
But as you mention seaborn in your tag list: this library provides a jitter plot like the on you reference to via stripplot. So this is finally the answer to the plotting part of your question, leading to this piece of code:
import pandas as pd
import seaborn as sns
filepath = 'https://raw.githubusercontent.com/gabrielacaesar/studyingPython/master/ceap-sf-new-12-04-2018.csv'
data = pd.read_csv(filepath, usecols=[1,9], decimal=',')
x = data['MES'].values
y = data['VALOR_REEMBOLSADO'].values
sns.stripplot(x, y, jitter=True)
which produces

Force Pandas to keep multiple columns with the same name

I'm building a program that collects data and adds it to an ongoing excel sheet weekly (read_excel() and concat() with the new data). The issue I'm having is that I need the columns to have the same name for presentation (it doesn't look great with x.1, x.2, ...).
I only need this on the final output. Is there any way to accomplish this? Would it be too time consuming to modify pandas?
you can create a list of custom headers that will be read into excel
newColNames = ['x','x','x'.....]
df.to_excel(path,header=newColNames)
You can add spaces to the end of the column name. It will appear the same in a Excel, but pandas can distinguish the difference.
import pandas as pd
df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], columns=['x','x ','x '])
df
x x x
0 1 2 3
1 4 5 6
2 7 8 9

Categories