Python Group by function - python

and first of all, thanks to all the peeps here helping out others. I'm currently undergoing a career change and working on a Dataset with Coviddata as a training project.
The issue I am having is that I have daily reports on covid Data but since that, in the last two years, is a massive amount of data, I want to group the data into year_month so that all the data from each month of each year of data that I have gets combined for more accessible and better visualizations/scatterplots.
First, I changed the Datatype of the date column as it was displayed as an object.
dfgermany['date'] = pd.to_datetime(dfgermany['date'])
#Scatterplot between total vaccinations and people fully vaccinated in Germany
plt.scatter(dfgermany['date'].tolist(), dfgermany['people_fully_vaccinated'])
plt.title('Date vs People fully Vaccinated')
#plt.xlabel('Date from 2020-01 until 2022-04-19')
plt.ylabel('People Fully Vaccinated')
plt.show()
However, the result in the axis is visually not very pleasing as the Dates get overlapped.
So I am thinking of somehow grouping the data and displaying each month's summary.
After looking up this for quite some time, I worked on this.
dfgermany['year_month'] = dfgermany['date'].dt.to_period('M')
which created a new column called year_month [name to be edited]
following this, I used
dfgermany.groupby(['year_month']).sum()
which gave me the exact result I wanted, but I somehow did not get this to be "saved."
Dropping the date column is not an issue there.
dfgermany.drop(['date'], axis=1)
I am Leaving only the year_month column for each day's life datapoint.
After this, I changed the data type back to DateTime
dfgermany['year_month'] = dfgermany['year_month'].astype(str)
dfgermany['year_month'] = pd.to_datetime(dfgermany['year_month'])
This is the result I'm getting.
The problem I'm having now is that the axis labels are still not readable, and the data is instead stacked over each other than grouped. . When I use the group function, the result is changed but not saved so that I can use it for a scatterplot.
dfgermany.groupby(['year_month']).sum()
The output looks exactly like I want but is not saved for the scatterplot... I researched this the whole day and didn't get any further. Can someone here assist, or better enlighten me on where I made a mistake?
Sorry for the wall of text on my first post
UPDATE
My Basecode/Testset i used for all trials so far is this one
#importing libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
import os
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
#This option ensures that the graphs you create are displayed within the notebook without the need to "call" them specifically.
%matplotlib inline
#creating path
path = r'C:\Users\stefa\Jupyter Analysis\14-04-2022 Achievement6'
#importing Dataframe
dfgermany = pd.read_csv(os.path.join(path, '02 Data', 'Prepared Data', 'CovidDE.csv'), index_col = False)
#Mathploblib known issue checkup with Version
matplotlib.__version__
enter code here
#Data Consistency checkup
dfgermany
dfgermany.dtypes
dfgermany['date'] = pd.to_datetime(dfgermany['date'])
I am fairly new to all this so i'm truly sorry if i'm asking probably easy things . I didnt know how i can add the test dataset i have from Our World in Data https://ourworldindata.org/coronavirus/country/germany

Since you haven't provided any sample input data nor any expected output data it is difficult to reproduce your problem, so forgive any of the following assumptions which maybe incorrect.
given a input dataframe of the form:
Movie Day Length View_Time
0 A 2 2.6 1.8
1 A 3 2.6 2.6
2 B 1 5.8 2.9
3 A 4 2.6 2.6
4 B 6 5.8 4.2
5 A 0 2.6 0.5
6 B 3 5.8 2.0
xdf = df.groupby(['Movie', 'Day']).Length.sum().to_frame()
will create a new dataframe
Length
Movie Day
A 0 2.6
2 2.6
3 2.6
4 2.6
B 1 5.8
3 5.8
6 5.8
Since you seem to be wanting to create a new dataframe that you can plot, this should provide the result you are looking for.
If this isn't the answer you need, please provide sufficient data so we can reproduce your problem

Related

Why does my bar plot in python cut off part of the x variable when plotting?

I am using pycharm to create plots of data, and I am following along a kaggle tutorial of seaborn. The bar plot plots flight delays throughout 12 months, and on the tutorial it shows 1-12 on the x axis, but when I try to execute this in my code in python it shows only up to 11.
I am very new to python, and coding in general and trying to self teach, but I'm having a lot of problems navigating pycharm and solving this issue.
import pandas as pd
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
import seaborn as sns
print("Setup Complete")
flight_delays = pd.read_csv(r'C:\Users\Matt\Desktop\Portfolio Projects\seaborn_work\flight_delays.csv')
print(flight_delays)
plt.figure(figsize=(10,6))
sns.barplot(x=flight_delays.index, y=flight_delays['NK'])
plt.ylabel("Arrival delay (In minutes)")
plt.title("Average Arrival Delay for Spirit Airline Flights, by Month")
plt.show()
I have tried using xlim to show all 12 x values, but that does not work for me or I dont understand how to use the command.
https://www.kaggle.com/code/alexisbcook/bar-charts-and-heatmaps/tutorial
here is the link to the tutorial I am following as well.
Thank you
I assume you use https://www.kaggle.com/datasets/alexisbcook/data-for-datavis?select=flight_delays.csv ?
When you read a csv with pandas by default the index will be numbers starting from 0. So if there are 12 months (rows) the index is 0 to 11. Unlike the Month column which contains the numbers from 1 to 12.
You can either relace the x argument with the Month column instead of the index:
sns.barplot(x=flight_delays['Month'], y=flight_delays['NK'])
Or you first set the Month column as index and then use the same command you did before:
flight_delays.set_index('Month', inplace=True)
sns.barplot(x=flight_delays.index, y=flight_delays['NK'])
You could also just explicitly set which column should be used as index column when you use read_csv, then you do not have to make any other changes to your code:
flight_delays = pd.read_csv(r'C:\Users\Matt\Desktop\Portfolio Projects\seaborn_work\flight_delays.csv', index_col='Month')

How do I format my training data for an LSTM network using Keras when I have multiple varying length time-series data? [duplicate]

This question already has an answer here:
Multivariate LSTM with missing values
(1 answer)
Closed 2 years ago.
I have two sets of training data that are of different lengths. I'll call these data series as the x_train data. Their shapes are (70480, 7) and (69058, 7), respectively. Each column represents a different sensor reading.
I am trying to use an LSTM network on this data. Should I merge the data into one object? How would I do that?
I also have two sets of data that are the resultant output from the x_train data. These are both of size (315,1). Would I use this as my y_train data?
So far I have read the data using pandas.read_csv() as follows:
c4_x_train = pd.read_csv('path')
c4_y_train = pd.read_csv('path')
c6_x_train = pd.read_csv('path')
c6_y_train = pd.read_csv('path')
Any clarification is appreciated. Thanks!
Just a few points
For fast file reading, consider using a different format like parquet or feather. Careful about depreciation, so for longtime storage, csv is just fine.
pd.concat is your friend here. Use like this
from pathlib import Path
import pandas as pd
dir_path = r"yourFolderPath"
files_list = [str(p) for p in dir_path.glob("**/*.csv")]
if files_list:
source_dfs = [pd.read_csv(file_) for file_ in files_list]
df = pd.concat(source_dfs, ignore_index=True)
This df then you can use to do your training.
Now, regarding the training. Well, that really depends as always. If you have the datetime in those csvs and they are continuous, go right ahead. If you have breaks inbetween the measurements, you might run into problems. Depending on trends, saisonality and noise, you could interpolate missing data. There are multiple approaches, such as the naive approach, filling it with the mean, forecasting from the values before, and many more. There is no right or wrong, it just really depends on what your data looks like.
EDIT: Comments don't like codeblocks.
Works like this:
Example:
#df1:
time value
1 1.4
2 2.5
#df2:
time value
3 1.1
4 1.0
#will be glued together to become df = pd.concat([df1, df2], ignore_index=True)
time value
1 1.4
2 2.5
3 1.1
4 1.0

How to draw a plot like this using Python? [duplicate]

I have a CSV file with 27.000 lines. I am trying to create a jitter plot, just like this one [https://static1.squarespace.com/static/56fd706140261df95349d4bd/t/59297c72579fb3d813d591c1/1495891103667/Jitter+Example+The+Truthful+Art.png?format=1000w].
The 'y' axis would be the column called "VALOR_REEMBOLSADO" (stands for "refund value"). The 'x' axis would be the column called "MES" (stands for "month").
It represents the spending of brazilian senators in 2017. The CSV file is very organized, but originally has the "VALOR_REEMBOLSADO" as string and not as float. I replaced the "," for ".", but I still can't plot the chart.
Can someone help me with the code? What code can create a chart like that?
Here you find the CSV file of the year 2017: https://www12.senado.leg.br/transparencia/dados-abertos-transparencia/dados-abertos-ceaps
At first I have to admit that I cannot understand some aspects of your question (first link doesn't work, and even more important: you want an x-axis which shows the months but in the plot, the data is shown over states).
But I see that your problems start already at the very beginning of reading the data in, so I'll try to give you the needed hints to start:
For reading in csv-data like this, I'd recommend pandas, usually imported with
import pandas as pd
It has a csv reader included, which is quite powerful. Generally, you should avoid manually tweaking the data sources you have (like changing decimal signs etc.), because this is something which is already adressed by importer functions like read_csv (and you don't want to do this again and again in the future with new data files but the same plot generation):
filepath = 'wherever/file/may/roam/2017.csv'
data = pd.read_csv(filepath, skiprows=1, sep=';', usecols=[1, 9], decimal=',')
With filepath you tell the importer where you stored the csv-file, skiprows=1 says that you're not interested in the first line of the file, sep defines the delimiter between the columns and via usecols you can pick only the columns of interest, 'MES' and 'VALOR_REEMBOLSADO' in your example.
decimal specifies the decimal sign of float numbers in your data.
Now data contains a pandas dataframe of your data:
In: data[:10]
Out:
MES VALOR_REEMBOLSADO
0 1 97.00
1 1 6000.00
2 1 418.04
3 1 1958.95
4 1 1178.67
5 1 1252.65
6 2 62.30
7 2 240.81
8 2 6000.00
9 2 2062.25
So this should be already something you can play around with.
This data can now be plotted with matplotlib or seaborn if you like.
pandas itself has also some plotting methods already included.
However, your question differs from the example plot you added, as I pointed out, so from this point on it's a little difficult to help precisely your needs.
You can aggregate all equal months for example, to create a plot over months. For those cases there is a groupby method for Dataframes:
data.groupby('MES')
This only returns a so called grouby-object, but you can tell it, what you want to do with the grouped data, e.g.:
In: data.groupby('MES').sum()
Out:
VALOR_REEMBOLSADO
MES
1 1558581.11
2 1951731.07
3 2225328.21
4 2248882.83
5 2256224.68
6 2216981.94
7 2053173.90
8 2372847.10
9 2161915.35
10 2355417.34
11 2294658.51
12 2938033.00
if you are interested in the sum within each month. The same for the average with data.groupby('MES').mean(). And for a first plot you could just add the plotting method like
data.groupby('MES').sum().plot()
which produces
If you want to see the distribution and the mean value like in the picture in your question (but still plotted over months, not over states, because I don't see this information in your file) you could have a look at scatter plots:
import matplotlib.pyplot as plt
plt.figure()
plt.scatter(data['MES'],data['VALOR_REEMBOLSADO'])
plt.plot(data.groupby('MES').mean()['VALOR_REEMBOLSADO'], 'k_', ms=10)
which produces
But as you mention seaborn in your tag list: this library provides a jitter plot like the on you reference to via stripplot. So this is finally the answer to the plotting part of your question, leading to this piece of code:
import pandas as pd
import seaborn as sns
filepath = 'https://raw.githubusercontent.com/gabrielacaesar/studyingPython/master/ceap-sf-new-12-04-2018.csv'
data = pd.read_csv(filepath, usecols=[1,9], decimal=',')
x = data['MES'].values
y = data['VALOR_REEMBOLSADO'].values
sns.stripplot(x, y, jitter=True)
which produces

Plotting from dataset in Python

This must be very simple but i am not able to figure out how to do it.I am trying to plot the data present in my dataset.
Below is my code ,
import pandas as pd
import matplotlib.pyplot as plt
dataset = pd.read_csv('TipsReceivedPerMeal.csv')
plt.scatter(dataset[0],dataset[1])
plt.show()
The data in my CSV file is some random data, which specifies what tip a waiter receive at one particular day.
Data in CSV
MealNumber TipReceived
1 17
2 10
3 5
4 7
5 14
6 25
Thanks in advance for the help.
Another option is to replace
plt.scatter(dataset[0],dataset[1])
with
plt.scatter(dataset[[0]],dataset[[1]])
There are several options, some already mentionned in previous answers,
plt.scatter(dataset['MealNumber'],dataset['TipReceived']) (as mentioned by #Ankit Malik)
plt.scatter(dataset.iloc[:,0],dataset.iloc[:,1])
plt.scatter(dataset[[0]],dataset[[1]]) (as mentioned by #Miriam)
In order for those to work with the data from the question, one should use the delim_whitespace=True paramter, as otherwise the read-in would not work:
dataset = pd.read_csv('TipsReceivedPerMeal.csv', delim_whitespace=True)
Just replace:
plt.scatter(dataset[0],dataset[1])
With:
plt.scatter(dataset['MealNumber'],dataset['TipReceived'])
In Pandas columns can either be referenced by name or by column number with iloc.

Pandas: creating an indexed time series [starting from 100] from returns data

I have data on logarithmic returns of a variable in a Pandas DataFrame. I would like to turn these returns into an indexed time series which starts from 100 (or any arbitrary number). This kind of operation is very common for example when creating an inflation index or when comparing two series of different magnitude:
So the first value in, say, Jan 1st 2000 is set to equal 100 and the next value in Jan 2nd 2000 equals 100 * exp(return_2000_01_02) and so on. Example below:
I know that I can loop through rows in a Pandas DataFrame using .iteritems() as presented in this SO question:
iterating row by row through a pandas dataframe
I also know that I can turn the DataFrame into a numpy array, loop through the values in that array and turn the numpy array back to a Pandas DataFrame. The .as_matrix() method is explained here:
http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.Series.html
An even simpler way to do it is to iterate the rows by using the Python and numpy indexing operators [] as documented in Pandas indexing:
http://pandas.pydata.org/pandas-docs/stable/indexing.html
The problem is that all these solutions (except for the iteritems) work "outside" Pandas and are, according to what I have read, inefficient.
Is there a way to create an indexed time series using purely Pandas? And if not, could you, please, suggest the most efficient way to do this. Finding solutions is surprisingly difficult, because index and indexing have a specific meaning in Pandas, which I am not after this time.
You can use a vectorized approach instead of a loop/iteration:
import pandas as pd
import numpy as np
df = pd.DataFrame({'return':np.array([np.nan, 0.01, -0.02, 0.05, 0.07, 0.01, -0.01])})
df['series'] = 100*np.exp(np.nan_to_num(df['return'].cumsum()))
#In [29]: df
#Out[29]:
# return series
#0 NaN 100.000000
#1 0.01 101.005017
#2 -0.02 99.004983
#3 0.05 104.081077
#4 0.07 111.627807
#5 0.01 112.749685
#6 -0.01 111.627807
#Crebit
I have created a framework to index prices in pandas quickly!
See on my github below for the file:
https://github.com/meinerst/JupyterWorkflow
It shows how you can pull the prices from yahoo finance and or show how you can work with your excisting dataframes.
I cant show the dataframe tables here. If you want to see them, follow the github link.
Indexing financial time series (pandas)
This example uses data pulled from yahoo finance. If you have a dataframe from elsewhere, go to part 2.
Part 1 (Pulling data)
For this, make sure the yfinance package is installed.
#pip install yfinance
import pandas as pd
import numpy as np
import yfinance as yf
import matplotlib.pyplot as plt
import datetime as dt
Insert the yahoo finance tickers into the variable 'tickers'. You can choose as many as you like.
tickers =['TSLA','AAPL','NFLX','MSFT']
Choose timeframe.
start=dt.datetime(2019,1,1)
end= dt.datetime.now()
In this example, the 'Adj Close' column is selected.
assets=yf.download(tickers,start,end)['Adj Close']
Part 2 (Indexing)
To graph a comparable price development graph the assets data frame needs to be indexed. New columns are added for this purpose.
First the indexing row is determined. In this case the initial prices.
assets_indexrow=assets[:1]
New columns are added to the original dataframe with the indexed price developments.
Insert your desired indexing value below. In this case, it is 100.
for ticker in tickers:
assets[ticker+'_indexed']=(assets[ticker]/ assets_indexrow[ticker][0])*100
The original columns of prices are then dropped
assets.drop(columns =tickers, inplace=True)
Graphing the result.
plt.figure(figsize=(14, 7))
for c in assets.columns.values:
plt.plot(assets.index, assets[c], lw=3, alpha=0.8,label=c)
plt.legend(loc='upper left', fontsize=12)
plt.ylabel('Value Change')
I cant insert the graph due to limited reputation points but see here:
Indexed Graph

Categories