I am new to Python and struggling to solve this one efficiently. I read a number of examples but they were complex and lack of understanding. For the below dataframe, I like to subplot per columns while ignoring the first two i.e Site_ID and Cell_ID:
Availability
VoLTE CSSR
VoLTE Attempts
Each subplot (Availability etc..), will include the "Grouped" Site_ID as legends. Each subplot is saved to a desired location.
Sample Data:
Date Site_ID Cell_ID Availability VoLTE CSSR VoLTE Attempts
22/03/2019 23181 23181B11 100 99.546435 264
03/03/2019 91219 91219A11 100 99.973934 663
17/04/2019 61212 61212A80 100 99.898843 1289
29/04/2019 91219 91219B26 99.907407 100 147
24/03/2019 61212 61212A11 100 99.831425 812
25/04/2019 61212 61212B11 100 99.91107 2677
29/03/2019 91219 91219A26 100 99.980066 1087
05/04/2019 91705 91705C11 100 99.331263 1090
04/04/2019 91219 91219A26 100 99.984588 914
19/03/2019 61212 61212B11 94.21875 99.934376 2318
23/03/2019 23182 23182B11 100 99.47367 195
02/04/2019 91219 91219A26 100 99.980123 958
26/03/2019 23181 23181A11 100 99.48185 543
19/03/2019 61212 61212A11 94.21875 99.777605 1596
18/04/2019 23182 23182B11 100 99.978012 264
26/03/2019 23181 23181C11 100 99.829911 1347
01/03/2019 91219 91219A11 100 99.770661 1499
12/03/2019 91219 91219B11 100 99.832273 1397
19/04/2019 61212 61212B80 100 99.987946 430
12/03/2019 91705 91705C11 100 98.789819 1000
Here is my inefficient solution and given there are over 100 columns, I am quite worried.
#seperates dataframes
Avail = new_df.loc[:,["Site_ID","Cell_ID","Availability"]]
V_CSSR = new_df.loc[:,["Site_ID","Cell_ID","VoLTE CSSR"]]
V_Atte = new_df.loc[:,["Site_ID","Cell_ID","VoLTE Attempts"]]
#plot each dataframe
Avail.groupby("Site_ID")["Availability"].plot(y="Availability", legend = True)
V_CSSR.groupby("Site_ID")["VoLTE CSSR"].plot(y="VoLTE CSSR", legend = True)
V_Atte.groupby("Site_ID")["VoLTE Attempts"].plot(y="VoLTE Attempts", legend = True)
This is the outcome I am after.
Not the best solution, but you can try:
fig, axes = plt.subplots(1,3, figsize=(10,4))
for col, ax in zip(cols, axes):
for site in df.Site_ID.unique():
tmp_df = df[df.Site_ID.eq(site)]
ax.plot(tmp_df.Date, tmp_df[col], label=site)
ax.set_title(col)
ax.legend()
plt.show()
Output:
Related
I would like to use the principal components analysis (PCA) method in python to understand what are the most important data to my machine learning model so I can get rid of the data that have less influence on my prediction.
To do this, I started with a simple example and I will implement that later on my real data. The following example consists of 5 columns (i.e., Five features or variables) and 100 rows (i.e., 100 samples).
my datasets are:
wt1 wt2 wt3 wt4 wt5 ko1 ko2 ko3 ko4 ko5
gene1 485 474 475 478 471 149 132 136 146 165
gene2 134 129 170 133 129 53 46 45 44 43
gene3 850 894 925 832 815 485 545 503 475 568
gene4 709 728 706 728 722 106 119 138 144 147
gene5 593 548 546 606 587 648 627 584 641 607
... ... ... ... ... ... ... ... ... ...
gene96 454 404 413 462 420 293 312 327 297 332
gene97 746 691 799 716 762 557 527 511 560 517
gene98 736 782 744 821 737 856 860 840 866 853
gene99 565 513 568 529 565 218 255 224 217 223
gene100 494 457 482 435 468 586 598 562 573 550
The features are wt1 to ko5, so I would like the PCA to tell me what are the wt or ko that I can remove without influencing the accuracy of my model
Here is my code:
import pandas as pd
import numpy as np
import random as rd
from sklearn.decomposition import PCA
from sklearn import preprocessing
import matplotlib.pyplot as plt
genes = ['gene' + str(i) for i in range(1,101)]
wt = ['wt' + str(i) for i in range(1,6)]
ko = ['ko' + str(i) for i in range(1,6)]
data = pd.DataFrame(columns=[*wt, *ko], index=genes)
# for each gene in the index(i.e. gene1, gene2,.. gene100), we create 5 values for the "wt" samples and 5 values for the "ko"..
# The mean can vary between 10 and 1000
for gene in data.index:
data.loc[gene,'wt1':'wt5'] = np.random.poisson(lam=rd.randrange(10,1000), size=5) # size = 5 ---> because we have wt1, wt2,... wt5
data.loc[gene,'ko1':'ko5'] = np.random.poisson(lam=rd.randrange(10,1000), size=5) # size = 5 ---> because we have ko1, ko2,... ko5
#print(data.head()) # only the first five rows
#print(data)
## Before we do PCA, we have to center and scale the data..
## After centering, the average value for each gene will be 0,
## After scaling, the standard deviation for the values for each gene will be 1
## Notice that we are passing in the transpose of our data, the scale function expects the samples to be rows instead of columns
scaled_data = preprocessing.scale(data.T) ## or StandardScaler().fit_transform(datalT)
# Variation is calculated in sklearn as: [(measurments - mean)**2/ the number of measurements]
# Variation is calculated in R as: [(measurments - mean)**2/ the number of measurements-1]
# There is no difference between the two methods..
pca = PCA() ## PCA here is an object
## Now we call the fit method on the scaled data
pca.fit(scaled_data) ## This is where we do all of the PCA math (i.e. calculate loading scores and the variation each principal component accounts for..)
pca_data = pca.transform(scaled_data) ## this is where we generate coordiantes for a PCA graph based on the loading scores and the scaled data..
## We'll start with a scree plot to see how many principal components shouldgo into the final plot..
# The first thing we do is calculate the percentage of variation that each principal component accounts for..
per_var = np.round(pca.explained_variance_ratio_* 100, decimals=1)
labels = ['PC' + str(x) for x in range(1, len(per_var)+1)]
plt.bar(x=range(1,len(per_var)+1), height = per_var, tick_label =labels)
plt.ylabel('Percentage of Explained Variance')
plt.xlabel('Principal Component')
plt.title('Scree Plot')
plt.show()
## Almost all the variation is along the first PC, so a 2-D graph, using PC1 and PC2 sholud do a good job representing the original data.
pca_df = pd.DataFrame(pca_data, index=[*wt, *ko], columns = labels) ## This is to organize the new data created by {pca.transform(scaled.data)}, into a matrix
plt.scatter(pca_df.PC1, pca_df.PC2)
plt.title('My PCA Graph')
plt.xlabel('PC1 - {0}%'.format(per_var[0]))
plt.ylabel('PC2 - {0}%'.format(per_var[1]))
for sample in pca_df.index:
plt.annotate(sample, (pca_df.PC1.loc[sample], pca_df.PC2.loc[sample]))
plt.show()
loading_scores = pd.Series(pca.components_[0], index=genes) # We'll start by creating a pandas "Series" object with the loading scores in PC1
sorted_loading_scores = loading_scores.abs().sort_values(ascending = False) #Sorting the loading scores based on their magnitude (absolute value)
top_10_genes = sorted_loading_scores[0:10].index.values ## Here we are just getting the names of the top 10 indexes (which are the gene names)
print(loading_scores[top_10_genes]) ## Printing out the top 10 gene names and their correspodning loading scores
The outputs of the code are the following figures:
As we can see that PC1 accounts for 89.5% of the data and PC2 accounts for 2.8% of the data..
So I can represent the original data by only using PC1 and PC2
My question is:
Is there a way to correlate PC1 and PC2 with the original data so I can understand what are the least important features in the original data?
I'm having a data-frame as follows:
match_id team team_score
411 RCB 263
7937 KKR 250
620 RCB 248
206 CSK 246
11338 KKR 241
61 CSK 240
562 RCB 235
Now, I want to plot a bar plot for all these values as an individual bars, what I'm getting in output is something different:
Is there any way I can make different bars for same x-axis values??
When 'team' is used as x, all the values for each team are averaged and a small error bar shows a confidence interval. To have each entry of the table as a separate bar, the index of the dataframe can be used for x. After creating the bars, they can be labeled with the team names.
Optionally, hue='team'colors the bars per team. Then dodge=False is needed to have the bars positioned nicely. In that case, Seaborn also creates a legend, which is not so useful, as the same information now also is present as the x-values. The legend can be suppressed via ax.legend_.remove().
from matplotlib import pyplot as plt
import pandas as pd
from io import StringIO
import seaborn as sns
data_str = StringIO("""match_id team team_score
411 RCB 263
7937 KKR 250
620 RCB 248
206 CSK 246
11338 KKR 241
61 CSK 240
562 RCB 235""")
df = pd.read_csv(data_str, delim_whitespace=True)
color_dict = {'RCB': 'dodgerblue', 'KKR': 'darkviolet', 'CSK': 'gold'}
ax = sns.barplot(x=df.index, y='team_score', hue='team', palette=color_dict, dodge=False, data=df)
ax.set_xticklabels(df['team'])
ax.legend_.remove()
plt.tight_layout()
plt.show()
So I have this kind of database:
Time Type Profit
2 82 s/l -51.3
5 9 t/p 164.32
8 38 s/l -53.19
11 82 s/l -54.4
14 107 s/l -54.53
.. ... ... ...
730 111 s/l -70.72
731 111 s/l -70.72
732 111 s/l -70.72
733 113 s/l -65.13
734 113 s/l -65.13
[239 rows x 3 columns]
I want to plot a chart which shows X as the time (that's already on week hours), and Y as profit(Which can be positive or negative). For Y, I would like for each hour (X) to have 2 bars to show the profit. The negative profit would be positive too in this case but in another bar.
For example we have -65 and 70. They would show as 65 and 70 on the chart but the loss would have a different bar color.
This is my code so far:
#reading the csv file
data = pd.read_csv(filename)
df = pd.DataFrame(data, columns = ['Time','Type','Profit']).astype(str)
#turns time column into hours of week
df['Time'] = df['Time'].apply(lambda x: findHourOfWeek(x))
#Takes in winning trades (t/p) and losing trades(s/l)
df = df[(df['Type'] == 't/p') | (df['Type'] == 's/l')]
#Plots the chart
ax = df.plot(title='Profits and Losses (Hour Of Week)',kind='bar')
#ax.legend(['Losses', 'Winners'])
plt.xlabel('Hour of Week')
plt.ylabel('Amount Of Profit/Loss')
plt.show()
You can groupby, unstack and plot:
(df.groupby(['Time','Type']).Profit.sum().abs()
.unstack('Type')
.plot.bar()
)
For your sample data above, the output is:
I got the idea to try and visualize data for election donations from the fec website. Basically, I would like to create a stacked bar chart, with the X-axis being the State, Y-axis being the donated amount, and the 'stacks' being the different candidates, showing how much each candidate received from each state.
Code:
import matplotlib.pyplot as plt
import pandas as pd
from pathlib import Path
pathName = r"R:\Downloads\indiv20\by_date"
dataDir = Path(pathName)
filename = "itcont_2020_20010425_20190425.txt"
fullName = dataDir / filename
data = pd.read_csv(fullName, low_memory=False, sep="|", usecols=[0, 9, 12, 14])
data.columns = ['Filer ID', 'State', 'Occupation', 'Donation Amount ($)']
data = data.dropna(subset=['Donation Amount ($)'])
donations_by_state = data.groupby('State').sum()
plt.bar(donations_by_state.index, donations_by_state['Donation Amount ($)'])
plt.ylabel('Donation Amount ($)')
plt.xlabel('State')
plt.title('Donations per State')
plt.show()
This plots the total contributions per state, and works great. However, when I try this groupby method to group all the data I want, I'm not sure how to plot a stacked bar chart from this data:
donations_per_candidate_per_state = data['Donation Amount ($)'].groupby([data['State'], data['Filer ID']]).sum()
State Filer ID
AA C00005561 350
C00010603 600
C00042366 115
C00309567 1675
C00331694 2500
C00365536 270
C00401224 4495
C00411330 100
C00492991 300
C00540500 300
C00641381 250
C00696948 2800
C00697441 250
C00699090 67
C00703108 1400
AB C00401224 1386
AE C00000935 295
C00003418 276
C00010603 1750
C00027466 320
C00193433 105
C00211037 251
C00216614 226
C00341396 20
C00369033 150
C00394957 50
C00401224 26538
C00438713 50
C00457325 310
C00492785 300
...
ZZ C00580100 1490
C00603084 95
C00607861 750
C00608380 125
C00618371 2199
C00630665 1000
C00632133 600
C00632398 400
C00639500 208
C00639591 1450
C00640623 6402
C00653816 1000
C00666149 1000
C00666453 2800
C00683102 1000
C00689430 3524
C00693234 13283
C00693713 1000
C00694018 2750
C00694455 12761
C00695510 1045
C00696245 250
C00696419 3000
C00696526 500
C00696948 31296
C00697441 34396
C00698050 350
C00698258 2800
C00699090 5757
C00700732 475
Name: Donation Amount ($), Length: 32662, dtype: int64
It seems to have the data tabulated in the way I need, just not sure how to plot it.
You can use the following as described here:
df = donations_per_candidate_per_state.unstack('Filer ID')
df.plot(kind='bar', stacked=True)
I have written a program (code below) that gives me for each file in a folder a data frame. In the data frame are the Quarters in the Year from the file and the counts (how often the quarters occurs in the file). An output for one file in the loop look for example like:
2008Q4 230
2009Q1 186
2009Q2 166
2009Q3 173
2009Q4 246
2010Q1 341
2010Q2 336
2010Q3 200
2010Q4 748
2011Q1 625
2011Q2 690
2011Q3 970
2011Q4 334
2012Q1 573
2012Q2 53
How can I create a big data frame where the counts for the quarters are summed up for all files in the folder?
path = "crisisuser"
os.chdir(path)
result = [i for i in glob.glob('*.{}'.format("csv"))]
os.chdir("..")
for i in result:
df = pd.read_csv("crisisuser/"+i)
df['quarter'] = pd.PeriodIndex(df.time, freq='Q')
df=df['quarter'].value_counts().sort_index()
I think you need append all Series to list, then use concat and sum per index values:
out = []
for i in result:
df = pd.read_csv("crisisuser/"+i)
df['quarter'] = pd.PeriodIndex(df.time, freq='Q')
out.append(df['quarter'].value_counts().sort_index())
s = pd.concat(out).sum(level=0)