I know we can use the following code to create a decile column for based on a column of given data set considering there are tie in the data (see How to qcut with non unique bin edges?):
import numpy as np
import pandas as pd
# create a sample
np.random.seed([3,1415])
df = pd.DataFrame(np.random.rand(100, 3), columns=list('ABC'))
# sort by column C
df = df.sort_values(['C'] , ascending = False )
# create decile by column C
df['decile'] = pd.qcut(df['C'].rank(method='first'), 10, labels=np.arange(10, 0, -1))
Is there an easy way to save the cut point from df then use the same cut point to cut a new data set? For example:
np.random.seed([1])
df_new = pd.DataFrame(np.random.rand(100, 1), columns=list('C'))
You can using .left get all bins
s1=pd.Series([1,2,3,4,5,6,7,8,9])
s2=pd.Series([2,3,4,6,1])
a=pd.qcut(s1,10).unique()
bins=[x.left for x in a ] + [np.inf]
pd.cut(s2,bins=bins)
Related
I have a pandas dataframe and I am experimenting with sci-kit learn Novelty and Outlier Detection. I am trying figure out how to save my good dataset back to new a new CSV file after the outlier detector flags outliers.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.neighbors import LocalOutlierFactor
df = pd.read_csv('./ILCplusDAT.csv')
df = df.fillna(method = 'ffill').fillna(method = 'bfill')
npower_pid = df[['power','pid']].to_numpy()
And using the sci kit learn feature where visually to me the results look good only using 2 of the columns power & pid of the original df:
ax = plt.figure(figsize=(25,8))
lof = LocalOutlierFactor(n_neighbors=20, contamination=0.005)
good = lof.fit_predict(npower_pid) == 1
plt.scatter(npower_pid[good, 0], npower_pid[good, 1], s=2, label="Good", color="#4CAF50")
plt.scatter(npower_pid[~good, 0], npower_pid[~good, 1], s=8, label="Bad", color="#F44336")
plt.legend();
Which creates an interesting plot that I would love to save a "filtered" original data frame of "BAD" data removed. Any tips greatly appreciated...hopefully this makes sense. The original data frame is 3 columns but the filtered data as shown in the plot below is only 2 of those columns. Can I still filter the original dataframe based on the output shown in this plot?
You want to filter df using your array, good:
# you can filter df using bool masking in .loc[...]
df.loc[good == True]
# or...
df.loc[good == False]
# ***NOTE: if you've altered the index in df you may have unexpected results.
# convert `good` into a `pd.Series` with the same index as `df`
s = pd.Series(good, index=df.index, name="is_outlier")
# ... join with df
df = df.join(s)
# then filter to True
df.loc[df.is_outlier == True]
# or False
df.loc[df.is_outlier == False]
Thanks to #Ian Thompson
My code for what its worth...
s = pd.Series(good, index=df.index, name="is_outlier")
df = df.join(s)
# df2 is filtered to remove BAD data
df2 = df[(df['is_outlier']==True)]
df2 = df2[['pid','power','dat']]
df2.to_csv('./filteredILCdata.csv')
I'm looking to convert a DataFrame to a NetworkX graph: I would like to use the Dataframe as a map where the indexes are the "sources" and the columns are the "targets". The values should be the weights.
df = pd.DataFrame(np.random.randint(0,3,size=(4, 4)), columns=list('ABCD'), index = list('ABCD'))
df
G = nx.from_pandas_edgelist(
df, source=df.index, target=df.column, create_using=nx.DiGraph
)
Here is a example of DataFrame, each index should connect to a column if the value is non-zero.
Would you know how to?
Use nx.from_pandas_adjacency():
import pandas as pd
import numpy as np
import networkx as nx
df = pd.DataFrame(np.random.randint(0,3,size=(4, 4)), columns=list('ABCD'), index = list('ABCD'))
G = nx.from_pandas_adjacency(df, create_using=nx.DiGraph)
As the comment form #Huug points out be aware of passing create_using=nx.DiGraph to the command, to ensure it is created as a directed graph.
For the dataframe below,I want to write a function that
I. extract the outliers for each column and export the output as a csv file (I need help with this one)
II. visualize using boxplot and export as pdf file
Outlier definition: as boundaries ±3 standard deviations from the mean
OR
as being any point of data that lies over 1.5 IQRs below the first quartile (Q1) or above the third quartile (Q3)in a data set.
High = (Q3) + 1.5 IQR
Low = (Q1) – 1.5 IQR
See below for the dataset and my attempt :
# dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# intialise data of lists.
data = {'region':['R1', 'R1', 'R2', 'R2', 'R2','R1','R1','R1','R2','R2'],
'cost':[120.05, 181.90, 10.21, 133.01, 311.19,2003.4,112.4,763.2,414.8,812.5],
'commission':[110.21, 191.12, 190.21,15.31, 245.09,63.41,811.3,10.34, 153.10, 311.17],
'salary':[10022,19910, 19113,449999, 25519,140.29, 291.07, 390.22, 245.09, 4122.62],
'revenue':[14029, 29100, 39022, 24509, 412271,110.21, 191.12, 190.21, 12.00, 245.09],
'tax':[120.05, 181.90, 10.34, 153.10, 311.17,52119,32991,52883,69359,57835],
'debt':[100.22,199.10, 191.13,199.99, 255.19,41218, 52991,1021,69152,79355],
'income': [43211,7672991,56881,211,77342,100.22,199.10, 191.13,199.99, 255.19],
'rebate': [31.21,429.01,538.18,621.58,6932.5,120.05, 181.90, 10.34, 153.10, 311.17],
'scale':['small','small','small','mid','mid','large','large','mid','large','small']
}
# Create DataFrame
df = pd.DataFrame(data)
# Print the output.
df
############## my attempt ####################
def outlier_extractor(data):
# select numeric columns
numeric_columns = data.select_dtypes(include=np.number).columns.tolist()
#(I)Extract and export outliers as csv..... I need help with this one
#(II) boxplot visualization
plt.figure(figsize=(10, 9))
for i, variable in enumerate(numeric_columns):
plt.subplot(4, 4, i + 1)
plt.boxplot(data[variable],whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.savefig('graph_outliers.pdf')
plt.show()
# driver code
outlier_extractor(df)
Please comment and share your full code. Thanks in advance
def outlier_extractor(data):
numeric_data = data.select_dtypes(include=np.number)
Q1, Q3 = numeric_data.quantile(.25), numeric_data.quantile(.75)
IQR = Q3-Q1
numeric_data[:] = np.where((numeric_data > Q3+1.5*IQR)|(numeric_data < Q1-1.5*IQR), np.nan, numeric_data)
numeric_data.apply(lambda series:series.dropna().to_csv(series.name+".csv"))
plt.figure(figsize=(10, 9))
for i, variable in enumerate(numeric_columns):
plt.subplot(4, 4, i + 1)
plt.boxplot(data[variable],whis=1.5)
plt.tight_layout()
plt.title(variable)
#plt.savefig('graph_outliers.pdf')
plt.show()
outlier_extractor(df)
Note that the apply function saves each filtered column in one different csv file. From your description I though that this was your task.
Note also that you don't need the seaborn package
EDIT
To export all the filtered dataframe with missing values replacing the ouliers you have to replace the to_csv row with:
numeric_data.to_excel("filtered_numeric_data.xlsx")
I have the following problem:
Given a 2D dataframe, first column with values and second giving categories of the points, I would like to compute a k-means dictionary of the means of each category and assign the centroid that the group mean of a particular value is closest to as a new column in the original data frame.
I would like to do this using groupby.
More generally, my problem is, that apply (to my knowledge) only can use functions that are defined on the individual groups (like mean()). k-means needs information on all the groups. Is there a nicer way than transforming everything to numpy arrays and working with these?
import pandas as pd
import numpy as np
from scipy.cluster.vq import kmeans2
k=4
raw_data = np.random.randint(0,100,size=(100, 4))
f = pd.DataFrame(raw_data, columns=list('ABCD'))
df = pd.DataFrame(f, columns=['A','B'])
groups = df.groupby('A')
means = groups.mean().unstack()
centroids, dictionary = kmeans2(means,k)
fig, ax = plt.subplots()
print dictionary
What I would like to get now, is a new column in df, that gives the value in dictionary for each entry.
You can achieve it by the following:
import pandas as pd
import numpy as np
from scipy.cluster.vq import kmeans2
k = 4
raw_data = np.random.randint(0,100,size=(100, 4))
f = pd.DataFrame(raw_data, columns=list('ABCD'))
df = pd.DataFrame(f, columns=['A','B'])
groups = df.groupby('A')
means_data_frame = pd.DataFrame(groups.mean())
centroid, means_data_frame['cluster'] = kmeans2(means_data_frame['B'], k)
df.join(means_data_frame, rsuffix='_mean', on='A')
This will append 2 more columns to df B_mean and cluster denoting the group's mean and the cluster that group's mean is closest to, respectively.
If you really want to use apply, you can write a function to read the cluster value from means_data_frame and assign it to a new column in df
Assume two dataframes, each with a datetime index, and each with one column of unnamed data. The dataframes are of different lengths and the datetime indexes may or may not overlap.
df1 is length 20. df2 is length 400. The data column consists of random floats.
I want to iterate through df2 taking 20 units per iteration, with each iteration incrementing the start vector by one unit - and similarly the end vector by one unit. On each iteration I want to calculate the correlation between the 20 units of df1 and the 20 units I've selected for this iteration of df2. This correlation coefficient and other statistics will then be recorded.
Once the loop is complete I want to plot df1 with the 20-unit vector of df2 that satisfies my statistical search - thus needing to keep up with some level of indexing to reacquire the vector once analysis has been completed.
Any thoughts?
Without knowing more specifics of the questions such as, why are you doing this or do dates matter, this will do what you asked. I'm happy to update based on your feedback.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random
df1 = pd.DataFrame({'a':[random.randint(0, 20) for x in range(20)]}, index = pd.date_range(start = '2013-01-01',periods = 20, freq = 'D'))
df2 = pd.DataFrame({'b':[random.randint(0, 20) for x in range(400)]}, index = pd.date_range(start = '2013-01-10',periods = 400, freq = 'D'))
corr = pd.DataFrame()
for i in range(0,380):
t0 = df1.reset_index()['a'] # grab the numbers from df1
t1 = df2.iloc[i:i+20].reset_index()['b'] # grab 20 days, incrementing by one each time
t2 = df2.iloc[i:i+20].index[0] # set the index to be the first day of df2
corr = corr.append(pd.DataFrame({'corr':t0.corr(t1)}, index = [t2])) #calculate the correlation and append it to the DF
# plot it and save the graph
corr.plot()
plt.title("Correlation Graph")
plt.ylabel("(%)")
plt.grid(True)
plt.show()
plt.savefig('corr.png')