Data quality Numeric Columns only - python

I'm trying to setup a data quality check for numeric columns in a dataframe. I want to run the describe() to produce stats on each numeric columns. How can I filter out other columns to produce stats. See line of code I'm using.
df1 = pandas.read_csv("D:/dc_Project/loans.csv")
print(df1.describe(include=sorted(df1)))

Went with the following from a teammate:
import pandas as pd
import numpy as np
df1 = pandas.read_csv("D:/dc_Project/loans.csv")
df2=df1.select_dtypes(include=np.number)

Related

How could i drop rows in Pandas in condition?

For example, let's take Penguins dataset, and i want to drop all entries in bill_length_mm column when they are more then 30:
import seaborn as sns
import pandas as pd
ds = sns.load_dataset("penguins")
ds.head()
ds.drop(ds[ds['bill_length_mm']>30])
And it gives me an error. And if i'll try to add axis=1 it'll just drop every column in dataset.
ds.drop(ds[ds['bill_length_mm']>30], axis=1)
So what shoud i do to complete ma goal?
Try
ds=ds.drop(ds[ds['bill_length_mm']>30].index)
Or
ds = ds[ds['bill_length_mm']<=30]
ds.drop is used to drop columns, not rows. If you only want to keep the rows where bill_length_mm<=30, you can use
ds = ds[ds['bill_length_mm']<=30]

how to import csv file?

I have downloaded the required csv file but don't know how to import the file into pandas using python.
Construct in Python four data frames (df1, df2, df3, df4) to store the four data sets (I, II, III, IV). Each dataset consists of eleven (x,y) points;
[.5 Marks]
Find the basic descriptive statistics in Python using the method describe();
[.5 Marks]
You can use below code to create panda DataFrame
import pandas as pd
df1 = pd.read_csv("dataset1")
df2 = pd.read_csv("dataset2")
#descriptive statistics
df1.describe()

How can I get the difference between values in a Pandas dataframe grouped by another field?

I have a CSV of data I've loaded into a dataframe that I'm trying to massage: I want to create a new column that contains the difference from one record to another, grouped by another field.
Here's my code:
import pandas as pd
import matplotlib.pyplot as plt
rl = 'https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv'
all_counties = pd.read_csv(url, dtype={"fips": str})
all_counties.date = pd.to_datetime(all_counties.date)
oregon = all_counties.loc[all_counties['state'] == 'Oregon']
oregon.set_index('date', inplace=True)
oregon.sort_values('county', inplace=True)
# This is not working; I was hoping to find the differences from one day to another on a per-county basis
oregon['delta'] = oregon.groupby(['state','county'])['cases'].shift(1, fill_value=0)
oregon.tail()
Unfortunately, I'm getting results where the delta is always the same as the cases.
I'm new at Pandas and relatively inexperienced with Python, so bonus points if you can point me towards how to best read the documentation.
Lets Try
oregon['delta']=oregon.groupby(['state','county'])['cases'].diff().fillna(0)

Data Frame Indexing

Using python3 I wrote a code for calculating data. Code is as follows:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
def data(symbols):
dates = pd.date_range('2016/01/01','2016/12/23')
df=pd.DataFrame(index=dates)
for symbol in symbols:
df_temp=pd.read_csv("/home/furqan/Desktop/Data/{}.csv".format(symbol),
index_col='Date',parse_dates=True,usecols=['Date',"Close"],
na_values = ['nan'])
df_temp=df_temp.rename(columns={'Close':symbol})
df=df.join(df_temp)
df=df.fillna(method='ffill')
df=df.fillna(method='bfill')
df=(df/df.ix[0,: ])
return df
symbols = ['FABL','HINOON']
df=data(symbols)
print(df)
p_value=(np.zeros((2,2),dtype="float"))
p_value[0,0]=0.5
p_value[1,1]=0.5
print(df.shape[1])
print(p_value.shape[0])
df=np.dot(df,p_value)
print(df.shape[1])
print(df.shape[0])
print(df)
When I print df for second time the index has vanished. I think the issue is due to matrix multiplication. How can I get the indexing and column headings back into df?
To resolve your issue, because you are using numpy methods, these typically return a numpy array which is why any existing columns and index labels will have been lost.
So instead of
df=np.dot(df,p_value)
you can do
df=df.dot(p_value)
Additionally because p_value is a pure numpy array, there is no column names here so you can either create a df using existing column names:
p_value=pd.DataFrame(np.zeros((2,2),dtype="float"), columns = df.columns)
or just overwrite the column names directly after calculating the dot product like so:
df.columns = ['FABL', 'HINOON']

Complex aggregation after group by operation in Pandas DataFrame

I have this DataFrame in pandas that I have that I've grouped by a column.
After this operation I need to generate all unique pairs between the rows of
each group and perform some aggregate operation on all the pairs of a group.
I've implemented the following sample algorithm to give you an idea. I want to refactor this code in order to make it work with pandas to yield performance increase and/or decrease code complexity.
Code:
import numpy as np
import pandas as pd
import itertools
#Construct Dataframe
samples=40
a=np.random.randint(3,size=(1,samples))
b=np.random.randint(9,size=(1,samples))
c=np.random.randn(1,samples)
d=np.append(a,b,axis=0)
e=np.append(d,c,axis=0)
e=e.transpose()
df = pd.DataFrame(e,columns=['attr1','attr2','value'])
df['attr1'] = df.attr1.astype('int')
df['attr2'] = df.attr2.astype('int')
#drop duplicate rows so (attr1,attr2) will be key
df = df.drop_duplicates(['attr1','attr2'])
#df = df.reset_index()
print(df)
for key,tup in df.groupby('attr1'):
print('Group',key,' length ',len(tup))
#generate pairs
agg=[]
for v1,v2 in itertools.combinations(list(tup['attr2']),2):
p1_val = float(df.loc[(df['attr1']==key) & (df['attr2']==v1)]['value'])
p2_val = float(df.loc[(df['attr1']==key) & (df['attr2']==v2)]['value'])
agg.append([key,(v1,v2),(p1_val-p2_val)**2])
#insert pairs to dataframe
p = pd.DataFrame(agg,columns=['group','pair','value'])
top = p.sort_values(by='value').head(4)
print(top['pair'])
#Perform some operation in df based on pair values
#....
I am really afraid that pandas DataFrames can not provide such sophisticated analysis functionality.
Do I have to stick to traditional python like in the example?
I'm new to Pandas so any comments/suggestions are welcome.

Categories