I have downloaded the required csv file but don't know how to import the file into pandas using python.
Construct in Python four data frames (df1, df2, df3, df4) to store the four data sets (I, II, III, IV). Each dataset consists of eleven (x,y) points;
[.5 Marks]
Find the basic descriptive statistics in Python using the method describe();
[.5 Marks]
You can use below code to create panda DataFrame
import pandas as pd
df1 = pd.read_csv("dataset1")
df2 = pd.read_csv("dataset2")
#descriptive statistics
df1.describe()
Related
I'm trying to setup a data quality check for numeric columns in a dataframe. I want to run the describe() to produce stats on each numeric columns. How can I filter out other columns to produce stats. See line of code I'm using.
df1 = pandas.read_csv("D:/dc_Project/loans.csv")
print(df1.describe(include=sorted(df1)))
Went with the following from a teammate:
import pandas as pd
import numpy as np
df1 = pandas.read_csv("D:/dc_Project/loans.csv")
df2=df1.select_dtypes(include=np.number)
I have a CSV of data I've loaded into a dataframe that I'm trying to massage: I want to create a new column that contains the difference from one record to another, grouped by another field.
Here's my code:
import pandas as pd
import matplotlib.pyplot as plt
rl = 'https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv'
all_counties = pd.read_csv(url, dtype={"fips": str})
all_counties.date = pd.to_datetime(all_counties.date)
oregon = all_counties.loc[all_counties['state'] == 'Oregon']
oregon.set_index('date', inplace=True)
oregon.sort_values('county', inplace=True)
# This is not working; I was hoping to find the differences from one day to another on a per-county basis
oregon['delta'] = oregon.groupby(['state','county'])['cases'].shift(1, fill_value=0)
oregon.tail()
Unfortunately, I'm getting results where the delta is always the same as the cases.
I'm new at Pandas and relatively inexperienced with Python, so bonus points if you can point me towards how to best read the documentation.
Lets Try
oregon['delta']=oregon.groupby(['state','county'])['cases'].diff().fillna(0)
I have just started to learn to use Jupyter notebook. I have a data file called 'Diseases'.
Opening data file
import pandas as pd
df = pd.read_csv('Diseases.csv')
Choosing data from a column named 'DIABETES', i.e choosing subject IDs that have diabetes, yes is 1 and no is 0.
df[df.DIABETES >1]
Now I want to export this cleaned data (that has fewer rows)
df.to_csv('diabetes-filtered.csv')
This exports the original data file, not the filtered df with fewer rows.
I saw in another question that the inplace argument needs to be used. But I don't know how.
You forget assign back filtered DataFrame, here to df1:
import pandas as pd
df = pd.read_csv('Diseases.csv')
df1 = df[df.DIABETES >1]
df1.to_csv('diabetes-filtered.csv')
Or you can chain filtering and exporting to file:
import pandas as pd
df = pd.read_csv('Diseases.csv')
df[df.DIABETES >1].to_csv('diabetes-filtered.csv')
TLDR: I created a dask dataframe from a dask bag. The dask dataframe treats every observation (event) as a column. So, instead of having rows of data for each event, I have a column for each event. The goal is to transpose the columns to rows in the same way that pandas can transpose a dataframe using df.T.
Details:
I have sample twitter data from my timeline here. To get to my starting point, here is the code to read a json from disk into a dask.bag and then convert that into a dask.dataframe
import dask.bag as db
import dask.dataframe as dd
import json
b = db.read_text('./sampleTwitter.json').map(json.loads)
df = b.to_dataframe()
df.head()
The Problem All my individual events (i.e. tweets) are recorded as columns vice rows. In keeping with tidy principles, I would like to have rows for each event. pandas has a transpose method for dataframes and dask.array has a transpose method for arrays. My goal is to do the same transpose operation, but on a dask dataframe. How would I do that?
Convert rows to columns
Edit for solution
This code resolves the original transpose problem, cleans Twitter json files by defining the columns you want to keep and dropping the rest, and creates a new column by applying a function to a Series. Then, we write a MUCH smaller, cleaned file to disk.
import dask.dataframe as dd
from dask.delayed import delayed
import dask.bag as db
from dask.diagnostics import ProgressBar,Profiler, ResourceProfiler, CacheProfiler
import pandas as pd
import json
import glob
# pull in all files..
filenames = glob.glob('~/sampleTwitter*.json')
# df = ... # do work with dask.dataframe
dfs = [delayed(pd.read_json)(fn, 'records') for fn in filenames]
df = dd.from_delayed(dfs)
# see all the fields of the dataframe
fields = list(df.columns)
# identify the fields we want to keep
keepers = ['coordinates','id','user','created_at','lang']
# remove the fields i don't want from column list
for f in keepers:
if f in fields:
fields.remove(f)
# drop the fields i don't want and only keep whats necessary
df = df.drop(fields,axis=1)
clean = df.coordinates.apply(lambda x: (x['coordinates'][0],x['coordinates'][1]), meta= ('coords',tuple))
df['coords'] = clean
# making new filenames from old filenames to save cleaned files
import re
newfilenames = []
for l in filenames:
newfilenames.append(re.search('(?<=\/).+?(?=\.)',l).group()+'cleaned.json')
#newfilenames
# custom saver function for dataframes using newfilenames
def saver(frame,filename):
return frame.to_json('./'+filename)
# converting back to a delayed object
dfs = df.to_delayed()
writes = [(delayed((saver)(df, fn))) for df, fn in zip(dfs, newfilenames)]
# writing the cleaned, MUCH smaller objects back to disk
dd.compute(*writes)
I think you can get the result you want by bypassing bag altogether, with code like
import glob
import pandas as pd
import dask.dataframe as dd
from dask.delayed import delayed
filenames = glob.glob('sampleTwitter*.json')
dfs = [delayed(pd.read_json)(fn, 'records') for fn in filenames]
ddf = dd.from_delayed(dfs)
I have this DataFrame in pandas that I have that I've grouped by a column.
After this operation I need to generate all unique pairs between the rows of
each group and perform some aggregate operation on all the pairs of a group.
I've implemented the following sample algorithm to give you an idea. I want to refactor this code in order to make it work with pandas to yield performance increase and/or decrease code complexity.
Code:
import numpy as np
import pandas as pd
import itertools
#Construct Dataframe
samples=40
a=np.random.randint(3,size=(1,samples))
b=np.random.randint(9,size=(1,samples))
c=np.random.randn(1,samples)
d=np.append(a,b,axis=0)
e=np.append(d,c,axis=0)
e=e.transpose()
df = pd.DataFrame(e,columns=['attr1','attr2','value'])
df['attr1'] = df.attr1.astype('int')
df['attr2'] = df.attr2.astype('int')
#drop duplicate rows so (attr1,attr2) will be key
df = df.drop_duplicates(['attr1','attr2'])
#df = df.reset_index()
print(df)
for key,tup in df.groupby('attr1'):
print('Group',key,' length ',len(tup))
#generate pairs
agg=[]
for v1,v2 in itertools.combinations(list(tup['attr2']),2):
p1_val = float(df.loc[(df['attr1']==key) & (df['attr2']==v1)]['value'])
p2_val = float(df.loc[(df['attr1']==key) & (df['attr2']==v2)]['value'])
agg.append([key,(v1,v2),(p1_val-p2_val)**2])
#insert pairs to dataframe
p = pd.DataFrame(agg,columns=['group','pair','value'])
top = p.sort_values(by='value').head(4)
print(top['pair'])
#Perform some operation in df based on pair values
#....
I am really afraid that pandas DataFrames can not provide such sophisticated analysis functionality.
Do I have to stick to traditional python like in the example?
I'm new to Pandas so any comments/suggestions are welcome.