Python conditional filtering in csv file - python

Please help! I have tried different things/packages writing a program that takes in 4 inputs and returns the writing score statistics of a group based on those combination of inputs from a csv file. This is my first project, so I would appreciate any insights/hints/tips!
Here is the csv sample (has 200 rows total):
id gender ses schtyp prog write
70 male low public general 52
121 female middle public vocation 68
86 male high public general 33
141 male high public vocation 63
172 male middle public academic 47
113 male middle public academic 44
50 male middle public general 59
11 male middle public academic 34
84 male middle public general 57
48 male middle public academic 57
75 male middle public vocation 60
60 male middle public academic 57
Here is what I have so far:
import csv
import numpy
csv_file_object=csv.reader(open('scores.csv', 'rU')) #reads file
header=csv_file_object.next() #skips header
data=[] #loads data into array for processing
for row in csv_file_object:
data.append(row)
data=numpy.array(data)
#asks for inputs
gender=raw_input('Enter gender [male/female]: ')
schtyp=raw_input('Enter school type [public/private]: ')
ses=raw_input('Enter socioeconomic status [low/middle/high]: ')
prog=raw_input('Enter program status [general/vocation/academic: ')
#makes them lower case and strings
prog=str(prog.lower())
gender=str(gender.lower())
schtyp=str(schtyp.lower())
ses=str(ses.lower())
What I am missing is how to filter and gets stats only for a specific group. For example, say I input male, public, middle, and academic -- I'd want to get the average writing score for that subset. I tried the groupby function from pandas, but that only gets you stats for broad groups (such as public vs private). I also tried DataFrame from pandas, but that only gets me filtering for one input and not sure how to get the writing scores. Any hints would be greatly appreciated!

Agreeing with Ramon, Pandas is definitely the way to go, and has extraordinary filtering/sub-setting capability once you get used to it. But it can be tough to first wrap your head around (or at least it was for me!), so I dug up some examples of the sub-setting you need from some of my old code. The variable itu below is a Pandas DataFrame with data on various countries over time.
# Subsetting by using True/False:
subset = itu['CntryName'] == 'Albania' # returns True/False values
itu[subset] # returns 1x144 DataFrame of only data for Albania
itu[itu['CntryName'] == 'Albania'] # one-line command, equivalent to the above two lines
# Pandas has many built-in functions like .isin() to provide params to filter on
itu[itu.cntrycode.isin(['USA','FRA'])] # returns where itu['cntrycode'] is 'USA' or 'FRA'
itu[itu.year.isin([2000,2001,2002])] # Returns all of itu for only years 2000-2002
# Advanced subsetting can include logical operations:
itu[itu.cntrycode.isin(['USA','FRA']) & itu.year.isin([2000,2001,2002])] # Both of above at same time
# Use .loc with two elements to simultaneously select by row/index & column:
itu.loc['USA','CntryName']
itu.iloc[204,0]
itu.loc[['USA','BHS'], ['CntryName', 'Year']]
itu.iloc[[204, 13], [0, 1]]
# Can do many operations at once, but this reduces "readability" of the code
itu[itu.cntrycode.isin(['USA','FRA']) &
itu.year.isin([2000,2001,2002])].loc[:, ['cntrycode','cntryname','year','mpen','fpen']]
# Finally, if you're comfortable with using map() and list comprehensions,
you can do some advanced subsetting that includes evaluations & functions
to determine what elements you want to select from the whole, such as all
countries whose name begins with "United":
criterion = itu['CntryName'].map(lambda x: x.startswith('United'))
itu[criterion]['CntryName'] # gives us UAE, UK, & US

Look at pandas. I think it will shorten your csv parsing work and gives the subset funcitonality you're asking for...
import pandas as pd
data = pd.read_csv('fileName.txt', delim_whitespace=True)
#get all of the male students
data[data['gender'] == 'male']

Related

How to select rows in dataframe based on a condition

I have an emails dataframe in which I have given this query:
williams = emails[emails["employee"] == "kean-s"]
This selects all the rows that have employee kean-s. Then I count the frequencies and print the top most. This is how it's done:
williams["X-Folder"].value_counts()[:10]
This gives output like this:
attachments 2026
california 682
heat wave 244
ferc 188
pr-crisis management 92
federal legislation 88
rto 78
india 75
california - working group 72
environmental issues 71
Now, I need to print all the rows from emails that has X_Folder column equal to attachments, california, heat way etc. How do I go about it? When I print values[0] it simply returns the frequency number and not the term corresponding to it (tried printing it because if I'm able to loop through it, Ill just put a condition inside dataframe)
Use Series.isin with boolean indexing for values of index:
df = williams[williams["X-Folder"].isin(williams["X-Folder"].value_counts()[:10].index)]
Or:
df = williams[williams["X-Folder"].isin(williams["X-Folder"].value_counts().index[:10])]
If need filter all rows in original DataFrame (also rows with not matched kean-s) then use:
df1 = emails[emails["X-Folder"].isin(williams["X-Folder"].value_counts().index[:10])]

Multiprocessing fuzzy wuzzy string search - python

I am trying to do string match and bring the match id using fuzzy wuzzy in python. My dataset is huge, dataset1 = 1.8 million records, dataset2 = 1.6 million records.
What I tried so far,
First I tried to use record linkage package in python, unfortunately it ran out of memory when it build the multi index, so I moved to AWS with good machine power and successfully built it, however when I tried to run the comparison on it, it runs forever, I agree that its due to the number of comparison.
Then, I tried to do string match with fuzzy wuzzy and parallelise the process using dask package. And executed it on a sample data. It works fine, but I know the process will still take time as the search space is wide. I am looking for a way to add blocking or indexing on this piece of code.
test = pd.DataFrame({'Address1':['123 Cheese Way','234 Cookie Place','345 Pizza Drive','456 Pretzel Junction'],'city':['X','U','X','U']})
test2 = pd.DataFrame({'Address1':['123 chese wy','234 kookie Pl','345 Pizzza DR','456 Pretzel Junktion'],'city':['X','U','Z','Y'] , 'ID' : ['1','3','4','8']})
Here, I am trying to look for test.Address1 in test2.Address1 and bring its ID.
def fuzzy_score(str1, str2):
return fuzz.token_set_ratio(str1, str2)
def helper(orig_string, slave_df):
slave_df['score'] = slave_df.Address1.apply(lambda x: fuzzy_score(x,orig_string))
#return my_value corresponding to the highest score
return slave_df.ix[slave_df.score.idxmax(),'ID']
dmaster = dd.from_pandas(test, npartitions=24)
dmaster = dmaster.assign(ID_there=dmaster.Address1.apply(lambda x: helper(x, test2)))
dmaster.compute(get=dask.multiprocessing.get)
This works fine, however I am not sure how I can apply indexing on it by limiting the search space on the same city.
Lets say, I am creating an index on the city field and subset based on the city of the original string and pass that city to the helper function,
# sort the dataframe
test2.sort_values(by=['city'], inplace=True)
# set the index to be this and don't drop
test2.set_index(keys=['city'], drop=False,inplace=True)
I don't know how to do that ? Please advise. Thanks in advance.
I prefer using fuzzywuzzy.process.extractOne. That compares a string to an iterable of strings.
def extract_one(col, other):
# need this for dask later
other = other.compute() if hasattr(other, 'compute') else other
return pd.DataFrame([process.extractOne(x, other) for x in col],
columns=['Address1', 'score', 'idx'],
index=col.index)
extract_one(test.Address1, test2.Address1)
Address1 score idx
0 123 chese wy 92 0
1 234 kookie Pl 83 1
2 345 Pizzza DR 86 2
3 456 Pretzel Junktion 95 3
The idx is the index of the other passed to extract_one that matches closest. I would recommend having a meaningful index, to making joining the results later on easier.
For your second question, about filtering to cities, I would use a groupby and apply
gr1 = test.groupby('city')
gr2 = test2.groupby("city")
gr1.apply(lambda x: extract_one(x.Address1,
gr2.get_group(x.name).Address1))
Address1 score idx
0 123 chese wy 92 0
1 234 kookie Pl 83 1
2 345 Pizzza DR 86 2
3 456 Pretzel Junktion 95 3
The only difference with dask is the need to specify a meta to the apply:
ddf1 = dd.from_pandas(test, 2)
ddf2 = dd.from_pandas(test2, 2)
dgr1 = ddf1.groupby('city')
dgr2 = ddf2.groupby('city')
meta = pd.DataFrame(columns=['Address1', 'score', 'idx'])
dgr1.apply(lambda x: extract_one(x.Address1,
dgr2.get_group(x.name).Address1),
meta=meta).compute()
Address1 score idx
city
U 0 234 kookie Pl 83 1
1 234 kookie Pl 28 1
X 0 123 chese wy 92 0
1 123 chese wy 28 0
Here's a notebook: https://gist.github.com/a932b3591346b898d6816a5efc2bc5ad
I'm curious to hear how the performance is. I'm assuming the actual string comparison done in fuzzy wuzzy will take the bulk of the time, but I'd love to hear back on how much overhead is spent in pandas and dask. Make sure you have the C extentions for computing the Levenshtein distance.
i run on the same problem once. The whole process takes forever and even you will use multiprocessing it is not really going to be very fast. The main problem that causes the slow speed is the fuzzy matching because the processing is very tedious and requires a lot of time.
Alternatively, and more efficient in my opinion, it would be to use embedding aka bag of words and apply an ML method on it. the fact that you will use numerical vector makes the whole process way faster!

Python correlation matrix for categorical data

I have some data for a charity which contains the amount someone donated and some information about the person who donated like below.
gender age country donation_amount
F 25 UK 15
F 65 France 80
M 55 Germany 54
F 41 UK 3
M 74 France 99
I would like to find out which columns are most strongly correlated to the donation amount so I can investigate them further e.g. certain countries donate a lot compared to others so it would be good to target them. This is easy to do with the pandas.corr() function, however this doesn't work with categorical data such as gender, only numerical data such as age.
Does anyone know a way I can do this?
I have read about using pandas.get_dummies() to convert categorical variable into dummy/indicator variables. The problem is I have quite a lot of columns and a couple of them have over 40 different categories for demographics so this get's very big incredibly quickly and hard to interpret (the way I have been doing it at least!).
I also found this article to say you can use spearmanr but also read elsewhere that you shouldn't use spearmanr for categorical data. The pandas.corr(method=spearman) method still doesn't work on categorical data either.
(Python: Rank order correlation for categorical data)
This is my first post so apologies if I haven't explained myself very well! Please let me know and I will correct anything if needed.

Calculating a probability based on several variables in a Pandas dataframe

I'm still very new to Python and Pandas, so bear with me...
I have a dataframe of passengers on a ship that sunk. I have broken this down into other dataframes by male and female, and also by class to create probabilities for survival. I made a function that compares one dataframe to a dataframe of only survivors, and calculates the probability of survival among this group:
def survivability(total_pass_df, column, value):
survivors = sum(did_survive[column] == value)
total = len(total_pass_df)
survival_prob = round((survivors / total), 2)
return survival_prob
But now I'm trying to compare survivability among smaller groups - male first class passengers vs female third class passengers for example. I did make dataframes for both of these groups, but I still can't use my survivability function because I"m comparing two different columns - sex and class - rather than just one.
I know exactly how I'd do it with Python - loop through the 'survived' column (which is either a 1 or 0), in the dataframe, if it equals 1, then add one to an index value, and once all the data has been gone through, divide the index value by the length of the dataframe to get the probability of survival....
But I'm supposed to use Pandas for this, and I can't for the life of me work out in my head how to do it....
:/
Without a sample of the data frames you're working with, I can't be sure if I understand your question correctly. But based on your description of the pure-Python procedure,
I know exactly how I'd do it with Python - loop through the 'survived' column (which is either a 1 or 0), in the dataframe, if it equals 1, then add one to an index value, and once all the data has been gone through, divide the index value by the length of the dataframe to get the probability of survival....
you can do this in Pandas by simply writing
dataframe['survived'].mean()
That's it. Given that all the values are either 1 or 0, the mean will be the number of 1's divided by the total number of rows.
If you start out with a data frame that has columns like survived, sex, class, and so on, you can elegantly combine this with Pandas' boolean indexing to pick out the survival rates for different groups. Let me use the Socialcops Titanic passengers data set as an example to demonstrate. Assuming the DataFrame is called df, if you want to analyze only male passengers, you can get those records as
df[df['sex'] == 'male']
and then you can take the survived column of that and get the mean.
>>> df[df['sex'] == 'male']['survived'].mean()
0.19198457888493475
So 19% of male passengers survived. If you want to narrow down to male second-class passengers, you'll need to combine the conditions using &, like this:
>>> df[(df['sex'] == 'male') & (df['pclass'] == 2)]['survived'].mean()
0.14619883040935672
This is getting a little unwieldy, but there's an easier way that actually lets you do multiple categories at once. (The catch is that this is a somewhat more advanced Pandas technique and it might take a while to understand it.) Using the DataFrame.groupby() method, you can tell Pandas to group the rows of the data frame according to their values in certain columns. For example,
df.groupby('sex')
tells Pandas to group the rows by their sex: all male passengers' records are in one group, and all female passengers' records are in another group. The thing you get from groupby() is not a DataFrame, it's a special kind of object that lets you apply aggregation functions - that is, functions which take a whole group and turn it into one number (or something). So, for example, if you do this
>>> df.groupby('sex').mean()
pclass survived age sibsp parch fare \
sex
female 2.154506 0.727468 28.687071 0.652361 0.633047 46.198097
male 2.372479 0.190985 30.585233 0.413998 0.247924 26.154601
body
sex
female 166.62500
male 160.39823
you see that for each column, Pandas takes the average over the male passengers' records of all that column's values, and also over all the female passenger's records. All you care about here is the survival rate, so just use
>>> df.groupby('sex').mean()['survived']
sex
female 0.727468
male 0.190985
One big advantage of this is that you can give more than one column to group by, if you want to look at small groups. For example, sex and class:
>>> df.groupby(['sex', 'pclass']).mean()['survived']
sex pclass
female 1 0.965278
2 0.886792
3 0.490741
male 1 0.340782
2 0.146199
3 0.152130
(you have to give groupby a list of column names if you're giving more than one)
Have you tried merging the two dataframes by passenger ID and then doing a pivot table in Pandas with whatever row subtotals and aggfunc=numpy.mean?
import pandas as pd
import numpy as np
# Passenger List
p_list = pd.DataFrame()
p_list['ID'] = [1,2,3,4,5,6]
p_list['Class'] = ['1','2','2','1','2','1']
p_list['Gender'] = ['M','M','F','F','F','F']
# Survivor List
s_list = pd.DataFrame()
s_list['ID'] = [1,2,3,4,5,6]
s_list['Survived'] = [1,0,0,0,1,0]
# Merge the datasets
merged = pd.merge(p_list,s_list,how='left',on=['ID'])
# Pivot to get sub means
result = pd.pivot_table(merged,index=['Class','Gender'],values=['Survived'],aggfunc=np.mean, margins=True)
# Reset the index
for x in range(result.index.nlevels-1,-1,-1):
result.reset_index(level=x,inplace=True)
print result
Class Gender Survived
0 1 F 0.000000
1 1 M 1.000000
2 2 F 0.500000
3 2 M 0.000000
4 All 0.333333

Python code to create graphs from columns of data

I am writing a script that produces histograms of specific columns in a tab-delimited text file. Currently, the program will create a single graph from a hard coded column number that I am using as a placeholder.
The input table looks something like this:
SAMPID TRAIT COHORT AGE BMI WEIGHT WAIST HEIGHT LDL HDL
123 LDL STUDY1 52 32.2 97.1 102 149 212.5 21.4
456 LDL STUDY1 33 33.7 77.0 101 161 233.2 61.2
789 LDL STUDY2 51 25.1 67.1 107 162 231.1 21.3
abc LDL STUDY2 76 33.1 80.4 99 134 220.5 21.2
...
And I have the following code:
import csv
import numpy
from matplotlib import pyplot
r = csv.reader(open("path",'r'), delimiter = '\t')
input_table=[]
for row in r:
input_table.append(row)
column=[]
missing=0
nonmissing=0
for E in input_table[1:3635]: # the number of rows in the input table
if E[8] == "": missing+=1 # [8] is hard coded now, want to change this to column header name "LDL"
else:
nonmissing +=1
column.append(float(E[8]))
pyplot.hist(column, bins=20, label="the label") # how to handle multiple histogram outputs if multiple column headers are specified?
print "n = ", nonmissing
print "numer of missing values: ", missing
pyplot.show()
Can anyone offer suggestions that would allow me to expand/improve my program to do any of the following?
graph data from columns specified by header name, not the column number
iterate over a list containing multiple header names to create/display several histograms at once
Create a graph that only includes a subset of the data, as specified by a specific value in a column (ie, for a specific sample ID, or a specific COHORT value)
One component not shown here is that I will eventually have a separate input file that will contain a list of headers (ie "HDL", "LDL", "HEIGHT") needing to be graphed separately, but then displayed together in a grid-like manner.
I can additional information if needed.
Well, I have a few comments and suggestions, hope it helps.
In my opinion, the first thing you should do to get all those things you want is to structure your data.
Try to create, for each row from the file, a dictionary like
{'SAMPID': <value_1>, 'TRAIL': <value_2>, ...}
And then you will have a list of such dict objects, and you will be able to iterate it and filter by any field you wish.
That is the first and most important point.
After you do that, modularize your code, do not just create a single script to get all the job done. Identify the pieces of code that will be redundant (as a filtering loop), put it into a function and call it, passing all necessary args.
One aditional detail: You don't need to hadcode the size of your list as in
for E in input_table[1:3635]:
Just write
for E in input_table[1:-1]
And it should do for every list. Of course, if you stop treating you data as raw text, that won't be necessary. Just iterate your list of dicts normally.
If you have more doubts, let me know.
Francisco

Categories