Let's say I have a (9000x9000) table like the following:
zone 304 305 306 307 308 ...
001 1 2 8 9 12 ...
002 6 8 3 7 1 ...
003 4 8 1 12 9 ...
004 2 7 3 16 34 ...
...
The main data table looks like this:
package # weight origin destination zone
123 2oz 004 305 7 to be inputted here
.
.
.
I need SAS to output the "zone" corresponding to a given ordered pair. I fear the only way would be with some type of loop. For instance, in the example above, the orgin value is from the row labels and the destination from the column labels. The intersection is the target value I need assigned to "zone".
A solution using python data wrangling libraries would work also.
Also, the 9000x9000 table is an Excel CSV file.
You could use pandas, it has a built in function to read from an excel document: pandas.read_excel()
So for this file:
import pandas as pd
df = pd.read_excel('test.xlsx')
print(df[101][502])
Output:
67
My approach:
Load the data set into a temporary array (9000x9000) and then lookup each element as needed. Could be memory intensive, but 9000*9000 seems small enough to me.
Another safe approach, transpose the data to be in a long format:
Key1 Key2 Value
001 304 1
001 305 2
...
Then, in any language, it becomes a join/merge instead of lookup.
You can also use PROC IML, which loads the data as a matrix and then you can access using the indexes.
There are also ways in SAS to do this lookup via a merge, primarily using VVALUEX.
Without knowing how you're going to use it, I can't provide any more information.
EDIT: added 3'rd option which is IML. Basically there are many ways to do this, the best depends on how you're planning to use it overall.
EDIT2:
1. Import first data set into SAS (PROC IMPORT)
2. Transpose using PROC TRANSPOSE
3. Merge either data step or PROC SQL, by ORIGIN DESTINATION, which will be straight forward. At this point it's really a standard lookup with 2 keys.
Related
I am trying to create dictionary from a dictionary from a dataframe in the following way.
My dataframe contains a column called station_id. The station_id values are unique. That is each row correspond to station id. Then there is another column called trip_id (see example below). Many stations can be associated with a single trip_id. For example
l1=[1,1,2,2]
l2=[34,45,66,67]
df1=pd.DataFrame(list(zip(l1,l2)),columns=['trip_id','station_name'])
df1.head()
trip_id station_name
0 1 34
1 1 45
2 2 66
3 2 67
I am trying to get a dictionary d={1:[34,45],2:[66,67]}.
I solved it with a for loop in the following fashion.
from tqdm import tqdm
Trips_Stations={}
Trips=set(df['trip_id'])
T=list(Trips)
for i in tqdm(range(len(Trips))):
c_id=T[i]
Values=list(df[df.trip_id==c_id].stop_id)
Trips_Stations.update({c_id:Values})
Trips_Stations
My actual dataset has about 65000 rows. The above takes about 2 minutes to run. While this is acceptable for my application, I was wondering if there is a faster way to do it using base pandas.
Thanks
somehow stackoverflow suggested that I look at Group_By
This is much faster
d=df.groupby('trip_id')['stop_id'].apply(list)
from collections import OrderedDict, defaultdict
o_d=d.to_dict(OrderedDict)
o_d=dict(o_d)
It took about 30 secs for the dataframe with 65000 rows. Then
I am trying to do string match and bring the match id using fuzzy wuzzy in python. My dataset is huge, dataset1 = 1.8 million records, dataset2 = 1.6 million records.
What I tried so far,
First I tried to use record linkage package in python, unfortunately it ran out of memory when it build the multi index, so I moved to AWS with good machine power and successfully built it, however when I tried to run the comparison on it, it runs forever, I agree that its due to the number of comparison.
Then, I tried to do string match with fuzzy wuzzy and parallelise the process using dask package. And executed it on a sample data. It works fine, but I know the process will still take time as the search space is wide. I am looking for a way to add blocking or indexing on this piece of code.
test = pd.DataFrame({'Address1':['123 Cheese Way','234 Cookie Place','345 Pizza Drive','456 Pretzel Junction'],'city':['X','U','X','U']})
test2 = pd.DataFrame({'Address1':['123 chese wy','234 kookie Pl','345 Pizzza DR','456 Pretzel Junktion'],'city':['X','U','Z','Y'] , 'ID' : ['1','3','4','8']})
Here, I am trying to look for test.Address1 in test2.Address1 and bring its ID.
def fuzzy_score(str1, str2):
return fuzz.token_set_ratio(str1, str2)
def helper(orig_string, slave_df):
slave_df['score'] = slave_df.Address1.apply(lambda x: fuzzy_score(x,orig_string))
#return my_value corresponding to the highest score
return slave_df.ix[slave_df.score.idxmax(),'ID']
dmaster = dd.from_pandas(test, npartitions=24)
dmaster = dmaster.assign(ID_there=dmaster.Address1.apply(lambda x: helper(x, test2)))
dmaster.compute(get=dask.multiprocessing.get)
This works fine, however I am not sure how I can apply indexing on it by limiting the search space on the same city.
Lets say, I am creating an index on the city field and subset based on the city of the original string and pass that city to the helper function,
# sort the dataframe
test2.sort_values(by=['city'], inplace=True)
# set the index to be this and don't drop
test2.set_index(keys=['city'], drop=False,inplace=True)
I don't know how to do that ? Please advise. Thanks in advance.
I prefer using fuzzywuzzy.process.extractOne. That compares a string to an iterable of strings.
def extract_one(col, other):
# need this for dask later
other = other.compute() if hasattr(other, 'compute') else other
return pd.DataFrame([process.extractOne(x, other) for x in col],
columns=['Address1', 'score', 'idx'],
index=col.index)
extract_one(test.Address1, test2.Address1)
Address1 score idx
0 123 chese wy 92 0
1 234 kookie Pl 83 1
2 345 Pizzza DR 86 2
3 456 Pretzel Junktion 95 3
The idx is the index of the other passed to extract_one that matches closest. I would recommend having a meaningful index, to making joining the results later on easier.
For your second question, about filtering to cities, I would use a groupby and apply
gr1 = test.groupby('city')
gr2 = test2.groupby("city")
gr1.apply(lambda x: extract_one(x.Address1,
gr2.get_group(x.name).Address1))
Address1 score idx
0 123 chese wy 92 0
1 234 kookie Pl 83 1
2 345 Pizzza DR 86 2
3 456 Pretzel Junktion 95 3
The only difference with dask is the need to specify a meta to the apply:
ddf1 = dd.from_pandas(test, 2)
ddf2 = dd.from_pandas(test2, 2)
dgr1 = ddf1.groupby('city')
dgr2 = ddf2.groupby('city')
meta = pd.DataFrame(columns=['Address1', 'score', 'idx'])
dgr1.apply(lambda x: extract_one(x.Address1,
dgr2.get_group(x.name).Address1),
meta=meta).compute()
Address1 score idx
city
U 0 234 kookie Pl 83 1
1 234 kookie Pl 28 1
X 0 123 chese wy 92 0
1 123 chese wy 28 0
Here's a notebook: https://gist.github.com/a932b3591346b898d6816a5efc2bc5ad
I'm curious to hear how the performance is. I'm assuming the actual string comparison done in fuzzy wuzzy will take the bulk of the time, but I'd love to hear back on how much overhead is spent in pandas and dask. Make sure you have the C extentions for computing the Levenshtein distance.
i run on the same problem once. The whole process takes forever and even you will use multiprocessing it is not really going to be very fast. The main problem that causes the slow speed is the fuzzy matching because the processing is very tedious and requires a lot of time.
Alternatively, and more efficient in my opinion, it would be to use embedding aka bag of words and apply an ML method on it. the fact that you will use numerical vector makes the whole process way faster!
I got a data set with 5,000,000 rows x 3 columns.
Basically, it looks like:
location os clicked
0 China ios 1
1 USA android 0
2 Japan ios 0
3 China android 1
So, I went to Pandas.DataFrame for some awesome and fast support.
Now I am going to replace the values located in the series of dataframes according to a dict.
NOTE: the dict I used as reference looks like:
{ China : 1,
USA : 2,
Japan : 3,
.... : ..
}
BECAUSE I use Pandas.DataFrame.Column_Label.drop_duplicates().
Finally, I got:
location os clicked
0 1 ios 1
1 2 android 0
2 3 ios 0
3 1 android 1
I have done the fully mapping in 446 s.
Is there a faster way to do this?
I think the replace() function has wasted time a lot for pointless searching. So am I heading to the right end?
I can answer my own question now.
The point of doing this is about handling categorical data, which appeared over and over again on Classification tasks and etc. It's universal in the first place that we want to use one-hot encoding method to convert categorical data to numerical vector, acceptable for sklearn package or statsmodel.
To do so, simply read the cvs file as pandas.DataFrame by using:
data = pd.read_csv(dir, encoding='utf-8')
then:
data_binary = pd.get_dummies(data, prefix=['os','locate'],columns=['os','location'])
and all good to go.
I've been reading a huge (5 GB) gzip file in the form:
User1 User2 W
0 11 12 1
1 12 11 2
2 13 14 1
3 14 13 2
which is basically a directed graph representation of connections among users with a certain weight W. Since the file is so big, I tried to read it through networkx, building a Directed Graph and then set it to Undirected. But it took too much time. So I was thinking in doing the same thing analysing a pandas dataframe. I would like to return the previous dataframe in the form:
User1 User2 W
0 11 12 3
1 13 14 3
where the common links in the two directions have been merged into one having as W the sum of the single weights. Any help would be appreciated.
There is probably a more concise way, but this works. The main trick is just to normalize the data such that User1 is always the lower number ID. Then you can use groupby since 11,12 and 12,11 are now recognized as representing the same thing.
In [330]: df = pd.DataFrame({"User1":[11,12,13,14],"User2":[12,11,14,13],"W":[1,2,1,2]})
In [331]: df['U1'] = df[['User1','User2']].min(axis=1)
In [332]: df['U2'] = df[['User1','User2']].max(axis=1)
In [333]: df = df.drop(['User1','User2'],axis=1)
In [334]: df.groupby(['U1','U2'])['W'].sum()
Out[334]:
U1 U2
11 12 3
13 14 3
Name: W, dtype: int64
For more concise code that avoids creating new variables, you could replace the middle 3 steps with:
In [400]: df.ix[df.User1>df.User2,['User1','User2']] = df.ix[df.User1>df.User2,['User2','User1']].values
Note that column switching can be trickier than you'd think, see here: What is correct syntax to swap column values for selected rows in a pandas data frame using just one line?
As far as making this code fast in general, it will depend on your data. I don't think the above code will be as important as other things you might do. For example, your problem should be amenable to a chunking approach where you iterate over sections of the code, gradually shrinking it on each pass. In that case, the main thing you need to think about is sorting the data before chunking, so as to minimize how many passes you need to make. But doing it that way should allow you to do all the work in memory.
I am writing a script that produces histograms of specific columns in a tab-delimited text file. Currently, the program will create a single graph from a hard coded column number that I am using as a placeholder.
The input table looks something like this:
SAMPID TRAIT COHORT AGE BMI WEIGHT WAIST HEIGHT LDL HDL
123 LDL STUDY1 52 32.2 97.1 102 149 212.5 21.4
456 LDL STUDY1 33 33.7 77.0 101 161 233.2 61.2
789 LDL STUDY2 51 25.1 67.1 107 162 231.1 21.3
abc LDL STUDY2 76 33.1 80.4 99 134 220.5 21.2
...
And I have the following code:
import csv
import numpy
from matplotlib import pyplot
r = csv.reader(open("path",'r'), delimiter = '\t')
input_table=[]
for row in r:
input_table.append(row)
column=[]
missing=0
nonmissing=0
for E in input_table[1:3635]: # the number of rows in the input table
if E[8] == "": missing+=1 # [8] is hard coded now, want to change this to column header name "LDL"
else:
nonmissing +=1
column.append(float(E[8]))
pyplot.hist(column, bins=20, label="the label") # how to handle multiple histogram outputs if multiple column headers are specified?
print "n = ", nonmissing
print "numer of missing values: ", missing
pyplot.show()
Can anyone offer suggestions that would allow me to expand/improve my program to do any of the following?
graph data from columns specified by header name, not the column number
iterate over a list containing multiple header names to create/display several histograms at once
Create a graph that only includes a subset of the data, as specified by a specific value in a column (ie, for a specific sample ID, or a specific COHORT value)
One component not shown here is that I will eventually have a separate input file that will contain a list of headers (ie "HDL", "LDL", "HEIGHT") needing to be graphed separately, but then displayed together in a grid-like manner.
I can additional information if needed.
Well, I have a few comments and suggestions, hope it helps.
In my opinion, the first thing you should do to get all those things you want is to structure your data.
Try to create, for each row from the file, a dictionary like
{'SAMPID': <value_1>, 'TRAIL': <value_2>, ...}
And then you will have a list of such dict objects, and you will be able to iterate it and filter by any field you wish.
That is the first and most important point.
After you do that, modularize your code, do not just create a single script to get all the job done. Identify the pieces of code that will be redundant (as a filtering loop), put it into a function and call it, passing all necessary args.
One aditional detail: You don't need to hadcode the size of your list as in
for E in input_table[1:3635]:
Just write
for E in input_table[1:-1]
And it should do for every list. Of course, if you stop treating you data as raw text, that won't be necessary. Just iterate your list of dicts normally.
If you have more doubts, let me know.
Francisco