look ahead time analysis in R (data mining algorithm)

look ahead time analysis in R (data mining algorithm) - python

I have a file (dozens of columns and millions of rows) that essentially looks like this:
customerID VARCHAR(11)
accountID VARCHAR(11)
snapshotDate Date
isOpen Boolean
...
One record in the file might look like this:
1,100,200901,1,...
1,100,200902,1,...
1,100,200903,1,...
1,100,200904,1,...
1,100,200905,1,...
1,100,200906,1,...
...
1,100,201504,1,...
1,100,201505,1,...
1,100,201506,1,...
When an account is closed, two things can happen. Typically, no further snapshots for that record will exist in the data. Occasionally, further records will continue to be added but the isOpen flag will be set to 0.
I want to add an additional Boolean column, called "closedInYr", that has a 0 value UNLESS THE ACCOUNT CLOSES WITHIN ONE YEAR AFTER THE SNAPSHOT DATE.
My solution is slow and gross. It takes each record, counts forward in time 12 months, and if it finds a record with the same customerID, accountID, and isOpen set to 1, it populates the record with a 0 in the "closedInYr" field, otherwise it populates the field with a 1. It works, but the performance is not acceptable, and we have a number of these kinds of files to process.
Any ideas on how to implement this? I use R, but am willing to code in Perl, Python, or practically anything except COBOL or VB.
Thanks

I suggest to use the Linux "date" command to convert the date to the unix time stamps.
Unix time stamp are the number of seconds elapsed since 1 January 1970. So basically a year is 60s*60m*24h*256d seconds. So, if the difference between the time stamps is more than this number then it is longer than a year.
It will be something like this:
>date --date='201106' "+%s"
1604642400
So if you use perl, which is a pretty cool file handling language, you will parse your whole file in a few lines and use eval"you date command".

If all the snapshots for a given record appear in one row, and the records that were open for the same period of time have the same length (i.e., snapshots were taken at regular intervals), then one possibility might be filtering based on row lengths. If the longest open row is length N and one year's records are M, then you know a N-M row was open, at longest, one year less than the longest... That approach doesn't handle the case where the snapshots keep getting added, albeit with open flags set to 0, but it might allow you to cut the number of searches down by at least reducing the number of searches that need to be made per row?
At least, that's an idea. More generally, searching from the end to find the last year where isOpen == 1 might cut the search down a little...
Of course, this all assumes each record is in one row. If not, maybe a melt is in order first?

Related

subsetting very large files - python methods for optimal performance

I have one file (index1) with 17,270,877 IDs, and another file (read1) with a subset of these IDs (17,211,741). For both files, the IDs are on every 4th line.
I need a new (index2) file that contains only the IDs in read1. For each of those IDs I also need to grab the next 3 lines from index1. So I'll end up with index2 whose format exactly matches index1 except it only contains IDs from read1.
I am trying to implement the methods I've read here. But I'm stumbling on these two points: 1) I need to check IDs on every 4th line, but I need all of the data in index1 (in order) because I have to write the associated 3 lines following the ID. 2) unlike that post, which is about searching for one string in a large file, I'm searching for a huge number of strings in another huge file.
Can some folks point me in some direction? Maybe none of those 5 methods are ideal for this. I don't know any information theory; we have plenty of RAM so I think holding the data in RAM for searching is the most efficient? I'm really not sure.
Here a sample of what the index look like (IDs start with #M00347):
#M00347:30:000000000-BCWL3:1:1101:15589:1332 1:N:0:0
CCTAAGGTTCGG
+
CDDDDFFFFFCB
#M00347:30:000000000-BCWL3:1:1101:15667:1332 1:N:0:0
CGCCATGCATCC
+
BBCCBBFFFFFF
#M00347:30:000000000-BCWL3:1:1101:15711:1332 1:N:0:0
TTTGGTTCCCGG
+
CDCDECCFFFCB
read1 looks very similar, but the lines before and after the '+' are different.

If data of index1 can fit in memory, the best approach is to do a single scan of this file and store all data in a dictionary like this:
{"#M00347:30:000000000-BCWL3:1:1101:15589:1332 1:N:0:0":["CCTAAGGTTCGG","+","CDDDDFFFFFCB"],
"#M00347:30:000000000-BCWL3:1:1101:15667:1332 1:N:0:0":["CGCCATGCATCC","+","BBCCBBFFFFFF"],
..... }
Values can be stored as formatted string as you prefer.
After this, you can do a single scan on read1 and when an IDs is encountered you can do a simple lookup on the dictionary to retrieve needed data.

How can I create dynamic INT variables for daily datasize variations

Each day I receive many different files from different vendors, and the sizes are vastly different. I am looking for some dynamic code that will decide what is relevant across all files. I would like to think thru how to break this file into components (df1, df2, df3 for example) which will make it easier for analysis.
Basically the first 6 lines are for overall information about the store (df1).
The 2nd component is reserved for specific item sales (starting on row 9, ending in a DIFFERENT row in every file), and I'm not sure how to capture that. I have tried something along the lines of
numb = df.loc['Type of payment'].index[0] - 2
but it is bringing in the tuple instead of the row location (int). How can i save upperrange and lowerrange to be a dynamic (int) so that each day it will bring in the correct df2 data I am looking for?
The same problem exists at the bottom under "Type of payment" - you will notice that crypto is included for the 1st day but not the 2nd. I need to find a way to get a dynamic range to remove erroneous info and keep the integrity of the rest. I think finding the lowerrange will allow me to capture from that point to the end of the sheet, but I'm open to suggestions.
df = pd.read_csv('GMSALES.csv', skipfooter=2)
upperrange = df.loc['Item Number'] #brings in tuple
lowerrange = df.loc['Type of payment'] #brings in tuple
df1 = df.iloc[:,7] #this works
df2 = df.iloc[:('upperrange':'lowerrange')] # this is what I would like to get to
df3 = df.iloc[:(lowerrange:)] # this is what I would like to get to

Your organizational problem is that your data comes in as a spreadsheet that is used for physical organization more than functional organization. The "columns" are merely typographical tabs. The file contains several types of heterogeneous data; you are right in wanting to reorganize this into individual data frames.
Very simply, you need to parse the file, customer by customer -- either before or after reading it into RAM.
From your current organization, this involves simply scanning the "df2" range of your heterogeneous data frame. I think that the simplest way is to start from row 7 and look for "Item Number" in column A; that is your row of column names. Then scan until you find a row with nothing in column A; back up one row, and that gives you lowerrange.
Repeat with the payments: find the next row with "Type of payment". I will assume that you have some way to discriminate payment types from fake data, such as a list of legal payment types (strings). Scan from "Type of Payment" until you find a row with something other than a legal payment type; the previous row is your lowerrange for df3.
Can you take it form there?

Pyspark - Saving & using previously calculated values

I have a dataset of thousands of files and I read / treat them with PySpark.
First, I've created functions like the following one to treat the whole dataset and this is working great.
def get_volume_spark(data):
days = lambda i: i * 86400 # This is 60sec*60min*24h
partition = Window.partitionBy("name").orderBy(F.col("date").cast("long")).rangeBetween(days(-31), days(0))
data = data.withColumn("monthly_volume", F.count(F.col("op_id")).over(partition))\
.filter(F.col("monthly_volume") >= COUNT_THRESHOLD)
return data
Every day I got new files arriving and I want to treat new files ONLY and append data the the first created file instead of treating the whole dataset again with more data every day because it would be too long and operations has been already made.
The other thing is, here I split by month for example (I calculate the count per month), but no one can assure that I will have a whole month (and certainly not) in the new files. So I want to keep a counter or something to resume where I were.
I wanted to know if there's some way to do that or this is not possible at all.

Efficent way to calculate Top 10, or Top X, list, across multiple time periods

What I want to do: Calculate the most popular search queries for: past day, past 30 days, past 60 days, past 90 days, each calendar month, and for all time.
My raw data is a list of timestamped search queries, and I'm already running a nightly cron job for related data aggregation so I'd like to integrate this calculation into it. Reading through every query is fine (and as far as I can tell necessary) for a daily tally, but for the other time periods this is going to be an expensive calculation so I'm looking for a way to use my precounted data to save time.
What I don't want to do: Pull the records for every day in the period, sum all the tallies, sort the entire resulting list, and take the top X values. This is going to be inefficient, especially for the "all time" list.
I considered using heaps and binary trees to keep realtime sorts and/or access data faster, reading words off of each list in parallel and pushing their values into the heap with various constraints and ending conditions, but this always ruins either the lookup time or the sort time and I'm basically back to looking at everything.
I also thought about keeping running totals for each time period, adding the latest day and subtracting the earliest (saving monthly totals on the 1st of every month), but then I have to save complete counts for every time period every day (instead of just the top X) and I'm still looking through every record in the daily totals.
Is there any way to perform this faster, maybe using some other data structure or a fun mathematical property that I'm just not aware of? Also, tn case anyone needs to know, this whole thing lives inside a Django project.

The short answer is No.
There is no guarantee that a Top-Ten-Of-Last-Year song was ever on a Top-Ten-Daily list (it's highly likely, but not guaranteed).
The only way to get an absolutely-for-sure Top Ten is to add up all the votes over the specified time period, then select the Top Ten.

Could use the Counter() class, part of the high-performance container datatypes. Create a dictionary of all the searches as keys to the dictionary with a count of their frequency.
cnt = Counter()
for word in ['red', 'blue', 'red', 'green', 'blue', 'blue']:
cnt[word] += 1
print cnt
Counter({'blue': 3, 'red': 2, 'green': 1})

I'm not sure if it fits with what you're doing, but if the data is stored via a Django model, you can avail yourself of aggregation to get the info in a single query.
Given:
class SearchQuery(models.Model):
query = models.CharField()
date = models.DateTimeField()
Then:
import datetime
from django.db.models import Count
today = datetime.date.today()
yesterday = today - datetime.timedelta(days=1)
days_ago_30 = today - datetime.timedelta(days=30)
...
top_yesterday = SearchQuery.objects.filter(date__range=(yesterday, today)).annotate(query_count=Count('query')).order_by('-query_count')
top_30_days = SearchQuery.objects.filter(date__range=(days_ago_30, today)).annotate(query_count=Count('query')).order_by('-query_count')
...
That's the most efficient you'll get it done with Django, but it may not necessarily be the most efficient. However, doing things like adding a index for query will help a lot.
EDIT
It just occurred to me that you'll end up with dupes in the list with that. You can technically de-dupe the list after the fact, but if you're running Django 1.4+ and PostgreSQL as your database you can simple append .distinct('query') to the end of those querysets.

summing entries in a variable based on another variable(make unique) in python lists

i have a question as to how i can perform this task in python:-
i have an array of entries like:
[IPAddress, connections, policystatus, activity flag, longitude, latitude] (all as strings)
ex.
['172.1.21.26','54','1','2','31.15424','12.54464']
['172.1.21.27','12','2','4','31.15424','12.54464']
['172.1.27.34','40','1','1','-40.15474','-54.21454']
['172.1.2.45','32','1','1','-40.15474','-54.21454']
...
till about 110000 entries with about 4000 different combinations of longitude-latitude
i want to count the average connections, average policy status,average of activity flag for each location
something like this:
[longitude,latitude,avgConn,avgPoli,avgActi]
['31.15424','12.54464','33','2','3']
['-40.15474','-54.21454','31','1','1']
...
so on
and i have about 195 files with ~110,000 entries each (sort of a big data problem)
my files are in .csv but im using it as .txt to easily work with it in python(not sure if this is the best idea)
im still new to python so im not really sure whats the best approach to use but i sincerely appreciate any help or guidance for this problem
thanks in advance!

No, if you have the files as .csv, threating them as text does not make sense, since python ships with the excellent csv module.
You could read the csv rows into a dict to group them, but I'd suggest writing the data in a proper database, and use SQL's AVG() and GROUP BY. Python ships with bindings for most databaases. If you have none installed, consider using the sqlite module.

I'll only give you the algorithm, you would learn more by writing the actual code yourself.
Use a Dictionary, with the key as a pair of the form (longitude, latitude) and value as a list of the for [ConnectionSum,policystatusSum,ActivityFlagSum]
loop over the entries once (do count the total number of entries, N)
a. for each entry, if the location exists - add the conn, policystat and Activity value to the existing sum.
b. if the entry does not exist, then assign [0,0,0] as the value
Do 1 and 2 for all files.
After all the entries have been scanned. Loop over the dictionary and divide each element of the list [ConnectionSum,policystatusSum,ActivityFlagSum] by N to get the average values of each.

As long as your locations are restricted to being in the same files (or even close to each other in a file), all you need to do is the stream-processing paradigm. For example if you know that duplicate locations only appear in a file, read each file, calculate the averages, then close the file. As long as you let the old data float out of scope, the garbage collector will get rid of it for you. Basically do this:
def processFile(pathToFile):
...
totalResults = ...
for path in filePaths:
partialResults = processFile(path)
totalResults = combine...partialResults...with...totalResults
An even more elegant solution would be to use the O(1) method of calculating averages "on-line". If for example you are averaging 5,6,7, you would do 5/1=5.0, (5.0*1+6)/2=5.5, (5.5*2+7)/3=6. At each step, you only keep track of the current average and the number of elements. This solution will yield the minimal amount of memory used (no more than the size of your final result!), and doesn't care about which order you visit elements in. It would go something like this. See http://docs.python.org/library/csv.html for what functions you'll need in the CSV module.
import csv
def allTheRecords():
for path in filePaths:
for row in csv.somehow_get_rows(path):
yield SomeStructure(row)
averages = {} # dict: keys are tuples (lat,long), values are an arbitrary
# datastructure, e.g. dict, representing {avgConn,avgPoli,avgActi,num}
for record in allTheRecords():
position = (record.lat, record.long)
currentAverage = averages.get(position, default={'avgConn':0, 'avgPoli':0, 'avgActi':0, num:0})
newAverage = {apply the math I mentioned above}
averages[position] = newAverage
(Do note that the notion of an "average at a location" is not well-defined. Well, it is well-defined, but not very useful: If you knew the exactly location of every IP event to infinite precision, the average of everything would be itself. The only reason you can compress your dataset is because your latitude and longitude have finite precision. If you run into this issue if you acquire more precise data, you can choose to round to the appropriate precision. It may be reasonable to round to within 10 meters or something; see latitude and longitude. This requires just a little bit of math/geometry.)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.