I am trying to read data from a list, but there are some inconsistencies. So I want to store only the data which shows periodicity, assuming all the inconsistencies have the same format.
We expect each 12 items a datetime object, but for some days there are less data, and I am not interested in those dates (for simplicity sake). When a date has missing data I think it only has 6 elements instead of 11. Days and all data are items of a list. So I'm trying to store the index of the dates which don't follow the described pattern (for the next date, the element we are seeing shouldn't be a date)
I'm trying to do this using recursion, but every time I run the function I have created, the kernel restarts.
I cannot link the data of clean_values because AEMET opendata eliminates the data requested after like five minutes
import datetime as dt
tbe=[]
def recursive(x, clean_values):
if x<len(clean_values) and x>=0:
for i in range(0,len(clean_values),12):
if type(clean_values[i]) == dt.datetime: #If it is not a datetime
pass
else:
tbe.append(i-12) #We store the date before (that should be the one with the problem)
break
recursive(i-6, clean_values) # And restore the function but using the position in which we think the date is
else:
return
recursive(0, clean_values)
Sorry I cannot provide more information
Related
I have a dataset of thousands of files and I read / treat them with PySpark.
First, I've created functions like the following one to treat the whole dataset and this is working great.
def get_volume_spark(data):
days = lambda i: i * 86400 # This is 60sec*60min*24h
partition = Window.partitionBy("name").orderBy(F.col("date").cast("long")).rangeBetween(days(-31), days(0))
data = data.withColumn("monthly_volume", F.count(F.col("op_id")).over(partition))\
.filter(F.col("monthly_volume") >= COUNT_THRESHOLD)
return data
Every day I got new files arriving and I want to treat new files ONLY and append data the the first created file instead of treating the whole dataset again with more data every day because it would be too long and operations has been already made.
The other thing is, here I split by month for example (I calculate the count per month), but no one can assure that I will have a whole month (and certainly not) in the new files. So I want to keep a counter or something to resume where I were.
I wanted to know if there's some way to do that or this is not possible at all.
I'm using influxdb in my project and I'm facing an issue with query when multiple points are written at once
I'm using influxdb-python to write 1000 unique points to influxdb.
In the influxdb-python there is a function called influxclient.write_points()
I have two options now:
Write each point once every time (1000 times) or
Consolidate 1000 points and write all the points once.
The first option code looks like this(pseudo code only) and it works:
thousand_points = [0...9999
while i < 1000:
...
...
point = [{thousand_points[i]}] # A point must be converted to dictionary object first
influxclient.write_points(point, time_precision="ms")
i += 1
After writing all the points, when I write a query like this:
SELECT * FROM "mydb"
I get all the 1000 points.
To avoid the overhead added by every write in every iteration, I felt like exploring writing multiple points at once. Which is supported by the write_points function.
write_points(points, time_precision=None, database=None,
retention_policy=None, tags=None, batch_size=None)
Write to multiple time series names.
Parameters: points (list of dictionaries, each dictionary represents
a point) – the list of points to be written in the database
So, what I did was:
thousand_points = [0...999]
points = []
while i < 1000:
...
...
points.append({thousand_points[i]}) # A point must be converted to dictionary object first
i += 1
influxclient.write_points(points, time_precision="ms")
With this change, when I query:
SELECT * FROM "mydb"
I only get 1 point as the result. I don't understand why.
Any help will be much appreciated.
You might have a good case for a SeriesHelper
In essence, you set up a SeriesHelper class in advance, and every time you discover a data point to add, you make a call. The SeriesHelper will batch up the writes for you, up to bulk_size points per write
I know this has been asked well over a year ago, however, in order to publish multiple data points in bulk to influxdb each datapoint needs to have a unique timestamp it seems, otherwise it will just be continously overwritten.
I'd import a datetime and add the following to each datapoint within the for loop:
'time': datetime.datetime.now().strftime("%Y-%m-%dT%H:%M:%SZ")
So each datapoint should look something like...
{'fields': data, 'measurement': measurement, 'time': datetime....}
Hope this is helpful for anybody else who runs into this!
Edit: Reading the docs show that another unique identifier is a tag, so you could instead include {'tag' : i} (supposedly each iteration value is unique) if you wish to specify time. (However this I haven't tried)
I have a file (dozens of columns and millions of rows) that essentially looks like this:
customerID VARCHAR(11)
accountID VARCHAR(11)
snapshotDate Date
isOpen Boolean
...
One record in the file might look like this:
1,100,200901,1,...
1,100,200902,1,...
1,100,200903,1,...
1,100,200904,1,...
1,100,200905,1,...
1,100,200906,1,...
...
1,100,201504,1,...
1,100,201505,1,...
1,100,201506,1,...
When an account is closed, two things can happen. Typically, no further snapshots for that record will exist in the data. Occasionally, further records will continue to be added but the isOpen flag will be set to 0.
I want to add an additional Boolean column, called "closedInYr", that has a 0 value UNLESS THE ACCOUNT CLOSES WITHIN ONE YEAR AFTER THE SNAPSHOT DATE.
My solution is slow and gross. It takes each record, counts forward in time 12 months, and if it finds a record with the same customerID, accountID, and isOpen set to 1, it populates the record with a 0 in the "closedInYr" field, otherwise it populates the field with a 1. It works, but the performance is not acceptable, and we have a number of these kinds of files to process.
Any ideas on how to implement this? I use R, but am willing to code in Perl, Python, or practically anything except COBOL or VB.
Thanks
I suggest to use the Linux "date" command to convert the date to the unix time stamps.
Unix time stamp are the number of seconds elapsed since 1 January 1970. So basically a year is 60s*60m*24h*256d seconds. So, if the difference between the time stamps is more than this number then it is longer than a year.
It will be something like this:
>date --date='201106' "+%s"
1604642400
So if you use perl, which is a pretty cool file handling language, you will parse your whole file in a few lines and use eval"you date command".
If all the snapshots for a given record appear in one row, and the records that were open for the same period of time have the same length (i.e., snapshots were taken at regular intervals), then one possibility might be filtering based on row lengths. If the longest open row is length N and one year's records are M, then you know a N-M row was open, at longest, one year less than the longest... That approach doesn't handle the case where the snapshots keep getting added, albeit with open flags set to 0, but it might allow you to cut the number of searches down by at least reducing the number of searches that need to be made per row?
At least, that's an idea. More generally, searching from the end to find the last year where isOpen == 1 might cut the search down a little...
Of course, this all assumes each record is in one row. If not, maybe a melt is in order first?
So my question is when I run this code for first time and it was giving me the results correctly i.e. in the format of 2013-01-23.
But when i tried running this code next time I was not getting the correct result (giving the output as 23/01/2013).
Why is it different the second time?
from pandas import *
fec1 = read_csv("/user_home/w_andalib_dvpy/sample_data/sample.csv")
def convert_date(val):
d, m, y = val.split('/')
return datetime(int(y),int(m),int(d))
# FECHA is the date column name in raw file. format: 23/01/2013
fec1.FECHA.map(convert_date)
fec1.FECHA
Parsing dates with pandas can be done at the time you read the csv by passing parse_dates=['yourdatecolumn'] and date_parser=convert_date to the pandas.read_csv method.
Doing it this way is a much faster operation than loading the data, then parsing the dates.
The reason you get different outputs when you do the same operation twice is probably due to that when you parse the dates, you take D/M/Y as input, but have Y/M/D as output. it basically flips the D and Y every time.
What I want to do: Calculate the most popular search queries for: past day, past 30 days, past 60 days, past 90 days, each calendar month, and for all time.
My raw data is a list of timestamped search queries, and I'm already running a nightly cron job for related data aggregation so I'd like to integrate this calculation into it. Reading through every query is fine (and as far as I can tell necessary) for a daily tally, but for the other time periods this is going to be an expensive calculation so I'm looking for a way to use my precounted data to save time.
What I don't want to do: Pull the records for every day in the period, sum all the tallies, sort the entire resulting list, and take the top X values. This is going to be inefficient, especially for the "all time" list.
I considered using heaps and binary trees to keep realtime sorts and/or access data faster, reading words off of each list in parallel and pushing their values into the heap with various constraints and ending conditions, but this always ruins either the lookup time or the sort time and I'm basically back to looking at everything.
I also thought about keeping running totals for each time period, adding the latest day and subtracting the earliest (saving monthly totals on the 1st of every month), but then I have to save complete counts for every time period every day (instead of just the top X) and I'm still looking through every record in the daily totals.
Is there any way to perform this faster, maybe using some other data structure or a fun mathematical property that I'm just not aware of? Also, tn case anyone needs to know, this whole thing lives inside a Django project.
The short answer is No.
There is no guarantee that a Top-Ten-Of-Last-Year song was ever on a Top-Ten-Daily list (it's highly likely, but not guaranteed).
The only way to get an absolutely-for-sure Top Ten is to add up all the votes over the specified time period, then select the Top Ten.
Could use the Counter() class, part of the high-performance container datatypes. Create a dictionary of all the searches as keys to the dictionary with a count of their frequency.
cnt = Counter()
for word in ['red', 'blue', 'red', 'green', 'blue', 'blue']:
cnt[word] += 1
print cnt
Counter({'blue': 3, 'red': 2, 'green': 1})
I'm not sure if it fits with what you're doing, but if the data is stored via a Django model, you can avail yourself of aggregation to get the info in a single query.
Given:
class SearchQuery(models.Model):
query = models.CharField()
date = models.DateTimeField()
Then:
import datetime
from django.db.models import Count
today = datetime.date.today()
yesterday = today - datetime.timedelta(days=1)
days_ago_30 = today - datetime.timedelta(days=30)
...
top_yesterday = SearchQuery.objects.filter(date__range=(yesterday, today)).annotate(query_count=Count('query')).order_by('-query_count')
top_30_days = SearchQuery.objects.filter(date__range=(days_ago_30, today)).annotate(query_count=Count('query')).order_by('-query_count')
...
That's the most efficient you'll get it done with Django, but it may not necessarily be the most efficient. However, doing things like adding a index for query will help a lot.
EDIT
It just occurred to me that you'll end up with dupes in the list with that. You can technically de-dupe the list after the fact, but if you're running Django 1.4+ and PostgreSQL as your database you can simple append .distinct('query') to the end of those querysets.