Pyspark - Saving & using previously calculated values - python

I have a dataset of thousands of files and I read / treat them with PySpark.
First, I've created functions like the following one to treat the whole dataset and this is working great.
def get_volume_spark(data):
days = lambda i: i * 86400 # This is 60sec*60min*24h
partition = Window.partitionBy("name").orderBy(F.col("date").cast("long")).rangeBetween(days(-31), days(0))
data = data.withColumn("monthly_volume", F.count(F.col("op_id")).over(partition))\
.filter(F.col("monthly_volume") >= COUNT_THRESHOLD)
return data
Every day I got new files arriving and I want to treat new files ONLY and append data the the first created file instead of treating the whole dataset again with more data every day because it would be too long and operations has been already made.
The other thing is, here I split by month for example (I calculate the count per month), but no one can assure that I will have a whole month (and certainly not) in the new files. So I want to keep a counter or something to resume where I were.
I wanted to know if there's some way to do that or this is not possible at all.

Related

subsetting very large files - python methods for optimal performance

I have one file (index1) with 17,270,877 IDs, and another file (read1) with a subset of these IDs (17,211,741). For both files, the IDs are on every 4th line.
I need a new (index2) file that contains only the IDs in read1. For each of those IDs I also need to grab the next 3 lines from index1. So I'll end up with index2 whose format exactly matches index1 except it only contains IDs from read1.
I am trying to implement the methods I've read here. But I'm stumbling on these two points: 1) I need to check IDs on every 4th line, but I need all of the data in index1 (in order) because I have to write the associated 3 lines following the ID. 2) unlike that post, which is about searching for one string in a large file, I'm searching for a huge number of strings in another huge file.
Can some folks point me in some direction? Maybe none of those 5 methods are ideal for this. I don't know any information theory; we have plenty of RAM so I think holding the data in RAM for searching is the most efficient? I'm really not sure.
Here a sample of what the index look like (IDs start with #M00347):
#M00347:30:000000000-BCWL3:1:1101:15589:1332 1:N:0:0
CCTAAGGTTCGG
+
CDDDDFFFFFCB
#M00347:30:000000000-BCWL3:1:1101:15667:1332 1:N:0:0
CGCCATGCATCC
+
BBCCBBFFFFFF
#M00347:30:000000000-BCWL3:1:1101:15711:1332 1:N:0:0
TTTGGTTCCCGG
+
CDCDECCFFFCB
read1 looks very similar, but the lines before and after the '+' are different.
If data of index1 can fit in memory, the best approach is to do a single scan of this file and store all data in a dictionary like this:
{"#M00347:30:000000000-BCWL3:1:1101:15589:1332 1:N:0:0":["CCTAAGGTTCGG","+","CDDDDFFFFFCB"],
"#M00347:30:000000000-BCWL3:1:1101:15667:1332 1:N:0:0":["CGCCATGCATCC","+","BBCCBBFFFFFF"],
..... }
Values can be stored as formatted string as you prefer.
After this, you can do a single scan on read1 and when an IDs is encountered you can do a simple lookup on the dictionary to retrieve needed data.

Dataframe Generator Based On Conditions Pandas

I have manually created a bunch of dataframes to later concatenate back together based on a list of bigrams I have(my reason for doing this is out of the scope of this question). The problem is, I want to set this code to run daily or weekly and the manually created dataframes I have created will no longer work if the data has changed once refreshed. For instance, looking at the code below, what if "data_science," is no longer a bigram being pulled from my code next week and I have another bigram like "hello_world," that is not listed below in my code. I need to set up one function that will do all of these for me. I have about 50 dataframes I am making from my real data so even without the automation purposes, it would be a huge time saver to get a function going for this. One KEY point to make is that I am grabbing all of these bigrams from a list and naming a dataframe for each one of them. My function below with the list_input is what I am using that for.
data_science = df[df['column_name'].str.contains("data") &
df['column_name'].str.contains("science")]
data_science['bigram'] = "(data_science)"
p_value = df[df['column_name'].str.contains("p") &
df['column_name'].str.contains("value")]
p_value['bigram'] = "(p_value)"
ab_testing = df[df['column_name'].str.contains("ab") &
df['column_name'].str.contains("testing")]
ab_testing['bigram'] = "(ab_texting)"```
I am trying something like this code below but have not figured out how to make it work yet.
```def df_creator(a,b, my_list):
for a,b in my_list:
a_b = df[df['Message_stop'].str.contains(a) &
df['Message_stop'].str.contains(b)]
a_b['bigram'] = "(a_b)"```

Creating two lists from one randomly

I'm using pandas to import a lot of data from a CSV file, and once read I format it to contain only numerical data. This then returns a list within a list. Each list then contains around 140k bits of data. numericalData[][].
From this list, I wish to create Testing and Training Data. For my testing data, I want to have 30% of my read data numericalData, so I use this following bit of code;
testingAmount = len(numericalData0[0]) * trainingDataPercentage / 100
Works a treat. Then, I use numpy to select that amount of data from each column of my imported numericalData;
testingData.append(np.random.choice(numericalData[x], testingAmount) )
This then returns a sample with 38 columns (running in a loop), where each column has around 49k elements of data randomly selected from my imported numericalData.
The issue is, my trainingData needs to hold the other 70% of the data, but I'm unsure on how to do this. I've tried to compare each element in my testingData, and if both elements aren't equal, then add it to my trainingData. This resulted in an error and didn't work. Next, I tried to delete the selected testingData from my imported data, and then save that new column to my trainingData, alas, that didn't work eiher.
I've only been working with python for the past week so I'm a bit lost on what to try now.
You can use random.shuffle and split list after that. For toy example:
import random
data = range(1, 11)
random.shuffle(data)
training = data[:5]
testing = data[5:]
To get more information, read the docs.

influxdb: Write multiple points vs single point multiple times

I'm using influxdb in my project and I'm facing an issue with query when multiple points are written at once
I'm using influxdb-python to write 1000 unique points to influxdb.
In the influxdb-python there is a function called influxclient.write_points()
I have two options now:
Write each point once every time (1000 times) or
Consolidate 1000 points and write all the points once.
The first option code looks like this(pseudo code only) and it works:
thousand_points = [0...9999
while i < 1000:
...
...
point = [{thousand_points[i]}] # A point must be converted to dictionary object first
influxclient.write_points(point, time_precision="ms")
i += 1
After writing all the points, when I write a query like this:
SELECT * FROM "mydb"
I get all the 1000 points.
To avoid the overhead added by every write in every iteration, I felt like exploring writing multiple points at once. Which is supported by the write_points function.
write_points(points, time_precision=None, database=None,
retention_policy=None, tags=None, batch_size=None)
Write to multiple time series names.
Parameters: points (list of dictionaries, each dictionary represents
a point) – the list of points to be written in the database
So, what I did was:
thousand_points = [0...999]
points = []
while i < 1000:
...
...
points.append({thousand_points[i]}) # A point must be converted to dictionary object first
i += 1
influxclient.write_points(points, time_precision="ms")
With this change, when I query:
SELECT * FROM "mydb"
I only get 1 point as the result. I don't understand why.
Any help will be much appreciated.
You might have a good case for a SeriesHelper
In essence, you set up a SeriesHelper class in advance, and every time you discover a data point to add, you make a call. The SeriesHelper will batch up the writes for you, up to bulk_size points per write
I know this has been asked well over a year ago, however, in order to publish multiple data points in bulk to influxdb each datapoint needs to have a unique timestamp it seems, otherwise it will just be continously overwritten.
I'd import a datetime and add the following to each datapoint within the for loop:
'time': datetime.datetime.now().strftime("%Y-%m-%dT%H:%M:%SZ")
So each datapoint should look something like...
{'fields': data, 'measurement': measurement, 'time': datetime....}
Hope this is helpful for anybody else who runs into this!
Edit: Reading the docs show that another unique identifier is a tag, so you could instead include {'tag' : i} (supposedly each iteration value is unique) if you wish to specify time. (However this I haven't tried)

look ahead time analysis in R (data mining algorithm)

I have a file (dozens of columns and millions of rows) that essentially looks like this:
customerID VARCHAR(11)
accountID VARCHAR(11)
snapshotDate Date
isOpen Boolean
...
One record in the file might look like this:
1,100,200901,1,...
1,100,200902,1,...
1,100,200903,1,...
1,100,200904,1,...
1,100,200905,1,...
1,100,200906,1,...
...
1,100,201504,1,...
1,100,201505,1,...
1,100,201506,1,...
When an account is closed, two things can happen. Typically, no further snapshots for that record will exist in the data. Occasionally, further records will continue to be added but the isOpen flag will be set to 0.
I want to add an additional Boolean column, called "closedInYr", that has a 0 value UNLESS THE ACCOUNT CLOSES WITHIN ONE YEAR AFTER THE SNAPSHOT DATE.
My solution is slow and gross. It takes each record, counts forward in time 12 months, and if it finds a record with the same customerID, accountID, and isOpen set to 1, it populates the record with a 0 in the "closedInYr" field, otherwise it populates the field with a 1. It works, but the performance is not acceptable, and we have a number of these kinds of files to process.
Any ideas on how to implement this? I use R, but am willing to code in Perl, Python, or practically anything except COBOL or VB.
Thanks
I suggest to use the Linux "date" command to convert the date to the unix time stamps.
Unix time stamp are the number of seconds elapsed since 1 January 1970. So basically a year is 60s*60m*24h*256d seconds. So, if the difference between the time stamps is more than this number then it is longer than a year.
It will be something like this:
>date --date='201106' "+%s"
1604642400
So if you use perl, which is a pretty cool file handling language, you will parse your whole file in a few lines and use eval"you date command".
If all the snapshots for a given record appear in one row, and the records that were open for the same period of time have the same length (i.e., snapshots were taken at regular intervals), then one possibility might be filtering based on row lengths. If the longest open row is length N and one year's records are M, then you know a N-M row was open, at longest, one year less than the longest... That approach doesn't handle the case where the snapshots keep getting added, albeit with open flags set to 0, but it might allow you to cut the number of searches down by at least reducing the number of searches that need to be made per row?
At least, that's an idea. More generally, searching from the end to find the last year where isOpen == 1 might cut the search down a little...
Of course, this all assumes each record is in one row. If not, maybe a melt is in order first?

Categories