I have been reading documentation for a few hours now and I feel I am approaching the problem with the wrong mindset.
I have two tables in HIVE which I read with (spark.table(table_A)) with the same amount and type of columns, but with different origins, so their data is different. Both tables reflect flags that show whether or not a condition is met. There are around 20 columns, at least, and they could increase in the future.
If table_A has its first row be 0 0 1 1 0 table_B could be 0 1 0 1 0, I would like the result to be the result of a XNOR, comparing positions, so: 1 0 0 1 1 , since it has the same values in the first, fourth and fifth position
So I thought of the XNOR operation, when if boths values match then it returns a 1, and a 0 otherwise.
I am facing a number of problems, one of them is the volume of my data (right now I am working with a sample of 1 week and it's already at the 300MB mark), so I am working with pyspark and avoiding pandas since it usually does not fit in memory and/or lags the operation a lot.
Summing up, I have two objects of type pyspark.sql.dataframe.DataFrame, each has one of the tables, and so far the best I've got is something like this:
df_bitwise = df_flags_A.flag_column_A.bitwiseXOR(df_flags_B.flag_columns_B)
But sadly this returns a pyspark.sql.column.Column and I do not know how to read that result, and I do not know to build a dataframe with this (I would like the end result to be something like 20 times the above operation, one for each column, each forming a column of a dataframe).
What am I doing wrong because I feel like this is not the right approach.
Related
I have data frame df and I would like to keep a running total of names that occur in a column of that data frame. I am trying to calculate the running total column:
name running total
a 1
a 2
b 1
a 3
c 1
b 2
There are two ways I thought to do this:
Loop through the dataframe and use a separate dictionary containing name and current count. The current count for the relevant name would increase by 1 each time the loop is carried out, and that value would be copied into my dataframe.
Change the count in field for each value in the dataframe. In excel I would use a countif combined with a drag down formula A$1:A1 to fix the first value but make the second value relative so that the range I am looking in changes with the row.
The problem is I am not sure how to implement these. Does anyone have any ideas on which is preferable and how these could be implemented?
#bunji is right. I'm assuming you're using pandas and that your data is in a dataframe called df. To add the running totals to your dataframe, you could do something like this:
df['running total'] = df.groupby(['name']).cumcount() + 1
The + 1 gives you a 1 for your first occurrence instead of 0, which is what you would get otherwise.
I am not generally a python user. I do most things in R and Stata. However, I cannot find a good semantic similarity package/API in either of those.
I have two data frames in the environment. One is questions, which consists of 2 columns and 3 rows. The other is results, which has 3 columns and 3 rows.
I am trying to compare each question (individually) in the first column of the questions dataframe to all of the questions in the second column. Then I want the output to populate the results dataframe. The function takes two strings as arguments So far my code looks like this:
for i in range(1, 3):
results.iloc[i-1,i] = liteClient.compare(questions.iloc[0,i], questions.iloc[:,1])
I assume that I am pointing to the data incorrectly, but I really haven't been able to find good documentation about this seemingly straightforward issue.
By comparison, here is my working R code, which uses a slightly different function and only one dataframe.
for (i in 1:3){
df[,i+2] <- levenshteinSim(df$yr1questions[i], df$yr2questions)
}
Any help would be greatly appreciated! I am trying to come up with proof-of-concept code to compare similar survey questions between years based on semantic meaning.
Bob
Let's try to compare(multiply) every question A to every question B
import pandas as pd
questions = pd.DataFrame(np.arange(6).reshape(3,2), columns=["question_A", "question_B"])
This gives:
question_A question_B
0 0 1
1 2 3
2 4 5
Then let's define a compare fonction:
def compare(row):
return pd.Series(row[0]*questions['question_B'])
results = questions.apply(compare, axis=1)
That gives us:
0 1 2
0 0 0 0
1 2 6 10
2 4 12 20
As you pointed in the comments, here is a version comparing only two strings at a time:
def compare(row):
question_a = row[0]
return pd.Series([liteClient.compare(question_a, question_b) for question_b in questions['question_B']])
Based on what you've put so far here are some issues with what you've written which are understandable from your R programming background:
for i in range(1, 3):
In python 3.x, what this does is create a range object which you can think of as a special type of function (though is really an object that contains iteration properties) that allows you to generate a sequence of numbers with a certain step size (default is 1) exclusively. Additionally you need to know that most programming languages index starting at zero, not 1, and this includes python.
What this range object does here is generate the sequence 1, 2 and that is it.
Your arrays you are using i to index over are not going to index over all indicies. What I believe you want is something like:
for i in range(3):
Notice how there is only one number here, this defaults to the exclusive maximum of the range, and 0 being the inclusive minimum, so this will generate the sequence of 0,1,2. If you have an array of size 3, this will represent all possible indices for that array.
This next line is a bit confusing to me, since I'm not familiar with R, but I sort of understand what you were trying to do. If I understand correctly you are trying to compare two columns of 3 questions each, and compare each question in column 1 to the questions in column 2, resulting in a 3 x 3 matrix of comparison results which you are trying to store in results?. Assuming the size are already correct (as in results is 3x3) I'd like to explain some peculiarities I see in this code.
results.iloc[i-1,i] = liteClient.compare(questions.iloc[0,i], questions.iloc[:,1])
with results.iloc[i-1,i] you are going to be indexing by row an column, as in i-1 is the row, and i is the column here. So with out changing range(1,3) this results in the following indexes being accessed, 0,1, 1,2 and that is it. I believe liteClient.compare(...) is supposed to return either a dataframe of 1x3, or a list of size 3 based on what you were trying to do inside of it, this may not be the case however, I'm not sure what object you are using to call that member function, so I don't know where the documentation for that function exists. Assuming it does return a list of size 3 or the dataframe, what you'll need to change the way you are trying to assign the data, via this:
results.iloc[i,:] = ...
What is happening here is that iloc(...) is taking a row positional argument and a slice positional argument, here you are assigning all the columns in the result matrix at that row to the values returned by compare. With the for statement changes this will iterate over all indices in the dataframe.
liteClient.compare(questions.iloc[0,i], questions.iloc[:,1])
This line as it currently stands you will be iterating over each column in the first row of questions.iloc, and then comparing them to the second column and all rows of the second questions.iloc.
I believe what you will want to do is change this to:
liteClient.compare(questions.iloc[i,0], questions.iloc[:,1])
What this does is for each i, 0, 1, 2, at column 0, compare this to every row in column 1. If your questions dataframe is actually organized as 2 columns and 3 rows this should work, otherwise you will need to change how you create questions as well.
in all I believe the fixed program should look something like:
for i in range(3):
results.iloc[i,:] = liteClient.compare(questions.iloc[i,0], questions.iloc[:,1])
I have a decent sized dataframe (roughly: df.shape = (4000000, 2000)) that I want to use .groupby().max() on. Unfortunately, neither my laptop nor the server I have access to can do this without throwing a MemoryError (The laptop has 16G of RAM, the server has 64.) This is likely due to the datatypes in a lot of the columns. Right now, I'm considering those are fixed and immutable (many, many dates, large integers, etc), though perhaps that could be part of the solution.
The command I would like is simply new_df = df.groupby('KEY').max().
What is the most efficient way to break down this problem to prevent running into memory problems? Some things I've tried, to varying success:
Break the df into subsets and run .groupby().max on those subsets, then concatenating. Issue: the size of the full df can vary, and is likely to grow. I'm not sure the best way to break the df apart so that the subsets are definitely not going to throw the MemoryError.
Include a subset set of columns on which to run the .groupby() in a new df, then merge this with the original. Issue: the number of columns in this subset can vary (much smaller or larger than current), though the names of the columns all include the prefix ind_.
Look for out-of-memory management tools. As of yet, I've not found anything useful.
Thanks for any insight or help you can provide.
EDIT TO ADD INFO
The data is for a predictive modeling exercise, and the number of columns stems from making binary variables from a column of discrete (non-continuous/categorical) values. The df will grow in column size if another column of values goes through the same process. Also, data is originally pulled from a SQL query; the set of items the query finds is likely to grow over time, meaning both the number of rows will grow, and (since the number of distinct values in one or more columns may grow, so will the number of columns after making binary indicator variables). The data pulled goes through extensive transformation, and gets merged with other datasets, making it implausible to run the grouping in the database itself.
There are repeated observations by KEY, which have the the same value except for the column I turn into indicators (this is just to show shape/value samples: actual df has dates, integers 16 digits or longer, etc):
KEY Col1 Col2 Col3
A 1 blue car
A 1 blue bike
A 1 blue train
B 2 green car
B 2 green plane
B 2 green bike
This should become, with dummied Col3:
KEY Col1 Col2 ind_car ind_bike ind_train ind_plane
A 1 blue 1 1 1 0
B 2 green 1 1 0 1
So the .groupby('KEY') gets me the groups, and the .max() gets the new indicator columns with the right values. I know the `.max()' process might be bogged down by getting a "max" for a string or date column.
I am querying a database for a few variables from an experiment, one at a time and storing the data in a Pandas DataFrame. I can get the data that I need, looks as below for instance:
file time variableid data
0 1 1503657 1 11
1 1 1503757 1 22
There is data for several variables that I will be grabbing like this, and then I will be combining them into a single DataFrame to be output to a csv. Each variable's data column will be added as a new column with the corresponding name of the variable (as the file_id should always be the same). The time column values might be different (one DF could be longer than the other, the data wasn't sampled at all of the same times, etc), but if I merge the tables on the time (and file) column, then any discrepancies are filled in with NaN (and I will fill them in with DF.fillna(0)) and the DF can be resorted by the time.
What I need though is a way to filter the data so that it fits a certain rate, such as every 100 milliseconds (1503700,1503800,...). The datapoint itself doesn't have to fit that rate exactly (and in fact the data rarely falls on a time that ends in 00 for instance), but it should be the closest matching data for that time (it could be the closest before or after that time actually, as long as it is consistent throughout).
I thought about iterating over all the values in the time column and adding the row with the closest time one by one (I would first create a blank DF with the desired times), but there are sometimes 50,000+ rows in a sample table. I found an answer about interpolating (link below), but I don't really want to add or modify any of the data itself, just pull the rows that most closely match the rate that I want to sample the data (one reason is some of the data is binary and I wouldn't want to end up with something like 0.5 because the before desired time and after desired time values were 0 and 1). Any help is greatly appreciated, thanks.
combining pandas dataframes of different sampling rates
I have an excel file with close to 13 columns with 0's and 1's. I want to perform bitwise counting on columns like so:
A B Result
1 1 1
0 1 1
1 0 1
0 0 0
I tried lookup, vlookup, countif(s), but nothing seems to be working for me? Are there any other functions I could use?
* EDIT *
I am actually looking to implement this in Python because it is part of a rather long workflow and I don't want to interrupt the script by having to exit do this and then come back. What is a a rather naive way of doing this in Python?
So far, I have tried to write something where I ask the user to provide an input of which columns they would like grouped but I could not make it work.
Thanks,
If you're doing a bitwise OR (as your example seems to show) you can just put this formula in your Result column
=MIN(SUM(A1:B1),1)
And then just copy down
Or, you could use the OR function, which will return True if any value is 1 and False if all 0
=IF(OR(A1:B1),1,0)
The following formula will output 1 or 0, depending if there are 1 or more 1's in the columns A->C...
=IF(COUNTIF(A2:C2,"=1")>1,1,0)