I have an excel file with close to 13 columns with 0's and 1's. I want to perform bitwise counting on columns like so:
A B Result
1 1 1
0 1 1
1 0 1
0 0 0
I tried lookup, vlookup, countif(s), but nothing seems to be working for me? Are there any other functions I could use?
* EDIT *
I am actually looking to implement this in Python because it is part of a rather long workflow and I don't want to interrupt the script by having to exit do this and then come back. What is a a rather naive way of doing this in Python?
So far, I have tried to write something where I ask the user to provide an input of which columns they would like grouped but I could not make it work.
Thanks,
If you're doing a bitwise OR (as your example seems to show) you can just put this formula in your Result column
=MIN(SUM(A1:B1),1)
And then just copy down
Or, you could use the OR function, which will return True if any value is 1 and False if all 0
=IF(OR(A1:B1),1,0)
The following formula will output 1 or 0, depending if there are 1 or more 1's in the columns A->C...
=IF(COUNTIF(A2:C2,"=1")>1,1,0)
Related
I have a "large" DataFrame table with index being country codes (alpha-3) and columns being years (1900 to 2000) imported via a pd.read_csv(...) [as I understand, these are actually string so I need to pass it as '1945' for example].
The values are 0,1,2,3.
I need to "spread" these values until the next non-0 for each row.
example : 0 0 1 0 0 3 0 0 2 1
becomes: 0 0 1 1 1 3 3 3 2 1
I understand that I should not use iterations (current implementation is something like this, as you can see, using 2 loops is not optimal, I guess I could get rid of one by using apply(row) )
def spread_values(df):
for idx in df.index:
previous_v = 0
for t_year in range(min_year, max_year):
current_v = df.loc[idx, str(t_year)]
if current_v == 0 and previous_v != 0:
df.loc[idx, str(t_year)] = previous_v
else:
previous_v = current_v
However I am told I should use the apply() function, or vectorisation or list comprehension because it is not optimal?
The apply function however, regardless of the axis, does not allow to dynamically get the index/column (which I need to conditionally update the cell), and I think the core issue I can't make the vec or list options work is because I do not have a finite set of column names but rather a wide range (all examples I see use a handful of named columns...)
What would be the more optimal / more elegant solution here?
OR are DataFrames not suited for my data at all? what should I use instead?
You can use df.replace(to_replace=0, method='ffil). This will fill all zeros in your dataframe (except for zeros occuring at the start of your dataframe) with the previous non-zero value per column.
If you want to do it rowwise unfortunately the .replace() function does not accept an axis argument. But you can transpose your dataframe, replace the zeros and transpose it again: df.T.replace(0, method='ffill').T
I have been reading documentation for a few hours now and I feel I am approaching the problem with the wrong mindset.
I have two tables in HIVE which I read with (spark.table(table_A)) with the same amount and type of columns, but with different origins, so their data is different. Both tables reflect flags that show whether or not a condition is met. There are around 20 columns, at least, and they could increase in the future.
If table_A has its first row be 0 0 1 1 0 table_B could be 0 1 0 1 0, I would like the result to be the result of a XNOR, comparing positions, so: 1 0 0 1 1 , since it has the same values in the first, fourth and fifth position
So I thought of the XNOR operation, when if boths values match then it returns a 1, and a 0 otherwise.
I am facing a number of problems, one of them is the volume of my data (right now I am working with a sample of 1 week and it's already at the 300MB mark), so I am working with pyspark and avoiding pandas since it usually does not fit in memory and/or lags the operation a lot.
Summing up, I have two objects of type pyspark.sql.dataframe.DataFrame, each has one of the tables, and so far the best I've got is something like this:
df_bitwise = df_flags_A.flag_column_A.bitwiseXOR(df_flags_B.flag_columns_B)
But sadly this returns a pyspark.sql.column.Column and I do not know how to read that result, and I do not know to build a dataframe with this (I would like the end result to be something like 20 times the above operation, one for each column, each forming a column of a dataframe).
What am I doing wrong because I feel like this is not the right approach.
I am not generally a python user. I do most things in R and Stata. However, I cannot find a good semantic similarity package/API in either of those.
I have two data frames in the environment. One is questions, which consists of 2 columns and 3 rows. The other is results, which has 3 columns and 3 rows.
I am trying to compare each question (individually) in the first column of the questions dataframe to all of the questions in the second column. Then I want the output to populate the results dataframe. The function takes two strings as arguments So far my code looks like this:
for i in range(1, 3):
results.iloc[i-1,i] = liteClient.compare(questions.iloc[0,i], questions.iloc[:,1])
I assume that I am pointing to the data incorrectly, but I really haven't been able to find good documentation about this seemingly straightforward issue.
By comparison, here is my working R code, which uses a slightly different function and only one dataframe.
for (i in 1:3){
df[,i+2] <- levenshteinSim(df$yr1questions[i], df$yr2questions)
}
Any help would be greatly appreciated! I am trying to come up with proof-of-concept code to compare similar survey questions between years based on semantic meaning.
Bob
Let's try to compare(multiply) every question A to every question B
import pandas as pd
questions = pd.DataFrame(np.arange(6).reshape(3,2), columns=["question_A", "question_B"])
This gives:
question_A question_B
0 0 1
1 2 3
2 4 5
Then let's define a compare fonction:
def compare(row):
return pd.Series(row[0]*questions['question_B'])
results = questions.apply(compare, axis=1)
That gives us:
0 1 2
0 0 0 0
1 2 6 10
2 4 12 20
As you pointed in the comments, here is a version comparing only two strings at a time:
def compare(row):
question_a = row[0]
return pd.Series([liteClient.compare(question_a, question_b) for question_b in questions['question_B']])
Based on what you've put so far here are some issues with what you've written which are understandable from your R programming background:
for i in range(1, 3):
In python 3.x, what this does is create a range object which you can think of as a special type of function (though is really an object that contains iteration properties) that allows you to generate a sequence of numbers with a certain step size (default is 1) exclusively. Additionally you need to know that most programming languages index starting at zero, not 1, and this includes python.
What this range object does here is generate the sequence 1, 2 and that is it.
Your arrays you are using i to index over are not going to index over all indicies. What I believe you want is something like:
for i in range(3):
Notice how there is only one number here, this defaults to the exclusive maximum of the range, and 0 being the inclusive minimum, so this will generate the sequence of 0,1,2. If you have an array of size 3, this will represent all possible indices for that array.
This next line is a bit confusing to me, since I'm not familiar with R, but I sort of understand what you were trying to do. If I understand correctly you are trying to compare two columns of 3 questions each, and compare each question in column 1 to the questions in column 2, resulting in a 3 x 3 matrix of comparison results which you are trying to store in results?. Assuming the size are already correct (as in results is 3x3) I'd like to explain some peculiarities I see in this code.
results.iloc[i-1,i] = liteClient.compare(questions.iloc[0,i], questions.iloc[:,1])
with results.iloc[i-1,i] you are going to be indexing by row an column, as in i-1 is the row, and i is the column here. So with out changing range(1,3) this results in the following indexes being accessed, 0,1, 1,2 and that is it. I believe liteClient.compare(...) is supposed to return either a dataframe of 1x3, or a list of size 3 based on what you were trying to do inside of it, this may not be the case however, I'm not sure what object you are using to call that member function, so I don't know where the documentation for that function exists. Assuming it does return a list of size 3 or the dataframe, what you'll need to change the way you are trying to assign the data, via this:
results.iloc[i,:] = ...
What is happening here is that iloc(...) is taking a row positional argument and a slice positional argument, here you are assigning all the columns in the result matrix at that row to the values returned by compare. With the for statement changes this will iterate over all indices in the dataframe.
liteClient.compare(questions.iloc[0,i], questions.iloc[:,1])
This line as it currently stands you will be iterating over each column in the first row of questions.iloc, and then comparing them to the second column and all rows of the second questions.iloc.
I believe what you will want to do is change this to:
liteClient.compare(questions.iloc[i,0], questions.iloc[:,1])
What this does is for each i, 0, 1, 2, at column 0, compare this to every row in column 1. If your questions dataframe is actually organized as 2 columns and 3 rows this should work, otherwise you will need to change how you create questions as well.
in all I believe the fixed program should look something like:
for i in range(3):
results.iloc[i,:] = liteClient.compare(questions.iloc[i,0], questions.iloc[:,1])
This is a problem I've encountered in various contexts, and I'm curious if I'm doing something wrong, or if my whole approach is off. The particular data/functions are not important here, but I'll include a concrete example in any case.
It's not uncommon to want a groupby/apply that does various operations on each group, and returns a new dataframe. An example might be something like this:
def patch_stats(df):
first = df.iloc[0]
diversity = (len(df['artist_id'].unique())/float(len(df))) * df['dist'].mean()
start = first['ts']
return pd.DataFrame({'diversity':[diversity],'start':[start]})
So, this is a grouping function that generates a new DataFrame with two columns, each derived from a different operation on the input data. Again, the specifics aren't too important here, but this is the issue:
When I look at the output, I get something like this:
result = df.groupby('patch_idx').apply(patch_stats)
print result
diversity start
patch_idx
0 0 0.876161 2007-02-24 22:54:28
1 0 0.588997 2007-02-25 01:55:39
2 0 0.655306 2007-02-25 04:27:05
3 0 0.986047 2007-02-25 05:37:58
4 0 0.997020 2007-02-25 06:27:08
5 0 0.639499 2007-02-25 17:40:56
6 0 0.687874 2007-02-26 05:24:11
7 0 0.003714 2007-02-26 07:07:20
8 0 0.065533 2007-02-26 09:01:11
9 0 0.000000 2007-02-26 19:23:52
10 0 0.068846 2007-02-26 20:43:03
...
It's all good, except I have an extraneous, unnamed index level that I don't want:
print result.index.names
FrozenList([u'patch_idx', None])
Now, this isn't a huge deal; I can always get rid of the extraneous index level with something like:
result = result.reset_index(level=1,drop=True)
But seeing how this comes up anytime I have grouping function that returns a DataFrame, I'm wondering if there's a better approach to how I'm approaching this. Is it bad form to have a grouping function that returns a DataFrame? If so, what's the right method to get the same kind of result? (again, this is a general question fitting problems of this type)
In your grouping function, return a Series instead of a DataFrame. Specifically, replace the last line of patch_stats with:
return pd.Series({'diversity':diversity, 'start':start})
I've encountered this same issue.
Solution
result = df.groupby('patch_idx', group_keys=False).apply(patch_stats)
print result
For a data like this
import pandas as pd
df=pd.DataFrame({'group1': list('AABBCCAABBCC'),'group2':list('ZYYXYXXYZXYZ')})
I figured out with some difficulty that to make a frequency table of rows and columns, the most common way is as follows
print df.pivot_table(index='group1',columns='group2',aggfunc=len,fill_value=0)
by which I get
group2 X Y Z
group1
A 1 2 1
B 2 1 1
C 1 2 1
I am just wondering if there are any 'faster' way to generate the same table. Not that there is anything wrong with it but what I mean is something which involve less typing (without me having to write a custom function)
I am just comparing this with R where same result could have been achieved by
table(df$group1,df$group2)
Compared to this, entering non default parameters like aggfunc and fill_value and typing out argument names, index and columns seems lot of additional effort.
In general my experience (very limited) is that R equivalent functions in python are very similar in conciseness.
Any suggestions on alternative methods would be great. I will need to make several of these tables with my data.
Here is an alternative method.
>>> df.groupby(['group1', 'group2']).group2.count().unstack().fillna(0)
group2 X Y Z
group1
A 1 2 1
B 2 1 1
C 1 2 1
pd.crosstab(df['group1'],df['group2'])
This was exactly what I was looking for. Did not find it when I was searching for it initially.