Conflicting results when grouping observations in Stata vs Python

Conflicting results when grouping observations in Stata vs Python - python

I have a longitudinal dataset and I am trying to create two variables that correspond to two time periods based on specific date ranges (period_1 and period_2) to be able to analyze the effect of each of those time periods on my outcome.
My Stata code for grouping variables by ID is
gen period_1 = date_eval < mdy(5,4,2020)
preserve
collapse period_1=period_1
count if period_1
and it gives me a number of individuals during that period.
However, I get a different number if I use the SQL query in Python
evals_period_1 = ps.sqldf('SELECT id, COUNT(date_eval) FROM df WHERE strftime(date_eval) < strftime("%m/%d/%Y",{}) GROUP BY id'.format('5/4/2020'))
Am I grouping by ID differently in these two codes? Please let me know what you think.

Agree with Nick that a reproducible example would have been useful. Or at least a description of the results and how it is not as you expected. However, I can still say something about your Stata code. See a reproducible example below, and see how your code always results in the count 1. Even though the example below randomize the data to be different each time.
* Create a data set with 50 rows where period_1 is dummy (0,1) randomized
* differently each run
clear
set obs 50
gen period_1 = (runiform() < .5)
* List the first 5 rows
list in 1/5
* This collapses all rows and what you are left with is one row where the value
* is the average of all rows
collapse period_1=period_1
* List the one remaining observation
list
* Here Stata syntax is probably not what you are expecting. period_1 will
* here be replaced with the value in the first row. The random mean around .5.
* (This is my understanding assuming it follows what "display period_1" would do)
count if period_1
* That is identical to count if .5. And Stata evaluates
* any number >0 to "true" meaning the count where
* this statement is true to 1. This will always be the case in this code
* unless the random number generator creates the corner case where all rows are 0
count if .5
You probably want to drop the row with collapse and change the last row to count if period_1 == 1. But how your data is formatted is relevant for if this is the solution to your original question.

Related

How can I left join between two df by multiple conditions and dynamic probabilities?

How can I left join between two df by multiple conditions and dynamic probabilities?
DF A:
DF B:
I would like to left join 3 times between tables A and B and the conditions should be dynamic.
It's a bit complex so I'll try to explain with an example:
In table A -
Column A should contain one number from table B by this logic-
All the numbers in table B that are equal/less than the amount we have in the amount column in table A.
So for instance: 120 (table A) should get one of the following in table B - 120,110,100,90,80.
The number should be selected by probabilities - I want to be able to define probabilities for those numbers (for example, 120 - 20%, 110 - 5%, 100 - 50%, 90 - 10%, 80 - 15%).
Column B should contain one number from table B by this logic-
All the numbers in table B that are greater and not equal to 999 to the amount we have in the amount column in table A.
So for instance: 120 (table A) should get one of the following in table B - 130,140,150,160,170.
The number should be selected by probabilities - I want to be able to define probabilities for those numbers (for example, 130 - 20%, 140 - 5%, 150 - 50%, 160 - 10%, 170 - 15%).
Column C in table A should always get 999.
Hopefully I managed to explain myself.
Thanks in advance.

It's a bit difficult to understand exactly what you wan't, but I'll provide some tips:
Way to select based on the distribution (probability):
Create an array with the given cumulated-probability intervals e.g
prob_in = [0,0.2,0.25,0.75,0.85,1] #20%+5%+50%+10%+15%
and an array with your numbers corresponding to the probabilities
num_pick = [120,110, 100, 90, 80]
Draw a number from the uniform-distribution [0,1], called u (say you get u=0.35).
Check in which interval u is, i.e in the third interval [0.25,0.75] thus you pick the third element in num_pick (which is 100)
Do what you want with that number
This might be much to do, luckily you can use the answer from this SO question
from numpy.random import choice
draw = choice(list_of_candidates, number_of_items_to_pick,
p=probability_distribution)
remember to decide if replace=False/True should be used i.e can the same item be chosen multiple times?

Calculate the mean from excel sheet for specific rows

Hello guys! I am struggling to calculate the mean of certain rows from
an excel sheet using python. In particular, I would like to calculate the mean for every three rows starting from the first three and then moving to the next three and so on. My excel sheet contains 156 rows of data.
My data sheet looks like this:
And this is my code:
import numpy
import pandas as pd
df = pd.read_excel("My Excel.xlsx")
x = df.iloc[[0,1,2], [9,10,11]].mean()
print(x)
To sum up, I am trying to calculate the mean of Part 1 Measurements 1 (rows 1,2,3) and the mean of Part 2
Measurements 1 (rows 9,10,11) using one line of code, or some kind of index. I am expecting to receive two lists of numbers, one that stands for the mean of Part 1 Measurement 1 (rows 1,2,3) and the other for the mean of Part 2 Measurements 1 (rows 10,11,12). I am also familiar with the fact that python counts row number one as 0. The index should have a form of n+1.
Thank you in advance.

You could (e.g.) generate a list for each mean you want to calculate:
x1, x2 = list(df.iloc[[0,1,2]].mean()), list(df.iloc[[9,10,11]].mean())
Or you could also generate a list of lists:
x = [list(df.iloc[[0,1,2]].mean()), list(df.iloc[[9,10,11]].mean())]

How do I assign tiles to a pandas data frame based on equal parts of a column?

I have sorted a roughly 1 million row dataframe by a certain column. I would like to assign groups to each observation based on equal sums of another column but I'm not sure how to do this.
Example below:
import pandas as pd
value1 = [25,27,20,22,28,20]
value2 = [.34,.43,.54,.43,.5,.7]
df = pd.DataFrame({'value1':value1,'value2':value2})
df.sort_values('value1', ascending = False)
df['wanted_result'] = [1,1,1,2,2,2]
Like this example, I want to sum my column (example column value1) and assign groups to have as close to equal value1 sums as they can. Is there a build in function to this?

Greedy Loop
Using Numba's JIT to quicken it up.
from numba import njit
#njit
def partition(c, n):
delta = c[-1] / n
group = 1
indices = [group]
total = delta
for left, right in zip(c, c[1:]):
left_diff = total - left
right_diff = total - right
if right > total and abs(total - right) > abs(total - left):
group += 1
total += delta
indices.append(group)
return indices
df.assign(result=partition(df.value1.to_numpy().cumsum(), n=2))
value1 value2 result
4 28 0.50 1
1 27 0.43 1
0 25 0.34 1
3 22 0.43 2
2 20 0.54 2
5 20 0.70 2
This is NOT optimal. This is a greedy heuristic. It goes through the list and finds where we step over to the next group. At that point it decides whether it's better to include the current point in the current group or the next group.
This should behave pretty well except in cases with huge disparity in values with the larger values coming towards the end. This is because this algorithm is greedy and only looks at what it knows at the moment and not everything at once.
But like I said, it should be good enough.

I think, this is a kind of optimalisation problem (non-linear)
and Pandas is definitively not any good candidate to solve it.
The basic idea to solve the problem can be as follows:
Definitions:
n - number of elements,
groupNo - the number of groups to divide into.
Start from generating an initial solution, e.g. take consecutive
groups of n / groupNo elements into each bin.
Define the goal function, e.g. sum of squares of differences between
sum of each group and sum of all elements / groupNo.
Perform an iteration:
for each pair of elements a and b from different bins,
calculate the new goal function value, if these elements were moved
to the other bin,
select the pair that gives the greater improvement of the goal function
and perform the move (move a from its present bin to the bin, where b is,
and vice versa).
If no such pair can be found, then we have the final result.
Maybe someone will propose a better solution, but at least this solution is
some concept to start with.

How can I speed up a loop which uses numpy.where on a large pandas dataframe?

The code I've written works correctly but I have a speed problem...
One function (below) is called nearly 10 000 times and takes 0.4+secs each time on my machine, meaning the script itself takes about 66 minutes - too long to be useful. Is there a significantly faster way of writing this 'countifs' like function for Python-3.x? (excel equivalent, for context)
I have an input of c.800 000 rows and 50 columns, this is read into a pandas dataframe (df). All well and good so far. I'm only interested in four columns: 'dateA', 'dateB', 'theme' and 'category'.
I feed the function with individual dates over and over again (generated elsewhere) - for example, between 2013-01-01 and 2017-12-31 ('specifiedDate'); this is the source of c.2000 calls to the function. For each 'specifiedDate' there are five categories (provided by 'a') multiplying the 2000 calls by 5! I'm attempting to rapidly count the number of rows in df which match a provided criteria (in np.where()) for each date and category.
import numpy as np
import pandas as pd
def loopthroughdates(specifiedDate, a):
df['calc'] = np.where((df['category'] == a)
& (df['dateA'] < specifiedDate)
& (df['dateB'] > specifiedDate)
| (df['category'] == a)
& (df['theme'] == "Blue")
& (df['dateA'] < specifiedDate),1,0)
total = df['calc'].sum()
return total
The function returns an integer equal to the number of rows which match the criteria in np.where() for each date and category. This integer is used in the rest of the script to build up a table which looks like the below:
Date,cat1,cat2,cat3,cat4,cat5
2015-04-10,100,300,80,30,250
2015-04-11,101,300,70,35,248
2015-04-12,102,298,72,38,247
I've tried a number of approaches using bits and pieces from other questions on this site but can't find one quicker than this, I feel there must be - can you help?
Edit
The function is called by a nested for loop:
for specifiedDate in datelist:
for a in categorylist:
total = loopthroughdates(specifiedDate, a)
A sample of df (5 rows) excluding the irrelevant(?) columns - remember this is over 800 000 rows and 50 cols:
dateA,dateB,category,theme
2015-01-01,2015-05-10,cat2,blue
2015-04-11,2015-04-13,cat2,blue
2015-02-25,2015-06-01,cat5,red
2015-08-01,2015-08-15,cat1,blue
2014-10-10,2015-09-03,cat4,blue
Thanks!

Pandas: selecting multiple columns from one row

I have a script that does things for me, but very inefficiently. I asked for some help on code reviewers, and was told to try Pandas instead. This is what I've done, but I'm having some difficulty understand how it works. I've tried to read the documentation and other questions here, but I can't find any answer.
So, I've got a dataframe with a small amount of rows (20 to couple of hundred) and a smaller number of columns. I've used the read_table pandas function to get at the original data in .txt form, which looks like this:
[ID1, Gene1, Sequence1, Ratio1, Ratio2, Ratio3]
[ID1, Gene1, Sequence2, Ratio1, Ratio2, Ratio3]
[ID2, Gene2, Sequence3, Ratio1, Ratio2, Ratio3]
[ID2, Gene3, Sequence4, Ratio1, Ratio2, Ratio3]
[ID3, Gene3, Sequence5, Ratio1, Ratio2, Ratio3]
... along with a whole bunch of unimportant columns.
What I want to be able to do is to select all the ratios from each Sequence and perform some calculations and statistics on them (all 3 ratios for each sequence, that is). I've tried
df.groupby('Sequence')
for col in df:
do something / print(col) / print(col[0])
... but that only makes me more confused. If I pass print(col), I get some kind of df construct printed, whereas if I pass print(col[0]), I only get the sequences. As far as I can see in the construct, I should still have all the other columns and their data, since groupby() doesn't remove any data, it just groups it by some input column. What am I doing wrong?
Although I haven't gotten that far yet, due to the problems above, I also want my script to be able to select all the ratios for every ID and perform the same calculations on them, but this time every ratio by itself (i.e. Ratio1 for all rows of ID1, the same for Ratio2, etc.). And, lastly, do the same thing for every gene.
EDIT:
So, say I want to perform this calculation on every ratio in the row, and then take the median of the three resulting values:
df[Value1] = spike[data['ID']] / float(data['Ratio 1]) * (10**-12) * (6.022*10**23) / (1*10**6)
df[Value2] = spike[data['ID']] / float(data['Ratio 2]) * (10**-12) * (6.022*10**23) / (1*10**6)
df[Value3] = spike[data['ID']] / float(data['Ratio 3]) * (10**-12) * (6.022*10**23) / (1*10**6)
... where spike is a dictionary, and the keys are the IDs. Ignoring the dict part, I can make calculations (thanks!), but how do I access the dictionary using the dataframe IDs? With the above code, I just get a "Unhashable type: Series" error.
Here's some real data:
ID Gene Sequence Ratio1 Ratio2 Ratio3
1 KRAS SFEDXXYR 15.822 14.119 14.488
2 KRAS VEDAXXXLVR 9.8455 8.9279 16.911
3 ELK4 IEXXXCESLNK 15.745 7.9122 9.5966
3 ELK4 IEGXXXSLNKR 1.177 NaN 12.073

df.groupby() does not modify/group df in place. So you have to assign the result to a new variable to further use it. E.g. :
grouped = df.groupby('Sequence')
BTW, in the example data you give, all data in the Sequence column are unique, so grouping on that column will not do much.
Furthermore, you normally don't need to 'iterate over the df' as you do here. To apply a function to all groups, you can do that directly on the groupby result, eg df.groupby().apply(..) or df.groupby().aggregate(..).
Can you give a more specific example of what kind of function you want to apply to the ratios?
To calculate the median of the three ratio's for each sequence (each row), you can do:
df[['Ratio1', 'Ratio2', 'Ratio3']].median(axis=1)
The axis=1 means that you do not want to take the median of one column (over the rows), but for each row (over the columns)
Another examle, to calculate the median of all Ratio1's for each ID, you can do:
df.groupby('ID')['Ratio1'].median()
Here you group by ID, select column Ratio1 and calculate the median value for each group.
UPDATE: you should probably split the questions into seperate ones, but as an answer to your new question:
data['ID'] will give you the ID column, so you cannot use it as a key. You want one specific value of that column. To apply a function on each row of a dataframe, you can use apply:
def my_func(row):
return spike[row['ID']] / float(row['Ratio 1']) * (10**-12) * (6.022*10**23) / (1*10**6)
df['Value1'] = df.apply(my_func, axis=1)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Conflicting results when grouping observations in Stata vs Python - python

Related

How can I left join between two df by multiple conditions and dynamic probabilities?

Calculate the mean from excel sheet for specific rows

How do I assign tiles to a pandas data frame based on equal parts of a column?

How can I speed up a loop which uses numpy.where on a large pandas dataframe?

Pandas: selecting multiple columns from one row

Categories

Resources