I have a Pandas dataframe containing case-control data and can be represented by the following structure:
caseA caseN catA
0 y 1 a
1 y 1 a
2 y 1 b
3 y 1 b
4 y 1 c
5 y 1 d
6 y 1 a
7 y 1 c
8 n 0 c
9 n 0 d
10 n 0 a
11 n 0 b
12 n 0 c
13 n 0 a
14 n 0 d
15 n 0 a
16 n 0 b
17 n 0 c
18 n 0 a
19 n 0 d
The caseA and caseN variables represent cases and controls as strings and integers, respectively.
I can calculate a 2x2 table to facilitate the calculation of odds and odds ratios using the pandas crosstab method. The default order of the columns is control-case but I change this to case-control which, to my way of thinking, is a bit more intuitive.
I then slice the dataframe to print just a select number of rows with columns in the order case - control. This works exactly as expected.
However, if I add a new column to the dataframe (e.g. a column containing the odds values) and then slice the dataframe in exactly the same way, the cases and controls are printed in the wrong order.
The following code snippet illustrates this point:
df = pd.DataFrame({'caseN':[1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0],
'caseA':['y','y','y','y','y','y','y','y','n','n','n','n','n','n','n','n','n','n','n','n'],
'catA':['a','a','b','b','c','d','a','c','c','d','a','b','c','a','d','a','b','c','a','d']})
print('\nCross tabulation\n')
continTab = pd.crosstab(df['catA'],df['caseN'])
print(continTab)
print('\nReorderd cross tabulation\n')
continTab = continTab[[1,0]]
print(continTab)
#print('\n<-- An extra column containg odds has been entered here -->')
#continTab['odds'] = continTab[1]/continTab[0]
print('\nPrint just a slice contains rows a and c only with 1 - 0 column order\n')
print(continTab.loc[['a','c'],[1,0]])
On the first run through the sliced table produced is just as expected:
caseN 1 0
catA
a 3 4
c 2 3
But if you uncomment the code that calculates the odds column and then re-run the exact same code, the sliced table produced is:
caseN 0 1
catA
a 4 3
c 3 2
I can think of no reason when this should happen. Is this a bug?
(Interestingly, repeating the process using the case-control data described as strings (in variable caseA) produces the correct expected results.)
Related
So I have a Dataframe that is the same thing 348 times, but with a different date as a static column. What I would like to do is add a column that checks against that date and then counts the number of rows that are within 20 miles using a lat/lon column and geopy.
My frame is like this:
What I am looking to do is something like an apply function that takes all of the identifying dates that are equal to the column and then run this:
geopy.distance.vincenty(x, y).miles
X would be the location's lat/lon and y would be the iterative lat/lon. I'd want the count of locations in which the above is < 20. I'd then like to store this count as a column in the initial Dataframe.
I'm ok with Pandas, but this is just outside my comfort zone. Thanks.
I started with this DataFrame (because I did not want to type that much by hand and you did not provide any code for the data):
df
Index Number la ID
0 0 1 [43.3948, -23.9483] 1/1/90
1 1 2 [22.8483, -34.3948] 1/1/90
2 2 3 [44.9584, -14.4938] 1/1/90
3 3 4 [22.39458, -55.34924] 1/1/90
4 4 5 [33.9383, -23.4938] 1/1/90
5 5 6 [22.849, -34.397] 1/1/90
Now I introduced an artificial column which is only there to help us get the cartesian product of the distances
df['join'] = 1
df_c = pd.merge(df, df[['la', 'join','Index']], on='join')
The next step is to apply the vincenty function via .apply and store the result in an extra column
df_c['distance'] = df_c.apply(lambda x: distance.vincenty(x.la_x, x.la_y).miles, 1)
Now we have the cartesian product of the original matrix, which means we have the comparison of each city with itself, too. But we will take that into account in the next step by performing -1. We groupby the Index_x and sum all the distances smaller the 20 miles.
df['num_close_cities'] = df_c.groupby('Index_x').apply(lambda x: sum((x.distance < 20))) -1
df.drop('join', 1)
Index Number la ID num_close_cities
0 0 1 [43.3948, -23.9483] 1/1/90 0
1 1 2 [22.8483, -34.3948] 1/1/90 1
2 2 3 [44.9584, -14.4938] 1/1/90 0
3 3 4 [22.39458, -55.34924] 1/1/90 0
4 4 5 [33.9383, -23.4938] 1/1/90 0
5 5 6 [22.849, -34.397] 1/1/90 1
I have a pandas data-frame where I am trying to replace/ change the duplicate values to 0 (don't want to delete the values) within a certain range of days.
So, in example given below, I want to replace duplicate values in all columns with 0 within a range of let's say 3 (the number can be changed) days. Desired result is also given below
A B C
01-01-2011 2 10 0
01-02-2011 2 12 2
01-03-2011 2 10 0
01-04-2011 3 11 3
01-05-2011 5 15 0
01-06-2011 5 23 1
01-07-2011 4 21 4
01-08-2011 2 21 5
01-09-2011 1 11 0
So, the output should look like
A B C
01-01-2011 2 10 0
01-02-2011 0 12 2
01-03-2011 0 0 0
01-04-2011 3 11 3
01-05-2011 5 15 0
01-06-2011 0 23 1
01-07-2011 4 21 4
01-08-2011 2 0 5
01-09-2011 1 11 0
Any help will be appreciated.
You can use df.shift() for this to look at a value from a row up or down (or several rows, specified by the number x in .shift(x)).
You can use that in combination with .loc to select all rows that have a identical value to the 2 rows above and then replace it with a 0.
Something like this should work :
(edited the code to make it flexible for endless number of columns and flexible for the number of days)
numberOfDays = 3 # number of days to compare
for col in df.columns:
for x in range(1, numberOfDays):
df.loc[df[col] == df[col].shift(x), col] = 0
print df
This gives me the output:
A B C
date
01-01-2011 2 10 0
01-02-2011 0 12 2
01-03-2011 0 0 0
01-04-2011 3 11 3
01-05-2011 5 15 0
01-06-2011 0 23 1
01-07-2011 4 21 4
01-08-2011 2 0 5
01-09-2011 1 11 0
I don't find anything better than looping over all columns, because every column leads to a different grouping.
First define a function which does what you want at grouped level, i.e. setting all but the first entry to zero:
def set_zeros(g):
g.values[1:] = 0
return g
for c in df.columns:
df[c] = df.groupby([c, pd.Grouper(freq='3D')], as_index=False)[c].transform(set_zeros)
This custom function can be applied to each group, which is defined by a time range (freq='3D') and equal values of a column within this period. As the columns generally have their equal values in different rows, this has to be done for each column in a loop.
Change freq to 5D, 10D or 20D for your other considerations.
For a detailed description of how to define the time period see http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases
I am trying to add a running count to a pandas df.
For the values in Column A, I want to add '5' and for values in Column B I want to add '1'.
So for the df below I'm hoping to produce:
A B Total
0 0 0 0
1 0 0 0
2 1 0 5
3 1 1 6
4 1 1 6
5 2 1 11
6 2 2 12
So for every incremental integer in Column A, it equal '5' in the total. While Column B is the '+1'.
I tried:
df['Total'] = df['A'].cumsum(axis = 0)
But this doesn't include Column B
df['Total'] = df['A'] * 5 + df['B']
As far as I can tell, you are simply doing row wise operations, not a cumulative sum. This snippet calculates the row value of A times 5 and adds the row value of B for each row. Please don't make it any more complicated than it really is.
What is a cumulative sum (also called running total)?
From Wikipedia:
Consider the sequence < 5 8 3 2 >. What is the total of this sequence?
Answer: 5 + 8 + 3 + 2 = 18. This is arrived at by simple summation of the sequence.
Now we insert the number 6 at the end of the sequence to get < 5 8 3 2 6 >. What is the total of that sequence?
Answer: 5 + 8 + 3 + 2 + 6 = 24. This is arrived at by simple summation of the sequence. But if we regarded 18 as the running total, we need only add 6 to 18 to get 24. So, 18 was, and 24 now is, the running total. In fact, we would not even need to know the sequence at all, but simply add 6 to 18 to get the new running total; as each new number is added, we get a new running total.
I am trying to do some data manipulations using pandas. I have an excel file with two columns x,y . The number of elements in x corresponds to number of connections(n_arrows) it makes with an element in column y. The number of unique elements in column x corresponds to the number of unique points(n_nodes). What i want to do is to generate a random data frame(10^4 times) with the unique elements in column x and elements in column y? The code i was trying to work on is attached. Any suggestion will be appreciated
import pandas as pd
import numpy as np
df = pd.read_csv('/home/amit/Desktop/playing_with_pandas.csv')
num_nodes = df.drop_duplicates(subset='x', keep="last")
n_arrows = [32] #32 rows corresponds to 32
n_nodes = [10]
n_arrows_random = np.random.randn(df.x)
Here are 2 methods:
Solution 1: If you need the x and y values to be independently random:
Given a sample df (thanks #AmiTavory):
df = pd.DataFrame({'x': [1, 1, 1, 2], 'y': [1, 2, 3, 4]})
Using numpy.random.choice, you can do this to select random values from your x column and random values from your y column:
def simulate_df(df, size_of_simulated_df):
return pd.DataFrame({'x':np.random.choice(df.x, size_of_simulated_df),
'y':np.random.choice(df.y, size_of_simulated_df)})
>>> simulate_df(df, 10)
x y
0 1 3
1 1 3
2 1 4
3 1 4
4 2 1
5 2 3
6 1 2
7 1 4
8 1 2
9 1 3
The function simulate_df returns random values sampled from your original dataframe in the x and y columns. The size of your simulated dataframe can be controlled by the argument size_of_simulated_df, which should be an integer representing the number of rows you want.
Solution 2: As per your comments, based on your task, you might want to return a dataframe of random rows, maintaining the x->y correspondence. Here is a vectorized pandas way to do that:
def simulate_df(df=df, size_of_simulated_df=10):
return df.sample(size_of_simulated_df, replace=True).reset_index(drop=True)
>>> simulate_df()
x y
0 1 2
1 2 4
2 2 4
3 2 4
4 1 1
5 1 3
6 1 3
7 1 1
8 1 1
9 1 3
Assigning your simulated Dataframes for future reference:
In the likely scenario you want to do some sort of calculation on your simulated dataframes, I'd recommend saving them to some sort of dictionary structure using a loop like this:
dict_of_dfs = {}
for i in range(100):
dict_of_dfs['df'+str(i)] = simulate_df(df, len(df))
Or a dictionary comprehension like this:
dict_of_dfs = {'df'+str(i): simulate_df(df, (len(df))) for i in range(100)}
You can then access any one of your simulated dataframes in the same way you would access any dictionary value:
# Access the 48th simulated dataframe:
>>> dict_of_dfs['df47']
x y
0 1 4
1 2 1
2 1 4
3 2 3
I have the following dataframe df:
A B C D E
J 4 2 3 2 3
K 5 2 6 2 1
L 2 6 5 4 7
I would like to create an additional column that adds by index the df except column A (which also are numbers), therefore what I have tried is :
df['summation'] = df.iloc[:, 1:4].sum(axis=0)
However, the column summation is added but gives NaN values.
Desired output is:
A B C D E summation
J 4 2 3 2 3 10
K 5 2 6 2 1 11
L 2 6 5 4 7 22
The sum along the row starting at B to the end.
As pointed out in the comments, you apply sum on the wrong axis. If you want to exclude columns from the sum, you can use drop (which also accepts a list of column names which might be handy if you want to exclude columns at e.g. index 0 and 3; then iloc might not be ideal)
df.drop('A', axis=1).sum(axis=1)
which yields
J 10
K 11
L 22
Also #ayhan's solution in the comments works fine.