In Pandas, I am trying to manually code a chi-square test. I am comparing row 0 with row 1 in the dataframe below.
data
2 3 5 10 30
0 3 0 6 5 0
1 33324 15833 58305 54402 38920
For this, I need to calculate the expected cell counts for each cell as: cell(i,j) = rowSum(i)*colSum(j) / sumAll. In R, I can do this simply by taking the outer() products:
Exp_counts <- outer(rowSums(data), colSums(data), "*")/sum(data) # Expected cell counts
I used numpy's outer product function to imitate the outcome of the above R code:
import numpy as np
pd.DataFrame(np.outer(data.sum(axis=1),data.sum(axis=0))/ (data.sum().sum()), index=data.index, columns=data.columns.values)
2 3 5 10 30
0 2 1 4 3 2
1 33324 15831 58306 54403 38917
Is it possible to achieve this with a Pandas function?
A Complete solution using only Pandas built-in methods:
def outer_product(row):
numerator = df.sum(1).mul(row.sum(0))
denominator = df.sum(0).sum(0)
return (numerator.floordiv(denominator))
df.apply(outer_product)
Timings: For 1 million rows of DF.
Related
I need to count how many times a value appears in the column. I did something similar in Excel and I want to understand how to play in pandas. Thanks
You can try something like this:
import pandas as pd
df = pd.DataFrame({'char_list':list('aabbbbssbbaaabdddcccsbcderfffrrcashhttyy')})
df = df['char_list'].value_counts().reset_index()
df.columns = ['char_list', 'count']
print(df)
Output:
char_list count
0 b 8
1 a 6
2 c 5
3 s 4
4 d 4
5 r 3
6 f 3
7 h 2
8 t 2
9 y 2
10 e 1
Do you want something like this :
df = pd.DataFrame({"a":[1,2,3,1,1,4,5,6,2,1]})
oc = df.groupby("a").size
df["count"]=df["a"].map(oc)
print(oc)
print()
print(df)
to get
a
1 4
2 2
3 1
4 1
5 1
6 1
dtype: int64
a count
0 1 4
1 2 2
2 3 1
3 1 4
4 1 4
5 4 1
6 5 1
7 6 1
8 2 2
9 1 4
or do you prefer something like that Pandas: Incrementally count occurrences in a column with an increment of occurrences ?
Clarify and describe your requirements
Count the occurrence of string X inside what?
Where to look, how to count?
What is X?
What does your Excel formula?
Your Excel formula is doing a window-based aggregation, where the aggregation function is a count (function COUNT.IF) and the window is from first row until current row (first parameter of type range). The counted (given criteria) is specified per row (second parameter as cell value).
See Excel's function COUNTIF:
Counts the number of cells within a range that meet the given criteria
Illustrate by example
Instead of "window-based" we could also say cumulative:
The formula counts the occurrence of string key123 (value in column A, current row, e.g. 1) in rows between first ($A$1) to current ($A1).
Given a column with strings where the first string is key123, then
its first occurrence should have count 1,
the second should have count 2
and so on
Equivalent functions in pandas
So your Excel formula =COUNTIF($A$1:$A1; A1) would directly translate to pandas GroupBy.cumcount like
df.groupby("Column_A").cumcount()+1
as already answered in:
Pandas: Incrementally count occurrences in a column
Terminology
The cumulative count increases the count for each occurrence. Similar to a cumulative sum also known as running total.
See also related SQL keywords/concepts:
GROUP BY: grouping records and applying aggregate-functions
COUNT: an aggregate-function like SUM, AVG, MAX, MIN
window functions: allow further fine-grained aggregation
Hi so I have a dataframe and I would like to find the index whenever one of the column's cumulative sum is equal to a threshold. It will then reset and start the cumsum again.
For example:
d = np.random.randn(10, 1) * 2
df = pd.DataFrame(d.astype(int), columns=['data'])
pd.concat([df,df.cumsum()],axis=1)
Outout:
Out[34]:
data data1
0 1 1
1 2 3
2 3 6
3 2 8
4 0 8
5 1 9
6 0 9
7 -1 8
8 1 9
9 2 11
So in the above sample data, data1 is the cumsum of column 1. If I set thres=5 this means that whenever the running sum of column 1 is greater than or equal to 5, I save the index. After that happens, the running sum resets and start again until the next running total sum is greater than or equal to 5 is reached.
Right now I am doing a loop and keeping to track the running sum an manually resetting. I was wondering if there is a fast vectorized way in pandas to do it as my dataframe is millions of rows long.
Thanks
I'm not familiar with pandas but my understanding is that it is based on numpy. Using numpy you can define custom functions that can be used with accumulate.
Here is one that I think is close to what you're looking for:
import numpy as np
def capsum(array,cap):
capAdd = np.frompyfunc(lambda a,b:a+b if a < cap else b,2,1)
return capAdd.accumulate(values, dtype=np.object)
values = np.random.rand(1000000) * 3 // 1
result = capsum(values,5) # --> produces the result in 0.17 sec.
I believe (or I hope) you can use numpy functions on dataframes.
I am trying to do some data manipulations using pandas. I have an excel file with two columns x,y . The number of elements in x corresponds to number of connections(n_arrows) it makes with an element in column y. The number of unique elements in column x corresponds to the number of unique points(n_nodes). What i want to do is to generate a random data frame(10^4 times) with the unique elements in column x and elements in column y? The code i was trying to work on is attached. Any suggestion will be appreciated
import pandas as pd
import numpy as np
df = pd.read_csv('/home/amit/Desktop/playing_with_pandas.csv')
num_nodes = df.drop_duplicates(subset='x', keep="last")
n_arrows = [32] #32 rows corresponds to 32
n_nodes = [10]
n_arrows_random = np.random.randn(df.x)
Here are 2 methods:
Solution 1: If you need the x and y values to be independently random:
Given a sample df (thanks #AmiTavory):
df = pd.DataFrame({'x': [1, 1, 1, 2], 'y': [1, 2, 3, 4]})
Using numpy.random.choice, you can do this to select random values from your x column and random values from your y column:
def simulate_df(df, size_of_simulated_df):
return pd.DataFrame({'x':np.random.choice(df.x, size_of_simulated_df),
'y':np.random.choice(df.y, size_of_simulated_df)})
>>> simulate_df(df, 10)
x y
0 1 3
1 1 3
2 1 4
3 1 4
4 2 1
5 2 3
6 1 2
7 1 4
8 1 2
9 1 3
The function simulate_df returns random values sampled from your original dataframe in the x and y columns. The size of your simulated dataframe can be controlled by the argument size_of_simulated_df, which should be an integer representing the number of rows you want.
Solution 2: As per your comments, based on your task, you might want to return a dataframe of random rows, maintaining the x->y correspondence. Here is a vectorized pandas way to do that:
def simulate_df(df=df, size_of_simulated_df=10):
return df.sample(size_of_simulated_df, replace=True).reset_index(drop=True)
>>> simulate_df()
x y
0 1 2
1 2 4
2 2 4
3 2 4
4 1 1
5 1 3
6 1 3
7 1 1
8 1 1
9 1 3
Assigning your simulated Dataframes for future reference:
In the likely scenario you want to do some sort of calculation on your simulated dataframes, I'd recommend saving them to some sort of dictionary structure using a loop like this:
dict_of_dfs = {}
for i in range(100):
dict_of_dfs['df'+str(i)] = simulate_df(df, len(df))
Or a dictionary comprehension like this:
dict_of_dfs = {'df'+str(i): simulate_df(df, (len(df))) for i in range(100)}
You can then access any one of your simulated dataframes in the same way you would access any dictionary value:
# Access the 48th simulated dataframe:
>>> dict_of_dfs['df47']
x y
0 1 4
1 2 1
2 1 4
3 2 3
I have a boolean matrix of M x N, where M = 6000 and N = 1000
1 | 0 1 0 0 0 1 ----> 1000
2 | 1 0 1 0 1 0 ----> 1000
3 | 0 0 1 1 0 0 ----> 1000
V
6000
Now for each column, I want to find the first occurrence where the value is 1. For the above example, in the first 5 columns, I want 2 1 2 3 2 1.
Now the code I have is
sig_matrix = list()
num_columns = df.columns
for col_name in num_columns:
print('Processing column {}'.format(col_name))
sig_index = df.filter(df[col_name] == 1).\
select('perm').limit(1).collect()[0]['perm']
sig_matrix.append(sig_index)
Now the above code is really slow and it takes 5~7 minutes for me to parse 1000 columns is there any faster ways to do this instead of what I am doing? I am also willing to use pandas data frame instead of pyspark dataframe if that is faster.
Here is a numpy version that runs <1s for me, so should be preferable for this size of data:
arr=np.random.choice([0,1], size=(6000,1000))
[np.argwhere(arr[:,i]==1.)[0][0] for i in range(1000)]
There could well be more efficient numpy solutions.
I ended up solving my problem using numpy. Here is how I did it.
import numpy as np
sig_matrix = list()
columns = list(df)
for col_name in columns:
sig_index = np.argmax(df[col_name]) + 1
sig_matrix.append(sig_index)
As the values in my columns are 0 and 1, argmax will return the first occurrence of value 1.
I have a Pandas dataframe containing case-control data and can be represented by the following structure:
caseA caseN catA
0 y 1 a
1 y 1 a
2 y 1 b
3 y 1 b
4 y 1 c
5 y 1 d
6 y 1 a
7 y 1 c
8 n 0 c
9 n 0 d
10 n 0 a
11 n 0 b
12 n 0 c
13 n 0 a
14 n 0 d
15 n 0 a
16 n 0 b
17 n 0 c
18 n 0 a
19 n 0 d
The caseA and caseN variables represent cases and controls as strings and integers, respectively.
I can calculate a 2x2 table to facilitate the calculation of odds and odds ratios using the pandas crosstab method. The default order of the columns is control-case but I change this to case-control which, to my way of thinking, is a bit more intuitive.
I then slice the dataframe to print just a select number of rows with columns in the order case - control. This works exactly as expected.
However, if I add a new column to the dataframe (e.g. a column containing the odds values) and then slice the dataframe in exactly the same way, the cases and controls are printed in the wrong order.
The following code snippet illustrates this point:
df = pd.DataFrame({'caseN':[1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0],
'caseA':['y','y','y','y','y','y','y','y','n','n','n','n','n','n','n','n','n','n','n','n'],
'catA':['a','a','b','b','c','d','a','c','c','d','a','b','c','a','d','a','b','c','a','d']})
print('\nCross tabulation\n')
continTab = pd.crosstab(df['catA'],df['caseN'])
print(continTab)
print('\nReorderd cross tabulation\n')
continTab = continTab[[1,0]]
print(continTab)
#print('\n<-- An extra column containg odds has been entered here -->')
#continTab['odds'] = continTab[1]/continTab[0]
print('\nPrint just a slice contains rows a and c only with 1 - 0 column order\n')
print(continTab.loc[['a','c'],[1,0]])
On the first run through the sliced table produced is just as expected:
caseN 1 0
catA
a 3 4
c 2 3
But if you uncomment the code that calculates the odds column and then re-run the exact same code, the sliced table produced is:
caseN 0 1
catA
a 4 3
c 3 2
I can think of no reason when this should happen. Is this a bug?
(Interestingly, repeating the process using the case-control data described as strings (in variable caseA) produces the correct expected results.)