Is there a Pandas equivalent to tidyr's uncount? - python

Let's assume we have a table with groupings of variable and their frequencies:
In R:
> df
# A tibble: 3 x 3
Cough Fever cases
<lgl> <lgl> <dbl>
1 TRUE FALSE 1
2 FALSE FALSE 2
3 TRUE TRUE 3
Then we could use tidyr::uncount to get a dataframe with the individual cases:
> uncount(df, cases)
# A tibble: 6 x 2
Cough Fever
<lgl> <lgl>
1 TRUE FALSE
2 FALSE FALSE
3 FALSE FALSE
4 TRUE TRUE
5 TRUE TRUE
6 TRUE TRUE
Is there an equivalent in Python/Pandas?

You have a row index and repeat it according to the counts, for example in R you can do:
df[rep(1:nrow(df),df$cases),]
first to get a data like yours:
df = pd.DataFrame({'x':[1,1,2,2,2,2],'y':[0,1,0,1,1,1]})
counts = df.groupby(['x','y']).size().reset_index()
counts.columns = ['x','y','n']
x y n
0 1 0 1
1 1 1 1
2 2 0 1
3 2 1 3
Then:
counts.iloc[np.repeat(np.arange(len(counts)),counts.n),:2]
x y
0 1 0
1 1 1
2 2 0
3 2 1
3 2 1
3 2 1

I haven't found an equivalent function in Python, but this works
df2 = df.pop('cases')
df = pd.DataFrame(df.values.repeat(df2, axis=0), columns=df.columns)
df['cases'] is passed to df2, then you create a new DataFrame with the elements from the original DataFrame repeated according to the count in df2. Please let me know if it helps.

In addition to the other solutions, you could combine take, repeat and drop:
import pandas as pd
df = pd.DataFrame({'Cough': [True, False, True],
'Fever': [False, False, True],
'cases': [1, 2, 3]})
df.take(df.index.repeat(df.cases)).drop(columns="cases")
Cough Fever
0 True False
1 False False
1 False False
2 True True
2 True True
2 True True

As easy as you use tidyr's API with datar:
>>> from datar.all import f, tribble, uncount
>>> df = tribble(
... f.Cough, f.Fever, f.cases,
... True, False, 1,
... False, False, 2,
... True, True, 3
... )
>>> uncount(df, f.cases)
Cough Fever
<bool> <bool>
0 True False
1 False False
2 False False
3 True True
4 True True
5 True True
I am the author of the package. Feel free to submit issues if you have any questions.

Related

how to change suffix on new df column of df when merging iteratively

I have a temp df and a a dflst.
the temp has as columns the unique col names from a dataframes in a dflst .
The dflst has a dynamic len, my problem arrises when len(dflst)>=4.
aLL DFs (temp and all the ones in dflst) have columns with true/false values and a p column with numbers
code to recreate data:
#making temp df
var_cols=['a', 'b', 'c', 'd']
temp = pd.DataFrame(list(itertools.product([False, True], repeat=len(var_cols))), columns=var_cols)
#makinf dflst
df0=pd.DataFrame(list(itertools.product([False, True], repeat=len(['a', 'b']))), columns=['a', 'b'])
df0['p']= np.random.randint(1, 5, df0.shape[0])
df1=pd.DataFrame(list(itertools.product([False, True], repeat=len(['c', 'd']))), columns=['c', 'd'])
df1['p']= np.random.randint(1, 5, df1.shape[0])
df2=pd.DataFrame(list(itertools.product([False, True], repeat=len(['a', 'c', ]))), columns=['a', 'c'])
df2['p']= np.random.randint(1, 5, df2.shape[0])
df3=pd.DataFrame(list(itertools.product([False, True], repeat=len(['d']))), columns=['d'])
df3['p']= np.random.randint(1, 5, df3.shape[0])
dflst=[df0, df1, df2, df3]
I want to merge the dfs in dflst, so that the 'p'col values from dfs in dflst into temp df, in the rows with compatible values between the two .
I am currently doing it with pd.merge as follows:
for df in dflst:
temp = temp.merge(df, on=list(df)[:-1], how='right')
but this results to a df that has same names for different columns, when dflst has 4 or more dfs.. I understand that that is due to suffix of merge. but it creates problems with column indexing.
How can I have unique names on the new columns added to temp iteratively?
I don't fully understand what you want but IIUC:
for i, df in enumerate(dflst):
temp = temp.merge(df.rename(columns={'p': f'p{i}'}),
on=df.columns[:-1].tolist(), how='right')
print(temp)
# Output:
a b c d p0 p1 p2 p3
0 False False False False 4 2 2 1
1 False True False False 3 2 2 1
2 False False True False 4 3 4 1
3 False True True False 3 3 4 1
4 True False False False 3 2 2 1
5 True True False False 3 2 2 1
6 True False True False 3 3 1 1
7 True True True False 3 3 1 1
8 False False False True 4 4 2 3
9 False True False True 3 4 2 3
10 False False True True 4 1 4 3
11 False True True True 3 1 4 3
12 True False False True 3 4 2 3
13 True True False True 3 4 2 3
14 True False True True 3 1 1 3
15 True True True True 3 1 1 3

(Python) Selecting rows containing a string in ANY column?

I am trying to iterate through a dataframe and return the rows that contain a string "x" in any column.
This is what I have been trying
for col in df:
rows = df[df[col].str.contains(searchTerm, case = False, na = False)]
However, it only returns up to 2 rows if I search for a string I know exists there and in more rows.
How do I make sure it is searching every row of every column?
Edit: My end goal is to get the row and column of the cell containing the string searchTerm
Welcome!
Agree with all the comments. It's generally best practice to find a way to accomplish what you want in Pandas/Numpy without iterating over rows/columns.
If the objective is to "find rows where any column contains the value 'x'), life is a lot easier than you think.
Below is some data:
import pandas as pd
df = pd.DataFrame({
'a': range(10),
'b': ['x', 'b', 'c', 'd', 'x', 'f', 'g', 'h', 'i', 'x'],
'c': [False, False, True, True, True, False, False, True, True, True],
'd': [1, 'x', 3, 4, 5, 6, 7, 8, 'x', 10]
})
print(df)
a b c d
0 0 x False 1
1 1 b False x
2 2 c True 3
3 3 d True 4
4 4 x True 5
5 5 f False 6
6 6 g False 7
7 7 h True 8
8 8 i True x
9 9 x True 10
So clearly rows 0, 1, 4, 8 and 9 should be included.
If we just do df == 'x', pandas broadcasts the comparison across the whole dataframe:
df == 'x'
a b c d
0 False True False False
1 False False False True
2 False False False False
3 False False False False
4 False True False False
5 False False False False
6 False False False False
7 False False False False
8 False False False True
9 False True False False
But pandas also has the handy .any method, to check for True in any dimension. So if we want to check across all columns, we want dimension 1:
rows = (df == 'x').any(axis=1)
print(rows)
0 True
1 True
2 False
3 False
4 True
5 False
6 False
7 False
8 True
9 True
Note that if you want your solution to be truly case sensitive like what you're using with the .str method, you might need something more like:
rows = (df.applymap(lambda x: str(x).lower() == 'x')).any(axis=1)
The correct rows are flagged without any looping. And you get a series back that can be used for indexing the original dataframe:
df.loc[rows]
a b c d
0 0 x False 1
1 1 b False x
4 4 x True 5
8 8 i True x
9 9 x True 10

Creating a matrix of 0 and 1 that maintains its shape

I am attempting to create a matrix of 1 if every 2nd column value is greater than the previous column value and 0s if less, when I use np.where it just flattens it I want to keep the first column and the last column and it shape.
df = pd.DataFrame(np.random.randn(8, 4),columns=['A', 'B', 'C', 'D'])
newd=pd.DataFrame()
for x in df.columns[1::2]:
if bool(df.iloc[:,df.columns.get_loc(x)] <=
df.iloc[:,df.columns.get_loc(x)-1]):
newdf.append(1)
else:newdf.append(0)
This question was a little vague, but I will answer a question that I think gets at the heart of what you are asking:
Say you start with a matrix:
df1 = pd.DataFrame(np.random.randn(8, 4),columns=['A', 'B', 'C', 'D'])
Which creates:
A B C D
0 2.464130 0.796172 -1.406528 0.332499
1 -0.370764 -0.185119 -0.514149 0.158218
2 -2.164707 0.888354 0.214550 1.334445
3 2.019189 0.910855 0.582508 -0.861778
4 1.574337 -1.063037 0.771726 -0.196721
5 1.091648 0.407703 0.406509 -1.052855
6 -1.587963 -1.730850 0.168353 -0.899848
7 0.225723 0.042629 2.152307 -1.086585
Now you can use pd.df.shift() to shift the entire matrix, and then check the resulting columns item by item in one step. For example:
df1.shift(1)
Creates:
A B C D
0 -0.370764 -0.185119 -0.514149 0.158218
1 -2.164707 0.888354 0.214550 1.334445
2 2.019189 0.910855 0.582508 -0.861778
3 1.574337 -1.063037 0.771726 -0.196721
4 1.091648 0.407703 0.406509 -1.052855
5 -1.587963 -1.730850 0.168353 -0.899848
6 0.225723 0.042629 2.152307 -1.086585
7 NaN NaN NaN NaN
And now you can check the resulting columns with a new matrix as so:
df2 = df1.shift(-1) > df1
which returns:
A B C D
0 False False True False
1 False True True True
2 True True True False
3 False False True True
4 False True False False
5 False False False True
6 True True True False
7 False False False False
To complete your question, we convert the True/False to 1/0 as such:
df2 = df2.applymap(lambda x: 1 if x == True else 0)
Which returns:
A B C D
0 0 0 1 0
1 0 1 1 1
2 1 1 1 0
3 0 0 1 1
4 0 1 0 0
5 0 0 0 1
6 1 1 1 0
7 0 0 0 0
In one line:
df2 = (df1.shift(-1)>df1).replace({True:1,False:0})

Label contiguous groups of True elements within a pandas Series

I have a pandas series of Boolean values, and I would like to label contiguous groups of True values. How is it possible to do this? Is it possible to do this in a vectorised manner? Any help would be hugely appreciated!
Data:
A
0 False
1 True
2 True
3 True
4 False
5 False
6 True
7 False
8 False
9 True
10 True
Desired:
A Label
0 False 0
1 True 1
2 True 1
3 True 1
4 False 0
5 False 0
6 True 2
7 False 0
8 False 0
9 True 3
10 True 3
Here's a unlikely but simple and working solution:
import scipy.ndimage.measurements as mnts
labeled, clusters = mnts.label(df.A.values)
# labeled is what you want, cluster is the number of clusters.
df.Labels = labeled # puts it into df
Tested as:
a = array([False, False, True, True, True, False, True, False, False,
True, False, True, True, True, True, True, True, True,
False, True], dtype=bool)
labeled, clusters = mnts.label(a)
>>> labeled
array([0, 0, 1, 1, 1, 0, 2, 0, 0, 3, 0, 4, 4, 4, 4, 4, 4, 4, 0, 5], dtype=int32)
>>> clusters
5
With cumsum
a = df.A.values
z = np.zeros(a.shape, int)
z[a] = pd.factorize((~a).cumsum()[a])[0] + 1
df.assign(Label=z)
A Label
0 False 0
1 True 1
2 True 1
3 True 1
4 False 0
5 False 0
6 True 2
7 False 0
8 False 0
9 True 3
10 True 3
You can use cumsum and groupby + ngroup to mark groups.
v = (~df.A).cumsum().where(df.A).bfill()
df['Label'] = (
v.groupby(v).ngroup().add(1).where(df.A).fillna(0, downcast='infer'))
df
A Label
0 False 0
1 True 1
2 True 1
3 True 1
4 False 0
5 False 0
6 True 2
7 False 0
8 False 0
9 True 3
10 True 3

All rows within a given column must match, for all columns

I have a Pandas DataFrame of data in which all rows within a given column must match:
df = pd.DataFrame({'A': [1,1,1,1,1,1,1,1,1,1],
'B': [2,2,2,2,2,2,2,2,2,2],
'C': [3,3,3,3,3,3,3,3,3,3],
'D': [4,4,4,4,4,4,4,4,4,4],
'E': [5,5,5,5,5,5,5,5,5,5]})
In [10]: df
Out[10]:
A B C D E
0 1 2 3 4 5
1 1 2 3 4 5
2 1 2 3 4 5
...
6 1 2 3 4 5
7 1 2 3 4 5
8 1 2 3 4 5
9 1 2 3 4 5
I would like a quick way to know if there is an variance anywhere in the DataFrame. At this point, I don't need to know which values have varied, since I will be going in to handle those later. I just need a quick way to know if the DataFrame needs further attention or if I can ignore it and move on to the next one.
I can check any given column using
(df.loc[:,'A'] != df.loc[0,'A']).any()
but my Pandas knowledge limits me to iterating through the columns (I understand iteration is frowned upon in Pandas) to compare all of them:
A B C D E
0 1 2 3 4 5
1 1 2 9 4 5
2 1 2 3 4 5
...
6 1 2 3 4 5
7 1 2 3 4 5
8 1 2 3 4 5
9 1 2 3 4 5
for col in df.columns:
if (df.loc[:,col] != df.loc[0,col]).any():
print("Found a fail in col %s" % col)
break
Out: Found a fail in col C
Is there an elegant way to return a boolean if any row within any column of a dataframe does not match all the values in the column... possibly without iteration?
Given your example dataframe:
df = pd.DataFrame({'A': [1,1,1,1,1,1,1,1,1,1],
'B': [2,2,2,2,2,2,2,2,2,2],
'C': [3,3,3,3,3,3,3,3,3,3],
'D': [4,4,4,4,4,4,4,4,4,4],
'E': [5,5,5,5,5,5,5,5,5,5]})
You can use the following:
df.apply(pd.Series.nunique) > 1
Which gives you:
A False
B False
C False
D False
E False
dtype: bool
If we then force a couple of errors:
df.loc[3, 'C'] = 0
df.loc[5, 'B'] = 20
You then get:
A False
B True
C True
D False
E False
dtype: bool
You can compare the entire DataFrame to the first row like this:
In [11]: df.eq(df.iloc[0], axis='columns')
Out[11]:
A B C D E
0 True True True True True
1 True True True True True
2 True True True True True
3 True True True True True
4 True True True True True
5 True True True True True
6 True True True True True
7 True True True True True
8 True True True True True
9 True True True True True
then test if all values are true:
In [13]: df.eq(df.iloc[0], axis='columns').all()
Out[13]:
A True
B True
C True
D True
E True
dtype: bool
In [14]: df.eq(df.iloc[0], axis='columns').all().all()
Out[14]: True
You can use apply to loop through columns and check if all the elements in the column are the same:
df.apply(lambda col: (col != col[0]).any())
# A False
# B False
# C False
# D False
# E False
# dtype: bool

Categories