I have a dataframe of unique strings and I want to find the row and column for a given string. I want these values because I'll be eventually exporting this dataframe to an excel spreadsheet. The easiest way I've found so far to get these values is the follwing:
jnames = list(df.iloc[0].to_frame().index)
for i in jnames:
for k in df[i]:
if 'searchstring' in str(k):
print('Column: {}'.format( (jnames.index(i) + 1 ) ) )
print('Row: {}'.format( list( df[i] ).index('searchstring') ) )
break
Can anyone advise a solution that takes better advantage of the inherent capabilities of pandas?
Without reproducible code / data, I'm going to make up a dataframe and show one simple way:
Setup
import pandas as pd, numpy as np
df = pd.DataFrame([['a', 'b', 'c'], ['d', 'e', 'f'], ['g', 'h', 'b']])
The dataframe looks like this:
0 1 2
0 a b c
1 d e f
2 g h b
Solution
result = list(zip(*np.where(df.values == 'b')))
Result
[(0, 1), (2, 2)]
Explanation
df.values accesses the numpy array underlying the dataframe.
np.where creates an array of coordinates satisfying the provided condition.
zip(*...) transforms [x-coords-array, y-coords-array] into (x, y) coordinate pairs.
Try using contains. This will return you a dataframe of rows that contain the slice you are looking for.
df[df['<my_col>'].str.contains('<my_string_slice>')]
Similarly, you can use match for a direct match.
This is my approach not writing double for loops:
value_to_search = "c"
print(df[[x for x in df.columns if value_to_search in df[x].unique()]].index[0])
print(df[[x for x in df.columns if value_to_search in df[x].unique()]].columns[0])
The first will return the column name and the second will return the index. Combined together, you will get the index-column combination. Since you mentioned that all values in the df is unique, both lines will return exactly one value.
You might need a try-except if value_to_search might not in the data frame.
By using stack , data from jpp
df[df=='b'].stack()
Out[211]:
0 1 b
2 2 b
dtype: object
Related
How do I drop row number i of a DF ?
I did the thing below but it is not working.
DF = DF.drop(i)
So I wonder what I miss there.
You must pass a label to drop. Here drop tries to use i as a label and fails (ith KeyError) as your index probably has other values. Worse, if the index was composed of integers in random order you might drop an incorrect row without noticing it.
Use:
df.drop(df.index[i])
Example:
df = pd.DataFrame({'col': range(4)}, index=list('ABCD'))
out = df.drop(df.index[2])
output:
col
A 0
B 1
D 3
pitfall
In case of duplicated indices, you might remove unwanted rows!
df = pd.DataFrame({'col': range(4)}, index=list('ABAD'))
out = df.drop(df.index[2])
output (A is incorrectly dropped!):
col
B 1
D 3
workaround:
import numpy as np
out = df[np.arange(len(df)) != i]
drop several indices by position:
import numpy as np
out = df[~np.isin(np.arange(len(df)), [i, j])]
You need to add square brackets:
df = df.drop([i])
Try This:
df.drop(df.index[i])
I am unable to comment on the original question as I don't have a high enough reputation, but I refer to this question DataFrames - Average Columns, specifically this line of code:
dfgrp= df.iloc[:,2:].groupby((np.arange(len(df.iloc[:,2:].columns)) // 2) + 1, axis=1).mean().add_prefix('ColumnAVg')
As I read it, take all rows from column 2 onwards, group by the length of the same rows and columns something something something on columns, not rows, get the mean of those columns then add to new columns called ColumnAVg1/2/3 etc.
I also know this takes the mean of columns 1&2, 3&4, 5&6 etc. but I don't know how it does.
And so my question is, what needs to change in the above code to get the mean of columns 1&2, 2&3, 3&4, 4&5 etc. with the results in the same format?
df = pd.DataFrame(np.random.randn(2, 4), columns=['a', 'b', 'c', 'd'])
groups = [(1,2),(2,3),(2,3,4),(1,3)]
df2 = pd.DataFrame([df.iloc[:, i - 1] for z in groups for i in z]).T
labels = [str(z) for z in groups for _ in z]
result = df2.groupby(by=labels, axis=1).mean()
Probably not what you were looking for but something like this should work.
So unfortunately you cannot alter that code to get your result, because it achieved what it does by assigning a number to each column, and thus grouping them together. However, you can do something cheeky. Just provide 2 groupings, get the average for each grouping and combined them into a single frame.
df = pd.DataFrame(np.random.randn(2, 4), columns=['a', 'b', 'c', 'd'])
d1 = df.groupby((np.arange(len(df.columns)) // 2), axis=1).mean()
d2 = df.groupby((np.arange(len(df.columns) + 1) // 2)[1:], axis=1).mean()
dfo = pd.DataFrame()
for i in range(len(df.columns)-1):
c = f'average_{df.columns[i]}_{df.columns[i+1]}'
if i % 2 == 0:
dfo[c] = d1[d1.columns[i / 2]]
else:
dfo[c] = d2[d2.columns[(i+1) / 2]]
What he did is to assign columns 1,2,3,4 to 1,1,2,2. So in our code, we have d1 assigned according to 1,1,2,2 and d2 assigned according to 0,1,1,2. The for loop is to combine the results.
How can I create a new column in a Pandas DataFrame that compresses/collapses multiple values at once from another column? Also, is it possible to use a default value so that you don't have to explicitly write out all the value mappings?
I'm referring to a process that is often called "variable recoding" in statistical software such as SPSS and Stata.
Example
Suppose I have a DataFrame with 1,000 observations. The only column in the DataFrame is called col1 and it has 26 unique values (the letters A through Z). Here's a reproducible example of my starting point:
import pandas as pd
import numpy as np
import string
np.random.seed(666)
df = pd.DataFrame({'col1':np.random.choice(list(string.ascii_uppercase),size=1000)})
I want to create a new column called col2 according to the following mapping:
If col1 is equal to either A, B or C, col2 should receive AA
If col1 is equal to either D, E or F, col2 should receive MM
For all other values in col1, col2 should receive ZZ
I know I can partially do this using Pandas' replace function, but it has two problems. The first is that the replace function doesn't allow you to condense multiple input values into one single response value. This forces me to write out df['col1'].replace({'A':'AA','B':'AA','C':'AA'}) instead of something simpler like df['col1'].replace({['A','B','C']:'AA'}).
The second problem is that the replace function doesn't have an all_other_values keyword or anything like that. This forces me to manually write out the ENTIRE value mappings like this df['col1'].replace({'A':'AA','B':'AA',...,'G':'ZZ','H':'ZZ','I':'ZZ',...,'X':'ZZ','Y':'ZZ','Z':'ZZ'}) instead of something simpler like df['col1'].replace(dict_for_abcdef, all_other_values='ZZ')
Is there another way to use the replace function that I'm missing that would allow me to do what I'm asking? Or is there another Pandas function that enables you to so similar things to what I describe above?
Dirty implementation
Here is a "dirty" implementation of what I'm looking for using loc:
df['col2'] = 'ZZ' # Initiate the column with the default "all_others" value
df.loc[df['col1'].isin(['A','B','C']),'col2'] = 'AA' # Mapping from "A","B","C" to "AA"
df.loc[df['col1'].isin(['D','E','F']),'col2'] = 'MM' # Mapping from "D","E","F" to "MM"
I find this solution a bit messy and was hoping something a bit cleaner existed.
Can try with np.select which takes a list of conditions, a list of values, and also a default:
conds = [df['col1'].isin(['A', 'B', 'C']),
df['col1'].isin(['D', 'E', 'F'])]
values = ['AA', 'MM']
df['col2'] = np.select(conds, values, default='ZZ')
Can also use between instead of isin:
conds = [df['col1'].between('A', 'C'),
df['col1'].between('D', 'F')]
values = ['AA', 'MM']
df['col2'] = np.select(conds, values, default='ZZ')
Sample Input and Output:
import string
import numpy as np
import pandas as pd
letters = string.ascii_uppercase
df = pd.DataFrame({'col1': list(letters)[:10]})
df:
col1 col2
0 A AA
1 B AA
2 C AA
3 D MM
4 E MM
5 F MM
6 G ZZ
7 H ZZ
8 I ZZ
9 J ZZ
np.select(condition, choice, alternative). For conditions, check numerals between a defined range
c=[df['col1'].between('A','C'),df['col1'].between('E','F')]
CH=['AA','MM']
df=df.assign(col2=np.select(c,CH,'ZZ'))
I have two dataframes in pandas. DF "A" contains the start and end indexes of zone names. DF "B" contains the start and end indexes of subzones. The goal is to extract all subzones of all zones.
Example:
A:
start index | end index | zone name
-----------------------------------
1 | 10 | X
B:
start index | end index | subzone name
-----------------------------------
2 | 3 | Y
In the above example, Y is a subzone of X since its indexes fall within X's indexes.
The way I'm currently doing this is using iterrows to go through every row in A, and for every row (zone) I find the slice in B (subzone).
This solution is extremely slow in pandas since iterrows is not fast. How can I do this task without using iterrows in pandas?
Grouping with Dicts and Series is possible,
Grouping information may exist in a form other than an array. Let’s consider another
example DataFrame ( since your Data Frames don't have Data do i m taking my own DF DFA =mapping, DFB= people
with values and that have real world interpretations):
people = pd.DataFrame(np.random.randn(5, 5),
columns=['a', 'b', 'c', 'd', 'e'],
index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])
people.iloc[2:3, [1, 2]] = np.nan # Add a few NA values
Now, suppose I have a group correspondence for the columns and want to sum
together the columns by group:
mapping = {'a': 'red', 'b': 'red', 'c': 'blue',
'd': 'blue', 'e': 'red', 'f' : 'orange'}
#Mapping is a Dictionary just like a DataFrame (DF A representing Zones)
you could construct an array from this dict to pass to groupby, but instead we
can just pass the dict ( I sure you can convert at Dictionary to dtata Frame and Data Frame to Dictionary, so skipping the step, other wise you are well come to ask in comments)
by_column = people.groupby(mapping, axis=1)
i am using sum() operator you can use whatever operator you want ( in case you want to combine sub Zones with Parent Zones you can do this by concatenation- out of scope of this other wise i would have gone in details )
by_column.sum()
The same functionality holds for Series, which can be viewed as a fixed-size mapping:
Note: using functions with arrays, dicts, or Series is not a problem as everything gets converted to arrays internally.
I'm having trouble understanding how looping through a dataframe works.
I found somewhere that if you write:
for row in df.iterrows()
you wont be able to access row['column1'], instead youll have to use
for row,index in df.iterrows() and then it works.
Now i want to create a collection of signals I found in the loop by adding row to a new dataframe newdf.append(row) this works but it looses the ability to be referenced by a string. How do i have to add those rows to my dataframe in order for that to work?
Detailed code:
dataframe1 = DataFrame(np.random.randn(10, 5), columns=['a','b','c', 'd', 'e'])
dataframe2 = DataFrame()
for index,row in dataframe1:
if row['a'] == 5
dataframe2.append(row)
print dataframe2['b']
This doesnt work, because he wont accept strings inside the bracket for dataframe2.
Yes this could be done easier, but for the sake of argument lets say it couldnt(more complex logic than one if).
In my real code there are like ten different ifs and elses determining what to do with that specific row (and do other stuff from within the loop). Im not talking about filtering but just adding the row to a new dataframe in a way that it preservers the index so i can reference with the name of the column
In pandas, it is pretty straightforward to filter and pass the results, if needed, to a new dataframe, just as #smci suggests for r.
import numpy as np
import pandas as pd
dataframe1 = pd.DataFrame(np.random.randn(10, 5), columns=['a','b','c', 'd', 'e'])
dataframe1.head()
a b c d e
0 -2.824391 -0.143400 -0.936304 0.056744 -1.958325
1 -1.116849 0.010941 -1.146384 0.034521 -3.239772
2 -2.026315 0.600607 0.071682 -0.925031 0.575723
3 0.088351 0.912125 0.770396 1.148878 0.230025
4 -0.954288 -0.526195 0.811891 0.558740 -2.025363
Then, to filter, you can do like so:
dataframe2=dataframe1.ix[dataframe1.a>.5]
dataframe2.head()
a b c d e
0 0.708511 0.282347 0.831361 0.331655 -2.328759
1 1.646602 -0.090472 -0.074580 -0.272876 -0.647686
8 2.728552 -0.481700 0.338771 0.848957 -0.118124
EDIT
OP didn't want to use a filter, so here is an example iterating through rows instead:
np.random.seed(123)
dataframe1 = pd.DataFrame(np.random.randn(10, 5), columns=['a','b','c', 'd', 'e'])
## I declare the second df with the same structure
dataframe2 = pd.DataFrame(columns=['a','b','c', 'd', 'e'])
For the loop I use iterrows, and instead of appending to an empty dataframe, I use the index from the iterator to place at the same index position in the empty frame. Notice that I said > .5 instead of = 5 or else the resulting dataframe would be empty for sure.
for index, row in dataframe1.iterrows():
if row['a'] > .5:
dataframe2.loc[index] = row
dataframe2
a b c d e
1 1.651437 -2.426679 -0.428913 1.265936 -0.866740
4 0.737369 1.490732 -0.935834 1.175829 -1.253881
UPDATE:
Don't. Solution is:
dataframe1[dataframe1.a > .5]
# or, if you only want the 'b' column
dataframe1[dataframe1.a > .5] ['b']
You only want to filter for rows where a==5 (and then select the b column?)
You have still shown zero reason whatsoever why you need to append to the dataframe1. In fact you don't need to append anything, you just directly generate your filtered version.
ORIGINAL VERSION:
Don't.
If all you want to do is compute aggregations or summaries and they don't really belong in the parent dataframe, do a filter. Assign the result to a separate dataframe.
If you really insist on using iterate+append, instead of filter, even knowing all the caveats, then create an empty summary dataframe, then append to that as you iterate. Only after you're finished iterating, append it (and only if you really need to), back to the parent dataframe.