I have a pandas dataframe, df, which contains a feature ('alpha') which is a list of letters {'A','B',...,'G'}
I'd like to select from df all rows which belong to a subset of this feature, say {'A','B','C'}.
What's the most 'pythonic' way to do this?
I was thinking something along the lines of:
subset = {'A','B','C'}
df1 = df[df['alpha'] == subset]
...but this generates an error:
"need more than 0 values to unpack"
I think you want to use isin to test for membership, example:
In [79]:
subset = {'a','b','c'}
df = pd.DataFrame({'a':list('abasbvggcgasgfdasgcdce')})
df[df['a'].isin(subset)]
Out[79]:
a
0 a
1 b
2 a
4 b
8 c
10 a
15 a
18 c
20 c
Related
I need a fast way to extract the right values from a pandas dataframe:
Given a dataframe with (a lot of) data in several named columns and an additional columns whose values only contains names of the other columns, how do I select values from the data-columns with the additional columns as keys?
It's simple to do via an explicit loop, but this is extremely slow with something like .iterrows() directly on the DataFrame. If converting to numpy-arrays, it's faster, but still not fast. Can I combine methods from pandas to do it even faster?
Example: This is the kind of DataFrame structure, where columns A and B contain data and column keys contains the keys to select from:
import pandas
df = pandas.DataFrame(
{'A': [1,2,3,4],
'B': [5,6,7,8],
'keys': ['A','B','B','A']},
)
print(df)
output:
Out[1]:
A B keys
0 1 5 A
1 2 6 B
2 3 7 B
3 4 8 A
Now I need some fast code that returns a DataFrame like
Out[2]:
val_keys
0 1
1 6
2 7
3 4
I was thinking something along the lines of this:
tmp = df.melt(id_vars=['keys'], value_vars=['A','B'])
out = tmp.loc[a['keys']==a['variable']]
which produces:
Out[2]:
keys variable value
0 A A 1
3 A A 4
5 B B 6
6 B B 7
but doesn't have the right order or index. So it's not quite a solution.
Any suggestions?
See if either of these work for you
df['val_keys']= np.where(df['keys'] =='A', df['A'],df['B'])
or
df['val_keys']= np.select([df['keys'] =='A', df['keys'] =='B'], [df['A'],df['B']])
No need to specify anything for the code below!
def value(row):
a = row.name
b = row['keys']
c = df.loc[a,b]
return c
df.apply(value, axis=1)
Have you tried filtering then mapping:
df_A = df[df['key'].isin(['A'])]
df_B = df[df['key'].isin(['B'])]
A_dict = dict(zip(df_A['key'], df_A['A']))
B_dict = dict(zip(df_B['key'], df_B['B']))
df['val_keys'] = df['key'].map(A_dict)
df['val_keys'] = df['key'].map(B_dict).fillna(df['val_keys']) # non-exhaustive mapping for the second one
Your df['val_keys'] column will now contain the result as in your val_keys output.
If you want you can just retain that column as in your expected output by:
df = df[['val_keys']]
Hope this helps :))
I am trying to edit values after making duplicate rows in Pandas.
I want to edit only one column ("code"), but i see that since it has duplicates , it will affect the entire rows.
Is there any method to first create duplicates and then modify data only of duplicates created ?
import pandas as pd
df=pd.read_excel('so.xlsx',index=False)
a = df['code'] == 1234
b = df[a]
df=df.append(b)
print('\n\nafter replicate')
print(df)
Current output after making duplicates is as below:
coun code name
0 A 123 AR
1 F 123 AD
2 N 7 AR
3 I 0 AA
4 T 10 AS
2 N 7 AR
3 I 7 AA
Now I expect to change values only on duplicates created , in this case bottom two rows. But now I see the indexes are duplicated as well.
You can avoid the duplicate indices by using the ignore_index argument to append.
df=df.append(b, ignore_index=True)
You may also find it easier to modify your data in b, before appending it to the frame.
import pandas as pd
df=pd.read_excel('so.xlsx',index=False)
a = df['code'] == 3
b = df[a]
b["region"][2] = "N"
df=df.append(b, ignore_index=True)
print('\n\nafter replicate')
print(df)
Sorry, I know there are so many questions relating to indexing, and it's probably starring me in the face, but I'm having a little trouble with this. I am familiar with .loc, .iloc, and .index methods and slicing in general. The method .reset_index may not have been (and may not be able to be) called on our dataframe and therefore index lables may not be in order. The dataframe and numpy array(s) are actually different length subsets of the dataframe, but for this example I'll keep them the same size (I can handle offsetting once I have an example).
Here is a picture that show's what I'm looking for:
I can pull cols of rows from the dataframe based on some search criteria.
idxlbls = df.index[df['timestamp'] == dt]
stuff = df.loc[idxlbls, 'col3':'col5']
But how do I map that to row number (array indices, not label indices) to be used as an array index in numpy (assuming same row length)?
stuffprime = array[?, ?]
The reason I need it is because the dataframe is much larger and more complete and contains the column searching criteria, but the numpy arrays are subsets that have been extracted and modified prior in the pipeline (and do not have the same searching criteria in them). I need to search the dataframe and pull the equivalent data from the numpy arrays. Basically I need to correlate specific rows from a dataframe to the corresponding rows of a numpy array.
I would map pandas indices to numpy indicies:
keys_dict = dict(zip(idxlbls, range(len(idxlbls))))
Then you may use the dictionary keys_dict to address the array elements by a pandas index: array[keys_dict[some_df_index], :]
I believe need get_indexer for positions by filtered columns names, for index is possible use same way or numpy.where for positions by boolean mask:
df = pd.DataFrame({'timestamp':list('abadef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4]}, index=list('ABCDEF'))
print (df)
timestamp B C D E
A a 4 7 1 5
B b 5 8 3 3
C a 4 9 5 6
D d 5 4 7 9
E e 5 2 1 2
F f 4 3 0 4
idxlbls = df.index[df['timestamp'] == 'a']
stuff = df.loc[idxlbls, 'C':'E']
print (stuff)
C D E
A 7 1 5
C 9 5 6
a = df.index.get_indexer(stuff.index)
Or get positions by boolean mask:
a = np.where(df['timestamp'] == 'a')[0]
print (a)
[0 2]
b = df.columns.get_indexer(stuff.columns)
print (b)
[2 3 4]
I have a large data set in a pandas dataframe I would like to subset in order to further manipulate.
For example, I have a df that looks like this:
Sample Group AMP ADP ATP
1 A 0.3840396 0.55635504 0.5844648
2 A 0.3971521 0.57851902 -0.24603208
3 A 0.4578926 0.68118957 0.19129746
4 B 0.400222 0.58370811 0.01782915
5 B 0.4110945 0.60208593 -0.6285537
6 B 0.3307011 -0.82615087 -0.25354715
7 C 0.3485679 -0.79597002 -0.17294609
8 C 0.3408411 -0.8090222 0.76138965
9 C 0.3856457 -0.73333568 0.27364299
Lets say I want to make a new dataframe df2 from df that contains only the samples in group B and only the corresponding values for ATP. I should be able to do this from indexing alone.(?)
I would like to do something like this:
df2 = df[(df['Group']=='B') & (df['ATP'])]
I know df['ATP'] obviously is not the correct way to do this. Printing df2 yields this:
Sample Group AMP ADP ATP
4 B 0.400222 0.583708 0.017829
5 B 0.411094 0.602086 -0.628554
6 B 0.330701 -0.826151 -0.253547
Ideally, df2 would look like this:
Sample Group ATP
4 B 0.017829
5 B -0.628554
6 B -0.253547
Is there a way to do this without resorting to some convoluted looping or simply manually deleting the unwanted columns and their values?
Thanks!!!
df2 = df.loc[df['Group'] == 'B', ['Sample', 'Group', 'ATP']]
I have a problem with adding columns in pandas.
I have DataFrame, dimensional is nxk. And in process I wiil need add columns with dimensional mx1, where m = [1,n], but I don't know m.
When I try do it:
df['Name column'] = data
# type(data) = list
result:
AssertionError: Length of values does not match length of index
Can I add columns with different length?
If you use accepted answer, you'll lose your column names, as shown in the accepted answer example, and described in the documentation (emphasis added):
The resulting axis will be labeled 0, ..., n - 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information.
It looks like column names ('Name column') are meaningful to the Original Poster / Original Question.
To save column names, use pandas.concat, but don't ignore_index (default value of ignore_index is false; so you can omit that argument altogether). Continue to use axis=1:
import pandas
# Note these columns have 3 rows of values:
original = pandas.DataFrame({
'Age':[10, 12, 13],
'Gender':['M','F','F']
})
# Note this column has 4 rows of values:
additional = pandas.DataFrame({
'Name': ['Nate A', 'Jessie A', 'Daniel H', 'John D']
})
new = pandas.concat([original, additional], axis=1)
# Identical:
# new = pandas.concat([original, additional], ignore_index=False, axis=1)
print(new.head())
# Age Gender Name
#0 10 M Nate A
#1 12 F Jessie A
#2 13 F Daniel H
#3 NaN NaN John D
Notice how John D does not have an Age or a Gender.
Use concat and pass axis=1 and ignore_index=True:
In [38]:
import numpy as np
df = pd.DataFrame({'a':np.arange(5)})
df1 = pd.DataFrame({'b':np.arange(4)})
print(df1)
df
b
0 0
1 1
2 2
3 3
Out[38]:
a
0 0
1 1
2 2
3 3
4 4
In [39]:
pd.concat([df,df1], ignore_index=True, axis=1)
Out[39]:
0 1
0 0 0
1 1 1
2 2 2
3 3 3
4 4 NaN
We can add the different size of list values to DataFrame.
Example
a = [0,1,2,3]
b = [0,1,2,3,4,5,6,7,8,9]
c = [0,1]
Find the Length of all list
la,lb,lc = len(a),len(b),len(c)
# now find the max
max_len = max(la,lb,lc)
Resize all according to the determined max length (not in this example
if not max_len == la:
a.extend(['']*(max_len-la))
if not max_len == lb:
b.extend(['']*(max_len-lb))
if not max_len == lc:
c.extend(['']*(max_len-lc))
Now the all list is same length and create dataframe
pd.DataFrame({'A':a,'B':b,'C':c})
Final Output is
A B C
0 1 0 1
1 2 1
2 3 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
I had the same issue, two different dataframes and without a common column. I just needed to put them beside each other in a csv file.
Merge:
In this case, "merge" does not work; even adding a temporary column to both dfs and then dropping it. Because this method makes both dfs with the same length. Hence, it repeats the rows of the shorter dataframe to match the longer dataframe's length.
Concat:
The idea of The Red Pea didn't work for me. It just appended the shorter df to the longer one (row-wise) while leaving an empty column (NaNs) above the shorter df's column.
Solution: You need to do the following:
df1 = df1.reset_index()
df2 = df2.reset_index()
df = [df1, df2]
df_final = pd.concat(df, axis=1)
df_final.to_csv(filename, index=False)
This way, you'll see your dfs besides each other (column-wise), each of which with its own length.
If somebody like to replace a specific column of a different size instead of adding it.
Based on this answer, I use a dict as an intermediate type.
Create Pandas Dataframe with different sized columns
If the column to be inserted is not a list but already a dict, the respective line can be omitted.
def fill_column(dataframe: pd.DataFrame, list: list, column: str):
dict_from_list = dict(enumerate(list)) # create enumertable object from list and create dict
dataFrame_asDict = dataframe.to_dict() # Get DataFrame as Dict
dataFrame_asDict[column] = dict_from_list # Assign specific column
return pd.DataFrame.from_dict(dataFrame_asDict, orient='index').T # Create new DataSheet from Dict and return it