I've got a data frame with column names like 'AH_AP' and 'AH_AS'.
Essentially all i want to do is swap the part before the underscore and the part after the underscore so that the column headers are 'AP_AH' and 'AS_AH'.
I can do that if the elements are in a list, but i've no idea how to get that to apply to column names.
My solution if it were a list goes like this:
columns = ['AH_AP','AS_AS']
def rejig_col_names():
elements_of_header = columns.split('_')
new_title = elements_of_header[-1] + "_" + elements_of_header[0]
return new_title
i'm guessing i need to apply this to something like the below, but i've no idea how, or how to reference a single column within df.columns:
df.columns = df.columns.map()
Any help appreciated. Thanks :)
You can do it this way:
Input:
df = pd.DataFrame(data=[['1','2'], ['3','4']], columns=['AH_PH', 'AH_AS'])
print(df)
AH_PH AH_AS
0 1 2
1 3 4
Output:
df.columns = df.columns.str.split('_').str[::-1].str.join('_')
print(df)
PH_AH AS_AH
0 1 2
1 3 4
Explained:
Use string accessor and the split method on '_'
Then using the str accessor with index slicing reversing, [::-1], you
can reverse the order of the list
Lastly, using the string accessor and join, we can concatenate the
list back together again.
You were almost there: you can do
df.columns = df.columns.map(rejig_col_names)
except that the function gets called with a column name as argument, so change it like this:
def rejig_col_names(col_name):
elements_of_header = col_name.split('_')
new_title = elements_of_header[-1] + "_" + elements_of_header[0]
return new_title
An alternative to the other answer. Using your function and DataFrame.rename
import pandas as pd
def rejig_col_names(columns):
elements_of_header = columns.split('_')
new_title = elements_of_header[-1] + "_" + elements_of_header[0]
return new_title
data = {
'A_B': [1, 2, 3],
'C_D': [4, 5, 6],
}
df = pd.DataFrame(data)
df.rename(rejig_col_names, axis='columns', inplace=True)
print(df)
str.replace is also an option via swapping capture groups:
Sample input borrowed from ScottBoston
df = pd.DataFrame(data=[['1', '2'], ['3', '4']], columns=['AH_PH', 'AH_AS'])
Then Capture everything before and after the '_' and swap capture group 1 and 2.
df.columns = df.columns.str.replace(r'^(.*)_(.*)$', r'\2_\1', regex=True)
PH_AH AS_AH
0 1 2
1 3 4
Related
All Im trying to do is pass a variable to a pandas .query function. I keep getting empty rows returned when I use a python string variable (even when its formatted).
This works
a = '1736_4_A1'
df = metaData.query("array_id == #a")
print(df)
output:
array_id wafer_id slide position array_no sample_id
0 1736_4_A1 1736 4 A1 1 Rat 2nd
But this does not work! I dont understand why
array = str(waferid) + '_' + str(slideid) + '_' + str(position)
a = f'{array}'
a = "{}_{}_{}".format(waferid, slideid, position)
print(a)
df = metaData.query("array_id == #a")
print(df)
output:
1736_4_a1
Empty DataFrame
Columns: [array_id, wafer_id, slide, position, array_no, sample_id]
Index: []
I've spent too many hours on this. I feel like this should be simple! What am I doing wrong here?
I have df where some of the records in the column contains prefix and some of them not. I would like to update records without prefix. Unfortunately, my script adds desired prefix to each record in df:
new_list = []
prefix = 'x'
for ids in df['ids']:
if ids.find(prefix) < 1:
new_list.append(prefix + ids)
How can I ommit records with the prefix?
I've tried with df[df['ids'].str.contains(prefix)], but I'm getting an error.
Use Series.str.startswith for mask and add values with numpy.where:
df = pd.DataFrame({'ids':['aaa','ssx','xwe']})
prefix = 'x'
df['ids'] = np.where(df['ids'].str.startswith(prefix), '', prefix) + df['ids']
print (df)
ids
0 xaaa
1 xssx
2 xwe
I have this simplified dataframe:
ID, Date
1 8/24/1995
2 8/1/1899 :00
How can I use the power of pandas to recognize any date in the dataframe which has extra :00 and removes it.
Any idea how to solve this problem?
I have tried this syntax but did not help:
df[df["Date"].str.replace(to_replace="\s:00", value="")]
The Output Should Be Like:
ID, Date
1 8/24/1995
2 8/1/1899
You need to assign the trimmed column back to the original column instead of doing subsetting, and also the str.replace method doesn't seem to have the to_replace and value parameter. It has pat and repl parameter instead:
df["Date"] = df["Date"].str.replace("\s:00", "")
df
# ID Date
#0 1 8/24/1995
#1 2 8/1/1899
To apply this to an entire dataframe, I'd stack then unstack
df.stack().str.replace(r'\s:00', '').unstack()
functionalized
def dfreplace(df, *args, **kwargs):
s = pd.Series(df.values.flatten())
s = s.str.replace(*args, **kwargs)
return pd.DataFrame(s.values.reshape(df.shape), df.index, df.columns)
Examples
df = pd.DataFrame(['8/24/1995', '8/1/1899 :00'], pd.Index([1, 2], name='ID'), ['Date'])
dfreplace(df, '\s:00', '')
rng = range(5)
df2 = pd.concat([pd.concat([df for _ in rng]) for _ in rng], axis=1)
df2
dfreplace(df2, '\s:00', '')
I have two series in the dataframe below. The first is a string which will appear in the second, which will be a url string. What I want to do is change the first series by concatenating on extra characters, and have that change applied onto the second string.
import pandas as pd
#import urlparse
d = {'OrigWord' : ['bunny', 'bear', 'bull'], 'WordinUrl' : ['http://www.animal.com/bunny/ear.html', 'http://www.animal.com/bear/ear.html', 'http://www.animal.com/bull/ear.html'] }
df = pd.DataFrame(d)
def trial(source_col, dest_col):
splitter = dest_col.str.split(str(source_col))
print type(splitter)
print splitter
res = 'angry_' + str(source_col).join(splitter)
return res
df['Final'] = df.applymap(trial(df.OrigWord, df.WordinUrl))
I'm trying to find the string from the source_col, then split on that string in the dest_col, then effect that change on the string in dest_col. Here I have it as a new series called Final but I would rather inplace. I think the main issue are the splitter variable, which isn't working and the application of the function.
Here's how result should look:
OrigWord WordinUrl
angry_bunny http://www.animal.com/angry_bunny/ear.html
angry_bear http://www.animal.com/angry_bear/ear.html
angry_bull http://www.animal.com/angry_bull/ear.html
apply isn't really designed to apply to multiple columns in the same row. What you can do is to change your function so that it takes in a series instead and then assigns source_col, dest_col to the appropriate value in the series. One way of doing it is as below:
def trial(x):
source_col = x["OrigWord"]
dest_col = x['WordinUrl' ]
splitter = str(dest_col).split(str(source_col))
res = splitter[0] + 'angry_' + source_col + splitter[1]
return res
df['Final'] = df.apply(trial,axis = 1 )
here is an alternative approach:
df['WordinUrl'] = (df.apply(lambda x: x.WordinUrl.replace(x.OrigWord,
'angry_' + x.OrigWord), axis=1))
In [25]: df
Out[25]:
OrigWord WordinUrl
0 bunny http://www.animal.com/angry_bunny/ear.html
1 bear http://www.animal.com/angry_bear/ear.html
2 bull http://www.animal.com/angry_bull/ear.html
Instead of using split, you can use the replace method to prepend the angry_ to the corresponding source:
def trial(row):
row.WordinUrl = row.WordinUrl.replace(row.OrigWord, "angry_" + row.OrigWord)
row.OrigWord = "angry_" + row.OrigWord
return row
df.apply(trial, axis = 1)
OrigWord WordinUrl
0 angry_bunny http://www.animal.com/angry_bunny/ear.html
1 angry_bear http://www.animal.com/angry_bear/ear.html
2 angry_bull http://www.animal.com/angry_bull/ear.html
I'm trying to reproduce my Stata code in Python, and I was pointed in the direction of Pandas. I am, however, having a hard time wrapping my head around how to process the data.
Let's say I want to iterate over all values in the column head 'ID.' If that ID matches a specific number, then I want to change two corresponding values FirstName and LastName.
In Stata it looks like this:
replace FirstName = "Matt" if ID==103
replace LastName = "Jones" if ID==103
So this replaces all values in FirstName that correspond with values of ID == 103 to Matt.
In Pandas, I'm trying something like this
df = read_csv("test.csv")
for i in df['ID']:
if i ==103:
...
Not sure where to go from here. Any ideas?
One option is to use Python's slicing and indexing features to logically evaluate the places where your condition holds and overwrite the data there.
Assuming you can load your data directly into pandas with pandas.read_csv then the following code might be helpful for you.
import pandas
df = pandas.read_csv("test.csv")
df.loc[df.ID == 103, 'FirstName'] = "Matt"
df.loc[df.ID == 103, 'LastName'] = "Jones"
As mentioned in the comments, you can also do the assignment to both columns in one shot:
df.loc[df.ID == 103, ['FirstName', 'LastName']] = 'Matt', 'Jones'
Note that you'll need pandas version 0.11 or newer to make use of loc for overwrite assignment operations. Indeed, for older versions like 0.8 (despite what critics of chained assignment may say), chained assignment is the correct way to do it, hence why it's useful to know about even if it should be avoided in more modern versions of pandas.
Another way to do it is to use what is called chained assignment. The behavior of this is less stable and so it is not considered the best solution (it is explicitly discouraged in the docs), but it is useful to know about:
import pandas
df = pandas.read_csv("test.csv")
df['FirstName'][df.ID == 103] = "Matt"
df['LastName'][df.ID == 103] = "Jones"
You can use map, it can map vales from a dictonairy or even a custom function.
Suppose this is your df:
ID First_Name Last_Name
0 103 a b
1 104 c d
Create the dicts:
fnames = {103: "Matt", 104: "Mr"}
lnames = {103: "Jones", 104: "X"}
And map:
df['First_Name'] = df['ID'].map(fnames)
df['Last_Name'] = df['ID'].map(lnames)
The result will be:
ID First_Name Last_Name
0 103 Matt Jones
1 104 Mr X
Or use a custom function:
names = {103: ("Matt", "Jones"), 104: ("Mr", "X")}
df['First_Name'] = df['ID'].map(lambda x: names[x][0])
The original question addresses a specific narrow use case. For those who need more generic answers here are some examples:
Creating a new column using data from other columns
Given the dataframe below:
import pandas as pd
import numpy as np
df = pd.DataFrame([['dog', 'hound', 5],
['cat', 'ragdoll', 1]],
columns=['animal', 'type', 'age'])
In[1]:
Out[1]:
animal type age
----------------------
0 dog hound 5
1 cat ragdoll 1
Below we are adding a new description column as a concatenation of other columns by using the + operation which is overridden for series. Fancy string formatting, f-strings etc won't work here since the + applies to scalars and not 'primitive' values:
df['description'] = 'A ' + df.age.astype(str) + ' years old ' \
+ df.type + ' ' + df.animal
In [2]: df
Out[2]:
animal type age description
-------------------------------------------------
0 dog hound 5 A 5 years old hound dog
1 cat ragdoll 1 A 1 years old ragdoll cat
We get 1 years for the cat (instead of 1 year) which we will be fixing below using conditionals.
Modifying an existing column with conditionals
Here we are replacing the original animal column with values from other columns, and using np.where to set a conditional substring based on the value of age:
# append 's' to 'age' if it's greater than 1
df.animal = df.animal + ", " + df.type + ", " + \
df.age.astype(str) + " year" + np.where(df.age > 1, 's', '')
In [3]: df
Out[3]:
animal type age
-------------------------------------
0 dog, hound, 5 years hound 5
1 cat, ragdoll, 1 year ragdoll 1
Modifying multiple columns with conditionals
A more flexible approach is to call .apply() on an entire dataframe rather than on a single column:
def transform_row(r):
r.animal = 'wild ' + r.type
r.type = r.animal + ' creature'
r.age = "{} year{}".format(r.age, r.age > 1 and 's' or '')
return r
df.apply(transform_row, axis=1)
In[4]:
Out[4]:
animal type age
----------------------------------------
0 wild hound dog creature 5 years
1 wild ragdoll cat creature 1 year
In the code above the transform_row(r) function takes a Series object representing a given row (indicated by axis=1, the default value of axis=0 will provide a Series object for each column). This simplifies processing since you can access the actual 'primitive' values in the row using the column names and have visibility of other cells in the given row/column.
This question might still be visited often enough that it's worth offering an addendum to Mr Kassies' answer. The dict built-in class can be sub-classed so that a default is returned for 'missing' keys. This mechanism works well for pandas. But see below.
In this way it's possible to avoid key errors.
>>> import pandas as pd
>>> data = { 'ID': [ 101, 201, 301, 401 ] }
>>> df = pd.DataFrame(data)
>>> class SurnameMap(dict):
... def __missing__(self, key):
... return ''
...
>>> surnamemap = SurnameMap()
>>> surnamemap[101] = 'Mohanty'
>>> surnamemap[301] = 'Drake'
>>> df['Surname'] = df['ID'].apply(lambda x: surnamemap[x])
>>> df
ID Surname
0 101 Mohanty
1 201
2 301 Drake
3 401
The same thing can be done more simply in the following way. The use of the 'default' argument for the get method of a dict object makes it unnecessary to subclass a dict.
>>> import pandas as pd
>>> data = { 'ID': [ 101, 201, 301, 401 ] }
>>> df = pd.DataFrame(data)
>>> surnamemap = {}
>>> surnamemap[101] = 'Mohanty'
>>> surnamemap[301] = 'Drake'
>>> df['Surname'] = df['ID'].apply(lambda x: surnamemap.get(x, ''))
>>> df
ID Surname
0 101 Mohanty
1 201
2 301 Drake
3 401
df['FirstName']=df['ID'].apply(lambda x: 'Matt' if x==103 else '')
df['LastName']=df['ID'].apply(lambda x: 'Jones' if x==103 else '')
In case someone is looking for a way to change the values of multiple rows based on some logical condition of each row itself, using .apply() with a function is the way to go.
df = pd.DataFrame({'col_a':[0,0], 'col_b':[1,2]})
col_a col_b
0 0 1
1 0 2
def func(row):
if row.col_a == 0 and row.col_b <= 1:
row.col_a = -1
row.col_b = -1
return row
df.apply(func, axis=1)
col_a col_b
0 -1 -1 # Modified row
1 0 2
Although .apply() is typically used to add a new row/column to a dataframe, it can be used to modify the values of existing rows/columns.
I found it much easier to debut by printing out where each row meets the condition:
for n in df.columns:
if(np.where(df[n] == 103)):
print(n)
print(df[df[n] == 103].index)