Pandas query throws error when column name starts with a number - python

I'm trying to perform a query on the following dataframe:
data = {'ab': [1,2,3], 'c1': [1,2,3], 'd': [1,2,3], 'e_f': [1,2,3]}
df = pd.DataFrame(data)
for cl in df.columns:
print len(df.query('%s==2' %cl))
This works fine. However, if a column name starts with a number then it throws a syntax error.
data = {'ab': [1,2,3], 'c1': [1,2,3], '1d': [1,2,3], 'e_f': [1,2,3]}
df = pd.DataFrame(data)
for cl in df.columns:
print len(df.query('%s==2' %cl))
File "", line 1
1 d ==2
^
SyntaxError: invalid syntax
I think that the problem is related to the format of the string. I was wondering what will be the correct way to form this query.

query uses pandas.eval, which is documented to "evaluate a Python expression as a string". Your query is not a valid Python expression, because 1d is not valid syntax in Python, so you can't use query to refer to this column that way.
Things in pandas are generally easier if you make sure all your columns are valid Python identifiers.

You could always get a list of the column names which returns the columns as strings and then query them.
data = {'ab': [1,2,3], 'c1': [1,2,3], 'd': [1,2,3], 'e_f': [1,2,3]}
df = pd.DataFrame(data)
cols = list(df)
So for example cols[0] would be 'ab' and cols[2] would be '1d'.

Related

Remap values in a Pandas column based on dictionary key/value pairs using RegEx in replace() function

I have the following Pandas dataframe:
foo = {
"first_name" : ["John", "Sally", "Mark", "Jane", "Phil"],
"last_name" : ["O'Connor", "Jones P.", "Williams", "Connors", "Lee"],
"salary" : [101000, 50000, 56943, 330532, 92750],
}
df = pd.DataFrame(foo)
I'd like to be able to validate column data using a RegEx pattern, then replace with NaN if the validation fails.
To do this, I use the following hard-coded RegEx patterns in the .replace() method:
df[['first_name']] = df[['last_name']].replace('[^A-Za-z \/\-\.\']', np.NaN, regex=True)
df[['last_name']] = df[['last_name']].replace('[^A-Za-z \/\-\.\']', np.NaN, regex=True)
df[['salary']] = df[['salary']].replace('[^0-9 ]', np.NaN, regex=True)
This approach works. But, I have 15-20 columns. So, this approach is going to be difficult to maintain.
I'd like to set up a dictionary that looks as follows:
regex_patterns = {
'last_name' : '[^A-Za-z \/\-\.\']',
'first_name' : '[^A-Za-z \/\-\.\']',
'salary' : '[^0-9 ]'
}
Then, I'd like to pass a value to the .replace() function based on the name of the column in the df. It would look as follows:
df[['first_name']] = df[['last_name']].replace('<reference_to_regex_patterns_dictionary>', np.NaN, regex=True)
df[['last_name']] = df[['last_name']].replace('<reference_to_regex_patterns_dictionary>', np.NaN, regex=True)
df[['salary']] = df[['salary']].replace('<reference_to_regex_patterns_dictionary>', np.NaN, regex=True)
How would I reference the name of the df column, then use that to look up the key in the dictionary and get its associated value?
For example, look up first_name, then access its dictionary value [^A-Za-z \/\-\.\'] and pass this value into .replace()?
Thanks!
P.S. if there is a more elegant approach, I'm all ears.
One approach would be using columns attribute:
regex_patterns = {
'last_name' : '[^A-Za-z \/\-\.\']',
'first_name' : '[^A-Za-z \/\-\.\']',
'salary' : '[^0-9 ]'
}
for column in df.columns:
df[column] = df[[column]].replace(regex_pattern[column], np.NaN, regex=True)
You can actually pass a nested dictionary of the form {'col': {'match': 'replacement'}} to replace
In your case:
d = {k:{v:np.nan} for k,v in regex_patterns.items()}
df.replace(d, regex=True)

Extract values from dictionary and conditionally assign them to columns in pandas

I am trying to extract values from a column of dictionaries in pandas and assign them to their respective columns that already exist. I have hardcoded an example below of the data set that I have:
df_have = pd.DataFrame(
{
'value_column':[np.nan, np.nan, np.nan]
,'date':[np.nan, np.nan, np.nan]
,'string_column':[np.nan, np.nan, np.nan]
, 'dict':[[{'value_column':40},{'date':'2017-08-01'}],[{'value_column':30},
{'string_column':'abc'}],[{'value_column':10},{'date':'2016-12-01'}]]
})
df_have
df_want = pd.DataFrame(
{
'value_column':[40, 30, 10]
,'date':['2017-08-01', np.nan, '2016-12-01']
,'string_column':[np.nan, 'abc', np.nan]
,'dict':[[{'value_column':40},{'date':'2017-08-01'}],[{'value_column':30},
{'string_column':'abc'}],[{'value_column':10},{'date':'2016-12-01'}]]})
df_want
I have managed to extract the values out of the dictionaries using loops:
'''
for row in range(len(df_have)):
row_holder = df_have.dict[row]
number_of_dictionaries_in_the_row = len(row_holder)
for dictionary in range(number_of_dictionaries_in_the_row):
variable_holder = df_have.dict[row][dictionary].keys()
variable = list(variable_holder)[0]
value = df_have.dict[row][dictionary].get(variable)
'''
I now need to somehow conditionally turn df_have into df_want. I am happy to take a completely new approach and recreate the whole thing from scratch. We could even assume that I only have a dataframe with the dictionaries and nothing else.
You could use pandas string methods to pull the data out, although I think it is inefficient nesting data structures within Pandas :
df_have.loc[:, "value_column"] = df_have["dict"].str.get(0).str.get("value_column")
df_have.loc[:, "date"] = df_have["dict"].str.get(-1).str.get("date")
df_have.loc[:, "string_column"] = df_have["dict"].str.get(-1).str.get("string_column")
value_column date string_column dict
0 40 2017-08-01 None [{'value_column': 40}, {'date': '2017-08-01'}]
1 30 None abc [{'value_column': 30}, {'string_column': 'abc'}]
2 10 2016-12-01 None [{'value_column': 10}, {'date': '2016-12-01'}]

Pandas Apply with multiple columns as input

For a dataframe which has 4 columns of coordinates (longitude, lattitude) I would like to create a 5th column which has the distance between both places for each column, below illustrates this:
dict = [{'x1': '1','y1': '1','x2': '3','y2': '2'},
{'x1': '1','y1': '1','x2': '3','y2': '2'}]
data = pd.DataFrame(dict)
As an outcome I would like to have this:
dict1 = [{'x1': '1','y1': '1','x2': '3','y2': '2','distance': '2.6'},
{'x1': '1','y1': '1','x2': '3','y2': '2','distance': '2.9'}]
data2 = pd.DataFrame(dict)
Where distance is computed using from geopy.distance import great_circle:
This is what I tried:
data['distance']=data[['x1','y1','x2','y2']].apply(lambda x1,y1,x2,y2: great_circle(x1,y1,x2,y2).miles, axis=1)
But that gives me a type error:
TypeError: () missing 3 required positional arguments: 'y1', 'x2', and 'y2'
Any help is appreciated.
That is because the lambda function can only view the operand data[['x1','y1','x2','y2']], so you should modify it as follow. Hope this helps!
data['distance']=data[['x1','y1','x2','y2']].apply(lambda df: great_circle(df['x1'],df['y1'],df['x2'],df['y2']).miles, axis=1)

New Dataframe column based on regex match of at least one of the element in each row

How would you implement the following using pandas?
part 1:
I want to create a new conditional column in input_dataframe. Each row in input_dataframe will be matched against a regex. If at lease one element in the row matches, than the element for this row in the new column will contain the matched value(s).
part 2: A more complete version would be:
The source of the regex is the value of each element originating form another series. (i.e. I want to know if each row in input_dataframe contains a value(s) form the passed series.
part 3: An even more complete version would be:
Instead of passing a series, I'd pass another Dataframe, regex_dataframe. For each column in it, I would implement the same process as part 2 above. (Thus, The result would be a new column in the input_dataframe for each column in the regex_dataframe.)
example input:
input_df = pd.DataFrame({
'a':['hose','dog','baby'],
'b':['banana','avocado','mango'],
'c':['horse','dog','cat'],
'd':['chease','cucumber','orange']
})
example regex_dataframe:
regex_dataframe = pd.DataFrame({
'e':['ho','ddddd','ccccccc'],
'f':['wwwwww','ado','kkkkkkkk'],
'g':['fffff','mmmmmmm','cat'],
'i':['heas','ber','aaaaaaaa']
})
example result:
result_dataframe = pd.DataFrame({
'a': ['hose', 'dog', 'baby'],
'b': ['banana', 'avocado', 'mango'],
'c': ['horse', 'dog', 'cat'],
'd': ['chease', 'cucumber', 'orange'],
'e': ['ho', '', ''],
'f': ['', 'ado', ''],
'g': ['', '', 'cat'],
'i': ['heas', 'ber', '']
})
Rendered:
First of all, rename regex_dataframe so individual cells correspond to each other in both dataframes.
input_df = pd.DataFrame({
'a':['hose','dog','baby'],
'b':['banana','avocado','mango'],
'c':['horse','dog','cat'],
'd':['chease','cucumber','orange']
})
regex_dataframe = pd.DataFrame({
'a':['ho','ddddd','ccccccc'],
'b':['wwwwww','ado','kkkkkkkk'],
'c':['fffff','mmmmmmm','cat'],
'd':['heas','ber','aaaaaaaa']
})
Apply the method DataFrame.combine(other, func, fill_value=None, overwrite=True) to to get pairs of corresponding columns (which are Series).
Apply Series.combine(other, func, fill_value=nan) to get pairs of corresponding cells.
Apply regex to the cells.
import re
def process_cell(text, reg):
res = re.search(reg, text)
return res.group() if res else ''
def process_column(col_t, col_r):
return col_t.combine(col_r, lambda text, reg: process_cell(text, reg))
input_df.combine(regex_dataframe, lambda col_t, col_r: process_column(col_t, col_r))

Bokeh: Column DataSource part giving error

I am trying to create an interactive bokeh plot that holds multiple data and I am not sure why I am getting the error
ValueError: expected an element of ColumnData(String, Seq(Any)),got {'x': 6.794, 'y': 46.8339999999999, 'country': 'Congo, Dem. Rep.', 'pop': 3.5083789999999997, 'region': 'Sub-Saharan Africa'}
source = ColumnDataSource(data={
'x' : data.loc[1970].fertility,
'y' : data.loc[1970].life,
'pop' : (data.loc[1970].population / 20000000) + 2,
'region' : data.loc[1970].region,
})
I have tried two different data sets by importing data from excel and have been running out of issues on exactly why this happening.
As the name suggests, the ColumnDataSource is a data structure for storing columns of data. This means that the value of every key in .data must be a column, i.e. a Python list, a NumPy array, or a Pandas series. But you are trying to assign plain numbers as the values, which is what the error message is telling you:
I am trying to create an interactive bokeh plot that holds multiple data and I am not sure why I am getting the error
expected an element of ColumnData(String, Seq(Any))
This is saying acceptable, expected values are dicts that map strings to sequences. But what you passed is clearly not that:
got {'x': 6.794, 'y': 46.8339999999999, 'country': 'Congo, Dem. Rep.', 'pop': 3.5083789999999997, 'region': 'Sub-Saharan Africa'}
The value for x for instance is just the number 6.794 and not an array or list, etc.
You can easily do this:
source = ColumnDataSource({str(c): v.values for c, v in df.items()})
This would be a solution. I think the problem is in getting the data from df.
source = ColumnDataSource(data={
'x' : data[data['Year'] == 1970]['fertility'],
'y' : data[data['Year'] == 1970]['life'],
'pop' : (data[data['Year'] == 1970]['population']/20000000) + 2,
'region' : data[data['Year'] == 1970]['region']
})
I had this same problem using this same dataset.
My solution was to import the csv in pandas using "Year" as index column.
data = pd.read_csv(csv_path, index_col='Year')

Categories