Pandas String to Integer by Character - python

In a Pandas data frame column, I want to convert each character in a string to an integer (as is done with ord()) and add 100 to the left. I know how to do this with a regular string:
st = "JOHNSMITH4817001141979"
a=[ord(x) for x in st]
b=[]
for x in a:
b.append('{:03}'.format(x)) #Add leading zero, ensuring 3 digits
b=['100']+b
b=''.join([ "%s"%x for x in b])
b=int(b)
b
Result: 100074079072078083077073084072052056049055048048049049052049057055057
But what if I wanted to perform this operation on every cell of a column in a Pandas data frame like this one?
import pandas as pd
df = pd.DataFrame({'string':['JOHNSMITH4817001141979','JOHNSMYTHE4817001141979']})
df
string
0 JOHNSMITH4817001141979
1 JOHNSMYTHE4817001141979
I just need a separate column with the result as an integer for each cell in 'string'.
Thanks in advance!

First, you transform your processing chain into a function such as:
def get_it(a):
a=[ord(x) for x in st]
b=[]
for x in a:
b.append('{:03}'.format(x)) #Add leading zero, ensuring 3 digits
b=['100']+b
b=''.join([ "%s"%x for x in b])
return int(b)
and then you call it iteratively for each element in the column and make this list the new column
df['result'] = [get_it(i) for i in df['string']]
Although this does work, I yet think that you can find a better solution by optimizing your process "get_it"

Also, you can do the following:
def get_it(a):
a=[ord(x) for x in st]
b=[]
for x in a:
b.append('{:03}'.format(x)) #Add leading zero, ensuring 3 digits
b=['100']+b
b=''.join([ "%s"%x for x in b])
return int(b)
df['result'] = df['string'].apply(get_it)

If you want a one-liner(Python 3.6+)
import pandas as pd
df = pd.DataFrame({'string':['JOHNSMITH4817001141979','JOHNSMYTHE4817001141979']})
df['string'].apply(lambda x:''.join(['100']+[f'{ord(i):03}' for i in x])).astype(int)
For Python < 3.6, replace f-format to '{ord(i):03}'.format(i=i). What I have done is transform your function into a lambda expression and apply it to the column.

Related

Subset dataframe based on integer in column name

I have a dataframe that has names such as these for its columns:
column_names=[c_12_2_heart,
c_29_4_lung,
c_21_21_stomach,
c_2_25_bladder,
c_40_1_kidney]
In Python, how can I return a list of only the dataframe columns where the number after the first underscore is greater than 20?
We can use a list comprehension with basic string splitting logic:
column_names = ["c_12_2_heart", "c_29_4_lung", "c_21_21_stomach", "c_2_25_bladder", "c_40_1_kidney"]
output = [x for x in column_names if int(x.split("_")[1].split("_")[0]) > 20]
print(output) # ['c_29_4_lung', 'c_21_21_stomach', 'c_40_1_kidney']
Alternatively to what #Tim Biegeleisen wrote, you can use numpy.where after splitting the columns string.
import numpy as np
numbers = np.array([int(c.split('_')[1]) for c in column_names])
inds = np.where(numbers > 20)[0]
column_names_filt = [column_names[i] for i in inds]
print(column_names_filt)

pandas: create column conditioned on row containing a string in list

I have a dataframe with 20 columns, and 3 of those columns (always the same) may contain one or more of these strings ["fraction", "fractional", "1/x", "one fifth"].
I want to add a new column that says whether or not each row is "fractional" (in other words, contains one of those words). This column could have Y or N to indicate this.
I've tried to do it with iterrows, like so:
list_of_fractional_widgets = []
for index, row in df.iterrows():
fractional_keywords = ["fraction", "fractional", "1/x", "one fifth", "Fraction"]
# use str to remove offending nan values
xx = str(row["HeaderX"])
yy = str(row["HeaderY"])
zz = str(row["HeaderZ"])
widget_data = [xx, yy, zz]
for word in fractional_keywords:
found = [True for x in widget_data if word in x]
if len(found)>0:
list_of_fractional_widgets.append('Y')
break
if len(found) ==0:
list_of_fractional_widgets.append('N')
df['Fractional?'] = list_of_fractional_widgets
however, I'm trying to understand if there is a more pandas / numpy efficient way to do so. Something like:
np.where(df['HeaderX'].str.contains(fractional_keywords?)), True)
as described in this SO question, but using a list and different headers.
Create a single pattern by joining all the words with '|'. Then we check the condition in each column separately using Series.str.contains and create a single mask using np.logical_or.reduce.
Sample Data
import pandas as pd
import numpy as np
keywords = ["fraction", "fractional", "1/x", "one fifth", "Fraction"]
np.random.seed(45)
df = pd.DataFrame(np.random.choice(keywords+list('abcdefghijklm'), (4,3)),
columns=['HeaderX', 'HeaderY', 'HeaderZ'])
Code
pat = '|'.join(keywords)
df['Fractional?'] = np.logical_or.reduce([df[col].str.contains(pat)
for col in ['HeaderX', 'HeaderY', 'HeaderZ']])
HeaderX HeaderY HeaderZ Fractional?
0 g one fifth fraction True
1 one fifth Fraction k True
2 fractional j d True
3 j d h False
As a bonus, Series.str.contains can accept a case=False argument to ignore case when matching so there is no need to separately specify both 'fraction' and 'Fraction' (or any arbitrary capitalization like 'FracTIOn').

Remove all characters except alphabet in column rows

Let's say i have a dataset, and in some columns of these dataset I have lists. Well first key problem is actually that there are many columns with such lists, where strings can be separated by (';') or (';;'), the string itself starts with whitelist or even (';).
For some cases of these problem i implemented this function:
g = [';','']
f = []
for index, row in data_a.iterrows():
for x in row['column_1']:
if (x in g):
norm = row['column_1'].split(x)
f.append(norm)
print(norm)
else:
Actually it worked, but the problem is that it returned duplicated rows, and wasn't able to solve tasks with other separators.
Another problem is using dummies after I changed the way column values are stored:
column_values = data_a['column_1']
data_a.insert(loc=0, column='new_column_8', value=column_values)
dummies_new_win = pd.get_dummies(data_a['column_1'].apply(pd.Series).stack()).sum(level=0)
Instead of getting 40 columns in my case, i get 50 or 60. Due to the fact, that i am not able to make a function that removes from lists everything except just alphabet. I would like to understand how to implement such function because same string meanings can be written in different ways:
name-Jack or name(Jack)
Desired output would look like this:
nameJack nameJack
Im not sure if i understood you well, but to remove all non alphanumeric, you can use simple regex.
Example:
import re
n = '-s;a-d'
re.sub(r'\W+', '', n)
Output: 'sad'
You can use str.replace for pandas Series.
df = pd.DataFrame({'names': ['name-Jack','name(Jack)']})
df
# names
# 0 name-Jack
# 1 name(Jack)
df['names'] = df['names'].str.replace('\W+','')
df
# names
# 0 nameJack
# 1 nameJack

Optimize a for loop applied on all elements of a df

EDIT : here are the first lines :
df = pd.read_csv(os.path.join(path, file), dtype = str,delimiter = ';',error_bad_lines=False, nrows=50)
df["CALDAY"] = df["CALDAY"].apply(lambda x:dt.datetime.strptime(x,'%d/%m/%Y'))
df = df.fillna(0)
I have a csv file that has 1500 columns and 35000 rows. It contains values, but under the form 1.700,35 for example, whereas in python I need 1700.35. When I read the csv, all values are under a str type.
To solve this I wrote this function :
def format_nombre(df):
for i in range(length):
for j in range(width):
element = df.iloc[i,j]
if (type(element) != type(df.iloc[1,0])):
a = df.iloc[i,j].replace(".","")
b = float(a.replace(",","."))
df.iloc[i,j] = b
Basically, I select each intersection of all rows and columns, I replace the problematic characters, I turn the element into a float and I replace it in the dataframe. The if ensures that the function doesn't consider dates, which are in the first column of my dataframe.
The problem is that although the function does exactly what I want, it takes approximately 1 minute to cover 10 rows, so transforming my csv would take a little less than 60h.
I realize this is far from being optimized, but I struggled and failed to find a way that suited my needs and (scarce) skills.
How about:
def to_numeric(column):
if np.issubdtype(column.dtype, np.datetime64):
return column
else:
return column.str.replace('.', '').str.replace(',', '.').astype(float)
df = df.apply(to_numeric)
That's assuming all strings are valid. Otherwise use pd.to_numeric instead of astype(float).

Numbers with hyphens or strings of numbers with hyphens

I need to make a pandas DataFrame that has a column filled with hyphenated numbers. The only way I could think of to do this was to use strings. This all worked fine, until I needed to sort them to get them back into order after a regrouping. The problem is that strings sort like this:
['100-200','1000-1100','1100-1200','200-300']
This is clearly not how I want it sorted. I want it sorted numberically. How would I get this to work? I am willing to change anything. Keeping the hyphenated string as an integer or float would be the best, but I am unsure how to do that.
You could try something like this:
>>> t = ['100-200','1000-1100','1100-1200','200-300']
>>> t.sort(key=lambda x: [int(y) for y in x.split('-')])
>>> t
['100-200', '200-300', '1000-1100', '1100-1200']
This would allow you to sort on integers, and if a hyphen exists, it will sort first by the first integer in the key list and then by the second. If no hyphen exists, you will sort just on the integer equivalent of the string:
>>> t = ['100-200','1000-1100','1100-1200','200-300', '100']
>>> t.sort(key=lambda x: [int(y) for y in x.split('-')])
>>> t
['100', '100-200', '200-300', '1000-1100', '1100-1200']
If you have any float equivalents in any strings, simply change int to float like this:
>>> t = ['100-200.3','1000.5-1100','1100.76-1200','200-300.75', '100.35']
>>> t.sort(key=lambda x: [float(y) for y in x.split('-')])
>>> t
['100-200.3', '100.35', '200-300.75', '1000.5-1100', '1100.76-1200']
You could use sorted to construct a new ordering for the index, and then perform the sort (reordering) using df.take:
import pandas as pd
df = pd.DataFrame({'foo':['100-200','1000-1100','1100-1200','200-300']})
order = sorted(range(len(df)),
key=lambda idx: map(int, df.ix[idx, 'foo'].split('-')))
df = df.take(order)
print(df)
yields
foo
0 100-200
3 200-300
1 1000-1100
2 1100-1200
This is similar to #275365's solution, but note that the sorting is done on range(len(df)), not on the strings. The strings are only used in the key parameter to determine the order in which range(len(df)) should be rearranged.
Using sorted works fine if the DataFrame is small. You can get better performance when the DataFrame is of moderate size (for example, a few hundred rows on my machine), by using numpy.argsort instead:
import pandas as pd
import numpy as np
df = pd.DataFrame({'foo':['100-200','1000-1100','1100-1200','200-300']*100})
arr = df['foo'].map(lambda item: map(int, item.split('-'))).values
order = np.argsort(arr)
df = df.take(order)
Alternatively, you could split your string column into two integer-valued columns, and then use df.sort:
import pandas as pd
df = pd.DataFrame({'foo':['100-200','1000-1100','1100-1200','200-300']})
df[['start', 'end']] = df['foo'].apply(lambda val: pd.Series(map(int, val.split('-'))))
df.sort(['start', 'end'], inplace=True)
print(df)
yields
foo start end
0 100-200 100 200
3 200-300 200 300
1 1000-1100 1000 1100
2 1100-1200 1100 1200

Categories