Numbers with hyphens or strings of numbers with hyphens - python

I need to make a pandas DataFrame that has a column filled with hyphenated numbers. The only way I could think of to do this was to use strings. This all worked fine, until I needed to sort them to get them back into order after a regrouping. The problem is that strings sort like this:
['100-200','1000-1100','1100-1200','200-300']
This is clearly not how I want it sorted. I want it sorted numberically. How would I get this to work? I am willing to change anything. Keeping the hyphenated string as an integer or float would be the best, but I am unsure how to do that.

You could try something like this:
>>> t = ['100-200','1000-1100','1100-1200','200-300']
>>> t.sort(key=lambda x: [int(y) for y in x.split('-')])
>>> t
['100-200', '200-300', '1000-1100', '1100-1200']
This would allow you to sort on integers, and if a hyphen exists, it will sort first by the first integer in the key list and then by the second. If no hyphen exists, you will sort just on the integer equivalent of the string:
>>> t = ['100-200','1000-1100','1100-1200','200-300', '100']
>>> t.sort(key=lambda x: [int(y) for y in x.split('-')])
>>> t
['100', '100-200', '200-300', '1000-1100', '1100-1200']
If you have any float equivalents in any strings, simply change int to float like this:
>>> t = ['100-200.3','1000.5-1100','1100.76-1200','200-300.75', '100.35']
>>> t.sort(key=lambda x: [float(y) for y in x.split('-')])
>>> t
['100-200.3', '100.35', '200-300.75', '1000.5-1100', '1100.76-1200']

You could use sorted to construct a new ordering for the index, and then perform the sort (reordering) using df.take:
import pandas as pd
df = pd.DataFrame({'foo':['100-200','1000-1100','1100-1200','200-300']})
order = sorted(range(len(df)),
key=lambda idx: map(int, df.ix[idx, 'foo'].split('-')))
df = df.take(order)
print(df)
yields
foo
0 100-200
3 200-300
1 1000-1100
2 1100-1200
This is similar to #275365's solution, but note that the sorting is done on range(len(df)), not on the strings. The strings are only used in the key parameter to determine the order in which range(len(df)) should be rearranged.
Using sorted works fine if the DataFrame is small. You can get better performance when the DataFrame is of moderate size (for example, a few hundred rows on my machine), by using numpy.argsort instead:
import pandas as pd
import numpy as np
df = pd.DataFrame({'foo':['100-200','1000-1100','1100-1200','200-300']*100})
arr = df['foo'].map(lambda item: map(int, item.split('-'))).values
order = np.argsort(arr)
df = df.take(order)
Alternatively, you could split your string column into two integer-valued columns, and then use df.sort:
import pandas as pd
df = pd.DataFrame({'foo':['100-200','1000-1100','1100-1200','200-300']})
df[['start', 'end']] = df['foo'].apply(lambda val: pd.Series(map(int, val.split('-'))))
df.sort(['start', 'end'], inplace=True)
print(df)
yields
foo start end
0 100-200 100 200
3 200-300 200 300
1 1000-1100 1000 1100
2 1100-1200 1100 1200

Related

Subset dataframe based on integer in column name

I have a dataframe that has names such as these for its columns:
column_names=[c_12_2_heart,
c_29_4_lung,
c_21_21_stomach,
c_2_25_bladder,
c_40_1_kidney]
In Python, how can I return a list of only the dataframe columns where the number after the first underscore is greater than 20?
We can use a list comprehension with basic string splitting logic:
column_names = ["c_12_2_heart", "c_29_4_lung", "c_21_21_stomach", "c_2_25_bladder", "c_40_1_kidney"]
output = [x for x in column_names if int(x.split("_")[1].split("_")[0]) > 20]
print(output) # ['c_29_4_lung', 'c_21_21_stomach', 'c_40_1_kidney']
Alternatively to what #Tim Biegeleisen wrote, you can use numpy.where after splitting the columns string.
import numpy as np
numbers = np.array([int(c.split('_')[1]) for c in column_names])
inds = np.where(numbers > 20)[0]
column_names_filt = [column_names[i] for i in inds]
print(column_names_filt)

How can I efficiently and idiomatically filter rows of PandasDF based on multiple StringMethods on a single column?

I have a Pandas DataFrame df with many columns, of which one is:
col
---
abc:kk__LL-z12-1234-5678-kk__z
def:kk_A_LL-z12-1234-5678-kk_ss_z
abc:kk_AAA_LL-z12-5678-5678-keek_st_z
abc:kk_AA_LL-xx-xxs-4rt-z12-2345-5678-ek__x
...
I am trying to fetch all records where col starts with abc: and has the first -num- between '1234' and '2345' (inclusive using a string search; the -num- parts are exactly 4 digits each).
In the case above, I'd return
col
---
abc:kk__LL-z12-1234-5678-kk__z
abc:kk_AA_LL-z12-2345-5678-ek__x
...
My current (working, I think) solution looks like:
df = df[df['col'].str.startswith('abc:')]
df = df[df['col'].str.extract('.*-(\d+)-(\d+)-.*')[0].ge('1234')]
df = df[df['col'].str.extract('.*-(\d+)-(\d+)-.*')[0].le('2345')]
What is a more idiomatic and efficient way to do this in Pandas?
Complex string operations are not as efficient as numeric calculations. So the following approach might be more efficient:
m1 = df['col'].str.startswith('abc')
m2 = pd.to_numeric(df['col'].str.split('-').str[2]).between(1234, 2345)
dfn = df[m1&m2]
col
0 abc:kk__LL-z12-1234-5678-kk__z
3 abc:kk_AA_LL-z12-2345-5678-ek__x
One way would be to use regexp and apply function. I find it easier to play with regexp in a separate function than to crowd the pandas expression.
import pandas as pd
import re
def filter_rows(string):
z = re.match(r"abc:.*-(\d+)-(\d+)-.*", string)
if z:
return 1234 <= (int(z.groups()[0])) <= 2345
else:
return False
Then use the defined function to select rows
df.loc[df['col'].apply(filter_rows)]
col
0 abc:kk__LL-z12-1234-5678-kk__z
3 abc:kk_AA_LL-z12-2345-5678-ek__x
Another play on regex :
#string starts with abc,greedy search,
#then look for either 1234, or 2345,
#search on for 4 digit number and whatever else after
pattern = r'(^abc.*(?<=1234-|2345-)\d{4}.*)'
df.col.str.extract(pattern).dropna()
0
0 abc:kk__LL-z12-1234-5678-kk__z
3 abc:kk_AA_LL-z12-2345-5678-ek__x

Sortings pandas dataframe numbers first then strings

I have a dataframe with columns containing values like P123Y8O9 mixture of numbers and characters and if I apply sort function to this particular series in dataframe, it sorts the strings basis first digit then second and so on, what I want is to sort the strings basis first all numbers like 32456789 and then mixed strings like 2AJ6JH67
you can see that in above example numerically 2 (first digit of 2AJ6JH67) comes before 3 (first digit of 32456789) but the sorting is to be done 32456789 first and then 2AJ6JH67
How to sort dataframes this way?
One way is to sort numeric and non-numeric data separately.
Below are equivalent examples for a list or pd.Series.
lst = ['P123Y8O9', '32456789']
lst_sorted = list(map(str, sorted(int(x) for x in lst if x.isdigit()))) + \
sorted(x for x in lst if not x.isdigit())
# ['32456789', 'P123Y8O9']
s = pd.Series(lst)
s_sorted = pd.Series(list(map(str, sorted(int(x) for x in s if x.isdigit()))) + \
sorted(x for x in s if not x.isdigit()))
# 0 32456789
# 1 P123Y8O9
# dtype: object

How to check if float pandas column contains only integer numbers?

I have a dataframe
df = pd.DataFrame(data=np.arange(10),columns=['v']).astype(float)
How to make sure that the numbers in v are whole numbers?
I am very concerned about rounding/truncation/floating point representation errors
Comparison with astype(int)
Tentatively convert your column to int and test with np.array_equal:
np.array_equal(df.v, df.v.astype(int))
True
float.is_integer
You can use this python function in conjunction with an apply:
df.v.apply(float.is_integer).all()
True
Or, using python's all in a generator comprehension, for space efficiency:
all(x.is_integer() for x in df.v)
True
Here's a simpler, and probably faster, approach:
(df[col] % 1 == 0).all()
To ignore nulls:
(df[col].fillna(-9999) % 1 == 0).all()
If you want to check multiple float columns in your dataframe, you can do the following:
col_should_be_int = df.select_dtypes(include=['float']).applymap(float.is_integer).all()
float_to_int_cols = col_should_be_int[col_should_be_int].index
df.loc[:, float_to_int_cols] = df.loc[:, float_to_int_cols].astype(int)
Keep in mind that a float column, containing all integers will not get selected if it has np.NaN values. To cast float columns with missing values to integer, you need to fill/remove missing values, for example, with median imputation:
float_cols = df.select_dtypes(include=['float'])
float_cols = float_cols.fillna(float_cols.median().round()) # median imputation
col_should_be_int = float_cols.applymap(float.is_integer).all()
float_to_int_cols = col_should_be_int[col_should_be_int].index
df.loc[:, float_to_int_cols] = float_cols[float_to_int_cols].astype(int)
For completeness, Pandas v1.0+ offer the convert_dtypes() utility, that (among 3 other conversions) performs the requested operation for all dataframe-columns (or series) containing only integer numbers.
If you wanted to limit the conversion to a single column only, you could do the following:
>>> df.dtypes # inspect previous dtypes
v float64
>>> df["v"] = df["v"].convert_dtype()
>>> df.dtypes # inspect converted dtypes
v Int64
On 27 331 625 rows it works well. Time : 1.3sec
df['is_float'] = df[field_fact_qty]!=df[field_fact_qty].astype(int)
This way took Time : 4.9s
df[field_fact_qty].apply(lambda x : (x.is_integer()))

Pandas String to Integer by Character

In a Pandas data frame column, I want to convert each character in a string to an integer (as is done with ord()) and add 100 to the left. I know how to do this with a regular string:
st = "JOHNSMITH4817001141979"
a=[ord(x) for x in st]
b=[]
for x in a:
b.append('{:03}'.format(x)) #Add leading zero, ensuring 3 digits
b=['100']+b
b=''.join([ "%s"%x for x in b])
b=int(b)
b
Result: 100074079072078083077073084072052056049055048048049049052049057055057
But what if I wanted to perform this operation on every cell of a column in a Pandas data frame like this one?
import pandas as pd
df = pd.DataFrame({'string':['JOHNSMITH4817001141979','JOHNSMYTHE4817001141979']})
df
string
0 JOHNSMITH4817001141979
1 JOHNSMYTHE4817001141979
I just need a separate column with the result as an integer for each cell in 'string'.
Thanks in advance!
First, you transform your processing chain into a function such as:
def get_it(a):
a=[ord(x) for x in st]
b=[]
for x in a:
b.append('{:03}'.format(x)) #Add leading zero, ensuring 3 digits
b=['100']+b
b=''.join([ "%s"%x for x in b])
return int(b)
and then you call it iteratively for each element in the column and make this list the new column
df['result'] = [get_it(i) for i in df['string']]
Although this does work, I yet think that you can find a better solution by optimizing your process "get_it"
Also, you can do the following:
def get_it(a):
a=[ord(x) for x in st]
b=[]
for x in a:
b.append('{:03}'.format(x)) #Add leading zero, ensuring 3 digits
b=['100']+b
b=''.join([ "%s"%x for x in b])
return int(b)
df['result'] = df['string'].apply(get_it)
If you want a one-liner(Python 3.6+)
import pandas as pd
df = pd.DataFrame({'string':['JOHNSMITH4817001141979','JOHNSMYTHE4817001141979']})
df['string'].apply(lambda x:''.join(['100']+[f'{ord(i):03}' for i in x])).astype(int)
For Python < 3.6, replace f-format to '{ord(i):03}'.format(i=i). What I have done is transform your function into a lambda expression and apply it to the column.

Categories