Sortings pandas dataframe numbers first then strings - python

I have a dataframe with columns containing values like P123Y8O9 mixture of numbers and characters and if I apply sort function to this particular series in dataframe, it sorts the strings basis first digit then second and so on, what I want is to sort the strings basis first all numbers like 32456789 and then mixed strings like 2AJ6JH67
you can see that in above example numerically 2 (first digit of 2AJ6JH67) comes before 3 (first digit of 32456789) but the sorting is to be done 32456789 first and then 2AJ6JH67
How to sort dataframes this way?

One way is to sort numeric and non-numeric data separately.
Below are equivalent examples for a list or pd.Series.
lst = ['P123Y8O9', '32456789']
lst_sorted = list(map(str, sorted(int(x) for x in lst if x.isdigit()))) + \
sorted(x for x in lst if not x.isdigit())
# ['32456789', 'P123Y8O9']
s = pd.Series(lst)
s_sorted = pd.Series(list(map(str, sorted(int(x) for x in s if x.isdigit()))) + \
sorted(x for x in s if not x.isdigit()))
# 0 32456789
# 1 P123Y8O9
# dtype: object

Related

pandas: create column conditioned on row containing a string in list

I have a dataframe with 20 columns, and 3 of those columns (always the same) may contain one or more of these strings ["fraction", "fractional", "1/x", "one fifth"].
I want to add a new column that says whether or not each row is "fractional" (in other words, contains one of those words). This column could have Y or N to indicate this.
I've tried to do it with iterrows, like so:
list_of_fractional_widgets = []
for index, row in df.iterrows():
fractional_keywords = ["fraction", "fractional", "1/x", "one fifth", "Fraction"]
# use str to remove offending nan values
xx = str(row["HeaderX"])
yy = str(row["HeaderY"])
zz = str(row["HeaderZ"])
widget_data = [xx, yy, zz]
for word in fractional_keywords:
found = [True for x in widget_data if word in x]
if len(found)>0:
list_of_fractional_widgets.append('Y')
break
if len(found) ==0:
list_of_fractional_widgets.append('N')
df['Fractional?'] = list_of_fractional_widgets
however, I'm trying to understand if there is a more pandas / numpy efficient way to do so. Something like:
np.where(df['HeaderX'].str.contains(fractional_keywords?)), True)
as described in this SO question, but using a list and different headers.
Create a single pattern by joining all the words with '|'. Then we check the condition in each column separately using Series.str.contains and create a single mask using np.logical_or.reduce.
Sample Data
import pandas as pd
import numpy as np
keywords = ["fraction", "fractional", "1/x", "one fifth", "Fraction"]
np.random.seed(45)
df = pd.DataFrame(np.random.choice(keywords+list('abcdefghijklm'), (4,3)),
columns=['HeaderX', 'HeaderY', 'HeaderZ'])
Code
pat = '|'.join(keywords)
df['Fractional?'] = np.logical_or.reduce([df[col].str.contains(pat)
for col in ['HeaderX', 'HeaderY', 'HeaderZ']])
HeaderX HeaderY HeaderZ Fractional?
0 g one fifth fraction True
1 one fifth Fraction k True
2 fractional j d True
3 j d h False
As a bonus, Series.str.contains can accept a case=False argument to ignore case when matching so there is no need to separately specify both 'fraction' and 'Fraction' (or any arbitrary capitalization like 'FracTIOn').

Append a word after matching a string from looped list in data frame with external lists to new column in same dataframe

I want to loop over a pandas data frame where each row has a list of strings. But for each row, I want to cross-reference it with another set of lists with predefined strings. If the predefined string within the external lists matches with the string in the row, I want to append the matching string to a new column with the same index as the looped over row. If no string matches then a generic string must be appended to the column with the same index as the looped over row. Once all the rows(1207 to be exact) have been looped over, the column with the appended words must match the number of rows.
#these are the predefined lists
traffic=['stationary','congest','traffic','slow','heavi','bumper','flow','spectate','emergm','jam','visibl'] #predefined list of strings
accident=['outsur','accid','avoid','crash','overturn','massiv','fatalmov','roll'] #predefined list of strings
crime=['shootout','lawnessness','robbery','fire','n1shoot','rob','mug','killed','kill','scene','lawness'] #predefined list of strings
#this is the code I have already tried
for x in test['text']:
for y in x:
if y in traffic:
test['type1']=('traffic')
break
if y in crime:
test['type1']=('crime')
break
if y in accident:
test['type1']=('accident')
break
else:
test['type1']=('ignore')
break
Below is a sample of my data frame
Dataframe name is test
[original dataframe][1]
[1]: https://i.stack.imgur.com/aZML4.png
from what I have tried this is the output
[Output of code in dataframe][2]
[2]: https://i.stack.imgur.com/iwj1g.png
There might be a simpler way for you to run such comparison. The order was not clear which list should be compared first, but below is one way:
PS: Created a sample data:
x =[
[['report','shootout','midrand','n1','north','slow']],
[['jhbtraffic','lioght','out','citi','deep']],
[['jhbtraffic','light','out','booysen','booysen']]
]
df = pd.DataFrame(x, columns=['text'])
df
Out[2]:
text
0 [report, shootout, midrand, n1, north, slow]
1 [jhbtraffic, lioght, out, citi, deep]
2 [jhbtraffic, light, out, booysen, booysen]
Actual solution:
### get matched strings per row
matched = df['text'].apply(lambda x: [a for a in x for i in crime+accident+traffic if i==a ])
### merge to the original dataset
df.join(pd.DataFrame(matched.tolist(), index= df.index)).fillna('ignored')
Out[1]:
text 0 1
0 [report, shootout, midrand, n1, north, slow] shootout slow
1 [jhbtraffic, lioght, out, citi, deep] ignored ignored
2 [jhbtraffic, light, out, booysen, booysen] ignored ignored

How to apply calculation to column of text file?

I'm trying to apply a calculation to every value of every column in my csv file and replacing the old values with these new calculated values.
#temp_list is a list of lists. Eg. [['1.3','2.2','1.6'],['1.2','4.5','2.3']]
for row in temp_list:
minimum = min(row) #find minimum value of values in column 2
y = every value in the 2nd column - minimum
#for every value in the 2nd column, apply y calculation to it and replace original values with these values
row[1] = float(row[1])
I understand that if I did
row[1] = float(row[1]) * 3
for example, I would get each value in column 2 (index 1) to be multiplied by 3. How would I do that for my y calculation written above?
You can use zip to transpose the list of lists, convert the sequence to a list and then use [1] to get the values in the second row (originally second column), so that you can use the min function with float as a key function to get minimum of the values based on their values in floating point:
min(list(zip(*temp_list))[1], key=float)
This returns: 2.2
Based on your comment, I think this is what you wanted.
Since your lists are of strings, there's a bit of casting back and forth between Decimal and string
from decimal import Decimal
temp_list = [['1.3','2.2','1.6'],['1.2','4.5','2.3']]
for x in temp_list:
x[1] = str(Decimal(x[1]) - min(Decimal(y) for y in x))
print(temp_list)
# [['1.3', '0.9', '1.6'], ['1.2', '3.3', '2.3']]

Remove all characters except alphabet in column rows

Let's say i have a dataset, and in some columns of these dataset I have lists. Well first key problem is actually that there are many columns with such lists, where strings can be separated by (';') or (';;'), the string itself starts with whitelist or even (';).
For some cases of these problem i implemented this function:
g = [';','']
f = []
for index, row in data_a.iterrows():
for x in row['column_1']:
if (x in g):
norm = row['column_1'].split(x)
f.append(norm)
print(norm)
else:
Actually it worked, but the problem is that it returned duplicated rows, and wasn't able to solve tasks with other separators.
Another problem is using dummies after I changed the way column values are stored:
column_values = data_a['column_1']
data_a.insert(loc=0, column='new_column_8', value=column_values)
dummies_new_win = pd.get_dummies(data_a['column_1'].apply(pd.Series).stack()).sum(level=0)
Instead of getting 40 columns in my case, i get 50 or 60. Due to the fact, that i am not able to make a function that removes from lists everything except just alphabet. I would like to understand how to implement such function because same string meanings can be written in different ways:
name-Jack or name(Jack)
Desired output would look like this:
nameJack nameJack
Im not sure if i understood you well, but to remove all non alphanumeric, you can use simple regex.
Example:
import re
n = '-s;a-d'
re.sub(r'\W+', '', n)
Output: 'sad'
You can use str.replace for pandas Series.
df = pd.DataFrame({'names': ['name-Jack','name(Jack)']})
df
# names
# 0 name-Jack
# 1 name(Jack)
df['names'] = df['names'].str.replace('\W+','')
df
# names
# 0 nameJack
# 1 nameJack

Numbers with hyphens or strings of numbers with hyphens

I need to make a pandas DataFrame that has a column filled with hyphenated numbers. The only way I could think of to do this was to use strings. This all worked fine, until I needed to sort them to get them back into order after a regrouping. The problem is that strings sort like this:
['100-200','1000-1100','1100-1200','200-300']
This is clearly not how I want it sorted. I want it sorted numberically. How would I get this to work? I am willing to change anything. Keeping the hyphenated string as an integer or float would be the best, but I am unsure how to do that.
You could try something like this:
>>> t = ['100-200','1000-1100','1100-1200','200-300']
>>> t.sort(key=lambda x: [int(y) for y in x.split('-')])
>>> t
['100-200', '200-300', '1000-1100', '1100-1200']
This would allow you to sort on integers, and if a hyphen exists, it will sort first by the first integer in the key list and then by the second. If no hyphen exists, you will sort just on the integer equivalent of the string:
>>> t = ['100-200','1000-1100','1100-1200','200-300', '100']
>>> t.sort(key=lambda x: [int(y) for y in x.split('-')])
>>> t
['100', '100-200', '200-300', '1000-1100', '1100-1200']
If you have any float equivalents in any strings, simply change int to float like this:
>>> t = ['100-200.3','1000.5-1100','1100.76-1200','200-300.75', '100.35']
>>> t.sort(key=lambda x: [float(y) for y in x.split('-')])
>>> t
['100-200.3', '100.35', '200-300.75', '1000.5-1100', '1100.76-1200']
You could use sorted to construct a new ordering for the index, and then perform the sort (reordering) using df.take:
import pandas as pd
df = pd.DataFrame({'foo':['100-200','1000-1100','1100-1200','200-300']})
order = sorted(range(len(df)),
key=lambda idx: map(int, df.ix[idx, 'foo'].split('-')))
df = df.take(order)
print(df)
yields
foo
0 100-200
3 200-300
1 1000-1100
2 1100-1200
This is similar to #275365's solution, but note that the sorting is done on range(len(df)), not on the strings. The strings are only used in the key parameter to determine the order in which range(len(df)) should be rearranged.
Using sorted works fine if the DataFrame is small. You can get better performance when the DataFrame is of moderate size (for example, a few hundred rows on my machine), by using numpy.argsort instead:
import pandas as pd
import numpy as np
df = pd.DataFrame({'foo':['100-200','1000-1100','1100-1200','200-300']*100})
arr = df['foo'].map(lambda item: map(int, item.split('-'))).values
order = np.argsort(arr)
df = df.take(order)
Alternatively, you could split your string column into two integer-valued columns, and then use df.sort:
import pandas as pd
df = pd.DataFrame({'foo':['100-200','1000-1100','1100-1200','200-300']})
df[['start', 'end']] = df['foo'].apply(lambda val: pd.Series(map(int, val.split('-'))))
df.sort(['start', 'end'], inplace=True)
print(df)
yields
foo start end
0 100-200 100 200
3 200-300 200 300
1 1000-1100 1000 1100
2 1100-1200 1100 1200

Categories