Detect specific characters in pandas dataframe

Detect specific characters in pandas dataframe - python

How to detect columns and rows that might have one of the characters in a string of a dataframe element other than the desired characters.
desired characters are A, B, C, a, b, c, 1, 2, 3, &, %, =, /
dataframe -
Col1
Col2
Col3
Abc
Øa
12
bbb
+
}
output will be elements Øa, +, } and their location in dataframe.

I find it really difficult to locate an element for a condition directly in pandas, so I converted the dataframe to a nested list first, then proceeded to work with the list. Try this:
import pandas as pd
import numpy as np
#creating your sample dataframe
array = np.array([['Abc','Øa','12'],['bbb','+','}']])
columns = ['Col1','Col2','Col3']
df = pd.DataFrame(data=array, columns=columns)
#convert dataframe to nested list
pd_list = df.values.tolist()
#return any characters other than the ones in 'var'
all_chars = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;=>?#[\\]^_`{|}~Ø'
var = 'ABCabc123&%=//'
for a in var:
all_chars = all_chars.replace(a, "")
#stores previously detected elements to prevent duplicate
temp_storage = []
#loops through the nested list to get the elements' indexes
for x in all_chars:
for i in pd_list:
for n in i:
if x in n:
#check if element is duplicate
if not n in temp_storage:
temp_storage.append(n)
print(f'found {n}: row={pd_list.index(i)}; col={i.index(n)}')
Output:
> found +: row=1; col=1
> found }: row=1; col=2
> found Øa: row=0; col=1

Related

Insert elements in front of specific list elements

I have pandas data frame with two columns:
sentence - fo n bar
annotations [B-inv, B-inv, O, I-acc, O, B-com, I-com, I-com]
I want to insert additional 'O' elements in the annotations list in front of each annotation starting with 'B', which will look like this:
[O, B-inv, O, B-inv, O, I-acc, O, O, B-com, I-com, I-com]
' f o n bar'
And then insert additional whitespace in front of each element with an index equal to the 'B' annotation indexes from the initial annotation: meaning inserting in front of each char from the sentence with index in this list [0,1,5]
Maybe to make it more visibly appealing I should represent it this way:
Initial sentence:
Ind
Sentence char
Annot
0
f
B-inv
1
o
B-inv
2
whitespace
O
3
n
I-acc
4
whitespace
O
5
b
B-com
6
a
I-com
7
r
I-com
End sentence:
Ind
Sentence char
Annot
0
whitespace
O
1
f
B-inv
2
whitespace
O
3
o
B-inv
4
whitespace
O
5
n
I-acc
6
whitespace
O
7
whitespace
O
8
b
B-com
9
a
I-com
10
r
I-com

Updated answer (list comprehension)
from itertools import chain
annot = ['B-inv', 'B-inv', 'O', 'I-acc', 'O', 'B-com', 'I-com', 'I-com']
sent = list('fo n bar')
annot, sent = list(map(lambda l: list(chain(*l)), list(zip(*[(['O', a], [' ', s]) if a.startswith('B') else ([a], [s]) for a,s in zip(annot, sent)]))))
print(annot)
print(''.join(sent))
chain from itertools allow you to chain together a list of lists to form a single list. Then the rest is some clumsy use of zip together with list unpacking (the prefix * in argument names) to get it in one line. map is only used to apply the same operation to both lists basically.
But a more readable version, so you can also follow the steps better, could be:
# find where in the annotations the element starts with 'B'
loc = [a.startswith('B') for a in annot]
# Use this locator to add an element and Merge the list of lists with `chain`
annot = list(chain.from_iterable([['O', a] if l else [a] for a,l in zip(annot, loc)]))
sent = ''.join(chain.from_iterable([[' ', a] if l else [a] for a,l in zip(sent, loc)])) # same on sentence
Note that above, I do not use map as we process each list separately, and there is less zipping and casting to lists. So most probably, a much cleaner, and hence preferred solution.
Old answer (pandas)
I am not sure it is the most convenient to do this on a DataFrame. It might be easier on a simple list, before converting to a DataFrame.
But anyway, here is a way through it, assuming you don't really have meaningful indices in your DataFrame (so that indices are simply the integer count of each row).
The trick is to use .str strings functions such as startswith in this case to find matching strings in one of the column Series of interest and then you could loop over the matching indices ([0, 1, 5] in the example) and insert at a dummy location (half index, e.g. 0.5 to place the row before row 1) the row with the whitespace and 'O' data. Then sorting by sindices with .sort_index() will rearrange all rows in the way you want.
import pandas as pd
annot = ['B-inv', 'B-inv', 'O', 'I-acc', 'O', 'B-com', 'I-com', 'I-com']
sent = list('fo n bar')
df = pd.DataFrame({'sent':sent, 'annot':annot})
idx = np.argwhere(df.annot.str.startswith('B').values) # find rows where annotations start with 'B'
for i in idx.ravel(): # Loop over the indices before which we want to insert a new row
df.loc[i-0.5] = [' ', 'O'] # made up indices so that the subsequent sorting will place the row where you want it
df.sort_index().reset_index(drop=True) # this will output the new DataFrame

Count number of rows that a value stored in a list occurs in

There is a DataFrame df that holds data in list of strings:
>> df
words
0 [a,b,c]
1 [a]
2 [x,c,c]
3 [a]
...
I want to count the number of rows that each value in words occurs in. For example:
a: 3
b: 1
c: 2
x: 1
I get a list of all unique words in the DataFrame using:
>> from collections import OrderedDict #using OrderedDict to keep word order
>> l = []
>> df.words.apply(lambda x: l.append(x)) #add list of words to a list
>> l = list(OrderedDict.fromkeys([j for i in l for j in i])) #merge list of lists and remove duplicates
>> print(l)
[a,b,c,x]
From here I go through the list l and checking each row of df if the word exists, and then sum the Bool value for each word.
data = []
for w in l:
tmp = []
df.words.apply(lambda x: tmp.append(w in x))
data.append(sum(tmp))
I can then create a dictionary of words and their count. This is, however, very inefficient, as it takes a long time (70,000+ words and 50,000+ rows). Is there a faster way of doing this?

You can use Series.explode with Series.value_counts
df['words'].explode().values_counts(sort=False)
One more alternative is using itertools.chain.from_iterable with collections.Counter
counts = Counter(chain.from_iterable(df['words']))
pd.Series(counts)
a 3
b 1
c 3
x 1
dtype: int64

Convert each list to a set, then back to list. Combine them using itertools and then run a collections.Counter on it to get the dictionary.
#if data is your list of lists or dataframe
import itertools
import collections
data = [list(set(i)) for i in data]
newData = list(itertools.chain.from_iterable(data))
#chain makes an iterator that returns elements from the first iterable until it is
#exhausted, then proceeds to the next iterable, until all of the iterables are
#exhausted.
dictVal = collections.Counter(newData)

Compare array with file and form groups from elements of array

I have a text file with letters (tab delimited), and a numpy array (obj) with a few letters (single row). The text file has rows with different numbers of columns. Some rows in the text file may have multiple copies of same letters (I will like to consider only a single copy of a letter in each row). Letters in the same row of the text file are assumed to be similar to each other. Also, each letter of the numpy array obj is present in one or more rows of the text file.
Below is an example of the text file (you can download the file from here ):
b q a i m l r
j n o r o
e i k u i s
In the above example, the letter o is mentioned two times in the second row, and the letter i is denoted two times in the third row. I will like to consider single copies of letters rows of the text file.
This is an example of obj: obj = np.asarray(['a', 'e', 'i', 'o', 'u'])
I want to compare obj with rows of the text file and form clusters from elements in obj.
This is how I want to do it. Corresponding to each row of the text file, I want to have a list which denotes a cluster (In the above example we will have three clusters since the text file has three rows). For every given element of obj, I want to find rows of the text file where the element is present. Then, I will like to assign index of that element of obj to the cluster which corresponds to the row with maximum length (the lengths of rows are decided with all rows having single copies of letters).
Below is a python code that I have written for this task
import pandas as pd
import numpy as np
data = pd.read_csv('file.txt', sep=r'\t+', header=None, engine='python').values[:,:].astype('<U1000')
obj = np.asarray(['a', 'e', 'i', 'o', 'u'])
for i in range(data.shape[0]):
globals()['data_row' + str(i).zfill(3)] = []
globals()['clust' + str(i).zfill(3)] = []
for j in range(len(obj)):
if obj[j] in set(data[i, :]): globals()['data_row' + str(i).zfill(3)] += [j]
for i in range(len(obj)):
globals()['obj_lst' + str(i).zfill(3)] = [0]*data.shape[0]
for j in range(data.shape[0]):
if i in globals()['data_row' + str(j).zfill(3)]:
globals()['obj_lst' + str(i).zfill(3)][j] = len(globals()['data_row' + str(j).zfill(3)])
indx_max = globals()['obj_lst' + str(i).zfill(3)].index( max(globals()['obj_lst' + str(i).zfill(3)]) )
globals()['clust' + str(indx_max).zfill(3)] += [i]
for i in range(data.shape[0]): print globals()['clust' + str(i).zfill(3)]
>> [0]
>> [3]
>> [1, 2, 4]
The above code gives me the right answer. But, in my actual work, the text file has tens of thousands of rows, and the numpy array has hundreds of thousands of elements. And, the above given code is not very fast. So, I want to know if there is a better (faster) way to implement the above functionality and aim (using python).

You can do it using merge after a stack on data (in pandas), then some groupby with nunique or idxmax to get what you want
#keep data in pandas
data = pd.read_csv('file.txt', sep=r'\t+', header=None, engine='python')
obj = np.asarray(['a', 'e', 'i', 'o', 'u'])
#merge to keep only the letters from obj
df = (data.stack().reset_index(0,name='l')
.merge(pd.DataFrame({'l':obj})).set_index('level_0'))
#get the len of unique element of obj in each row of data
# and use transform to keep this lenght along each row of df
df['len'] = df.groupby('level_0').transform('nunique')
#get the result you want in a series
res = (pd.DataFrame({'data_row':df.groupby('l')['len'].idxmax().values})
.groupby('data_row').apply(lambda x: list(x.index)))
print(res)
data_row
0 [0]
1 [3]
2 [1, 2, 4]
dtype: object
res contains the clusters with the index being the row in the original data

Python concatenate two columns but maintain fix length

I have a DataFrame as below. I want to concatenate first 2 columns.
If the length of their concatenation is <13 then I would like to add 0s in between so that the length becomes 13.
If the length of their concatenation is >=13 then I just want to concatenate.
d = {'col1': [123456, 2, 1234567], 'col2': [1234567, 4, 1234567]}
df = pd.DataFrame(data=d)
df
df['var3'] = df.col1.astype(str) + df.col1.astype(str)
df
In case of second row, instead of '22' I want 11 0s between 2 and 2.
I would like to keep the third row as it is as the length of concatenation is >13.

You may want to convert the numbers to strings before doing anything else, so I assume that col1 and col2 are strings.
First, find the combined string lengths and how many zeros are missing:
pads = 13 - (df.col1.str.len() + df.col2.str.len())
Then generate the necessary paddings and concatenate the columns and the paddings:
df['var3'] = df.col1 + pads.apply(lambda x: x * '0') + df.col2
#0 1234561234567
#1 2000000000004
#2 12345671234567

For each row, make a tuple with 3 values:
string1
string2
The difference between the length of both strings and 13 (or whatever target length)
x = pd.Series(list(zip(df['col1'].astype(str),
df['col2'].astype(str),
13 - (df['col1'].astype(str) + df['col2'].astype(str)).str.len())))
Then use the string method ljust to pad the left string with 0s and add it to the right string. Assign everything to the new column.
df['var3'] = x.apply(lambda x: x[0].ljust(x[2], '0') + x[1])

Python: Combination with criteria

I have the following list of combinations:
a = [(1,10),(2,8),(300,28),(413,212)]
b = [(8,28), (8,15),(10,21),(28,34),(413,12)]
I want to create a new combination list from these two lists which follow the following criteria:
A. List a and List b have common elements.
The second element of the tuple in list a equals the first element of the
Tuple in the list b.
Combination of List a and List b should form a new combination:
d = [(1,10,21),(2,8,28),(2,8,15),(300,28,34)]
All other tuples in both lists which do not satisfy the criteria get ignored.
QUESTIONS
Can I do this criteria based combination using itertools?
What is the most elegant way to solve this problem with/without using modules?
How can one display the output in excel sheet to display each element of a tuple in list d to a separate column such as:
d = [(1,10,21),(2,8,28),(2,8,15),(300,28,34)] is displayed in excel as:
Col A = [1, 2, 2, 300]
Col B = [10,8,8,28]
Col C = [21,28,15,34]

pandas works like a charm for excel.
Here is the code:
a = [(1,10),(2,8),(300,28),(413,212)]
b = [(8,28), (8,15),(10,21),(28,34),(413,12)]
c = [(x, y, t) for x, y in a for z, t in b if y == z]
import pandas as pd
df = pd.DataFrame(c)
df.to_excel('MyFile.xlsx', header=False, index=False)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Detect specific characters in pandas dataframe - python

Related

Insert elements in front of specific list elements

Count number of rows that a value stored in a list occurs in

Compare array with file and form groups from elements of array

Python concatenate two columns but maintain fix length

Python: Combination with criteria

Categories

Resources