I have a text file with letters (tab delimited), and a numpy array (obj) with a few letters (single row). The text file has rows with different numbers of columns. Some rows in the text file may have multiple copies of same letters (I will like to consider only a single copy of a letter in each row). Letters in the same row of the text file are assumed to be similar to each other. Also, each letter of the numpy array obj is present in one or more rows of the text file.
Below is an example of the text file (you can download the file from here ):
b q a i m l r
j n o r o
e i k u i s
In the above example, the letter o is mentioned two times in the second row, and the letter i is denoted two times in the third row. I will like to consider single copies of letters rows of the text file.
This is an example of obj: obj = np.asarray(['a', 'e', 'i', 'o', 'u'])
I want to compare obj with rows of the text file and form clusters from elements in obj.
This is how I want to do it. Corresponding to each row of the text file, I want to have a list which denotes a cluster (In the above example we will have three clusters since the text file has three rows). For every given element of obj, I want to find rows of the text file where the element is present. Then, I will like to assign index of that element of obj to the cluster which corresponds to the row with maximum length (the lengths of rows are decided with all rows having single copies of letters).
Below is a python code that I have written for this task
import pandas as pd
import numpy as np
data = pd.read_csv('file.txt', sep=r'\t+', header=None, engine='python').values[:,:].astype('<U1000')
obj = np.asarray(['a', 'e', 'i', 'o', 'u'])
for i in range(data.shape[0]):
globals()['data_row' + str(i).zfill(3)] = []
globals()['clust' + str(i).zfill(3)] = []
for j in range(len(obj)):
if obj[j] in set(data[i, :]): globals()['data_row' + str(i).zfill(3)] += [j]
for i in range(len(obj)):
globals()['obj_lst' + str(i).zfill(3)] = [0]*data.shape[0]
for j in range(data.shape[0]):
if i in globals()['data_row' + str(j).zfill(3)]:
globals()['obj_lst' + str(i).zfill(3)][j] = len(globals()['data_row' + str(j).zfill(3)])
indx_max = globals()['obj_lst' + str(i).zfill(3)].index( max(globals()['obj_lst' + str(i).zfill(3)]) )
globals()['clust' + str(indx_max).zfill(3)] += [i]
for i in range(data.shape[0]): print globals()['clust' + str(i).zfill(3)]
>> [0]
>> [3]
>> [1, 2, 4]
The above code gives me the right answer. But, in my actual work, the text file has tens of thousands of rows, and the numpy array has hundreds of thousands of elements. And, the above given code is not very fast. So, I want to know if there is a better (faster) way to implement the above functionality and aim (using python).
You can do it using merge after a stack on data (in pandas), then some groupby with nunique or idxmax to get what you want
#keep data in pandas
data = pd.read_csv('file.txt', sep=r'\t+', header=None, engine='python')
obj = np.asarray(['a', 'e', 'i', 'o', 'u'])
#merge to keep only the letters from obj
df = (data.stack().reset_index(0,name='l')
.merge(pd.DataFrame({'l':obj})).set_index('level_0'))
#get the len of unique element of obj in each row of data
# and use transform to keep this lenght along each row of df
df['len'] = df.groupby('level_0').transform('nunique')
#get the result you want in a series
res = (pd.DataFrame({'data_row':df.groupby('l')['len'].idxmax().values})
.groupby('data_row').apply(lambda x: list(x.index)))
print(res)
data_row
0 [0]
1 [3]
2 [1, 2, 4]
dtype: object
res contains the clusters with the index being the row in the original data
Related
I have pandas data frame with two columns:
sentence - fo n bar
annotations [B-inv, B-inv, O, I-acc, O, B-com, I-com, I-com]
I want to insert additional 'O' elements in the annotations list in front of each annotation starting with 'B', which will look like this:
[O, B-inv, O, B-inv, O, I-acc, O, O, B-com, I-com, I-com]
' f o n bar'
And then insert additional whitespace in front of each element with an index equal to the 'B' annotation indexes from the initial annotation: meaning inserting in front of each char from the sentence with index in this list [0,1,5]
Maybe to make it more visibly appealing I should represent it this way:
Initial sentence:
Ind
Sentence char
Annot
0
f
B-inv
1
o
B-inv
2
whitespace
O
3
n
I-acc
4
whitespace
O
5
b
B-com
6
a
I-com
7
r
I-com
End sentence:
Ind
Sentence char
Annot
0
whitespace
O
1
f
B-inv
2
whitespace
O
3
o
B-inv
4
whitespace
O
5
n
I-acc
6
whitespace
O
7
whitespace
O
8
b
B-com
9
a
I-com
10
r
I-com
Updated answer (list comprehension)
from itertools import chain
annot = ['B-inv', 'B-inv', 'O', 'I-acc', 'O', 'B-com', 'I-com', 'I-com']
sent = list('fo n bar')
annot, sent = list(map(lambda l: list(chain(*l)), list(zip(*[(['O', a], [' ', s]) if a.startswith('B') else ([a], [s]) for a,s in zip(annot, sent)]))))
print(annot)
print(''.join(sent))
chain from itertools allow you to chain together a list of lists to form a single list. Then the rest is some clumsy use of zip together with list unpacking (the prefix * in argument names) to get it in one line. map is only used to apply the same operation to both lists basically.
But a more readable version, so you can also follow the steps better, could be:
# find where in the annotations the element starts with 'B'
loc = [a.startswith('B') for a in annot]
# Use this locator to add an element and Merge the list of lists with `chain`
annot = list(chain.from_iterable([['O', a] if l else [a] for a,l in zip(annot, loc)]))
sent = ''.join(chain.from_iterable([[' ', a] if l else [a] for a,l in zip(sent, loc)])) # same on sentence
Note that above, I do not use map as we process each list separately, and there is less zipping and casting to lists. So most probably, a much cleaner, and hence preferred solution.
Old answer (pandas)
I am not sure it is the most convenient to do this on a DataFrame. It might be easier on a simple list, before converting to a DataFrame.
But anyway, here is a way through it, assuming you don't really have meaningful indices in your DataFrame (so that indices are simply the integer count of each row).
The trick is to use .str strings functions such as startswith in this case to find matching strings in one of the column Series of interest and then you could loop over the matching indices ([0, 1, 5] in the example) and insert at a dummy location (half index, e.g. 0.5 to place the row before row 1) the row with the whitespace and 'O' data. Then sorting by sindices with .sort_index() will rearrange all rows in the way you want.
import pandas as pd
annot = ['B-inv', 'B-inv', 'O', 'I-acc', 'O', 'B-com', 'I-com', 'I-com']
sent = list('fo n bar')
df = pd.DataFrame({'sent':sent, 'annot':annot})
idx = np.argwhere(df.annot.str.startswith('B').values) # find rows where annotations start with 'B'
for i in idx.ravel(): # Loop over the indices before which we want to insert a new row
df.loc[i-0.5] = [' ', 'O'] # made up indices so that the subsequent sorting will place the row where you want it
df.sort_index().reset_index(drop=True) # this will output the new DataFrame
How to detect columns and rows that might have one of the characters in a string of a dataframe element other than the desired characters.
desired characters are A, B, C, a, b, c, 1, 2, 3, &, %, =, /
dataframe -
Col1
Col2
Col3
Abc
Øa
12
bbb
+
}
output will be elements Øa, +, } and their location in dataframe.
I find it really difficult to locate an element for a condition directly in pandas, so I converted the dataframe to a nested list first, then proceeded to work with the list. Try this:
import pandas as pd
import numpy as np
#creating your sample dataframe
array = np.array([['Abc','Øa','12'],['bbb','+','}']])
columns = ['Col1','Col2','Col3']
df = pd.DataFrame(data=array, columns=columns)
#convert dataframe to nested list
pd_list = df.values.tolist()
#return any characters other than the ones in 'var'
all_chars = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;=>?#[\\]^_`{|}~Ø'
var = 'ABCabc123&%=//'
for a in var:
all_chars = all_chars.replace(a, "")
#stores previously detected elements to prevent duplicate
temp_storage = []
#loops through the nested list to get the elements' indexes
for x in all_chars:
for i in pd_list:
for n in i:
if x in n:
#check if element is duplicate
if not n in temp_storage:
temp_storage.append(n)
print(f'found {n}: row={pd_list.index(i)}; col={i.index(n)}')
Output:
> found +: row=1; col=1
> found }: row=1; col=2
> found Øa: row=0; col=1
I'm making a commonality table for some data, and i need to have a subset table with the number of appearances of each value in descending order.
I have a table:
the end result for me is a list that the names do not appear, but the rows are the months with commonality, something like this:
i need it to be in descending order horizontally and vertically(number of appearance).
at the end what worked for me is:
import pandas as pd
import csv
# this function gets a string like: "line,num%" and returns the num in float
def take_second(elem):
elem = float(elem[elem.find(",") + 1:elem.find("%")])
return elem
# this function gets a list and sorts it in descending order according to the second variable of every cell
def sorting(lst):
lst.sort(key=lambda x: take_second(x), reverse=True)
return lst
data = pd.DataFrame.from_csv('commonality_input.csv')
# calculates the times each value appeared for a given name
a = pd.Series({col: data[col].value_counts() for col in data.columns})
for i, val in enumerate(a):
# calculates the % of each, and rounds to 2 decimal places.
a[i] = round(100*val/val.sum(), 2)
# converts all of the objects to a list called 'b'
b = pd.DataFrame(a.tolist())
# get the data frame size, and lists of headers for future use
line = list(b.columns.values)
name = list(b.index.values)
cols = len(b.columns)
rows = len(b)
for c in b.columns:
b[c] = b[c].apply(lambda t: "{},{}%".format(c, t))
# copy the DataFrame to a list of lists, and creates a new list with out the NaN values
lol = b.values.tolist()
data_list = []
for items in lol:
data_list.append([x for x in items if 'nan%' not in x])
# sort list values in descending values in every row, adds names back
for l in range(rows):
sorting(data_list[l])
data_list[l].insert(0, parameter.pop(0))
# sort list by value of the highest percentage of the second column in descending order
data_list.sort(key=lambda x: take_second(x[1]), reverse=True)
# writes the commonality table to a csv file
result_file = open("test.csv", 'w', newline='')
wr = csv.writer(result_file)
for item in data_list:
wr.writerow(item)
note that i'm pretty new to python, so there must be a more efficient way for this program.
I have a problem running the code below.
data is my dataframe. X is the list of columns for train data. And L is a list of categorical features with numeric values.
I want to one hot encode my categorical features. So I do as follows. But a "ValueError: Columns must be same length as key" (for the last line) is thrown. And I still don't understand why after long research.
def turn_dummy(df, prop):
dummies = pd.get_dummies(df[prop], prefix=prop, sparse=True)
df.drop(prop, axis=1, inplace=True)
return pd.concat([df, dummies], axis=1)
L = ['A', 'B', 'C']
for col in L:
data_final[X] = turn_dummy(data_final[X], col)
It appears that this is a problem of dimensionality. It would be like the following:
Say I have a list like so:
mylist = [0, 0, 0, 0]
It is of length 4. If I wanted to do 1:1 mapping of elements of a new list into that one:
otherlist = ['a', 'b']
for i in range(len(mylist)):
mylist[i] = otherlist[i]
Obviously this will throw an IndexError, because it's trying to get elements that otherlist just doesn't have
Much the same is occurring here. You are trying to insert a string (len=1) to a column of length n>1. Try:
data_final[X] = turn_dummy(data_final[X], L)
Assuming len(L) = number_of_rows
Main Problem:
numpy arrays of the same type and same size are not being column stacked together using np.hstack, np.column_stack, or np.concatenate(axis=1).
Explaination:
I don't understand what properties of a numpy array can change such that numpy.hstack, numpy.column_stack and numpy.concatenate(axis=1) do not work properly. I am having a problem getting my real program to stack by column - it only appends to the rows. Is there some property of a numpy array which would cause this to be true? It doesn't throw an error, it just doesn't do the "right" or "normal" behavior.
I have tried a simple case which works as I would expect it to:
input:
a = np.array([['1', '2'], ['3', '4']], dtype=object)
b = np.array([['5', '6'], ['7', '8']], dtype=object)
np.hstack(a, b)
output:
np.array([['1', '2', '5', '6'], ['3', '4', '7', '8']], dtype=object)
That's perfectly fine by me, and what I want.
However, what I get from my program is this:
First array:
[['29.8989', '0'] ['29.8659', '-8.54805e-005'] ['29.902', '-0.00015875']
..., ['908.791', '-0.015765'] ['908.073', '-0.0154842'] []]
Second array (to be added on in columns):
[['29.8989', '26.8556'] ['29.8659', '26.7969'] ['29.902', '29.0183'] ...,
['908.791', '943.621'] ['908.073', '940.529'] []]
What should be the two arrays side by side or in columns:
[['29.8989', '0'] ['29.8659', '-8.54805e-005'] ['29.902', '-0.00015875']
..., ['908.791', '943.621'] ['908.073', '940.529'] []]
Clearly, this isn't the right answer.
The module creating this problem is rather long (I will give it at the bottom), but here is a simplification of it which still works (performs the right column stacking) like the first example:
import numpy as np
def contiguous_regions(condition):
d = np.diff(condition)
idx, = d.nonzero()
idx += 1
if condition[0]:
idx = np.r_[0, idx]
if condition[-1]:
idx = np.r_[idx, condition.size]
idx.shape = (-1,2)
return idx
def is_number(s):
try:
np.float64(s)
return True
except ValueError:
return False
total_array = np.array([['1', '2'], ['3', '4'], ['strings','here'], ['5', '6'], ['7', '8']], dtype=object)
where_number = np.array(map(is_number, total_array))
contig_ixs = contiguous_regions(where_number)
print contig_ixs
t = tuple(total_array[s[0]:s[1]] for s in contig_ixs)
print t
print np.hstack(t)
It basically looks through an array of lists and finds the longest set of continuous numbers. I would like to column stack those sets of data if they are of the same length.
Here is the real module providing the problem:
import numpy as np
def retrieve_XY(file_path):
# XY data is read in from a file in text format
file_data = open(file_path).readlines()
# The list of strings (lines in the file) is made into a list of lists while splitting by whitespace and removing commas
file_data = np.array(map(lambda line: line.rstrip('\n').replace(',',' ').split(), file_data))
# Remove empty lists, make into numpy array
xy_array = np.array(filter(None, column_stacked_data_chain))
# Each line is searched to make sure that all items in the line are a number
where_num = np.array(map(is_number, file_data))
# The data is searched for the longest contiguous chain of numbers
contig = contiguous_regions(where_num)
try:
# Data lengths (number of rows) for each set of data in the file
data_lengths = contig[:,1] - contig[:,0]
# Get the maximum length of data (max number of contiguous rows) in the file
maxs = np.amax(data_lengths)
# Find the indices for where this long list of data is (index within the indices array of the file)
# If there are two equally long lists of data, get both indices
longest_contig_idx = np.where(data_lengths == maxs)
except ValueError:
print 'Problem finding contiguous data'
return np.array([])
###############################################################################################
###############################################################################################
# PROBLEM ORIGINATES HERE
# Starting and stopping indices of the contiguous data are stored
ss = contig[longest_contig_idx]
# The file data with this longest contiguous chain of numbers
# If there are multiple sets of data of the same length, they are added in columns
longest_data_chains = tuple([file_data[i[0]:i[1]] for i in ss])
print "First array:"
print longest_data_chains[0]
print
print "Second array (to be added on in columns):"
print longest_data_chains[1]
column_stacked_data_chain = np.concatenate(longest_data_chains, axis=1)
print
print "What should be the two arrays side by side or in columns:"
print column_stacked_data_chain
###############################################################################################
###############################################################################################
xy = np.array(zip(*xy_array), dtype=float)
return xy
#http://stackoverflow.com/questions/4494404/find-large-number-of-consecutive-values-fulfilling-condition-in-a-numpy-array
def contiguous_regions(condition):
"""Finds contiguous True regions of the boolean array "condition". Returns
a 2D array where the first column is the start index of the region and the
second column is the end index."""
# Find the indicies of changes in "condition"
d = np.diff(condition)
idx, = d.nonzero()
# We need to start things after the change in "condition". Therefore,
# we'll shift the index by 1 to the right.
idx += 1
if condition[0]:
# If the start of condition is True prepend a 0
idx = np.r_[0, idx]
if condition[-1]:
# If the end of condition is True, append the length of the array
idx = np.r_[idx, condition.size] # Edit
# Reshape the result into two columns
idx.shape = (-1,2)
return idx
def is_number(s):
try:
np.float64(s)
return True
except ValueError:
return False
UPDATE:
I got it to work with the help of #hpaulj . Apparently the fact that the data was structured like np.array([['1','2'],['3','4']]) in both cases was not sufficient since the real case I was using had a dtype=object and there were some strings in the lists. Therefore, numpy was seeing a 1d array instead of a 2d array, which is required.
The solution which fixed this was calling a map(float, data) to every list that was given by the readlines function.
Here is what I ended up with:
import numpy as np
def retrieve_XY(file_path):
# XY data is read in from a file in text format
file_data = open(file_path).readlines()
# The list of strings (lines in the file) is made into a list of lists while splitting by whitespace and removing commas
file_data = map(lambda line: line.rstrip('\n').replace(',',' ').split(), file_data)
# Remove empty lists, make into numpy array
xy_array = np.array(filter(None, file_data))
# Each line is searched to make sure that all items in the line are a number
where_num = np.array(map(is_number, xy_array))
# The data is searched for the longest contiguous chain of numbers
contig = contiguous_regions(where_num)
try:
# Data lengths
data_lengths = contig[:,1] - contig[:,0]
# All maximums in contiguous data
maxs = np.amax(data_lengths)
longest_contig_idx = np.where(data_lengths == maxs)
except ValueError:
print 'Problem finding contiguous data'
return np.array([])
# Starting and stopping indices of the contiguous data are stored
ss = contig[longest_contig_idx]
print ss
# The file data with this longest contiguous chain of numbers
# Float must be cast to each value in the lists of the contiguous data and cast to a numpy array
longest_data_chains = np.array([[map(float, n) for n in xy_array[i[0]:i[1]]] for i in ss])
# If there are multiple sets of data of the same length, they are added in columns
column_stacked_data_chain = np.hstack(longest_data_chains)
xy = np.array(zip(*column_stacked_data_chain), dtype=float)
return xy
#http://stackoverflow.com/questions/4494404/find-large-number-of-consecutive-values-fulfilling-condition-in-a-numpy-array
def contiguous_regions(condition):
"""Finds contiguous True regions of the boolean array "condition". Returns
a 2D array where the first column is the start index of the region and the
second column is the end index."""
# Find the indicies of changes in "condition"
d = np.diff(condition)
idx, = d.nonzero()
# We need to start things after the change in "condition". Therefore,
# we'll shift the index by 1 to the right.
idx += 1
if condition[0]:
# If the start of condition is True prepend a 0
idx = np.r_[0, idx]
if condition[-1]:
# If the end of condition is True, append the length of the array
idx = np.r_[idx, condition.size] # Edit
# Reshape the result into two columns
idx.shape = (-1,2)
return idx
def is_number(s):
try:
np.float64(s)
return True
except ValueError:
return False
This function will now take in a file and output the longest contiguous number type data found within it. If there are multiple data sets found with the same length, it column stacks them.
It's the empty list at the end of your array's that's causing your problem:
>>> a = np.array([[1, 2], [3, 4]])
>>> b = np.array([[1, 2], [3, 4], []])
>>> a.shape
(2L, 2L)
>>> a.dtype
dtype('int32')
>>> b.shape
(3L,)
>>> b.dtype
dtype('O')
Because of that empty list at the end, instead of creating a 2D array it is creating a 1D, with every item holding a two item long list object.