Appending seperate dataframes, each as column - python

I am currently working on the following:
data - with the correct index
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
kmeans.fit(data_values)
wcss.append(kmeans.inertia_)
kmeans = KMeans(n_clusters=2).fit(data_values)
y = kmeans.fit_predict(data_values) # prediction of k
df= pd.DataFrame(y,index = data.index)
....
#got here multiple dicts
Example of y:
[1 2 3 4 5 2 2 5 1 0 0 1 0 0 1 0 1 4 4 4 3 1 0 0 1 0 0 ...]
f = pd.DataFrame(y, columns = [buster] )
f.to_csv('busters.csv, mode = 'a')
y = clusters after determination
I dont know how did I stuck on this.. I am iterating over 20 dataframes, each one consists of one columns and values from 1-9. The index is irrelevent. I am trying to append all frame together but instead it just prints them one after the other. If I put ".T" to transpose it , I still got rows with irrelevent values as index, which I cant remove them because they are actually headers.
Needed result

If the dicts produced in each iteration look like {'Buster1': [0, 2, 2, 4, 5]}, {'Buster2': [1, 2, 3, 4, 5]} ..., using 5 elements here for illustration purposes, and all the lists, i.e., values in the dicts, have the same number of elements (as it is the case in your example), you could create a single dict and use pd.DataFrame directly. (You may also want to take a look at pandas.DataFrame.from_dict.)
You may have lists with more than 5 elements, more than 3 dicts (and thus columns), and you will be generating the dicts with a loop, but the code below should be sufficient for getting the idea.
>>> import pandas as pd
>>>
>>> d = {}
>>> # update d in every iteration
>>> d.update({'Buster 1': [0, 2, 2, 4, 5]})
>>> d.update({'Buster 2': [1, 2, 3, 4, 5]})
>>> # ...
>>> d.update({'Buster n': [0, 9, 3, 0, 0]})
>>>
>>> pd.DataFrame(d, columns=d.keys())
Buster 1 Buster 2 Buster n
0 0 1 0
1 2 2 9
2 2 3 3
3 4 4 0
4 5 5 0
If you have the keys, e.g., 'Buster 1', and values, e.g., [0, 2, 2, 4, 5], separated, as I believe is the case, you can simplify the above (and make it more efficient) by replacing d.update({'Buster 1': [0, 2, 2, 4, 5]}) with d['Buster 1']=[0, 2, 2, 4, 5].
I included columns=d.keys() because depending on your Python and pandas version the ordering of the columns may not be as you expect it to be. You can specify the ordering of the columns through specifying the order in which you provide the keys. For example:
>>> pd.DataFrame(d, columns=sorted(d.keys(),reverse=True))
Buster n Buster 2 Buster 1
0 0 1 0
1 9 2 2
2 3 3 2
3 0 4 4
4 0 5 5
Although it may not apply to your use case, if you do not want to print the index, you can take a look at How to print pandas DataFrame without index.

Related

how to count the match number between two dataframe fast?

I'm writing some programs on calculate the match item number between two dataframes.
for example,
A is the dataframe as : A = pd.DataFrame({'pick_num1':[1, 2, 3], 'pick_num2':[2, 3, 4], 'pick_num3':[4, 5, 6]})
B is the answer I want to match, like:
B = pd.DataFrame({'ans_num1':[1, 2, 3], 'ans_num2':[2, 3, 4], 'ans_num3':[4, 5, 6], 'ans_num4':[7, 8, 1], 'ans_num5':[9, 1, 9]})
DataFrame A
pick_num1 pick_num2 pick_num3 match_num
0 1 2 4 2
1 2 3 5 2
2 3 4 6 2
DataFrame B
ans_num1 ans_num2 ans_num3 ans_num4 ans_num5
0 1 2 4 7 9
1 2 3 5 8 1
2 3 4 6 1 9
and I want to append a new column of ['match_num'] at the end of A.
Now I have tried to write a mapping function to compare and calculate, and I found the speed is not that fast while the dataframe is huge, the functions are below:
def win_prb_func(df1, p_name):
df1['match_num'] += np.sum(pd.concat([df1[p_name]]*5, axis=1).values==df1[open_ball_name_ls].values, 1)
return df1
def compute_win_prb(df1):
return list(map(lambda p_name: win_prb_func(df1, p_name), pick_name_ls))
df1 = pd.concat([A, B], axis=1)
df1['win prb.'] = 0
result_df = compute_win_prb(df1)
where pick_name_ls is ['pick_num1', 'pick_num2', 'pick_num3'], and open_ball_name_ls is ['ans_num1', 'ans_num2', 'ans_num3', 'ans_num4', 'ans_num5'].
I'm wondering is it possible to make the computation more fast or smart than I did?
now the performance would is: 0.015626192092895508 seconds
Thank you for helping me!
You can use broadcasting instead of concatenating the columns:
def win_prb_func(df1, p_name):
df1['match_num'] += np.sum(df1[p_name].values[:, np.newaxis] == df1[open_ball_name_ls].values, 1)
return df1
Since df1[p_name].values will return an 1-D array, you have to convert it into the column vector by adding a new axis. It only takes me 0.004 second.

repeat indices with different repeat values in numpy

I'm looking for an efficient way to do the following with Numpy:
Given a array counts of positive integers containing for instance:
[3, 1, 0, 6, 3, 2]
I would like to generate another array containing the indices of the first one, where the index i is repeated counts[i] times:
[0 0 0 1 3 3 3 3 3 3 4 4 4 5 5]
My problem is that this array is potentially very large and I'm looking for a vectorial (or fast) way to do this.
You can do it with numpy.repeat:
import numpy as np
arr = np.array([3, 1, 0, 6, 3, 2])
repix = np.repeat(np.arange(arr.size), arr)
print(repix)
Output:
[0 0 0 1 3 3 3 3 3 3 4 4 4 5 5]

Groupwise sorting in pandas

I want to sort an array within the group boundaries defined in another array. The groups are not presorted in any way and need to remain unchanged after the sorting. In numpy terms it would look like this:
import numpy as np
def groupwise_sort(group_idx, a, reverse=False):
sortidx = np.lexsort((-a if reverse else a, group_idx))
# Reverse sorting back to into grouped order, but preserving groupwise sorting
revidx = np.argsort(np.argsort(group_idx, kind='mergesort'), kind='mergesort')
return a[sortidx][revidx]
group_idx = np.array([3, 2, 3, 2, 2, 1, 2, 1, 1])
a = np.array([3, 2, 1, 7, 4, 5, 5, 9, 1])
groupwise_sort(group_idx, a)
# >>> array([1, 2, 3, 4, 5, 1, 7, 5, 9])
groupwise_sort(group_idx, a, reverse=True)
# >>> array([3, 7, 1, 5, 4, 9, 2, 5, 1])
How can I do the same with pandas? I saw df.groupby() and df.sort_values(), though I couldn't find a straight forward way to achieve the same sorting. And a fast one, if possible.
Let us first set the stage:
import pandas as pd
import numpy as np
group_idx = np.array([3, 2, 3, 2, 2, 1, 2, 1, 1])
a = np.array([3, 2, 1, 7, 4, 5, 5, 9, 1])
df = pd.DataFrame({'group': group_idx, 'values': a})
df
# group values
#0 3 3
#1 2 2
#2 3 1
#3 2 7
#4 2 4
#5 1 5
#6 2 5
#7 1 9
#8 1 1
To get a dataframe sorted by group and values (within groups):
df.sort_values(["group", "values"])
# group values
#8 1 1
#5 1 5
#7 1 9
#1 2 2
#4 2 4
#6 2 5
#3 2 7
#2 3 1
#0 3 3
To sort the values in descending order, use ascending = False. To apply different orders to different columns, you can supply a list:
df.sort_values(["group", "values"], ascending = [True, False])
# group values
#7 1 9
#5 1 5
#8 1 1
#3 2 7
#6 2 5
#4 2 4
#1 2 2
#0 3 3
#2 3 1
Here, groups are sorted in ascending order, and the values within each group are sorted in descending order.
To only sort values for contiguous rows belonging to the same group, create a new group indicator:
(I keep this in here for reference since it might be helpful for others. I wrote this in an earlier version before the OP clarified his question in the comments.)
df['new_grp'] = (df.group.diff(1) != 0).astype('int').cumsum()
df
# group values new_grp
#0 3 3 1
#1 2 2 2
#2 3 1 3
#3 2 7 4
#4 2 4 4
#5 1 5 5
#6 2 5 6
#7 1 9 7
#8 1 1 7
We can then easily sort with new_grp instead of group, leaving the original order of groups untouched.
Ordering within groups but keeping the group-specifing row-positions:
To sort the elements of each group but keep the group-specific positions in the dataframe, we need to keep track of the original row numbers. For instance, the following will do the trick:
# First, create an indicator for the original row-number:
df["ind"] = range(len(df))
# Now, sort the dataframe as before
df_sorted = df.sort_values(["group", "values"])
# sort the original row-numbers within each group
newindex = df.groupby("group").apply(lambda x: x.sort_values(["ind"]))["ind"].values
# assign the sorted row-numbers to the sorted dataframe
df_sorted["ind"] = newindex
# Sort based on the row-numbers:
sorted_asc = df_sorted.sort_values("ind")
# compare the resulting order of values with your desired output:
np.array(sorted_asc["values"])
# array([1, 2, 3, 4, 5, 1, 7, 5, 9])
This is easier to test and profile when written up in a function, so let's do that:
def sort_my_frame(frame, groupcol = "group", valcol = "values", asc = True):
frame["ind"] = range(len(frame))
frame_sorted = frame.sort_values([groupcol, valcol], ascending = [True, asc])
ind_sorted = frame.groupby(groupcol).apply(lambda x: x.sort_values(["ind"]))["ind"].values
frame_sorted["ind"] = ind_sorted
frame_sorted = frame_sorted.sort_values(["ind"])
return(frame_sorted.drop(columns = "ind"))
np.array(sort_my_frame(df, "group", "values", asc = True)["values"])
# array([1, 2, 3, 4, 5, 1, 7, 5, 9])
np.array(sort_my_frame(df, "group", "values", asc = False)["values"])
# array([3, 7, 1, 5, 4, 9, 2, 5, 1])
Note that the latter results match your desired outcome.
I am sure this can be written up in a more succinct way. For instance, if the index of your dataframe is already ordered, you can use that one instead of the indicator ind I create (i.e., following #DJK's comment, we can use sort_index instead of sort_values and avoid assigning an additional column). In any case, the above highlights one possible solution and how to approach it. An alternative would be to use your numpy functions and wrap the output around a pd.DataFrame.
Pandas is built on top of numpy. Assuming a dataframe like so:
df
Out[21]:
group values
0 3 3
1 2 2
2 3 1
3 2 7
4 2 4
5 1 5
6 2 5
7 1 9
8 1 1
Call your function.
groupwise_sort(df.group.values, df['values'].values)
Out[22]: array([1, 2, 3, 4, 5, 1, 7, 5, 9])
groupwise_sort(df.group.values, df['values'].values, reverse=True)
Out[23]: array([3, 7, 1, 5, 4, 9, 2, 5, 1])

Preparing variable-length data for sklearn

Since this is a complicated problem (at least for me), I will try to keep this as brief as possible.
My data is of the form
import pandas as pd
import numpy as np
# edit: a1 and a2 are linked as they are part of the same object
a1 = np.array([[1, 2, 3], [4, 5], [7, 8, 9, 10]])
a2 = np.array([[5, 6, 5], [2, 3], [3, 4, 8, 1]])
b = np.array([6, 15, 24])
y = np.array([0, 1, 1])
df = pd.DataFrame(dict(a1=a1.tolist(),a2=a2.tolist(), b=b, y=y))
a1 a2 b y
0 [1, 2, 3] [5, 6, 5] 6 0
1 [4, 5] [2, 3] 15 1
2 [7, 8, 9, 10] [3, 4, 8, 1] 24 1
which I would like to use in sklearn for classification, e.g.
from sklearn import tree
X = df[['a1', 'a2', 'b']]
Y = df['y']
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)
print(clf.predict([[2., 2.]]))
However, while pandas can handle lists as entries, sklearn, by design, cannot. In this example the clf.fit will result in ValueError: setting an array element with a sequence. to which you can find plenty of answers.
But how do you deal with such data?
I tried to split the data up into multiple columns (i.e. a1[0] ... a1[3] - code for that is a bit lengthy), but a1[3] will be empty (NaN, 0 or whatever invalid value you think of). Imputation does not make sense here, since no value is supposed to be there.
Of course, such a procedure has an impact on the result of the classification as the algorithm might pick up the "zero" value as something meaningful.
If the dataset is large enough, so I thought, it might be worth splitting it up in equal lengths of a1. But this procedure can reduce the power of the classification algorithm, since the length of a1 might help to distinguish between classes.
I also thought of using warm start for algorithms that support (e.g. Perceptron) and fit it to data split by the length of a1. But this would surely fail, would it not? The datasets would have different number of features, so I assume that something would go wrong.
Solutions to this problem surely must exist and I've simply not found the right place in the documentation.
Lets assume for a second those numbers are numerical categories.
What you can do is transform column 'a' into a set of binary columns, of which each corresponds to a possible value of 'a'.
Taking your example code, we would:
import pandas as pd
import numpy as np
a = np.array([[1, 2, 3], [4, 5], [7, 8, 9, 10]])
b = np.array([6, 15, 24])
y = np.array([0, 1, 1])
df = pd.DataFrame(dict(a=a.tolist(),b=b,y=y))
from sklearn.preprocessing import MultiLabelBinarizer
MLB = MultiLabelBinarizer()
df_2 = pd.DataFrame(MLB.fit_transform(df['a']), columns=MLB.classes_)
df_2
1 2 3 4 5 7 8 9 10
0 1 1 1 0 0 0 0 0 0
1 0 0 0 1 1 0 0 0 0
2 0 0 0 0 0 1 1 1 1
Than, we can just concat the old and new data:
new_df = pd.concat([df_2, df.drop('a',1)],1)
1 2 3 4 5 7 8 9 10 b y
0 1 1 1 0 0 0 0 0 0 6 0
1 0 0 0 1 1 0 0 0 0 15 1
2 0 0 0 0 0 1 1 1 1 24 1
Please do notice that if you have a training and a test set, it would be wise to first concat em, do the transform, and than separate 'em. Thats because one of the data sets can contain terms that do not belong to the other.
Hope that helps
Edit:
If you are worried that might make your df too big, its perfectly okay to apply PCA to the binarized variables. It will reduce cardinality while maintaining an arbitrary amount of variance/correlation.
Sklearn likes the data in 2d array i.e. shape (batch_size, features)
The simplest solution is to prepare one feature vector by concatenating the arrays using numpy.concatenate. The pass this feature vector to sklearn. Since the length of each column is fixed this should work.

Pandas : determine mapping from unique rows to original dataframe

Given the following inputs:
In [18]: input
Out[18]:
1 2 3 4
0 1 5 9 1
1 2 6 10 2
2 1 5 9 1
3 1 5 9 1
In [26]: df = input.drop_duplicates()
Out[26]:
1 2 3 4
0 1 5 9 1
1 2 6 10 2
How would I go about getting an array that has the indices of the rows from the subset that are equivalent, eg:
resultant = [0, 1, 0, 0]
I.e. the '1' here is basically stating that (row[1] in input) == (row[1] in df). Since there will be fewer unique rows than there will be multiple values in 'resultant' that will equate to similar values in df. i.e (row[k] in input == row[k+N] in input) == (row[1] in df) could be a case.
I am looking for actual row number mapping from input:df.
While this example is trivial in my case i have a ton of dropped mappings that might map to one index as an example.
Why do I want this? I am training an autoencoder type system where the target sequence is non-unique.
One way would be to treat it as a groupby on all columns:
>> df.groupby(list(df.columns)).groups
{(1, 5, 9, 1): [0, 2, 3], (2, 6, 10, 2): [1]}
Another would be to sort and then compare, which is less efficient in theory but could very well be faster in some cases and is definitely easier to make more tolerant of error:
>>> ds = df.sort(list(df.columns))
>>> eqs = (ds != ds.shift()).all(axis=1).cumsum()
>>> ds.index.groupby(eqs)
{1: [0, 2, 3], 2: [1]}
This seems the right datastructure to me, but if you really do want an array with the group ids, that's easy too, e.g.
>>> eqs.sort_index() - 1
0 0
1 1
2 0
3 0
dtype: int64
Don't have pandas installed on this computer, but I think you could use df.iterrows() like:
def find_matching_row(row, df_slimmed):
for index, slimmed_row in df_slimmed.iterrows():
if slimmed_row.equals(row[slimmed_row.columns]):
return index
def rows_mappings(df, df_slimmed):
for _, row in df.iterrows():
yield find_matching_row(row, df_slimmed)
list(rows_mappings(df, input))
This is if you are interested in generating the resultant list in your example, I don't quite follow the latter part of your reasoning.

Categories