Using Python to find matching arrays and combine two arrays into one - python

I would like to use Python to find the matching arrays of x[0] in set1 and x[0] in set2, for example such as [4012642, 0.10869565] in set 1 and [4012642, 2] in set 2. Then I would like to combine them into 1 array and divide set1[1] by set2[1], So it would become [4012642, (2/0.10869565)] or [4012642, 18.40]. I want to do this for each variable in set1 and set2 and put into a new array. Any help is greatly appreciated, sorry I may have worded this very confusingly.
set1 = [[4012640, 0.014925373], [4012642, 0.10869565], [4012644, 0.40298506], [4012646, 0.04477612], [4012616, 0.6264330499999999], [4012618, 1.128477924], [4012620, 0], [4012622, 0.12820514], [4012624, 0.16417910000000002], [4013328, 0.16666667], [4012626, 0.149253743], [4012658, 0], [4012628, 0.41791046], [4012630, 0.28493894000000003], [4012632, 1.999999953], [4012634, 0.08955224], [4012636, 0], [4012638, 0]]
set2 = [[4012640, 2], [4012642, 2], [4012644, 2], [4012646, 1], [4012616, 5], [4012618, 8], [4012620, 1], [4012622, 2], [4012624, 5], [4013328, 2], [4012626, 6], [4012658, 1], [4012628, 4], [4012630, 8], [4012632, 4], [4012634, 4], [4012636, 1], [4012638, 1]]

Personally, I preferr to use dataframe to handle this kind 'join' question
# build two dataframe from set1 and set2
df1=pd.DataFrame(columns=['x0','x1'])
df1['x0']=[x[0] for x in set1]
df1['x1']=[x[1] for x in set1]
df2=pd.DataFrame(columns=['x0','x2'])
df2['x0']=[x[0] for x in set2]
df2['x2']=[x[1] for x in set2]
Then call merge method in pandas to match two dataframe by column 'x0'
# Merge two dataframe on 'x0'
df=pd.merge(df1,df2,on=['x0'],how='left')
# Calculate a new columnn by 'x2'/'x1'
df['values']=df['x2']/df['x1']
Results:
x0 x1 x2 values
0 4012640 0.014925 2 134.000001
1 4012642 0.108696 2 18.400000

Related

How to generate numeric mapping for categorical columns in pandas?

I want to manipulate categorical data using pandas data frame and then convert them to numpy array for model training.
Say I have the following data frame in pandas.
import pandas as pd
df2 = pd.DataFrame({"c1": ['a','b',None], "c2": ['d','e','f']})
>>> df2
c1 c2
0 a d
1 b e
2 None f
And now I want "compress the categories" horizontally as the following:
compressed_categories
0 c1-a, c2-d <--- this could be a string, ex. "c1-a, c2-d" or array ["c1-a", "c2-d"] or categorical data
1 c1-b, c2-e
2 c1-nan, c2-f
Next I want to generate a dictionary/vocabulary based on the unique occurrences plus "nan" columns in compressed_categories, ex:
volcab = {
"c1-a": 0,
"c1-b": 1,
"c1-c": 2,
"c1-nan": 3,
"c2-d": 4,
"c2-e": 5,
"c2-f": 6,
"c2-nan": 7,
}
So I can further numerically encoding then as follows:
compressed_categories_numeric
0 [0, 4]
1 [1, 5]
2 [3, 6]
So my ultimate goal is to make it easy to convert them to numpy array for each row and thus I can further convert it to tensor.
input_data = np.asarray(df['compressed_categories_numeric'].tolist())
then I can train my model using input_data.
Can anyone please show me an example how to make this series of conversion? Thanks in advance!
To build volcab dictionary and compressed_categories_numeric, you can use:
df3 = df2.fillna(np.nan).astype(str).apply(lambda x: x.name + '-' + x)
volcab = {k: v for v, k in enumerate(np.unique(df3))}
df2['compressed_categories_numeric'] = df3.replace(volcab).agg(list, axis=1)
Output:
>>> volcab
{'c1-a': 0, 'c1-b': 1, 'c1-nan': 2, 'c2-d': 3, 'c2-e': 4, 'c2-f': 5}
>>> df2
c1 c2 compressed_categories_numeric
0 a d [0, 3]
1 b e [1, 4]
2 None f [2, 5]
>>> np.array(df2['compressed_categories_numeric'].tolist())
array([[0, 3],
[1, 4],
[2, 5]])

How to group consecutive data in 2d array in python

I have a 2d NumPy array that looks like this:
array([[1, 1],
[1, 2],
[2, 1],
[2, 2],
[3, 1],
[5, 1],
[5, 2]])
and I want to group it and have an output that looks something like this:
Col1 Col2
group 1: 1-2, 1-2
group 2: 3-3, 1-1
group 3: 5-5, 1-2
I want to group the columns based on if they are consecutive.
So, for a unique value In column 1, group data in the second column if they are consecutive between rows. Now for a unique grouping of column 2, group column 1 if it is consecutive between rows.
The result can be thought of as corner points of a grid. In the above example, group 1 is a square grid, group 2 is a a point, and group 3 is a flat line.
My system won't allow me to use pandas so I cannot use group_by in that library but I can use other standard libraries.
Any help is appreciated. Thank you
Here you go ...
Steps are:
Get a list xUnique of unique column 1 values with sort order preserved.
Build a list xRanges of items of the form [col1_value, [col2_min, col2_max]] holding the column 2 ranges for each column 1 value.
Build a list xGroups of items of the form [[col1_min, col1_max], [col2_min, col2_max]] where the [col1_min, col1_max] part is created by merging the col1_value part of consecutive items in xRanges if they differ by 1 and have identical [col2_min, col2_max] value ranges for column 2.
Turn the ranges in each item of xGroups into strings and print with the required row and column headings.
Also package and print as a numpy.array to match the form of the input.
import numpy as np
data = np.array([
[1, 1],
[1, 2],
[2, 1],
[2, 2],
[3, 1],
[5, 1],
[5, 2]])
xUnique = list({pair[0] for pair in data})
xRanges = list(zip(xUnique, [[0, 0] for _ in range(len(xUnique))]))
rows, cols = data.shape
iRange = -1
for i in range(rows):
if i == 0 or data[i, 0] > data[i - 1, 0]:
iRange += 1
xRanges[iRange][1][0] = data[i, 1]
xRanges[iRange][1][1] = data[i, 1]
xGroups = []
for i in range(len(xRanges)):
if i and xRanges[i][0] - xRanges[i - 1][0] == 1 and xRanges[i][1] == xRanges[i - 1][1]:
xGroups[-1][0][1] = xRanges[i][0]
else:
xGroups += [[[xRanges[i][0], xRanges[i][0]], xRanges[i][1]]]
xGroupStrs = [ [f'{a}-{b}' for a, b in row] for row in xGroups]
groupArray = np.array(xGroupStrs)
print(groupArray)
print()
print(f'{"":<10}{"Col1":<8}{"Col2":<8}')
[print(f'{"group " + str(i) + ":":<10}{col1:<8}{col2:<8}') for i, (col1, col2) in enumerate(xGroupStrs)]
Output:
[['1-2' '1-2']
['3-3' '1-1']
['5-5' '1-2']]
Col1 Col2
group 0: 1-2 1-2
group 1: 3-3 1-1
group 2: 5-5 1-2

Merge and add duplicate integers from a multidimensional array

I have a multidimensional list where the first item is a date and the second is a date time object needing to be added together. For example (leave the second as a integer for simplicity):
[[01/01/2019, 10], [01/01/2019, 3], [02/01/2019, 4], [03/01/2019, 2]]
The resulting array should be:
[[01/01/2019, 13], [02/01/2019, 4], [03/01/2019, 2]]
Does someone have a short way of doing this?
The background to this is vehicle tracking, I have a list of trips performed by vehicle and I want to have a summary by day with a count of total time driven per day.
You should change your data 01/01/2019 to '01/01/2019'.
#naivepredictor suggested good sample, anyway, if you don't want to import pandas, use this.
my_list = [['01/01/2019', 10], ['01/01/2019', 3], ['02/01/2019', 4], ['03/01/2019', 2]]
result_d = {}
for i in my_list:
result_d[i[0]] = result_d.get(i[0], 0) + i[1]
print(result_d) #{'01/01/2019': 13, '02/01/2019': 4, '03/01/2019': 2}
print([list(d) for d in result_d.items()]) #[['01/01/2019', 13], ['02/01/2019', 4], ['03/01/2019', 2]]
import pandas as pd
# create dataframe out of the given imput
df = pd.DataFrame(data=[['01/01/2019', 10], ['01/01/2019', 3], ['02/01/2019', 4]], columns=['date', 'trip_len'])
# groupby date and sum values for each day
df = df.groupby('date').sum().reset_index()
# output result as list of lists
result = df.values.tolist()

pandas dataframe exponential decay summation

I have a pandas dataframe,
[[1, 3],
[4, 4],
[2, 8]...
]
I want to create a column that has this:
1*(a)^(3) # = x
1*(a)^(3 + 4) + 4 * (a)^4 # = y
1*(a)^(3 + 4 + 8) + 4 * (a)^(4 + 8) + 2 * (a)^8 # = z
...
Where "a" is some value.
The stuff: 1, 4, 2, is from column one, the repeated 3, 4, 8 is column 2
Is this possible using some form of transform/apply?
Essentially getting:
[[1, 3, x],
[4, 4, y],
[2, 8, z]...
]
Where x, y, z is the respective sums from the new column (I want them next to each other)
There is a "groupby" that is being done on the dataframe, and this is what I want to do for a given group
If I'm understanding your question correctly, this should work:
df = pd.DataFrame([[1, 3], [4, 4], [2, 8]], columns=['a', 'b'])
a = 42
new_lst = []
for n in range(len(lst)):
z = 0
i = 0
while i <= n:
z += df['a'][i]*a**(sum(df['b'][i:n+1]))
i += 1
new_lst.append(z)
df['new'] = new_lst
Update:
Saw that you are using pandas and updated with dataframe methods. Not sure there's an easy way to do this with apply since you need a mix of values from different rows. I think this for loop is still the best route.

produce a matrix of strings on the basis of a list

I want to produce a matrix on the basis of this data I have:
[[0, 1], [1, 0], [0, 2], [1, 1], [2, 0], [0, 3], [1, 2], [2, 1], [3, 0]]
What I want to do, is if the sum inside the square brackets is equal to 1, produce a string variable y_n where n is the counter of lists meeting that condition,
and yxn if the sum is greater than one, where n counts the number of strings produced.
So for my data it should produce:
y_1
y_2
yx1
yx2
up to
yx7
So my best attempt is:
if len(gcounter) != 0:
hg = len(gcounter[0])
else:
hg=1
LHS=Matrix(hg,1,lambda i,j:(var('yx%d' %i)))
print(LHS)
The data is called gcounter.
It's not giving me an error, but its not filling LHS up with anything
I'm not entirely sure I understand what you're doing, but I think this generator does what you want:
def gen_y_strings(data):
counter_1 = counter_other = 0
for item in data:
if sum(item) == 1:
counter_1 += 1
yield "y_{}".format(counter_1)
else:
counter_other += 1
yield "yx{}".format(counter_other)
You can run it like this:
for result in gen_y_strings(gcounter):
print(result)
Which, given the example data, outputs what you wanted:
y_1
y_2
yx1
yx2
yx3
yx4
yx5
yx6
yx7

Categories