I have two arrays (A and B) with about 50 000 values in each. Every value represents an ID. I want to create a pandas dataframe with three columns, col1: values from array A, col2: values from array B, col3: a string with the labels "unique" or "duplicate". In each array the ID:s are unique.
The arrays is of different length. So I can't do something like this to get started.
a = np.array([1, 2, 3, 4, 5])
a = np.array([5, 6, 7, 8, 9, 10])
pd.DataFrame({'a':a, 'a':b})
I was then thinking to create a different pandas dataframe, also with three columns. One for ID, another for which array the ID comes from (a or b). And thereafter group on ID and count occurrences. if >=2 then we have a duplicate.
But I couldn’t figure out how get to numpy arrays after one another in the same column (like rbind in R) and at the same time create the other column based on which array the value come from.
Most likely there are far better solutions that those I have suggested above. Any ideas?
For finding duplicate elements in two arrays, use numpy.intersect1d:
In [458]: a = np.array([1, 2, 3, 4, 5])
In [459]: b = np.array([5, 6, 7, 8, 9, 10])
In [462]: np.intersect1d(a,b)
Out[462]: array([5])
Convert the array into series and then concat them to create dataframe
a = np.array([1, 2, 3, 4, 5,])
b = np.array([5, 6, 7, 8, 9, 10])
s1 = pd.Series(a, name = 'a')
s2 = pd.Series(b, name = 'b')
pd.concat([s1, s2], axis = 1)
a b
0 1.0 5
1 2.0 6
2 3.0 7
3 4.0 8
4 5.0 9
5 NaN 10
Try with merge + indicator
out = pd.DataFrame({'a':a}).merge(pd.DataFrame({'b':b}), left_on='a',right_on='b',indicator=True,how='outer')
Out[210]:
a b _merge
0 1.0 NaN left_only
1 2.0 NaN left_only
2 3.0 NaN left_only
3 4.0 NaN left_only
4 5.0 5.0 both
5 NaN 6.0 right_only
6 NaN 7.0 right_only
7 NaN 8.0 right_only
8 NaN 9.0 right_only
9 NaN 10.0 right_only
Related
I have a dataframe like the one below
d = {"to_explode": [[1, 2, 3], [4, 5], [6, 7, 8, 9]], "numbers": [3, 2, 4]}
df = pd.DataFrame(data=d)
to_explode numbers
0 [1, 2, 3] 3
1 [4, 5] 4
2 [6, 7, 8, 9] 12
I want to call pd.explode on the list-like column, but I want to divide the data in the other column accordingly.
In this example, the values in the numbers column for the first row would be replaced with 1 - i.e. 3 / 3 (the corresponding number of items in the to_explode column).
How would I do this please?
You need to perform the computation (get the list length with str.len), then explode:
out = (df
.assign(numbers=df['numbers'].div(df['to_explode'].str.len()))
.explode('to_explode')
)
output:
to_explode numbers
0 1 1.0
0 2 1.0
0 3 1.0
1 4 1.0
1 5 1.0
2 6 1.0
2 7 1.0
2 8 1.0
2 9 1.0
I have a dataframe with these characteristics (the indexes are float values):
import pandas as pd
d = {'A': [1,2,3,4,5,6,7,8,9,10],
'B': [1,2,3,4,5,6,7,8,9,10],
'C': [1,2,3,4,5,6,7,8,9,10],
'D': ['one','one','one','one','one','two','two','two','two','two']}
df = pd.DataFrame(data=d)
df
A B C D
50.0 1 1 1 one
50.2 2 2 2 one
50.4 3 3 3 one
50.6 4 4 4 one
50.8 5 5 5 one
51.0 6 6 6 two
51.2 7 7 7 two
51.4 8 8 8 two
51.6 9 9 9 two
51.8 10 10 10 two
And a list of offsets with these values (they are also floats):
offsets = [[0.4, 0.6, 0.8], [0.2, 0.4, 0.6]]
I need to iterate through my dataframe over columns A, B and C, choosing the categorical values from column D, replacing the last values from columns A, B and C by nan according their indexes in relation to the offsets in my list, resulting in a dataframe like this:
A B C D
50.0 1 1 1 one
50.2 2 2 nan one
50.4 3 nan nan one
50.6 nan nan nan one
50.8 nan nan nan one
51.0 6 6 6 two
51.2 7 7 7 two
51.4 8 8 nan two
51.6 9 nan nan two
51.8 nan nan nan two
The value of the offset means what values must be set to nan from the bottom up. For example: offsets[0][0]=0.4, so for column A when D == 'one', the two values from the bottom up must be set to nan (rows 4 and 3, 50.8-0.4 = 50.4 - 50.4 doesn't change). For A when D == 'two', the offsets[1][0]=0.2, so one value from the bottom up must be set to nan (row 9, 51.8-0.2 = 51.6 - 51.6 doesn't change). Offsets[1][0]=0.6, so for column B when D == 'one', the three values from the bottom up must be set to nan (rows 4, 3 and 2, 50.8-0.6 = 50.2 - 50.2 doesn't change). For B when D == 'two', the offsets[1][1]=0.4, so two values from the bottom up must be set to nan (rows 9 and 8, 51.8-0.4 = 51.4 - 51.4 doesn't change). For column C is the same.
Any idea how to do this? A quick comment - I want to replace these values in the dataframe itself, without creating a new one.
One approach is to use apply to set the last values of each column to NaN:
import pandas as pd
# toy data
df = pd.DataFrame(data={'A': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'B': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'C': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'D': ['one', 'one', 'one', 'one', 'one', 'two', 'two', 'two', 'two', 'two']})
offsets = [2, 3, 4]
offset_lookup = dict(zip(df.columns[:3], offsets))
def funny_shift(x, ofs=None):
"""This function shift each column by the given offset in the ofs parameter"""
for column, offset in ofs.items():
x.loc[x.index[-1 * offset:], column] = None
return x
df.loc[:, ["A", "B", "C"]] = df.groupby("D").apply(funny_shift, ofs=offset_lookup)
print(df)
Output
A B C D
0 1.0 1.0 1.0 one
1 2.0 2.0 NaN one
2 3.0 NaN NaN one
3 NaN NaN NaN one
4 NaN NaN NaN one
5 6.0 6.0 6.0 two
6 7.0 7.0 NaN two
7 8.0 NaN NaN two
8 NaN NaN NaN two
9 NaN NaN NaN two
UPDATE
If you have multiple updates per group, you could do:
offsets = [[2, 3, 4], [1, 2, 3]]
offset_lookup = (dict(zip(df.columns[:3], offset)) for offset in offsets)
def funny_shift(x, ofs=None):
"""This function shift each column by the given offset in the ofs parameter"""
current = next(ofs)
for column, offset in current.items():
x.loc[x.index[-1 * offset:], column] = None
return x
df.loc[:, ["A", "B", "C"]] = df.groupby("D").apply(funny_shift, ofs=offset_lookup)
print(df)
I'd like to cut/group a pandas Dataframe according to a start and a stop column, but only in the case of start->stop.
I would like the range of indexes from 'start' non-zero value to 'stop' non-zero value. But only if the 'start' non-zero value is followed next by a 'stop' non-zero value. Running through the indices from top to bottom
I attached some code creating a simplified version of the problem and a corresponding image.
col1 = np.zeros(10)
col2 = np.zeros(10)
col1[[0, 1, 5, 8]] = 1
col2[[3, 6, 7, 9]] = 1
df = pd.DataFrame({'start': col1, 'stop': col2})
The desired output would group the indexes somewhat like:
[(1,2,3), (5,6), (8,9)]
Additional info in case this would simplify things:
Merging the columns would be fine.
My original data frame has a pd.TimedeltaIndex.
Visual Clarification of the desired result:
First we need to look the intervals of start and stop and find out which are “valid” interval ends:
>>> ends = df.index.to_series().where(df['stop'].ne(0))
>>> starts = df.index.to_series().where(df['start'].ne(0))
>>> ends
0 NaN
1 NaN
2 NaN
3 3.0
4 NaN
5 NaN
6 6.0
7 7.0
8 NaN
9 9.0
dtype: float64
>>> starts
0 0.0
1 1.0
2 NaN
3 NaN
4 NaN
5 5.0
6 NaN
7 NaN
8 8.0
9 NaN
dtype: float64
Now we can try to get for each valid start the next valid end:
>>> next_end = ends.bfill().rename('end')
>>> valid_starts = starts.dropna().rename('start')
>>> candidates = valid_starts.to_frame().join(next_end, how='left')
>>> candidates
start end
0 0.0 3.0
1 1.0 3.0
5 5.0 6.0
8 8.0 9.0
Here we see that there is an issue with the interval starting at 0: another interval starts later (at 1) so [0, 3] is not valid and we should only keep [1, 3]. This could be done with groupby + max for example:
>>> intervals = candidates.groupby('end')['start'].max().reset_index().astype(int)
>>> intervals
end start
0 3 1
1 6 5
2 9 8
Finally generating the list of indexes from the endpoints is easy:
>>> intervals.agg(lambda s: list(range(s['start'], s['end'] + 1)), axis='columns')
0 [1, 2, 3]
1 [5, 6]
2 [8, 9]
dtype: object
I am having trouble filling Pandas dataframes with values from lists of unequal lengths.
nx_lists_into_df is a list of numpy arrays.
I get the following error:
ValueError: Length of values does not match length of index
The code is below:
# Column headers
df_cols = ["f1","f2"]
# Create one dataframe fror each sheet
df1 = pd.DataFrame(columns=df_cols)
df2 = pd.DataFrame(columns=df_cols)
# Create list of dataframes to iterate through
df_list = [df1, df2]
# Lists to be put into the dataframes
nx_lists_into_df = [[array([0, 1, 3, 4, 7]),
array([2, 5, 6, 8])],
[array([0, 1, 2, 6, 7]),
array([3, 4, 5, 8])]]
# Loop through each sheet (i.e. each round of k folds)
for df, test_index_list in zip_longest(df_list, nx_lists_into_df):
counter = -1
# Loop through each column in that sheet (i.e. each fold)
for col in df_cols:
print(col)
counter += 1
# Add 1 to each index value to start indexing at 1
df[col] = test_index_list[counter] + 1
Thank you for your help.
Edit: This is how the result should hopefully look:-
print(df1)
f1 f2
0 0 2
1 1 5
2 3 6
3 4 8
4 7 NaN
print(df2)
f1 f2
0 0 3
1 1 4
2 2 5
3 6 8
4 7 NaN
We'll leverage pd.Series to attach an appropriate index and will allow us to use the pd.DataFrame constructor without complaining of unequal lengths.
df1, df2 = (
pd.DataFrame(dict(zip(df_cols, map(pd.Series, d))))
for d in nx_lists_into_df
)
print(df1)
f1 f2
0 0 2.0
1 1 5.0
2 3 6.0
3 4 8.0
4 7 NaN
print(df2)
f1 f2
0 0 3.0
1 1 4.0
2 2 5.0
3 6 8.0
4 7 NaN
Setup
from numpy import array
nx_lists_into_df = [[array([0, 1, 3, 4, 7]),
array([2, 5, 6, 8])],
[array([0, 1, 2, 6, 7]),
array([3, 4, 5, 8])]]
# Column headers
df_cols = ["f1","f2"]
You could predefine the size of your DataFrames (by setting the index range to the length of the longest column you want to add [or any size bigger than the longest column]) like so:
df1 = pd.DataFrame(columns=df_cols, index=range(5))
df2 = pd.DataFrame(columns=df_cols, index=range(5))
print(df1)
f1 f2
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
(df2 is the same)
The DataFrame will be filled with NaNs automatically.
Then you use .loc to access each entry separately like so:
for x in range(len(nx_lists_into_df)):
for col_idx, y in enumerate(nx_lists_into_df[x]):
df_list[x].loc[range(len(y)), df_cols[col_idx]] = y
print(df1)
f1 f2
0 0 2
1 1 5
2 3 6
3 4 8
4 7 NaN
print(df2)
f1 f2
0 0 3
1 1 4
2 2 5
3 6 8
4 7 NaN
The first loop iterates over the first dimension of your array (or the number of DataFrames you want to create).
The second loop iterates over the column values for the DataFrame, where y are the values for the current column and df_cols[col_idx] is the respective column (f1 or f2).
Since the row & col indices are the same size as y, you don't get the length mismatch.
Also check out the enumerate(iterable, start=0) function to get around those "counter" variables.
Hope this helps.
If I understand correctly, this is possible via pd.concat.
But see #pir's solution for an extendable version.
# Lists to be put into the dataframes
nx_lists_into_df = [[array([0, 1, 3, 4, 7]),
array([2, 5, 6, 8])],
[array([0, 1, 2, 6, 7]),
array([3, 4, 5, 8])]]
df1 = pd.concat([pd.DataFrame({'A': nx_lists_into_df[0][0]}),
pd.DataFrame({'B': nx_lists_into_df[0][1]})],
axis=1)
# A B
# 0 0 2.0
# 1 1 5.0
# 2 3 6.0
# 3 4 8.0
# 4 7 NaN
df2 = pd.concat([pd.DataFrame({'C': nx_lists_into_df[1][0]}),
pd.DataFrame({'D': nx_lists_into_df[1][1]})],
axis=1)
# C D
# 0 0 3.0
# 1 1 4.0
# 2 2 5.0
# 3 6 8.0
# 4 7 NaN
I have to dataframes that are related via a hierarchical dictionary.
In[0]: import pandas as pd
d = {'levelA_1':['sublevel_1', 'sublevel_2'],
'levelA_2':['sublevel_3', 'sublevel_4'],
'levelA_3':['sublevel_5', 'sublevel_6']}
datA = pd.DataFrame({'A': {'levelA_1': 4, 'levelA_2': 2, 'levelA_3': 2},
'B': {'levelA_1': 1, 'levelA_2': 3, 'levelA_3': 5},
'C': {'levelA_1': 2, 'levelA_2': 4, 'levelA_3': 6}})
datB = pd.DataFrame({'A': {'sublevel_1': 4, 'sublevel_2': 1, 'sublevel_3': 3, 'sublevel_4': 4},
'B': {'sublevel_1': 1, 'sublevel_2': 3, 'sublevel_3': 4, 'sublevel_4': 8},
'C': {'sublevel_1': 2, 'sublevel_2': 6, 'sublevel_3': 13, 'sublevel_4': 6}})
In[1]: datA
Out[1]:
A B C
levelA_1 4 1 2
levelA_2 2 3 4
levelA_3 2 5 6
In[2]: datB
Out[2]:
A B C
sublevel_1 4 1 2
sublevel_2 1 3 6
sublevel_3 3 4 13
sublevel_4 4 8 6
In[3]: x = 3
The first dataframe (datA) provides values for the keys of d and the other (datB) provides values for the values of d.
Furthermore I have a base value of x. I want to multiply the matrix of datA with x and then each element of datB with the referenced value (from the dict).
So for example I want to get the following result for a cell.
x = 3
3 * datB['B']['sublevel_3'] * datA['B']['levelA_2']
# res = 3*4*3 = 36
Desired output for dataframe:
A B C
sublevel_1 48 3 12
sublevel_2 12 9 26
sublevel_3 18 36 156
sublevel_4 24 72 72
Is there a better way than to loop through each cell?
IIUC
datA['New']=datA.reset_index()['index'].map(d).values
# map the dict , build the connecction for datA and datB
New_datA=datA.set_index(list('ABC'),append=True).New.apply(pd.Series).stack().reset_index(list('ABC'))
# makeing datA and datB have the same index, then we could do dataframe calculation
New_datA=New_datA.set_index(0)
datB*New_datA*3
#you can add dropna at the end to remove the NaN value
Out[95]:
A B C
sublevel_1 48.0 3.0 12.0
sublevel_2 12.0 9.0 36.0
sublevel_3 18.0 36.0 156.0
sublevel_4 24.0 72.0 72.0
sublevel_5 NaN NaN NaN
sublevel_6 NaN NaN NaN