Filling Pandas columns with lists of unequal lengths - python

I am having trouble filling Pandas dataframes with values from lists of unequal lengths.
nx_lists_into_df is a list of numpy arrays.
I get the following error:
ValueError: Length of values does not match length of index
The code is below:
# Column headers
df_cols = ["f1","f2"]
# Create one dataframe fror each sheet
df1 = pd.DataFrame(columns=df_cols)
df2 = pd.DataFrame(columns=df_cols)
# Create list of dataframes to iterate through
df_list = [df1, df2]
# Lists to be put into the dataframes
nx_lists_into_df = [[array([0, 1, 3, 4, 7]),
array([2, 5, 6, 8])],
[array([0, 1, 2, 6, 7]),
array([3, 4, 5, 8])]]
# Loop through each sheet (i.e. each round of k folds)
for df, test_index_list in zip_longest(df_list, nx_lists_into_df):
counter = -1
# Loop through each column in that sheet (i.e. each fold)
for col in df_cols:
print(col)
counter += 1
# Add 1 to each index value to start indexing at 1
df[col] = test_index_list[counter] + 1
Thank you for your help.
Edit: This is how the result should hopefully look:-
print(df1)
f1 f2
0 0 2
1 1 5
2 3 6
3 4 8
4 7 NaN
print(df2)
f1 f2
0 0 3
1 1 4
2 2 5
3 6 8
4 7 NaN

We'll leverage pd.Series to attach an appropriate index and will allow us to use the pd.DataFrame constructor without complaining of unequal lengths.
df1, df2 = (
pd.DataFrame(dict(zip(df_cols, map(pd.Series, d))))
for d in nx_lists_into_df
)
print(df1)
f1 f2
0 0 2.0
1 1 5.0
2 3 6.0
3 4 8.0
4 7 NaN
print(df2)
f1 f2
0 0 3.0
1 1 4.0
2 2 5.0
3 6 8.0
4 7 NaN
Setup
from numpy import array
nx_lists_into_df = [[array([0, 1, 3, 4, 7]),
array([2, 5, 6, 8])],
[array([0, 1, 2, 6, 7]),
array([3, 4, 5, 8])]]
# Column headers
df_cols = ["f1","f2"]

You could predefine the size of your DataFrames (by setting the index range to the length of the longest column you want to add [or any size bigger than the longest column]) like so:
df1 = pd.DataFrame(columns=df_cols, index=range(5))
df2 = pd.DataFrame(columns=df_cols, index=range(5))
print(df1)
f1 f2
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
(df2 is the same)
The DataFrame will be filled with NaNs automatically.
Then you use .loc to access each entry separately like so:
for x in range(len(nx_lists_into_df)):
for col_idx, y in enumerate(nx_lists_into_df[x]):
df_list[x].loc[range(len(y)), df_cols[col_idx]] = y
print(df1)
f1 f2
0 0 2
1 1 5
2 3 6
3 4 8
4 7 NaN
print(df2)
f1 f2
0 0 3
1 1 4
2 2 5
3 6 8
4 7 NaN
The first loop iterates over the first dimension of your array (or the number of DataFrames you want to create).
The second loop iterates over the column values for the DataFrame, where y are the values for the current column and df_cols[col_idx] is the respective column (f1 or f2).
Since the row & col indices are the same size as y, you don't get the length mismatch.
Also check out the enumerate(iterable, start=0) function to get around those "counter" variables.
Hope this helps.

If I understand correctly, this is possible via pd.concat.
But see #pir's solution for an extendable version.
# Lists to be put into the dataframes
nx_lists_into_df = [[array([0, 1, 3, 4, 7]),
array([2, 5, 6, 8])],
[array([0, 1, 2, 6, 7]),
array([3, 4, 5, 8])]]
df1 = pd.concat([pd.DataFrame({'A': nx_lists_into_df[0][0]}),
pd.DataFrame({'B': nx_lists_into_df[0][1]})],
axis=1)
# A B
# 0 0 2.0
# 1 1 5.0
# 2 3 6.0
# 3 4 8.0
# 4 7 NaN
df2 = pd.concat([pd.DataFrame({'C': nx_lists_into_df[1][0]}),
pd.DataFrame({'D': nx_lists_into_df[1][1]})],
axis=1)
# C D
# 0 0 3.0
# 1 1 4.0
# 2 2 5.0
# 3 6 8.0
# 4 7 NaN

Related

Replace specific values in a data frame with column mean

I have a dataframe and I want to replace the value 7 with the round number of mean of its columns with out other 7 in that columns. Here is a simple example:
import pandas as pd
df = pd.DataFrame()
df['a'] = [1, 2, 3]
df['b'] =[3, 0, -1]
df['c'] = [4, 7, 6]
df['d'] = [7, 7, 6]
a b c d
0 1 3 4 7
1 2 0 7 7
2 3 -1 6 6
And here is the output I want:
a b c d
0 1 3 4 2
1 2 0 3 2
2 3 -1 6 6
For example, in row 1, the mean of column c is equal to 3.33 and then its round is 3, and in column column d is equal to 2 (since we do not consider the other 7 in that column).
Can you please help me with that?
here is one way to do it
# replace 7 with np.nan
df.replace(7,np.nan, inplace=True)
# fill NaN values with the mean of the column
(df.fillna(df.apply(lambda x: x.replace(np.nan, 0)
.mean(skipna=False) ))
.round(0)
.astype(int))
a b c d
0 1 3 4 2
1 2 0 3 2
2 3 -1 6 6
temp = df.replace(to_replace=7, value=0, inplace=False).copy()
df.replace(to_replace=7, value=temp.mean().astype(int), inplace=True)

Apply different mathematical function in table in Python

I have two columns - Column A and Column B and it has some values like below:-
Now, I want to apply normal arithmetic function for each row and add result in next column. But Different arithmetic operator should be apply on each row. Like
A+B for first row
A-B for second row
A*B for third row
A/B for fourth row
and so on till nth record of the row with same repetitive mathematical function.
Can someone please help me with this code in Python.
python-3.x
pandas
We can use:
row.name to access the index when using apply on a row
can use a dictionary to map indexes to a operations
Code
import operator as _operator
# Data
d = {"A":[5, 6, 7, 8, 9, 10, 11],
"B": [1, 2, 3, 4, 5, 6, 7]}
df = pd.DataFrame(d)
print(df)
# Mapping from index to mathematical operation
operator_map = {
0: _operator.add,
1: _operator.sub,
2: _operator.mul,
3: _operator.truediv,
}
# use row.name % 4 to have operators have a cycle of 4
df['new'] = df.apply(lambda row: operator_map[row.name % 4](*row), axis = 1)
Output
Initial df
A B
0 5 1
1 6 2
2 7 3
3 8 4
4 9 5
5 10 6
6 11 7
New df
A B new
0 5 1 6.0
1 6 2 4.0
2 7 3 21.0
3 8 4 2.0
4 9 5 14.0
5 10 6 4.0
6 11 7 77.0
IIUC, you can try DataFrame.apply on rows with operator
import operator
operators = [operator.add, operator.sub, operator.mul, operator.truediv]
df['C'] = df.apply(lambda row: operators[row.name](*row), axis=1)
print(df)
A B C
0 5 1 6.0
1 6 2 4.0
2 7 3 21.0
3 8 4 2.0

Group dataframe according to start and stop columns

I'd like to cut/group a pandas Dataframe according to a start and a stop column, but only in the case of start->stop.
I would like the range of indexes from 'start' non-zero value to 'stop' non-zero value. But only if the 'start' non-zero value is followed next by a 'stop' non-zero value. Running through the indices from top to bottom
I attached some code creating a simplified version of the problem and a corresponding image.
col1 = np.zeros(10)
col2 = np.zeros(10)
col1[[0, 1, 5, 8]] = 1
col2[[3, 6, 7, 9]] = 1
df = pd.DataFrame({'start': col1, 'stop': col2})
The desired output would group the indexes somewhat like:
[(1,2,3), (5,6), (8,9)]
Additional info in case this would simplify things:
Merging the columns would be fine.
My original data frame has a pd.TimedeltaIndex.
Visual Clarification of the desired result:
First we need to look the intervals of start and stop and find out which are “valid” interval ends:
>>> ends = df.index.to_series().where(df['stop'].ne(0))
>>> starts = df.index.to_series().where(df['start'].ne(0))
>>> ends
0 NaN
1 NaN
2 NaN
3 3.0
4 NaN
5 NaN
6 6.0
7 7.0
8 NaN
9 9.0
dtype: float64
>>> starts
0 0.0
1 1.0
2 NaN
3 NaN
4 NaN
5 5.0
6 NaN
7 NaN
8 8.0
9 NaN
dtype: float64
Now we can try to get for each valid start the next valid end:
>>> next_end = ends.bfill().rename('end')
>>> valid_starts = starts.dropna().rename('start')
>>> candidates = valid_starts.to_frame().join(next_end, how='left')
>>> candidates
start end
0 0.0 3.0
1 1.0 3.0
5 5.0 6.0
8 8.0 9.0
Here we see that there is an issue with the interval starting at 0: another interval starts later (at 1) so [0, 3] is not valid and we should only keep [1, 3]. This could be done with groupby + max for example:
>>> intervals = candidates.groupby('end')['start'].max().reset_index().astype(int)
>>> intervals
end start
0 3 1
1 6 5
2 9 8
Finally generating the list of indexes from the endpoints is easy:
>>> intervals.agg(lambda s: list(range(s['start'], s['end'] + 1)), axis='columns')
0 [1, 2, 3]
1 [5, 6]
2 [8, 9]
dtype: object

Find duplicate values in two arrays, Python

I have two arrays (A and B) with about 50 000 values in each. Every value represents an ID. I want to create a pandas dataframe with three columns, col1: values from array A, col2: values from array B, col3: a string with the labels "unique" or "duplicate". In each array the ID:s are unique.
The arrays is of different length. So I can't do something like this to get started.
a = np.array([1, 2, 3, 4, 5])
a = np.array([5, 6, 7, 8, 9, 10])
pd.DataFrame({'a':a, 'a':b})
I was then thinking to create a different pandas dataframe, also with three columns. One for ID, another for which array the ID comes from (a or b). And thereafter group on ID and count occurrences. if >=2 then we have a duplicate.
But I couldn’t figure out how get to numpy arrays after one another in the same column (like rbind in R) and at the same time create the other column based on which array the value come from.
Most likely there are far better solutions that those I have suggested above. Any ideas?
For finding duplicate elements in two arrays, use numpy.intersect1d:
In [458]: a = np.array([1, 2, 3, 4, 5])
In [459]: b = np.array([5, 6, 7, 8, 9, 10])
In [462]: np.intersect1d(a,b)
Out[462]: array([5])
Convert the array into series and then concat them to create dataframe
a = np.array([1, 2, 3, 4, 5,])
b = np.array([5, 6, 7, 8, 9, 10])
s1 = pd.Series(a, name = 'a')
s2 = pd.Series(b, name = 'b')
pd.concat([s1, s2], axis = 1)
a b
0 1.0 5
1 2.0 6
2 3.0 7
3 4.0 8
4 5.0 9
5 NaN 10
Try with merge + indicator
out = pd.DataFrame({'a':a}).merge(pd.DataFrame({'b':b}), left_on='a',right_on='b',indicator=True,how='outer')
Out[210]:
a b _merge
0 1.0 NaN left_only
1 2.0 NaN left_only
2 3.0 NaN left_only
3 4.0 NaN left_only
4 5.0 5.0 both
5 NaN 6.0 right_only
6 NaN 7.0 right_only
7 NaN 8.0 right_only
8 NaN 9.0 right_only
9 NaN 10.0 right_only

Pandas index column by boolean

I want to keep columns that have 'n' or more values.
For example:
> df = pd.DataFrame({'a': [1,2,3], 'b': [1,None,4]})
a b
0 1 1
1 2 NaN
2 3 4
3 rows × 2 columns
> df[df.count()==3]
IndexingError: Unalignable boolean Series key provided
> df[:,df.count()==3]
TypeError: unhashable type: 'slice'
> df[[k for (k,v) in (df.count()==3).items() if v]]
a
0 1
1 2
2 3
Is that the best way to do this? It seems ridiculous.
You can use conditional list comprehension to generate the columns that exceed your threshold (e.g. 3). Then just select those columns from the data frame:
# Create sample DataFrame
df = pd.DataFrame({'a': [1, 2, 3, 4, 5],
'b': [1, None, 4, None, 2],
'c': [5, 4, 3, 2, None]})
>>> df_new = df[[col for col in df if df[col].count() > 3]]
Out[82]:
a c
0 1 5
1 2 4
2 3 3
3 4 2
4 5 NaN
Use count to produce a boolean index and use this as a mask for the columns:
In [10]:
df[df.columns[df.count() > 2]]
Out[10]:
a
0 1
1 2
2 3
if you want to keep columns that have 'n' or more values. for my example i am considering n value as 4
df = pd.DataFrame({'a': [1,2,3,4,6], 'b': [1,None,4,5,7],'c': [1,2,3,5,8]})
print df
a b c
0 1 1 1
1 2 NaN 2
2 3 4 3
3 4 5 5
4 6 7 8
print df[[i for i in xrange(0,len(df.columns)) if len(df.iloc[:,i]) - df.isnull().sum()[i] >4]]
a c
0 1 1
1 2 2
2 3 3
3 4 5
4 6 8

Categories