I have data of the following format:
Col1 Col2 Col3
1, 1424549456, "3 4"
2, 1424549457, "2 3 4 5"
& have successfully read it into pandas.
How can I turn Col3 to a numpy matrix of the following form:
# each value needs to become a 1 in the index of the col
# i.e. in the above example 3 is the 4th value, thus
# it is [0 0 0 1] [0 indexing is included]
mtx = [0 0 0 1 1 0 # corresponds to first row
0 0 1 1 1 1]; # corresponds to second row
Thanks for any help you can provide!
Since 0.13.1 there's str.get_dummies:
In [11]: s = pd.Series(["3 4", "2 3 4 5"])
In [12]: s.str.get_dummies(sep=" ")
Out[12]:
2 3 4 5
0 0 1 1 0
1 1 1 1 1
You have to ensure the columns are integers (rather than strings) and reindex:
In [13]: df = s.str.get_dummies(sep=" ")
In [14]: df.columns = df.columns.map(int)
In [15]: df.reindex(columns=np.arange(6), fill_value=0)
Out[15]:
0 1 2 3 4 5
0 0 0 0 1 1 0
1 0 0 1 1 1 1
To get the numpy values use .values:
In [16]: df.reindex(columns=np.arange(6), fill_value=0).values
Out[16]:
array([[0, 0, 0, 1, 1, 0],
[0, 0, 1, 1, 1, 1]])
if there's not a lot of data you can do something like
res = []
def f(v):
r = np.zeros(6, np.int)
r[map(int, v.split())] = 1
res.append(r)
df.Col3.apply(f)
mat = np.array(res)
# if you really want it to be a matrix, you can do
mat = np.matrix(res)
check out this link for more info
Related
I'm a python newbie and need help with a specfic task. My main goal is to identify all indicies with their specific values and column-names which are greater than 0 within a row and to sum up these values below each other into another column within the same row.
Here is what I tried:
import pandas as pd
import numpy as np
table = {
'A':[0, 2, 0, 5],
'B' :[4, 1, 3, 0],
'C':[2, 9, 0, 6],
'D':[1, 0, 1, 6]
}
df = pd.DataFrame(table)
print(df)
# create a new column that sums up the row
df['summary'] = 'NoData'
# print the header
print(df.columns.values)
A B C D summary
0 0 4 2 1 NoData
1 2 1 9 0 NoData
2 0 3 0 1 NoData
3 5 0 6 6 NoData
# get length of rows and columns
row = len(df.index)
column = len(df.columns)
# If a value at a spefic index is greater
# than 0, take the column name and the value at that index and print it into the column
#'summary'. Also write all values greater than 0 within a row below each other
for i in range(row):
for j in range(column):
if df.iloc[i][j] > 0:
df.at[i,'summary'] = df.columns(df.iloc[i][j]) + '\n'
I hope it is a bit clear what I want to achieve. Here is a picture of how the result should look in the column 'summary'
You don't really need a for loop.
Starting with df:
A B C D
0 0 4 2 1
1 2 1 9 0
2 0 3 0 1
3 5 0 6 6
You can do:
# Define an helper function
def f(val, col_name):
# You can modify this function in order to customize the summary string
return "" if val == 0 else str(val) + col_name + "\n"
# Assign summary column
df["summary"] = df.apply(lambda x: x.apply(f, args=(x.name,))).sum(axis=1).str[:-1]
Output:
A B C D summary
0 0 4 2 1 4B\n2C\n1D
1 2 1 9 0 2A\n1B\n9C
2 0 3 0 1 3B\n1D
3 5 0 6 6 5A\n6C\n6D
It works for longer column names as well:
one two three four summary
0 0 4 2 1 4two\n2three\n1four
1 2 1 9 0 2one\n1two\n9three
2 0 3 0 1 3two\n1four
3 5 0 6 6 5one\n6three\n6four
Try this:
import pandas as pd
import numpy as np
table = {
'A':[0, 2, 0, 5],
'B' :[4, 1, 3, 0],
'C':[2, 9, 0, 6],
'D':[1, 0, 1, 6]
}
df = pd.DataFrame(table)
print(df)
print(f'\n\n-------------BREAK-----------\n\n')
def func(line):
templist = ''
list_col = line.index.values.tolist()
temp = line.values.tolist()
for x in range(0, len(temp)):
if (temp[x] <= 0):
pass
else:
if (x == 0 ):
templist = f"{temp[x]}{list_col[x]}"
else:
templist = f"{templist}\n{temp[x]}{list_col[x]}"
return templist
df['summary'] = df.apply(func, axis = 1)
print(df)
EXIT
A B C D
0 0 4 2 1
1 2 1 9 0
2 0 3 0 1
3 5 0 6 6
-------------BREAK-----------
A B C D summary
0 0 4 2 1 \n4B\n2C\n1D
1 2 1 9 0 2A\n1B\n9C
2 0 3 0 1 \n3B\n1D
3 5 0 6 6 5A\n6C\n6D
If we have a pandas data frame and a mapping dictionary for the values in the data frame, replacing the values in the data frame using the dictionary as a mapping can be done like so:
In: df
Out:
Col1 Col2
0 a c
1 b c
2 b c
In: key
Out: {'a': 1, 'b': 2, 'c': 3}
In: df.replace(key)
Out:
Col1 Col2
0 1 3
1 2 3
2 2 3
How can a similar transformation be accomplished when the mapping dictionary has lists as values? For example:
In: key
Out: {'a': [1, 0, 0], 'b': [0, 1, 0], 'c': [0, 0, 1]}
In: df.replace(key)
ValueError: NumPy boolean array indexing assignment cannot assign 3 input values to the 1 output values where the mask is true
In this example, the end goal would be to have a new data frame that has 3 rows and 6 columns:
1 0 0 0 0 1
0 1 0 0 0 1
0 1 0 0 0 1
IIUC, you can applymap+explode+reshape:
df2 = df.applymap(key.get).explode(list(df.columns))
df2 = (df2
.set_index(df2.groupby(level=0).cumcount(), append=True)
.unstack(level=1)
)
output:
Col1 Col2
0 1 2 0 1 2
0 1 0 0 0 0 1
1 0 1 0 0 0 1
2 0 1 0 0 0 1
NB. to reset the columns: df2.columns = range(df2.shape[1])
0 1 2 3 4 5
0 1 0 0 0 0 1
1 0 1 0 0 0 1
2 0 1 0 0 0 1
You can use a combination DataFrame.apply and Series.map to perform this substitution. From there, you can perform a DataFrame.sum to concatenate the lists and then cast your data back into a new DataFrame
out = pd.DataFrame(
df.apply(lambda s: s.map(key)).sum(axis=1).tolist()
)
print(out)
0 1 2 3 4 5
0 1 0 0 0 0 1
1 0 1 0 0 0 1
2 0 1 0 0 0 1
Semi-related testing of .sum vs .chain:
In [22]: %timeit tmp_df.sum(axis=1)
77.6 µs ± 1.82 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [23]: %timeit tmp_df.apply(lambda row: list(chain.from_iterable(row)), axis=1)
197 µs ± 1.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [24]: tmp_df
Out[24]:
Col1 Col2
0 [1, 0, 0] [0, 0, 1]
1 [0, 1, 0] [0, 0, 1]
2 [0, 1, 0] [0, 0, 1]
While I won't say that .sum is the best method for concatenating lists in a Series, .apply & chain.from_iterable doesn't seem to fair much better- at least on a very small sample like this.
Hmm, this is tricky.
One solution I came up with is to convert the lists to their string represention before replacing with them, because pandas treats lists specially. Then you can use itertools.chain.from_iterable on each row to combine all the lists into one big list, and create a dataframe out of that:
import ast
from itertools import chain
n = df.replace({k: str(v) for k, v in key.items()}).applymap(ast.literal_eval)
df =pd.DataFrame(n.apply(lambda x: list(chain.from_iterable(x)), axis=1).tolist())
Output:
>>> df
0 1 2 3 4 5
0 1 0 0 0 0 1
1 0 1 0 0 0 1
2 0 1 0 0 0 1
Here's a method of replacing the items with lists without looping or stringifying:
df[:] = pd.Series(key)[df.to_numpy().flatten()].to_numpy().reshape(df.shape)
Output:
>>> df
Col1 Col2
0 [1, 0, 0] [0, 0, 1]
1 [0, 1, 0] [0, 0, 1]
2 [0, 1, 0] [0, 0, 1]
Or, you can use explode and reshape to convert the data directly to a numpy array:
arr = pd.Series(key)[df.to_numpy().flatten()].explode().to_numpy().reshape(-1, 6) # 6 = len of one of the items of `key` * number of columns in df
Output:
>>> arr
array([[1, 0, 0, 0, 0, 1],
[0, 1, 0, 0, 0, 1],
[0, 1, 0, 0, 0, 1]], dtype=object)
>>> pd.DataFrame(arr)
0 1 2 3 4 5
0 1 0 0 0 0 1
1 0 1 0 0 0 1
2 0 1 0 0 0 1
I have the following data
attr1_A attr1_B attr1_C attr1_D attr2_A attr2_B attr2_C
1 0 0 1 1 0 0
0 1 1 0 0 0 1
0 0 0 0 0 1 0
1 1 1 0 1 1 0
I want to retain attr1_A, attr1_B and combine attr1_C and attr1_D into attr1_others. As long as attr1_C and/or attr1_D is 1, then attr1_others will be 1. Similarly, I want to keep attr2_A but combine the remaining attr2_* into attr2_others. Like this:
attr1_A attr1_B attr1_others attr2_A attr2_others
1 0 1 1 0
0 1 1 0 1
0 0 0 0 1
1 1 1 1 1
In other words, for any group of attr, I want to retain a few known columns but combine the remaining (which I don't know how many remaining attr of the same group.
I am thinking of doing each group separately: processing all attr1_*, and then attr2_* because there are a limited number of groups in my dataset, but many attr under each group.
What I can think right now is to retrieve the others columns like:
# for group 1
df[x for x in df.columns if "A" not in x and "B" not in x and "attr1_" in x]
# for group 2
df[x for x in df.columns if "A" not in x and "attr2_" in x]
And to combine, I am thinking of using any function, but I can't come up with the syntax. Could you help?
Updated attempt:
I tried this
# for group 1
df['attr1_others'] = df[df[[x for x in list(df.columns)
if "attr1_" in x
and "A" not in x
and "B" not in x]].any(axis = 'column')]
but got the below error:
ValueError: No axis named column for object type <class 'pandas.core.frame.DataFrame'>
Dataframes have the great ability to manipulate data in place, without having to write complex python logic.
To create your attr1_others and attr2_others columns, you can combine the columns with or conditions using this:
df['attr1_others'] = df['attr1_C'] | df['attr1_D']
df['attr2_others'] = df['attr2_B'] | df['attr2_C']
If instead, you wanted an and condition, you could use:
df['attr1_others'] = df['attr1_C'] & df['attr1_D']
df['attr2_others'] = df['attr2_B'] & df['attr2_C']
You can then delete the lingering original values using del:
del df['attr1_C']
del df['attr1_D']
del df['attr2_B']
del df['attr2_C']
Create a list of kept-columns. Drop those kept-columns out and assign left-over columns to new dataframe df1. Groupby df1 by the splitted column names; call any on axis=1; add_suffix '_others' and assign result to df2. Finally, join and sort_index
keep_cols = ['attr1_A', 'attr1_B', 'attr2_A']
df1 = df.drop(keep_cols,1)
df2 = (df1.groupby(df1.columns.str.split('_').str[0], axis=1)
.any(1).add_suffix('_others').astype(int))
Out[512]:
attr1_others attr2_others
0 1 0
1 1 1
2 0 1
3 1 1
df_final = df[keep_cols].join(df2).sort_index(1)
Out[514]:
attr1_A attr1_B attr1_others attr2_A attr2_others
0 1 0 1 1 0
1 0 1 1 0 1
2 0 0 0 0 1
3 1 1 1 1 1
You can use custom list to select columns, and then .any() with axis=1 parameter. To convert to interger, use .astype(int).
For example:
import pandas as pd
df = pd.DataFrame({
'attr1_A': [1, 0, 0, 1],
'attr1_B': [0, 1, 0, 1],
'attr1_C': [0, 1, 0, 1],
'attr1_D': [1, 0, 0, 0],
'attr2_A': [1, 0, 0, 1],
'attr2_B': [0, 0, 1, 1],
'attr2_C': [0, 1, 0, 0]})
cols = [col for col in df.columns.values if col.startswith('attr1') and col.split('_')[1] not in ('A', 'B')]
df['attr1_others'] = df[cols].any(axis=1).astype(int)
df.drop(cols, axis=1, inplace=True)
cols = [col for col in df.columns.values if col.startswith('attr2') and col.split('_')[1] not in ('A', )]
df['attr2_others'] = df[cols].any(axis=1).astype(int)
df.drop(cols, axis=1, inplace=True)
print(df)
Prints:
attr1_A attr1_B attr2_A attr1_others attr2_others
0 1 0 1 1 0
1 0 1 0 1 1
2 0 0 0 0 1
3 1 1 1 1 1
I am having trouble with conditionals / boolean indexing. I am trying to populate a dataframe (dfp) with logic which is conditional on data from a similarly shaped dataframe (dfs) plus the previous row of itself (dfp).
This is my latest fail...
import pandas as pd
dfs = pd.DataFrame({'a':[1,0,-1,0,1,0,0,-1,0,0],'b':[0,1,0,0,-1,0,1,0,-1,0]})
In [171]: dfs
Out[171]:
a b
0 1 0
1 0 1
2 -1 0
3 0 0
4 1 -1
5 0 0
6 0 1
7 -1 0
8 0 -1
9 0 0
dfp = pd.DataFrame(index=dfs.index,columns=dfs.columns)
dfp[(dfs==1)|((dfp.shift(1)==1)&(dfs!=-1))] = 1
In [166]: dfp.fillna(0)
Out[166]:
a b
0 1.0 0.0
1 0.0 1.0
2 0.0 0.0
3 0.0 0.0
4 1.0 0.0
5 0.0 0.0
6 0.0 1.0
7 0.0 0.0
8 0.0 0.0
9 0.0 0.0
So I would like dfp to have a 1 in row n if either of 2 conditions are met:
1) dfs same row = 1 or 2) both dfp previous row = 1 and dfs same row <> -1
I would like my final output to look like this:
a b
0 1 0
1 1 1
2 0 1
3 0 1
4 1 0
5 1 0
6 1 1
7 0 1
8 0 0
9 0 0
UPDATE / EDIT:
Sometimes the visual is more helpful - below is how it would map out in Excel.
Thanks in advance, very grateful for your time.
Let's summarize the invariants:
If the dfs value is 1 then the dfp value is 1.
If the dfs value is -1 then the dfp value is 0.
If the dfs value is 0 then the dfp value is 1 if the previous dfp value is 1 otherwise it's 0.
Or to formulate in another way:
The dfp starts with 1 if the first value is 1, otherwise 0
The dfp values are 0 until there is a 1 in dfs.
The dfp values are 1 until there is a -1 in dfs.
This is very easy to formulate in python:
def create_new_column(dfs_col):
newcol = np.zeros_like(dfs_col)
if dfs_col[0] == 1:
last = 1
else:
last = 0
for idx, val in enumerate(dfs_col):
if last == 1 and val == -1:
last = 0
if last == 0 and val == 1:
last = 1
newcol[idx] = last
return newcol
And the test:
>>> create_new_column(dfs.a)
array([1, 1, 0, 0, 1, 1, 1, 0, 0, 0], dtype=int64)
>>> create_new_column(dfs.b)
array([0, 1, 1, 1, 0, 0, 1, 1, 0, 0], dtype=int64)
However this is very inefficient in Python because iterating over numpy-arrays (and pandas Series/DataFrames) is slow and the for-loops in python are inefficient as well.
However if you have numba or Cython you can compile this and it will be (probably) faster than any NumPy solution could be, because NumPy would require several rolling and/or accumulate operations.
For example with numba:
>>> import numba
>>> numba_version = numba.njit(create_new_column) # compilation step
>>> numba_version(np.asarray(dfs.a)) # need cast to np.array
array([1, 1, 0, 0, 1, 1, 1, 0, 0, 0], dtype=int64)
>>> numba_version(np.asarray(dfs.b)) # need cast to np.array
array([0, 1, 1, 1, 0, 0, 1, 1, 0, 0], dtype=int64)
Even if dfs has millions of rows the numba solution will take only milliseconds:
>>> dfs = pd.DataFrame({'a':np.random.randint(-1, 2, 1000000),'b':np.random.randint(-1, 2, 1000000)})
>>> %timeit numba_version(np.asarray(dfs.b))
100 loops, best of 3: 9.37 ms per loop
Not the best way to do it but something that works.
dfs = pd.DataFrame({'a':[1,0,-1,0,1,0,0,-1,0,0],'b':[0,1,0,0,-1,0,1,0,-1,0]})
dfp = dfs.copy()
Define the function as follows. Usage of 'last' here is a little hacky.
last = [0]
def f( x ):
if x == 1:
x = 1
elif x != -1 and last[0] == 1:
x = 1
else:
x = 0
last[0] = x
return x
Simply apply the func f on each column.
dfp.a = dfp.a.apply( f )
dfp
a b
0 1 0
1 1 1
2 0 0
3 0 0
4 1 -1
5 1 0
6 1 1
7 0 0
8 0 -1
9 0 0
Similarly for col b. Don't forget to re-initalize 'last'.
last[0] = 0
dfp.b = dfp.b.apply( f )
dfp
a b
0 1 0
1 1 1
2 0 1
3 0 1
4 1 0
5 1 0
6 1 1
7 0 1
8 0 0
9 0 0
I've got a dataset with a big number of rows. Some of the values are NaN, like this:
In [91]: df
Out[91]:
1 3 1 1 1
1 3 1 1 1
2 3 1 1 1
1 1 NaN NaN NaN
1 3 1 1 1
1 1 1 1 1
And I want to count the number of NaN values in each row, it would be like this:
In [91]: list = <somecode with df>
In [92]: list
Out[91]:
[0,
0,
0,
3,
0,
0]
What is the best and fastest way to do it?
You could first find if element is NaN or not by isnull() and then take row-wise sum(axis=1)
In [195]: df.isnull().sum(axis=1)
Out[195]:
0 0
1 0
2 0
3 3
4 0
5 0
dtype: int64
And, if you want the output as list, you can
In [196]: df.isnull().sum(axis=1).tolist()
Out[196]: [0, 0, 0, 3, 0, 0]
Or use count like
In [130]: df.shape[1] - df.count(axis=1)
Out[130]:
0 0
1 0
2 0
3 3
4 0
5 0
dtype: int64
To count NaNs in specific rows, use
cols = ['col1', 'col2']
df['number_of_NaNs'] = df[cols].isna().sum(1)
or index the columns by position, e.g. count NaNs in the first 4 columns:
df['number_of_NaNs'] = df.iloc[:, :4].isna().sum(1)