I tried to convert a set column to list in python dataframe, but failed. Not sure what's best way to do so. Thanks.
Here is the example:
I tried to create a 'c' column which convert 'b' set column to list. but 'c' is still set.
data = [{'a': [1,2,3], 'b':{11,22,33}},{'a':[2,3,4],'b':{111,222}}]
tdf = pd.DataFrame(data)
tdf['c'] = list(tdf['b'])
tdf
a b c
0 [1, 2, 3] {33, 11, 22} {33, 11, 22}
1 [2, 3, 4] {222, 111} {222, 111}
You could do:
import pandas as pd
data = [{'a': [1,2,3], 'b':{11,22,33}},{'a':[2,3,4],'b':{111,222}}]
tdf = pd.DataFrame(data)
tdf['c'] = [list(e) for e in tdf.b]
print(tdf)
Use apply:
tdf['c'] = tdf['b'].apply(list)
Because using list is doing to whole column not one by one.
Or do:
tdf['c'] = tdf['b'].map(list)
Related
I have the following pandas dataframe
data = [{'a': 1, 'b': '[2,3,4,5,6' }, {'a': 10, 'b': '[54,3,40,5'}]
test = pd.DataFrame(data)
display(test)
a b
0 1 [2,3,4,5,6
1 10 [54,3,40,5
I want to list the number in column b, but as the list has the [ only at the beginning, doesnt allow me to create the list, I'm trying to remove the "[" so I can extract the numbers, but I keep getting errors, what I'm doing wrong?
This is how the numbers are stored
test.iloc[1,1]
'[54,3,40,5'
And this is what I've tried to remove the "[".
test.iloc[0,1].replace("[",'', regex=True).to_list()
test.iloc[0,1].str.replace("[\]\[]", "")
What i want to achieve is to have b as a proper list so i can apply other functions.
a b
0 1 [2,3,4,5,6]
1 10 [54,3,40,5]
To make your 'b' column a list you can first delete the open squared bracket at the beginning, and then use the split method on each element of your 'b' column
test['b'] = test['b'].str.replace('[', '').map(lambda x: x.split(','))
test
# a b
# 0 1 [2, 3, 4, 5, 6]
# 1 10 [54, 3, 40, 5]
try it:
def func(col):
return eval(col+']')
test['b'] = test['b'].apply(func)
import pandas as pd
data = [{'a': 1, 'b': '[2,3,4,5,6' }, {'a': 10, 'b': '[54,3,40,5'}]
test = pd.DataFrame(data)
print(test['b'][0][1:])
for i in range(len(test['b'])):
test['b'][i] = test['b'][i][1:]
I have a data-frame for which want to create a column that represents missing value patterns in data-frame.For example :
for example for the CSV file,
A,B,C,D
1,NaN,NaN,NaN
Nan,2,3,NaN
3,2,2,3
3,2,NaN,3
3,2,1,NaN
I want to create a column E,which has value in following way:
If A,B,C,D all are missing E = 4,
If A,B,C,D all are present E = 0,
if A and B are only missing E = 1 of that sort, encoding of E need not be like I mentioned just an indication of pattern.How can I come across this problem in pandas?
use isnull in combination with sum(axis=1)
Example:
import pandas as pd
df = pd.DataFrame({'A': [1, None, 3, 3, 3],
'B':[ None, None, 1, 1, 1]})
df['C'] = df.isnull().sum(axis=1)
I want to find duplicates in a selection of columns of a df,
# converts the sub df into matrix
mat = df[['idx', 'a', 'b']].values
str_dict = defaultdict(set)
for x in np.ndindex(mat.shape[0]):
concat = ''.join(str(x) for x in mat[x][1:])
# take idx as values of each key a + b
str_dict[concat].update([mat[x][0]])
dups = {}
for key in str_dict.keys():
dup = str_dict[key]
if len(dup) < 2:
continue
dups[key] = dup
The code finds duplicates of the concatenation of a and b. Uses the concatenation as key for a set defaultdict (str_dict), updates the key with idx values; finally uses a dict (dups) to store any concatenation if the length of its value (set) is >= 2.
I am wondering if there is a better way to do that in terms of efficiency.
You can just concatenate and convert to set:
res = set(df['a'].astype(str) + df['b'].astype(str))
Example:
df = pd.DataFrame({'idx': [1, 2, 3],
'a': [4, 4, 5],
'b': [5, 5,6]})
res = set(df['a'].astype(str) + df['b'].astype(str))
print(res)
# {'56', '45'}
If you need to map indices too:
df = pd.DataFrame({'idx': [1, 2, 3],
'a': [41, 4, 5],
'b': [3, 13, 6]})
df['conc'] = (df['a'].astype(str) + df['b'].astype(str))
df = df.reset_index()
res = df.groupby('conc')['index'].apply(set).to_dict()
print(res)
# {'413': {0, 1}, '56': {2}}
You can filter the column you need before drop_duplicate
df[['a','b']].drop_duplicates().astype(str).apply(np.sum,1).tolist()
Out[1027]: ['45', '56']
I have column names in a dictionary and would like to select those columns from a dataframe.
In the example below, how do I select dictionary values 'b', 'c' and save it in to df1?
import pandas as pd
ds = {'cols': ['b', 'c']}
d = {'a': [2, 3], 'b': [3, 4], 'c': [4, 5]}
df_in = pd.DataFrame(data=d)
print(ds)
print(df_in)
df_out = df_in[[ds['cols']]]
print(df_out)
TypeError: unhashable type: 'list'
Remove nested list - []:
df_out = df_in[ds['cols']]
print(df_out)
b c
0 3 4
1 4 5
According to ref, just need to drop one set of brackets.
df_out = df_in[ds['cols']]
Just saw this:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
Apparently the .ix() operator is now deprecated. Wondering how to do something like this:
df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]}, index=pd.DatetimeIndex(['2017-01-01', '2017-01-03', '2017-01-05']))
wanted_int_index = df.index.get_loc('2017-01-04', method='ffill') # index_id = 2
wanted_str_column = 'a'
value = df.ix[wanted_int_index, wanted_str_column] # value = 2
print(value)
# 2
My understanding is that .loc() excepts label (str) for both index and columns, while .iloc() excepts position (int) for both index and columns. Am I missing a usage here?
loc should work for non-datetime indexing.
import pandas as pd
import numpy as np
data = np.random.rand(10)
df = pd.DataFrame(data, index=range(10),columns=['A'])
print(df.loc[1,'A']) #this works
For datetimes, like you have, you need to index using them. Ie,
df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]},
index=pd.DatetimeIndex(['2017-01-01', '2017-01-03', '2017-01-05']))
wanted_int_index = df.index.get_loc('2017-01-04', method='ffill') # index_id = 2
wanted_str_column = 'a'
value = df.loc[df.index[wanted_int_index], wanted_str_column] # value = 2
print(value) #this works