I have DataFrame with two columns: Type and Name. The values in each cell are lists of equal length, i.e we have pairs (Type, Name). I want to:
Group Name by it's Type
Create column Type with the values of Names
My current code is a for loop:
for idx, row in df.iterrows():
for t in list(set(row["Type"])):
df.at[idx, t] = [row["Name"][i] for i in range(len(row["Name"])) if row["Type"][i] == t]
but it works very slow. How can I speed up this code?
EDIT Here is the code example which ilustrates what I want to obtain but in a faster way:
import pandas as pd
df = pd.DataFrame({"Type": [["1", "1", "2", "3"], ["2","3"]], "Name": [["A", "B", "C", "D"], ["E", "F"]]})
unique = list(set(row["Type"]))
for t in unique:
df[t] = None
df[t] = df[t].astype('object')
for idx, row in df.iterrows():
for t in unique:
df.at[idx, t] = [row["Name"][i] for i in range(len(row["Name"])) if row["Type"][i] == t]
You could write a function my_function(param) and then do something like this:
df['type'] = df['name'].apply(lambda x: my_function(x))
There are likely better alternatives to using lambda functions, but lambdas are what I remember. If you post a simplified mock of your original data and what the desired output should look like, it may help you find the best answer to your question. I'm not certain I understand what you're trying to do. A literal group by should be done using Dataframes' groupby method.
If I understand correctly your dataframe looks something like this:
df = pd.DataFrame({'Name':['a,b,c','d,e,f,g'], 'Type':['3,3,2','1,2,2,1']})
Name Type
0 a,b,c 3,3,2
1 d,e,f,g 1,2,2,1
where the elements are lists of strings.
Start with running:
df['Name:Type'] = (df['Name']+":"+df['Type']).map(process)
using:
def process(x):
x_,y_ = x.split(':')
x_ = x_.split(','); y_ = y_.split(',')
s = zip(x_,y_)
str_ = ','.join(':'.join(y) for y in s)
return str_
Then you will get:
This reduces the problem to a single column.
Finally produce the dataframe required by:
l = ','.join(df['Name:Type'].to_list()).split(',')
pd.DataFrame([i.split(':') for i in l], columns=['Name','Type'])
Giving:
is it the result you want? (if not then add to your question an example of desired output):
res = df.explode(['Name','Type']).groupby('Type')['Name'].agg(list)
print(res)
'''
Type
1 [A, B]
2 [C, E]
3 [D, F]
Name: Name, dtype: object
UPD
df1 = df.apply(lambda x: pd.Series(x['Name'],x['Type']).groupby(level=0).agg(list).T,1)
res = pd.concat([df,df1],axis=1)
print(res)
'''
Type Name 1 2 3
0 [1, 1, 2, 3] [A, B, C, D] [A, B] [C] [D]
1 [2, 3] [E, F] NaN [E] [F]
Related
I have a pandas dataframe that has one long row as a result of a flattened json list.
I want to go from the example:
{'0_id': 1, '0_name': a, '0_address': USA, '1_id': 2, '1_name': b, '1_address': UK, '1_hobby': ski}
to a table like the following:
id
name
address
hobby
1
a
USA
2
b
UK
ski
Any help is greatly appreciated :)
There you go:
import json
json_data = '{"0_id": 1, "0_name": "a", "0_address": "USA", "1_id": 2, "1_name": "b", "1_address": "UK", "1_hobby": "ski"}'
arr = json.loads(json_data)
result = {}
for k in arr:
kk = k.split("_")
if int(kk[0]) not in result:
result[int(kk[0])] = {"id":"", "name":"", "hobby":""}
result[int(kk[0])][kk[1]] = arr[k]
for key in result:
print("%s %s %s" % (key, result[key]["name"], result[key]["address"]))
if you want to have field more dynamic, you have two choices - either go through all array and gather all possible names and then build template associated empty array, or just check if key exist in result when you returning results :)
This way only works if every column follows this pattern, but should otherwise be pretty robust.
data = {'0_id': '1', '0_name': 'a', '0_address': 'USA', '1_id': '2', '1_name': 'b', '1_address': 'UK', '1_hobby': 'ski'}
df = pd.DataFrame(data, index=[0])
indexes = set(x.split('_')[0] for x in df.columns)
to_concat = []
for i in indexes:
target_columns = [col for col in df.columns if col.startswith(i)]
df_slice = df[target_columns]
df_slice.columns = [x.split('_')[1] for x in df_slice.columns]
to_concat.append(df_slice)
new_df = pd.concat(to_concat)
I have a dataframe with strings and a dictionary which values are lists of strings.
I need to check if each string of the dataframe contains any element of every value in the dictionary. And if it does, I need to label it with the appropriate key from the dictionary. All I need to do is to categorize all the strings in the dataframe with keys from the dictionary.
For example.
df = pd.DataFrame({'a':['x1','x2','x3','x4']})
d = {'one':['1','aa'],'two':['2','bb']}
I would like to get something like this:
df = pd.DataFrame({
'a':['x1','x2','x3','x4'],
'Category':['one','two','x3','x4']})
I tried this, but it has not worked:
df['Category'] = np.nan
for k, v in d.items():
for l in v:
df['Category'] = [k if l in str(x).lower() else x for x in df['a']]
Any ideas appreciated!
Firstly create a function that do this for you:-
def func(val):
for x in range(0,len(d.values())):
if val in list(d.values())[x]:
return list(d.keys())[x]
Now make use of split() and apply() method:-
df['Category']=df['a'].str.split('',expand=True)[2].apply(func)
Finally use fillna() method:-
df['Category']=df['Category'].fillna(df['a'])
Now if you print df you will get your expected output:-
a Category
0 x1 one
1 x2 two
2 x3 x3
3 x4 x4
Edit:
You can also do this by:-
def func(val):
for x in range(0,len(d.values())):
if any(l in val for l in list(d.values())[x]):
return list(d.keys())[x]
then:-
df['Category']=df['a'].apply(func)
Finally:-
df['Category']=df['Category'].fillna(df['a'])
I've come up with the following heuristic, which looks really dirty.
It outputs what you desire, albeit with some warnings, since I've used indices to append values to dataframe.
import pandas as pd
import numpy as np
def main():
df = pd.DataFrame({'a': ['x1', 'x2', 'x3', 'x4']})
d = {'one': ['1', 'aa'], 'two': ['2', 'bb']}
found = False
i = 0
df['Category'] = np.nan
for x in df['a']:
for k,v in d.items():
for item in v:
if item in x:
df['Category'][i] = k
found = True
break
else:
df['Category'][i] = x
if found:
found = False
break
i += 1
print(df)
main()
my question relates to using the values in a dataframe column as keys in order to return their respective values and run a conditional.
I have a dataframe, df, containing a column "count" that has integers from 1 to 8 and a column "category" that has values either "A", "B", or "C"
I have a dictionary, dct, containing pairs A:2, B:4, C:6
This is my (incorrect) code:
result = df[df["count"] >= dct.get(df["category"])]
So I want to return a dataframe where the "count" value for a given row is equal to more than the value retrieved from a dictionary using the "category" letter in the same row.
So if there were count values of (1, 2, 6, 6) and category values of (A, B, C, A), the third and forth row would be return in the resultant dataframe.
How do I modify the above code to achieve this?
A good way to go is to add your dictionary into a the existing dataframe and then apply a query on the new dataframe:
import pandas as pd
df = pd.DataFrame(data={'count': [4, 5, 6], 'category': ['A', 'B', 'C']})
dct = {'A':5, 'B':4, 'C':-1}
df['min_count'] = df['category'].map(dct)
df = df.query('count>min_count')
following your logic:
import pandas as pd
dct = {'A':2, 'B':4, 'C':6}
df = pd.DataFrame({'count':[1,2,5,6],
'category':['A','B','C','A']})
print('original dataframe')
print(df)
def process_row(x):
return True if x['count'] >= dct[x['category']] else False
f = df.apply(lambda row: process_row(row), axis=1)
df = df[f]
print('final output')
print(df)
output:
original dataframe
count category
0 1 A
1 2 B
2 5 C
3 6 A
final output
count category
3 6 A
A small modification to your code:
result = df[df['count'] >= df['category'].apply(lambda x: dct[x])]
You cannot directly use dct.get(df['category']) because df['category'] returns a mutable Series which cannot be used as a dictionary key (Dictionary keys need to be immutable objects)
So, apply and lambda to the rescue! :)
In the following example, how do I keep only rows that have "a" in the array present in column tags?
df = pd.DataFrame(columns=["val", "tags"], data=[[5,["a","b","c"]]])
df[3<df.val] # this works
df["a" in df.tags] # is there an equivalent for filtering on tags?
I think using sets is intuitive. Then you can use >= as set containment
df[df.tags.apply(set) >= {'a'}]
val tags
0 5 [a, b, c]
A Numpy alternative would be
tags = df['tags']
n = len(tags)
out = np.zeros(n, np.bool8)
i = np.arange(n).repeat(tags.str.len())
np.logical_or.at(out, i, np.concatenate(tags) == 'a')
df[out]
Per #JonClements
You can use set.issubset in a map (very clever)
df[df.tags.map({'a'}.issubset)]
val tags
0 5 [a, b, c]
Use list comprehension:
df1 = df[["a" in x for x in df.tags]]
you could use apply with a lambda function which tests if 'a' is in arg of lambda:
df.tags.apply(lambda x: 'a' in x)
Result:
0 True
Name: tags, dtype: bool
This can also be used to index your dataframe:
df[df.tags.apply(lambda x: 'a' in x)]
Result:
val tags
0 5 [a, b, c]
I have a DataFrame, and I want to select certain rows and columns from it. I know how to do this using loc. However, I want to be able to specify each criteria individually, rather than in one go.
import numpy as np
import pandas as pd
idx = pd.IndexSlice
index = [np.array(['foo', 'foo', 'qux', 'qux']),
np.array(['a', 'b', 'a', 'b'])]
columns = ["A", "B"]
df = pd.DataFrame(np.random.randn(4, 2), index=index, columns=columns)
print df
print df.loc[idx['foo', :], idx['A':'B']]
A B
foo a 0.676649 -1.638399
b -0.417915 0.587260
qux a 0.294555 -0.573041
b 1.592056 0.237868
A B
foo a -0.470195 -0.455713
b 1.750171 -0.409216
Requirement
I want to be able to achieve the same result with something like the following bit of code, where I specify each criteria one by one. It's also important that I'm able to use a slice_list to allow dynamic behaviour [i.e. the syntax should work whether there are two, three or ten different criteria in the slice_list].
slice_1 = 'foo'
slice_2 = ':'
slice_list = [slice_1, slice_2]
column_slice = "'A':'B'"
print df.loc[idx[slice_list], idx[column_slice]]
You can achieve this using the slice built-in function. You can't build slices with strings as ':' is a literal character and not a syntatical one.
slice_1 = 'foo'
slice_2 = slice(None)
column_slice = slice('A', 'B')
df.loc[idx[slice_1, slice_2], idx[column_slice]]
You might have to build your "slice lists" a little differently than you intended, but here's a relatively compact method using df.merge() and df.ix[]:
# Build a "query" dataframe
slice_df = pd.DataFrame(index=[['foo','qux','qux'],['a','a','b']])
# Explicitly name columns
column_slice = ['A','B']
slice_df.merge(df, left_index=True, right_index=True, how='inner').ix[:,column_slice]
Out[]:
A B
foo a 0.442302 -0.949298
qux a 0.425645 -0.233174
b -0.041416 0.229281
This method also requires you to be explicit about your second index and columns, unfortunately. But computers are great at making long tedious lists for you if you ask nicely.
EDIT - Example of method to dynamically built a slice list that could be used like above.
Here's a function that takes a dataframe and spits out a list that could then be used to create a "query" dataframe to slice the original by. It only works with dataframes with 1 or 2 indices. Let me know if that's an issue.
def make_df_slice_list(df):
if df.index.nlevels == 1:
slice_list = []
# Only one level of index
for dex in df.index.unique():
if input("DF index: " + dex + " - Include? Y/N: ") == "Y":
# Add to slice list
slice_list.append(dex)
if df.index.nlevels > 1:
slice_list = [[] for _ in xrange(df.index.nlevels)]
# Multi level
for i in df.index.levels[0]:
print "DF index:", i, "has subindexes:", [dex for dex in df.ix[i].index]
sublist = input("Enter a the indexes you'd like as a list: ")
# if no response, the first entry
if len(sublist)==0:
sublist = [df.ix[i].index[0]]
# Add an entry to the first index list for each sub item passed
[slice_list[0].append(i) for item in sublist]
# Add each of the second index list items
[slice_list[1].append(item) for item in sublist]
return slice_list
I'm not advising this as a way to communicate with your user, just an example. When you use it you have to pass strings (e.g. "Y" and "N") and lists of string (["a","b"]) and empty lists [] at prompts. Example:
In [115]: slice_list = make_df_slice_list(df)
DF index: foo has subindexes: ['a', 'b']
Enter a the indexes you'd like as a list: []
DF index: qux has subindexes: ['a', 'b']
Enter a the indexes you'd like as a list: ['a','b']
In [116]:slice_list
Out[116]: [['foo', 'qux', 'qux'], ['a', 'a', 'b']]
# Back to my original solution, but now passing the list:
slice_df = pd.DataFrame(index=slice_list)
column_slice = ['A','B']
slice_df.merge(df, left_index=True, right_index=True, how='inner').ix[:,column_slice]
Out[117]:
A B
foo a -0.249547 0.056414
qux a 0.938710 -0.202213
b 0.329136 -0.465999
Building up on the answer by Ted Petrou:
slices = [('foo', slice(None)), slice('A', 'B')]
print df.loc[tuple(idx[s] for s in slices)]
A B
foo a -0.465421 -0.591763
b -0.854938 1.221204
slices = [('foo', slice(None)), 'A']
print df.loc[tuple(idx[s] for s in slices)]
foo a -0.465421
b -0.854938
Name: A, dtype: float64
slices = [('foo', slice(None))]
print df.loc[tuple(idx[s] for s in slices)]
A B
foo a -0.465421 -0.591763
b -0.854938 1.221204
You have to use tuples when calling __getitem__ (loc[...]) with a 'dynamic' argument.
You could also avoid building the slice objects by hand:
def to_selector(s):
if isinstance(s, tuple) or isinstance(s, list):
return tuple(map(to_selector, s))
ps = [None if len(p) == 0 else p for p in s.split(':')]
assert len(ps) > 0 and len(ps) <= 2
if len(ps) == 1:
assert ps[0] is not None
return ps[0]
return slice(*ps)
query = [('foo', ':'), 'A:B']
df.loc[tuple(idx[to_selector(s)] for s in query)]
do you mean this?
import numpy as np
import pandas as pd
idx = pd.IndexSlice
index = [np.array(['foo', 'foo', 'qux', 'qux']),
np.array(['a', 'b', 'a', 'b'])]
columns = ["A", "B"]
df = pd.DataFrame(np.random.randn(4, 2), index=index, columns=columns)
print df
#
la1 = lambda df: df.loc[idx['foo', :], idx['A':'B']]
la2 = lambda df: df.loc[idx['qux', :], idx['A':'B']]
laList = [la1, la2]
result = map(lambda la: la(df), laList)
print result[0]
print result[1]
A B
foo a 0.162138 -1.382822
b -0.822986 -0.403766
qux a 0.191695 -1.125841
b 0.669254 -0.704894
A B
foo a 0.162138 -1.382822
b -0.822986 -0.403766
A B
qux a 0.191695 -1.125841
b 0.669254 -0.704894
Did you simply mean this?
df.loc[idx['foo',:], :].loc[idx[:,'a'], :]
In a slightly more general form, for example:
def multiindex_partial_row_slice(df, part_idx, criteria):
slc = idx[tuple([slice(None) if i != part_idx else criteria
for i in range(len(df.index.levels))])]
return df.loc[slc, :]
multiindex_partial_row_slice(df, 1, slice('a','b'))
Similarly you can always narrow your current column set by appending .loc[:, columns] to your currently sliced view.