Question regarding how to generate a new key in a panda dataframe - python

I have a data frame like this:
df = pd.DataFrame({"item" : ["a", "b"], "beg": [1, 2], "end" : [10, 11]})
And I want to create a new key (lets call it "range") that holds every value of "beg" and "end" for every item,
ie:
item beg end range
0 a 1 10 1 10
1 b 2 11 2 11
I'm guessing I have to make a list of list, but I can't figure out how.
I know that I shloud loop over the df like this:
for index, row in df.iterrows():
df["range"] = #here goes the function
Thanks in advance.

You can try:
df['range'] = df[['beg','end']].astype(str).agg(' '.join, axis=1)
or:
df['range'] = df.apply(lambda x: f"{x['beg']} {x['end']}", axis=1)
Which gives you:
item beg end range
0 a 1 10 1 10
1 b 2 11 2 11
where range is string type. If you want list type:
df['range'] = [list(x) for x in zip(df['beg'], df['end'])]

Related

Matching value with column to retrieve index value

Please see example dataframe below:
I'm trying match values of columns X with column names and retrieve value from that matched column
so that:
A B C X result
1 2 3 B 2
5 6 7 A 5
8 9 1 C 1
Any ideas?
Here are a couple of methods:
# Apply Method:
df['result'] = df.apply(lambda x: df.loc[x.name, x['X']], axis=1)
# List comprehension Method:
df['result'] = [df.loc[i, x] for i, x in enumerate(df.X)]
# Pure Pandas Method:
df['result'] = (df.melt('X', ignore_index=False)
.loc[lambda x: x['X'].eq(x['variable']), 'value'])
Here I just build a dataframe from your example and call it df
dict = {
'A': (1,5,8),
'B': (2,6,9),
'C': (3,7,1),
'X': ('B','A','C')}
df = pd.DataFrame(dict)
You can extract the value from another column based on 'X' using the following code. There may be a better way to do this without having to convert first to list and retrieving the first element.
list(df.loc[df['X'] == 'B', 'B'])[0]
I'm going to create a column called 'result' and fill it with 'NA' and then replace the value based on your conditions. The loop below, extracts the value and uses .loc to replace it in your dataframe.
df['result'] = 'NA'
for idx, val in enumerate(list(vals)):
extracted = list(df.loc[df['X'] == val, val])[0]
df.loc[idx, 'result'] = extracted
Here it is as a function:
def search_replace(dataframe, search_col='X', new_col_name='result'):
dataframe[new_col_name] = 'NA'
for idx, val in enumerate(list(vals)):
extracted = list(dataframe.loc[dataframe[search_col] == val, val])[0]
dataframe.loc[idx, new_col_name] = extracted
return df
and the output
>>> search_replace(df)
A B C X result
0 1 2 3 B 2
1 5 6 7 A 5
2 8 9 1 C 1

python for loop with if statement to divide numbers

if statement and for loop
I am stuck with the following code, I have a column in which I want to divide by 2 if the number is above 10 and run this for all the rows. I have tried this code but it gives the error of the series is ambiguous:
if df[x] > 10:
df[x]/2
else:
df[x]
I suppose that I need a for loop in combination with the if statement. However I could not make it running, anyone has some ideas?
The easiest approach, I think, is to use boolean indexing. For example:
df = pd.DataFrame( # create dataframe
20*np.random.rand(6, 4),
columns=list("ABCD"))
print(df) # print df
df[df>10]/=2 # divide entries over 10 by 2
print(df) # print df
Result:
A B C D
0 1.245686 1.443671 17.423559 17.617235
1 13.834285 10.482565 2.213459 9.581361
2 0.290626 14.082919 0.224327 11.033058
3 5.113568 5.305690 19.453723 3.260354
4 14.679005 8.761523 2.417432 4.843426
5 15.990754 12.421538 4.872804 5.577625
A B C D
0 1.245686 1.443671 8.711780 8.808617
1 6.917143 5.241283 2.213459 9.581361
2 0.290626 7.041459 0.224327 5.516529
3 5.113568 5.305690 9.726862 3.260354
4 7.339503 8.761523 2.417432 4.843426
5 7.995377 6.210769 4.872804 5.577625
It's not clear exactly what you're asking, but assuming you want to divide the element at x if it is > 10, you can simply do
>>> df = [1,2,3,4,5,6]
>>> df = [1,2,30,40,5,6]
>>> if df[2]>10:
... df[2]/=2
...
>>> df
[1, 2, 15.0, 40, 5, 6]
Note the /= instead of your /.

How to get new pandas dataframe with certain columns and rows depending on list elements?

I have such a list:
l = ['A','B']
And such a dataframe df
Name x y
A 1 2
B 2 1
C 2 2
I now want to get a new dataframe where only the entries for Name and x which are included in l are kept.
new_df should look like this:
Name x
A 1
B 2
I was playing around with isin but did not solve this problem.
Use DataFrame.loc with Series.isin:
new_df = df.loc[df.Name.isin(l), ["Name", "x"]]
This should do it:
# assuming Name is the index
new_df = df[df.index.isin(l)]
# if you only want column x
new_df = df.loc[df.index.isin(l), "x"]
simple as that
l = ['A','B']
def make_empty(row):
print(row)
for idx, value in enumerate(row):
row[idx] = value if value in l else ''
return row
df_new = df[df['Name'].isin(l) | df['x'].isin(l)][['Name','x']]
df_new.apply(lambda row: make_empty(row)
Output:
Name x
0 A
1 B

Match entire list with values from another dataframe

I have a dataframe with a list in one column and want to match all items in this list with a second dataframe. The matched values should then be added (as a list) to a new column in the first dataframe.
data = {'froots': [['apple','banana'], ['apple','strawberry']]
}
df1 = pd.DataFrame(data)
data = {'froot': ['apple','banana','strawberry'],
'age': [2,3,5]
}
df2 = pd.DataFrame(data)
DF1
index fruits
1 ['apple','banana']
2 ['apple','strawberry']
DF2
index fruit age
1 apple 2
2 banana 3
3 strawberry 5
New DF1
index froots age
1 ['apple','banana'] [2,3]
2 ['apple','strawberry'] [2,5]
I have a simple solution that takes way too long:
age = list()
for index,row in df1.iterrows():
numbers = row.froots
tmp = df2[['froot','age']].apply(lambda x: x['age'] if x['froot'] in numbers else None, axis=1).dropna().tolist()
age.append(tmp)
df1['age'] = age
Is there maybe a faster solution to this problem?
Thanks in Advance!
Use lsit comprehension with dictionary created by df2 and add new values to list if exist in dictionary tested by if:
d = df2.set_index('froot')['age'].to_dict()
df1['ag1e'] = df1['froots'].apply(lambda x: [d[y] for y in x if y in d])
print (df1)
froots ag1e
0 [apple, banana] [2, 3]
1 [apple, strawberry] [2, 5]

For each item in list L, find all of the corresponding items in a dataframe

I'm looking for a fast solution to this Python problem:
- 'For each item in list L, find all of the corresponding items in a dataframe column (`df [ 'col1' ]).
The catch is that both L and df ['col1'] may contain duplicate values and all duplicates should be returned.
For example:
L = [1,4,1]
d = {'col1': [1,2,3,4,1,4,4], 'col2': ['a','b','c','d','e','f','g']}
df = pd.DataFrame(data=d)
The desired output would be a new DataFrame where df [ 'col1' ] contains the values:
[1,1,1,1,4,4,4]
and rows are duplicated accordingly. Note that 1 appears 4 times (twice in L * twice in df)
I have found that the obvious solutions like .isin() don't work because they drop duplicates.
A list comprehension does work, but it is too slow for my real-life problem, where len(df) = 16 million and len(L) = 150000):
idx = [y for x in L for y in df[df['col1'].values == x]]
res = df.loc[idx].reset_index(drop=True)
This is basically just a problem of comparing two lists (with a bit of dataframe indexing difficulty tacked on), and a clever and very fast solution by Mad Physicist almost works for this, except that duplicates in L are dropped (it returns [1, 4, 1, 4, 4] in the example above; i.e., it finds the duplicates in df but ignores the duplicates in L).
train = np.array([...]) # my df['col1']
keep = np.array([...]) # my list L
keep.sort()
ind = np.searchsorted(keep, train, side='left')
ind[ind == keep.size] -= 1
train_keep = train[keep[ind] == train]
I'd be grateful for any ideas.
Initial data:
L = [1,4,1]
df = pd.DataFrame({'col':[1,2,3,4,1,4,4] })
You can create dataframe from L
df2 = pd.DataFrame({'col':L})
and merge it with initial dataframe:
result = df.merge(df2, how='inner', on='col')
print(result)
Result:
col
0 1
1 1
2 1
3 1
4 4
5 4
6 4
IIUC try:
L = [1,4,1]
pd.concat([df.loc[df['col'].eq(el), 'col'] for el in L], axis=0)
(Not sure how do you want to have indexes- the above will return a bit raw format)
Output:
0 1
4 1
3 4
5 4
6 4
0 1
4 1
Name: col, dtype: int64
Reindexed:
pd.concat([df.loc[df['col'].eq(el), 'col'] for el in L], axis=0).reset_index(drop=True)
#output:
0 1
1 1
2 4
3 4
4 4
5 1
6 1
Name: col, dtype: int64

Categories