Iterating over lists produces unexpected results - python

In the first example below, I am iterating over a list of dataframes. The For loop creates column 'c'. Printing each df shows that both elements in the list were updated.
In the second example, I am iterating over a list of variables. The For loop applys some math to each element. But when printing, the list does not reflect the changes made in the For loop.
Please help me to understand why the elements in the second example are not being impacted by the For loop, like they are in the first example.
import pandas as pd
df1 = pd.DataFrame([[1,2],[3,4]], columns=['a', 'b'])
df2 = pd.DataFrame([[3,4],[5,6]], columns=['a', 'b'])
dfs = [df1, df2]
for df in dfs:
df['c'] = df['a'] + df['b']
print(df1)
print(df2)
result:
a b c
0 1 2 3
1 3 4 7
a b c
0 3 4 7
1 5 6 11
Second example:
a, b = 2, 3
test = [a, b]
for x in test:
x = x * 2
print(test)
result: [2, 3]
expected result: [4, 6]

In your second example, test is a list of ints which are not mutable. If you want a similar effect to your first snippet, you will have to store something mutable in your list:
a, b = 2, 3
test = [[a], [b]]
for x in test:
x[0] = x[0] * 2
print(test)
Output: [[4], [6]]

When you iterate in a list like this x takes the value at the current position.
for x in test:
x = x * 2
When you try to assign a new value to x you are not changing the element in the list, you are changing what the variable x contains.
To change the actual value in the list iterate by index:
for i in range(len(test)):
test[i] = test[i] * 2

Related

python for loop with if statement to divide numbers

if statement and for loop
I am stuck with the following code, I have a column in which I want to divide by 2 if the number is above 10 and run this for all the rows. I have tried this code but it gives the error of the series is ambiguous:
if df[x] > 10:
df[x]/2
else:
df[x]
I suppose that I need a for loop in combination with the if statement. However I could not make it running, anyone has some ideas?
The easiest approach, I think, is to use boolean indexing. For example:
df = pd.DataFrame( # create dataframe
20*np.random.rand(6, 4),
columns=list("ABCD"))
print(df) # print df
df[df>10]/=2 # divide entries over 10 by 2
print(df) # print df
Result:
A B C D
0 1.245686 1.443671 17.423559 17.617235
1 13.834285 10.482565 2.213459 9.581361
2 0.290626 14.082919 0.224327 11.033058
3 5.113568 5.305690 19.453723 3.260354
4 14.679005 8.761523 2.417432 4.843426
5 15.990754 12.421538 4.872804 5.577625
A B C D
0 1.245686 1.443671 8.711780 8.808617
1 6.917143 5.241283 2.213459 9.581361
2 0.290626 7.041459 0.224327 5.516529
3 5.113568 5.305690 9.726862 3.260354
4 7.339503 8.761523 2.417432 4.843426
5 7.995377 6.210769 4.872804 5.577625
It's not clear exactly what you're asking, but assuming you want to divide the element at x if it is > 10, you can simply do
>>> df = [1,2,3,4,5,6]
>>> df = [1,2,30,40,5,6]
>>> if df[2]>10:
... df[2]/=2
...
>>> df
[1, 2, 15.0, 40, 5, 6]
Note the /= instead of your /.

how to write a function to filter rows based on a list values one by one and make analysis

I have a list contains more than 10 values and I have a full dataframe. I'd like to filter each value from the list to a subdataframe and do some analysis on each of them. How can I write a function so I don't need to copy paste and change value so many times.
eg.
list = ['A','B','C']
df1 = df[df['column1']=='A']
df2 = df[df['column1']=='B']
df3 = df[df['column1']=='C']
for each subdataframe ,I will do a groupby and value count
df1.groupby(['column2']).size()
df2.groupby(['column2']).size()
df3.groupby(['column2']).size()
First many DataFrames is here not necessary.
You can filter only necessary values for column1 and pass both columns to groupby:
L = ['A','B','C']
s = df1[df1['column1'].isin(L)].groupby(['column1', 'column2']).size()
Last select by values of list:
s.loc['A']
s.loc['B']
s.loc['C']
If want function:
def f(df, x):
return df[df['column1'].eq(L)].groupby(['column2']).size()
print (f(df1, 'A'))
You can use locals() to create variables dynamically (it's not really a good practice but it works well):
df = pd.DataFrame({'column1': list('ABC') * 3, 'column2': list('IKJIJKLNK')})
lst = ['A', 'B', 'C']
for idx, elmt in enumerate(lst, 1):
locals()[f"df{idx}"] = df[df['column1'] == elmt]
>>> df3
column1 column2
2 C J
5 C K
8 C K
>>> df3.value_counts('column2')
column1 column2
column2
K 2
J 1
dtype: int64
>>> df
column1 column2
0 A I
1 B K
2 C J
3 A I
4 B J
5 C K
6 A L
7 B N
8 C K

For each item in list L, find all of the corresponding items in a dataframe

I'm looking for a fast solution to this Python problem:
- 'For each item in list L, find all of the corresponding items in a dataframe column (`df [ 'col1' ]).
The catch is that both L and df ['col1'] may contain duplicate values and all duplicates should be returned.
For example:
L = [1,4,1]
d = {'col1': [1,2,3,4,1,4,4], 'col2': ['a','b','c','d','e','f','g']}
df = pd.DataFrame(data=d)
The desired output would be a new DataFrame where df [ 'col1' ] contains the values:
[1,1,1,1,4,4,4]
and rows are duplicated accordingly. Note that 1 appears 4 times (twice in L * twice in df)
I have found that the obvious solutions like .isin() don't work because they drop duplicates.
A list comprehension does work, but it is too slow for my real-life problem, where len(df) = 16 million and len(L) = 150000):
idx = [y for x in L for y in df[df['col1'].values == x]]
res = df.loc[idx].reset_index(drop=True)
This is basically just a problem of comparing two lists (with a bit of dataframe indexing difficulty tacked on), and a clever and very fast solution by Mad Physicist almost works for this, except that duplicates in L are dropped (it returns [1, 4, 1, 4, 4] in the example above; i.e., it finds the duplicates in df but ignores the duplicates in L).
train = np.array([...]) # my df['col1']
keep = np.array([...]) # my list L
keep.sort()
ind = np.searchsorted(keep, train, side='left')
ind[ind == keep.size] -= 1
train_keep = train[keep[ind] == train]
I'd be grateful for any ideas.
Initial data:
L = [1,4,1]
df = pd.DataFrame({'col':[1,2,3,4,1,4,4] })
You can create dataframe from L
df2 = pd.DataFrame({'col':L})
and merge it with initial dataframe:
result = df.merge(df2, how='inner', on='col')
print(result)
Result:
col
0 1
1 1
2 1
3 1
4 4
5 4
6 4
IIUC try:
L = [1,4,1]
pd.concat([df.loc[df['col'].eq(el), 'col'] for el in L], axis=0)
(Not sure how do you want to have indexes- the above will return a bit raw format)
Output:
0 1
4 1
3 4
5 4
6 4
0 1
4 1
Name: col, dtype: int64
Reindexed:
pd.concat([df.loc[df['col'].eq(el), 'col'] for el in L], axis=0).reset_index(drop=True)
#output:
0 1
1 1
2 4
3 4
4 4
5 1
6 1
Name: col, dtype: int64

replace empty list with NaN in pandas dataframe

I'm trying to replace some empty list in my data with a NaN values. But how to represent an empty list in the expression?
import numpy as np
import pandas as pd
d = pd.DataFrame({'x' : [[1,2,3], [1,2], ["text"], []], 'y' : [1,2,3,4]})
d
x y
0 [1, 2, 3] 1
1 [1, 2] 2
2 [text] 3
3 [] 4
d.loc[d['x'] == [],['x']] = d.loc[d['x'] == [],'x'].apply(lambda x: np.nan)
d
ValueError: Arrays were different lengths: 4 vs 0
And, I want to select [text] by using d[d['x'] == ["text"]] with a ValueError: Arrays were different lengths: 4 vs 1 error, but select 3 by using d[d['y'] == 3] is correct. Why?
If you wish to replace empty lists in the column x with numpy nan's, you can do the following:
d.x = d.x.apply(lambda y: np.nan if len(y)==0 else y)
If you want to subset the dataframe on rows equal to ['text'], try the following:
d[[y==['text'] for y in d.x]]
I hope this helps.
You can use function "apply" to match the specified cell value no matter it is the instance of string, list and so on.
For example, in your case:
import pandas as pd
d = pd.DataFrame({'x' : [[1,2,3], [1,2], ["text"], []], 'y' : [1,2,3,4]})
d
x y
0 [1, 2, 3] 1
1 [1, 2] 2
2 [text] 3
3 [] 4
if you use d == 3 to select the cell whose value is 3, it's totally ok:
x y
0 False False
1 False False
2 False True
3 False False
However, if you use the equal sign to match a list, there may be out of your exception, like d == [text] or d == ['text'] or d == '[text]', such as the following:
There's some solutions:
Use function apply() on the specified Series in your Dataframe just like the answer on the top:
A more general method with the function applymap() on a Dataframe may be used for the preprocessing step:
d.applymap(lambda x: x == [])
x y
0 False False
1 False False
2 False False
3 True False
Wish it can help you and the following learners and it would be better if you add a type check in you applymap function which would otherwise cause some exceptions probably.
To answer your main question, just leave out the empty lists altogether. The NaN's will automatically get populated in if there's a value in one column and not the other if you use pandas.concat instead of building a dataframe from a dictionary.
>>> import pandas as pd
>>> ser1 = pd.Series([[1,2,3], [1,2], ["text"]], name='x')
>>> ser2 = pd.Series([1,2,3,4], name='y')
>>> result = pd.concat([ser1, ser2], axis=1)
>>> result
x y
0 [1, 2, 3] 1
1 [1, 2] 2
2 [text] 3
3 NaN 4
About your second question, it seems that you can't search inside of an element. Perhaps you should make that a separate question since it's not really related to your main question.

Selecting rows from pandas DataFrame using a list

I have a list of lists as below
[[1, 2], [1, 3]]
The DataFrame is similar to
A B C
0 1 2 4
1 0 1 2
2 1 3 0
I would like a DataFrame, if the value in column A is equal to the first element of any of the nested lists and the value in column B of the corresponding row is equal to the second element of that same nested list.
Thus the resulting DataFrame should be
A B C
0 1 2 4
2 1 3 0
The code below do want you need:
tmp_filter = pandas.DataFrame(None) #The dataframe you want
# Create your list and your dataframe
tmp_list = [[1, 2], [1, 3]]
tmp_df = pandas.DataFrame([[1,2,4],[0,1,2],[1,3,0]], columns = ['A','B','C'])
#This function will pass the df pass columns by columns and
#only keep the columns with the value you want
def pass_true_df(df, cond):
for i, c in enumerate(cond):
df = df[df.iloc[:,i] == c]
return df
# Pass through your list and add the row you want to keep
for i in tmp_list:
tmp_filter = pandas.concat([tmp_filter, pass_true_df(tmp_df, i)])
import pandas
df = pandas.DataFrame([[1,2,4],[0,1,2],[1,3,0],[0,2,5],[1,4,0]],
columns = ['A','B','C'])
filt = pandas.DataFrame([[1, 2], [1, 3],[0,2]],
columns = ['A','B'])
accum = []
#grouped to-filter
data_g = df.groupby('A')
for k2,v2 in data_g:
accum.append(v2[v2.B.isin(filt.B[filt.A==k2])])
print(pandas.concat(accum))
result:
A B C
3 0 2 5
0 1 2 4
2 1 3 0
(I made the data and filter a little more complicated as a test.)

Categories