iterating large pandas DataFrame too slow - python

I have a large dataframe where I would like to make a new column based on existing columns.
test = pd.DataFrame({'Test1':["100","4242","3454","2","54"]})
test['Test2'] = ""
for i in range(0,len(test)):
if len(test.iloc[i,0]) == 4:
test.iloc[i,-1] = test.iloc[i,0][0:1]
elif len(test.iloc[i,0]) == 3:
test.iloc[i,-1] = test.iloc[i,0][0]
elif len(test.iloc[i,0]) < 3:
test.iloc[i,-1] = 0
else:
test.iloc[i,-1] = np.nan
This is working for a small dataframe, but when I have a large data set, (10+ million rows), it is taking way too long. How can I make this process faster?

Use str.len method to find the lengths of strings in the 'Test1' column and then using this information, use np.select to assign relevant parts of the strings in 'Test1' or default values to 'Test2'.
import numpy as np
lengths = test['Test1'].str.len()
test['Test2'] = np.select([lengths == 4, lengths == 3, lengths < 3], [test['Test1'].str[0:1], test['Test1'].str[0], 0], np.nan)
Output:
Test1 Test2
0 100 1
1 4242 4
2 3454 3
3 2 0
4 54 0
Note that [0:1] only returns the first element (same as [0]) so maybe you meant [0:2] (or something else) otherwise you can save one condition there.

So, basically you want to extract the first character of the string if it is at least 3 characters long. (NB. for a string, [0] and [0:1] yields exactly the same thing)
Just use a regex with a lookbehind for that.
test['Test2'] = test['Test1'].str.extract('^(.)(?=..)').fillna(0)
output:
Test1 Test2
0 100 1
1 4242 4
2 3454 3
3 2 0
4 54 0
How the regex works:
^ # match beginning of string
(.) # capture one character
(?=..) # only if it is followed by at least two characters

Related

Python - count successive leading digits on a pandas row string without counting non successive digits

I need to create a new column that counts the number of leading 0s, however I am getting errors trying to do so.
I extracted data from mongo based on the following regex [\^0[0]*[1-9][0-9]*\] on mongo and saved it to a csv file. This is all "Sequences" that start with a 0.
df['Sequence'].str.count('0')
and
df['Sequence'].str.count('0[0]*[1-9][0-9]')
Give the below results. As you can see that both of the "count" string return will also count non leading 0s. Or simply the total number of 0s.
Sequence 0s
0 012312312 1
1 024624624 1
2 036901357 2
3 002486248 2
4 045074305 3
5 080666140 3
I also tried writing using loops which worked when testing but when using it on the data frame, I encounter the following **IndexError: string index out of range**
results = []
count = 0
index = 0
for item in df['Sequence']:
count = 0
index = 0
while (item[index] == "0"):
count = count + 1
index = index + 1
results.append(count)
df['0s'] = results
df
In short; If I can get 2 for 001230 substring instead of 3. I could save the results in a column to do my stats on.
You can use extract with the ^(0*) regex to match only the leading zeros. Then use str.len to get the length.
df['0s'] = df['sequence'].str.extract('^(0*)', expand = False).str.len()
Example input:
df = pd.DataFrame({'sequence': ['12040', '01230', '00010', '00120']})
Output:
sequence 0s
0 12040 0
1 01230 1
2 00010 3
3 00120 2
You can use this regex:
'^0+'
the ^ means, capture if the pattern starts at the beginning of the string.
the +means, capture if occuring at least once or multiple times.
IIUC, you want to count the number of leading 0s, right? Take advantage of the fact that leading 0s disappear when an integer of type str is converted to that of type int. Here's one solution:
df['leading 0s'] = df['Sequence'].str.len() - df['Sequence'].astype(int).astype(str).str.len()
Output:
Sequence leading 0s
0 012312312 1
1 024624624 1
2 036901357 1
3 002486248 2
4 045074305 1
5 080666140 1
Try str.findall:
df['0s'] = df['Sequence'].str.findall('^0*').str[0].str.len()
print(df)
# Output:
Sequence 0s
0 012312312 1
1 024624624 1
2 036901357 1
3 002486248 2
4 045074305 1
5 080666140 1

groupby name and position in the group

I would like to group by a column and then split one or more of the groups into two.
Exmaple
This
np.random.seed(11)
df = pd.DataFrame({"animal":np.random.choice( ['panda','python','shark'], 10),
"number": 1})
df.sort_values("animal")
gives me this dataframe
animal number
1 panda 1
4 panda 1
7 panda 1
9 panda 1
0 python 1
2 python 1
3 python 1
5 python 1
8 python 1
6 shark 1
Now I would like to group by animal but also split the "pythons" into the first two and the rest of the "pythons". So that
df.grouby(your_magic).sum()
gives me
number
animal
panda 4
python_1 2
python_2 3
shark 1
What about
np.random.seed(11)
df = pd.DataFrame({"animal":np.random.choice( ['panda','python','shark'], 10),
"number": 1})
## find index on which you split python into python_1 and python_2
python_split_idx = df[df['animal'] == 'python'].iloc[2].name
## rename python according to index
df[df['animal'] == 'python'] = df[df['animal'] == 'python'].apply(lambda row: pd.Series(['python_1' if row.name < python_split_idx else 'python_2', row.number], index=['animal', 'number']), axis=1)
## group according to all animals and sum the number
df.groupby('animal').agg({'number': sum})
Output:
number
animal
panda 4
python_1 2
python_2 3
shark 1
I ended up using something similar to Stefan's answer but slightly reformulated it to avoid having to use apply. This looks a bit cleaner to me.
idx1 = df[df['animal'] == 'python'].iloc[:2].index
idx2 = df[df['animal'] == 'python'].iloc[2:].index
df.loc[idx1, "animal"] = "python_1"
df.loc[idx2, "animal"] = "python_2"
df.groupby("animal").sum()
This is a quick and dirty (and somewhat inefficient) way to do it if you want to rename all your pythons before you group them.
indices = []
for i,v in enumerate(df['animal']):
if v == 'python':
if len(indices) <2:
indices.append(i)
df.loc[i,'animal'] = 'python_1'
else:
df.loc[i,'animal'] = 'python_2'
grouped = df.groupby('animal').agg('sum')
print(grouped)
This provides your desired output exactly.
As an alternative, here's a totally different approach that creates another column to capture whether each animal is a member of the group of the top two pythons and then groups on both columns.
snakes = df[df['animal'] == 'python']
df['special_snakes'] = [1 if i not in snakes.index[:2] else 0 for i in df.index]
df.groupby(['animal', 'special_snakes']).agg('sum')
The output looks a bit different, but achieves the same outcome. This approach also has the advantage of capturing the condition on which you are grouping your animals without actually changing the values in the animal column.
number
animal special_snakes
panda 1 4
python 0 2
1 3
shark 1 1

How to convert (Not-One) Hot Encodings to a Column with Multiple Values on the Same Row

I basically want to reverse the process posed in this question.
>>> import pandas as pd
>>> example_input = pd.DataFrame({"one" : [0,1,0,1,0],
"two" : [0,0,0,0,0],
"three" : [1,1,1,1,0],
"four" : [1,1,0,0,0]
})
>>> print(example_input)
one two three four
0 0 0 1 1
1 1 0 1 1
2 0 0 1 0
3 1 0 1 0
4 0 0 0 0
>>> desired_output = pd.DataFrame(["three, four", "one, three, four",
"three", "one, three", ""])
>>> print(desired_output)
0
0 three, four
1 one, three, four
2 three
3 one, three
4
There are many questions (examples 1 & 2) about reversing one-hot encoding, but the answers rely on only one binary class being active per row, while my data can have multiple classes active in the same row.
This question comes close to addressing what I need, but its multiple classes are separated on different rows. I need my results to be strings joined by a separator (for example ", "), such that the output has the same number of rows as the input.
Using the ideas found in these two questions (1 & 2), I was able to come up with a solution, but it requires an ordinary python for loop to iterate through the rows, which I suspect will be slow compared to a solution which entirely uses pandas.
The input dataframe can use actual Boolean values instead of integer encoding if it makes things easier. The output can be a dataframe or a series; I'm eventually going to add the resulting column to a larger dataframe. I'm also open to using numpy if it allows for a better solution, but otherwise I would prefer to stick with pandas.
You can do DataFrame.dot which is much faster than iterating over all the rows in the dataframe:
df.dot(df.columns + ', ').str.rstrip(', ')
0 three, four
1 one, three, four
2 three
3 one, three
4
dtype: object
Here's a solution using a python list comprehension to iterate through each row:
import pandas as pd
def reverse_hot_encoding(df, sep=', '):
df = df.astype(bool)
l = [sep.join(df.columns[row]) for _, row in df.iterrows()]
return pd.Series(l)
if __name__ == '__main__':
example_input = pd.DataFrame({"one" : [0,1,0,1,0],
"two" : [0,0,0,0,0],
"three" : [1,1,1,1,0],
"four" : [1,1,0,0,0]
})
print(reverse_hot_encoding(example_input))
And here's the output:
0 three, four
1 one, three, four
2 three
3 one, three
4
dtype: object

Adding unique identifiers to duplicate values in pandas dataframe

I would like to create unique identifiers for values that are duplicates. Values that are duplicates are only 0's. The idea is to convert each zero to zero plus its position (0+1 for first row, 0+2 for second row etc). However the problem is the column also has other non duplicate values.
I have written this line of code to try and convert the zero values as stated but I am getting this error message
TypeError: ufunc 'add' did not contain a loop with signature matching
types dtype('
Here is my code
seller_customer['customer_id'] = np.where(seller_customer['customer_id']==0, seller_customer['customer_id'] + seller_customer.groupby(['customer_id']).cumcount().replace('0',''))
Here is a sample of my data
{0: '7e468d618e16c6e1373fb2c4a522c969',
1: '1c14a115bead8a332738c5d7675cca8c',
2: '434dee65d973593dbb8461ba38202798',
3: '4bbeac9d9a22f0628ba712b90862df28',
4: '578d5098cbbe40771e1229fea98ccafd',
5: 0,
6: 0,
7: 0}
If I understand correctly, you can just assign range values to those ids that are 0:
df.loc[df['id']==0, 'id'] = np.arange((df['id']==0).sum()) + 1
print(df)
Output:
id
0 7e468d618e16c6e1373fb2c4a522c969
1 1c14a115bead8a332738c5d7675cca8c
2 434dee65d973593dbb8461ba38202798
3 4bbeac9d9a22f0628ba712b90862df28
4 578d5098cbbe40771e1229fea98ccafd
5 1
6 2
7 3
Or a shorter but slightly slower:
df.loc[df['id']==0, 'id'] = (df['id']==0).cumsum()
You can do something like this:
from pandas.util import hash_pandas_object
import numpy as np
df.x = np.where(df.x == 0, hash_pandas_object(df.x), df.x)
df
Output:
x
0 7e468d618e16c6e1373fb2c4a522c969
1 1c14a115bead8a332738c5d7675cca8c
2 434dee65d973593dbb8461ba38202798
3 4bbeac9d9a22f0628ba712b90862df28
4 578d5098cbbe40771e1229fea98ccafd
5 593769213749726025
6 14559158595676751865
7 4575103004772269825
They won't be sequential like the index but they will be unique (almost certainly, unless you encounter a hash collision)

pandas get position of a given index in DataFrame

Let's say I have a DataFrame like this:
df
A B
5 0 1
18 2 3
125 4 5
where 5, 18, 125 are the index
I'd like to get the line before (or after) a certain index. For instance, I have index 18 (eg. by doing df[df.A==2].index), and I want to get the line before, and I don't know that this line has 5 as an index.
2 sub-questions:
How can I get the position of index 18? Something like df.loc[18].get_position() which would return 1 so I could reach the line before with df.iloc[df.loc[18].get_position()-1]
Is there another solution, a bit like options -C, -A or -B with grep ?
For your first question:
base = df.index.get_indexer_for((df[df.A == 2].index))
or alternatively
base = df.index.get_loc(18)
To get the surrounding ones:
mask = pd.Index(base).union(pd.Index(base - 1)).union(pd.Index(base + 1))
I used Indexes and unions to remove duplicates. You may want to keep them, in which case you can use np.concatenate
Be careful with matches on the very first or last rows :)
If you need to convert more than 1 index, you can use np.where.
Example:
# df
A B
5 0 1
18 2 3
125 4 5
import pandas as pd
import numpy as np
df = pd.DataFrame({"A": [0,2,4], "B": [1,3,5]}, index=[5,18,125])
np.where(df.index.isin([18,125]))
Output:
(array([1, 2]),)

Categories