I have a pandas DataFrame and a column contained a string that was separated by a pipe. These were from movie genres. They looked like this:
Genre
Adventure|Animation|Children|Comedy|Fantasy
Comedy|Romance
...
I used str.split to get them back into the cell as a List. Like this:
Genre
[Adventure, Animation, Children, Comedy, Fantasy]
[Adventure, Children, Fantasy]
[Comedy, Romance]
[Comedy, Drama, Romance]
[Comedy]
I want to get a sum of all the genres. For example how many times did Comedy appear? How many times did Adventure and so on? I can't seem to figure this out.
This would look like
Comedy 4
Adventure 2
Animation 1
(...and so on...)
As somebody from the for-loop club, I recommend using python's C-accelerated routines—itertools.chain, and collections.Counter—for performance.
from itertools import chain
from collections import Counter
pd.Series(
Counter(chain.from_iterable(x.split('|') for x in df.Genre)))
Adventure 1
Animation 1
Children 1
Comedy 2
Fantasy 1
Romance 1
dtype: int64
Why do I think CPython functions are better than pandas "vectorised" string functions? They are inherently hard to vectorise. You can read more at For loops with pandas - When should I care?.
If you have to deal with NaNs, you can call a function that handles exceptions gracefully:
def try_split(x):
try:
return x.split('|')
except AttributeError:
return []
pd.Series(
Counter(chain.from_iterable(try_split(x) for x in df.Genre)))
pandaically, you would do this with split, stack, and value_counts.
df['Genre'].str.split('|', expand=True).stack().value_counts()
Comedy 2
Romance 1
Children 1
Animation 1
Fantasy 1
Adventure 1
dtype: int64
The timing difference is obvious even for tiny DataFrames.
%timeit df['Genre'].str.get_dummies(sep='|').sum()
%timeit df['Genre'].str.split('|', expand=True).stack().value_counts()
%%timeit
pd.Series(
Counter(chain.from_iterable(try_split(x) for x in df.Genre)))
2.8 ms ± 68.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.4 ms ± 210 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
320 µs ± 9.71 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I'm also in favor of using chain+for.
Just to document this, one more possible way is to use get_dummies
df['Genre'].str.get_dummies(sep='|').sum()
Related
I want to modify a single value in a DataFrame. The typical suggestion for doing this is to use df.at[] and reference the position as the index label and the column label, or to use df.iat[] and reference the position as the integer row and the integer column. But I want to reference the position as the integer row and the column label.
Assume this DataFrame:
dateindex
apples
oranges
bananas
2021-01-01 14:00:01.384624
1
X
3
2021-01-05 13:43:26.203773
4
5
6
2021-01-31 08:23:29.837238
7
8
9
2021-02-08 10:23:09.095632
0
1
2
data = [{'apples':1, 'oranges':'X', 'bananas':3},
{'apples':4, 'oranges':5, 'bananas':6},
{'apples':7, 'oranges':8, 'bananas':9},
{'apples':0, 'oranges':1, 'bananas':2}]
indexes = [pd.to_datetime('2021-01-01 14:00:01.384624'),
pd.to_datetime('2021-01-05 13:43:26.203773'),
pd.to_datetime('2021-01-31 08:23:29.837238'),
pd.to_datetime('2021-02-08 10:23:09.095632')]
idx = pd.Index(indexes, name='dateindex')
df = pd.DataFrame(data, index=idx)
I want to change the value "X" to "2". I don't know the exact time; I just know that it's the first row. But I do know that I want to change the "oranges" column.
I want to do something like df.at[0,'oranges'], but I can't do that; I get a KeyError.
The best thing that I can figure out is to do df.at[df.index[0],'oranges'], but that seems so awkward when they've gone out of their way to provide both by-label and by-integer-offset interfaces. Is that the best thing?
Wrt
The best thing that I can figure out is to do df.at[df.index[0],'oranges'], but that seems so awkward when they've gone out of their way to provide both by-label and by-integer-offset interfaces. Is that the best thing?
Yes, it is. And I agree, it is awkward. The old .ix used to support these mixed indexing cases better but its behaviour depended on the dtype of the axis, making it inconsistent. In the meanwhile...
The other options, which have been used in the other answers, can all issue the SettingWithCopy warning. It's not guaranteed to raise the issue but it might, based on what the indexing criteria are and how values are assigned.
Referencing Combining positional and label-based indexing and starting with this df, which has dateindex as the index:
apples oranges bananas
dateindex
2021-01-01 14:00:01.384624 1 X 3
2021-01-05 13:43:26.203773 4 5 6
2021-01-31 08:23:29.837238 7 8 9
2021-02-08 10:23:09.095632 0 1 2
Using both options:
with .loc or .at:
df.at[df.index[0], 'oranges'] = -50
apples oranges bananas
dateindex
2021-01-01 14:00:01.384624 1 -50 3
2021-01-05 13:43:26.203773 4 5 6
2021-01-31 08:23:29.837238 7 8 9
2021-02-08 10:23:09.095632 0 1 2
with .iloc or .iat:
df.iat[0, df.columns.get_loc('oranges')] = -20
apples oranges bananas
dateindex
2021-01-01 14:00:01.384624 1 -20 3
2021-01-05 13:43:26.203773 4 5 6
2021-01-31 08:23:29.837238 7 8 9
2021-02-08 10:23:09.095632 0 1 2
FWIW, I find approach #1 more consistent since it can handle multiple row indexes without changing the functions/methods used: df.loc[df.index[[0, 2]], 'oranges'] but approach #2 needs a different column indexer when there are multiple columns: df.iloc[[0, 2], df.columns.get_indexer(['oranges', 'bananas'])].
Solution with Series.iat
If it doesn't seem more awkward to you, you can use the iat method of pandas Series:
df["oranges"].iat[0] = 2
Time performance comparison with other methods
As this method doesn't raise any warning, it can be interesting to compare its time performance with other proposed solutions.
%%timeit
df.at[df.index[0], 'oranges'] = 2
# > 9.91 µs ± 47.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
df.iat[0, df.columns.get_loc('oranges')] = 2
# > 13.5 µs ± 74.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
df["oranges"].iat[0] = 2
# > 3.49 µs ± 16.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The pandas.Series.iat method seems to be the most performant one (I took the median of three runs).
Let's try again with huge DataFrames
With a DatetimeIndex
# Generating random data
df_large = pd.DataFrame(np.random.randint(0, 50, (100000, 100000)))
df_large.columns = ["col_{}".format(i) for i in range(100000)]
df_large.index = pd.date_range(start=0, periods=100000)
# 2070-01-01 to 2243-10-16, a bit unrealistic
%%timeit
df_large.at[df_large.index[55555], 'col_55555'] = -2
# > 10.1 µs ± 85.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
df_large.iat[55555, df_large.columns.get_loc('col_55555')] = -2
# > 13.2 µs ± 118 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
df_large["col_55555"].iat[55555] = -2
# > 3.31 µs ± 19 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
With a RangeIndex
# Generating random data
df_large = pd.DataFrame(np.random.randint(0, 50, (100000, 100000)))
df_large.columns = ["col_{}".format(i) for i in range(100000)]
%%timeit
df_large.at[df_large.index[55555], 'col_55555'] = 2
# > 4.5 µs ± 18.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
df_large.iat[55555, df_large.columns.get_loc('col_55555')] = 2
# > 13.5 µs ± 50.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
df_large["col_55555"].iat[55555] = 2
# > 3.49 µs ± 20.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Since it is a simple indexing with O(n) complexity, the size of the array doesn't change much the results, except when it comes to the "at + index" ; strangely enough, it shows worst performance with small dataframes. Thanks to the author wfaulk for spotting that using a RangeIndex decreases the access time of the "at + index" method. The time performance remains higher and constant when dealing with DatetimeIndex with pd.Series.iat.
You were actually quite close with your initial guess.
You would do it like this:
import pandas as pd
mydict = [{'a': 1, 'b': 2, 'c': 3, 'd': 4},
{'a': 100, 'b': 200, 'c': 300, 'd': 400},
{'a': 1000, 'b': 2000, 'c': 3000, 'd': 4000 }]
df = pd.DataFrame(mydict)
print(df)
# change th value of column a, row 2
df['a'][2] = 100
# print column a, row 2
print(df['a'][2])
There are lots of different variants such as loc and iloc, but this is one good method.
In the example we discovered that loc was optimal as df[][] throws an error:
import pandas as pd
data = [{'apples':1, 'oranges':'X', 'bananas':3},
{'apples':4, 'oranges':5, 'bananas':6},
{'apples':7, 'oranges':8, 'bananas':9},
{'apples':0, 'oranges':1, 'bananas':2}]
indexes = [pd.to_datetime('2021-01-01 14:00:01.384624'),
pd.to_datetime('2021-01-05 13:43:26.203773'),
pd.to_datetime('2021-01-31 08:23:29.837238'),
pd.to_datetime('2021-02-08 10:23:09.095632')]
idx = pd.Index(indexes, name='dateindex')
df = pd.DataFrame(data, index=idx)
print(df)
df.loc['2021-01-01 14:00:01.384624','oranges'] = 10
# df['oranges'][0] = 10
print(df)
This works.
You can use the loc method. It receives the row and column you want to change.
Changing X to 2: df.loc[0, 'oranges'] = 2
See: pandas.DataFrame.loc
I have a very large knowledge graph in pandas dataframe format as follows.
This dataframe KG has more than 100 million rows:
pred subj obj
0 nationality BART USA
1 placeOfBirth BART NEWYORK
2 locatedIn NEWYORK USA
... ... ... ...
116390740 hasFather BART HOMMER
116390741 nationality HOMMER USA
116390743 placeOfBirth HOMMER NEWYORK
I tried to get a row from this KG with a specific value for subj and obj.
a) I tried indexing into KG by generating a boolean series using isin() function:
KG[KG['subj'].isin(['BART', 'NEWYORK']) & KG['obj'].isin(['USA', 'HOMMER'])]
b) I also tried indexing the KG using query() function:
KG = KG.set_index(['subj','obj'], drop=True)
KG = KG.sort_index()
subj_substitution = ['BART', 'NEWYORK']
obj_substitution= ['USA', 'HOMMER']
KG.query(f"subj in {subj_substitution} & obj in {obj_substitution}
c) And I also tried to join two DataFrames using a merge() as shown below.
subj_df
subj
0 BART
1 NEWYORK
obj_df
obj
0 USA
1 HOMMER
merge_result = pd.merge(KG, subj_df, on = ['subj']).drop_duplicates()
merge_result = pd.merge(merge_result, obj_df, on = ['obj']).drop_duplicates()
These methods result in the following:
pred subj obj
0 nationality BART USA
2 locatedIn NEWYORK USA
116390740 hasFather BART HOMMER
I used the timeit function to check the time for each as shown below.
timeit.timeit(lambda: KG[(KG['subj'].isin(['BART', 'NEWYORK']) & (KG['obj'].isin(['USA', 'HOMMER'])))] , number=10)
The runtimes were:
function
runtime
isin()
35.6s
query()
155.2s
merge()
288.9s
I think isin() is the fastest way to index a very large Dataframe.
I would appreciate it if you could tell me a faster way than this.
I would personally go with isin or query with in.
Pandas doc says:
Performance of query()
DataFrame.query() using numexpr is slightly faster than Python for large frames.
Note: You will only see the performance benefits of using the numexpr engine with DataFrame.query() if your frame has more than approximately 200,000 rows.
Details about query can be found here
In your example when I tested KG dataframe with shape (50331648, 3) - 50M+ rows and 3 column using query and isin the performance results were almost same.
isin
%timeit KG[KG['subj'].isin(['BART', 'NEWYORK']) & KG['obj'].isin(['USA', 'HOMMER'])]
4.14 s ± 83.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
query with in operator
%timeit KG.query("(subj in ['BART', 'NEWYORK']) and (obj in ['USA', 'HOMMER'])")
4.08 s ± 82.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
query with isin
%timeit KG.query("(subj.isin(['BART', 'NEWYORK']))& (obj.isin(['USA', 'HOMMER']))")
4.99 s ± 210 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Test Data
d="""pred,subj,obj
nationality,BART,USA
placeOfBirth,BART,NEWYORK
locatedIn,NEWYORK,USA
hasFather,BART,HOMMER
nationality,HOMMER,USA
placeOfBirth,HOMMER,NEWYORK"""
KG = pd.read_csv(StringIO(d))
for i in range(23):
KG = pd.concat([KG,KG])
KG.shape # (50331648, 3)
If performance + code readability(maintenance) is concerned, then atleast for complex queries I would go with query function.
Suppose I have a DataFrame "DS_df" containing strings ands numbers. The three columns "LAultimateparentcountry", "borrowerultimateparentcountry" and "tot" form a relationship.
How can I create a dictionary out of those three columns (for the entire dataset, while order matters)? I would need to access the two countries as one variable, and tot as another. I've tried the code below so far, but this merely yields me a list with separate items. For some reason, I am also not able to get .join to work, as the df is quite big (+900k rows).
new_list =[]
for i, row in DS_df.iterrows():
new_list.append(row["LAultimateparentcountry"])
new_list.append(row["borrowerultimateparentcountry"])
new_list.append(row["tot"])
Preferred outcome would be a dictionary, where I could access "Germany_Switzerland": 56708 for example. Any help or advice is much appreciated.
Cheers
You can use a dict this way:
countries_map = {}
for index, row in DS_df.iterrows():
curr_rel = f'{row["LAultimateparentcountry"]}_{row["borrowerultimateparentcountry"]}'
countries_map[curr_rel] = row["tot"]
If you are not wishing to not run over existing keys values
(and use their first appearance):
countries_map = {}
for index, row in DS_df.iterrows():
curr_rel = f'{row["LAultimateparentcountry"]}_{row["borrowerultimateparentcountry"]}'
if curr_rel not in countries_map.keys():
countries_map[curr_rel] = row["tot"]
When performing operations on a dataframe it's always good to think for a solution column-wise and not row-wise.
If your dataframe is having 900k+ rows then it might be a good option to apply vectorized operations on dataframe.
Below are two solutions:
Using pd.Series + to_dict():
pd.Series(DS_df.tot.values, index=DS_df.LAultimateparentcountry.str.cat(DS_df.borrowerultimateparentcountry, sep="_")).to_dict()
Using zip() + dict():
dict(zip(DS_df.LAultimateparentcountry.str.cat(DS_df.borrowerultimateparentcountry, sep="_"), DS_df.tot))
Test Dataframe:
DS_df = DataFrame({
'LAultimateparentcountry':['India', 'Germany', 'India'],
'borrowerultimateparentcountry':['France', 'Ireland', 'France'],
'tot':[56708, 87902, 91211]
})
DS_df
LAultimateparentcountry borrowerultimateparentcountry tot
0 India France 56708
1 Germany Ireland 87902
2 India France 91211
Output of both solutions:
{'India_France': 91211, 'Germany_Ireland': 87902}
If the formed key has duplicates then the value will be updated.
Which solution is more performant?
short answer -
zip() + dict() # if the rows are approx. below 1000000
pd.Series + to_dict() # if the rows are approx. above 1000000
Long answer - Below are the tests:
Test with 30 rows and 3 columns
zip() + dict()
%timeit dict(zip(DS_df.LAultimateparentcountry.str.cat(DS_df.borrowerultimateparentcountry, sep="_"), DS_df.tot))
297 µs ± 21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
pd.Series + to_dict():
%timeit pd.Series(DS_df.tot.values, index=DS_df.LAultimateparentcountry.str.cat(DS_df.borrowerultimateparentcountry, sep="_")).to_dict()
506 µs ± 35.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Test with 6291456 rows and 3 columns
pd.Series + to_dict()
%timeit pd.Series(DS_df.tot.values, index=DS_df.LAultimateparentcountry.str.cat(DS_df.borrowerultimateparentcountry, sep="_")).to_dict()
3.92 s ± 77.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
zip + dict()
%timeit dict(zip(DS_df.LAultimateparentcountry.str.cat(DS_df.borrowerultimateparentcountry, sep="_"), DS_df.tot))
3.97 s ± 226 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I have a tiny test dataset for students university applications in different majors
It looks like this
0 35377 female Chemistry False
1 56105 male Physics True
2 31441 female Chemistry False
3 51765 male Physics True
4 53714 female Physics True
shape is 500,4
I need to get the admission rate for females and solved it now in three different ways. Each of them return the correct result.
DONE
Using group by
female_admitted_rate = df.groupby('gender').get_group('female')[df['admitted'] == True].count()/len(df.groupby('gender').get_group('female'))```
```python
[OUT]
student_id 0.287938
gender 0.287938
major 0.287938
admitted 0.287938
dtype: float64
Using plain pandas
len(df[(df['gender']=='female') & (df['admitted'])])/(len(df[df['gender']=='female']))
[Out] 0.28793774319066145
Using query
len(df.query("gender == 'female' & admitted"))/len(df.query("gender == 'female'"))
[Out] 0.28793774319066145
QUESTIONS
What would you use to get this information?
Is there a special
advantage for one of the shown approaches?
Is there one approach
what makes absolutely not sense to you guys?
Is there an perticular
calculation performance benefit by using one of the three above the
others when it comes to big data sets?
I think you only need DataFrame.loc[] + Series.mean():
df.loc[df['gender'].eq('female'), 'admitted'].mean()
True is interpreted as 1 and False as 0 by Series.mean
You can check with timeit
%%timeit
df.loc[df['gender'].eq('female'), 'admitted'].mean()
1.16 ms ± 66.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
len(df[(df['gender']=='female') & (df['admitted'])])/(len(df[df['gender']=='female']))
3.45 ms ± 428 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
df.groupby('gender').get_group('female')[df['admitted'] == True].count()/len(df.groupby('gender').get_group('female'))
10.3 ms ± 718 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
len(df.query("gender == 'female' & admitted"))/len(df.query("gender == 'female'"))
11.1 ms ± 604 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
These times are for the sample DataFrame I think performance could vary greatly depending on the shape of your DataFrame. Although I honestly believe that the method I propose will be the fastest in most cases, in addition to providing a clean and simple syntax.
I am using the pandas vectorized str.split() method to extract the first element returned from a split on "~". I also have also tried using df.apply() with a lambda and str.split() to produce equivalent results. When using %timeit, I'm finding that df.apply() is performing faster than the vectorized version.
Everything that I have read about vectorization seems to indicate that the first version should have better performance. Can someone please explain why I am getting these results? Example:
id facility
0 3466 abc~24353
1 4853 facility1~3.4.5.6
2 4582 53434_Facility~34432~cde
3 9972 facility2~FACILITY2~343
4 2356 Test~23 ~FAC1
The above dataframe has about 500,000 rows and I have also tested at around 1 million with similar results. Here is some example input and output:
Vectorization
In [1]: %timeit df['facility'] = df['facility'].str.split('~').str[0]
1.1 s ± 54.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Lambda Apply
In [2]: %timeit df['facility'] = df['facility'].astype(str).apply(lambda facility: facility.split('~')[0])
650 ms ± 52.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Does anyone know why I am getting this behavior?
Thanks!
Pandas string methods are only "vectorized" in the sense that you don't have to write the loop yourself. There isn't actually any parallelization going on, because string (especially regex problems) are inherently difficult (impossible?) to parallelize. If you really want speed, you actually should fall back to python here.
%timeit df['facility'].str.split('~', n=1).str[0]
%timeit [x.split('~', 1)[0] for x in df['facility'].tolist()]
411 ms ± 10.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
132 ms ± 302 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
For more information on when loops are faster than pandas functions, take a look at For loops with pandas - When should I care?.
As for why apply is faster, I'm of the belief that the function apply is applying (i.e., str.split) is a lot more lightweight than the string splitting happening in the bowels of Series.str.split.