Lets imagine you have a DataFrame df with a large number of columns, say 50, and df does not have any indexes (i.e. index_col=None). You would like to select a subset of the columns as defined by a required_columns_list, but would like to only return those rows meeting a mutiple criteria as defined by various boolean indexes. Is there a way to consicely generate the selection statement using a dict generator?
As an example:
df = pd.DataFrame(np.random.randn(100,50),index=None,columns=["Col" + ("%03d" % (i + 1)) for i in range(50)])
# df.columns = Index[u'Col001', u'Col002', ..., u'Col050']
required_columns_list = ['Col002', 'Col012', 'Col025', 'Col032', 'Col033']
now lets imagine that I define:
boolean_index_dict = {'Col001':"MyAccount", 'Col002':"Summary", 'Col005':"Total"}
I would like to select out using a dict generator to construct the multiple boolean indices:
df.loc[GENERATOR_USING_boolean_index_dict, required_columns_list].values
The above generator boolean method would be the equivalent of:
df.loc[(df['Col001']=="MyAccount") & (df['Col002']=="Summary") & (df['Col005']=="Total"), ['Col002', 'Col012', 'Col025', 'Col032', 'Col033']].values
Hopefully, you can see that this would be really useful 'template' in operating on large DataFrames and the boolean indexing can then be defined in the boolean_index_dict. I would greatly appreciate if you could let me know if this is possible in Pandas and how to construct the GENERATOR_USING_boolean_index_dict?
Many thanks and kind regards,
Bertie
p.s. If you would like to test this out, you will need to populate some of df columns with text. The definition of df using random numbers was simply given as a starter if required for testing...
Suppose this is your df:
df = pd.DataFrame(np.random.randint(0,4,(100,50)),index=None,columns=["Col" + ("%03d" % (i + 1)) for i in range(50)])
# the first five cols and rows:
df.iloc[:5,:5]
Col001 Col002 Col003 Col004 Col005
0 2 0 2 3 1
1 0 1 0 1 3
2 0 1 1 0 3
3 3 1 0 2 1
4 1 2 3 1 0
Compared to your example all columns are filled with ints of 0,1,2 or 3.
Lets define the criteria:
req = ['Col002', 'Col012', 'Col025', 'Col032', 'Col033']
filt = {'Col001': 2, 'Col002': 2, 'Col005': 2}
So we want some columns, where some others columns all contain the value 2.
You can then get the result with:
df.loc[df[filt.keys()].apply(lambda x: x.tolist() == filt.values(), axis=1), req]
In my case this is the result:
Col002 Col012 Col025 Col032 Col033
43 2 2 1 3 3
98 2 1 1 1 2
Lets check the required columns for those rows:
df[filt.keys()].iloc[[43,98]]
Col005 Col001 Col002
43 2 2 2
98 2 2 2
And some other (non-matching) rows:
df[filt.keys()].iloc[[44,99]]
Col005 Col001 Col002
44 3 0 3
99 1 0 0
I'm starting to like Pandas more and more.
Related
Consider the following toy code that performs a simplified version of my actual question:
import pandas
df = pandas.DataFrame(
{
'n_event': [1,2,3,4,5],
'some column': [0,1,2,3,4],
}
)
df = df.set_index(['n_event'])
print(df)
resampled_df = df.sample(frac=1, replace=True)
print(resampled_df)
The resampled_df is, as it name suggests, a resampled version of the original one (with replacement). This is exactly what I want. An example output of the previous code is
some column
n_event
1 0
2 1
3 2
4 3
5 4
some column
n_event
4 3
1 0
4 3
4 3
2 1
Now for my actual question I have the following dataframe:
import pandas
df = pandas.DataFrame(
{
'n_event': [1,1,2,2,3,3,4,4,5,5],
'n_channel': [1,2,1,2,1,2,1,2,1,2],
'some column': [0,1,2,3,4,5,6,7,8,9],
}
)
df = df.set_index(['n_event','n_channel'])
print(df)
which looks like
some column
n_event n_channel
1 1 0
2 1
2 1 2
2 3
3 1 4
2 5
4 1 6
2 7
5 1 8
2 9
I want to do exactly the same as before, resample with replacements, but treating each group of rows with the same n_event as a single entity. A hand-built example of what I want to do can look like this:
some column
n_event n_channel
2 1 2
2 3
2 1 2
2 3
3 1 4
2 5
1 1 0
2 1
5 1 8
2 9
As seen, each n_event was treated as a whole and things within each event were no mixed up.
How can I do this without proceeding by brute force (i.e. without for loops, etc)?
I have tried with df.sample(frac=1, replace=True, ignore_index=False) and a few things using group_by without success.
Would a pivot()/melt() sequence work for you?
Use pivot() to from long to wide (make each group a single row).
Do the sampling.
Then back from wide to long using melt().
Don't have time to work out a full answer but thought I would get this idea to you in case it might help you.
Following the suggestion of jch I was able to find a solution by combining pivot and stack:
import pandas
df = pandas.DataFrame(
{
'n_event': [1,1,2,2,3,3,4,4,5,5],
'n_channel': [1,2,1,2,1,2,1,2,1,2],
'some column': [0,1,2,3,4,5,6,7,8,9],
'other col': [5,6,4,3,2,5,2,6,8,7],
}
)
resampled_df = df.pivot(
index = 'n_event',
columns = 'n_channel',
values = set(df.columns) - {'n_event','n_channel'},
)
resampled_df = resampled_df.sample(frac=1, replace=True)
resampled_df = resampled_df.stack()
print(resampled_df)
I have the the below df build from a pivot of a larger df. In this table 'week' is the the index (dtype = object) and I need to show week 53 as the first row instead of the last
Can someone advice please? I tried reindex and custom sorting but can't find the way
Thanks!
here is the table
Since you can't insert the row and push others back directly, a clever trick you can use is create a new order:
# adds a new column, "new" with the original order
df['new'] = range(1, len(df) + 1)
# sets value that has index 53 with 0 on the new column
# note that this comparison requires you to match index type
# so if weeks are object, you should compare df.index == '53'
df.loc[df.index == 53, 'new'] = 0
# sorts values by the new column and drops it
df = df.sort_values("new").drop('new', axis=1)
Before:
numbers
weeks
1 181519.23
2 18507.58
3 11342.63
4 6064.06
53 4597.90
After:
numbers
weeks
53 4597.90
1 181519.23
2 18507.58
3 11342.63
4 6064.06
One way of doing this would be:
import pandas as pd
df = pd.DataFrame(range(10))
new_df = df.loc[[df.index[-1]]+list(df.index[:-1])].reset_index(drop=True)
output:
0
9 9
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
Alternate method:
new_df = pd.concat([df[df["Year week"]==52], df[~(df["Year week"]==52)]])
I have 2 DataFrames which has a column whose value is of type set containing 8 digit integers.
df1 (contains around 200k rows)
id s1
0 0 {43649632, 95799329, 40649644, 23335890, 81779...
1 1 {69900026, 74441229}
2 2 {85195648, 55750338, 98936902, 82000264, 43544...
3 3 {21916700, 13627806}
4 4 {62929026, 38592365, 44179790, 38355127}
df2 (contains around 900k rows)
id s1
0 0 {58209736, 25405713, 28691898, 94682562}
1 1 {81089732, 82343077}
2 2 {59692896, 33234306, 40445479, 18728345, 24464...
3 3 {71406042, 69900026, 74441229}
4 4 {62929026}
I want to know the FASTEST way to find the pair of ids from df2 and df2 that match ONE OF THIS condition:
df1.s1 is a subset of d2.s1
OR
df2.s1 is a subset of d1.s1
For example,
id=1 of df1 is a subset of id=3 of df2, so (1, 3) is a valid pair
id=4 of df2 is a subset of id=4 of df1, so (1, 4) is a valid pair
I have tried this code below but it's going to take about 20 hours:
id_pairs = []
for i in tqdm(list(df2.itertuples(index=False))):
for j in df1.itertuples(index=False):
if i.s1.issubset(j.s1) or j.s1.issubset(i.s1):
id_pairs.append((i.id, j.id))
Is there a faster or more efficient way to do this?
You could do a cartesian join and then apply the condition
df1["key"] = 0
df2["key"] = 0
merged = df1.merge(df2, how="outer", on="key")
def subset(row):
if (row.s1.issubset(row.s2)) or (row.s2.issubset(row.s1)):
return (row.id1, row.id2)
else:
return None
merged.apply(lambda row: subset(row), axis=1).dropna()
I am using dask dataframe.groupby().apply()
and get a dask series as a return value.
I am each group to a list triplets such as (a,b,1) and wish then to turn all the triplets into a single dask data frame
I am using this code in the end of the mapping function to return the triplets as a dask df
#assume here that trips is a generator for tripletes such as you would produce from itertools.product([l1,l2,l3])
trip = list(itertools.chain.from_iterable(trip))
df = pd.DataFrame.from_records(trip)
return dd.from_pandas(df,npartitions=1)
then when I try to use something similar to pandas concat with dask concatenate
Assume the result of the apply function is the variable result.
I am trying to use
import dask.dataframe as dd
dd.concat(result, axis=0
and get the error
raise TypeError("dfs must be a list of DataFrames/Series objects")
TypeError: dfs must be a list of DataFrames/Series objects
But when I check for the type of result using
print type(result)
I get
output: class 'dask.dataframe.core.Series'
What is the proper way to apply a function over groups of dask groupby object and get all the results into one dataframe?
Thanks
edit:--------------------------------------------------------------
in order to produce the use case, assume this fake data generation
import random
import pandas as pd
import dask.dataframe as dd
people = [[random.randint(1,3), random.randint(1,3), random.randint(1,3)] for i in range(1000)]
ddf = dd.from_pandas(pd.DataFrame.from_records(people, columns=["first name", "last name", "cars"]), npartitions=1)
Now my mission is to group people by first and last name (e.g all the people with same first name & first last name) and than I need to get a new dask data frame which will contain how many cars each group had.
Assume that the apply function can return either a series of lists of tuples e.g [(name,name,cars count),(name,name,cars count)] or a data frame with the same columns - name, name, car count.
Yes, I know that particular use case can be solved in another way, but please trust me, my use case is more complex. But i can not share the data and can not generate any similar data. so let's use a dummy data :-)
The challenge is to connect all the results of the apply into a single dask data frame (pandas data frame will be a problem here, data will not fit in memory - so transitions via a pandas data frame will be a problem)
For me working if output of apply is pandas DataFrame, so last if necessary convert to dask DataFrame:
def f(x):
trip = ((1,2,x) for x in range(3))
df = pd.DataFrame.from_records(trip)
return df
df1 = ddf.groupby('cars').apply(f, meta={'x': 'i8', 'y': 'i8', 'z': 'i8'}).compute()
#only for remove MultiIndex
df1 = df1.reset_index()
print (df1)
cars level_1 x y z
0 1 0 1 2 0
1 1 1 1 2 1
2 1 2 1 2 2
3 2 0 1 2 0
4 2 1 1 2 1
5 2 2 1 2 2
6 3 0 1 2 0
7 3 1 1 2 1
8 3 2 1 2 2
ddf1 = dd.from_pandas(df1,npartitions=1)
print (ddf1)
cars level_1 x y z
npartitions=1
0 int64 int64 int64 int64 int64
8 ... ... ... ... ...
Dask Name: from_pandas, 1 tasks
EDIT:
L = []
def f(x):
trip = ((1,2,x) for x in range(3))
#append each
L.append(da.from_array(np.array(list(trip)), chunks=(1,3)))
ddf.groupby('cars').apply(f, meta={'x': 'i8', 'y': 'i8', 'z': 'i8'}).compute()
dar = da.concatenate(L, axis=0)
print (dar)
dask.array<concatenate, shape=(12, 3), dtype=int32, chunksize=(1, 3)>
For your edit:
In [8]: ddf.groupby(['first name', 'last name']).cars.count().compute()
Out[8]:
first name last name
1 1 107
2 107
3 110
2 1 117
2 120
3 99
3 1 119
2 103
3 118
Name: cars, dtype: int64
This is my pandas DataFrame with original column names.
old_dt_cm1_tt old_dm_cm1 old_rr_cm2_epf old_gt
1 3 0 0
2 1 1 5
Firstly I want to extract all unique variations of cm, e.g. in this case cm1 and cm2.
After this I want to create a new column per each unique cm. In this example there should be 2 new columns.
Finally in each new column I should store the total count of non-zero original column values, i.e.
old_dt_cm1_tt old_dm_cm1 old_rr_cm2_epf old_gt cm1 cm2
1 3 0 0 2 0
2 1 1 5 2 1
I implemented the first step as follows:
cols = pd.DataFrame(list(df.columns))
ind = [c for c in df.columns if 'cm' in c]
df.ix[:, ind].columns
How to proceed with steps 2 and 3, so that the solution is automatic (I don't want to manually define column names cm1 and cm2, because in original data set I might have many cm variations.
You can use:
print df
old_dt_cm1_tt old_dm_cm1 old_rr_cm2_epf old_gt
0 1 3 0 0
1 2 1 1 5
First you can filter columns contains string cm, so columns without cm are removed.
df1 = df.filter(regex='cm')
Now you can change columns to new values like cm1, cm2, cm3.
print [cm for c in df1.columns for cm in c.split('_') if cm[:2] == 'cm']
['cm1', 'cm1', 'cm2']
df1.columns = [cm for c in df1.columns for cm in c.split('_') if cm[:2] == 'cm']
print df1
cm1 cm1 cm2
0 1 3 0
1 2 1 1
Now you can count non - zero values - change df1 to boolean DataFrame and sum - True are converted to 1 and False to 0. You need count by unique column names - so groupby columns and sum values.
df1 = df1.astype(bool)
print df1
cm1 cm1 cm2
0 True True False
1 True True True
print df1.groupby(df1.columns, axis=1).sum()
cm1 cm2
0 2 0
1 2 1
You need unique columns, which are added to original df:
print df1.columns.unique()
['cm1' 'cm2']
Last you can add new columns by df[['cm1','cm2']] from groupby function:
df[df1.columns.unique()] = df1.groupby(df1.columns, axis=1).sum()
print df
old_dt_cm1_tt old_dm_cm1 old_rr_cm2_epf old_gt cm1 cm2
0 1 3 0 0 2 0
1 2 1 1 5 2 1
Once you know which columns have cm in them you can map them (with a dict) to the desired new column with an adapted version of this answer:
col_map = {c:'cm'+c[c.index('cm') + len('cm')] for c in ind}
# ^ if you are hard coding this in you might as well use 2
so that instead of the string after cm it is cm and the character directly following, in this case it would be:
{'old_dm_cm1': 'cm1', 'old_dt_cm1_tt': 'cm1', 'old_rr_cm2_epf': 'cm2'}
Then add the new columns to the DataFrame by iterating over the dict:
for col,new_col in col_map.items():
if new_col not in df:
df[new_col] =[int(a!=0) for a in df[col]]
else:
df[new_col]+=[int(a!=0) for a in df[col]]
note that int(a!=0) will simply give 0 if the value is 0 and 1 otherwise. The only issue with this is because dicts are inherently unordered it may be preferable to add the new columns in order according to the values: (like the answer here)
import operator
for col,new_col in sorted(col_map.items(),key=operator.itemgetter(1)):
if new_col in df:
df[new_col]+=[int(a!=0) for a in df[col]]
else:
df[new_col] =[int(a!=0) for a in df[col]]
to ensure the new columns are inserted in order.