select rows in a dataframe in python based on two criteria - python

Based on the dataframe (1) below, I wish to create a dataframe (2) where either y or z is equal to 2. Is there a way to do this conveniently?
And if I were to create a dataframe (3) that only contains rows from dataframe (1) but not dataframe (2), how should I approach it?
id x y z
0 324 1 2
1 213 1 1
2 529 2 1
3 347 3 2
4 109 2 2
...

df[df[['y','z']].eq(2).any(1)]
Out[1205]:
id x y z
0 0 324 1 2
2 2 529 2 1
3 3 347 3 2
4 4 109 2 2

You can create df2 easily enough using a condition:
df2 = df1[df1.y.eq(2) | df1.z.eq(2)]
df2
x y z
id
0 324 1 2
2 529 2 1
3 347 3 2
4 109 2 2
Given df2 and df1, you can perform a set difference operation on the index, like this:
df3 = df1.iloc[df1.index.difference(df2.index)]
df3
x y z
id
1 213 1 1

You can do the following:
import pandas as pd
df = pd.read_csv('data.csv')
df2 = df[(df.y == 2) | (df.z == 2)]
print(df2)
Results:
id x y z
0 0 324 1 2
2 2 529 2 1
3 3 347 3 2
4 4 109 2 2

Related

Pandas or other Python application to generate a column with 1 to n value based on other two columns with rules

Hope I can explain the question properly.
In basic terms, imagining the df as below:
print(df)
year id
1 16100
1 150
1 150
2 66
2 370
2 370
2 530
3 41
3 43
3 61
Would need df.seq to be a cycling 1 to n value if the year rows are identical, until it changes.
df.seq2 would be still n, instead of n+1, if the above rows id value is identical.
So if we imagine excel like formula would be something like
df.seq2 = IF(A2=A1,IF(B2=B1,F1,F1+1),1)
which would make the desired output seq and seq2 below:
year id seq seq2
1 16100 1 1
1 150 2 2
1 150 3 2
2 66 1 1
2 370 2 2
2 370 3 2
2 530 4 3
3 41 1 1
3 43 2 2
3 61 3 3
Did test couple things like (assuming I've generated the df.seq)
comb_df['match'] = comb_df.year.eq(comb_df.year.shift())
comb_df['match2'] = comb_df.id.eq(comb_df.id.shift())
comb_df["seq2"] = np.where((comb_df["match"].shift(+1) == True) & (comb_df["match2"].shift(+1) == True), comb_df["seq"] - 1, comb_df["seq2"])
But the problem is this doesn't really work out if there are multiple duplicates in a row etc.
Perhaps issue can not be resolved purely on numpy sort of way but perhaps I'd have to iterate over the rows?
There are 2-3 million rows, so the performance will be an issue if the solution would be very slow.
Would need to generate both df.seq and df.seq2
Any ideas would be extremely helpful!
We can do groupby with cumcount and factorize
df['seq'] = df.groupby('year').cumcount()+1
df['seq2'] = df.groupby('year')['id'].transform(lambda x : x.factorize()[0]+1)
df
Out[852]:
year id seq seq2
0 1 16100 1 1
1 1 150 2 2
2 1 150 3 2
3 2 66 1 1
4 2 370 2 2
5 2 370 3 2
6 2 530 4 3
7 3 41 1 1
8 3 43 2 2
9 3 61 3 3

Convert a column with values in a list into separated rows grouped by specific columns

I'm trying to convert a column with values in a list into separated rows grouped by specifics columns.
That's the dataframe I have:
id rooms bathrooms facilities
111 1 2 [2, 3, 4]
222 2 3 [4, 5, 6]
333 2 1 [2, 3, 4]
That's the dataframe I need:
id rooms bathrooms facility
111 1 2 2
111 1 2 3
111 1 2 4
222 2 3 4
222 2 3 5
222 2 3 6
333 2 1 2
333 2 1 3
333 2 1 4
I was trying converting to list the column facilities first:
facilities = pd.DataFrame(df.facilities.tolist())
And later join by columns and following the same method with another suggested solution:
df[['id', 'rooms', 'bathrooms']].join(facilities).melt(id_vars=['id', 'rooms', 'bathrooms']).drop('variable', 1)
Unfortunately, it didn't work for me.
Another solution?
Thanks in advance!
You need explode:
df.explode('facilities')
# id rooms bathrooms facilities
#0 111 1 2 2
#0 111 1 2 3
#0 111 1 2 4
#1 222 2 3 4
#1 222 2 3 5
#1 222 2 3 6
#2 333 2 1 2
#2 333 2 1 3
#2 333 2 1 4
It is a bit awkward to have list as values in a dataframe so one way I can think of to work around this is to unpack the lists and store each in its own column, then use the melt function.
# recreate your data
d = {"id":[111, 222, 333],
"rooms": [1,2,2],
"bathrooms": [2,3,1],
"facilities": [[2, 3, 4],[4, 5, 6],[2, 3, 4]]}
df = pd.DataFrame(d)
# unpack the lists
f0, f1, f2 = [],[],[]
for row in df.itertuples():
f0.append(row.facilities[0])
f1.append(row.facilities[1])
f2.append(row.facilities[2])
df["f0"] = f0
df["f1"] = f1
df["f2"] = f2
# melt the dataframe
df = pd.melt(df, id_vars=['id', 'rooms', 'bathrooms'], value_vars=["f0", "f1", "f2"], value_name="facilities")
# optionally sort the values and remove the "variable" column
df.sort_values(by=['id'], inplace=True)
df = df[['id', 'rooms', 'bathrooms', 'facilities']]
I think that should get you the dataframe you need.
id rooms bathrooms facilities
0 111 1 2 2
3 111 1 2 3
6 111 1 2 4
1 222 2 3 4
4 222 2 3 5
7 222 2 3 6
2 333 2 1 2
5 333 2 1 3
8 333 2 1 4
The following will give the desired output
def changeDf(x):
df_m = pd.DataFrame(columns=['id','rooms','bathrooms','facilities'])
for index, fc in enumerate(x['facilities']):
df_m.loc[index] = [x['id'], x['rooms'], x['bathrooms'], fc]
return df_m
df_modified = df.apply(changeDf, axis=1)
df_final = pd.concat([i for i in df_modified])
print(df_final)
"df" is input dataframe and "df_final" is desired dataframe
Try this
reps = [len(x) for x in df.facilities]
facilities = pd.Series(np.array(df.facilities.tolist()).ravel())
df = df.loc[df.index.repeat(reps)].reset_index(drop=True)
df.facilities = facilities
df
id rooms bathrooms facilities
0 111 1 2 2
1 111 1 2 3
2 111 1 2 4
3 222 2 3 4
4 222 2 3 5
5 222 2 3 6
6 333 2 1 2
7 333 2 1 3
8 333 2 1 4

Join to dataframe on X.n = Y.n + 1

I have a data frame X with data like:
n val
------------
1 4
2 3
3 0
and another, Y, with the same columns, like:
n val
------------
1 288
2 12
3 130
4 1230
How can I create an additional column in X, that is the value of the following (n + 1) val?
Expected output is:
n val val2
------------
1 4 12
2 3 130
3 0 1230
Apologies as I'm sure this has been asked before, I'm just having trouble finding it, and can't figure it out using join or merge, those seem to only take column names as inputs.
We can do merge
df=df1.merge(df2.assign(n=df2.n-1),on='n')
n val_x val_y
0 1 4.0 12
1 2 3.0 130
2 3 0.0 1230

Adding entries in dataframe for reverse index order

I have to make bar plot of data from a multindex panda dataframe. This dataframe has the following structure :
value
1 2 25
3 96
4 -12
...
2 3 -25
4 -30
...
3 4 541
5 396
6 14
...
Note that there is a value for index entry (1,2) but no value for (2,1). There is always an index entry (x,y) with y > x and I'd like to create an entry (y,x) for every entry (x,y) having the same value. Basically, I'd like to make my dataframe matrix symmetric. I've tried by switching the level of the indexes and then concatenating the results into a new dataframe but I can't obtain the result I want. Maybe I could do it with a for loop but I'm pretty sure there is a better way to do that... Do you know how to do that efficiently ?
Try using, pd.concat and swaplevel :
pd.concat([df, df.swaplevel(0,1)])
Output:
value
x y
1 2 25
3 96
4 -12
2 3 -25
4 -30
3 4 541
5 396
6 14
2 1 25
3 1 96
4 1 -12
3 2 -25
4 2 -30
3 541
5 3 396
6 3 14
You can unstack, transpose, stack again, and concat to the original series:
new_df = pd.concat( (df.value, df.value.unstack(level=1).T.stack()))
Toy data:
idx = [(a,b) for b in range(1,4) for a in range(1, b)]
idx = pd.MultiIndex.from_tuples(idx)
np.random.seed(10)
df = pd.DataFrame({'value': np.random.randint(-100,100, len(idx))}, index=idx)
df.sort_index(inplace=True)
# df:
# value
# 1 2 -91
# 3 25
# 2 3 -85
Output (new_df):
1 2 -91.0
3 25.0
2 3 -85.0
1 -91.0
3 1 25.0
2 -85.0
dtype: float64

Python Pandas operate on row

Hi my dataframe look like:
Store,Dept,Date,Sales
1,1,2010-02-05,245
1,1,2010-02-12,449
1,1,2010-02-19,455
1,1,2010-02-26,154
1,1,2010-03-05,29
1,1,2010-03-12,239
1,1,2010-03-19,264
Simply, I need to add another column called '_id' as concatenation of Store, Dept, Date like "1_1_2010-02-05", I assume I can do it through df['id'] = df['Store'] +'' +df['Dept'] +'_'+df['Date'], but it turned out to be not.
Similarly, i also need to add a new column as log of sales, I tried df['logSales'] = math.log(df['Sales']), again, it did not work.
You can first convert it to strings (the integer columns) before concatenating with +:
In [25]: df['id'] = df['Store'].astype(str) +'_' +df['Dept'].astype(str) +'_'+df['Date']
In [26]: df
Out[26]:
Store Dept Date Sales id
0 1 1 2010-02-05 245 1_1_2010-02-05
1 1 1 2010-02-12 449 1_1_2010-02-12
2 1 1 2010-02-19 455 1_1_2010-02-19
3 1 1 2010-02-26 154 1_1_2010-02-26
4 1 1 2010-03-05 29 1_1_2010-03-05
5 1 1 2010-03-12 239 1_1_2010-03-12
6 1 1 2010-03-19 264 1_1_2010-03-19
For the log, you better use the numpy function. This is vectorized (math.log can only work on single scalar values):
In [34]: df['logSales'] = np.log(df['Sales'])
In [35]: df
Out[35]:
Store Dept Date Sales id logSales
0 1 1 2010-02-05 245 1_1_2010-02-05 5.501258
1 1 1 2010-02-12 449 1_1_2010-02-12 6.107023
2 1 1 2010-02-19 455 1_1_2010-02-19 6.120297
3 1 1 2010-02-26 154 1_1_2010-02-26 5.036953
4 1 1 2010-03-05 29 1_1_2010-03-05 3.367296
5 1 1 2010-03-12 239 1_1_2010-03-12 5.476464
6 1 1 2010-03-19 264 1_1_2010-03-19 5.575949
Summarizing the comments, for a dataframe of this size, using apply will not differ much in performance compared to using vectorized functions (working on the full column), but when your real dataframe becomes larger, it will.
Apart from that, I think the above solution is also easier syntax.
In [153]:
import pandas as pd
import io
temp = """Store,Dept,Date,Sales
1,1,2010-02-05,245
1,1,2010-02-12,449
1,1,2010-02-19,455
1,1,2010-02-26,154
1,1,2010-03-05,29
1,1,2010-03-12,239
1,1,2010-03-19,264"""
df = pd.read_csv(io.StringIO(temp))
df
Out[153]:
Store Dept Date Sales
0 1 1 2010-02-05 245
1 1 1 2010-02-12 449
2 1 1 2010-02-19 455
3 1 1 2010-02-26 154
4 1 1 2010-03-05 29
5 1 1 2010-03-12 239
6 1 1 2010-03-19 264
[7 rows x 4 columns]
In [154]:
# apply a lambda function row-wise, you need to convert store and dept to strings in order to build the new string
df['id'] = df.apply(lambda x: str(str(x['Store']) + ' ' + str(x['Dept']) +'_'+x['Date']), axis=1)
df
Out[154]:
Store Dept Date Sales id
0 1 1 2010-02-05 245 1 1_2010-02-05
1 1 1 2010-02-12 449 1 1_2010-02-12
2 1 1 2010-02-19 455 1 1_2010-02-19
3 1 1 2010-02-26 154 1 1_2010-02-26
4 1 1 2010-03-05 29 1 1_2010-03-05
5 1 1 2010-03-12 239 1 1_2010-03-12
6 1 1 2010-03-19 264 1 1_2010-03-19
[7 rows x 5 columns]
In [155]:
import math
# now apply log to sales to create the new column
df['logSales'] = df['Sales'].apply(math.log)
df
Out[155]:
Store Dept Date Sales id logSales
0 1 1 2010-02-05 245 1 1_2010-02-05 5.501258
1 1 1 2010-02-12 449 1 1_2010-02-12 6.107023
2 1 1 2010-02-19 455 1 1_2010-02-19 6.120297
3 1 1 2010-02-26 154 1 1_2010-02-26 5.036953
4 1 1 2010-03-05 29 1 1_2010-03-05 3.367296
5 1 1 2010-03-12 239 1 1_2010-03-12 5.476464
6 1 1 2010-03-19 264 1 1_2010-03-19 5.575949
[7 rows x 6 columns]

Categories