Get new dataframe by multiple conditions with pd.Dataframe.isin() - python

I am trying to write function that obtains a df and a dictionary that maps columns to values. The function slices rows (indexes) such that it returns only rows whose values match ‘criteria’ keys values.
for example:
df_isr13 = filterby_criteria(df, {"Area":["USA"], "Year":[2013]}) Only rows with "Year"=2013 and "Area"="USA" are included in the output.
I tried:
def filterby_criteria(df, criteria):
for key, values in criteria.items():
return df[df[key].isin(values)]
but I get only the first criterion
How can I get the new dataframe that except all criterias by pd.Dataframe.isin()?

You can use for loop and add every criterion by pandas merge function:
def filterby_criteria(df, criteria):
for key, values in criteria.items():
df = pd.merge(df[df [key].isin(values)], df, how='inner')
return df

Consider a simple merge of two data frames since by default merge uses all matching names:
from itertools import product
import pandas as pd
def filterby_criteria(df, criteria):
# EXTRACT DICT ITEMS
k,v = criteria.keys(), criteria.values()
# BUILD DF OF ALL POSSIBLE MATCHES
all_matches = (pd.DataFrame(product(*v))
.set_axis(list(k), axis='columns', inplace=False)
)
# RETURN MERGED DF
return df.merge(all_matches)
To demonstrate with random, seeded data:
Data
import numpy as np
import pandas as pd
np.random.seed(61219)
tools = ['sas', 'stata', 'spss', 'python', 'r', 'julia']
years = list(range(2013, 2019))
random_df = pd.DataFrame({'Tool': np.random.choice(tools, 500),
'Int': np.random.randint(1, 10, 500),
'Num': np.random.uniform(1, 100, 500),
'Year': np.random.choice(years, 500)
})
print(random_df.head(10))
# Tool Int Num Year
# 0 spss 4 96.465327 2016
# 1 sas 7 23.455771 2016
# 2 r 5 87.349825 2014
# 3 julia 4 18.214028 2017
# 4 julia 7 17.977237 2016
# 5 stata 3 41.196579 2013
# 6 stata 8 84.943676 2014
# 7 python 4 60.576030 2017
# 8 spss 4 47.024075 2018
# 9 stata 3 87.271072 2017
Function call
criteria = {"Tool":["python", "r"], "Year":[2013, 2015]}
def filterby_criteria(df, criteria):
k,v = criteria.keys(), criteria.values()
all_matches = (pd.DataFrame(product(*v))
.set_axis(list(k), axis='columns', inplace=False)
)
return df.merge(all_matches)
final_df = filterby_criteria(random_df, criteria)
Output
print(final_df)
# Tool Int Num Year
# 0 python 8 96.611384 2015
# 1 python 7 66.782828 2015
# 2 python 9 73.638629 2015
# 3 python 4 70.763264 2015
# 4 python 2 28.311917 2015
# 5 python 3 69.888967 2015
# 6 python 8 97.609694 2015
# 7 python 3 59.198276 2015
# 8 python 3 64.497017 2015
# 9 python 8 87.672138 2015
# 10 python 9 33.605467 2015
# 11 python 8 25.225665 2015
# 12 r 3 72.202364 2013
# 13 r 1 62.192478 2013
# 14 r 7 39.264766 2013
# 15 r 3 14.599786 2013
# 16 r 4 22.963723 2013
# 17 r 1 97.647922 2013
# 18 r 5 60.457344 2013
# 19 r 5 15.711207 2013
# 20 r 7 80.273330 2013
# 21 r 7 74.190107 2013
# 22 r 7 37.923396 2013
# 23 r 2 91.970678 2013
# 24 r 4 31.489810 2013
# 25 r 1 37.580665 2013
# 26 r 2 9.686955 2013
# 27 r 6 56.238919 2013
# 28 r 6 72.820625 2015
# 29 r 3 61.255351 2015
# 30 r 4 45.690621 2015
# 31 r 5 71.143601 2015
# 32 r 6 54.744846 2015
# 33 r 1 68.171978 2015
# 34 r 5 8.521637 2015
# 35 r 7 87.027681 2015
# 36 r 3 93.614377 2015
# 37 r 7 37.918881 2015
# 38 r 3 7.715963 2015
# 39 python 1 42.681928 2013
# 40 python 6 57.354726 2013
# 41 python 1 48.189897 2013
# 42 python 4 12.201131 2013
# 43 python 9 1.078999 2013
# 44 python 9 75.615457 2013
# 45 python 8 12.631277 2013
# 46 python 9 82.227578 2013
# 47 python 7 97.802213 2013
# 48 python 1 57.103964 2013
# 49 python 1 1.941839 2013
# 50 python 3 81.981437 2013
# 51 python 1 56.869551 2013
PyFiddle Demo (click Run at top)

Related

Importing CSV with time losing format

I am trying to import csv with time, but when I import it, it does not show the correct format, how can I fix the time uploaded format?
import pandas as pd
# load the data from the CSV file
data = pd.read_csv('H_data - Copy.csv')
print (data.head(5))
Result :
0 0 24:39.5
1 1 25:20.4
2 2 25:56.1
3 3 26:36.1
4 4 27:21.0
CSV looks like this when I copy and past, not sure how to upload it here:
time
0 24:39.5
1 25:20.4
2 25:56.1
3 26:36.1
4 27:21.0
5 27:57.1
6 28:34.2
7 29:11.0
8 29:47.6
9 30:27.4
10 31:06.6
11 31:46.9
12 32:22.9
13 32:58.4
14 33:30.3
15 34:13.2
16 34:51.8
17 35:32.8
18 36:04.5
19 36:46.4
20 37:27.0
21 37:58.2
22 38:43.1
23 39:23.5
24 39:54.6
25 40:39.5
26 41:15.1
27 41:55.6
28 42:27.8

How do I convert date (YYYY-MM-DD) to Month-YY and groupby on some other column to get minimum and maximum month?

I have created a data frame which has rolling quarter mapping using the code
abcd = pd.DataFrame()
abcd['Month'] = np.nan
abcd['Month'] = pd.date_range(start='2020-04-01', end='2022-04-01', freq = 'MS')
abcd['Time_1'] = np.arange(1, abcd.shape[0]+1)
abcd['Time_2'] = np.arange(0, abcd.shape[0])
abcd['Time_3'] = np.arange(-1, abcd.shape[0]-1)
db_nd_ad_unpivot = pd.melt(abcd, id_vars=['Month'],
value_vars=['Time_1', 'Time_2', 'Time_3',],
var_name='Time_name', value_name='Time')
abcd_map = db_nd_ad_unpivot[(db_nd_ad_unpivot['Time']>0)&(db_nd_ad_unpivot['Time']< abcd.shape[0]+1)]
abcd_map = abcd_map[['Month','Time']]
The output of the code looks like this:
Now, I have created an additional column name that gives me the name of the month and year in format Mon-YY using the code
abcd_map['Month'] = pd.to_datetime(abcd_map.Month)
# abcd_map['Month'] = abcd_map['Month'].astype(str)
abcd_map['Time_Period'] = abcd_map['Month'].apply(lambda x: x.strftime("%b'%y"))
Now I want to see for a specific time, what is the minimum and maximum in the month column. For eg. for time instance 17
,The simple groupby results as:
Time Period
17 Aug'21-Sept'21
The desired output is
Time Time_Period
17 Aug'21-Oct'21.
I think it is based on min and max of the column Month as by using the strftime function the column is getting converted in String/object type.
How about converting to string after finding the min and max
New_df = abcd_map.groupby('Time')['Month'].agg(['min', 'max']).apply(lambda x: x.dt.strftime("%b'%y")).agg(' '.join, axis=1).reset_index()
Do this:
abcd_map['Month_'] = pd.to_datetime(abcd_map['Month']).dt.strftime('%Y-%m')
abcd_map['Time_Period'] = abcd_map['Month_'] = pd.to_datetime(abcd_map['Month']).dt.strftime('%Y-%m')
abcd_map['Time_Period'] = abcd_map['Month'].apply(lambda x: x.strftime("%b'%y"))
df = abcd_map.groupby(['Time']).agg(
sum_col=('Time', np.sum),
first_date=('Time_Period', np.min),
last_date=('Time_Period', np.max)
).reset_index()
df['TimePeriod'] = df['first_date']+'-'+df['last_date']
df = df.drop(['first_date','last_date'], axis = 1)
df
which returns
Time sum_col TimePeriod
0 1 3 Apr'20-May'20
1 2 6 Jul'20-May'20
2 3 9 Aug'20-Jun'20
3 4 12 Aug'20-Sep'20
4 5 15 Aug'20-Sep'20
5 6 18 Nov'20-Sep'20
6 7 21 Dec'20-Oct'20
7 8 24 Dec'20-Nov'20
8 9 27 Dec'20-Jan'21
9 10 30 Feb'21-Mar'21
10 11 33 Apr'21-Mar'21
11 12 36 Apr'21-May'21
12 13 39 Apr'21-May'21
13 14 42 Jul'21-May'21
14 15 45 Aug'21-Jun'21
15 16 48 Aug'21-Sep'21
16 17 51 Aug'21-Sep'21
17 18 54 Nov'21-Sep'21
18 19 57 Dec'21-Oct'21
19 20 60 Dec'21-Nov'21
20 21 63 Dec'21-Jan'22
21 22 66 Feb'22-Mar'22
22 23 69 Apr'22-Mar'22
23 24 48 Apr'22-Mar'22
24 25 25 Apr'22-Apr'22

Creating dataframe with all jumbled combinations from previous dataframe

I have a dataframe:
John
Kelly
Jay
Max
Kert
I want to create new dataframe such that the output is as follows:
John_John
John_Kelly
John_Jay
John_Max
John_Kert
Kelly_John
Kelly_Kelly
Kelly_Jay
Kelly_Max
Kelly_Kert
...
Kert_Max
Kert_Kert
Assuming "name" the column, you can use a cross merge:
df2 = df.merge(df, how='cross')
out = (df2['name_x']+'_'+df2['name_y']).to_frame('name')
Or, with itertools.product:
from itertools import product
out = pd.DataFrame({'name': [f'{a}_{b}' for a,b in product(df['name'], repeat=2)]})
output:
name
0 John_John
1 John_Kelly
2 John_Jay
3 John_Max
4 John_Kert
5 Kelly_John
6 Kelly_Kelly
7 Kelly_Jay
8 Kelly_Max
9 Kelly_Kert
10 Jay_John
11 Jay_Kelly
12 Jay_Jay
13 Jay_Max
14 Jay_Kert
15 Max_John
16 Max_Kelly
17 Max_Jay
18 Max_Max
19 Max_Kert
20 Kert_John
21 Kert_Kelly
22 Kert_Jay
23 Kert_Max
24 Kert_Kert

How to change a list of synsets to list elements?

I have tried out the following snippet of code for my project:
import pandas as pd
import nltk
from nltk.corpus import wordnet as wn
nltk.download('wordnet')
df=[]
hypo = wn.synset('science.n.01').hyponyms()
hyper = wn.synset('science.n.01').hypernyms()
mero = wn.synset('science.n.01').part_meronyms()
holo = wn.synset('science.n.01').part_holonyms()
ent = wn.synset('science.n.01').entailments()
df = df+hypo+hyper+mero+holo+ent
df_agri_clean = pd.DataFrame(df)
df_agri_clean.columns=["Items"]
print(df_agri_clean)
pd.set_option('display.expand_frame_repr', False)
It has given me this output of a dataframe:
Items
0 Synset('agrobiology.n.01')
1 Synset('agrology.n.01')
2 Synset('agronomy.n.01')
3 Synset('architectonics.n.01')
4 Synset('cognitive_science.n.01')
5 Synset('cryptanalysis.n.01')
6 Synset('information_science.n.01')
7 Synset('linguistics.n.01')
8 Synset('mathematics.n.01')
9 Synset('metallurgy.n.01')
10 Synset('metrology.n.01')
11 Synset('natural_history.n.01')
12 Synset('natural_science.n.01')
13 Synset('nutrition.n.03')
14 Synset('psychology.n.01')
15 Synset('social_science.n.01')
16 Synset('strategics.n.01')
17 Synset('systematics.n.01')
18 Synset('thanatology.n.01')
19 Synset('discipline.n.01')
20 Synset('scientific_theory.n.01')
21 Synset('scientific_knowledge.n.01')
This can be converted to a list by just printing df.
[Synset('agrobiology.n.01'), Synset('agrology.n.01'), Synset('agronomy.n.01'), Synset('architectonics.n.01'), Synset('cognitive_science.n.01'), Synset('cryptanalysis.n.01'), Synset('information_science.n.01'), Synset('linguistics.n.01'), Synset('mathematics.n.01'), Synset('metallurgy.n.01'), Synset('metrology.n.01'), Synset('natural_history.n.01'), Synset('natural_science.n.01'), Synset('nutrition.n.03'), Synset('psychology.n.01'), Synset('social_science.n.01'), Synset('strategics.n.01'), Synset('systematics.n.01'), Synset('thanatology.n.01'), Synset('discipline.n.01'), Synset('scientific_theory.n.01'), Synset('scientific_knowledge.n.01')]
I wish to change every word under "Items" like so :
Synset('agrobiology.n.01') => agrobiology.n.01
or
Synset('agrobiology.n.01') => 'agrobiology'
Any answer associated will be appreciated! Thanks!
To access the name of these items, just do function.name(). You could use line comprehension update these items as follows:
df_agri_clean['Items'] = [df_agri_clean['Items'][i].name() for i in range(len(df_agri_clean))]
df_agri_clean
The output will be as you expected
Items
0 agrobiology.n.01
1 agrology.n.01
2 agronomy.n.01
3 architectonics.n.01
4 cognitive_science.n.01
5 cryptanalysis.n.01
6 information_science.n.01
7 linguistics.n.01
8 mathematics.n.01
9 metallurgy.n.01
10 metrology.n.01
11 natural_history.n.01
12 natural_science.n.01
13 nutrition.n.03
14 psychology.n.01
15 social_science.n.01
16 strategics.n.01
17 systematics.n.01
18 thanatology.n.01
19 discipline.n.01
20 scientific_theory.n.01
21 scientific_knowledge.n.01
To further replace ".n.01" as well from the string, you could do the following:
df_agri_clean['Items'] = [df_agri_clean['Items'][i].name().replace('.n.01', '') for i in range(len(df_agri_clean))]
df_agri_clean
Output (just like your second expected output)
Items
0 agrobiology
1 agrology
2 agronomy
3 architectonics
4 cognitive_science
5 cryptanalysis
6 information_science
7 linguistics
8 mathematics
9 metallurgy
10 metrology
11 natural_history
12 natural_science
13 nutrition.n.03
14 psychology
15 social_science
16 strategics
17 systematics
18 thanatology
19 discipline
20 scientific_theory
21 scientific_knowledge

extracting upper and lower row if a condition is met

Regards.
I have the following coordinate dataframe, divided by blocks. Each block starts at seq0_leftend, seq0_rightend, seq1_leftend, seq1_rightend, seq2_leftend, seq2_rightend, seq3_leftend, seq3_rightend, and so on. I would like that, for each block given the condition if, coordinates are negative, extract the upper and lower row. example of my dataframe file:
seq0_leftend seq0_rightend
0 7 107088
1 107089 108940
2 108941 362759
3 362760 500485
4 500486 509260
5 509261 702736
seq1_leftend seq1_rightend
0 1 106766
1 106767 108619
2 108620 355933
3 355934 488418
4 488419 497151
5 497152 690112
6 690113 700692
7 700693 721993
8 721994 722347
9 722348 946296
10 946297 977714
11 977715 985708
12 -985709 -990725
13 991992 1042023
14 1042024 1259523
15 1259524 1261239
seq2_leftend seq2_rightend
0 1 109407
1 362514 364315
2 109408 362513
3 364450 504968
4 -504969 -515995
5 515996 671291
6 -671295 -682263
7 682264 707010
8 -707011 -709780
9 709781 934501
10 973791 1015417
11 -961703 -973790
12 948955 961702
13 1015418 1069976
14 1069977 1300633
15 -1300634 -1301616
16 1301617 1344821
17 -1515463 -1596433
18 1514459 1515462
19 -1508094 -1514458
20 1346999 1361467
21 -1361468 -1367472
22 1369840 1508093
seq3_leftend seq3_rightend
0 1 112030
1 112031 113882
2 113883 381662
3 381663 519575
4 519576 528317
5 528318 724500
6 724501 735077
7 735078 759456
8 759457 763157
9 763158 996929
10 996931 1034492
11 1034493 1040984
12 -1040985 -1061402
13 1071212 1125426
14 1125427 1353901
15 1353902 1356209
16 1356210 1392818
seq4_leftend seq4_rightend
0 1 105722
1 105723 107575
2 107576 355193
3 355194 487487
4 487488 496220
5 496221 689560
6 689561 700139
7 700140 721438
8 721458 721497
9 721498 947183
10 947184 978601
11 978602 986595
12 -986596 -991612
13 994605 1046245
14 1046247 1264692
15 1264693 1266814
Finally write a new csv with the data of interest, an example of the final result that I would like, would be this:
seq1_leftend seq1_rightend
11 977715 985708
12 -985709 -990725
13 991992 1042023
seq2_leftend seq2_rightend
3 364450 504968
4 -504969 -515995
5 515996 671291
6 -671295 -682263
7 682264 707010
8 -707011 -709780
9 709781 934501
10 973791 1015417
11 -961703 -973790
12 948955 961702
14 1069977 1300633
15 -1300634 -1301616
16 1301617 1344821
17 -1515463 -1596433
18 1514459 1515462
19 -1508094 -1514458
20 1346999 1361467
21 -1361468 -1367472
22 1369840 1508093
seq3_leftend seq3_rightend
11 1034493 1040984
12 -1040985 -1061402
13 1071212 1125426
seq4_leftend seq4_rightend
11 978602 986595
12 -986596 -991612
13 994605 1046245
I assume that you have a list of DataFrames, let's call it src.
To convert a single DataFrame, define the following function:
def findRows(df):
col = df.iloc[:, 0]
if col.lt(0).any():
return df[col.lt(0) | col.shift(1).lt(0) | col.shift(-1).lt(0)]
else:
return None
Note that this function starts with reading column 0 from the source
DataFrame, so it is independent of the name of this column.
Then it checks whether any element in this column is < 0.
If found, the returned object is a DataFrame with rows which
contain a value < 0:
either in this element,
or in the previous element,
or in the next element.
If not found, this function returns None (from your expected result
I see that in such a case you don't want even any empty DataFrame).
The first stage is to collect results of this function called on each
DataFrame from src:
result = [ findRows(df) for df in src ]
An the last part is to filter out elements which are None:
result = list(filter(None.__ne__, result))
To see the result, run:
for df in result:
print(df)
For src containing first 3 of your DataFrames, I got:
seq1_leftend seq1_rightend
11 977715 985708
12 -985709 -990725
13 991992 1042023
seq2_leftend seq2_rightend
3 364450 504968
4 -504969 -515995
5 515996 671291
6 -671295 -682263
7 682264 707010
8 -707011 -709780
9 709781 934501
10 973791 1015417
11 -961703 -973790
12 948955 961702
14 1069977 1300633
15 -1300634 -1301616
16 1301617 1344821
17 -1515463 -1596433
18 1514459 1515462
19 -1508094 -1514458
20 1346999 1361467
21 -1361468 -1367472
22 1369840 1508093
As you can see, the resulting list contains only results
originating from the second and third source DataFrame.
The first was filtered out, since findRows returned
None from its processing.

Categories