IF(A2>A3, 1, 0) Excel formula in Pandas - python

I am trying to create a column with zeros and ones based on values from 1st column.
If the value of upper cell is bigger, then write 1, else 0.
Example code would look like this:
df = pd.Dataframe({'col1': [1, 2, 1, 3, 0]})
df['col2'] = ...python version of excel formula IF(A2>A3, 1, 0)...
expected output:
I have tried:
while True:
for index, rows in df.iterrows():
df['col1'] = np.where(df['col1'] > df['col1'][index+1], 1, 0)
but this is very slow and gives wrong results.
Thanks in advance!

You can use
df['col2'] = df['col1'].shift().lt(df['col1']).astype(int)

Here is the final solution I came up with:
df['col2] = (df['col1'<df['col1'].shift()).astype(int).shift(periods=-1).fillna(0)

Related

Python dataframe merge on condition

I have two data frames, and 3 conditions to create new data frame
1)df1["Product"]==df2["Product"] and df2["Date"] >= df1["Date"]
2)Now need to loop df2["Product"] sum(df2["Count"]) while checking df1["Count"] on each iteration for df2["Count"] == df1["Count"]
Example
df1["Product"][2] = "147326.A" and df1["Date"][2] = "1/03/22" and df1["Count"][2] = 4,
now we check df2 if there is match df2["Product"][1] == df1["Product"][2] and df2["Date"][1] >= df1["Date"][2], first condition are met now we need to sum() the df2["Count"] end on each iteration compare it to df1["Count"] if df1["Count"]== df2[Count] add to new data frame
df1 = pd.DataFrame({"Date":["11/01/22", "1/02/22", "1/03/22", "1/04/22", "2/02/22"],"Product" :["315114.A", "147326.A", "147326.A", "91106.A", "283214.A"],"Count":[3,1,4,1,2]})
df2 = pd.DataFrame({"Date" : ["15/01/22", "4/02/22", "7/03/22", "1/04/22", "2/02/22", "15/01/22","1/06/22","1/06/22"],"Product" : ["315114.A", "147326.A ", "147326.A", "91106.A", "283214.A", "315114.A","147326.A","147326.A" ],"Count" : [1, 1, 2, 1, 2, 2, 1, 1]})
The following data should be a match:
df1 = pd.DataFrame({"Date" : ["01/03/2022"],"Product":["91106.A"],"Count":[2]})
df2 = pd.DataFrame({"Date" : ["01/03/2022", "7/03/2022", "7/03/2022", "7/03/2022","7/03/2022", "7/03/2022"],"Product" : ["91106.A", "91106.A","91106.A", "91106.A", "91106.A", "91106.A"],"Count" : [1, 1, 1, 1, 1, 1]})
You could solve this in a list comprehension (within a pd.DataFrame):
df3 = pd.DataFrame([j.to_dict() for i, j in df1.iterrows() if
j["Count"] == df2[(df2["Product"] == j["Product"]) &
(df2["Date"] >= j["Date"])]["Count"].sum()])
Splitting this up into lots of lines would look like this:
l = []
for i, j in df1.iterrows():
if j["Count"] == df2[(df2["Product"] == j["Product"]) &
(df2["Date"] >= j["Date"])]["Count"].sum():
x = j.to_dict()
l.append(x)
df3 = pd.DataFrame(l)

maintaining pandas df index with selection & groupby (python)

I am having an issue with returning the original df index of a row given a groupby condition after subselecting some of the df. It's easier to understand through code.
So if we start with a toy dataframe:
headers = ['a','b']
nrows = 8
df = pd.DataFrame(columns = headers)
df['a'] = [0]*(nrows//2) + [1]*(nrows//2)
df['b'] = [2]*(nrows//4) + [4]*(nrows//4) + [2]*(nrows//4) + [4]*(nrows//4)
print(df)
then I select the subset of data I am interested in and checking that the index is retained:
sub_df = df[df['a']==1] ## selects for only group 1 (indices 4-7)
print(sub_df.index) ## looks good so far
sub_df.index returns
Int64Index([4, 5, 6, 7], dtype='int64')
Which seems great! I would like to group data from that subset and extract the original df index and that is where the issue occurs:
For example:
g_df = sub_df.groupby('b')
g_df_idx = g_df.indices
print(g_df_idx) ## bad!
when I print(g_df_idx) I want it to return:
{2: array([4,5]), 4: array([6,7])}
Due to the way I will be using this code I can't just groupby(['a','b'])
I'm going nuts with this thing. Here are some of the many solutions I have tried:
## 1
e1_idx = sub_df.groupby('b').indices
# print(e1_idx) ## issue persists
## 2
e2 = sub_df.groupby('b', as_index = True) ## also tried as_index = False
e2_idx = e2.indices
# print(e2_idx) ## issue persists
## 3
e3 = sub_df.reset_index()
e3_idx = e3.groupby('b').indices
# print(e3_idx) ## issue persists
I'm sure there must be some simple solution I'm just overlooking. Would be very grateful for any advice.
you can do like this
g_df_idx = g_df.apply(lambda x: x.index).to_dict()
print(g_df_idx)
# {2: Int64Index([4, 5], dtype='int64'), 4: Int64Index([6, 7], dtype='int64')}

How to use np.where() to divide elements of an array into categories?

I'm trying to use np.where() to classify elements of an array into three categories. My array is mean_house_value = [200.000, 120.000, 111.765, 326.234, 700.090, 99.345, 150.232, 250.000, 940.000, 177.000, 45.000, 42.000, 620.654]. The dataset is called housing. house_value_cat is the new column in the dataset where I want to save my new classification. The classification is the following:
mean_house_value < 200.000
200.000 < mean_house_value < 400.000
400.000 < mean_house_value
My code so far is the following:
housing["house_value_cat"] = np.ceil(housing["mean_house_value"]/3)
housing["house_value_cat"].where((housing["house_value_cat"]<200.000) &(housing["house_value_cat"]>400.000))
print(housing["house_value_cat"])
How can I implement the second condition (200.000 < mean_house_value < 400.000) in my code?
[my desired output should look like this:1
Numpy has a function digitize() that does what you want :
>>> import numpy as np
>>> mean_house_value = [200.000, 120.000, 111.765, 326.234, 700.090, 99.345, 150.232, 250.000, 940.000, 177.000, 45.000, 42.000, 620.654]
>>> np.digitize(mean_house_value,[0.,200.,400.])
array([2, 1, 1, 2, 3, 1, 1, 2, 3, 1, 1, 1, 3])
You can create a new column in a dataframe with this result. Assuming you already defined a dataframe called housing :
housing["house_value_cat"] = np.digitize(mean_house_value,[0.,200.,400.])

convert a dataframe column from string to List of numbers

I have created the following dataframe from a csv file:
id marks
5155 1,2,3,,,,,,,,
2156 8,12,34,10,4,3,2,5,0,9
3557 9,,,,,,,,,,
7886 0,7,56,4,34,3,22,4,,,
3689 2,8,,,,,,,,
It is indexed on id. The values for the marks column are string. I need to convert them to a list of numbers so that I can iterate over them and use them as index number for another dataframe. How can I convert them from string to a list? I tried to add a new column and convert them based on "Add a columns in DataFrame based on other column" but it failed:
df = df.assign(new_col_arr=lambda x: np.fromstring(x['marks'].values[0], sep=',').astype(int))
Here's a way to do:
df = df.assign(new_col_arr=df['marks'].str.split(','))
# convert to int
df['new_col'] = df['new_col_arr'].apply(lambda x: list(map(int, [i for i in x if i != ''])))
I presume that you want to create NEW dataframe, since the number of items is differnet from number of rows. I suggest the following:
#source data
df = pd.DataFrame({'id':[5155, 2156, 7886],
'marks':['1,2,3,,,,,,,,','8,12,34,10,4,3,2,5,0,9', '0,7,56,4,34,3,22,4,,,']
# create dictionary from df:
dd = {row[0]:np.fromstring(row[1], dtype=int, sep=',') for _, row in df.iterrows()}
{5155: array([1, 2, 3]),
2156: array([ 8, 12, 34, 10, 4, 3, 2, 5, 0, 9]),
7886: array([ 0, 7, 56, 4, 34, 3, 22, 4])}
# here you pad the lists inside dictionary so that they have equal length
...
# convert dd to DataFrame:
df2 = pd.DataFrame(dd)
I found two similar alternatives:
1.
df['marks'] = df['marks'].str.split(',').map(lambda num_str_list: [int(num_str) for num_str in num_str_list if num_str])
2.
df['marks'] = df['marks'].map(lambda arr_str: [int(num_str) for num_str in arr_str.split(',') if num_str])

Python: Efficient looping in dataframe to find duplicates for multiple columns

I am using python and I want to go through a dataset and highlight the most used locations.
This is my dataset (but with 300,000+ records):
Longitude Latitude
14.28586 48.3069
14.28577 48.30687
14.28555 48.30678
14.28541 48.30673
First I add a density column:
df['Density'] = 0
And this is the code that I am using to increase the density value for each record:
for index in range(0,len(df)):
for index2 in range(index + 1, len(df)):
if df['Longitude'].loc[index] == df['Longitude'].loc[index2] and df['Latitude'].loc[index] == df['Latitude'].loc[index2]:
df['Density'].loc[index] += 1
df['Density'].loc[index2] += 1
print("match")
print(str(index) + "/" + str(len(df)))
The code above is simply iterating through the dataframe, comparing the first record against all the other records in the dataset (inner loop) and when a match is found both of their density values are incremented.
I want to find the Longitudes and Latitudes that match and increase their density value.
The code is obviously very slow and I am sure that Python will have a cool technique for doing something like this, any ideas?
You can use duplicated, groupby, transform & sum to achieve this:
Lets create a sample dataset that actually has duplicates
df = pd.DataFrame({'lat': [0, 0, 0, 1, 1, 2, 2, 2],
'lon': [1, 1, 2, 1, 0, 2, 2, 2]})
First flag the duplicate rows based on lat & lon, & apply the transform to create a new column
df['is_dup'] = df[['lat', 'lon']].duplicated()
df['dups'] = df.groupby(['lat','lon']).is_dup.transform(np.sum)
# df outputs:
df['is_dup'] = df[['lat', 'lon']].duplicated()
df['dups'] = df.groupby(['lat','lon']).is_dup.transform(np.sum)

Categories