python : Compute columns of data frames and add them to new columns - python

I want to make a new column by calculating existing columns.
For example df
df
no data1 data2
1 10 15
2 51 46
3 36 20
......
i want to make this
new_df
no data1 data2 data1/-2 data1/2 data2/-2 data2/2
1 10 15 -5 5 -7.5 7.5
2 51 46 -25.5 25.5 -23 23
3 36 20 -18 18 -9 9
but i don't know how to make this as efficient as possible

To create a new df column based on the calculations of two or more other columns, you would have to define a new column and set it equal to your equation. For example:
df['new_col'] = df['col_1'] * df['col_2']

Is this what you mean? :
import pandas as pd
number = [[1,2],[3,4],[5,6],[7,8],[9,10]]
df = pd.DataFrame(number)
df['Data 1/2'] = df[0] / df[1]
And the output :
0 1 Data 1/2
0 1 2 0.500000
1 3 4 0.750000
2 5 6 0.833333
3 7 8 0.875000
4 9 10 0.900000

Related

Counting repeating words with numpy and pandas Python

I want to write a code where it outputs the number of repeated values in a for each different value. Then I want to make a pandas data sheet to print it. The sums code down below does not work how would I be able to make it work and get the Expected Output?
import numpy as np
import pandas as pd
a = np.array([12,12,12,3,43,43,43,22,1,3,3,43])
uniques = np.unique(a)
sums = np.sum(uniques[:-1]==a[:-1])
Expected Output:
Value Repetition Count
1 1
3 3
12 3
22 1
43 4
Define a dataframe df based on the array a. Then, use .groupby() + .size() to get the size/count of unique values, as follows:
a = np.array([12,12,12,3,43,43,43,22,1,3,3,43])
df = pd.DataFrame({'Value': a})
df.groupby('Value').size().reset_index(name='Repetition Count')
Result:
Value Repetition Count
0 1 1
1 3 3
2 12 3
3 22 1
4 43 4
Edit
If you want also the percentages of counts, you can use:
(df.groupby('Value', as_index=False)
.agg(**{'Repetition Count': ('Value', 'size'),
'Percent': ('Value', lambda x: round(x.size/len(a) *100, 2))})
)
Result:
Value Repetition Count Percent
0 1 1 8.33
1 3 3 25.00
2 12 3 25.00
3 22 1 8.33
4 43 4 33.33
or use .value_counts with normalize=True
pd.Series(a).value_counts(normalize=True).mul(100)
Result:
43 33.333333
12 25.000000
3 25.000000
22 8.333333
1 8.333333
dtype: float64
You can use groupby:
>>> pd.Series(a).groupby(a).count()
1 1
3 3
12 3
22 1
43 4
dtype: int64
Or value_counts():
>>> pd.Series(a).value_counts().sort_index()
1 1
3 3
12 3
22 1
43 4
dtype: int64
Easiest if you make a pandas dataframe from np.array and then use value_counts().
df = pd.DataFrame(data=a, columns=['col1'])
print(df.col1.value_counts())
43 4
12 3
3 3
22 1
1 1

adding column to df that calculates count of different column using groupby

I'm trying to create a new column in a df. I want the new column to equal the count of the number rows of each unique 'mother_ID, which is a different column in the df.
This is what I'm currently doing. It makes the new column but the new column is filled with 'NaN's.
df.columns = ['mother_ID', 'date_born', 'mother_mass_g', 'hatchling_masses_g']
df.to_numpy()
This is how the original df appears when I print it:
count = df.groupby('mother_ID').hatchling_masses_g.count()
df['count']= count
Pic below shows what I get when I print new df, although if I simply print(count) I get the correct counts for each mother_ID . Does anyone know what I'm doing wrong?
Use groupby transform('count'):
df['count'] = df.groupby('mother_ID')['hatchling_masses_g'].transform('count')
Notice the difference between groupby count and groupby tranform with 'count'.
Sample Data:
import numpy as np
import pandas as pd
np.random.seed(5)
df = pd.DataFrame({
'mother_ID': np.random.choice(['a', 'b'], 10),
'hatchling_masses_g': np.random.randint(1, 100, 10)
})
mother_ID hatchling_masses_g
0 b 63
1 a 28
2 b 31
3 b 81
4 a 8
5 a 77
6 a 16
7 b 54
8 a 81
9 a 28
groupby.count
counts = df.groupby('mother_ID')['hatchling_masses_g'].count()
mother_ID
a 6
b 4
Name: hatchling_masses_g, dtype: int64
Notice how there are only 2 rows. When assigning back to the DataFrame there are 10 rows which means that pandas doesn't know how to align the data back. Which results in NaNs indicating missing data:
df['count'] = counts
mother_ID hatchling_masses_g count
0 b 63 NaN
1 a 28 NaN
2 b 31 NaN
3 b 81 NaN
4 a 8 NaN
5 a 77 NaN
6 a 16 NaN
7 b 54 NaN
8 a 81 NaN
9 a 28 NaN
It's trying to find 'a' and 'b' in the index and since it cannot it fills with only NaN values.
groupby.tranform('count')
transform, on the other hand, will populate the entire group with the count:
counts = df.groupby('mother_ID')['hatchling_masses_g'].transform('count')
counts:
0 4
1 6
2 4
3 4
4 6
5 6
6 6
7 4
8 6
9 6
Name: hatchling_masses_g, dtype: int64
Notice 10 rows were created (one for every row in the DataFrame):
This assigns back to the dataframe nicely (since the indexes align):
df['count'] = counts
mother_ID hatchling_masses_g count
0 b 63 4
1 a 28 6
2 b 31 4
3 b 81 4
4 a 8 6
5 a 77 6
6 a 16 6
7 b 54 4
8 a 81 6
9 a 28 6
If needed counts can be done via groupby count, then join back to the DataFrame on the group key:
counts = df.groupby('mother_ID')['hatchling_masses_g'].count().rename('count')
df = df.join(counts, on='mother_ID')
counts:
mother_ID
a 6
b 4
Name: count, dtype: int64
df:
mother_ID hatchling_masses_g count
0 b 63 4
1 a 28 6
2 b 31 4
3 b 81 4
4 a 8 6
5 a 77 6
6 a 16 6
7 b 54 4
8 a 81 6
9 a 28 6

pandas dataframe sort columns according to column totals

I was able to sort rows according to the last column. However, I also have a row at the bottom of the dataframe which has the totals of each column. I couldn't find a way to sort the columns according to the totals in the last row. The table looks like the following:
A B C T
0 9 9 9 27
1 9 10 4 23
2 7 4 8 19
3 2 6 9 17
T 27 29 30
I want this table to be sorted so that the order of columns will be from left to right C, B, A from highest total to lowest. How can this be done?
Use DataFrame.sort_values by index value T with axis=1:
df = df.sort_values('T', axis=1, ascending=False)
print (df)
C B A T
0 9 9 9 27.0
1 4 10 9 23.0
2 8 4 7 19.0
3 9 6 2 17.0
T 30 29 27 NaN

Populate column in data frame based on a range found in another dataframe

I'm attempting to populate a column in a data frame based on whether the index value of that record falls within a range defined by two columns in another data frame.
df1 looks like:
a
0 4
1 45
2 7
3 5
4 48
5 44
6 22
7 89
8 45
9 44
10 23
and df2 is:
START STOP CLASS
0 2 3 1
1 5 7 2
2 8 8 3
what I want would look like:
a CLASS
0 4 nan
1 45 nan
2 7 1
3 5 1
4 48 nan
5 44 2
6 22 2
7 89 2
8 45 3
9 44 nan
10 23 nan
The START column in df2 is the minimum value of the range and the STOP column is the max.
You can use IntervalIndex (requires v0.20.0).
First construct the index:
df2.index = pd.IntervalIndex.from_arrays(df2['START'], df2['STOP'], closed='both')
df2
Out:
START STOP CLASS
[2, 3] 2 3 1
[5, 7] 5 7 2
[8, 8] 8 8 3
Now if you index into the second DataFrame it will lookup the value in the intervals. For example,
df2.loc[6]
Out:
START 5
STOP 7
CLASS 2
Name: [5, 7], dtype: int64
returns the second class. I don't know if it can be used with merge or with merge_asof but as an alternative you can use map:
df1['CLASS'] = df1.index.to_series().map(df2['CLASS'])
Note that I first converted the index to a Series to be able to use the Series.map method. This results in
df1
Out:
a CLASS
0 4 NaN
1 45 NaN
2 7 1.0
3 5 1.0
4 48 NaN
5 44 2.0
6 22 2.0
7 89 2.0
8 45 3.0
9 44 NaN
10 23 NaN
Alternative solution:
classdict = df2.set_index("CLASS").to_dict("index")
rangedict = {}
for key,value in classdict.items():
# get all items in range and assign value (the key)
for item in list(range(value["START"],value["STOP"]+1)):
rangedict[item] = key
extract rangedict:
{2: 1, 3: 1, 5: 2, 6: 2, 7: 2, 8: 3}
now map and possibly format(?):
df1['CLASS'] = df1.index.to_series().map(rangedict)
df1.applymap("{0:.0f}".format)
outputs:
a CLASS
0 4 nan
1 45 nan
2 7 1
3 5 1
4 48 nan
5 44 2
6 22 2
7 89 2
8 45 3
9 44 nan
10 23 nan
import pandas as pd
import numpy as np
# Here is your existing dataframe
df_existing = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
# Create a new empty dataframe with specific column names and data types
df_new = pd.DataFrame(index=None)
columns = ['field01','field02','field03','field04']
dtypes = [str,int,int,int]
for c,d in zip(columns, dtypes):
df_new[c] = pd.Series(dtype=d)
# Set the index on the new dataframe to same as existing
df_new['new_index'] = df_existing.index
df_new.set_index('new_index', inplace=True)
# Fill the new dataframe with specific fields from the existing dataframe
df_new[['field02','field03']] = df_existing[['B','C']]
print df_new

Best way to join / merge by range in pandas

I'm frequently using pandas for merge (join) by using a range condition.
For instance if there are 2 dataframes:
A (A_id, A_value)
B (B_id,B_low, B_high, B_name)
which are big and approximately of the same size (let's say 2M records each).
I would like to make an inner join between A and B, so A_value would be between B_low and B_high.
Using SQL syntax that would be:
SELECT *
FROM A,B
WHERE A_value between B_low and B_high
and that would be really easy, short and efficient.
Meanwhile in pandas the only way (that's not using loops that I found), is by creating a dummy column in both tables, join on it (equivalent to cross-join) and then filter out unneeded rows. That sounds heavy and complex:
A['dummy'] = 1
B['dummy'] = 1
Temp = pd.merge(A,B,on='dummy')
Result = Temp[Temp.A_value.between(Temp.B_low,Temp.B_high)]
Another solution that I had is by applying on each of A value a search function on B by usingB[(x>=B.B_low) & (x<=B.B_high)] mask, but it sounds inefficient as well and might require index optimization.
Is there a more elegant and/or efficient way to perform this action?
Setup
Consider the dataframes A and B
A = pd.DataFrame(dict(
A_id=range(10),
A_value=range(5, 105, 10)
))
B = pd.DataFrame(dict(
B_id=range(5),
B_low=[0, 30, 30, 46, 84],
B_high=[10, 40, 50, 54, 84]
))
A
A_id A_value
0 0 5
1 1 15
2 2 25
3 3 35
4 4 45
5 5 55
6 6 65
7 7 75
8 8 85
9 9 95
B
B_high B_id B_low
0 10 0 0
1 40 1 30
2 50 2 30
3 54 3 46
4 84 4 84
numpy
The ✌easiest✌ way is to use numpy broadcasting.
We look for every instance of A_value being greater than or equal to B_low while at the same time A_value is less than or equal to B_high.
a = A.A_value.values
bh = B.B_high.values
bl = B.B_low.values
i, j = np.where((a[:, None] >= bl) & (a[:, None] <= bh))
pd.concat([
A.loc[i, :].reset_index(drop=True),
B.loc[j, :].reset_index(drop=True)
], axis=1)
A_id A_value B_high B_id B_low
0 0 5 10 0 0
1 3 35 40 1 30
2 3 35 50 2 30
3 4 45 50 2 30
To address the comments and give something akin to a left join, I appended the part of A that doesn't match.
pd.concat([
A.loc[i, :].reset_index(drop=True),
B.loc[j, :].reset_index(drop=True)
], axis=1).append(
A[~np.in1d(np.arange(len(A)), np.unique(i))],
ignore_index=True, sort=False
)
A_id A_value B_id B_low B_high
0 0 5 0.0 0.0 10.0
1 3 35 1.0 30.0 40.0
2 3 35 2.0 30.0 50.0
3 4 45 2.0 30.0 50.0
4 1 15 NaN NaN NaN
5 2 25 NaN NaN NaN
6 5 55 NaN NaN NaN
7 6 65 NaN NaN NaN
8 7 75 NaN NaN NaN
9 8 85 NaN NaN NaN
10 9 95 NaN NaN NaN
Not sure that is more efficient, however you can use sql directly (from the module sqlite3 for instance) with pandas (inspired from this question) like:
conn = sqlite3.connect(":memory:")
df2 = pd.DataFrame(np.random.randn(10, 5), columns=["col1", "col2", "col3", "col4", "col5"])
df1 = pd.DataFrame(np.random.randn(10, 5), columns=["col1", "col2", "col3", "col4", "col5"])
df1.to_sql("df1", conn, index=False)
df2.to_sql("df2", conn, index=False)
qry = "SELECT * FROM df1, df2 WHERE df1.col1 > 0 and df1.col1<0.5"
tt = pd.read_sql_query(qry,conn)
You can adapt the query as needed in your application
I don't know how efficient it is, but someone wrote a wrapper that allows you to use SQL syntax with pandas objects. That's called pandasql. The documentation explicitly states that joins are supported. This might be at least easier to read since SQL syntax is very readable.
conditional_join from pyjanitor may be helpful in the abstraction/convenience;:
# pip install pyjanitor
import pandas as pd
import janitor
inner join
A.conditional_join(B,
('A_value', 'B_low', '>='),
('A_value', 'B_high', '<=')
)
A_id A_value B_id B_low B_high
0 0 5 0 0 10
1 3 35 1 30 40
2 3 35 2 30 50
3 4 45 2 30 50
left join
A.conditional_join(
B,
('A_value', 'B_low', '>='),
('A_value', 'B_high', '<='),
how = 'left'
)
A_id A_value B_id B_low B_high
0 0 5 0.0 0.0 10.0
1 1 15 NaN NaN NaN
2 2 25 NaN NaN NaN
3 3 35 1.0 30.0 40.0
4 3 35 2.0 30.0 50.0
5 4 45 2.0 30.0 50.0
6 5 55 NaN NaN NaN
7 6 65 NaN NaN NaN
8 7 75 NaN NaN NaN
9 8 85 NaN NaN NaN
10 9 95 NaN NaN NaN
lets take a simple example:
df=pd.DataFrame([2,3,4,5,6],columns=['A'])
returns
A
0 2
1 3
2 4
3 5
4 6
now lets define a second dataframe
df2=pd.DataFrame([1,6,2,3,5],columns=['B_low'])
df2['B_high']=[2,8,4,6,6]
results in
B_low B_high
0 1 2
1 6 8
2 2 4
3 3 6
4 5 6
here we go; and we want output to be index 3 and A value 5
df.where(df['A']>=df2['B_low']).where(df['A']<df2['B_high']).dropna()
results in
A
3 5.0
I know this is an old question but for newcomers there is now the pandas.merge_asof function that performs join based on closest match.
In case you want to do a merge so that a column of one DataFrame (df_right) is between 2 columns of another DataFrame (df_left) you can do the following:
df_left = pd.DataFrame({
"time_from": [1, 4, 10, 21],
"time_to": [3, 7, 15, 27]
})
df_right = pd.DataFrame({
"time": [2, 6, 16, 25]
})
df_left
time_from time_to
0 1 3
1 4 7
2 10 15
3 21 27
df_right
time
0 2
1 6
2 16
3 25
First, find matches of the right DataFrame that are closest but largest than the left boundary (time_from) of the left DataFrame:
merged = pd.merge_asof(
left=df_1,
right=df_2.rename(columns={"time": "candidate_match_1"}),
left_on="time_from",
right_on="candidate_match_1",
direction="forward"
)
merged
time_from time_to candidate_match_1
0 1 3 2
1 4 7 6
2 10 15 16
3 21 27 25
As you can see the candidate match in index 2 is wrongly matched, as 16 is not between 10 and 15.
Then, find matches of the right DataFrame that are closest but smaller than the right boundary (time_to) of the left DataFrame:
merged = pd.merge_asof(
left=merged,
right=df_2.rename(columns={"time": "candidate_match_2"}),
left_on="time_to",
right_on="candidate_match_2",
direction="backward"
)
merged
time_from time_to candidate_match_1 candidate_match_2
0 1 3 2 2
1 4 7 6 6
2 10 15 16 6
3 21 27 25 25
Finally, keep the matches where the candidate matches are the same, meaning that the value of the right DataFrame are between values of the 2 columns of the left DataFrame:
merged["match"] = None
merged.loc[merged["candidate_match_1"] == merged["candidate_match_2"], "match"] = \
merged.loc[merged["candidate_match_1"] == merged["candidate_match_2"], "candidate_match_1"]
merged
time_from time_to candidate_match_1 candidate_match_2 match
0 1 3 2 2 2
1 4 7 6 6 6
2 10 15 16 6 None
3 21 27 25 25 25

Categories