Subtracting two columns to form a new column - Pandas - python

My df looks like this;
no_1 no_2 no_3
2022-10-12 4 5 53
2022-10-13 48 4 34
2022-10-14 0 43 93
2022-10-15 0 3 43
.
.
.
2022-10-22 8 34 4
I'm simply trying to add a new column which the the value when you subtract two other columns which should be easy but for some reason keeps failing.
I have the following;
no_data['no_4'] = no_data['no_3'] - no_data['no_1']
but keep getting the error..
TypeError: only integer scalar arrays can be converted to a scalar index
I'm afraid my trouble-shooting hasn't helped me with this so any help is much appreciated!
Thanks

My assumption Mostly MultiIndex issue try re assigning column names & Let me know if it worked
no_data.columns=['no_1', 'no_2', 'no_3']
no_data['no_4'] = no_data['no_3'] - no_data['no_1']

Related

Pandas Confusion - My apply() is returning a series and I don't understand why

As part of my ongoing quest to get my head around pandas I am confronted by a surprise series. I don't understand how and why the output is a series - I was expecting a dataframe. If someone could explain what is happening here it would be much appreciated.
ta, Andrew
Some data:
hash email date subject subject_length
0 65319af6e jbrockmendel#gmail.com 2020-11-28 REF-IntervalIndex._assert_can_do_setop-38112 44
1 0bf58d8a9 simonjayhawkins#gmail.com 2020-11-28 DOC-add-contibutors-to-1.2.0-release-notes-38132 48
2 d16df293c 45562402+rhshadrach#users.noreply.github.com 2020-11-28 TYP-Add-cast-to-ABC-Index-like-types-38043 42
...
Some Code:
def my_function(row):
output = row['email'].value_counts().sort_values(ascending = False).head(3)
return output
top_three = dataframe.groupby(pd.Grouper(key='date', freq='1M')).apply(my_function)
Some Output:
date
2020-01-31 jbrockmendel#gmail.com 159
50263213+MomIsBestFriend#users.noreply.github.com 44
TomAugspurger#users.noreply.github.com 41
...
2020-10-31 jbrockmendel#gmail.com 170
2658661+dsaxton#users.noreply.github.com 23
61934744+phofl#users.noreply.github.com 21
2020-11-30 jbrockmendel#gmail.com 134
61934744+phofl#users.noreply.github.com 36
41443370+ivanovmg#users.noreply.github.com 19
Name: email, dtype: int64
It depends on what your Groupby is returning.
In your case, you are applying a function on row['email'] and returning a single value_counts, while all other columns in your data are part of index. A reset_index() would therefore give you what you need. Meaning, you are returning a multi-index single column output after groupby, which will be returned as a Series instead of a DataFrame.
For more clarity on which data structure is returned, we can do a toy experiment.
For example, for the first case, the apply function is applying the lambda function on groups where each group contains a dataframe (check [i for i in df.groupby(['a'])] to see what each group contains.
df = pd.DataFrame({'a':[1,1,2,2,3], 'b':[4,5,6,7,8]})
print(df.groupby(['a']).apply(lambda x:x**2))
#dataframe
a b
0 1 16
1 1 25
2 4 36
3 4 49
4 9 64
For the second case, we are only applying the lambda function on a series object OR only a single series is being returned. In this case, it doesn't return a dataframe and instead returns a series.
print(df.groupby(['a'])['b'].apply(lambda x:x**2))
#series
0 16
1 25
2 36
3 49
4 64
Name: b, dtype: int64
This can be solved simply by -
print(df.groupby(['a'])[['b']].apply(lambda x:x**2))
#dataframe
b
0 16
1 25
2 36
3 49

Pandas Dataframe: Filter by column value startswith int

I have a dataframe which looks like this:
ID Unit Semester Note BNF
0 3537 143066.0 4010 2.3 5
1 3537 143067.0 4010 m.E. E
2 75 113142.0 4011 5.0 5
3 3726 113142.0 4011 3.3 5
4 5693 113142.0 4011 5.0 5
this dataframe contains three categories. These categories are based on the values in the "Semester"-column. There are values which start with 113, 143 and 153.
Now I want to split this whole dataframe that I get three new dataframes for every categorie.
I tried to convert the column to string and work with 'startswith'.
mi = df[df['Unit'].apply(str)]
mi = df[df['Unit'].startswith('143')]
but that didn't work.
I hope someone could help me. Thanks a lot!
Isn't your target meant to be the Semester and not Unit mi = df[df['Unit'].apply(str)]? If so, then I would suggest making a new column (or use a multi-level index) with the following approach:
df["Semester_Start"] = df["Semester"].apply(lambda x: str(x)[:3])
#Take sub-sections
df[df["Semester_Start"] == "143"]
https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html
This should do the trick:
dfs=[df.loc[df.Unit.astype(str).str.startswith(el)] for el in df.groupby(df["Unit"].astype("str").str[:3]).groups]
In short - you get the list of all possible first 3 digits of Unit.
Then You just iterate over that list in list comprehension filtering for each element using python string startswith(...) method.
Hope this helps!

How to find inappropriate datatype in pandas data frame for each column?

main_df:
Name Age Id DOB
0 Tom 20 A4565 22-07-1993
1 nick 21 G4562 11-09-1996
2 krish AKL F4561 15-03-1997
3 636A 18 L5624 06-07-1995
4 mak 20 K5465 03-09-1997
5 nits 55 56541 45aBc
6 444 66 NIT 09031992
column_info_df:
Column_Name Column_Type
0 Name string
1 Age integer
2 Id string
3 DOB Date
how can i find data type error value from main df. For example from column info df we can see 'Name' is a string column, so in main df, 'Name' column should contain either string or alphanumeric other than that it's an error. I need to find those datatype error values in a separate df.
error output df:
Column_Name Current_Value Exp_Dtype Index_No.
0 Name 444 string 6
1 Age 444 int 2
2 Name 56441 string 6
0 DOB 4aBc Date 5
0 DOB 09031992 Date 6
i tried this:
for i,r in column_info_df.iterrows():
if r['Column_Type'] == 'string':
main_df[r['Column_Name']].loc[main_df[r['Column_Name']].str.match(r'[^a-z|A-Z]+')]
elif r['Column_Type'] == 'integer':
main_df[r['Column_Name']].loc[main_df[r['Column_Name']].str.match(r'[^0-9]+')]
elif r['Column_Type'] == 'Date':
i have stuck here,because this RE is not catching every errors. i don't know how to go further?
Here is one way of using df.eval(),
Note: though this will check based on pattern and return non matching values. However, note that this cannot check valid types, example if date column has an entry which looks like a date but is an invalid date, this wouldnot identify that:
d={"string":".str.contains(r'[a-z|A-Z]')","integer":".str.contains('^[0-9]*$')",
"Date":".str.contains('\d\d-\d\d-\d\d\d\d')"}
m=df.eval([f"~{a}{b}"
for a,b in zip(column_info_df['Column_Name'],column_info_df['Column_Type'].map(d))]).T
final=(pd.DataFrame(np.where(m,df,np.nan),columns=df.columns)
.reset_index().melt('index',var_name='Column_Name',
value_name='Current_Value').dropna())
final['Expected_dtype']=(final['Column_Name']
.map(column_info_df.set_index('Column_Name')['Column_Type']))
print(final)
Output:
index Column_Name Current_Value Expected_dtype
6 6 Name 444 string
9 2 Age AKL integer
19 5 Id 56541 string
26 5 DOB 45aBc Date
27 6 DOB 09031992 Date
I agree there can be better regex patterns for this job but the idea should be same.
If I understood what you did, you created separate dataframes, which contains infos about your main one.
What I suggest would be instead to use the build-in methods offered by pandas to deal with dataframes.
For instance, if you have a dataframe main, then:
main.info()
will give you the type of object for each column. Note that a column can contain only one type, as it is a series, which is itself a ndarray.
So your column name cannot have anything else but strings that you would have missed. Instead, you can have NaN values. You can check for them with the help of
main.describe()
I hope that helped :-)

Aggregations for Timedelta values in the Python DataFrame

I have big DataFrame (df) which looks like:
Acc_num date_diff
0 29 0:04:43
1 29 0:01:43
2 29 2:22:45
3 29 0:16:21
4 29 0:58:20
5 30 0:00:35
6 34 7:15:26
7 34 4:40:01
8 34 0:56:02
9 34 6:53:44
10 34 1:36:58
.....
Acc_num int64
date_diff timedelta64[ns]
dtype: object
I need to calculate 'date_diff' mean (in timedelta format) for each account number.
df.date_diff.mean() works correctly. But when I try next:
df.groupby('Acc_num').date_diff.mean() it raises an exception:
"DataError: No numeric types to aggregate"
I also tried df.pivot_table() method, but didn't acheive anything.
Could someone help me with this stuff. Thank you in advance!
Weird limitation indeed. But a simple solution would be:
df.groupby('Acc_num').date_diff.agg(lambda g:g.sum()/g.count())
Edit:
Pandas will actually attempt to aggregate non-numeric columns if you pass numeric_only=False
df.groupby('Acc_num').date_diff.mean(numeric_only=False)

Searching one Python dataframe / dictionary for fuzzy matches in another dataframe

I have the following pandas dataframe with 50,000 unique rows and 20 columns (included is a snippet of the relevant columns):
df1:
PRODUCT_ID PRODUCT_DESCRIPTION
0 165985858958 "Fish Burger with Lettuce"
1 185965653252 "Chicken Salad with Dressing"
2 165958565556 "Pork and Honey Rissoles"
3 655262522233 "Cheese, Ham and Tomato Sandwich"
4 857485966653 "Coleslaw with Yoghurt Dressing"
5 524156285551 "Lemon and Raspberry Cheesecake"
I also have the following dataframe (which I also have saved in dictionary form) which has 2 columns and 20,000 unique rows:
df2 (also saved as dict_2)
PROD_ID PROD_DESCRIPTION
0 548576 "Fish Burger"
1 156956 "Chckn Salad w/Ranch Dressing"
2 257848 "Rissoles - Lamb & Rosemary"
3 298770 "Lemn C-cake"
4 651452 "Potato Salad with Bacon"
5 100256 "Cheese Cake - Lemon Raspberry Coulis"
What I am wanting to do is compare the "PRODUCT_DESCRIPTION" field in df1 to the the "PROD_DESCRIPTION" field in df2 and find the closest match/matches to help with the heavy lifting part. I would then need to manually check the matches but it would be a lot quicker The ideal outcome would look like this, e.g. with one or more part matches noted:
PRODUCT_ID PRODUCT_DESCRIPTION PROD_ID PROD_DESCRIPTION
0 165985858958 "Fish Burger with Lettuce" 548576 "Fish Burger"
1 185965653252 "Chicken Salad with Dressing" 156956 "Chckn Salad w/Ranch Dressing"
2 165958565556 "Pork and Honey Rissoles" 257848 "Rissoles - Lamb & Rosemary"
3 655262522233 "Cheese, Ham and Tomato Sandwich" NaN NaN
4 857485966653 "Coleslaw with Yoghurt Dressing" NaN NaN
5 524156285551 "Lemon and Raspberry Cheesecake" 298770 "Lemn C-cake"
6 524156285551 "Lemon and Raspberry Cheesecake" 100256 "Cheese Cake - Lemon Raspberry Coulis"
I have already completed a join which has identified the exact matches. It's not important that the index is retained as the Product ID's in each df are unique. The results can also be saved into a new dataframe as this will then be applied to a third dataframe that has around 14 million rows.
I've used the following questions and answers (amongst others):
Is it possible to do fuzzy match merge with python pandas
Fuzzy merge match with duplicates including trying jellyfish module as suggested in one of the answers
Python fuzzy matching fuzzywuzzy keep only the best match
Fuzzy match items in a column of an array
and also various loops/functions/mapping etc. but have had no success, either getting the first "fuzzy match" which has a low score or no matches being detected.
I like the idea of a matching/distance score column being generated as per here as it would then allow me to speed up the manual checking process.
I'm using Python 2.7, pandas and have fuzzywuzzy installed.
using fuzz.ratio as my distance metric, calculate my distance matrix like this
df3 = pd.DataFrame(index=df.index, columns=df2.index)
for i in df3.index:
for j in df3.columns:
vi = df.get_value(i, 'PRODUCT_DESCRIPTION')
vj = df2.get_value(j, 'PROD_DESCRIPTION')
df3.set_value(
i, j, fuzz.ratio(vi, vj))
print(df3)
0 1 2 3 4 5
0 63 15 24 23 34 27
1 26 84 19 21 52 32
2 18 31 33 12 35 34
3 10 31 35 10 41 42
4 29 52 32 10 42 12
5 15 28 21 49 8 55
Set a threshold for acceptable distance. I set 50
Find the index value (for df2) that has maximum value for every row.
threshold = df3.max(1) > 50
idxmax = df3.idxmax(1)
Make assignments
df['PROD_ID'] = np.where(threshold, df2.loc[idxmax, 'PROD_ID'].values, np.nan)
df['PROD_DESCRIPTION'] = np.where(threshold, df2.loc[idxmax, 'PROD_DESCRIPTION'].values, np.nan)
df
You should be able to iterate over both dataframes and populate either a dict of a 3rd dataframe with your desired information:
d = {
'df1_id': [],
'df1_prod_desc': [],
'df2_id': [],
'df2_prod_desc': [],
'fuzzywuzzy_sim': []
}
for _, df1_row in df1.iterrows():
for _, df2_row in df2.iterrows():
d['df1_id'] = df1_row['PRODUCT_ID']
...
df3 = pd.DataFrame.from_dict(d)
I don't have enough reputation to be able to comment on answer from #piRSquared. Hence this answer.
The definition of 'vi' and 'vj' didn't go through with an error (AttributeError: 'DataFrame' object has no attribute 'get_value'). It worked when I inserted an "underscore". E.g. vi = df._get_value(i, 'PRODUCT_DESCRIPTION')
Similar issue persisted for 'set_value' and the same solution worked there too. E.g. df3._set_value(i, j, fuzz.ratio(vi, vj))
Generating idxmax posed another error (TypeError: reduction operation 'argmax' not allowed for this dtype) which was because contents of df3 (the fuzzy ratios) were of type 'object'. I converted all of them to numeric just before defining threshold and it worked. E.g. df3 = df3.apply(pd.to_numeric)
A million thanks to #piRSquared for the solution. For a Python novice like me, it worked like a charm. I am posting this answer to make it easy for other newbies like me.

Categories