Show me duplicated Addresses pandas

Show me duplicated Addresses pandas - python

I have these two columns in my csv (Address of New Home and Cancelled Can in the csv). If any Address is cancelled, under Can True has to be written but sometimes the end user forget to write True and the same Address appears twice. I want Python to tell me(not remove) the Addresses that appear twice without the first one being cancelled out.
Example:
Date_Booked Address of New Home Can
01/07/2017 1234 SO Drive True
02/14/2017 4321 Python Court
03/17/2017 1234 SO Drive
03/23/2017 4321 Python Court
As you can view from the above example, 1234 SO Drive was cancelled and True was written, this is what we want but 4321 Python Court was cancelled that is why it was written twice but since it does not say True under the Cancelled it will show up twice in our csv and cause all sorts of issues.
import pandas as pd
first = pd.read_csv('Z:PCR.csv')
df = pd.DataFrame(first)
non_cancelled = df['Can'].apply(lambda x: x != 'True')
dup_addresses = non_cancelled.groupby('Address of New Home').filter(lambda x: len (x) > 1)
if not dup_addresses.empty:
raise Exception ('Same address written twice without cancellation')
I am getting the following error:
Traceback (most recent call last):
File "pandas\index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas\index.c:4279)
File "pandas\src\hashtable_class_helper.pxi", line 404, in pandas.hashtable.Int64HashTable.get_item (pandas\hashtable.c:8543)
TypeError: an integer is required
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
dup_addresses = non_cancelled.groupby('Address of New Home').filter(lambda x: len (x) > 1)
KeyError: 'Address of New Home'
Any assistance would be greatly appreciated.

This should update your Can column by keeping the True that is already there an updating with ones that were missed.
can = df.duplicated(subset=['Address of New Home'], keep='last')
df['Can'] = df.Can.combine_first(can.where(can, ''))
print(df)
Date_Booked Address of New Home Can
0 01/07/2017 1234 SO Drive True
1 02/14/2017 4321 Python Court True
2 03/17/2017 1234 SO Drive
3 03/23/2017 4321 Python Court
Per request
can = df.duplicated(subset=['Address of New Home'], keep='last')
df['Can'] = df.Can.combine_first(pd.Series(np.where(can, 'Missed', ''), df.index))
print(df)
Date_Booked Address of New Home Can
0 01/07/2017 1234 SO Drive True
1 02/14/2017 4321 Python Court Missed
2 03/17/2017 1234 SO Drive
3 03/23/2017 4321 Python Court

Your column is Address_of_New_Home, not Address of New Home. Just forgot the underscores

The problem is in this statement:
non_cancelled = df['Can'].apply(lambda x: x != 'True')
When you apply this argument, you are applying to to the series df['Can'], so the method returns a series, not the full DataFrame. To illustrate, here is some code:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': np.arange(0,5), 'b': np.arange(5,10), 'c': np.arange(10,15)})
print(df)
The output is this
a b c
0 0 5 10
1 1 6 11
2 2 7 12
3 3 8 13
4 4 9 14
But when I do this:
a = df['a'].apply(lambda x: x*20)
print(a)
I get:
0 0
1 20
2 40
3 60
4 80
To do what you would like to do, try doing this instead:
non_cancelled = df[df['Can'] != True]
This gives us all rows in the DataFrame where the condition (df['Can'] != True) returns as True

Related

Efficiently concatenate/append dataframe in a for loop to get a single big dataframe using python pandas

Using a logic- I am reading multiple PDF files which are having certain highlighted portions(presume that these are tables).
After pushing them to a list, I am saving them to a dataframe.
Here's the logic for the same
try:
filepath = [file for file in glob.glob("Folder/*.pdf")]
for file in filepath:
doc = fitz.open(file)
print(file)
highlights = []
for page in doc:
highlights += handle_page(page)
#print(highlights)
highlights_alt = highlights[0].split(',')
df = pd.DataFrame(highlights_alt, columns=['Security Name'])
#print(df.columns.tolist())
df[['Security Name', 'Weights']] = df['Security Name'].str.rsplit(n=1, expand=True)
df.drop_duplicates(keep='first', inplace=True)
print(df.head())
print(df.shape)
except IndexError:
print('file {} is not highlighted'.format(file))
Using this logic I get the dataframes however if the folder has 5 PDFs then this logic creates 5 different dataframes. Something like this.
Folder\z.pdf
Security Name Weights
0 UTILITIES (5.96
1 %*) None
(2, 2)
Folder\y.pdf
Security Name Weights
0 Quantity/ Market Value % of Net Investments Cu... 1.125
1 % 01
2 /07 None
3 /2027 None
4 EUR 230
(192, 2)
Folder\x.pdf
Security Name Weights
0 Holding £740
1 000 None
2 Leeds Building Society 3.75
3 % variable 25
4 /4 None
(526, 2)
However I want a single dataframe with the above records in them making their shape as (720,2) something like
Security Name Weights
0 Holding £740
1 000 None
2 Leeds Building Society 3.75
3 % variable 25
4 /4 None
.
.
720 xyz 3.33
(720, 2)
I tried using pandas's concat & append but have been unsuccessful so far. Please let me know an efficient way of doing it since, the PDFs in future would be more than 1000s.
Please help!

A quick way is to use pd.concat:
big_df = pd.concat(list_of_dfs, axis=0)
If this creates an error it would be helpful to know what the error is.

Having trouble with - class 'pandas.core.indexing._AtIndexer'

I'm working on a ML project to predict answer times in stack overflow based on tags. Sample data:
Unnamed: 0 qid i qs qt tags qvc qac aid j as at
0 1 563355 62701.0 0 1235000081 php,error,gd,image-processing 220 2 563372 67183.0 2 1235000501
1 2 563355 62701.0 0 1235000081 php,error,gd,image-processing 220 2 563374 66554.0 0 1235000551
2 3 563356 15842.0 10 1235000140 lisp,scheme,subjective,clojure 1047 16 563358 15842.0 3 1235000177
3 4 563356 15842.0 10 1235000140 lisp,scheme,subjective,clojure 1047 16 563413 893.0 18 1235001545
4 5 563356 15842.0 10 1235000140 lisp,scheme,subjective,clojure 1047 16 563454 11649.0 4 1235002457
I'm stuck at the data cleaning process. I intend to create a new column named 'time_taken' which stores the difference between the at and qt columns.
Code:
import pandas as pd
import numpy as np
df = pd.read_csv("answers.csv")
df['time_taken'] = 0
print(type(df.time_taken))
for i in range(0,263541):
val = df.qt[i]
qtval = val.item()
val = df.at[i]
atval = val.item()
df.time_taken[i] = qtval - atval
I'm getting this error:
Traceback (most recent call last):
File "<ipython-input-39-9384be9e5531>", line 1, in <module>
val = df.at[0]
File "D:\Softwares\Anaconda\lib\site-packages\pandas\core\indexing.py", line 2080, in __getitem__
return super().__getitem__(key)
File "D:\Softwares\Anaconda\lib\site-packages\pandas\core\indexing.py", line 2027, in __getitem__
return self.obj._get_value(*key, takeable=self._takeable)
TypeError: _get_value() missing 1 required positional argument: 'col'
The problem here lies in the indexing of df.at
Types of both df.qt and df.at are
<class 'pandas.core.indexing._AtIndexer'>
<class 'pandas.core.series.Series'> respectively.
I'm an absolute beginner in data science and do not have enough experience with pandas and numpy.

There is, to put it mildly, an easier way to do this.
df['time_taken'] = df['at'] - df.qt
The AtIndexer issue comes up because .at is a pandas method. You want to make sure to not name columns any names that are the same as a Python/Pandas method for this reason. You can get around it just by indexing with df['at'] instead of df.at.
Besides that, this operation — if I'm understanding it — can be done with one short line vs. a long for loop.

Create a True/False column in Python DataFrame(1,0) based on two columns values

I'm having trouble creating a new column based on columns 'language_1' and 'language_2' in python dataframe. I want to create a 'bilingual' column where a '1' represents a user who speaks both English and Spanish(bi-lingual) and a 0 for non-bilingual speakers. Ultimately I want to compare their average ratings to each other, but want to categorize them first. I tried using if statements but I'm not sure how to write an if statement that combines multiple conditions to result in 1 value. Thank you for any help.
===============================================================================================
name language_1 language_2 rating bilingual
Kevin English Null 4.25
Miguel English Spanish 4.56
Carlos English Spanish 4.61
Aaron Null Spanish 4.33
===============================================================================================
Here is the code I've tried to use to append the new column to my dataframe.
def label_bilingual(row):
if row('language_english') == row['English'] and row('language_spanish') == 'Spanish':
val = 1
else:
val = 0
df_doc_1['bilingual'] = df_doc_1.apply(label_bilingual, axis=1)
Here is the error I'm getting.
----> 1 df_doc_1['bilingual'] = df_doc_1.apply(label_bilingual, axis=1)
'Series' object is not callable

You have a few issues with your function, one which is causing your error and a few more which will cause more problems after.
1 - You have tried to call the column with row('name') which is not callable.
df('row')
Traceback (most recent call last):
File "<pyshell#30>", line 1, in <module>
df('row')
TypeError: 'DataFrame' object is not callable
2 - You have tried to compare row['column'] to row['English'] which will not work, as a column named English does not exist
KeyError: 'English'
3 - You do not return any values
val = 1
val = 0
You need to modify your function as below to resolve these errors.
def label_bilingual(row):
if row['language_1'] == 'English' and row['language_2'] == 'Spanish':
return 1
else:
return 0
Output
>>> df['bilingual'] = df.apply(label_bilingual, axis=1)
>>> df
name language_1 language_2 rating bilingual
0 Kevin English Null 4.25 0
1 Miguel English Spanish 4.56 1
2 Carlos English Spanish 4.61 1
3 Aaron Null Spanish 4.33 0

To make it simpler I'd suggest having missing values in either column as numpy.nan. For example if missing values were recorded as np.nan:
bilingual = np.where(np.isnan(df[['language_1', 'language_2']].values.any(), 0, 1))
df['bilingual'] = bilingual
Here np.where checks condition inside, which in turn checks whether values in either of language columns are missing. And if true, than a person is not bilingual and gets a 0, otherwise 1.

Am I using groupby.sum() correctly?

I've the following code, and a problem in the new_df["SUM"] line:
import pandas as pd
df = pd.read_excel(r"D:\Tesina\Proteoma Humano\Tablas\uno - copia.xlsx")
#df = pd.DataFrame({'ID': ['C9JLR9','O95391', 'P05114',"P14866"], 'SEQ': ['1..100,182..250,329..417,490..583', '1..100,206..254,493..586', '1..100', "1..100,284..378" ]})
df2 = pd.DataFrame
df["SEQ"] = df["SEQ"].replace("\.\."," ", regex =True)
new_df = df.assign(SEQ=df.SEQ.str.split(',')).explode('SEQ')
for index, row in df.iterrows():
new_df['delta'] = new_df['SEQ'].map(lambda x: (int(x.split()[1])+1)-int(x.split()[0]) if x.split()[0] != '1' else (int(x.split()[1])+1))
new_df["SUM"] = new_df.groupby(["ID"]).sum().reset_index(drop=True) #Here's the error, even though I can't see where
df2 = new_df.groupby(["ID","SUM"], sort=False)["SEQ"].apply((lambda x: ','.join(x.astype(str)))).reset_index(name="SEQ")
To give some context, what it does is the following: grabs every line with the same ID, separates the numbers with a "," in between, does some math with those numbers (that's where the "delta" (which i know it's not a delta) line gets involved), and finally sums up all the "delta" for each ID, grouping them all by their original ID, so I maintain the same numbers of rows.
And, when I use a sample of the data (the one that´s commented at the beginning), it works perfectly, giving me the ouptut that I wish:
ID SUM SEQ
0 C9JLR9 353 1 100,182 250,329 417,490 583
1 O95391 244 1 100,206 254,493 586
2 P05114 101 1 100
3 P14866 196 1 100,284 378
But, when I aply it on my Excel file (that has 10471 rows), the groupby.sum() line doesn't work as it's supposed to (I've already checked everything else, I know the error is within that line).
This is the output that I receive:
ID SUM SEQ
0 C9JLR9 39 1 100,182 250,329 417,490 583
1 O95391 20 1 100,206 254,493 586
2 P05114 33 1 100
4 P98177 21 1 100,176 246
You can clearly see that the SUM values differ (and are not correct at all). I haven't been able to figure out where those numbers come from, also. It's really weird.

If anyone is interested, the solution was provided in the comments: I had to change the line with the following:
new_df["SUM"] = new_df.groupby("ID")["delta"].transform("sum")

pandas str.split with .tolist() produced a float?

I have a hard time bug fixing my code which worked fine in testing on a small subset of the entire data. I could double check types to be sure, but the error message is already informative enough: The list I made ended up being a float. But how?
The last three lines which ran:
diagnoses = all_treatments['DIAGNOS'].str.split(' ').tolist()
all_treatments = all_treatments.drop(['DIAGNOS','INDATUMA','date'], axis=1)
all_treatments['tobacco'] = tobacco(diagnoses)
The error:
Traceback (most recent call last):
File "treatments2_noiopro.py", line 97, in <module>
all_treatments['tobacco'] = tobacco(diagnoses)
File "treatments2_noiopro.py", line 13, in tobacco
for codes in codes_column]
TypeError: 'float' object is not iterable
FWIW, the function itself is:
def tobacco(codes_column):
return [any('C30' <= code < 'C40' or
'F17' <= code <'F18'
for code in codes) if codes else False
for codes in codes_column]
I am using versions pandas 0.16.2 np19py26_0, iopro 1.7.1 np19py27_p0, and python 2.7.10 0 under Linux.

You can use str.split on the series and apply a function to the result:
def tobacco(codes):
return any(['C30' <= code < 'C40' or 'F17' <= code <'F18' for code in codes])
data = [('C35 C50'), ('C36'), ('C37'), ('C50 C51'), ('F1 F2'), ('F17'), ('F3 F17'), ('')]
df = pd.DataFrame(data=data, columns=['DIAGNOS'])
df
DIAGNOS
0 C35 C50
1 C36
2 C37
3 C50 C51
4 F1 F2
5 F17
6 F3 F17
7
df.DIAGNOS.str.split(' ').apply(tobacco)
0 True
1 True
2 True
3 False
4 False
5 True
6 True
7 False
dtype: bool
edit:
Seems like using str.contains is significantly faster than both methods.
tobacco_codes = '|'.join(["C{}".format(i) for i in range(30, 40)] + ["F17"])
data = [('C35 C50'), ('C36'), ('C37'), ('C50 C51'), ('F1 F2'), ('F17'), ('F3 F17'), ('C3')]
df = pd.DataFrame(data=data, columns=['DIAGNOS'])
df.DIAGNOS.str.contains(tobacco_codes)

I guess diagnoses is a generator and since you drop something in line 2 of your code this changes the generator. I can't test anything right now, but let me know if it works when commenting line 2 of your code.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Show me duplicated Addresses pandas - python

Your column is Address_of_New_Home, not Address of New Home. Just forgot the underscores

Related

Efficiently concatenate/append dataframe in a for loop to get a single big dataframe using python pandas

Having trouble with - class 'pandas.core.indexing._AtIndexer'

Create a True/False column in Python DataFrame(1,0) based on two columns values

Am I using groupby.sum() correctly?

pandas str.split with .tolist() produced a float?

Categories

Resources