Find value that is a subset to a row in Pandas dataframe - python

This is ascutally a follow up solution / question to one of my other questions: Python Pandas compare two dataframes to assign country to phone number
We have two data frames:
df1 = pd.DataFrame({"TEL": ["49123410", "49123411","49123412","49123413","49123414","49123710", "49123810"]})
df2 = pd.DataFrame({"BASE_NR": ["491234","491237","491238"],"NAME": ["A","B","C"]})
What I want to do is to assign on of the df2 Names to the df1 TEL. If we take the first value "491234", we see that the first five list entries in df1 start exactly on this string. This should result in something like this:
| | TEL | PREFIX |
| 0 | 49123410 | 491234 |
| 1 | 49123411 | 491234 |
| 2 | 49123412 | 491234 |
| 3 | 49123413 | 491234 |
| 4 | 49123414 | 491234 |
| 5 | 49123710 | 491237 |
| 6 | 49123810 | 491238 |
Other than in Python Pandas compare two dataframes to assign country to phone number
I developed another approach that works much faster:
for i, s in df2.iterrows():
df1.loc[df1["TEL"].str.startswith(s[0], na=False), "PREFIX"] = s[0]
So far, it worked perfectly and I have been using it over and over again, as I have to match many different sources on phone numbers and their subsets. But lately, I am experiencing more and more issues. The PREFIX column will be setup but stays empty. No matches are found any longer, where I had about 150.000 before.
Is there something fundamental that I am missing and was it only luck it worked this way? Input files (I am reading them in from a csv) and data types have not changed. I also have not changed the Pandas version (22).
PS: What also would be helpful is an idea, how to debug that part that happens here:
df1.loc[df1["TEL"].str.startswith(s[0], na=False), "PREFIX"] = s[0]

Well if it is speed you are after, this should be faster:
mapping = dict(zip(df2['BASE_NR'].tolist(), df2['NAME'].tolist()))
def getName(tel):
for k, v in mapping.items():
if tel.startswith(k):
return k, v
return '', ''
df1['BASE_NR'], df1['NAME'] = zip(*df1['TEL'].apply(getName))

Related

delete duplicates between two rows Tableau

how to delete duplicates between two values and keep the first value only on tableau for each user id ?
for example for a certain user :
| status | date |
| -------- | -------------- |
| success| 1/1/2022|
| fail| 1/2/2022|
| fail| 1/3/2022|
| fail| 1/4/2022|
| success| 1/5/2022|
i want the results to be :
| status | date |
| -------- | -------------- |
| success| 1/1/2022|
| fail| 1/2/2022|
| success| 1/5/2022|
on python it would be like this :
edited_data=[]
for key in d:
dup = [True]
total_len = len(d[key].index)
for i in range(1, total_len):
if d[key].iloc[i]['status'] == d[key].iloc[i-1]['status']:
dup.append(False)
else:
dup.append(True)
edited_data.append(d[key][dup])```
One way you could do this is with the LOOKUP() function. Since this particular problem requires each row to know what came before it, it will be important to make sure your dates are sorted correctly and that the table calculation is computed correctly. Something like this should work:
IF LOOKUP(MIN([Status]),-1) = MIN([Status]) THEN "Hide" ELSE "Show" END
And then simply hide or exclude the "Hide" rows.

Efficient way to write Pandas groupby codes by eliminating repetition

I have a DataFrame as below.
df = pd.DataFrame({
'Country':['A','A','A','A','A','A','B','B','B'],
'City':['C 1','C 1','C 1','B 2','B 2','B 2','C 1','C 1','C 1'],
'Date':['7/1/2020','7/2/2020','7/3/2020','7/1/2020','7/2/2020','7/3/2020','7/1/2020','7/2/2020','7/3/2020'],
'Value':[46,90,23,84,89,98,31,84,41]
})
I need to create 2 averages
Firstly, both Country and City as the criteria
Secondly, Average for only the Country
In order to achieve this, we can easily write below codes
df.groupby(['Country','City']).agg('mean')
.
+---------+------+-------+
| Country | City | Value |
+---------+------+-------+
| A | B 2 | 90.33 |
| +------+-------+
| | C 1 | 53 |
+---------+------+-------+
| B | C 1 | 52 |
+---------+------+-------+
df.groupby(['Country']).agg('mean')
.
+---------+-------+
| Country | |
+---------+-------+
| A | 71.67 |
+---------+-------+
| B | 52 |
+---------+-------+
The only change in the above 2 codes are the groupby criteria City. apart from that everything is same. so there's a clear repetition/duplication of codes. (specially when it comes to complex scenarios).
Now my question is, Is there any way that, we could write one code to incorporate both the scenarios at once. DRY - Don't Repeat Yourself.
what I've in my mind is something like below.
Choice = 'City' `<<--Here I type either City or None or something based on the requirement. Eg: If None, the Below code will ignore that criteria.`
df.groupby(['Country',Choice]).agg('mean')
Is this possible? or what is the best way to write the above codes efficiently without repetition?
I am not sure what you want to accomplish but.. why not just using a if?
columns=['Country']
if Choice:
columns.append(Choice)
df.groupby(columns).agg('mean')

Using pandas module to assign an excel cell value to a variable [Python 3.8, Linux]

Foreword:
I'd prefer to avoid lengthy processes if possible. As a beginner, one line with lots of syntax is a little overwhelming, if I need to use something similar, please give a basic note on what they do. It's not vital that I know, it just takes the edge off. Please, point out where I'm using inefficient code and suggest better functions and/or modules, as I said I have little knowledge in python.
Situation:
I'm a newbie to pandas, but I've taken the time to play around with x.iloc[y,x] and x.loc[y,x] (where x is pd.read_excel('/my/file/name.xlsx', sheet_name='sheet1'))and, at least for the given formats, I understand what makes them tick. I know that these are going to be useful to me for my macros. I'm on Linux so VBA isn't an easy option, and PyUNO for LibreOffice is a project I'm putting off for a while. I'm expecting that the above functions aren't the best way to select a cell in excel from python.
What I've found:
Too much. For a beginner like me, most of the tutorials are very complex with little explanation; I can make the code there work, I just have no clue why it works that way. I've mostly found information relating to standard 'in-house' python databases, and it seems that the excel related articles are few and far between, the ones I've read unfortunately relate to more advanced functions. I could probably learn them, but I'm not currently interested.
The issue:
Lets take a look at this code I wrote earlier, with a little help from pythonbasics.org;
import pandas as pd
import xlrd #not sure if this is needed, thonny assistant says its not, website says it is
df = pd.read_excel('/home/myname/Desktop/sheetname.xlsx', sheet_name='sheet1')
p = df.loc[5, 5]
p = str(p) #unsure if this does anything, I haven't got a write to A.txt either way
path = "/home/mynmae/Desktop/A.txt"
text_file = open(path, "w")
text_file.write('%s' % p)
text_file.close
Lets get rid of the mess.
First, I read sheetname.xlsx and assign it to df
df = pd.read_excel('/home/myname/Desktop/sheetname.xlsx', sheet_name='sheet1')
Now I try reading cell F6, lets keep in the string conversion for p
p = df.loc[5, 5]
p = str(p)
now that we've got p, lets open up a text file on my desktop
path = "/home/mynmae/Desktop/A.txt"
text_file = open(path, "w")
All that's left is to 'paste' p into the text file. We opened it with 'w' so we can write over the file. p is a string, so we write with ('%s' % p)
text_file.write('%s' % p)
text_file.close
Now we should have the value of F6 (lets say its "hello") in A.txt! Lets see:
A.txt;
..Oh
What I know:
All the write stuff works in a second program I have. The only difference is p is replaced by another string variable, I would guess that isn't the issue. However, when I call print(p) after converting p = str(p) it gives me what I want, with the headers in place. I would like to remove the headers but that's for later.
My question:
given spreadsheet 'sheetname.xlsx' and worksheet 'sheet1', using pandas (or a better module for spreadsheet work if there is one) how can I assign the value of cell F6 (or any cell, switching up my selection is easy) to the variable p?
Solution to your problem:
You're going to be livid at how silly the fix is. You forgot to put () after text_file.close. You're not executing the .close() function. It isn't throwing a runtime error because it's just returning the value of the .close function in that line. It is then moving onto the following lines of code.
Please try this:
path = "/home/myname/Desktop/A.txt"
text_file = open(path, "w")
text_file.write('%s' % p)
text_file.close()
Additional:
For the functionality you're using, you must have the xlrd module installed in your environment, but it doesn't need to be imported.
If you want to use integer indices freely on both dimensions for df.loc(), I suggest you use the argument header=None on pd.read_excel().
Excel File:
+----+----+----+
| A | B | C |
+------------------+
| | | | |
| 1 | hi | hi | hi |
| | | | |
+------------------+
| | | | |
| 2 | hi | hi | hi |
| | | | |
+------------------+
| | | | |
| 3 | hi | hi | hi |
| | | | |
+---+----+----+----+
With automatic headers: (These are strings. You must do df.loc[0, "hi.0"])
import pandas as pd
df = pd.read_excel('Book1.xlsx', sheet_name='Sheet1')
df.head()
Output:
+--------------+
|hi.1|hi.2|hi.3|
+------------------+
| | | | |
| 0 | hi | hi | hi |
| | | | |
+------------------+
| | | | |
| 1 | hi | hi | hi |
| | | | |
+------------------+
Without headers: (These are integers. You can safely do df.loc[0, 2])
import pandas as pd
df = pd.read_excel('Book1.xlsx', sheet_name='Sheet1', header=None)
df.head()
Output:
+----+----+----+
| 0 | 1 | 2 |
+------------------+
| | | | |
| 0 | hi | hi | hi |
| | | | |
+------------------+
| | | | |
| 1 | hi | hi | hi |
| | | | |
+------------------+
| | | | |
| 2 | hi | hi | hi |
| | | | |
+---+----+----+----+
The reason why I tried to show that difference is, when you said that, "when I call print(p) after converting p = str(p) it gives me what I want, with the headers in place", I am worried about what you mean by 'with headers'. It is supposed to be a string. If you're getting the
headers in this pattern, then maybe that you're testing the first row only.
You can do it without Pandas, using module xlrd:
import xlrd
workbook = xlrd.open_workbook('sheetname.xlsx')
worksheet = workbook.sheet_by_name('sheet1')
# Read specific cell and store it in variable:
value = worksheet.cell(row, column)
# row and column are indexed as Python does, so cell 'A1' is (0,0)

Creating new column from API lookup using groupby

I have a dataframe of weather date that looks like this:
+----+------------+----------+-----------+
| ID | Station_ID | Latitude | Longitude |
+----+------------+----------+-----------+
| 0 | 6010400 | 52.93 | -82.43 |
| 1 | 6010400 | 52.93 | -82.43 |
| 2 | 6010400 | 52.93 | -82.43 |
| 3 | 616I001 | 45.07 | -77.88 |
| 4 | 616I001 | 45.07 | -77.88 |
| 5 | 616I001 | 45.07 | -77.88 |
+----+------------+----------+-----------+
I want to create a new column called postal_code using an API lookup based on the latitude and longitude values. I cannot perform a lookup for each row in the dataframe as that would be inefficient, since there are over 500,000 rows and only 186 unique Station_IDs. It's also unfeasible due to rate limiting on the API I need to use.
I believe I need to perform a groupby transform but can't quite figure out how to get it to work correctly.
Any help with this would be greatly appreciated.
I believe, you can use groupby only for aggregations, which is not what you want.
First combine both 'Latitude' and 'Longitude'. It gives a new column with tuples.
df['coordinates'] = list(zip(df['Latitude'],df['Longitude']))
Then you can use this 'coordinates' column to create all unique values of (Latitude,Longitude) using set datatype, so it doesn't contain duplicates.
set(list(df['coordinates']))
Then fetch the postal_codes of these coordinates using API calls as you said and store them as a dict.
Then you can use this dict to populate postal codes for each row.
postal_code_dict = {'key':'value'} #sample dictionary
df['postal_code'] = df['coordinates'].apply(lambda x: postal_code_dict[x])
Hope this helps.

Combing two dataframes in Pandas Python [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 4 years ago.
I would like to combine two dataframes
I would like to combine both the dataframes such a way that the accts are the same.
For eg, acct 10 should values in CME and NISSAN while the rest are zeros.
I think you can use df.combine_first():
It will update null elements with value in the same location in other.
df2.combine_first(df1)
Also, you can try:
pd.concat([df1.set_index('acct'),df2.set_index('acct')],axis=1).reset_index()
It looks like what you're trying to do is merge these two DataFrames.
You can use df.merge to merge the two. Since you want to match on the acct column, set the on keyword arg to "acct" and set how to "inner" to keep only those rows that appear in both DataFrames.
For example:
merged = df1.merge(df2, how="inner", on="acct")
Output:
+------+--------------------+------------------+-------------------+-----------+--------------------+-------------------+--------------------+
| acct | GOODM | KIS | NISSAN | CME | HKEX | OSE | SGX |
+------+--------------------+------------------+-------------------+-----------+--------------------+-------------------+--------------------+
| 10 | | | 1397464.227495019 | 1728005.0 | 0.0 | | |
| 30 | 30569.300965712766 | 4299649.75104102 | | 6237.0 | | | |
+------+--------------------+------------------+-------------------+-----------+--------------------+-------------------+--------------------+
If you want to fill empty values with zeroes, you can use df.fillna(0).

Categories