Python Pandas - Creating a function to replace repetitive DataFrames - python

I am new to Python and have managed to build out the following code which is producing the desired results in four separate dataframes
import pandas as pd
x2019 = df.Date.between('2015-06-28','2015-07-04') #Transaction Dates we want to analyze
y2019 = df.First_Purchase_Date.between('2014-01-01','2015-07-04') #customer first purchase dates we want to include in the dataset
TABLE_2019_USA_XX = df.loc[x2019 & y2019 & (df['Region'] == 'USA')].groupby(df['FPYear'])[['New Customer', 'Existing Customer', 'revenue']].sum() #with date filters for table
TABLE_2019_USA_XX['TotalCusts'] = TABLE_2019_USA_XX['New Customer'] + TABLE_2019_USA_XX['Existing Customer']
TABLE_2019_CANADA_XX = df.loc[x2019 & y2019 & (df['Region'] == 'Canada')].groupby(df['FPYear'])[['New Customer', 'Existing Customer', 'revenue']].sum() #with date filters for table
TABLE_2019_CANADA_XX['TotalCusts'] = TABLE_2019_CANADA_XX['New Customer'] + TABLE_2019_CANADA_XX['Existing Customer']
x2018 = df.Date.between('2014-07-23','2014-07-28') #Transaction Dates we want to analyze
y2018 = df.First_Purchase_Date.between('2014-01-01','2014-07-30') #customer first purchase dates we want to include in the dataset
TABLE_2018_USA_XX = df.loc[x2018 & y2018 & (df['Region'] == 'USA')].groupby(df['FPYear'])[['New Customer', 'Existing Customer', 'revenue']].sum() #with date filters for table
TABLE_2018_USA_XX['TotalCusts'] = TABLE_2018_USA_XX['New Customer'] + TABLE_2018_USA_XX['Existing Customer']
TABLE_2018_CANADA_XX = df.loc[x2018 & y2018 & (df['Region'] == 'Canada')].groupby(df['FPYear'])[['New Customer', 'Existing Customer', 'revenue']].sum() #with date filters for table
TABLE_2018_CANADA_XX['TotalCusts'] = TABLE_2018_CANADA_XX['New Customer'] + TABLE_2018_CANADA_XX['Existing Customer']
print(TABLE_2018_USA_XX)
print(TABLE_2019_USA_XX)
print(TABLE_2018_CANADA_XX)
print(TABLE_2019_CANADA_XX)
Output
FPYear New Customer Existing Customer revenue TotalCusts
2014 0 23 134 23
2015 12 32 432 44
FPYear New Customer Existing Customer revenue TotalCusts
2014 432 421 4315 853
2015 3415 452 2341 3867
FPYear New Customer Existing Customer revenue TotalCusts
2014 22 432 4312 454
2015 33 345 3415 378
FPYear New Customer Existing Customer revenue TotalCusts
2014 5 35 4312 40
2015 432 32 6131 464
Based on what I've read and feedback I got when building this script, I know I should be able to build out the above using a function but I can't figure out exactly how to do that. Can someone please provide a suggestion to get me started. I'm essentially trying to cut down my script and make it more efficient.

Just define a function and pass parameters the dates and region you use as filters:
import pandas as pd
def process(df, start_dt, end_dt, purch_start, purch_end, region):
mask_date = df['Date'].between(start_dt, end_dt)
mask_purch_date = df['First_Purchase_Date'].between(purch_start, purch_end)
mask_region = df['Region'] == region
temp_df = df[mask_date & mask_purch_date & mask_region].groupby(df['FPYear'])[['New Customer', 'Existing Customer', 'revenue']].sum()
temp_df['TotalCusts'] = temp_df['New Customer'] + temp_df['Existing Customer']
return temp_df
TABLE_2019_USA_XX = process(df,'2015-06-28','2015-07-04', '2014-01-01','2015-07-04', 'USA')
TABLE_2019_CANADA_XX = process(df,'2015-06-28','2015-07-04', '2014-01-01','2015-07-04', 'Canada')
TABLE_2018_USA_XX = process(df,'2014-07-23','2014-07-28', '2014-01-01','2014-07-30', 'USA')
TABLE_2018_CANADA_XX = process(df,'2014-07-23','2014-07-28','2014-01-01','2014-07-30', 'Canada')

IIUC, you have repeating columns in your dataframes and you are doing the same operation over and over?
dfs = ['TABLE_2019_CANADA_XX', 'TABLE_2018_CANADA_XX','TABLE_2018_USA_XX', 'TABLE_2019_USA_XX']
df = pd.concat(dfs)
df.groupby(['FPYear','Region'])[['New Customer', 'Existing Customer', 'revenue']].sum()

Related

Matching thousands of data takes too much time with Pandas

I receive every day a report with some values and I have to match postal codes from countries all over the world to get the right region. Then I upload the result in my Django app.
Here's a look at my report:
Order Number
Date
City
Postal code
930276
27/09/2022
Madrid
cp: 28033
929670
27/09/2022
Lisboa
cp: 1600-812
I have thousands of rows like this. The objective is to retrieve the region in ISO 3166-2 format. To help me, I accessed the following page Geonames and downloaded all the countries' information (example: "FR.txt", "ES.txt"...)
Because this is a huge txt file, I chose to store it on a S3 Server.
Here is what I tried:
def access_scaleway(region_name, endpoint_url, access_key, secret_key):
""" Accessing Scaleway Bucket """
scaleway = boto3.client('s3', region_name=region_name, endpoint_url=endpoint_url, aws_access_key_id=access_key,
aws_secret_access_key=secret_key)
return scaleway
def get_region_code_accessing_scaleway(countries, regions):
''' Retrieves the region code from the region name. '''
list_countries = countries
list_regions = regions
list_regions_codes = []
scaleway_session = access_scaleway(region_name=settings.SCALEWAY_S3_REGION_NAME,
endpoint_url=settings.SCALEWAY_S3_ENDPOINT_URL,
access_key=settings.SCALEWAY_ACCESS_KEY_ID,
secret_key=settings.SCALEWAY_SECRET_ACCESS_KEY)
for country, region in zip(list_countries, list_regions):
try:
obj = scaleway_session.get_object(Bucket=settings.SCALEWAY_STORAGE_BUCKET_NAME, Key=f'countries/{country}.txt')
df = pd.read_csv(io.BytesIO(obj['Body'].read()), sep='\t', header=None)
df.columns = ['country code', 'postal code', 'place name', 'admin name1', 'admin code1', 'admin name2', 'admin code2', 'admin name3', 'admin code3', 'latitude', 'longitude', 'accuracy']
df['postal code'] = df['postal code'].astype(str)
df['postal code'] = df['postal code'].str.zfill(5)
# Removing all spaces and special characters
postal_code = re.sub("[^0-9^-]", '', region).strip()
region_code = country + "-" + df[df['postal code'] == postal_code]['admin code1'].values[0]
list_regions_codes.append(region_code)
except AttributeError:
list_regions_codes.append(None)
except ValueError:
list_regions_codes.append(None)
return list_regions_codes
But it is way too long. For a simple report of 1000 rows, it takes like 30 min.
My second try was to go with the OpenDataSoft public API. Here is what I tried:
def fetch_data(url, params, headers=None):
response = requests.get(url=url, params=params, headers=headers)
return response
def get_region_code_accessing_scaleway(countries, regions):
''' Retrieves the region code from the region name. '''
list_countries = countries
list_regions = regions
list_regions_codes = []
for country, region in zip(list_countries, list_regions):
try:
#Get response from API
postal_code = re.sub("[^0-9^-]", '', region).strip()
response = fetch_data(
url="https://data.opendatasoft.com/api/v2/catalog/datasets/geonames-postal-code%40public/records?",
params="select=country_code%2C%20postal_code%2C%20admin_code1&where=country_code%3D%22" + country + "%22%20and%20postal_code%3D%22" + postal_code + "%22")
if response.status_code == 200:
data = response.json()
if len(data['records']) > 0:
list_regions_codes.append(country + "-" + data['records'][0]['record']['fields']['admin_code1'])
else:
list_regions_codes.append(None)
else:
print('Error:" + response.status_code')
list_regions_codes.append(None)
But once again, it takes like forever to get matching values.
The last thing I tried was to go with pgeocode but it was also too long.
I don't understand why it is so long because the desired output is this one:
Order Number
Date
City
Postal code
Region code
930276
27/09/2022
Madrid
cp: 28033
ES-MD
929670
27/09/2022
Lisboa
cp: 1600-812
PT-08
Do you have any idea to speed up the process?

Pandas Dataframe - Conditional Column Creation

I am attempting to create a new column based on conditional logic from another column. I've tried searching and haven't been able to find anything that addresses my issue.
I have imported a CSV to a pandas dataframe, it is structured like this. I edited a few of the descriptions for this post, but other than that everything is the same:
#code used to load dataframe:
df = pd.read_csv(r"C:\filepath\filename.csv")
#output from print(type(df)):
#class 'pandas.core.frame.DataFrame'
#output from print(df.columns.values):
#['Type' 'Trans Date' 'Post Date' 'Description' 'Amount']
#output from print(df.columns):
Index(['Type', 'Trans Date', 'Post Date', 'Description', 'Amount'], dtype='object')
#output from print
Type Trans Date Post Date Description Amount
0 Sale 01/25/2018 01/25/2018 DESC1 -13.95
1 Sale 01/25/2018 01/26/2018 AMAZON MKTPLACE PMTS -6.99
2 Sale 01/24/2018 01/25/2018 SUMMIT BISTRO -5.85
3 Sale 01/24/2018 01/25/2018 DESC3 -9.13
4 Sale 01/24/2018 01/26/2018 DYNAMIC VENDING INC -1.60
I then write the following code:
def criteria(row):
if row.Description.find('SUMMIT BISTRO')>0:
return 'Lunch'
elif row.Description.find('AMAZON MKTPLACE PMTS')>0:
return 'Amazon'
elif row.Description.find('Aldi')>0:
return 'Groceries'
else:
return 'NotWorking'
df['Category'] = df.apply(criteria, axis=0)
Errors:
Traceback (most recent call last):
File "C:\Users\Test_BankReconcile2.py", line 44, in <module>
df['Category'] = df.apply(criteria, axis=0)
File "C:\Users\Anaconda3\lib\site-packages\pandas\core\frame.py", line 4262, in apply
ignore_failures=ignore_failures)
File "C:\Users\Anaconda3\lib\site-packages\pandas\core\frame.py", line 4358, in _apply_standard
results[i] = func(v)
File "C:\Users\OneDrive\Documents\finance\Test_BankReconcile2.py", line 35, in criteria
if row.Description.find('SUMMIT BISTRO')>0:
File "C:\Users\Anaconda3\lib\site-packages\pandas\core\generic.py", line 3081, in __getattr__
return object.__getattribute__(self, name)
AttributeError: ("'Series' object has no attribute 'Description'", 'occurred at index Type')
I'm able to successfully execute this same sort of command on a very similar csv file from a different bank (this example is from my credit card), so I don't know what is going on but possibly I need to define the dataframe in some way that I'm not doing? Or possibly something else that is very obvious that I'm not seeing? Thank you all in advance for helping me solve this.
Yes, your problem is that you need to pass axis=1 to .apply:
In [52]: df
Out[52]:
Type Trans Date Post Date Description Amount
0 Sale 01/25/2018 01/25/2018 DESC1 -13.95
1 Sale 01/25/2018 01/26/2018 AMAZON MKTPLACE PMTS -6.99
2 Sale 01/24/2018 01/25/2018 SUMMIT BISTRO -5.85
3 Sale 01/24/2018 01/25/2018 DESC3 -9.13
4 Sale 01/24/2018 01/26/2018 DYNAMIC VENDING INC -1.60
In [53]: def criteria(row):
...: if row.Description.find('SUMMIT BISTRO')>0:
...: return 'Lunch'
...: elif row.Description.find('AMAZON MKTPLACE PMTS')>0:
...: return 'Amazon'
...: elif row.Description.find('Aldi')>0:
...: return 'Groceries'
...: else:
...: return 'NotWorking'
...:
In [54]: df.apply(criteria, axis=1)
Out[54]:
0 NotWorking
1 NotWorking
2 NotWorking
3 NotWorking
4 NotWorking
dtype: object
The second problem is you have a logic error, instead of .find(x) > 0 you want .find(x) >= 0, or better yet, some_string in some_other_string
For more general solution omit Description in loop and instead use df['Description'].apply(criteria) with Series.apply.
Also for check substring in string use in.
def criteria(row):
if 'SUMMIT BISTRO' in row:
return 'Lunch'
elif 'AMAZON MKTPLACE PMTS' in row:
return 'Amazon'
elif 'Aldi' in row:
return 'Groceries'
else:
return 'NotWorking'
df['Category'] = df['Description'].apply(criteria)

pandas dataframe apply function over column creating multiple columns

I have the pandas df below, with a few columns, one of which is ip_addresses
df.head()
my_id someother_id created_at ip_address state
308074 309115 2859690 2014-09-26 22:55:20 67.000.000.000 rejected
308757 309798 2859690 2014-09-30 04:16:56 173.000.000.000 approved
309576 310619 2859690 2014-10-02 20:13:12 173.000.000.000 approved
310347 311390 2859690 2014-10-05 04:16:01 173.000.000.000 approved
311784 312827 2859690 2014-10-10 06:38:39 69.000.000.000 approved
For each ip_address I'm trying to return the description, city, country
I wrote a function below and tried to apply it
from ipwhois import IPWhois
def returnIP(ip) :
obj = IPWhois(str(ip))
result = obj.lookup_whois()
description = result["nets"][len(result["nets"]) - 1 ]["description"]
city = result["nets"][len(result["nets"]) - 1 ]["city"]
country = result["nets"][len(result["nets"]) - 1 ]["country"]
return [description, city, country]
# ---
suspect['ipwhois'] = suspect['ip_address'].apply(returnIP)
My problem is that this returns a list, I want three separate columns.
Any help is greatly appreciated. I'm new to Pandas/Python so if there's a better way to write the function and use Pandas would be very helpful.
from ipwhois import IPWhois
def returnIP(ip) :
obj = IPWhois(str(ip))
result = obj.lookup_whois()
description = result["nets"][len(result["nets"]) - 1 ]["description"]
city = result["nets"][len(result["nets"]) - 1 ]["city"]
country = result["nets"][len(result["nets"]) - 1 ]["country"]
return (description, city, country)
suspect['description'], suspect['city'], suspect['country'] = \
suspect['ip_address'].apply(returnIP)
I was able to solve it with another stackoverflow solution
for n,col in enumerate(cols):
suspect[col] = suspect['ipwhois'].apply(lambda ipwhois: ipwhois[n])
If there's a more elegant way to solve this, please share!

Try to include a column based on input and file name in Pandas Dataframe in Python

I have a several csv files which have the following structure:
Erster Hoch Tief Schlusskurs Stuecke Volumen
Datum
14.02.2017 151.55 152.35 151.05 152.25 110.043 16.687.376
13.02.2017 149.85 152.20 149.25 151.25 415.76 62.835.200
10.02.2017 149.00 150.05 148.65 149.40 473.664 70.746.088
09.02.2017 144.75 148.45 144.35 148.00 642.175 94.348.392
Erster Hoch Tief Schlusskurs Stuecke Volumen
Datum
14.02.2017 111.454 111.776 111.454 111.776 44 4.918
13.02.2017 110.570 110.989 110.570 110.989 122 13.535
10.02.2017 109.796 110.705 109.796 110.705 0 0
09.02.2017 107.993 108.750 107.993 108.750 496 53.933
all are different based on the naming of the file name:
wkn_A1EWWW_historic.csv
wkn_A0YAQA_historic.csv
I want to have the following Output:
Date wkn Open High low Close pieced Volume
14.02.2017 A1EWWW 151.55 152.35 151.05 152.25 110.043 16.687.376
13.02.2017 A1EWWW 149.85 152.20 149.25 151.25 415.76 62.835.200
10.02.2017 A1EWWW 149.00 150.05 148.65 149.40 473.664 70.746.088
09.02.2017 A1EWWW 144.75 148.45 144.35 148.00 642.175 94.348.392
Date wkn Open High low Close pieced Volume
14.02.2017 A0YAQA 111.454 111.776 111.454 111.776 44 4.918
13.02.2017 A0YAQA 110.570 110.989 110.570 110.989 122 13.535
10.02.2017 A0YAQA 109.796 110.705 109.796 110.705 0 0
09.02.2017 A0YAQA 107.993 108.750 107.993 108.750 496 53.933
The code looks like the following:
import pandas as pd
wkn_list_dummy = {'A0YAQA','A1EWWW'}
for w_list in wkn_list_dummy:
url = 'C:/wkn_'+str(w_list)+'_historic.csv'
df = pd.read_csv(url, encoding='cp1252', sep=';', decimal=',', index_col=0)
print(df)
I tried using melt but it was not working.
You can add column by just assigning a value to it:
df['new_column'] = 'string'
All together:
import pandas as pd
wkn_list_dummy = {'A0YAQA','A1EWWW'}
final_df = pd.DataFrame()
for w_list in wkn_list_dummy:
url = 'C:/wkn_'+str(w_list)+'_historic.csv'
df = pd.read_csv(url, encoding='cp1252', sep=';', decimal=',', index_col=0)
df['wkn'] = w_list
final_df = final_df.append(df)
final_df.reset_index(inplace=True)
print(final_df)

Organize by Twitter unique identifier using python

I have a CSV file with each line containing information pertaining to a particular tweet (i.e. each line contains Lat, Long, User_ID, tweet and so on). I need to read the file and organize the tweets by the User_ID. I am trying to end up with a given User_ID attached to all of the tweets with that specific ID.
Here is what I want:
user_id: 'lat', 'long', 'tweet'
: 'lat', 'long', 'tweet'
user_id2: 'lat', 'long', 'tweet'
: 'lat', 'long', 'tweet'
: 'lat', 'long', 'tweet'
and so on...
This is a snip of my code that reads in the CSV file and creates a list:
UID = []
myID = []
ID = []
f = None
with open(csv_in,'rU') as f:
myreader = csv.reader(f, delimiter=',')
for row in myreader:
# Assign columns in csv to variables.
latitude = row[0]
longitude = row[1]
user_id = row[2]
user_name = row[3]
date = row[4]
time = row[5]
tweet = row[6]
flag = row[7]
compound = row[8]
Vote = row[9]
# Read variables into separate lists.
UID.append(user_id + ', ' + latitude + ', ' + longitude + ', ' + user_name + ', ' + date + ', ' + time + ', ' + tweet + ', ' + flag + ', ' + compound)
myID = ', '.join(UID)
ID = myID.split(', ')
I'd suggest you use pandas for this. It will allow you not only to list your tweets by user_id, as in your question, but also to do many other manipulations quite easily.
As an example, take a look at this python notebook from NLTK. At the end of it, you see an operation very closed to yours, reading a csv file containing tweets,
In [25]:
import pandas as pd
​
tweets = pd.read_csv('tweets.20150430-223406.tweet.csv', index_col=2, header=0, encoding="utf8")
You can also find a simple operation: looking for the tweets of a certain user,
In [26]:
tweets.loc[tweets['user.id'] == 557422508]['text']
Out[26]:
id
593891099548094465 VIDEO: Sturgeon on post-election deals http://...
593891101766918144 SNP leader faces audience questions http://t.c...
Name: text, dtype: object
For listing the tweets by user_id, you would simply do something like the following (this is not in the original notebook),
In [9]:
tweets.set_index('user.id')[0:4]
Out[9]:
created_at favorite_count in_reply_to_status_id in_reply_to_user_id retweet_count retweeted text truncated
user.id
107794703 Thu Apr 30 21:34:06 +0000 2015 0 NaN NaN 0 False RT #KirkKus: Indirect cost of the UK being in ... False
557422508 Thu Apr 30 21:34:06 +0000 2015 0 NaN NaN 0 False VIDEO: Sturgeon on post-election deals http://... False
3006692193 Thu Apr 30 21:34:06 +0000 2015 0 NaN NaN 0 False RT #LabourEoin: The economy was growing 3 time... False
455154030 Thu Apr 30 21:34:06 +0000 2015 0 NaN NaN 0 False RT #GregLauder: the UKIP east lothian candidat... False
Hope it helps.

Categories