I am adding a column "state" into an existing dataframe that does not share a common column with my other data frame. Therefore, I need to convert zipcodes into states (example, 00704 would be PR) to load into the dataframe that has the new column state.
reviewers = pd.read_csv('reviewers.txt',
sep='|',
header=None,
names=['user id','age','gender','occupation','zipcode'])
reviewers['state'] = ""
user id age gender occupation zipcode state
0 1 24 M technician 85711
1 2 53 F other 94043
zipcodes = pd.read_csv('zipcodes.txt',
usecols = [1,4],
converters={'Zipcode':str})
Zipcode State
0 00704 PR
1 00704 PR
2 00704 PR
3 00704 PR
4 00704 PR
zipcodes1 = zipcodes.set_index('Zipcode') ###Setting the index to zipcode
dfzip = zipcodes1
print(dfzip)
State
Zipcode
00704 PR
00704 PR
00704 PR
zips = (pd.Series(dfzip.values.tolist(), index = zipcodes1['State'].index))
states = []
for zipcode in reviewers['Zipcode']:
if re.search('[a-zA-Z]+', zipcode):
append.states['canada']
elif zipcode in zips.index:
append.states(zips['zipcode'])
else:
append.states('unkown')
I am not sure if my loop is correct either. I have to sort the zipcodes by U.S zipcode (numerical), Canada zip codes(alphabetical), and then other zip codes which we define as (unknown). Let me know if you need the data file.
Use:
#remove duplicates and create Series for mapping
zips = zipcodes.drop_duplicates().set_index('Zipcode')['State']
#get mask for canada zip codes
#if possible small letters change to [a-zA-Z]+
mask = reviewers['zipcode'].str.match('[A-Z]+')
#new column by mask
reviewers['state'] = np.where(mask, 'canada', reviewers['zipcode'].map(zips))
#NaNs are replaced
reviewers['state'] = reviewers['state'].fillna('unknown')
Loop version with apply:
import re
def f(code):
res="unknown"
#if possible small letter change to [a-zA-Z]+
if re.match('[A-Z]+', code):
res='canada'
elif code in zips.index:
res=zips[code]
return res
reviewers['State1'] = reviewers['zipcode'].apply(f)
print (reviewers.tail(10))
user id age gender occupation zipcode state State1
933 934 61 M engineer 22902 VA VA
934 935 42 M doctor 66221 KS KS
935 936 24 M other 32789 FL FL
936 937 48 M educator 98072 WA WA
937 938 38 F technician 55038 MN MN
938 939 26 F student 33319 FL FL
939 940 32 M administrator 02215 MA MA
940 941 20 M student 97229 OR OR
941 942 48 F librarian 78209 TX TX
942 943 22 M student 77841 TX TX
#test if same output
print ((reviewers['State1'] == reviewers['state']).all())
True
Timings:
In [56]: %%timeit
...: mask = reviewers['zipcode'].str.match('[A-Z]+')
...: reviewers['state'] = np.where(mask, 'canada', reviewers['zipcode'].map(zips))
...: reviewers['state'] = reviewers['state'].fillna('unknown')
...:
100 loops, best of 3: 2.08 ms per loop
In [57]: %%timeit
...: reviewers['State1'] = reviewers['zipcode'].apply(f)
...:
100 loops, best of 3: 17 ms per loop
Your loop needs to be fixed:
states = []
for zipcode in reviewers['Zipcode']:
if re.match(r'\w+', zipcode):
states.extend('Canada')
elif zipcode in zips.index:
states.extend(zips[zipcode])
else:
states.extend('Unknown')
Also, am assuming you want the states list to be plugged back into the dataframe. In that case you don't need the for loop. You can use pandas apply on the dataframe to get a new column:
def findState(code):
res='Unknown'
if re.match(r'\w+', code):
res='Canada'
elif code in zips.index:
res=zips[code]
return res
reviewers['State'] = reviewers['Zipcode'].apply(findstate)
Related
I have the following data set:
CustomerID Date Amount Department \
0 395134 2019-01-01 199 Home
1 395134 2019-01-01 279 Home
2 1356012 2019-01-07 279 Home
3 1921374 2019-01-08 269 Home
4 395134 2019-01-01 279 Home
... ... ... ... ...
18926474 1667426 2021-06-30 349 Womenswear
18926475 1667426 2021-06-30 299 Womenswear
18926476 583105 2021-06-30 349 Womenswear
18926477 538137 2021-06-30 279 Womenswear
18926478 825382 2021-06-30 2499 Home
DaysSincePurchase
0 986 days
1 986 days
2 980 days
3 979 days
4 986 days
... ...
18926474 75 days
18926475 75 days
18926476 75 days
18926477 75 days
18926478 75 days
I want to do some feature engineering and add a few columns after having aggregated (using group_by) by customerID. The Date column is unimportant and can easily be dropped. I want a data set where every row is one unique customerID which are just integers 1,2... (first col) where the other columns are:
Total amount of purchasing
Days since the last purchase
Number of total departments
This is what I've done, and it works. However when I time it, it takes about 1.5 hours. Is there another more efficient of doing this?
customer_group = joinedData.groupby(['CustomerID'])
n = originalData['CustomerID'].nunique()
# First arrange the data in a matrix.
matrix = np.zeros((n,5)) # Pre-allocate matrix
for i in range(0,n):
matrix[i,0] = i+1
matrix[i,1] = sum(customer_group.get_group(i+1)['Amount'])
matrix[i,2] = min(customer_group.get_group(i+1)['DaysSincePurchase']).days
matrix[i,3] = customer_group.get_group(i+1)['Department'].nunique()
# The above loop takes 6300 sec approx
# convert matrix to dataframe and name columns
newData = pd.DataFrame(matrix)
newData = newData.rename(columns = {0:"CustomerID"})
newData = newData.rename(columns = {1:"TotalDemand"})
newData = newData.rename(columns = {2:"DaysSinceLastPurchase"})
newData = newData.rename(columns = {3:"nrDepartments"})
Use agg:
>>> df.groupby('CustomerID').agg(TotalDemand=('Amount', sum),
DaysSinceLastPurchase=('DaysSincePurchase', min),
nrDepartments=('Department', 'nunique'))
I ran this function over a dataframe of 20,000,000 records. It took few seconds to be executed:
>>> %timeit df.groupby('CustomerID').agg(...)
14.7 s ± 225 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Generated data:
N = 20000000
df = pd.DataFrame(
{'CustomerID': np.random.randint(1000, 10000, N),
'Date': np.random.choice(pd.date_range('2020-01-01', '2020-12-31'), N),
'Amount': np.random.randint(100, 1000, N),
'Department': np.random.choice(['Home', 'Sport', 'Food', 'Womenswear',
'Menswear', 'Furniture'], N)})
df['DaysSincePurchase'] = pd.Timestamp.today().normalize() - df['Date']
I have the following piece of code:
my_list = ["US", "IT", "ES", "NL"]
for i in my_list:
A = sum_products_by_country(world_level,i)
df = pd.DataFrame({'value':A})
Descending = df.sort_values( by='value', ascending = 0 )
Top_5 = Descending[0:5]
print(Top_5)
The "sum_products_by_country" is a created function which takes as arguments a data frame ( in my case is named "world_level") and a country name and returns the sum by product for this country. Using this loop I find the top5 products and the sums for each country of my_list. Here is the output of this loop:
US value
Product
B 1492
H 455
BB 351
C 119
F 117
IT value
Product
P 346
U 331
A 379
Q 190
D 1389
ES value
Product
P 3046
U3 331
A 379
Q 1390
DD 10389
NL value
Product
P 3465
U 3313
AA 379
2Q 190
D 189
I want to write this output in a excel sheet using :
writer = pd.ExcelWriter('top products.xlsx', engine='xlsxwriter')
Top_5.to_excel(writer, sheet_name='Sheet1')
writer.save()
Could you tell me where should I put the code above in order to get the required excel document?
Is there also a way to get the column names(country,product,value) only once on the top in my excel document and not for each country separately? So I want something like this:
Country Product value
US
B 1492
H 455
BB 351
C 119
F 117
IT
P 346
U 331
A 379
Q 190
D 1389
ES
P 3046
U3 331
A 379
Q 1390
DD 10389
NL
P 3465
U 3313
AA 379
2Q 190
D 189
Thank you
This script should help you:
#Create workbook object
wb = openpyxl.Workbook()
sheet = wb.get_active_sheet()
sheet.title='Products by country'
#Generate data
#Add titles in the first row of each column
sheet.cell(row=1, column=1).value='country'
sheet.cell(row=1, column=2).value='product'
sheet.cell(row=1, column=3).value='value'
#Loop to set the value of each cell
for i in range(0, len(Country)):
sheet.cell(row=i+2, column=1).value=Country[i]#with country being your array full of country names. If you have 5 values for one country I would advise just having the country name in there five times.
sheet.cell(row=i+2, column=2).value=Product[i]#array with products
sheet.cell(row=i+2, column=3).value=Values[i]#array with values
#Finally, save the file and give it a name
wb.save('NameFile.xlsx')
I have a dataframe of samples, with a country column. The relative number of records in each country are:
d1.groupby("country").size()
country
Australia 21
Cambodia 58
China 280
India 133
Indonesia 195
Malaysia 138
Myanmar 51
Philippines 49
Singapore 1268
Taiwan 47
Thailand 273
Vietnam 288
How do I select, say, 100 random samples from each country, if that country has > 100 samples? (if the country has <= 100 samples, do nothing). Currently, I do this for, say, Singapore:
names_nonsg_ls = []
names_sg_ls = []
# if the country is not SG, add it to names_nonsg_ls.
# else, add it to names_sg_ls, which will be subsampled later.
for index, row in d0.iterrows():
if str(row["country"]) != "Singapore":
names_nonsg_ls.append(str(row["header"]))
else:
names_sg_ls.append(str(row["header"]))
# Select 100 random names from names_sg_ls
names_sg_ls = random.sample(names_sg_ls, 100)
# Form the list of names to retain
names_ls = names_nonsg_ls + names_sg_ls
# create new dataframe
d1 = d0.loc[d0["header"].isin(names_ls)]
But manually a new list for each country that has >100 names is just poor form, not to mention that I first have to manually pick out the countries with > 100 names.
You can group by country, then sample based on the group size:
d1.groupby("country", group_keys=False).apply(lambda g: g.sample(100) if len(g) > 100 else g)
Example:
df = pd.DataFrame({
'A': ['a','a','a','a','b','b','b','c','d'],
'B': list(range(9))
})
df.groupby('A', group_keys=False).apply(lambda g: g.sample(3) if len(g) > 3 else g)
# A B
#2 a 2
#0 a 0
#1 a 1
#4 b 4
#5 b 5
#6 b 6
#7 c 7
#8 d 8
And the copy has to be done for 'City' column starting with 'BH'.
The copied df.index shouls be same as the original
Eg -
STATE CITY
315 KA BLR
423 WB CCU
554 KA BHU
557 TN BHY
# state_df is new dataframe, df is existing
state_df = pd.DataFrame(columns=['STATE', 'CITY'])
for index, row in df.iterrows():
city = row['CITY']
if(city.startswith('BH')):
append row from df to state_df # pseudocode
Being new to pandas and Python, I need help in the pseudocode for the most efficient way.
Solution with startswith and boolean indexing:
print (df['CITY'].str.startswith('BH'))
315 False
423 False
554 True
557 True
state_df = df[df['CITY'].str.startswith('BH')]
print (state_df)
STATE CITY
554 KA BHU
557 TN BHY
If need copy only some columns add loc:
state_df = df.loc[df['CITY'].str.startswith('BH'), ['STATE']]
print (state_df)
STATE
554 KA
557 TN
Timings:
#len (df) = 400k
df = pd.concat([df]*100000).reset_index(drop=True)
In [111]: %timeit (df.CITY.str.startswith('BH'))
10 loops, best of 3: 151 ms per loop
In [112]: %timeit (df.CITY.str.contains('^BH'))
1 loop, best of 3: 254 ms per loop
try this:
In [4]: new = df[df['CITY'].str.contains(r'^BH')].copy()
In [5]: new
Out[5]:
STATE CITY
554 KA BHU
557 TN BHY
What if I need to copy only some columns of the row and not the entire
row
cols_to_copy = ['STATE']
new = df.loc[df.CITY.str.contains(r'^BH'), cols_to_copy].copy()
In [7]: new
Out[7]:
STATE
554 KA
557 TN
Removed the for loop and finally wrote this :
state_df = df.loc[df['CTYNAME'].str.startswith('Washington'), cols_to_copy]
For loop may be slower, but need to check on that
I have a csv file that contains trade data for some countries. The data has a format as follows:
rep par commodity value
USA GER 1 700
USA GER 2 100
USA GER 3 400
USA GER 5 100
USA GER 80 900
GER USA 2 300
GER USA 4 500
GER USA 5 700
GER USA 97 450
GER UK 50 300
UK USA 4 1100
UK USA 80 200
UK GER 50 200
UK GER 39 650
I intend to make a new dictionary and by using the created dictionary, calculate the total value of common traded commodities between countries.
For example, consider trade between USA-GER, I intend to check whether GER-USA is in the data and if it exists, values for the common commodities be summed and do the same for all countries. The dictionary should be like:
Dic_c1c2_producs=
{('USA','GER'): ('1','700'),('2','100'),('3','400'),('5','100'),('80','900');
('GER','USA'):('2','300'),('4','500'),('5','700'),('97','450') ;
('GER','UK'):('50','300');
('UK','USA'): ('4','80'),('80','200');
('UK','GER'): ('50','200'),('39','650')}
As you can see, USA-GER and GER-USA have commodities 2 and 5 in common and the value of these goods are (100+300)+(100+700).
For the pairs USA-UK and UK-USA, we have common commodities: 0 so total trade will be 0 as well. For GER-UK and UK-GER, commodity 50 is in common and total trade is 300+200.
At the end, I want to have something like:
Dic_c1c2_summation={('USA','GER'):1200;('GER','UK'):500; ('UK','USA'):0}
Any help would be appreciated.
In addition to my post, I have written following lines:
from collections import defaultdict
rfile = csv.reader(open("filepath",'r'))
rfile.next()
dic_c1c2_products = defaultdict(set)
dic_c_products = {}
country = set()
for row in rfile :
c1 = row[0]
c2 = row[1]
p = row[2]
country.add(c1)
for i in country :
dic_c_products[i] = set()
rfile = csv.reader(open("filepath"))
rfile.next()
for i in rfile:
c1 = i[0]
c2 = i[1]
p = i[2]
v=i[3]
dic_c_products[c1].add((p,v))
if not dic_c1c2_products.has_key((c1,c2)) :
dic_c1c2_products[(c1,c2)] = set()
dic_c1c2_products[(c1,c2)].add((p,v))
else:
dic_c1c2_products[(c1,c2)].add((p,v))
c_list = dic_c_products.keys()
dic_c1c2_productsummation = set()
for i in dic_c1c2_products.keys():
if dic_c1c2_products.has_key((i[1],i[0])):
for p1, v1 in dic_c1c2_products[(i[0],i[1])]:
for p2, v2 in dic_c1c2_products[(i[1],i[0])]:
if p1==p2:
summation=v1+v2
if i not in dic_c1c2_productsum.keys():
dic_c1c2_productsum[(i[0],i[1])]=(p1, summation)
else:
dic_c1c2_productsum[(i[0],i[1])].add((p1, summation))
else:
dic_c1c2_productsn[i] = " "
# save your data in a file called data
import pandas as pd
data = pd.read_csv('data', delim_whitespace=True)
data['par_rep'] = data.apply(lambda x: '_'.join(sorted([x['par'], x['rep']])), axis=1)
result = data.groupby(('par_rep', 'commodity')).filter(lambda x: len(x) >= 2).groupby(('par_rep'))['value'].sum().to_dict()
at the end result is {'GER_UK': 500, 'GER_USA': 1200}