python multiple for loop question (pandas, dataframe) - python

idx_list = []
for idx, row in df_quries_copy.iterrows():
for brand in brand_name:
if row['user_query'].contains(brand):
idx_list.append(idx)
else:
continue
brand_name list looks like below
brand_name = ['Apple', 'Lenovo', Samsung', ... ]
I have df_queries data frame which has the query the user used
the table is looks like below
user_query
user_id
Apple Laptop
A
Lenovo 5GB
B
and also I have a brand name as a list
i want to find out the users who uses related with brand such as 'Apple laptop'
but when I run the script, I got a message saying that
'str' object has no attribute 'contains'
how am I supposed to do to use multiple for loop ?
Thank you in advance.
for brand in brand_name[:100]:
if len(copy_df[copy_df['user_query'].str.contains(brand)]) >0:
ls.append(copy_df[copy_df['user_query'].str.contains(brand)].index)
else:continue
I tried like answer but the whole dataframe came out in a sudden as a result

You can use df_quries_copy[df_quries_copy['user_query'].str.contrains(brand)].index to get index directly.
for brand in brand_name:
df_quries_copy[df_quries_copy['user_query'].str.contrains(brand)].index
Or in your code, use brand in row['user_query'] since row['user_query'] is a string value.

Related

Python Contains in Panda DATAFRAME

I am grouping in each iteration the same price, add quantity together and combine the name of the exchange like :
asks_price asks_qty exchange_name_ask bids_price bids_qty exchange_name_bid
0 20156.51 0.000745 Coinbase 20153.28 0.000200 Coinbase
1 20157.52 0.050000 Coinbase 20152.27 0.051000 Coinbase
2 20158.52 0.050745 CoinbaseFTX 20151.28 0.051200 KrakenCoinbase
but to build orderbook i have to drop each time the row of one of the provider to update it so i do :
self.global_orderbook = self.global_orderbook[
self.global_orderbook.exchange_name_ask != name]
And then i have with Coinbase for example
asks_price asks_qty exchange_name_ask bids_price bids_qty exchange_name_bid
0 20158.52 0.050745 CoinbaseFTX 20151.28 0.051200 KrakenCoinbase
But i want that KrakenCoinbase also leave
so I want to do something like :
self.global_orderbook = self.global_orderbook[name not in self.global_orderbook.exchange_name_ask]
It doesnt work
I already try with contains but i cant on a series
self.global_orderbook = self.global_orderbook[self.global_orderbook.exchange_name_ask.contains(name)]
but
'Series' object has no attribute 'contains'
Thanks for help
To do that we can use astype(str)
like :
self.global_orderbook = self.global_orderbook[self.global_orderbook.exchange_name_ask.astype(str).str.contains(name,regex=False)]
And then it works we can use on column with string

How to update a dynamic list in python? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I had a list that has Business = ['Company name','Mycompany',Revenue','1000','Income','2000','employee','3000','Facilities','4000','Stock','5000'] , the output of the list structure is shown below:
Company Mycompany
Revenue 1000
Income 2000
employee 3000
Facilities 4000
Stock 5000
The Dynamic list gets updated ***
for every iteration for the list and some of the items in the list is
missing
***. for example execution 1 the list gets updated as below:
Company Mycompany
Income 2000 #revenue is missing
employee 3000
Facilities 4000
Stock 5000
In the above list the Revenue is removed from list as company has no revenue, in second example :
Company Mycompany
Revenue 1000
Income 2000
Facilities 4000 #Employee is missing
Stock 5000
In the above example 2 Employee is missing. How to create a output list that replaces the missing values with 0, in example 1 revenue is missing , hence I have to replace the output list with ['Revenue,'0'] at its position, for better understanding please find below
output list created for example 1: Revenue replaced with 0
Company Mycompany| **Revenue 0**| Income 2000| employee 3000| Facilities 4000| Stock 5000
Output list for example 2: employee is replaced with 0
Company Mycompany| Revenue 1000| Income 2000| **employee 0**| Facilities 4000| Stock 5000
How can I achieve the output list with replacing the output list with 0 on missing list items without changing the structure of list. My code so far:
for line in Business:
if 'Company' not in line:
Business.insert( 0, 'company')
Business.insert( 1, '0')
if 'Revenue' not in line:
#got stuck here
if 'Income' not in line:
#got stuck here
if 'Employee' not in line:
#got stuck here
if 'Facilities' not in line:
#got stuck here
if 'Stock' not in line:
#got stuck here
Thanks a lot in advance
If you are getting inputs as a list then you can convert the list into a dict like this then you'll have a better approach on data, getting as a dictionary would be a better choice though
Business = ['Company name','Mycompany','Revenue',1000,'Income',2000,'employee',3000,'Facilities',4000,'Stock',5000]
BusinessDict = {Business[i]:Business[i+1] for i in range(0,len(Business)-1,2)}
print(BusinessDict)
As said in the comments, a dict is a much better data structure for the problem. If you really need the list, you could use a temporary dict like this:
example = ['Company name','Mycompany','Income','2000','employee','3000','Facilities','4000','Stock','5000']
template = ['Company name', 'Revenue', 'Income', 'employee', 'Facilities', 'Stock']
# build a temporary dict
exDict = dict(zip(example[::2], example[1::2]))
# work on it
result = []
for i in template:
result.append(i)
if i in exDict.keys():
result.append(exDict[i])
else:
result.append(0)
A bit more efficient (but harder to understand for beginners) would be to create the temporary dict like this:
i = iter(example)
example_dict = dict(zip(i, i))
This works because zip uses lazy evaluation.
You can use dictionary like this:
d={'Company':0,'Revenue':0,'Income':0,'employee':0,'Facilities':0,'Stock':0}
given=[['Company','Mycompany'],['Income',2000],['employee',3000],['Facilities',4000],['Stock',5000]]
for i in given:
d[i[0]]=i[1]
ans=[]
for key,value in d.items():
ans.append([key,value])

Python: Extract unique sentences from string and place them in new column concatenated with ;

Afternoon Stackoverflowers,
I have been challenged with some extract/, as I am trying to prepare data for some users.
As I was advised, it is very hard to do it in SQL, as there is no clear pattern, I tried some things in python, but without success (as I am still learning python).
Problem statement:
My SQL query output is either excel or text file (depends on how I publish it but can do it both ways). I have a field (fourth column in excel or text file), which contains either one or multiple rejection reasons (see example below), separated by a comma. And at the same time, a comma is used within errors (sometimes).
Field example without any modification
INVOICE_CREATION_FAILED[Invalid Address information: Company Name, Limited: Blueberry Street 10+City++09301+SK|SK000001111|BLD at line 1 , Company Id on the Invoice does not match with the Company Id registered for the Code in System: [AL12345678901]|ABC1D|DL0000001 at line 2 , Incorrect Purchase order: VTB2R|ADLAVA9 at line 1 ]
Desired output:
Invalid Address information; Company Id on the Invoice does not match with the Company Id registered for the Code in System; Incorrect Purchase order
Python code:
import pandas
excel_data_df = pandas.read_excel('rejections.xlsx')
# print whole sheet data
print(excel_data_df['Invoice_Issues'].tolist())
excel_data_df['Invoice_Issues'].split(":", 1)
Output:
INVOICE_CREATION_FAILED[Invalid Address information:
I tried split string, but it doesn't work properly. It deletes everything after the colon, which is acceptable because it is coded that way, however, I would need to trim the string of the unnecessary data and keep only the desired output for each line.
I would be very thankful for any code suggestions on how to trim the output in a way that I will extract only what is needed from that string - if the substring is available.
In the excel, I would normally use list of errors, and nested IFs function with FIND and MATCH. But I am not sure how to do it in Python...
Many thanks,
Greg
This isn't the fastest way to do this, but in Python, speed is rarely the most important thing.
Here, we manually create a dictionary of the errors and what you would want them to map to, then we iterate over the values in Invoice_Issues, and use the dictionary to see if that key is present.
import pandas
# create a standard dataframe with some errors
excel_data_df = pandas.DataFrame({
'Invoice_Issues': [
'INVOICE_CREATION_FAILED[Invalid Address information: Company Name, Limited: Blueberry Street 10+City++09301+SK|SK000001111|BLD at line 1 , Company Id on the Invoice does not match with the Company Id registered for the Code in System: [AL12345678901]|ABC1D|DL0000001 at line 2 , Incorrect Purchase order: VTB2R|ADLAVA9 at line 1 ]']
})
# build a dictionary
# (just like "words" to their "meanings", this maps
# "error keys" to their "error descriptions".
errors_dictionary = {
'E01': 'Invalid Address information',
'E02': 'Incorrect Purchase order',
'E03': 'Invalid VAT ID',
# ....
'E39': 'No tax line available'
}
def extract_errors(invoice_issue):
# using the entry in the Invoice_Issues column,
# then loop over the dictionary to see if this error is in there.
# If so, add it to the list_of_errors.
list_of_errors = []
for error_number, error_description in errors_dictionary.items():
if error_description in invoice_issue:
list_of_errors.append(error_description)
return ';'.join(list_of_errors)
# print whole sheet data
print(excel_data_df['Invoice_Issues'].tolist())
# for every row in the Invoice_Isses column, run the extract_errors function, and store the value in the 'Errors' column.
excel_data_df['Errors'] = excel_data_df['Invoice_Issues'].apply(extract_errors)
# display the dataframe with the extracted errors
print(excel_data_df.to_string())
excel_data_df.to_excel('extracted_errors.xlsx)

How to iterate over a data frame

I have a dataset of users, books and ratings and I want to find users who rated high particular book and to those users I want to find what other books they liked too.
My data looks like:
df.sample(5)
User-ID ISBN Book-Rating
49064 102967 0449244741 8
60600 251150 0452264464 9
376698 52853 0373710720 7
454056 224764 0590416413 7
54148 25409 0312421273 9
I did so far:
df_p = df.pivot_table(index='ISBN', columns='User-ID', values='Book-Rating').fillna(0)
lotr = df_p.ix['0345339703'] # Lord of the Rings Part 1
like_lotr = lotr[lotr > 7].to_frame()
users = like_lotr['User-ID']
last line failed for
KeyError: 'User-ID'
I want to obtain users who rated LOTR > 7 to those users further find movies they liked too from the matrix.
Help would be appreciated. Thanks.
In your like_lotr dataframe 'User-ID' is the name of the index, you cannot select it like a normal column. That is why the line users = like_lotr['User-ID'] raises a KeyError. It is not a column.
Moreover ix is deprecated, better to use loc in your case. And don't put quotes: it need to be an integer, since 'User-ID' was originally a column of integers (at least from your sample).
Try like this:
df_p = df.pivot_table(index='ISBN', columns='User-ID', values='Book-Rating').fillna(0)
lotr = df_p.loc[452264464] # used another number from your sample dataframe to test this code.
like_lotr = lotr[lotr > 7].to_frame()
users = like_lotr.index.tolist()
user is now a list with the ids you want.
Using your small sample above and the number I used to test, user is [251150].
An alternative solution is to use reset_index. The two last lins should look like this:
like_lotr = lotr[lotr > 7].to_frame().reset_index()
users = like_lotr['User-ID']
reset_index put the index back in the columns.

Python CSV search for specific values after reading into dictionary

I need a little help reading specific values into a dictionary using Python. I have a csv file with User numbers. So user 1,2,3... each user is within a specific department 1,2,3... and each department is in a specific building 1,2,3... So I need to know how can I list all the users in department one in building 1 then department 2 in building 1 so on. I have been trying and have read everything into a massive dictionary using csv.ReadDict, but this would work if I could search through which entries I read into each dictionary of dictionaries. Any ideas for how to sort through this file? The CSV has over 150,000 entries for users. Each row is a new user and it lists 3 attributes, user_name, departmentnumber, department building. There are a 100 departments and 100 buildings and 150,000 users. Any ideas on a short script to sort them all out? Thanks for your help in advance
A brute-force approach would look like
import csv
csvFile = csv.reader(open('myfile.csv'))
data = list(csvFile)
data.sort(key=lambda x: (x[2], x[1], x[0]))
It could then be extended to
import csv
import collections
csvFile = csv.reader(open('myfile.csv'))
data = collections.defaultdict(lambda: collections.defaultdict(list))
for name, dept, building in csvFile:
data[building][dept].append(name)
buildings = data.keys()
buildings.sort()
for building in buildings:
print "Building {0}".format(building)
depts = data[building].keys()
depts.sort()
for dept in depts:
print " Dept {0}".format(dept)
names = data[building][dept]
names.sort()
for name in names:
print " ",name

Categories