how to search a list of values in another list - python

I'am new in Python.
Is there a way to search a list of values (words and fraises) in another list (csv table), and get only the matched rows?
Example:
LiastOfValues=['smoking','hard smoker','alcoholic']
ListfromCSV =
ID,TYPE,STRING1,NUMBER
1, a,'this is hard smoker man',4
2, b,'this one likes to drink',5
3, c,'dont like sigarets',6
4, e,'this one is smoking',7
To search LiastOfValues in each row and reply only the matched rows.
The Output:
Output=
ID,TYPE,STRING1,NUMBER
1, a,'this is hard smoker man',4
4, e,'this one is smoking',7
I have tryed this:
import csv
ListfromCSV ="ListfromCSV.txt"
LiastOfValues=['smoking','hard smoker','alcoholic','smoker']
with open(ListfromCSV ,'r') as f:
LineReader=csv.reader(f,delimiter=',')
for i in LineReader:
if value in (i[2])) :
print (i)

Try this. It assumes your csv is not nested and contains strings. If it is nested, you can convert the lists to strings:
[row for row in csv if any(map(lambda x: x in row,LiastOfValues))]
This code should get a list with matched rows (does not include header unless you match it)

Related

Compare data between two csv files and count how many rows have the same data

Let's say I have list of all OUs (AllOU.csv):
NEWS
STORE
SPRINKLES
ICECREAM
I want to look through a csv file (samplefile.csv) on the third column called 'column3', and search through each row if it matches what is in the samplefile.csv.
Then I want to sort them and count how many rows each one has.
This is how the column looks:
column3
CN=Clark Kent,OU=news,dc=company,dc=com
CN=Mary Poppins,OU=ice cream, dc=company,dc=com
CN=Mary Jane,OU=news,OU=tv,dc=company,dc=com
CN=Pepper Jack,OU=store,OU=tv,dc=company,dc=com
CN=Monty Python,OU=store,dc=company,dc=com
CN=Anne Potts,OU=sprinkles,dc=company,dc=com
I want to sort them out like this (or a list):
CN=Clark Kent,OU=news,dc=company,dc=com
CN=Mary Jane,OU=news,OU=tv,dc=company,dc=com
CN=Pepper Jack,OU=tv,OU=store,dc=company,dc=com
CN=Monty Python,OU=store,dc=company,dc=com
CN=Mary Poppins,OU=ice cream, dc=company,dc=com
CN=Anne Potts,OU=sprinkles,dc=company,dc=com
This is what the final output should be:
2, news
2, store,
1, icecream
1, sprinkles
Maybe a list would be a good way of sorting them? Like this?
holdingList =['CN=Clark Kent,OU=news,dc=company,dc=com','CN=Mary Jane,OU=news,OU=tv,dc=company,dc=com'],
['CN=Pepper Jack,OU=tv,OU=store,dc=company,dc=com','CN=Monty Python,OU=store,dc=company,dc=com'],
['CN=Mary Poppins,OU=ice cream, dc=company,dc=com'],
['CN=Anne Potts,OU=sprinkles,dc=company,dc=com']
I had something like this so far:
file = open('samplefile.csv')
df = pd.read_csv(file, usecols=['column3'])
#file of all OUs
file2 = open('ALLOU.csv')
OUList = pd.read_csv(file2, header=None)
for OU in OUList[0]:
df_dept = df[df['column3'].str.contains(f'OU={OU }')].count()
print({OU}, df_dept)
Read your file first and create a list of objects.
[{CN:’Clark Kent’,OU:’news’,dc:’company’,dc:’com’},…{…}]
Once you have created the list you can convert it to data frame and then apply all the grouping, sorting and other abilities of pandas.
Now to achieve this, first read your file into a variable lets call var filedata=yourFileContents. Next split filedata. var lines = filedata.split(‘\n’)
Now loop over each lines
dataList = []
for line in lines:
item = dict()
elements = line.split(‘,’)
for element in elements:
key_value = element.split(‘=‘)
item[key_value[0]] = key_value[1]
dataList.append(item)
print(dataList)
Now you may load this onto a panda dataframe and apply sorting and grouping. Once you have structured the data frame, you can simply search the key from the other file in this dataframe and get your numbers

Pandas DF.output write to columns (current data is written all to one row or one column)

I am using Selenium to extract data from the HTML body of a webpage and am writing the data to a .csv file using pandas.
The data is extracted and written to the file, however I would like to manipulate the formatting of the data to write to specified columns, after reading many threads and docs I am not able to understand how to do this.
The current CSV file output is as follows, all data in one row or one column
0,
B09KBFH6HM,
dropdownAvailable,
90,
1,
B09KBNJ4F1,
dropdownAvailable,
100,
2,
B09KBPFPCL,
dropdownAvailable,
110
or if I use the [count] count +=1 method it will be one row
0,B09KBFH6HM,dropdownAvailable,90,1,B09KBNJ4F1,dropdownAvailable,100,2,B09KBPFPCL,dropdownAvailable,110
I would like the output to be formatted as follows,
/col1 /col2 /col3 /col4
0, B09KBFH6HM, dropdownAvailable, 90,
1, B09KBNJ4F1, dropdownAvailable, 100,
2, B09KBPFPCL, dropdownAvailable, 110
I have tried using columns= options but get errors in the terminal and don't understand what feature I should be using to achieve this in the docs under the append details
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html?highlight=append#pandas.DataFrame.append
A simplified version is as follows
from selenium import webdriver
import pandas as pd
price = []
driver = webdriver.Chrome("./chromedriver")
driver.get("https://www.example.co.jp/dp/zzzzzzzzzz/")
select_box = driver.find_element_by_name("dropdown_selected_size_name")
options = [x for x in select_box.find_elements_by_tag_name("option")]
for element in options:
price.append(element.get_attribute("value"))
price.append(element.get_attribute("class"))
price.append(element.get_attribute("data-a-html-content"))
output = pd.DataFrame(price)
output.to_csv("Data.csv", encoding='utf-8-sig')
driver.close()
Do I need to parse each item separately and append?
I would like each of the .get_attribute values to be written to a new column.
Is there any advice you can offer for a solution to this as I am not very proficient at pandas, thank you for your helps
 Approach similar to #user17242583, but a little shorter:
data = [[e.get_attribute("value"), e.get_attribute("class"), e.get_attribute("data-a-html-content")] for e in options]
df = pd.DataFrame(data, columns=['ASIN', 'dropdownAvailable', 'size']) # third column maybe is the product size
df.to_csv("Data.csv", encoding='utf-8-sig')
Adding all your items to the price list is going to cause them all to be in one column. Instead, store separate lists for each column, in a dict, like this (name them whatever you want):
data = {
'values': [],
'classes': [],
'data_a_html_contents': [],
}
...
for element in options:
values.append(element.get_attribute("value"))
classes.append(element.get_attribute("class"))
data_a_html_contents.append(element.get_attribute("data-a-html-content"))
...
output = pd.DataFrame(data)
output.to_csv("Data.csv", encoding='utf-8-sig')
You were collecting the value, class and data-a-html-content and appending them to the same list price. Hence, the list becomes:
price = [value1, class1, data-a-html-content1, value2, class2, data-a-html-content2, ...]
Hence, within the dataframe it looks like:
Solution
To get value, class and data-a-html-content in seperate columns you can adopt any of the below two approaches:
Pass a dictionary to the dataframe.
Pass a list of lists to the dataframe.
While the #user17242583 and #h.devillefletcher suggests a dictionary, you can still achieve the same using list of lists as follows:
values = []
classes = []
data-a-html-contents = []
driver = webdriver.Chrome("./chromedriver")
driver.get("https://www.example.co.jp/dp/zzzzzzzzzz/")
select_box = driver.find_element_by_name("dropdown_selected_size_name")
options = [x for x in select_box.find_elements_by_tag_name("option")]
for element in options:
values.append(element.get_attribute("value"))
classes.append(element.get_attribute("class"))
data-a-html-contents.append(element.get_attribute("data-a-html-content"))
df = pd.DataFrame(data=list(zip(values, classes, data-a-html-contents)), columns=['Value', 'Class', 'Data-a-Html-Content'])
output = pd.DataFrame(my_list)
output.to_csv("Data.csv", encoding='utf-8-sig')
References
You can find a couple of relevant detailed discussions in:
Selenium: Web-Scraping Historical Data from Coincodex and transform into a Pandas Dataframe
Python Selenium: How do I print the values from a website in a text file?

Using Counter() function in python

I have an excel file with many rows and columns. I want to do the following. First, I want to filter the rows based on a text match. Second, I want to choose a particular column and generate word frequency for ALL THE WORDS in that column. Third, I want to graph the word and frequency.
I have figured out the first part. My question is how to apply Counter() on a dataframe. If I just use Counter(df), it returns an error. So, I used the following code to convert each row into a list and then applied Counter. When I do this, I get the word frequency for each row separately (if I use counter within the for loop, else I get the word frequency for just one row). However, I want a word count for all the rows put together. Appreciate any inputs. Thanks!
The following is an example data.
product review
a Great Product
a Delivery was fast
a Product received in good condition
a Fast delivery but useless product
b Dont recommend
b I love it
b Please dont buy
b Second purchase
My desired output is like this: for product a - (product,3),(delivery,2)(fast,2) etc..my current output is like (great,1), (product,1) for the first row.
This is the code I used.
strdata = column.values.tolist()
tokens = [tokenizer.tokenize(str(i)) for i in strdata]
cleaned_list = []
for m in tokens:
stopped = [i for i in m if str(i).lower() not in stop_words]
stemmed = [stemmer.stem(i) for i in stopped]
cleaned_list.append(stopped) #append stemmed words to list
count = Counter(stemmed)
print(count.most_common(10))
Firstly, using groupby concatenate strings from same group.
Secondly, apply Counter() on joined strings.
joined = df.groupby('product', as_index=False).agg({'review' : ' '.join})
joined['count'] = joined.apply(lambda x: collections.Counter(x['review'].split(' ')), axis=1)
# print(joined)
product review count
0 a Great Product Delivery was fast Product receiv... {'Great': 1, 'Product': 2, 'Delivery': 1, 'was...
1 b Dont recommend I love it Please dont buy Secon... {'Dont': 1, 'recommend': 1, 'I': 1, 'love': 1,...
You can use the following function. The idea is
Group your data by byvar. Combine every words in yvar as a list.
Apply Counter and, if you want, select the most common
Explode to get a long formatted dataframe (easier to analyze afterwards)
just keep the relevent columns (word and count in a new dataframe)
:
from collections import Counter
import pandas as pd
def count_words_by(data, yvar, byvar):
cw = pd.DataFrame({'counter' : data
.groupby(byvar)
.apply(lambda s: ' '.join(s[yvar]).split())
.apply(lambda s: Counter(s))
# .apply(lambda s: s.most_common(10)) #uncomment this line if you want the top 10 words
.explode()}
)
cw[['word','count']] = pd.DataFrame(cw['counter'].tolist(), index=cw.index)
cw_red = cw[['word','count']].reset_index()
return cw_red
count_words_by(data = df, yvar = "review", byvar = "product")
where I assume you start from there:
product review
a Great Product
a Delivery was fast
a Product received in good condition
a Fast delivery but useless product
b Dont recommend
b I love it
b Please dont buy
b Second purchase

Pandas: Each row take a string, separate by commas, and add unique word to list

Sample df:
filldata = [['5,Blue,Football', 3], ['Baseball,Blue,College,1993', 4], ['Green,5,Football', 1]]
df = pd.DataFrame(filldata, columns=['Tags', 'Count'])
I am wanting a unique list of words used in the Tags column. So I'm trying to loop through df and pull each row of Tags, split on , and add the words to a list. I could either check and add only unique words, or add them all and then just pull unique. I would like a solution for both methods if possible to see which is faster.
So expected output should be:
5, Blue, Football, Baseball, College, 1993, Green.
I have tried these:
tagslist = df['Tags'][0].split(',') # To give me initial starting words
def adduniquetags(newtags, tagslist):
thesetags = newtags.split(',')
tagslist = tagslist.extend(thesetags)
return tagslist
tagslist = [adduniquetags(row, tagslist) for row in df['Tags']]
and
tagslist = df['Tags'][0].split(',')
def adduniquetags(newtags, tagslist):
thesetags = newtags.split(',')
for word in thesetags:
if word not in tagslist:
tagslist.append(word)
tagslist = [adduniquetags(row, tagslist) for row in df['Tags']]
These two are essentially the same with one looking only for unique words. Both of these return a list of 'None'.
I have also tried this:
tagslist = df['Tags'][0].split(',')
def adduniquetags(newtags, tagslist):
thesetags = newtags.split(',')
tagslist = list(set(tagslist + thesetags))
return tagslist
tagslist = [adduniquetags(row, tagslist) for row in df['Tags']]
This one is adding unique values for each row, but not the words in each row. So even though I tried to split on the ,, it is still treating the entire text as one instead of using the individual words from the string.
Use Series.str.split to split strings, then use np.hstack to horizontally stack all the lists in column Tags, next use np.unique on this stacked array, to find the unique elements in array.
lst = np.unique(np.hstack(df['Tags'].str.split(','))).tolist()
Another possible idea using Series.explode + Series.unique:
lst = df['Tags'].str.split(',').explode().unique().tolist()
Result:
['1993', '5', 'Baseball', 'Blue', 'College', 'Football', 'Green']

Output row number along with the value in it

I have this code that looks for a certain value in a huge csv file. Those values are 223.2516 for column 2 in the file which is denoted as "row[2]" and 58.053 for column 3 denoted as "row[3]". I have the code set up so that I can find anything close to those values within an established limit. I know that the value 223.2516 doesnt exist in the file so I'm looking for everything that is relatively close as you can see in the code. The last two commands give an output of all the values:
In [54]: [row[2] for row in data if abs(row[2]-223.25)<0.001]
Out[54]:
[223.24945646,
223.25013049,
223.25093125999999,
223.24943973000001,
223.24924296,
223.24958522]
and
In [55]: [row[3] for row in data if abs(row[3]-58.053)<0.001]
Out[55]:
[58.052124569999997,
58.052942659999999,
58.053108100000003,
58.053536250000001,
58.05346918,
58.053109259999999,
58.052188620000003,
58.052528559999999,
58.053201559999998,
58.052009560000002,
58.052036010000002,
58.053623790000003,
58.052450120000003,
58.052405720000003,
58.053431590000002,
58.053709660000003,
58.053117569999998,
58.052511709999997]
The problem that I have is that I need both values to be within the same row. I'm not looking for the values independent of each other. The 223 value and the 58.0 value both have to be in the same row, theyre coordinates.
Is there a way to output only those values that are in the same row or at least, print the row number in which each value is in, along with the value?
Here's my code:
import numpy
from matplotlib import *
from pylab import *
data = np.genfromtxt('result.csv',delimiter=',',skip_header=1, dtype=float)
[row[2] for row in data if abs(row[2]-223.25)<0.001]
[row[3] for row in data if abs(row[3]-58.053)<0.001]
Question looks familiar. Use enumerate. As an example:
data = [[3, 222], [8, 223], [1,224], [5, 223]]
A = [ [ind,row[0],row[1]] for ind,row in enumerate(data) if abs(row[1]-223)<1 ]
print A
[[1, 8, 223], [3, 5, 223]]
This way, you get the index and you get the pair of values you want.
Take the idea and convert back to your example. So something like:
[ [ind, row] for ind,row in enumerate(data) if abs(row[2]-223.25)<0.001]
#brechmos has it. More explicitly, enumerate allows you to track the implicit index of any iterator as you iterate through it.
phrase = 'green eggs and spam':
for ii, word in enumerate(phrase.split()):
print("Word %d is %s" % (ii, word))
Output…
Word 0 is green
Word 1 is eggs
Word 2 is and
Word 3 is spam

Categories