Check data column is same or not with Pandas - python

I want to check data column 'tokenizing' and 'lemmatization' is same or not like the table. But, giving me an error
tokenizing
lemmatization
check
[pergi, untuk, melakukan, penanganan, banjir]
[pergi, untuk, laku, tangan, banjir]
False
[baca, buku, itu, asik]
[baca, buku, itu, asik]
True
from spacy.lang.id import Indonesian
import pandas as pd
nlp = Indonesian()
nlp.add_pipe('lemmatizer')
nlp.initialize()
data = [
'pergi untuk melakukan penanganan banjir',
'baca buku itu asik'
]
df = pd.DataFrame({'text': data})
#Tokenization
def tokenizer(words):
return [token for token in nlp(words)]
#Lemmatization
def lemmatizer(token):
return [lem.lemma_ for lem in token]
df['tokenizing'] = df['text'].apply(tokenizer)
df['lemmatization'] = df['tokenizing'].apply(lemmatizer)
#Check similarity
df.to_clipboard(sep='\s\s+')
df['check'] = df['tokenizing'].eq(df['lemmatization'])
df
How to compare?
result before error df.to_clipboard()
text tokenizing lemmatization
0 pergi untuk melakukan penanganan banjir [pergi, untuk, melakukan, penanganan, banjir] [pergi, untuk, laku, tangan, banjir]
1 baca buku itu asik [baca, buku, itu, asik] [baca, buku, itu, asik]
Update
The error is fixed. It is because typo. And after fixed the typo the result is like this the result is all False. What I want is like the table.

Base on your code, you forgot i on df['lemmatizaton'].
So that change
df['lemmatizaton'] = df['tokenizing'].apply(lemmatizer)
to
df['lemmatization'] = df['tokenizing'].apply(lemmatizer)
Then it may work.

Related

Create datasets based on authors from another dataset

I have a dataset in the following format
text author title
-------------------------------------
dt = text0 author0 title0
text1 author1 title1
. . .
. . .
. . .
and I would like to create different separate datasets which contain only texts of one author. For example the dataset names dt1 contains the texts of the author1, the dt2 contains texts of the author2, etc.
I would be grateful if you could help me with this using python.
Update:
dt =
text author title
-------------------------------------------------------------------------
0 I would like to go to the beach George Beach
1 I was in park few days ago Nick Park
2 I would like to go in uni Peter University
3 I have be in the airport at 8 Maria Airport
Please try, this is what I understand you require.
import pandas as pd
data = {
'text' : ['text0', 'text1', 'text2'],
'author': ['author0', 'author1', 'author1'],
'title': ['Comunicación', 'Administración', 'Ventas']
}
df = pd.DataFrame(data)
df1 = df[df["author"]=="author0"]
df2 = df[df["author"]=="author1"]
print(df1)
print(df2)
Update:
import pandas as pd
data = {
'text' : ['text0', 'text1', 'text2'],
'author': ['author0', 'author1', 'author1'],
'title': ['Comunicación', 'Administración', 'Ventas']
}
df = pd.DataFrame(data)
df1 = df[df["author"]=="author0"]
df2 = df[df["author"]=="author1"]
list_author = df['author'].unique().tolist()
for x in list_author:
a = df[df["author"]==x]
print(a)

Pandas: create a function to delete links

I necessary need a function to delete links from my oldText column (more then 1000 rows) in a pandas DataFrame.
I've created it using regex, but it doesn't work. This is my code:
def remove_links(text):
text = re.sub(r'http\S+', '', text)
text = text.strip('[link]')
return text
df['newText'] = df['oldText'].apply(remove_links)
I have not error, the code do just nothing
Your code is working for me:
CSV:
oldText
https://abc.xy/oldText asd
https://abc.xy/oldTe asd
https://abc.xy/oldT
https://abc.xy/old
https://abc.xy/ol
Code:
import pandas as pd
import re
def remove_links(text):
text = re.sub(r'http\S+', '', text)
text = text.strip('[link]')
return text
df = pd.read_csv('test2.csv')
df['newText'] = df['oldText'].apply(remove_links)
print(df)
Result:
oldText newText
0 https://abc.xy/oldText asd asd
1 https://abc.xy/oldTe asd asd
2 https://abc.xy/oldT
3 https://abc.xy/old
4 https://abc.xy/ol

Trying to create a streamlit app that uses user-provided URLs to scrape and return a downloadable df

I'm trying to use this create_df() function in Streamlit to gather a list of user-provided URLs called "recipes" and loop through each URL to return a df I've labeled "res" towards the end of the function. I've tried several approaches with the Streamlit syntax but I just cannot get this to work as I'm getting this error message:
recipe_scrapers._exceptions.WebsiteNotImplementedError: recipe-scrapers exception: Website (h) not supported.
Have a look at my entire repo here. The main.py script works just fine once you've installed all requirements locally, but when I try running the same script with Streamlit syntax in the streamlit.py script I get the above error. Once you run streamlit run streamlit.py in your terminal and have a look at the UI I've create it should be quite clear what I'm aiming at, which is providing the user with a csv of all ingredients in the recipe URLs they provided for a convenient grocery shopping list.
Any help would be greatly appreciated!
def create_df(recipes):
"""
Description:
Creates one df with all recipes and their ingredients
Arguments:
* recipes: list of recipe URLs provided by user
Comments:
Note that ingredients with qualitative amounts e.g., "scheutje melk", "snufje zout" have been ommitted from the ingredient list
"""
df_list = []
for recipe in recipes:
scraper = scrape_me(recipe)
recipe_details = replace_measurement_symbols(scraper.ingredients())
recipe_name = recipe.split("https://www.hellofresh.nl/recipes/", 1)[1]
recipe_name = recipe_name.rsplit('-', 1)[0]
print("Processing data for "+ recipe_name +" recipe.")
for ingredient in recipe_details:
try:
df_temp = pd.DataFrame(columns=['Ingredients', 'Measurement'])
df_temp[str(recipe_name)] = recipe_name
ing_1 = ingredient.split("2 * ", 1)[1]
ing_1 = ing_1.split(" ", 2)
item = ing_1[2]
measurement = ing_1[1]
quantity = float(ing_1[0]) * 2
df_temp.loc[len(df_temp)] = [item, measurement, quantity]
df_list.append(df_temp)
except (ValueError, IndexError) as e:
pass
df = pd.concat(df_list)
print("Renaming duplicate ingredients e.g., Kruimige aardappelen, Voorgekookte halve kriel met schil -> Aardappelen")
ingredient_dict = {
'Aardappelen': ('Dunne frieten', 'Half kruimige aardappelen', 'Voorgekookte halve kriel met schil',
'Kruimige aardappelen', 'Roodschillige aardappelen', 'Opperdoezer Ronde aardappelen'),
'Ui': ('Rode ui'),
'Kipfilet': ('Kipfilet met tuinkruiden en knoflook'),
'Kipworst': ('Gekruide kipworst'),
'Kipgehakt': ('Gemengd gekruid gehakt', 'Kipgehakt met Mexicaanse kruiden', 'Half-om-halfgehakt met Italiaanse kruiden',
'Kipgehakt met tuinkruiden'),
'Kipshoarma': ('Kalkoenshoarma')
}
reverse_label_ing = {x:k for k,v in ingredient_dict.items() for x in v}
df["Ingredients"].replace(reverse_label_ing, inplace=True)
print("Assigning ingredient categories")
category_dict = {
'brood': ('Biologisch wit rozenbroodje', 'Bladerdeeg', 'Briochebroodje', 'Wit platbrood'),
'granen': ('Basmatirijst', 'Bulgur', 'Casarecce', 'Cashewstukjes',
'Gesneden snijbonen', 'Jasmijnrijst', 'Linzen', 'Maïs in blik',
'Parelcouscous', 'Penne', 'Rigatoni', 'Rode kidneybonen',
'Spaghetti', 'Witte tortilla'),
'groenten': ('Aardappelen', 'Aubergine', 'Bosui', 'Broccoli',
'Champignons', 'Citroen', 'Gele wortel', 'Gesneden rodekool',
'Groene paprika', 'Groentemix van paprika, prei, gele wortel en courgette',
'IJsbergsla', 'Kumato tomaat', 'Limoen', 'Little gem',
'Paprika', 'Portobello', 'Prei', 'Pruimtomaat',
'Radicchio en ijsbergsla', 'Rode cherrytomaten', 'Rode paprika', 'Rode peper',
'Rode puntpaprika', 'Rode ui', 'Rucola', 'Rucola en veldsla', 'Rucolamelange',
'Semi-gedroogde tomatenmix', 'Sjalot', 'Sperziebonen', 'Spinazie', 'Tomaat',
'Turkse groene peper', 'Veldsla', 'Vers basilicum', 'Verse bieslook',
'Verse bladpeterselie', 'Verse koriander', 'Verse krulpeterselie', 'Wortel', 'Zoete aardappel'),
'kruiden': ('Aïoli', 'Bloem', 'Bruine suiker', 'Cranberrychutney', 'Extra vierge olijfolie',
'Extra vierge olijfolie met truffelaroma', 'Fles olijfolie', 'Gedroogde laos',
'Gedroogde oregano', 'Gemalen kaneel', 'Gemalen komijnzaad', 'Gemalen korianderzaad',
'Gemalen kurkuma', 'Gerookt paprikapoeder', 'Groene currykruiden', 'Groentebouillon',
'Groentebouillonblokje', 'Honing', 'Italiaanse kruiden', 'Kippenbouillonblokje', 'Knoflookteen',
'Kokosmelk', 'Koreaanse kruidenmix', 'Mayonaise', 'Mexicaanse kruiden', 'Midden-Oosterse kruidenmix',
'Mosterd', 'Nootmuskaat', 'Olijfolie', 'Panko paneermeel', 'Paprikapoeder', 'Passata',
'Pikante uienchutney', 'Runderbouillonblokje', 'Sambal', 'Sesamzaad', 'Siciliaanse kruidenmix',
'Sojasaus', 'Suiker', 'Sumak', 'Surinaamse kruiden', 'Tomatenblokjes', 'Tomatenblokjes met ui',
'Truffeltapenade', 'Ui', 'Verse gember', 'Visbouillon', 'Witte balsamicoazijn', 'Wittewijnazijn',
'Zonnebloemolie', 'Zwarte balsamicoazijn'),
'vlees': ('Gekruide runderburger', 'Half-om-half gehaktballetjes met Spaanse kruiden', 'Kipfilethaasjes', 'Kipfiletstukjes',
'Kipgehaktballetjes met Italiaanse kruiden', 'Kippendijreepjes', 'Kipshoarma', 'Kipworst', 'Spekblokjes',
'Vegetarische döner kebab', 'Vegetarische kaasschnitzel', 'Vegetarische schnitzel'),
'zuivel': ('Ei', 'Geraspte belegen kaas', 'Geraspte cheddar', 'Geraspte grana padano', 'Geraspte oude kaas',
'Geraspte pecorino', 'Karnemelk', 'Kruidenroomkaas', 'Labne', 'Melk', 'Mozzarella',
'Parmigiano reggiano', 'Roomboter', 'Slagroom', 'Volle yoghurt')
}
reverse_label_cat = {x:k for k,v in category_dict.items() for x in v}
df["Category"] = df["Ingredients"].map(reverse_label_cat)
col = "Category"
first_col = df.pop(col)
df.insert(0, col, first_col)
df = df.sort_values(['Category', 'Ingredients'], ascending = [True, True])
print("Merging ingredients by row across all recipe columns using justify()")
gp_cols = ['Ingredients', 'Measurement']
oth_cols = df.columns.difference(gp_cols)
arr = np.vstack(df.groupby(gp_cols, sort=False, dropna=False).apply(lambda gp: justify(gp.to_numpy(), invalid_val=np.NaN, axis=0, side='up')))
# Reconstruct DataFrame
# Remove entirely NaN rows based on the non-grouping columns
res = (pd.DataFrame(arr, columns=df.columns)
.dropna(how='all', subset=oth_cols, axis=0))
res = res.fillna(0)
res['Total'] = res.drop(['Ingredients', 'Measurement'], axis=1).sum(axis=1)
res=res[res['Total'] !=0] #To drop rows that are being duplicated with 0 for some reason; will check later
print("Processing complete!")
return res
Your function create_df needs a list as an argument but st.text_input returs always a string.
In your streamlit.py, replace this df_download = create_df(recs) by this df_download = create_df([recs]). But if you need to handle multiple urls, you should use str.split like this :
def create_df(recipes):
recipes = recipes.split(",") # <--- add this line to make a list from the user-input
### rest of the code ###
if download:
df_download = create_df(recs)
# Output :

How to use values from a PANDAS data frame as filter params in regex?

I would like to use the values from a Pandas df as filter params in a SPARQL query.
By reading the data from an excel file I'm creating the pandas dataframe:
xls = pd.ExcelFile ('excel/dataset_nuovo.xlsx')
df1 = pd.read_excel(xls, 'Sheet1')
print(df1)
Here the resulting dataframe:
oggetto descrizione lenght label tipologia
0 #iccd4580759# Figure: putto. Oggetti: ghirlanda di fiori 6 Bad OpereArteVisiva
1 #iccd3636719# Decorazione plastica. 2 Bad OpereArteVisiva
2 #iccd3641475# Scultura.. Figure: angelo 3 Bad OpereArteVisiva
3 #iccd8282504# Custodia di reliquiario in legno intagliato e ... 8 Good OpereArteVisiva
4 #iccd3019633# Portale. 1 Bad OpereArteVisiva
... ... ... ... ... ...
59995 #iccd2274873# Ciotola media a larga tesa. Decorazione in cob... 35 Good OpereArteVisiva
59996 #iccd11189887# Il medaglione bronzeo, sormontato da un'aquila... 85 Good OpereArteVisiva
59997 #iccd4545324# Tessuto di fondo rosaceo. Disegno a fiori e fo... 49 Good OpereArteVisiva
59998 #iccd2934870# Sculture a tutto tondo in legno dipinto di bia... 28 Good OpereArteVisiva
59999 #iccd2685205# Calice con piede a base circolare e nodo ovoid... 14 Bad OpereArteVisiva
Then I need to use the values from the oggetto column as filter to retrieve for (each record) the relative subject from a SPARQL endpoint.
By using this SPARQL query:
SELECT ?object ?description (group_concat(?subject;separator="|") as ?subjects)
WHERE { ?object a crm:E22_Man-Made_Object;
crm:P3_has_note ?description;
crm:P129_is_about ?concept;
crm:P2_has_type ?type.
?concept a crm:E28_Conceptual_Object;
rdfs:label ?subject.
filter( regex(str(?object), "#iccd4580759#" ))
}
I'm able to filter one single record.
object.type object.value ... subjects.type subjects.value
0 uri http://dati.culturaitalia.it/resource/oai-oaic... ... literal Putto con ghirlanda di fiori|Putto con ghirlan..
Since the dataset is 60k records I would automatize the process by looping through the dataframe and use the value as filter to have a new df with a relative subject col.
oggetto descrizione subject lenght label tipologia
0 #iccd4580759# Figure: putto. Oggetti: ghirlanda di fiori Putto con ghirlanda di fiori|Putto con ghirlan.. 6 Bad OpereArteVisiva
Here the entire script I wrote:
import xlrd
import pandas as pd
from pandas import json_normalize
from SPARQLWrapper import SPARQLWrapper, JSON
xls = pd.ExcelFile ('excel/dataset_nuovo.xlsx')
df1 = pd.read_excel(xls, 'Sheet1')
print(df1)
def query_ci(sparql_query, sparql_service_url):
sparql = SPARQLWrapper(sparql_service_url)
sparql.setQuery(sparql_query)
sparql.setReturnFormat(JSON)
# ask for the result
result = sparql.query().convert()
return json_normalize(result["results"]["bindings"])
sparql_query = """ SELECT ?object ?description (group_concat(?subject;separator="|") as ?subjects)
WHERE { ?object a crm:E22_Man-Made_Object;
crm:P3_has_note ?description;
crm:P129_is_about ?concept;
crm:P2_has_type ?type.
?concept a crm:E28_Conceptual_Object;
rdfs:label ?subject.
filter( regex(str(?object), "#iccd4580759#" ))
}
"""
sparql_service_url = "http://dati.culturaitalia.it/sparql"
result_table = query_ci(sparql_query, sparql_service_url)
print (result_table)
result_table.to_excel("output.xlsx")
Is it possible to do that?

Reading XML file in pandas

I am having a XML file, I want to load it in pandas, I have tried other XML files, but for this the shape is unclear and every time this throws the error and I am not coming up with the pandas dataframe I need. Any suggestions
-
-<tweet id="591154373696323584">
Diga cuanto nos van a costar las
<sentiment polarity="N" entity="Partido_Popular" aspect="Economia">autovías</sentiment>
de sus amiguetes ¿4500 millones o más ? #EsperanzAguirre #PPopular
</tweet>
-<tweet id="591154532362670080">
#lhermoso_ #sanchezcastejon
<sentiment polarity="N" entity="Partido_Socialista_Obrero_Espanol" aspect="Propio_partido">#DobleMoral</sentiment>
Castilla antes que Aragón...
</tweet>
I am using the code below.
import xml.etree.cElementTree as et
import pandas as pd
def getvalueofnode(node):
""" return node text or None """
return node.text if node is not None else None
def main():
parsed_xml = et.parse("stompol-train-tagged.xml")
dfcols = ['tweet id', 'tweet', 'sentiment_polarity', 'entity', 'aspect', 'sentiment']
df_xml = pd.DataFrame(columns=dfcols)
for node in parsed_xml.getroot():
tweetid = node.attrib.get('tweetid')
tweet = node.find('tweet')
sentiment_polarity = node.find('polarity')
entity = node.find('entity')
aspect = node.find('aspect')
sentiment = node.find('sentiment')
df_xml = df_xml.append(
pd.Series([tweetid, tweet, sentiment_polarity, entity, aspect, sentiment], index=dfcols),
ignore_index=True)
print(df_xml)
main()
I get None and all.

Categories