I am currently working on a project, the idea is to extract tweets (with geo enabled) from a Hashtag and to print a map (with Folium). Inside the map, I should have markers according to the user locations and when I click at the marker I should have the text of the tweet. But currently I only have a map and the markers.
This is my code :
import pandas as pd
import folium, ast
locations = pd.read_csv('tweets.csv', usecols=[3]).dropna()
l_locations = []
for loc in locations.values:
l_locations.append(ast.literal_eval(loc[0])['coordinates'])
print_tweet_map = folium.Map(location=[48.864716, 2.349014], zoom_start=8, tiles='Mapbox Bright')
for geo in l_locations:
folium.Marker(location=geo).add_to(print_tweet_map)
print_tweet_map.save('index.html')
Can you guys help me to print the markers and the text details of the tweet ?
Thanks in advance.
PS : I have currently :
Some lines of the csv file :
created_at,user,text,geo
2017-09-30 15:28:56,yanceustie,"RT #ChelseaFC: .#AlvaroMorata and #marcosalonso03 have been checking out the pitch ahead of kick-off..., null
2017-09-30 15:48:18,trendinaliaVN,#CHEMCI just started trending with 17632 tweets. More trends at ... #trndnl,"{'type': 'Point', 'coordinates': [21.0285, 105.8048]}"
Examine read_csv closely and then use it to retrieve the tweet text as well. Reading folium documentation, the popup seems the most relevant place to put each tweet's text.
Also you iterate over the same things 2 times. You can reduce those into 1 iteration loop that places a tweet on the map. (Imagine the map as your empty list you were appending to). No need to be overly sequential.
import pandas as pd
import folium, ast
frame = pd.read_csv('tweets.csv', usecols=[2, 3], names=['text', 'location'], header=None).dropna()
print_tweet_map = folium.Map(location=[48.864716, 2.349014], zoom_start=8, tiles='Mapbox Bright')
for index, item in frame.iterrows():
loc = item.get('location')
text = item.get('text')
geo = ast.literal_eval(loc[0])['coordinates']
folium.Marker(location=geo, popup=text) \
.add_to(print_tweet_map)
print_tweet_map.save('index.html')
This should work, or very close to work, but I don't have a proper computer handy for testing.
Related
I am scraping data with python. I get a csv file and can split it into columns in excel later. But I am encountering an issue I have not been able to solve. Sometimes the scraped items have two statuses and sometimes just one. The second status is thus moving the other values in the columns to the right and as a result the dates are not all in the same column which would be useful to sort the rows.
Do you have any idea how to make the columns merge if there are two statuses for example or other solutions?
Maybe is is also an issue that I still need to separate the values into columns manually with excel.
Here is my code
#call packages
import random
import time
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import pandas as pd
# define driver etc.
service_obj = Service("C:\\Users\\joerg\\PycharmProjects\\dynamic2\\chromedriver.exe")
browser = webdriver.Chrome(service=service_obj)
# create loop
initiative_list = []
for i in range(0, 2):
url = 'https://ec.europa.eu/info/law/better-regulation/have-your-say/initiatives_de?page='+str(i)
browser.get(url)
time.sleep(random.randint(5, 10))
initiative_item = browser.find_elements(By.CSS_SELECTOR, "initivative-item")
initiatives = [item.text for item in initiative_item]
initiative_list.extend(initiatives)
df = pd.DataFrame(initiative_list)
#create csv
print(df)
df.to_csv('Initiativen.csv')
df.columns = ['tosplit']
new_df = df['tosplit'].str.split('\n', expand=True)
print(new_df)
new_df.to_csv('Initiativennew.csv')
I tried to merge the columns if there are two statuses.
make the columns merge if there are two statuses for example or other solutions
[If by "statuses" you mean the yellow labels ending in OPEN/UPCOMING/etc, then] it should be taken care of by the following parts of the getDetails_iiaRow (below the dividing line):
labels = cssSelect(iiaEl, 'div.field span.label')
and then
'labels': ', '.join([l.text.strip() for l in labels])
So, multiple labels will be separated by commas (or any other separator you apply .join to).
initiative_item = browser.find_elements(By.CSS_SELECTOR, "initivative-item")
initiatives = [item.text for item in initiative_item]
Instead of doing it like this and then having to split and clean things, you should consider extracting each item in a more specific manner and have each "row" be represented as a dictionary (with the column-names as the keys, so nothing gets mis-aligned later). If you wrap it as a function:
def cssSelect(el, sel): return el.find_elements(By.CSS_SELECTOR, sel)
def getDetails_iiaRow(iiaEl):
title = cssSelect(iiaEl, 'div.search-result-title')
labels = cssSelect(iiaEl, 'div.field span.label')
iiarDets = {
'title': title[0].text.strip() if title else None,
'labels': ', '.join([l.text.strip() for l in labels])
}
cvSel = 'div[translate]+div:last-child'
for c in cssSelect(iiaEl, f'div:has(>{cvSel})'):
colName = cssSelect(c, 'div[translate]')[0].text.strip()
iiarDets[colName] = cssSelect(c, cvSel)[0].text.strip()
link = iiaEl.get_attribute('href')
if link[:1] == '/':
link = f'https://ec.europa.eu/{link}'
iiarDets['link'] = iiaEl.get_attribute('href')
return iiarDets
then you can simply loop through the pages like:
initiative_list = []
for i in range(0, 2):
url = f'https://ec.europa.eu/info/law/better-regulation/have-your-say/initiatives_de?page={i}'
browser.get(url)
time.sleep(random.randint(5, 10))
initiative_list += [
getDetails_iiaRow(iia) for iia in
cssSelect(browser, 'initivative-item>article>a ')
]
and the since it's all cleaned already, you can directly save the data with
pd.DataFrame(initiative_list).to_csv('Initiativen.csv', index=False)
The output I got for the first 3 pages looks like:
I think it is worth working a little bit harder to get your data rationalised before putting it in the csv rather than trying to unpick the damage once ragged data has been exported.
A quick look at each record in the page suggests that there are five main items that you want to export and these correspond to the five top-level divs in the a element.
The complexity (as you note) comes because there are sometimes two statuses specified, and in that case there is sometimes a separate date range for each and sometimes a single date range.
I have therefore chosen to put the three ever present fields as the first three columns, followed next by the status + date range columns as pairs. Finally I have removed the field names (these should effectively become the column headings) to leave only the variable data in the rows.
initiatives = [processDiv(item) for item in initiative_item]
def processDiv(item):
divs = item.find_elements(By.XPATH, "./article/a/div")
if "\n" in divs[0].text:
statuses = divs[0].text.split("\n")
if len(divs) > 5:
return [divs[1].text, divs[2].text.split("\n")[1], divs[3].text.split("\n")[1], statuses[0], divs[4].text.split("\n")[1], statuses[1], divs[5].text.split("\n")[1]]
else:
return [divs[1].text, divs[2].text.split("\n")[1], divs[3].text.split("\n")[1], statuses[0], divs[4].text.split("\n")[1], statuses[1], divs[4].text.split("\n")[1]]
else:
return [divs[1].text, divs[2].text.split("\n")[1], divs[3].text.split("\n")[1], divs[0].text, divs[4].text.split("\n")[1]]
The above approach sticks as close to yours as I can. You will clearly need to rework the pandas code to reflect the slightly altered data structure.
Personally, I would invest even more time in clearly identifying the best definitions for the fields that represent each piece of data that you wish to retrieve (rather than as simply divs 0-5), and extract the text directly from them (rather than messing around with split). In this way you are far more likely to create robust code that can be maintained over time (perhaps not your goal).
I believe Python is the best choice but I can be wrong.
Below is a sample from a data source in text format in Linux:
TUI,39832020:09:01,10.56| TUI,39832020:10:53,11.23| TUI,39832020:15:40,23.20
DIAN,39832020:09:04,11.56| TUI,39832020:11:45,11.23| DIAN,39832020:12:30,23.20| SLD,39832020:11:45,11.22
The size is unknown, let's presume a million rows.
Each line contains three or more sets delimited by |, and each set has fields separated by ,.
The first field in each set is the product ID. For example, on the sample above, TUI, DIAN, and SLD are products ID.
I need to find out how many types of products I have on file. For example, the first line contains 1: TUI, the second line contains 3: DIAN, TUI, and SLD.
In total, on those two lines, we can see there are three unique products.
Can anyone help?
Thank you very much. Any enlightening is appreciated.
UPDATE
I prefer a solution based in Python with Spark, i.e. pySpark.
I'm also looking for statistics like:
total amount of each product;
all records for a given time (the second field in each set, like 39832020:09:01);
minimum and maximum price for each product.
UPDATE 2
Thank you all for the code, I really appreciate. I wonder if anyone can write the data into a RDD and/or dataframe. I know that in SparkSQL it is very simple to obtain those statistics.
Thanks a lot in advance.
Thank you very much.
Similar to Accdias answer: Use a dictionary, read your file in line by line, split the data by | then by , and total up the counts in your dictionary.
myFile="lines_to_read.txt"
productCounts = dict()
with open(myFile, 'r') as linesToRead:
for thisLine in linesToRead:
for myItem in thisLine.split("|"):
productCode=myItem.split(",")
productCode=productCode[0].strip()
if productCode in productCounts:
productCounts[productCode]+=1
else:
productCounts[productCode]=1
print(productCounts)
**** Update ****
Dataframe use with Pandas so that we can query stats on the data afterwords:
import pandas as pd
myFile="lines_to_read.txt"
myData = pd.DataFrame (columns=['prodID', 'timeStamp', 'prodPrice'])
with open(myFile, 'r') as linesToRead:
for thisLine in linesToRead:
for myItem in thisLine.split("|"):
thisItem=myItem.strip('\n, " "').split(",")
myData = myData.append({'prodID':thisItem[0],'timeStamp':thisItem[1],'prodPrice':thisItem[2]}, ignore_index=True)
print(myData) # Full Table
print(myData.groupby('prodID').agg({'prodID':'count'})) # Total of prodID's
print(myData.loc[myData['timeStamp'] == '39832020:11:45']) # all lines where time = 39832020:11:45
print(myData.groupby('prodID').agg({'prodPrice':['min', 'max']})) # min/max prices
I'm attempting to create a wordcloud based off the scraped text from a specific website. The issue I'm having is with the webscraping portion of this. I've attempted two different ways, and both attempts I get stuck on how to proceed further.
First method:
Scrape the data for each specific tag into its own data frame
main_content= soup.find("div", attrs= {"class" : "col-md-4"})
main_content2= soup.find("article", attrs= {"class" : "col-lg-7 mid_info"})
comp_service= soup.find("div", attrs= {"class" : "col-md-6 col-lg-4"})
Here I'm stuck on how to add the three dataframes together in order to create the word cloud. This works fine if I use only one of the DF's and add it into 'lists' but I'm unsure how to add the other two into a single DF to then run the rest of the code. The following is the rest of the code for the word cloud potion:
str = ""
for list in lists:
info= list.text
str+=info
mask = np.array(Image.open("Desktop/big.png"))
color= ImageColorGenerator(mask)
wordcloud = WordCloud(width=1200, height=1000,
max_words=400,mask=mask,
stopwords=STOPWORDS,
background_color="white",
random_state=42).generate(str)
plt.imshow(wordcloud.recolor(color_func=color),interpolation="bilinear")
plt.axis("off")
plt.show()
Attempt 2
I found a peice of code that will extract all of the data from specific tags and put it into text
i = 0
for lists in soup.find_all(['article','div']):
print (lists.text)
However, when I attempt to run the rest of the code,
mask = np.array(Image.open("Desktop/big.png"))
color= ImageColorGenerator(mask)
wordcloud = WordCloud(width=1200, height=1000,
max_words=400,mask=mask,
stopwords=STOPWORDS,
background_color="white",
random_state=42).generate(str)
plt.imshow(wordcloud.recolor(color_func=color),interpolation="bilinear")
plt.axis("off")
plt.show()
I get 'ValueError: We need at least 1 word to plot a word cloud, got 0.' after running the wordcloud DF code.
I'm essentially just trying to pull all of the data from a website, store that information into a text file, then transform that data into a word cloud.
Please let me know any suggestions or clarifications I can provide.
Thank you.
This ended up working for me
lists = soup.find_all(['article','div'])
str = ""
for list in lists:
info= list.text
str+=info
So I'm a beginner but I want to visualise the mentions of a user using networkx in Python. I already collected all of the tweets I want to look at using the Twitter API and put them into a data frame.
The data frame has all sorts of data about the tweets but I am most interested in the user (5 users in the DF) and who was mentioned in the tweet of the user.
+-------+---------------------+
|user |mentioned_user |
+-------+---------------------+
|user1 |jack,peter,anne |
|user2 |sophie |
|user2 |anne,user1 |
+-------+---------------------+
I realize I can extract the data I need using from_pandas_edgelist like so:
test = nx.from_pandas_edgelist(
df,
source='user',
target='mentioned_user',
edge_attr=True,
create_using=nx.DiGraph()
)
But what do I do next? I would like to have a plot for each user where the user and mentioned_user are nodes.
Any help is greatly appreciated!
With your code you will create nodes from strings in 'user' column to strings in 'mentioned_user' column as-is, without splitting them to different users. So you should split 'mentioned_user' column and iterate through the dataframe manually:
df = pd.DataFrame({
'user':['user1','user2','user2'],
'mentioned_user':['jack,peter,anne','sophie','anne,user1']
})
df['splitted_users'] = df['mentioned_user'].apply(lambda x: x.split(','))
G = nx.DiGraph()
for r in df.iterrows():
for user in r[1]['splitted_users']:
G.add_edge(r[1]['user'], user)
nx.draw(G, with_labels=True)
will draw you:
I'm learning how to use sklearn and scikit and all that to do some machine learning.
I was wondering how to import this as data?
This is a dataset from the million song genre dataset.
How can I make my data.target[0] equal to "classic pop and rock" (as 0) and data.target[1] equal to 0 which is "classic pop and rock" and data.target[640] equal to 1 which is "folk"?
And my data.data[0,:] be equal to -8.697, 155.007, 1, 9, and so forth (all numerical values after the title column)
as others had mentioned it was a little unclear as to what shape you were looking for, but just as a general starter, and getting the data into a very flexible format, you could read the text file into python and convert it to a pandas dataframe. I am certain their are other more compact ways of doing this, but just to provide clear steps we could start with:
import pandas as pd
import re
file = 'filepath' #this is the file path to the saved text file
music = open(file, 'r')
lines = music.readlines()
# split the lines by comma
lines = [line.split(',') for line in lines]
# capturing the column line
columns = lines[9]
# capturing the actual content of the data, and dismissing the header info
content = lines[10:]
musicdf = pd.DataFrame(content)
# assign the column names to our dataframe
musicdf.columns = columns
# preview the dataframe
musicdf.head(10)
# the final column had formatting issues, so wanted to provide code to get rid of the "\n" in both the column title and the column values
def cleaner(txt):
txt = re.sub(r'[\n]+', '', txt)
return txt
# rename the column of issue
musicdf = musicdf.rename(columns = {'var_timbre12\n' : 'var_timbre12'})
# applying the column cleaning function above to the column of interest
musicdf['var_timbre12'] = musicdf['var_timbre12'].apply(lambda p: cleaner(p))
# checking the top and bottom of dataframe for column var_timbre12
musicdf['var_timbre12'].head(10)
musicdf['var_timbre12'].tail(10)
the result of this would be the following:
%genre track_id artist_name
0 classic pop and rock TRFCOOU128F427AEC0 Blue Oyster Cult
1 classic pop and rock TRNJTPB128F427AE9F Blue Oyster Cult
By having the data in this format, you can now do lots of grouping tasks, finding certain genres and their relative attributes, etc. using pandas groupby function.
Hope this helps!