So I'm a beginner but I want to visualise the mentions of a user using networkx in Python. I already collected all of the tweets I want to look at using the Twitter API and put them into a data frame.
The data frame has all sorts of data about the tweets but I am most interested in the user (5 users in the DF) and who was mentioned in the tweet of the user.
+-------+---------------------+
|user |mentioned_user |
+-------+---------------------+
|user1 |jack,peter,anne |
|user2 |sophie |
|user2 |anne,user1 |
+-------+---------------------+
I realize I can extract the data I need using from_pandas_edgelist like so:
test = nx.from_pandas_edgelist(
df,
source='user',
target='mentioned_user',
edge_attr=True,
create_using=nx.DiGraph()
)
But what do I do next? I would like to have a plot for each user where the user and mentioned_user are nodes.
Any help is greatly appreciated!
With your code you will create nodes from strings in 'user' column to strings in 'mentioned_user' column as-is, without splitting them to different users. So you should split 'mentioned_user' column and iterate through the dataframe manually:
df = pd.DataFrame({
'user':['user1','user2','user2'],
'mentioned_user':['jack,peter,anne','sophie','anne,user1']
})
df['splitted_users'] = df['mentioned_user'].apply(lambda x: x.split(','))
G = nx.DiGraph()
for r in df.iterrows():
for user in r[1]['splitted_users']:
G.add_edge(r[1]['user'], user)
nx.draw(G, with_labels=True)
will draw you:
Related
I need to summarize data using Python based on parameters from the user input and output summarized file. The actual data have several columns and groups and user should be able to configure what summary statistics (mean,median, max,min, quartiles) he/she wants to request for any column and group he/she chooses to summarize. The user input can be a JSON file (my_json_param). Here is a toy example of what I am trying to achieve. Compute summary statistics (mean,median) of columns (Height, Weight) by group (HighSchool). I need iterate through my_json_param with multiple combination of parameters and generate one summary file (df_sum) and a separate mapping file (sum_map). Basically, it should be a python function that can take in df and my_json_param and outputs df_sum.csv and a mapping file for summary statistics (sum_map.csv) (headers in sum_map.csv are combination of the column name and summary statistics -for example sum1 is meanHeight).
import pandas as pd
import json
# Generate toy data to demonstrate
df = pd.DataFrame({'HighSchool': ['HS1',
'HS2', 'HS3','HS1', 'HS2', 'HS3','HS1',
'HS2', 'HS3','HS1', 'HS2', 'HS3'],
'Height':[5.1,5.2,5.3,5.4,5.5,5.6,5.7,5.8,5.9,5.4,6,6.1],
'Weight':[110,120,130,140,150,160,180,190,120,135,155,185] })
df
# The summary is based on user input. User should be able to input columns names, HighSchool names and summary statistics in a JSON file like this
my_json_param = json.dumps({'Biometric':["Height","Weight"] , 'HighSchoolName':["HS1","HS2","HS3"],'Summarymeasure':["mean","median"]})
my_json_param
# Output a csv file that looks like this
df_sum = pd.DataFrame({'HighSchool': ['HS1', 'HS2', 'HS3'],
'sum1': [5.400,5.625,5.725],
'sum2': [141.25,153.75,148.75],
'sum3': [5.40,5.65,5.75],
'sum4': [137.5,152.5,145.0],
})
df_sum.style.hide_index()
# Also output a mapping file that maps each summary measure to meaningful headers
sum_map=pd.DataFrame({ 'sum1': ['meanHeight'],
'sum2': ['meanweight'],
'sum3': ['medianHeight'],
'sum4':['medianWeight'],
})
sum_map.style.hide_index()
# I know how to do a summary measure separately in python.
df.mean=df.groupby(['HighSchool'])['Height','Weight'].mean()
df.mean
# I need help in reading JSON file with multiple combination of parameters and generate one summary file and a separate mapping file.
def createsummary (dataframe,jsonfile,outputfile,mappingfile)
....
First time on stackoverflow so bear with me. Code is below. Basically, the df_history is a dataframe with different variables. I am trying to pull the 'close' variable and sort it based on the categorical type of the currency.
When I pull data over using the .query command, it gives me 1 object with all the individual observations together separated by a space. I know how to separate that back into independent data, but issue is that it is pulling the index count with the observations. In the image you can see 179, 178, 177 etc in the BTC object. I dont want that there and didnt indend to pull it. How do I get rid of that?
additional_rows = []
for currency in selected_coins:
df_history = df_history.sort_values(['date'], ascending=True)
row_data = [currency,
df_history.query('granularity == \'daily\' and currency == #currency')['close'],
df_history.query('granularity == \'daily\' and currency == #currency').head(180)['close'].pct_change(),
df_history['date']
]
additional_rows.append(row_data)
df_additional_info = pd.DataFrame(additional_rows, columns = ['currency',
'close',
'returns',
'df_history'])
df_additional_info.set_index('currency').transpose()
import ast
list_of_lists = df_additional_info.close.to_list()
flat_list = [i for sublist in list_of_lists for i in ast.literal_eval(sublist)]
uniq_list = list(set(flat_list))
len(uniq_list),len(flat_list)
I was trying to pull data from one data frame to the next and sort it based on a categorical input from the currency variable. It is not transferring over well
I am new to Python, Can i please seek some help from experts here?
I wish to construct a dataframe from https://api.cryptowat.ch/markets/summaries JSON response.
based on following filter criteria
Kraken listed currency pairs (Please take note, there are kraken-futures i dont want those)
Currency paired with USD only, i.e aaveusd, adausd....
Ideal Dataframe i am looking for is (somehow excel loads this json perfectly screenshot below)
Dataframe_Excel_Screenshot
resp = requests.get(https://api.cryptowat.ch/markets/summaries) kraken_assets = resp.json() df = pd.json_normalize(kraken_assets) print(df)
Output:
result.binance-us:aaveusd.price.last result.binance-us:aaveusd.price.high ...
0 264.48 267.32 ...
[1 rows x 62688 columns]
When i just paste the link in browser JSON response is with double quotes ("), but when i get it via python code. All double quotes (") are changed to single quotes (') any idea why?. Though I tried to solve it with json_normalize but then response is changed to [1 rows x 62688 columns]. i am not sure how do i even go about working with 1 row with 62k columns. i dont know how to extract exact info in the dataframe format i need (please see excel screenshot).
Any help is much appreciated. thank you!
the result JSON is a dict
load this into a dataframe
decode columns into products & measures
filter to required data
import requests
import pandas as pd
import numpy as np
# load results into a data frame
df = pd.json_normalize(requests.get("https://api.cryptowat.ch/markets/summaries").json()["result"])
# columns are encoded as product and measure. decode columns and transpose into rows that include product and measure
cols = np.array([c.split(".", 1) for c in df.columns]).T
df.columns = pd.MultiIndex.from_arrays(cols, names=["product","measure"])
df = df.T
# finally filter down to required data and structure measures as columns
df.loc[df.index.get_level_values("product").str[:7]=="kraken:"].unstack("measure").droplevel(0,1)
sample output
product
price.last
price.high
price.low
price.change.percentage
price.change.absolute
volume
volumeQuote
kraken:aaveaud
347.41
347.41
338.14
0.0274147
9.27
1.77707
613.281
kraken:aavebtc
0.008154
0.008289
0.007874
0.0219326
0.000175
403.506
3.2797
kraken:aaveeth
0.1327
0.1346
0.1327
-0.00673653
-0.0009
287.113
38.3549
kraken:aaveeur
219.87
226.46
209.07
0.0331751
7.06
1202.65
259205
kraken:aavegbp
191.55
191.55
179.43
0.030559
5.68
6.74476
1238.35
kraken:aaveusd
259.53
267.48
246.64
0.0339841
8.53
3623.66
929624
kraken:adaaud
1.61792
1.64602
1.563
0.0211692
0.03354
5183.61
8366.21
kraken:adabtc
3.757e-05
3.776e-05
3.673e-05
0.0110334
4.1e-07
252403
9.41614
kraken:adaeth
0.0006108
0.00063
0.0006069
-0.0175326
-1.09e-05
590839
367.706
kraken:adaeur
1.01188
1.03087
0.977345
0.0209986
0.020811
1.99104e+06
1.98693e+06
Hello Try the below code. I have understood the structure of the Dataset and modified to get the desired output.
`
resp = requests.get("https://api.cryptowat.ch/markets/summaries")
a=resp.json()
a['result']
#creating Dataframe froom key=result
da=pd.DataFrame(a['result'])
#using Transpose to get required Columns and Index
da=da.transpose()
#price columns contains a dict which need to be seperate Columns on the data frame
db=da['price'].to_dict()
da.drop('price', axis=1, inplace=True)
#intialising seperate Data frame for price
z=pd.DataFrame({})
for i in db.keys():
i=pd.DataFrame(db[i], index=[i])
z=pd.concat([z,i], axis=0 )
da=pd.concat([z, da], axis=1)
da.to_excel('nex.xlsx')`
I have a 12GB JSON file that every line contains information about a scientific paper. This is how it looks
enter image description here
I want to parse it and create 3 pandas dataframes that contain information about venues, authors and how many times an author has published in a venue. Bellow you can see the code I have written. My problem is that this code needs many days in order to run. Is there a way to make it faster?
venues = pd.DataFrame(columns = ['id', 'raw', 'type'])
authors = pd.DataFrame(columns = ['id','name'])
main = pd.DataFrame(columns = ['author_id','venue_id','number_of_times'])
with open(r'C:\Users\dintz\Documents\test.json',encoding='UTF-8') as infile:
papers = ijson.items(infile, 'item')
for paper in papers:
if 'id' not in paper["venue"]:
if 'type' not in paper["venue"]:
venues = venues.append({'raw': paper["venue"]["raw"]},ignore_index=True)
else:
venues = venues.append({'raw': paper["venue"]["raw"], 'type': paper["venue"]["type"]},ignore_index=True)
else:
venues = venues.append({'id': paper["venue"]["id"] , 'raw': paper["venue"]["raw"], 'type': paper["venue"]["type"]},ignore_index=True)
paper_authors = paper["authors"]
paper_authors_json = json.dumps(paper_authors)
obj = ijson.items(paper_authors_json,'item')
for author in obj:
authors = authors.append({'id': author["id"] , 'name': author["name"]},ignore_index=True)
main = main.append({'author_id': author["id"] , 'venue_raw': venues.iloc[-1]['raw'],'number_of_times': 1},ignore_index=True)
authors = authors.drop_duplicates(subset=None, keep='first', inplace=False)
venues = venues.drop_duplicates(subset=None, keep='first', inplace=False)
main = main.groupby(by=['author_id','venue_raw'], axis=0, as_index = False).sum()
Apache Spark allows to read json files in multiple chunks in parallel to make it faster -
https://spark.apache.org/docs/latest/sql-data-sources-json.html
For a regular multi-line JSON file, set the multiLine parameter to True.
If you're not familiar with Spark, you can use Pandas-compatible layer on top of Spark that is called Koalas -
https://koalas.readthedocs.io/en/latest/
Koalas read_json call -
https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.read_json.html
Your use wrong tool to accomplish this task, do not use pandas for this scenario.
Lets look at the last 3 lines of code, it is simple and clean, but how to fill these data into pandas dataframe is not so easy, when you can not use pandas input function such as read_json() or read_csv().
I prefer use pure python for this simple task, if your PC has sufficient memory, use dict to get a unique authors and venues, use itertools.groupby to grouping and use more_itertools.ilen to calculate the count.
authors = {}
venues = {}
for paper in papers:
venues[paper["venue"]["id"]] = (paper["venue"]["raw"], paper["venue"]["type"])
for author in obj:
authors[author["id"]] = author["name"]
I am currently working on a project, the idea is to extract tweets (with geo enabled) from a Hashtag and to print a map (with Folium). Inside the map, I should have markers according to the user locations and when I click at the marker I should have the text of the tweet. But currently I only have a map and the markers.
This is my code :
import pandas as pd
import folium, ast
locations = pd.read_csv('tweets.csv', usecols=[3]).dropna()
l_locations = []
for loc in locations.values:
l_locations.append(ast.literal_eval(loc[0])['coordinates'])
print_tweet_map = folium.Map(location=[48.864716, 2.349014], zoom_start=8, tiles='Mapbox Bright')
for geo in l_locations:
folium.Marker(location=geo).add_to(print_tweet_map)
print_tweet_map.save('index.html')
Can you guys help me to print the markers and the text details of the tweet ?
Thanks in advance.
PS : I have currently :
Some lines of the csv file :
created_at,user,text,geo
2017-09-30 15:28:56,yanceustie,"RT #ChelseaFC: .#AlvaroMorata and #marcosalonso03 have been checking out the pitch ahead of kick-off..., null
2017-09-30 15:48:18,trendinaliaVN,#CHEMCI just started trending with 17632 tweets. More trends at ... #trndnl,"{'type': 'Point', 'coordinates': [21.0285, 105.8048]}"
Examine read_csv closely and then use it to retrieve the tweet text as well. Reading folium documentation, the popup seems the most relevant place to put each tweet's text.
Also you iterate over the same things 2 times. You can reduce those into 1 iteration loop that places a tweet on the map. (Imagine the map as your empty list you were appending to). No need to be overly sequential.
import pandas as pd
import folium, ast
frame = pd.read_csv('tweets.csv', usecols=[2, 3], names=['text', 'location'], header=None).dropna()
print_tweet_map = folium.Map(location=[48.864716, 2.349014], zoom_start=8, tiles='Mapbox Bright')
for index, item in frame.iterrows():
loc = item.get('location')
text = item.get('text')
geo = ast.literal_eval(loc[0])['coordinates']
folium.Marker(location=geo, popup=text) \
.add_to(print_tweet_map)
print_tweet_map.save('index.html')
This should work, or very close to work, but I don't have a proper computer handy for testing.