Twitterscaper: Adding tweet country info to scraped dataframe - python

I am using twitterscraper from https://github.com/taspinar/twitterscraper to scrape around 20k tweets created since 2018. Tweet locations are not readily extracted from the default setting. Nevertheless, the search for tweets written from a location can be done by using advanced queries placed within quotes, e.g. "#hashtagofinterest near:US"
Thus I am thinking to loop through a list of country codes (alpha-2) to filter the tweets from a country and add the info of the country to my search result. Initial attempts had been done on small samples for tweets in the past 10 days.
#set arguments
begin_date = dt.date(2020,4,1)
end_date = dt.date(2020,4,11)
lang = 'en'
#define queries
queries = [(f'(#hashtagA OR #hashtagB near:{country})', country) for country in alpha_2]
#initiate queries
dfs = []
for query, country in queries[:10]: #trying on first 10 countries
temp = query_tweets(query, begindate = begin_date, enddate = end_date, lang=lang)
temp = pd.DataFrame(t.__dict__ for t in temp)
temp["country"] = [country]*len(temp)
dfs.append((temp, country))
I managed to add country info as a new variable for each country df.
Part of output: dfs
Part of output: df
However, I am stuck at combining each query result into 1 dataframe. pd.concat() is not working for passing 22 columns on the passed data of 2 columns
unintended result
My intended result is to have a new country column added to the default 21 columns in a dataframe (total 22 intended columns).
intended result

Since dfs is a list of tuples, with each tuple being (DataFrame, str), you only want to concatenate the first of each element of dfs.
You may achieve this using:
concat_df = pd.concat([df for df, _ in dfs], ignore_index=True)
which will create a new list of only the DataFrames and concatenate those. I have added ignore_index=True so that the rows will be re-indexed in the concatenated DataFrame.
Since the country is already stored in the DataFrame, you could also not add this to dfs and only append temp instead:
dfs = []
for query, country in queries[:10]: #trying on first 10 countries
temp = query_tweets(query, begindate = begin_date, enddate = end_date, lang=lang)
temp = pd.DataFrame(t.__dict__ for t in temp)
temp["country"] = [country]*len(temp)
dfs.append(temp)
concat_df = pd.concat(dfs, ignore_index=True)

Related

parse xlsx file having merged cells using python or pyspark

I want to parse an xlsx file. Some of the cells in the file are merged and working as a header for the underneath values.
But do not know what approach I should select to parse the file.
Shall I parse the file from xlsx to json format and then I should perform the pivoting or transformation of dataset.
OR
Shall proceed just by xlsx format and try to read the specific cell values- but I believe this approach will not make the code scalable and dynamic.
I tried to parse the file and tried to convert to json but it did not load the all the records. unfortunately, it is not throwing any exception.
from json import dumps
from xlrd import open_workbook
# load excel file
wb = open_workbook('/dbfs/FileStore/tables/filename.xlsx')
# get sheet by using sheet name
sheet = wb.sheet_by_name('Input Format')
# get total rows
total_rows = sheet.nrows
# get total columns
total_columns = sheet.ncols
# convert each row of sheet name in Dictionary and append to list
lst = []
for i in range(0, total_rows):
row = {}
for j in range(0, total_columns):
if i + 1 < total_rows:
column_name = sheet.cell(rowx=0, colx=j)
row_data = sheet.cell_value(rowx=i+1, colx=j)
row.update(
{
column_name.value: row_data
}
)
if len(row):
lst.append(row)
# convert into json
json_data = dumps(lst)
print(json_data)
After executing the above code I received following type of output:
{
"Analysis": "M000000000000002001900000000000001562761",
"KPI": "FELIX PARTY.MIX",
"": 2.9969042460942
},
{
"Analysis": "M000000000000002001900000000000001562761",
"KPI": "FRISKIES ESTERILIZADOS",
"": 2.0046260994622
},
Once the data will be in good shape then spark-databricks should be used for the transformation.
I tried multiple approaches but failed :(
Hence seeking help from the community.
For more clarity on the question I have added sample input/output screenshot as following.
Input dataset:
Expected Output1:
You can download the actual dataset and expected output from the following link
Dataset
To convert get the month column as per requirement, you can use the following code:
import pandas as pd
for_cols = pd.read_excel('/dbfs/FileStore/HHP.xlsx', engine='openpyxl', skiprows=2,nrows=1)
main_cols = [for_cols[req][0] for req in for_cols if type(for_cols[req][0])==type('x')] #getting main header column names
#print(main_cols)
for_dates = pd.read_excel('/dbfs/FileStore/HHP.xlsx', engine='openpyxl',skiprows=4,usecols="C:R")
dates = for_dates.columns.to_list() #getting list of month names to be used
#print(dates)
pdf = pd.read_excel('/dbfs/FileStore/HHP.xlsx', engine='openpyxl',skiprows=4) #reading the file without main headers
#pdf
#all the columns i.e., 2021 Jan will be labeled differently like 2021 Jan, 2021 Jan.1, 2021 Jan.2 and so on. So the following code will create an array of arrays where each of the child array will be used to create a new small dataframe. All these new dataframes will be combined to a single dataframe (union).
req_cols=[]
for i in range(len(main_cols)):
current_dates = ['Market','Product']
if(i!=0):
for d in dates:
current_dates.append(d+f'.{i}')
else:
current_dates.extend(dates)
req_cols.append(current_dates)
print(req_cols)
#the following code will combine the dataframe to remove multiple yyyy MMM columns. Also added a column `stype` whose name would help identify to which main header column does the month belongs to for each product.
mydf = pdf[req_cols[0]]
mydf['stype']= main_cols[0]
#display(mydf)
for i in range(1,len(req_cols)):
temp = pdf[req_cols[i]]
#print(temp.columns)
temp['stype'] = main_cols[i]
rename_cols={'Market': 'Market', 'Product': 'Product','stype':'stype'} #renaming columns i.e., changing 2021 Jan.1 and such to just 2021 Jan.
for j in req_cols[i][2:]:
rename_cols[j]= j[:8] #if j is 2021 Jan.3 then we only take until j[:8] to get the actual name (2021 Jan)
#print(rename_cols)
temp.rename(columns = rename_cols, inplace = True)
mydf = pd.concat([mydf,temp]) #combining the child dataframes to main dataframe.
mydf
tp = mydf[['Market','Product','2021 Jan','stype']]
req_df = tp.pivot(index=['Product','Market'],columns='stype', values='2021 Jan') #now pivoting the `stype` column
req_df['month'] = ['2021 Jan']*len(req_df) #initialising the month column
req_df.reset_index(inplace=True) #converting index columns to actual columns.
req_df #required data format for 2021 Jan.
#using the following code to get required result. Do it separately for each of the dates and then combine it to `req_df`
for dt in dates[1:]:
tp = mydf[['Market','Product',dt,'stype']]
tp1 = tp.pivot(index=['Product','Market'],columns='stype', values=dt)
tp1['month'] = [dt]*len(tp1)
tp1.reset_index(inplace=True)
req_df = pd.concat([req_df,tp1])
display(req_df[(req_df['Product'] != 'Nestle Purina')]) #selecting only data where product name is not Nestle Purina
To create a new column called Nestle Purina for one of the main columns (Penetration) you can use the following code:
nestle_purina = req_df[(req_df['Product'] == 'Nestle Purina')] #where product name is Nestle Purina
b = req_df[(req_df['Product'] != 'Nestle Purina')] #where product name is not nestle purina
a = b[['Product','Market','month','Penetration % (% of Households who bought a product atleast once in the given time period)']] #selecting required columns along with main column Penetration
n = nestle_purina[['month','Penetration % (% of Households who bought a product atleast once in the given time period)']] #selecting only required columns from nestle_purina df.
import numpy as np
a['Nestle Purina'] = np.nan #creating empty column to populate using code below
for dt in dates:
val = [i for i in n[(n['month'] == dt)]['Penetration % (% of Households who bought a product atleast once in the given time period)']] #getting the corresponding Nestle Purina value for Penetration column
a.loc[a['month'] == dt, 'Nestle Purina'] = val[0] #updating the `Nestle Purina` column value from nan to value extracted above.
a

How to add a new row with new header information in same dataframe

I have written a code to retrieve JSON data from an URL. It works fine. I give the start and end date and it loops through the date range and appends everything to a dataframe.
The colums are populated with the JSON data sensor and its corresponding values, hence the column names are like sensor_1. When I request the data from the URL it sometimes happens that there are new sensors and the old ones are switched off and deliver no data anymore and often times the length of the columns change. In that case my code just adds new columns.
What I want is instead of new columns a new header in the ongoing dataframe.
What I currently get with my code:
datetime;sensor_1;sensor_2;sensor_3;new_sensor_8;new_sensor_9;sensor_10;sensor_11;
2023-01-01;23.2;43.5;45.2;NaN;NaN;NaN;NaN;NaN;
2023-01-02;13.2;33.5;55.2;NaN;NaN;NaN;NaN;NaN;
2023-01-03;26.2;23.5;76.2;NaN;NaN;NaN;NaN;NaN;
2023-01-04;NaN;NaN;NaN;75;12;75;93;123;
2023-01-05;NaN;NaN;NaN;23;31;24;15;136;
2023-01-06;NaN;NaN;NaN;79;12;96;65;72;
What I want:
datetime;sensor_1;sensor_2;sensor_3;
2023-01-01;23.2;43.5;45.2;
2023-01-02;13.2;33.5;55.2;
2023-01-03;26.2;23.5;76.2;
datetime;new_sensor_8;new_sensor_9;sensor_10;sensor_11;
2023-01-04;75;12;75;93;123;
2023-01-05;23;31;24;15;136;
2023-01-06;79;12;96;65;72;
My loop to retrieve the data:
start_date = datetime.datetime(2023,1,1,0,0)
end_date = datetime.datetime(2023,1,6,0,0)
sensor_data = pd.DataFrame()
while start_zeit < end_zeit:
q = 'url'
r = requests.get(q)
j = json.loads(r.text)
sub_data = pd.DataFrame()
if 'result' in j:
datetime = pd.to_datetime(np.array(j['result']['data'])[:,0])
sensors = np.array(j['result']['sensors'])
data = np.array(j['result']['data'])[:,1:]
df_new = pd.DataFrame(data, index=datetime, columns=sensors)
sub_data = pd.concat([sub_data, df_new])
sensor_data = pd.concat([sensor_data, sub_data])
start_date += timedelta(days=1)
if 2 DataFrames will do for you the you can simply split using the column names:
df1 = df[['datetime', 'sensor_1', 'sensor_2', 'sensor_3']]
df2 = df[['datetime', 'new_sensor_8', 'new-sensor_9', 'sensor_10', 'sensor_11']]
Note the [[ used.
and use .dropna() to lose the NaN rows

Pandas DF, DateOffset, creating new column

So I'm working with the JHU covid19 data and they've left their recovered dataset go, they're no longer tracking it, just confirmed cases and deaths. What I'm trying to do here is recreate it. The table is the confirmed cases and deaths for every country for every date sorted by date and my getRecovered function below attempts to pick the date for that row, find the date two weeks before that and for the country of that row, and return a 'Recovered' column, which is the confirmed of two weeks ago - the dead today.
Maybe a pointless exercise, but still would like to know how to do it haha. I know it's a big dataset also and there's a lot of operations there, but I've been running it 20 mins now and still going. Did I do something wrong or would it just take this long?
Thanks for any help, friends.
urls = [
'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv',
'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv'
]
[wget.download(url) for url in urls]
confirmed = pd.read_csv('time_series_covid19_confirmed_global.csv')
deaths = pd.read_csv('time_series_covid19_deaths_global.csv')
dates = confirmed.columns[4:]
confirmed_long_form = confirmed.melt(
id_vars =['Province/State', 'Country/Region', 'Lat', 'Long'],
value_vars=dates,
var_name='Date',
value_name='Confirmed'
)
deaths_long_form = deaths.melt(
id_vars =['Province/State', 'Country/Region', 'Lat', 'Long'],
value_vars=dates,
var_name='Date',
value_name='Deaths'
)
full_table = confirmed_long_form.merge(
right=deaths_long_form,
how='left',
on=['Province/State', 'Country/Region', 'Date', 'Lat', 'Long']
)
full_table['Date'] = pd.to_datetime(full_table['Date'])
full_table = full_table.sort_values(by='Date', ascending=True)
def getRecovered(row):
ts = row['Date']
country = row['Country/Region']
ts = pd.Timestamp(ts)
do = pd.tseries.offsets.DateOffset(n = 14)
newTimeStamp = ts - do
oldrow = full_table.loc[(full_table['Date'] == newTimeStamp) & (full_table['Country/Region'] == country)]
return oldrow['Confirmed'] - row['Deaths']
full_table['Recovered'] = full_table.apply (lambda row: getRecovered(row), axis=1)
full_table
Your function is being applied row by row, which is likely why performance is suffering. Pandas is fastest when you make use of vectorised functions. For example you can use
pd.to_datetime(full_table['Date'])
to convert the whole date column much faster (see here: Convert DataFrame column type from string to datetime).
You can then add the date offset to that column, something like:
full_table['Recovery_date'] = pd.to_datetime(full_table['Date']) - pd.tseries.offsets.DateOffset(n = 14)
You can then self merge the table on date==recovery_date (plus any other keys) and subtract the numbers.

Using Panda, Update column values based on a list of ID and new Values

I have a df with and ID and Sell columns. I want to update the Sell column, using a list of new Sells (not all raws need to be updated - just some of them). In all examples I have seen, the value is always the same or is coming from a column. In my case, I have a dynamic value.
This is what I would like:
file = ('something.csv') # Has 300 rows
IDList= [['453164259','453106168','453163869','453164463'] # [ID]
SellList=[120,270,350,410] # Sells values
csv = path_pattern = os.path.join(os.getcwd(), file)
df = pd.read_csv(file)
df.loc[df['Id'].isin(IDList[x]), 'Sell'] = SellList[x] # Update the rows with the corresponding Sell value of the ID.
df.to_csv(file)
Any ideas?
Thanks in advance
Assuming 'id' is a string (as mentioned in IDList) & is not index of your df
IDList= [['453164259','453106168','453163869','453164463'] # [ID]
SellList=[120,270,350,410]
id_dict={x:y for x,y in zip(IDList,SellList)}
for index,row in df.iterrows():
if row['id'] in IDList:
df.loc[str(index),'Sell']=id_dict[row['id']]
If id is index:
IDList= [['453164259','453106168','453163869','453164463'] # [ID]
SellList=[120,270,350,410]
id_dict={x:y for x,y in zip(IDList,SellList)}
for index,row in df.iterrows():
if index in IDList:
df.loc[str(index),'Sell']=id_dict[index]
What I did is created a dictionary using IDlist & SellList & than looped over the df using iterrows()
df = pd.read_csv('something.csv')
IDList= ['453164259','453106168','453163869','453164463']
SellList=[120,270,350,410]
This will work efficiently, specially for large files:
df.set_index('id', inplace=True)
df.loc[IDList, 'Sell'] = SellList
df.reset_index() ## not mandatory, just in case you need 'id' back as a column
df.to_csv(file)

How do I filter columns of multiple DataFrames stored in a dictionary in an efficient way?

I am working with stock data and I want to make my data sets have equal length of data when performing certain types of analysis.
Problem
If I a load data for Apple I will get daily data since 1985 but if load data for a Natural Gas ETF it might only go as far back as 2012. I now want to filter Apple to only show history going back to 2012. Also, the end date, for example some of my dataset may not be up to date as Apple data is ranging from 1985 to 1-20-17 and the Natural Gas ETF data has a range of 2012 to 12-23-16. I also want another filter that sets the max date. So now my apple data set is filtered for dates ranging between 2012 to 12-23-16. Now my datasets are equal.
Approach
I have a dictionary called Stocks which stores all of my dateframes. All the dataframes have a column named D which is the Date column.
I wrote a function that populates a dictionary with the dataframes and also takes the min and max dates for each df. I store all those min max dates in two other dictionaries DatesMax and DateMin and then take the min and the max of those two dictionaries to get the max and the min dates that will be used for the filter value on all the dataframes.
The function below works, it gets the min and max dates of multiple dataframes and returns them in a dictionary named DatesMinMax.
def MinMaxDates (FileName):
DatesMax = {}; DatesMin = {}
DatesMinMax = {}; stocks = {}
with open (FileName) as file_object:
Current_indicators = file_object.read()
tickers = Current_indicators.split('\n')
for i in tickers:
a = '/' in i
if a == True:
x = i.find("/")+1
df = pd.read_csv(str( i[x:]) + '_data.csv')
stocks[i] = df
maxDate = max(df.D)
minDate = min(df.D)
DatesMax[i] = maxDate
DatesMin[i] = minDate
else:
df = pd.read_csv(i + '_data.csv')
stocks[i] = df
maxDate = max(df.D)
minDate = min(df.D)
DatesMax[i] = maxDate
DatesMin[i] = minDate
x = min(DatesMax.values())
y = max(DatesMin.values())
DatesMinMax = {'MaxDate' : x, 'MinDate' : y}
return DatesMinMax
print DatesMinMax
# {'MinDate': '2012-02-08', 'MaxDate': '2017-01-20'}
Question
Now, I will have to run my loop on all the dataframes in the dict name Stocks to filter there date columns. It seems inefficient to re-loop something again, but I can't think of any other other way to apply the filter.
Actually, you may not need to capture min and max (since 2016-12-30 < 2017-01-20) for later filtering, but simply run a full inner join merge across all dataframes on 'D' (Date) column.
Consider doing so with a chain merge which ensures equal lengths across all dataframe, and then slice this outputted master dataframe by ticker columns to build the Stocks dictionary. Of course, you can use the wide master dataframe for analysis:
with open (FileName) as file_object:
Current_indicators = file_object.read()
tickers = Current_indicators.split('\n')
# DATA FRAME LIST BUILD
dfs = []
for i in tickers:
if '/' in i:
x = i.find("/")+1
df = pd.read_csv(str( i[x:]) + '_data.csv')
# PREFIX ALL NON-DATE COLS WITH TICKER PREFIX
df.columns = [i+'_'+str(col) for col in df.columns if col!='D']
dfs.append(df)
else:
df = pd.read_csv(i + '_data.csv')
# PREFIX ALL NON-DATE COLS WITH TICKER PREFIX
df.columns = [i+'_'+str(col) for col in df.columns if col!='D']
dfs.append(df)
# CHAIN MERGE (INNER JOIN) ACROSS ALL DFS
masterdf = reduce(lambda left,right: pd.merge(left, right, on=['D']), dfs)
# DATA FRAME DICT BUILD
stocks = {}
for i in tickers:
# SLICE CURRENT TICKER COLUMNS
df = masterdf[['D']+[col for col in df.columns if i in col]]
# REMOVE TICKER PREFIXES
df.columns = [col.replace(i+'_', '') for col in df.columns]
stocks[i] = df

Categories