Pandas Dataframe - Conditional Column Creation - python

I am attempting to create a new column based on conditional logic from another column. I've tried searching and haven't been able to find anything that addresses my issue.
I have imported a CSV to a pandas dataframe, it is structured like this. I edited a few of the descriptions for this post, but other than that everything is the same:
#code used to load dataframe:
df = pd.read_csv(r"C:\filepath\filename.csv")
#output from print(type(df)):
#class 'pandas.core.frame.DataFrame'
#output from print(df.columns.values):
#['Type' 'Trans Date' 'Post Date' 'Description' 'Amount']
#output from print(df.columns):
Index(['Type', 'Trans Date', 'Post Date', 'Description', 'Amount'], dtype='object')
#output from print
Type Trans Date Post Date Description Amount
0 Sale 01/25/2018 01/25/2018 DESC1 -13.95
1 Sale 01/25/2018 01/26/2018 AMAZON MKTPLACE PMTS -6.99
2 Sale 01/24/2018 01/25/2018 SUMMIT BISTRO -5.85
3 Sale 01/24/2018 01/25/2018 DESC3 -9.13
4 Sale 01/24/2018 01/26/2018 DYNAMIC VENDING INC -1.60
I then write the following code:
def criteria(row):
if row.Description.find('SUMMIT BISTRO')>0:
return 'Lunch'
elif row.Description.find('AMAZON MKTPLACE PMTS')>0:
return 'Amazon'
elif row.Description.find('Aldi')>0:
return 'Groceries'
else:
return 'NotWorking'
df['Category'] = df.apply(criteria, axis=0)
Errors:
Traceback (most recent call last):
File "C:\Users\Test_BankReconcile2.py", line 44, in <module>
df['Category'] = df.apply(criteria, axis=0)
File "C:\Users\Anaconda3\lib\site-packages\pandas\core\frame.py", line 4262, in apply
ignore_failures=ignore_failures)
File "C:\Users\Anaconda3\lib\site-packages\pandas\core\frame.py", line 4358, in _apply_standard
results[i] = func(v)
File "C:\Users\OneDrive\Documents\finance\Test_BankReconcile2.py", line 35, in criteria
if row.Description.find('SUMMIT BISTRO')>0:
File "C:\Users\Anaconda3\lib\site-packages\pandas\core\generic.py", line 3081, in __getattr__
return object.__getattribute__(self, name)
AttributeError: ("'Series' object has no attribute 'Description'", 'occurred at index Type')
I'm able to successfully execute this same sort of command on a very similar csv file from a different bank (this example is from my credit card), so I don't know what is going on but possibly I need to define the dataframe in some way that I'm not doing? Or possibly something else that is very obvious that I'm not seeing? Thank you all in advance for helping me solve this.

Yes, your problem is that you need to pass axis=1 to .apply:
In [52]: df
Out[52]:
Type Trans Date Post Date Description Amount
0 Sale 01/25/2018 01/25/2018 DESC1 -13.95
1 Sale 01/25/2018 01/26/2018 AMAZON MKTPLACE PMTS -6.99
2 Sale 01/24/2018 01/25/2018 SUMMIT BISTRO -5.85
3 Sale 01/24/2018 01/25/2018 DESC3 -9.13
4 Sale 01/24/2018 01/26/2018 DYNAMIC VENDING INC -1.60
In [53]: def criteria(row):
...: if row.Description.find('SUMMIT BISTRO')>0:
...: return 'Lunch'
...: elif row.Description.find('AMAZON MKTPLACE PMTS')>0:
...: return 'Amazon'
...: elif row.Description.find('Aldi')>0:
...: return 'Groceries'
...: else:
...: return 'NotWorking'
...:
In [54]: df.apply(criteria, axis=1)
Out[54]:
0 NotWorking
1 NotWorking
2 NotWorking
3 NotWorking
4 NotWorking
dtype: object
The second problem is you have a logic error, instead of .find(x) > 0 you want .find(x) >= 0, or better yet, some_string in some_other_string

For more general solution omit Description in loop and instead use df['Description'].apply(criteria) with Series.apply.
Also for check substring in string use in.
def criteria(row):
if 'SUMMIT BISTRO' in row:
return 'Lunch'
elif 'AMAZON MKTPLACE PMTS' in row:
return 'Amazon'
elif 'Aldi' in row:
return 'Groceries'
else:
return 'NotWorking'
df['Category'] = df['Description'].apply(criteria)

Related

KeyError when trying to access the mode of DataFrame columns

I am trying to run the following code:
import time
import pandas as pd
import numpy as np
CITY_DATA = {'chicago': 'chicago.csv',
'new york city': 'new_york_city.csv',
'washington': 'washington.csv'}
def get_filters():
"""
Asks user to specify a city, month, and day to analyze.
Returns:
(str) city - name of the city to analyze
(str) month - name of the month to filter by, or "all" to apply no month filter
(str) day - name of the day of week to filter by, or "all" to apply no day filter
"""
print('Hello! Let\'s explore some US bikeshare data!')
# get user input for city (chicago, new york city, washington). HINT: Use a while loop to handle invalid inputs
while True:
city = input('Which city you would like to explore : "chicago" , "new york city" , or "washington" :' )
if city not in ('chicago', 'new york city', 'washington'):
print(" You entered wrong choice , please try again")
continue
else:
break
# get user input for month (all, january, february, ... , june)
while True:
month = input('Enter "all" for all data or chose a month : "january" , "february" , "march", "april" , "may" or "june " :')
if month not in ("all", "january", "february", "march", "april", "may", "june"):
print(" You entered wrong choice , please try again")
continue
else:
break
# get user input for day of week (all, monday, tuesday, ... sunday)
while True:
day = input('Enter "all" for all days or chose a day : "saturday", "sunday", "monday", "tuesday", "wednesday", "thursday", "friday": ')
if day not in ("all","saturday", "sunday", "monday", "tuesday", "wednesday", "thursday", "friday"):
print(" You entered wrong choice , please try again")
continue
else:
break
print('-'*60)
return city, month, day
def load_data(city, month, day):
"""
Loads data for the specified city and filters by month and day if applicable.
Args:
(str) city - name of the city to analyze
(str) month - name of the month to filter by, or "all" to apply no month filter
(str) day - name of the day of week to filter by, or "all" to apply no day filter
Returns:
df - Pandas DataFrame containing city data filtered by month and day
"""
df = pd.read_csv(CITY_DATA[city])
# convert the Start Time column to datetime
df['Start Time'] = pd.to_datetime(df['Start Time'])
# extract month , day of week , and hour from Start Time to new columns
df['month'] = df['Start Time'].dt.month
df['day_of_week'] = df['Start Time'].dt.day_name
df['hour'] = df['Start Time'].dt.hour
# filter by month if applicable
if month != 'all':
# use the index of the month_list to get the corresponding int
months = ['january', 'february', 'march', 'april', 'may', 'june']
month = months.index(month) + 1
# filter by month to create the new dataframe
df = df[df['month'] == month]
# filter by day of week if applicable
if day != 'all':
# filter by day of week to create the new dataframe
df = df[df['day_of_week'] == day.title()]
return df
def time_stats(df):
"""Displays statistics on the most frequent times of travel."""
print('\nCalculating The Most Frequent Times of Travel...\n')
start_time = time.time()
# display the most common month
popular_month = df['month'].mode()[0]
print('\n The most popular month is : \n', popular_month)
# display the most common day of week
popular_day = df['day_of_week'].mode()[0]
print('\n The most popular day of the week is : \n', str(popular_day))
# display the most common start hour
popular_hour = df['hour'].mode()[0]
print('\n The most popular hour of the day is :\n ', popular_hour)
print("\nThis took %s seconds.\n" % (time.time() - start_time))
print('-'*60)
def station_stats(df):
"""Displays statistics on the most popular stations and trip."""
print('\nCalculating The Most Popular Stations and Trip...\n')
start_time = time.time()
# display most commonly used start station
start_station = df['Start Station'].value_counts().idxmax()
print('\n The most commonly used start station is : \n', start_station)
# display most commonly used end station
end_station = df['End Station'].value_counts().idxmax()
print('\nThe most commonly used end station is: \n', end_station)
# display most frequent combination of start station and end station trip
combination = df.groupby(['Start Station','End Station']).value_counts().idxmax()
print('\nThe most frequent combination of start station and end station are: \n', combination)
print("\nThis took %s seconds." % (time.time() - start_time))
print('-'*40)
def trip_duration_stats(df):
"""Displays statistics on the total and average trip duration."""
start_time = time.time()
travel_time = sum(df['Trip Duration'])
print('Total travel time:', travel_time / 86400, " Days")
# display total travel time
total_time = sum(df['Trip Duration'])
print('\nThe total travel time is {} seconds: \n', total_time)
# display mean travel time
mean_time = df['Trip Duration'].mean()
print('\n The average travel time is \n', mean_time)
print("\nThis took %s seconds." % (time.time() - start_time))
print('-'*40)
def user_stats(df):
"""Displays statistics on bikeshare users."""
print('\nCalculating User Stats...\n')
start_time = time.time()
# TO DO: Display counts of user types
user_types = df['User Type'].value_counts()
#print(user_types)
print('User Types:\n', user_types)
# TO DO: Display counts of gender
print("\nThis took %s seconds." % (time.time() - start_time))
print('-'*40)
def main():
while True:
city, month, day = get_filters()
df = load_data(city, month, day)
time_stats(df)
station_stats(df)
trip_duration_stats(df)
user_stats(df)
restart = input('\nWould you like to restart? Enter yes or no.\n')
if restart.lower() != 'yes':
break
if __name__ == "__main__":
main()
and I am receiving the following errors , can someone assist please
the errors:
> Traceback (most recent call last):
File "C:\Users\DELL\PycharmProjects\Professional\venv\Lib\site-packages\pandas\core\indexes\range.py", line 391, in get_loc
return self._range.index(new_key)
^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: 0 is not in range
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Users\DELL\PycharmProjects\Professional\Bikeshare.py", line 203, in <module>
main()
File "C:\Users\DELL\PycharmProjects\Professional\Bikeshare.py", line 192, in main
time_stats(df)
File "C:\Users\DELL\PycharmProjects\Professional\Bikeshare.py", line 100, in time_stats
popular_month = df['month'].mode()[0]
~~~~~~~~~~~~~~~~~~^^^
File "C:\Users\DELL\PycharmProjects\Professional\venv\Lib\site-packages\pandas\core\series.py", line 981, in __getitem__
Calculating The Most Frequent Times of Travel...
return self._get_value(key)
^^^^^^^^^^^^^^^^^^^^
File "C:\Users\DELL\PycharmProjects\Professional\venv\Lib\site-packages\pandas\core\series.py", line 1089, in _get_value
loc = self.index.get_loc(label)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\DELL\PycharmProjects\Professional\venv\Lib\site-packages\pandas\core\indexes\range.py", line 393, in get_loc
raise KeyError(key) from err
KeyError: 0
I am expecting to filter pandas DataFrame to return month, day of week, and hour to perform some statistics.
KeyError means that the key isn't valid, because it doesn't exist. In this case, one reason to get KeyError when trying to get first mode is when column 'month' in dataframe is empty, and therefore mode() returns an empty collection, so you get KeyError: 0 when trying to get its first element.
To avoid this, you could replace:
popular_month = df['month'].mode()[0]
With:
try:
# try to get first mode of column 'month'
popular_month = df['month'].mode()[0]
except KeyError:
# if there's no data on column 'month'
popular_month = "unknown"
Because if there's no data on 'month' column, there's no point in trying to get its mode.
More about handling exceptions: https://docs.python.org/3/tutorial/errors.html#handling-exceptions
Also when I tried to ( not use the filters) by choosing " all " in the second and 3rd input, I get the following result:
Calculating The Most Frequent Times of Travel...
The most popular month is :
6
The most popular day of the week is :
<bound method PandasDelegate._add_delegate_accessors.._create_delegator_method..f of <pandas.core.indexes.accessors.DatetimeProperties object at 0x0000022B7CD5E890>>
The most popular hour of the day is :
17
This took 0.0260775089263916 seconds.
Calculating The Most Popular Stations and Trip...
The most commonly used start station is :
Streeter Dr & Grand Ave
The most commonly used end station is:
Streeter Dr & Grand Ave
The most frequent combination of start station and end station are:
('2112 W Peterson Ave', '2112 W Peterson Ave', 1064651, Timestamp('2017-06-02 07:59:13'), '2017-06-02 08:25:42', 1589, 'Subscriber', 'Female', 1963.0, 6, <bound method PandasDelegate._add_delegate_accessors.._create_delegator_method..f of <pandas.core.indexes.accessors.DatetimeProperties object at 0x0000022B7CD5E890>>, 7)
This took 2.1254045963287354 seconds.
Total travel time: 3250.8308680555556 Days
The total travel time is {} seconds:
280871787
The average travel time is
936.23929
This took 0.06502270698547363 seconds.
Calculating User Stats...
User Types:
Subscriber 238889
Customer 61110
Dependent 1
Name: User Type, dtype: int64
This took 0.022009611129760742 seconds.
Would you like to restart? Enter yes or no.

Python Pandas - Creating a function to replace repetitive DataFrames

I am new to Python and have managed to build out the following code which is producing the desired results in four separate dataframes
import pandas as pd
x2019 = df.Date.between('2015-06-28','2015-07-04') #Transaction Dates we want to analyze
y2019 = df.First_Purchase_Date.between('2014-01-01','2015-07-04') #customer first purchase dates we want to include in the dataset
TABLE_2019_USA_XX = df.loc[x2019 & y2019 & (df['Region'] == 'USA')].groupby(df['FPYear'])[['New Customer', 'Existing Customer', 'revenue']].sum() #with date filters for table
TABLE_2019_USA_XX['TotalCusts'] = TABLE_2019_USA_XX['New Customer'] + TABLE_2019_USA_XX['Existing Customer']
TABLE_2019_CANADA_XX = df.loc[x2019 & y2019 & (df['Region'] == 'Canada')].groupby(df['FPYear'])[['New Customer', 'Existing Customer', 'revenue']].sum() #with date filters for table
TABLE_2019_CANADA_XX['TotalCusts'] = TABLE_2019_CANADA_XX['New Customer'] + TABLE_2019_CANADA_XX['Existing Customer']
x2018 = df.Date.between('2014-07-23','2014-07-28') #Transaction Dates we want to analyze
y2018 = df.First_Purchase_Date.between('2014-01-01','2014-07-30') #customer first purchase dates we want to include in the dataset
TABLE_2018_USA_XX = df.loc[x2018 & y2018 & (df['Region'] == 'USA')].groupby(df['FPYear'])[['New Customer', 'Existing Customer', 'revenue']].sum() #with date filters for table
TABLE_2018_USA_XX['TotalCusts'] = TABLE_2018_USA_XX['New Customer'] + TABLE_2018_USA_XX['Existing Customer']
TABLE_2018_CANADA_XX = df.loc[x2018 & y2018 & (df['Region'] == 'Canada')].groupby(df['FPYear'])[['New Customer', 'Existing Customer', 'revenue']].sum() #with date filters for table
TABLE_2018_CANADA_XX['TotalCusts'] = TABLE_2018_CANADA_XX['New Customer'] + TABLE_2018_CANADA_XX['Existing Customer']
print(TABLE_2018_USA_XX)
print(TABLE_2019_USA_XX)
print(TABLE_2018_CANADA_XX)
print(TABLE_2019_CANADA_XX)
Output
FPYear New Customer Existing Customer revenue TotalCusts
2014 0 23 134 23
2015 12 32 432 44
FPYear New Customer Existing Customer revenue TotalCusts
2014 432 421 4315 853
2015 3415 452 2341 3867
FPYear New Customer Existing Customer revenue TotalCusts
2014 22 432 4312 454
2015 33 345 3415 378
FPYear New Customer Existing Customer revenue TotalCusts
2014 5 35 4312 40
2015 432 32 6131 464
Based on what I've read and feedback I got when building this script, I know I should be able to build out the above using a function but I can't figure out exactly how to do that. Can someone please provide a suggestion to get me started. I'm essentially trying to cut down my script and make it more efficient.
Just define a function and pass parameters the dates and region you use as filters:
import pandas as pd
def process(df, start_dt, end_dt, purch_start, purch_end, region):
mask_date = df['Date'].between(start_dt, end_dt)
mask_purch_date = df['First_Purchase_Date'].between(purch_start, purch_end)
mask_region = df['Region'] == region
temp_df = df[mask_date & mask_purch_date & mask_region].groupby(df['FPYear'])[['New Customer', 'Existing Customer', 'revenue']].sum()
temp_df['TotalCusts'] = temp_df['New Customer'] + temp_df['Existing Customer']
return temp_df
TABLE_2019_USA_XX = process(df,'2015-06-28','2015-07-04', '2014-01-01','2015-07-04', 'USA')
TABLE_2019_CANADA_XX = process(df,'2015-06-28','2015-07-04', '2014-01-01','2015-07-04', 'Canada')
TABLE_2018_USA_XX = process(df,'2014-07-23','2014-07-28', '2014-01-01','2014-07-30', 'USA')
TABLE_2018_CANADA_XX = process(df,'2014-07-23','2014-07-28','2014-01-01','2014-07-30', 'Canada')
IIUC, you have repeating columns in your dataframes and you are doing the same operation over and over?
dfs = ['TABLE_2019_CANADA_XX', 'TABLE_2018_CANADA_XX','TABLE_2018_USA_XX', 'TABLE_2019_USA_XX']
df = pd.concat(dfs)
df.groupby(['FPYear','Region'])[['New Customer', 'Existing Customer', 'revenue']].sum()

Pandas - Filter DataFrame with Logics

I have a Pandas DataFrame like this,
Employee ID ActionCode ActionReason ConcatenatedOutput
1 TER DEA TER_DEA
1 RET ABC RET_ABC
1 RET DEF RET_DEF
2 TER DEA TER_DEA
2 ABC ABC ABC_ABC
2 DEF DEF DEF_DEF
3 RET FGH RET_FGH
3 RET EFG RET_EFG
4 PLA ABC PLA_ABC
4 TER DEA TER_DEA
And I want to filter it with the below logics and change it to something like this,
Employee ID ConcatenatedOutput Context
1 RET_ABC RET or TER Found
2 TER_DEA RET or TER Found
3 RET_FGH RET or TER Found
4 PLA_ABC RET or TER Not Found
Logics:-
1) If the first record of an Employee is TER_DEA then we go in to that employee and see if that employee has any other records, If that employee has another RET record, then we pick up the first available RET record or else we stick to TER_DEA record.
2) if the first record of an employee is anything other than TER_DEA then we stick with that record.
3) Context is conditional if it has a RET or TER then we say RET or TER Found, else it is not found.
Note:- The final output will have only one record for an employee ID.
The data below,
employee_id = [1,1,1,2,2,2,3,3,4,4]
action_code = ['TER','RET','RET','TER','ABC','DEF','RET','RET','PLA','TER']
action_reason = ['DEA','ABC','DEF','DEA','ABC','DEF','FGH','EFG','ABC','DEA']
concatenated_output = ['TER_DEA', 'RET_ABC', 'RET_DEF', 'TER_DEA', 'ABC_ABC', 'DEF_DEF', 'RET_FGH', 'RET_EFG', 'PLA_ABC', 'TER_DEA']
df = pd.DataFrame({
'Employee ID': employee_id,
'ActionCode': action_code,
'ActionReason': action_reason,
'ConcatenatedOutput': concatenated_output,
})
I'd recommend you rather go with a Bool in that field. To get the test data I used this:
import pandas as pd
employee_id = [1,1,1,2,2,2,3,3,4,4]
action_code = ['TER','RET','RET','TER','ABC','DEF','RET','RET','PLA','TER']
action_reason = ['DEA','ABC','DEF','DEA','ABC','DEF','FGH','EFG','ABC','DEA']
concatenated_output = ['TER_DEA', 'RET_ABC', 'RET_DEF', 'TER_DEA', 'ABC_ABC', 'DEF_DEF', 'RET_FGH', 'RET_EFG', 'PLA_ABC', 'TER_DEA']
df = pd.DataFrame({
'Employee ID': employee_id,
'ActionCode': action_code,
'ActionReason': action_reason,
'ConcatenatedOutput': concatenated_output,
})
You can then do a group by on the employee ID and and apply a function to perform your specific program logic in there.
def myfunc(data):
if data.iloc[0]['ConcatenatedOutput'] == 'TER_DEA':
if len(data.loc[data['ActionCode'] == 'RET']) > 0:
located_record = data.loc[data['ActionCode'] == 'RET'].iloc[[0]]
else:
located_record = data.iloc[[0]]
else:
located_record = data.iloc[[0]]
located_record['RET or TER Context'] = data['ActionCode'].str.contains('|'.join(['RET', 'TER']))
return located_record
df.groupby(['Employee ID']).apply(myfunc)

pandas dataframe apply function over column creating multiple columns

I have the pandas df below, with a few columns, one of which is ip_addresses
df.head()
my_id someother_id created_at ip_address state
308074 309115 2859690 2014-09-26 22:55:20 67.000.000.000 rejected
308757 309798 2859690 2014-09-30 04:16:56 173.000.000.000 approved
309576 310619 2859690 2014-10-02 20:13:12 173.000.000.000 approved
310347 311390 2859690 2014-10-05 04:16:01 173.000.000.000 approved
311784 312827 2859690 2014-10-10 06:38:39 69.000.000.000 approved
For each ip_address I'm trying to return the description, city, country
I wrote a function below and tried to apply it
from ipwhois import IPWhois
def returnIP(ip) :
obj = IPWhois(str(ip))
result = obj.lookup_whois()
description = result["nets"][len(result["nets"]) - 1 ]["description"]
city = result["nets"][len(result["nets"]) - 1 ]["city"]
country = result["nets"][len(result["nets"]) - 1 ]["country"]
return [description, city, country]
# ---
suspect['ipwhois'] = suspect['ip_address'].apply(returnIP)
My problem is that this returns a list, I want three separate columns.
Any help is greatly appreciated. I'm new to Pandas/Python so if there's a better way to write the function and use Pandas would be very helpful.
from ipwhois import IPWhois
def returnIP(ip) :
obj = IPWhois(str(ip))
result = obj.lookup_whois()
description = result["nets"][len(result["nets"]) - 1 ]["description"]
city = result["nets"][len(result["nets"]) - 1 ]["city"]
country = result["nets"][len(result["nets"]) - 1 ]["country"]
return (description, city, country)
suspect['description'], suspect['city'], suspect['country'] = \
suspect['ip_address'].apply(returnIP)
I was able to solve it with another stackoverflow solution
for n,col in enumerate(cols):
suspect[col] = suspect['ipwhois'].apply(lambda ipwhois: ipwhois[n])
If there's a more elegant way to solve this, please share!

Organize by Twitter unique identifier using python

I have a CSV file with each line containing information pertaining to a particular tweet (i.e. each line contains Lat, Long, User_ID, tweet and so on). I need to read the file and organize the tweets by the User_ID. I am trying to end up with a given User_ID attached to all of the tweets with that specific ID.
Here is what I want:
user_id: 'lat', 'long', 'tweet'
: 'lat', 'long', 'tweet'
user_id2: 'lat', 'long', 'tweet'
: 'lat', 'long', 'tweet'
: 'lat', 'long', 'tweet'
and so on...
This is a snip of my code that reads in the CSV file and creates a list:
UID = []
myID = []
ID = []
f = None
with open(csv_in,'rU') as f:
myreader = csv.reader(f, delimiter=',')
for row in myreader:
# Assign columns in csv to variables.
latitude = row[0]
longitude = row[1]
user_id = row[2]
user_name = row[3]
date = row[4]
time = row[5]
tweet = row[6]
flag = row[7]
compound = row[8]
Vote = row[9]
# Read variables into separate lists.
UID.append(user_id + ', ' + latitude + ', ' + longitude + ', ' + user_name + ', ' + date + ', ' + time + ', ' + tweet + ', ' + flag + ', ' + compound)
myID = ', '.join(UID)
ID = myID.split(', ')
I'd suggest you use pandas for this. It will allow you not only to list your tweets by user_id, as in your question, but also to do many other manipulations quite easily.
As an example, take a look at this python notebook from NLTK. At the end of it, you see an operation very closed to yours, reading a csv file containing tweets,
In [25]:
import pandas as pd
​
tweets = pd.read_csv('tweets.20150430-223406.tweet.csv', index_col=2, header=0, encoding="utf8")
You can also find a simple operation: looking for the tweets of a certain user,
In [26]:
tweets.loc[tweets['user.id'] == 557422508]['text']
Out[26]:
id
593891099548094465 VIDEO: Sturgeon on post-election deals http://...
593891101766918144 SNP leader faces audience questions http://t.c...
Name: text, dtype: object
For listing the tweets by user_id, you would simply do something like the following (this is not in the original notebook),
In [9]:
tweets.set_index('user.id')[0:4]
Out[9]:
created_at favorite_count in_reply_to_status_id in_reply_to_user_id retweet_count retweeted text truncated
user.id
107794703 Thu Apr 30 21:34:06 +0000 2015 0 NaN NaN 0 False RT #KirkKus: Indirect cost of the UK being in ... False
557422508 Thu Apr 30 21:34:06 +0000 2015 0 NaN NaN 0 False VIDEO: Sturgeon on post-election deals http://... False
3006692193 Thu Apr 30 21:34:06 +0000 2015 0 NaN NaN 0 False RT #LabourEoin: The economy was growing 3 time... False
455154030 Thu Apr 30 21:34:06 +0000 2015 0 NaN NaN 0 False RT #GregLauder: the UKIP east lothian candidat... False
Hope it helps.

Categories