Using boolean masks in Pandas - python

This is probably a trivial query but I can't work it out.
Essentially, I want to be able to filter out noisy tweets from a dataframe below
<class 'pandas.core.frame.DataFrame'>
Int64Index: 140381 entries, 0 to 140380
Data columns:
text 140381 non-null values
created_at 140381 non-null values
id 140381 non-null values
from_user 140381 non-null values
geo 5493 non-null values
dtypes: float64(1), object(4)
I can create a dataframe based on unwanted keywords thus:
junk = df[df.text.str.contains("Swans")]
But what's the best way to use this to see what's left?

df[~df.text.str.contains("Swans")]

You can also use the following two options:
option 1:
df[-df.text.str.contains("Swans")]
option 2:
import numpy as np
df[np.invert(df.text.str.contains("Swans"))]

Related

Why am I getting an empty index?

All this is asking me to do is write a code that shows if there are any missing values where it is not the customers first order. I have provided the DataFrame. Should I use column 'Order_number" instead? Is my code wrong?
I named the DataFrame df_orders.
I thought my code would find the columns that have missing values and a greater order number than 1.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 478967 entries, 0 to 478966
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 order_id 478967 non-null int64
1 user_id 478967 non-null int64
2 order_number 478967 non-null int64
3 order_dow 478967 non-null int64
4 order_hour_of_day 478967 non-null int64
5 days_since_prior_order 450148 non-null float64
dtypes: float64(1), int64(5)
memory usage: 21.9 MB
None
# Are there any missing values where it's not a customer's first order?
m_v_fo= df_orders[df_orders['days_since_prior_order'].isna() > 1]
print(m_v_fo.head())
Empty DataFrame
Columns: [order_id, user_id, order_number, order_dow, order_hour_of_day,
days_since_prior_order]
Index: []
When you say .isna() you are returning a series of True or False. So that will never be > 1
Instead, try this:
m_v_fo= df_orders[df_orders['days_since_prior_order'].isna().sum() > 1]
If that doesn't solve the problem, then I'm not sure - try editing your question to add more detail and I can try again. :)
Update: I read your question again, and I think you're doing this out of order. First you need to filter on days_since_prior_order and then look for na.
m_v_fo = df_orders[df_orders['days_since_prior_order'] > 1].isna()

Python Why do my changed datatypes go back to their former dataype after I save them to a .csv?

.info of Initial dataframe
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2851191 entries, 0 to 3168737
Data columns (total 6 columns):
READ_TIME object
SVC_PT_ID float64
CUSTOMER_ID object
SIGNAL_NAME object
SIGNAL_DESCRIPTION object
VALUE float64
dtypes: float64(2), object(4)
memory usage: 152.3+ MB
Then I change datatypes of "CUSTOMER_ID", "SIGNAL_NAME" and "SIGNAL_DESCRIPTION" to "category"
test = new_inkl_24.astype({"CUSTOMER_ID": "category", "SIGNAL_NAME":"category", "SIGNAL_DESCRIPTION":"category"})
I check if everything worked and the file size is reduced
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2851191 entries, 0 to 2851190
Data columns (total 7 columns):
Unnamed: 0 int64
READ_TIME object
SVC_PT_ID float64
CUSTOMER_ID category
SIGNAL_NAME category
SIGNAL_DESCRIPTION category
VALUE float64
dtypes: category(3), float64(2), int64(1), object(1)
memory usage: 95.2+ MB
Everything worked perfectly. So now I save my dataframe to a .csv-fie
test.to_csv('test.csv')
And hereĀ“s the problem:
The file size is not reduced anymore. The csv is exactly as large as the initial one with the old datatypes for the columns.
Also when I import the file into my Jupyter notebook again, the datatypes are switched back to the intial ones.
# Load data
df = pd.read_csv("test.csv")
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2851191 entries, 0 to 2851190
Data columns (total 8 columns):
Unnamed: 0 int64
Unnamed: 0.1 int64
READ_TIME object
SVC_PT_ID float64
CUSTOMER_ID object
SIGNAL_NAME object
SIGNAL_DESCRIPTION object
VALUE float64
dtypes: float64(2), int64(2), object(4)
memory usage: 174.0+ MB
What am I doing wrong?
CSV is a text format that doesn't have data types; in effect, it's all strings, as you get from the csv module. Pandas does have a bunch of data types and will try to select appropriate ones per column when loading a CSV, but you can specify data types to guide it. You can also choose to save in a file format that supports types, such as HDF or Pickle (which also means dataframes can be stored using shelve).

Filtering pandas based on value tuples for multiple columns [duplicate]

I have 2 dataframes:
restaurant_ids_dataframe
Data columns (total 13 columns):
business_id 4503 non-null values
categories 4503 non-null values
city 4503 non-null values
full_address 4503 non-null values
latitude 4503 non-null values
longitude 4503 non-null values
name 4503 non-null values
neighborhoods 4503 non-null values
open 4503 non-null values
review_count 4503 non-null values
stars 4503 non-null values
state 4503 non-null values
type 4503 non-null values
dtypes: bool(1), float64(3), int64(1), object(8)`
and
restaurant_review_frame
Int64Index: 158430 entries, 0 to 229905
Data columns (total 8 columns):
business_id 158430 non-null values
date 158430 non-null values
review_id 158430 non-null values
stars 158430 non-null values
text 158430 non-null values
type 158430 non-null values
user_id 158430 non-null values
votes 158430 non-null values
dtypes: int64(1), object(7)
I would like to join these two DataFrames to make them into a single dataframe using the DataFrame.join() command in pandas.
I have tried the following line of code:
#the following line of code creates a left join of restaurant_ids_frame and restaurant_review_frame on the column 'business_id'
restaurant_review_frame.join(other=restaurant_ids_dataframe,on='business_id',how='left')
But when I try this I get the following error:
Exception: columns overlap: Index([business_id, stars, type], dtype=object)
I am very new to pandas and have no clue what I am doing wrong as far as executing the join statement is concerned.
any help would be much appreciated.
You can use merge to combine two dataframes into one:
import pandas as pd
pd.merge(restaurant_ids_dataframe, restaurant_review_frame, on='business_id', how='outer')
where on specifies field name that exists in both dataframes to join on, and how
defines whether its inner/outer/left/right join, with outer using 'union of keys from both frames (SQL: full outer join).' Since you have 'star' column in both dataframes, this by default will create two columns star_x and star_y in the combined dataframe. As #DanAllan mentioned for the join method, you can modify the suffixes for merge by passing it as a kwarg. Default is suffixes=('_x', '_y'). if you wanted to do something like star_restaurant_id and star_restaurant_review, you can do:
pd.merge(restaurant_ids_dataframe, restaurant_review_frame, on='business_id', how='outer', suffixes=('_restaurant_id', '_restaurant_review'))
The parameters are explained in detail in this link.
Joining fails if the DataFrames have some column names in common. The simplest way around it is to include an lsuffix or rsuffix keyword like so:
restaurant_review_frame.join(restaurant_ids_dataframe, on='business_id', how='left', lsuffix="_review")
This way, the columns have distinct names. The documentation addresses this very problem.
Or, you could get around this by simply deleting the offending columns before you join. If, for example, the stars in restaurant_ids_dataframe are redundant to the stars in restaurant_review_frame, you could del restaurant_ids_dataframe['stars'].
In case anyone needs to try and merge two dataframes together on the index (instead of another column), this also works!
T1 and T2 are dataframes that have the same indices
import pandas as pd
T1 = pd.merge(T1, T2, on=T1.index, how='outer')
P.S. I had to use merge because append would fill NaNs in unnecessarily.
In case, you want to merge two DataFrames horizontally, then use this code:
df3 = pd.concat([df1, df2],axis=1, ignore_index=True, sort=False)

create new column in data frame based on another data frame [duplicate]

I have 2 dataframes:
restaurant_ids_dataframe
Data columns (total 13 columns):
business_id 4503 non-null values
categories 4503 non-null values
city 4503 non-null values
full_address 4503 non-null values
latitude 4503 non-null values
longitude 4503 non-null values
name 4503 non-null values
neighborhoods 4503 non-null values
open 4503 non-null values
review_count 4503 non-null values
stars 4503 non-null values
state 4503 non-null values
type 4503 non-null values
dtypes: bool(1), float64(3), int64(1), object(8)`
and
restaurant_review_frame
Int64Index: 158430 entries, 0 to 229905
Data columns (total 8 columns):
business_id 158430 non-null values
date 158430 non-null values
review_id 158430 non-null values
stars 158430 non-null values
text 158430 non-null values
type 158430 non-null values
user_id 158430 non-null values
votes 158430 non-null values
dtypes: int64(1), object(7)
I would like to join these two DataFrames to make them into a single dataframe using the DataFrame.join() command in pandas.
I have tried the following line of code:
#the following line of code creates a left join of restaurant_ids_frame and restaurant_review_frame on the column 'business_id'
restaurant_review_frame.join(other=restaurant_ids_dataframe,on='business_id',how='left')
But when I try this I get the following error:
Exception: columns overlap: Index([business_id, stars, type], dtype=object)
I am very new to pandas and have no clue what I am doing wrong as far as executing the join statement is concerned.
any help would be much appreciated.
You can use merge to combine two dataframes into one:
import pandas as pd
pd.merge(restaurant_ids_dataframe, restaurant_review_frame, on='business_id', how='outer')
where on specifies field name that exists in both dataframes to join on, and how
defines whether its inner/outer/left/right join, with outer using 'union of keys from both frames (SQL: full outer join).' Since you have 'star' column in both dataframes, this by default will create two columns star_x and star_y in the combined dataframe. As #DanAllan mentioned for the join method, you can modify the suffixes for merge by passing it as a kwarg. Default is suffixes=('_x', '_y'). if you wanted to do something like star_restaurant_id and star_restaurant_review, you can do:
pd.merge(restaurant_ids_dataframe, restaurant_review_frame, on='business_id', how='outer', suffixes=('_restaurant_id', '_restaurant_review'))
The parameters are explained in detail in this link.
Joining fails if the DataFrames have some column names in common. The simplest way around it is to include an lsuffix or rsuffix keyword like so:
restaurant_review_frame.join(restaurant_ids_dataframe, on='business_id', how='left', lsuffix="_review")
This way, the columns have distinct names. The documentation addresses this very problem.
Or, you could get around this by simply deleting the offending columns before you join. If, for example, the stars in restaurant_ids_dataframe are redundant to the stars in restaurant_review_frame, you could del restaurant_ids_dataframe['stars'].
In case anyone needs to try and merge two dataframes together on the index (instead of another column), this also works!
T1 and T2 are dataframes that have the same indices
import pandas as pd
T1 = pd.merge(T1, T2, on=T1.index, how='outer')
P.S. I had to use merge because append would fill NaNs in unnecessarily.
In case, you want to merge two DataFrames horizontally, then use this code:
df3 = pd.concat([df1, df2],axis=1, ignore_index=True, sort=False)

stack/unstack/pivot dataframe on python/pandas

I have a dataframe which looks like this:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 198300 entries, 0 to 198299
Data columns (total 3 columns):
var 198300 non-null values
period 198300 non-null values
value 141492 non-null values
dtypes: float64(1), object(2)
I'd like to change i from having three collumns (var, period, value) to having all values of the period variable as columns, the values in var as rows. i try using:
X.pivot(index='var', columns='period', values='value')
But I get this error:
raise ReshapeError('Index contains duplicate entries, '
pandas.core.reshape.ReshapeError: Index contains duplicate entries, cannot reshape
But I've checked in excel, there are no duplicate entries... Any help out there? Thanks
To give this question an answer: usually when pandas objects that there are duplicate entries, it's right. To check this I often use
someseries.value_counts().head()
to see if one found its way in there.

Categories