Filtering data with pandas - python

I'm a newbie to Pandas and I'm trying to apply it to a script that I have already written.
I have a csv file from which I extract the data, and use the columns 'candidate', 'final track' and 'status' for my data frame.
My problem is, I would like to filter the data, using perhaps the method shown in Wes Mckinney's 10min tutorial ('http://nbviewer.ipython.org/urls/gist.github.com/wesm/4757075/raw/a72d3450ad4924d0e74fb57c9f62d1d895ea4574/PandasTour.ipynb'). In the section In [80]: he uses aapl_bars.close_price['2009-10-15'].
I would like to use a similar method to select all the data which have * as a status. Data from the other columns are also deleted if there is no * in that row.
My code at the moment:
def establish_current_tacks(filename):
df=pd.read_csv(filename)
cols=[df.iloc[:,0], df.iloc[:,10], df.iloc[:,11]]
current_tracks=pd.concat(cols, axis=1)
return current_tracks
My DataFrame:
>>> current_tracks
<class 'pandas.core.frame.DataFrame'>
Int64Index: 707 entries, 0 to 706
Data columns (total 3 columns):
candidate 695 non-null values
final track 670 non-null values
status 670 non-null values
dtypes: float64(1), object(2)
I would like to use something such as current_tracks.status['*'], but this does not work
Apologies if this is obvious, struggling a little to get my head around it.

Since the data you want to filter based on is not part of the data frame's index, but instead is a regular column, you need to do something like this:
current_tracks[current_tracks.status == '*']
Full example:
import pandas as pd
current_tracks = pd.DataFrame({'candidate': ['Bob', 'Jim', 'Alice'],
'final_track': [10, 15, 13], 'status': ['*', '.', '*']})
current_tracks
Out[3]:
candidate final_track status
0 Bob 10 *
1 Jim 15 .
2 Alice 13 *
current_tracks[current_tracks.status == '*']
Out[4]:
candidate final_track status
0 Bob 10 *
2 Alice 13 *
If status was part of your dataframe's index, your original syntax would have worked:
current_tracks = current_tracks.set_index('status')
current_tracks.candidate['*']
Out[8]:
status
* Bob
* Alice
Name: candidate, dtype: object

Related

How to convert a column's dtype from object to float? [duplicate]

I have the following data in pandas dataframe:
state 1st 2nd 3rd
0 California $11,593,820 $109,264,246 $8,496,273
1 New York $10,861,680 $45,336,041 $6,317,300
2 Florida $7,942,848 $69,369,589 $4,697,244
3 Texas $7,536,817 $61,830,712 $5,736,941
I want to perform some simple analysis (e.g., sum, groupby) with three columns (1st, 2nd, 3rd), but the data type of those three columns is object (or string).
So I used the following code for data conversion:
data = data.convert_objects(convert_numeric=True)
But, conversion does not work, perhaps, due to the dollar sign. Any suggestion?
#EdChum's answer is clever and works well. But since there's more than one way to bake a cake.... why not use regex? For example:
df[df.columns[1:]] = df[df.columns[1:]].replace('[\$,]', '', regex=True).astype(float)
To me, that is a little bit more readable.
You can use the vectorised str methods to replace the unwanted characters and then cast the type to int:
In [81]:
df[df.columns[1:]] = df[df.columns[1:]].apply(lambda x: x.str.replace('$','')).apply(lambda x: x.str.replace(',','')).astype(np.int64)
df
Out[81]:
state 1st 2nd 3rd
index
0 California 11593820 109264246 8496273
1 New York 10861680 45336041 6317300
2 Florida 7942848 69369589 4697244
3 Texas 7536817 61830712 5736941
dtype change is now confirmed:
In [82]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 4 columns):
state 4 non-null object
1st 4 non-null int64
2nd 4 non-null int64
3rd 4 non-null int64
dtypes: int64(3), object(1)
memory usage: 160.0+ bytes
Another way:
In [108]:
df[df.columns[1:]] = df[df.columns[1:]].apply(lambda x: x.str[1:].str.split(',').str.join('')).astype(np.int64)
df
Out[108]:
state 1st 2nd 3rd
index
0 California 11593820 109264246 8496273
1 New York 10861680 45336041 6317300
2 Florida 7942848 69369589 4697244
3 Texas 7536817 61830712 5736941
You can also use locale as follows
import locale
import pandas as pd
locale.setlocale(locale.LC_ALL,'')
df['1st']=df.1st.map(lambda x: locale.atof(x.strip('$')))
Note the above code was tested in Python 3 and Windows environment
To convert into integer, use:
carSales["Price"] = carSales["Price"].replace("[$,]", "", regex=True).astype(int)
You can use the methodstr.replace and the regex '\D' to remove all nondigit characters or '[^-.0-9]' to keep minus signs, decimal points and digits:
for col in df.columns[1:]:
df[col] = pd.to_numeric(df[col].str.replace('[^-.0-9]', ''))

Why does Pivot table return Int64 Type Error?

I'm trying to pivot a dataframe but it keeps returning an Int64 Error. A similar question was not actually answered - What causes these Int64 columns to cause a TypeError?
Here's the type of my dataframe:
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 price 30159 non-null Int64
1 type 30159 non-null object
2 size 30155 non-null Int64
3 location 30159 non-null object
4 neighborhood 30159 non-null object
dtypes: Int64(2), object(3)
the pivot table code:
pfraw = pd.pivot_table(pfraw, values = 'price', index = 'neighborhood', columns = 'type')
and the lat bit of the error message:
273 dtype = np.dtype(dtype)
275 if not isinstance(dtype, np.dtype):
276 # enforce our signature annotation
--> 277 raise TypeError(dtype) # pragma: no cover
279 converted = maybe_downcast_numeric(result, dtype, do_round)
280 if converted is not result:
TypeError: Int64
I dont understand why would it return an error with Int64.
First of all, let's create a df similar to the one OP has
import pandas as pd
df = pd.DataFrame( {'price': [10, 12, 18, 10, 12], 'type': ['A', 'A', 'A', 'B', 'B'], 'size': [10, 12, 18, 10, 12], 'location': ['A', 'A', 'A', 'B', 'B'], 'neighborhood': ['A', 'A', 'A', 'B', 'B']})
If one prints the df one will see that this one has int64 and not Int64 (as opposed to OP's). Note: On my answer here one finds the difference between the two dtypes.
print(df.info(verbose=True))
[Out]:
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 price 5 non-null int64
1 type 5 non-null object
2 size 5 non-null int64
3 location 5 non-null object
4 neighborhood 5 non-null object
And, with an int64 one will be able to create the pivot table with index "neighborhood", columns "type", and values "price", with the following
df_pivot = df.pivot_table(index='neighborhood', columns='type', values='price')
This is the output
type A B
neighborhood
A 13.333333 NaN
B NaN 11.0
However, with Int64 the Pivot Table can generate an error.
In order to handle that, one will need convert the type to int64
df[['price', 'size']] = df[['price', 'size']].astype('int64')
or
import numpy as np
df[['price', 'size']] = df[['price', 'size']].astype(np.int64)
Also, most likely, OP has missing values. The fastest way to handle that is to remove the rows with missing values. In order to find and remove the missing values, my answer here may be of help.
For the reference, this is a direct link to the module maybe_downcast_to_dtype that is raising the error that OP is having.

Identify the column name on the basis of column data in .CSV file

I have list of column names of csv file like:[email, null, password, ip_address, user_name, phone_no] .Consider I have csv with data:
03-Sep-14,foo2#yahoo.co.jp,,
20-Jan-13,foo3#gmail.com,,
20-Feb-15,foo4#yahoo.co.jp,,
12-May-16,foo5#hotmail.co.jp,,
25-May-16,foo6#hotmail.co.jp,,
Now I want to identify the column names of this csv file on the basis of data, like col_1 is date and col_2 is mail.
I tried to use pandas. like getting all values from col_1 and then identify either it is mail or something else but couldn't get much.
i tried something like this:
df = pd.read_csv('demo.csv', header=None)
df[df[1].str.contains("#")]
but its not helping me.
thank you.
Have you tried using Pandas dataframe.infer_objects()?
# importing pandas as pd
import pandas as pd
# Creating the dataframe
df = pd.DataFrame({"A":["alpha", 15, 81, 1, 100],
"B":[2, 78, 7, 4, 12],
"C":["beta", 21, 14, 61, 5]})
# data frame info and data
df.info()
print(df)
# slice all rows except first into a new frame
df_temp = df[1:]
# print it
print(df_temp)
df_temp.info()
# infer the object types
df_inferred = df_temp.infer_objects()
# print inferred
print(df_inferred)
df_inferred.info()
Here's the output from the above py script.
Initially df is inferred as object, int64 and object for A, B and C respectively.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 5 non-null object
1 B 5 non-null int64
2 C 5 non-null object
dtypes: int64(1), object(2)
memory usage: 248.0+ bytes
A B C
0 alpha 2 beta
1 15 78 21
2 81 7 14
3 1 4 61
4 100 12 5
A B C
1 15 78 21
2 81 7 14
3 1 4 61
4 100 12 5
After removing the first exception row which has the strings, the data frame is still showing the same type.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 1 to 4
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 4 non-null object
1 B 4 non-null int64
2 C 4 non-null object
dtypes: int64(1), object(2)
memory usage: 228.0+ bytes
A B C
1 15 78 21
2 81 7 14
3 1 4 61
4 100 12 5
After infer_objects(), the types have been correctly inferred as int64.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 1 to 4
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 4 non-null int64
1 B 4 non-null int64
2 C 4 non-null int64
dtypes: int64(3)
memory usage: 228.0 bytes
Is this what you need?
The OP has clarified that s/he needs to determine if the column contains one of the following:
email
password
ip_address
user_name
phone_no
null
There are a couple of approaches we could use:
Approach #1: Take a random sample of rows and analyze their column contents using heuristics
We could use the following heuristic rules to identify column content type.
email: Use a regex to check for presence of a valid email.
[Stackoverflow - How to validate an email address]
https://www.regular-expressions.info/email.html
https://emailregex.com/
ip_address: Use a regex to match an ip_address.
Stackoverflow - Validating IPv4 addresses with Regex
Stackoverflow - Regular expression that matches valid IPv6 addresses
username: Use a table of common first names or lastnames and search for them within the username
phone_no: Strip +, SPACE, -, (, ) -- alternatively, all special characters. If you are left with all digits, we have a potential phone number
null: All column contents in sample are null
password: If it doesn't satisfy rules 1 through 5, we identify it as password
We should do the analysis independently on each column and keep track of how many sample items in the column matched each heuristic. Then we could pick the classification with the maximum number of matches.
Approach #2: Train a classifier using training data (obtained from real system) and use it to determine column content type
This is a machine learning classification task. A naive approach would be to take each column's data mapped to the content type as the training input.
Using the OP's sample set:
03-Sep-14,foo2#yahoo.co.jp,,
20-Jan-13,foo3#gmail.com,,
20-Feb-15,foo4#yahoo.co.jp,,
12-May-16,foo5#hotmail.co.jp,,
25-May-16,foo6#hotmail.co.jp,,
We would have:
data_content, content_type
03-Sep-14, date
20-Jan-13, date
20-Feb-15, date
12-May-16, date
25-May-16, date
foo2#yahoo.co.jp, email
foo3#gmail.com, email
foo4#yahoo.co.jp, email
foo5#hotmail.co.jp, email
foo6#hotmail.co.jp, email
We can then use machine learning to build a text-to-class multi-class classifier. Some references are given below:
Multi-Class Text Classification from Start to Finish
Multi-Class Text Classification with Scikit-Learn

Python Pandas Dataframe with lists elements reports wrong type when calling `.info()`?

I have a dataframe in Python 3 that looks like
>>> df_people
name Information 1 Information 2
0 P1 [20, 21] [50, 52]
1 P2 [30, 20] [52, 55]
2 P3 [25, 33] [60, 54]
created from the following code:
people = {"name":["P1", "P2", "P3"],"Information 1":[[20, 21],[30, 20],[25,33]],"Information 2":[[50, 52],[52, 55],[60,54]]}
df_people= pd.DataFrame(people)
df_people
Now, if I call df_people.info() I get:
>>> df_people.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
name 3 non-null object
Information 1 3 non-null object
Information 2 3 non-null object
dtypes: object(3)
memory usage: 152.0+ bytes
The non-null object part worries me. Should it report something else?
Nope. You should not worry about the non-null objects.
This method shows you the information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage.
It is just showing you that the data you fill inside it are not null i.e not blank.
You can also visit the official documentation of pandas dataframe about .info() here :
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html

converting currency with $ to numbers in Python pandas

I have the following data in pandas dataframe:
state 1st 2nd 3rd
0 California $11,593,820 $109,264,246 $8,496,273
1 New York $10,861,680 $45,336,041 $6,317,300
2 Florida $7,942,848 $69,369,589 $4,697,244
3 Texas $7,536,817 $61,830,712 $5,736,941
I want to perform some simple analysis (e.g., sum, groupby) with three columns (1st, 2nd, 3rd), but the data type of those three columns is object (or string).
So I used the following code for data conversion:
data = data.convert_objects(convert_numeric=True)
But, conversion does not work, perhaps, due to the dollar sign. Any suggestion?
#EdChum's answer is clever and works well. But since there's more than one way to bake a cake.... why not use regex? For example:
df[df.columns[1:]] = df[df.columns[1:]].replace('[\$,]', '', regex=True).astype(float)
To me, that is a little bit more readable.
You can use the vectorised str methods to replace the unwanted characters and then cast the type to int:
In [81]:
df[df.columns[1:]] = df[df.columns[1:]].apply(lambda x: x.str.replace('$','')).apply(lambda x: x.str.replace(',','')).astype(np.int64)
df
Out[81]:
state 1st 2nd 3rd
index
0 California 11593820 109264246 8496273
1 New York 10861680 45336041 6317300
2 Florida 7942848 69369589 4697244
3 Texas 7536817 61830712 5736941
dtype change is now confirmed:
In [82]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 4 columns):
state 4 non-null object
1st 4 non-null int64
2nd 4 non-null int64
3rd 4 non-null int64
dtypes: int64(3), object(1)
memory usage: 160.0+ bytes
Another way:
In [108]:
df[df.columns[1:]] = df[df.columns[1:]].apply(lambda x: x.str[1:].str.split(',').str.join('')).astype(np.int64)
df
Out[108]:
state 1st 2nd 3rd
index
0 California 11593820 109264246 8496273
1 New York 10861680 45336041 6317300
2 Florida 7942848 69369589 4697244
3 Texas 7536817 61830712 5736941
You can also use locale as follows
import locale
import pandas as pd
locale.setlocale(locale.LC_ALL,'')
df['1st']=df.1st.map(lambda x: locale.atof(x.strip('$')))
Note the above code was tested in Python 3 and Windows environment
To convert into integer, use:
carSales["Price"] = carSales["Price"].replace("[$,]", "", regex=True).astype(int)
You can use the methodstr.replace and the regex '\D' to remove all nondigit characters or '[^-.0-9]' to keep minus signs, decimal points and digits:
for col in df.columns[1:]:
df[col] = pd.to_numeric(df[col].str.replace('[^-.0-9]', ''))

Categories