Python Assign Value to New Column If Contains() Is True - python

How can I use the str.contains() method to check a column if it contains specific strings and assign a value if true in a different column? Essentially, I'm trying to mimic a CASE WHEN LIKE THEN syntax in SQL but in pandas. Really new to python and pandas and would appreciate any help! Essentially, I want to search 'Source' for either video, audio, default, and if found, then Type would be video, audio, default accordingly. I hope this makes sense!
Source Type
video1393x2352_high video
audiowefxwrwf_low audio
default2325_none default
23234_audio audio

Use the str.extract method ... takes a regular expression as an argument ... returns matched group as a string ...
df['Type'] = df.Source.str.extract('(video|audio|default)')
For some case sensitivity you could add ...
df['Type'] = df.Source.str.lower().str.extract('(video|audio|default)')
Example, including a non match follows ...
In [24]: %paste
import pandas as pd
data = """
Source
video1393x2352_high
audiowefxwrwf_low
default2325_none
23234_audio
complete_crap
AUDIO_upper_case_test"""
from StringIO import StringIO # import from io for python 3
df = pd.read_csv(StringIO(data), header=0, index_col=None)
df['Type'] = df.Source.str.lower().str.extract('(video|audio|default)')
## -- End pasted text --
In [25]: df
Out[25]:
Source Type
0 video1393x2352_high video
1 audiowefxwrwf_low audio
2 default2325_none default
3 23234_audio audio
4 complete_crap NaN
5 AUDIO_upper_case_test audio

Try using numpy.where or pandas.DataFrame.where. Both take a boolean array and conditionally assign based on that.
In [4]: np.where([True, False, True], 3, 4)
Out[4]: array([3, 4, 3])
http://docs.scipy.org/doc/numpy/reference/generated/numpy.where.html
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.where.html
You would construct the boolean array using str.contains, and then pass it to the where method.

Try something like:-
import re
input_values = ['video1393x2352_high', 'audiowefxwrwf_low', 'default2325_none', '23234_audio']
pattern = re.compile('audio|video|default')
res_dict = {}
for input_val in input_values:
type = pattern.findall(input_val)
if type:
res_dict[input_val] = type[0]
print res_dict #{'23234_audio': 'audio', 'audiowefxwrwf_low': 'audio', 'video1393x2352_high': 'video', 'default2325_none': 'default'}

Related

h2o frame from pandas casting

I am using h2o to perform predictive modeling from python.
I have loaded some data from a csv using pandas, specifying some column types:
dtype_dict = {'SIT_SSICCOMP':'object',
'SIT_CAPACC':'object',
'PTT_SSIRMPOL':'object',
'PTT_SPTCLVEI':'object',
'cap_pad':'object',
'SIT_SADNS_RESP_PERC':'object',
'SIT_GEOCODE':'object',
'SIT_TIPOFIRMA':'object',
'SIT_TPFRODESI':'object',
'SIT_CITTAACC':'object',
'SIT_INDIRACC':'object',
'SIT_NUMCIVACC':'object'
}
date_cols = ["SIT_SSIDTSIN","SIT_SSIDTDEN","PTT_SPTDTEFF","PTT_SPTDTSCA","SIT_DTANTIFRODE","PTT_DTELABOR"]
columns_to_drop = ['SIT_TPFRODESI','SIT_CITTAACC',
'SIT_INDIRACC', 'SIT_NUMCIVACC', 'SIT_CAPACC', 'SIT_LONGITACC',
'SIT_LATITACC','cap_pad','SIT_DTANTIFRODE']
comp='mycomp'
file_completo = os.path.join(dataDir,"db4modelrisk_"+comp+".csv")
db4scoring = pd.read_csv(filepath_or_buffer=file_completo,sep=";", encoding='latin1',
header=0,infer_datetime_format =True,na_values=[''], keep_default_na =False,
parse_dates=date_cols,dtype=dtype_dict,nrows=500e3)
db4scoring.drop(labels=columns_to_drop,axis=1,inplace =True)
Then, after I set up a h2o cluster I import it in h2o using db4scoring_h2o = H2OFrame(db4scoring) and I convert categorical predictors in factor for example:
db4scoring_h2o["SIT_SADTPROV"]=db4scoring_h2o["SIT_SADTPROV"].asfactor()
db4scoring_h2o["PTT_SPTFRAZ"]=db4scoring_h2o["PTT_SPTFRAZ"].asfactor()
When I check data types using db4scoring.dtypes I notice that they are properly set but when I import it in h2o I notice that h2oframe performs some unwanted conversions to enum (eg from float or from int). I wonder if is is a way to specify the variable format in H2OFrame.
Yes, there is. See the H2OFrame doc here: http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/frame.html#h2oframe
You just need to use the column_types argument when you cast.
Here's a short example:
# imports
import h2o
import numpy as np
import pandas as pd
# create small random pandas df
df = pd.DataFrame(np.random.randint(0,10,size=(10, 2)),
columns=list('AB'))
print(df)
# A B
#0 5 0
#1 1 3
#2 4 8
#3 3 9
# ...
# start h2o, convert pandas frame to H2OFrame
# use column_types dict to set data types
h2o.init()
h2o_df = h2o.H2OFrame(df, column_types={'A':'numeric', 'B':'enum'})
h2o_df.describe() # you should now see the desired data types
# A B
# type int enum
# ...
# Filter a dictionary to keep elements only whose keys are even
newDict = filterTheDict(dictOfNames, lambda elem : elem[0] % 2 == 0)
print('Filtered Dictionary : ')
print(newDict)`enter code here`

Extracting value from JSON column very slow

I've got a CSV with a bunch of data. One of the columns, ExtraParams contains a JSON object. I want to extract a value using a specific key, but it's taking quite a while to get through the 60.000something rows in the CSV. Can it be sped up?
counter = 0 #just to see where I'm at
order_data['NewColumn'] = ''
for row in range(len(total_data)):
s = total_data['ExtraParams'][row]
try:
data = json.loads(s)
new_data = data['NewColumn']
counter += 1
print(counter)
order_data['NewColumn'][row] = new_data
except:
print('NewColumn not in row')
I use a try-except because a few of the rows have what I assume is messed up JSON, as they crash the program with a "expecting delimiter ','" error.
When I say "slow" I mean ~30mins for 60.000rows.
EDIT: It might be worth nothing each JSON contains about 35 key/value pairs.
You could use something like pandas and make use of the apply method. For some simple sample data in test.csv
Col1,Col2,ExtraParams
1,"a",{"dog":10}
2,"b",{"dog":5}
3,"c",{"dog":6}
You could use something like
In [1]: import pandas as pd
In [2]: import json
In [3]: df = pd.read_csv("test.csv")
In [4]: df.ExtraParams.apply(json.loads)
Out[4]:
0 {'dog': 10}
1 {'dog': 5}
2 {'dog': 6}
Name: ExtraParams, dtype: object
If you need to extract a field from the json, assuming the field is present in each row you can write a lambda function like
In [5]: df.ExtraParams.apply(lambda x: json.loads(x)['dog'])
Out[5]:
0 10
1 5
2 6
Name: ExtraParams, dtype: int64

Does pandas support read `set` paramater using read_csv

I save set parameter using to_csv.
csv file as below.
1,59,"set([17122, 196, 26405, 13032, 39657, 12427, 25133, 35951,
38928, 2 6088, 10258, 49235, 10326, 13176, 30450, 41787, 14084,
46149])",18,19.0,1 1,5.36363649368
Can I use read_csv and return a set type but str
users = pd.read_csv(DATA_PATH + "users_match.csv", dtype={
})
The answer is yes. Your solution
users = pd.read_csv(DATA_PATH + "users_match.csv", header = None)
will already return column 2 as a string as long as you have double quotes around set([...]).
Then use
users[2].apply(lambda x: eval(x))
to convert it back to set
To convert the DataFrame's str object (the string starting with the characters "set") into a built-in Python set object, here is one way:
>>> import pandas as pd
>>> df = pd.read_csv('users_match.csv', header=None)
>>> type(df[2][0])
str
>>> df.set_value(0, 2, eval(df[2][0]))
>>> type(df[2][0])
set

Splitting Regex response column on python

I am receiving an object array after applying re.findall for link and hashtags on Tweets data. My data looks like
b=['https://t.co/1u0dkzq2dV', 'https://t.co/3XIZ0SN05Q']
['https://t.co/CJZWjaBfJU']
['https://t.co/4GMhoXhBQO', 'https://t.co/0V']
['https://t.co/Erutsftlnq']
['https://t.co/86VvLJEzvG', 'https://t.co/zCYv5WcFDS']
Now I want to split it in columns, I am using following
df = pd.DataFrame(b.str.split(',',1).tolist(),columns = ['flips','row'])
But it is not working because of weird datatype I guess, I tried few other solutions as well. Nothing worked.And this is what I am expecting, two separate columns
https://t.co/1u0dkzq2dV https://t.co/3XIZ0SN05Q
https://t.co/CJZWjaBfJU
https://t.co/4GMhoXhBQO https://t.co/0V
https://t.co/Erutsftlnq
https://t.co/86VvLJEzvG
It's not clear from your question what exactly is part of your data. (Does it include the square brackets and single quotes?). In any case, the pandas read_csv function is very versitile and can handle ragged data:
import StringIO
import pandas as pd
raw_data = """
['https://t.co/1u0dkzq2dV', 'https://t.co/3XIZ0SN05Q']
['https://t.co/CJZWjaBfJU']
['https://t.co/4GMhoXhBQO', 'https://t.co/0V']
['https://t.co/Erutsftlnq']
['https://t.co/86VvLJEzvG', 'https://t.co/zCYv5WcFDS']
"""
# You'll probably replace the StringIO part with the filename of your data.
df = pd.read_csv(StringIO.StringIO(raw_data), header=None, names=('flips','row'))
# Get rid of the square brackets and single quotes
for col in ('flips', 'row'):
df[col] = df[col].str.strip("[]'")
df
Output:
flips row
0 https://t.co/1u0dkzq2dV https://t.co/3XIZ0SN05Q
1 https://t.co/CJZWjaBfJU NaN
2 https://t.co/4GMhoXhBQO https://t.co/0V
3 https://t.co/Erutsftlnq NaN
4 https://t.co/86VvLJEzvG https://t.co/zCYv5WcFDS

Get the name of a pandas DataFrame

How do I get the name of a DataFrame and print it as a string?
Example:
boston (var name assigned to a csv file)
import pandas as pd
boston = pd.read_csv('boston.csv')
print('The winner is team A based on the %s table.) % boston
You can name the dataframe with the following, and then call the name wherever you like:
import pandas as pd
df = pd.DataFrame( data=np.ones([4,4]) )
df.name = 'Ones'
print df.name
>>>
Ones
Sometimes df.name doesn't work.
you might get an error message:
'DataFrame' object has no attribute 'name'
try the below function:
def get_df_name(df):
name =[x for x in globals() if globals()[x] is df][0]
return name
In many situations, a custom attribute attached to a pd.DataFrame object is not necessary. In addition, note that pandas-object attributes may not serialize. So pickling will lose this data.
Instead, consider creating a dictionary with appropriately named keys and access the dataframe via dfs['some_label'].
df = pd.DataFrame()
dfs = {'some_label': df}
From here what I understand DataFrames are:
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects.
And Series are:
Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.).
Series have a name attribute which can be accessed like so:
In [27]: s = pd.Series(np.random.randn(5), name='something')
In [28]: s
Out[28]:
0 0.541
1 -1.175
2 0.129
3 0.043
4 -0.429
Name: something, dtype: float64
In [29]: s.name
Out[29]: 'something'
EDIT: Based on OP's comments, I think OP was looking for something like:
>>> df = pd.DataFrame(...)
>>> df.name = 'df' # making a custom attribute that DataFrame doesn't intrinsically have
>>> print(df.name)
'df'
DataFrames don't have names, but you have an (experimental) attribute dictionary you can use. For example:
df.attrs['name'] = "My name" # Can be retrieved later
attributes are retained through some operations.
Here is a sample function:
'df.name = file` : Sixth line in the code below
def df_list():
filename_list = current_stage_files(PATH)
df_list = []
for file in filename_list:
df = pd.read_csv(PATH+file)
df.name = file
df_list.append(df)
return df_list
I am working on a module for feature analysis and I had the same need as yours, as I would like to generate a report with the name of the pandas.Dataframe being analyzed. To solve this, I used the same solution presented by #scohe001 and #LeopardShark, originally in https://stackoverflow.com/a/18425523/8508275, implemented with the inspect library:
import inspect
def aux_retrieve_name(var):
callers_local_vars = inspect.currentframe().f_back.f_back.f_locals.items()
return [var_name for var_name, var_val in callers_local_vars if var_val is var]
Note the additional .f_back term since I intend to call it from another function:
def header_generator(df):
print('--------- Feature Analyzer ----------')
print('Dataframe name: "{}"'.format(aux_retrieve_name(df)))
print('Memory usage: {:03.2f} MB'.format(df.memory_usage(deep=True).sum() / 1024 ** 2))
return
Running this code with a given dataframe, I get the following output:
header_generator(trial_dataframe)
--------- Feature Analyzer ----------
Dataframe name: "trial_dataframe"
Memory usage: 63.08 MB

Categories