I have the following pandas dataframe with only one column:
column_name
0 cc_call_center_sk
1 cc_call_center_id
2 cc_rec_start_date
3 cc_rec_end_date
What I want to do is to extract each element inside that pandas column and put it into a string like this:
my_string = ['cc_call_center_sk', 'cc_call_center_id', 'cc_rec_start_date',
'cc_rec_end_date']
I tried to do this with the following code:
my_list = column_names.values.tolist()
However, the output is a list and it is not as desired:
[['cc_call_center_sk'], ['cc_call_center_id'], ['cc_rec_start_date'], ['cc_rec_end_date']]
The df.names.tolist() generates the expected result:
>>> df.name.tolist()
['cc_call_center_sk', 'cc_call_center_id', 'cc_rec_start_date', 'cc_rec_end_date']
For example:
>>> df=pd.DataFrame([['cc_call_center_sk'], ['cc_call_center_id'], ['cc_rec_start_date'], ['cc_rec_end_date']], columns=['names'])
>>> df
names
0 cc_call_center_sk
1 cc_call_center_id
2 cc_rec_start_date
3 cc_rec_end_date
>>> df = pd.DataFrame([['cc_call_center_sk'], ['cc_call_center_id'], ['cc_rec_start_date'], ['cc_rec_end_date']], columns=['names'])
>>> df.names.tolist()
['cc_call_center_sk', 'cc_call_center_id', 'cc_rec_start_date', 'cc_rec_end_date']
are you sure you do not "group" values, or perform other "preprocessing" before obtaining the df.names?
You can use the tolist method on the 'column_name' series. Note that my_string is a list of strings, not a string. The name you have assigned is not appropriate.
>>> import pandas as pd
>>> df = pd.DataFrame(['cc_call_center_sk', 'cc_call_center_id', 'cc_rec_start_date', 'cc_rec_end_date'],
... columns=['column_name'])
>>> df
column_name
0 cc_call_center_sk
1 cc_call_center_id
2 cc_rec_start_date
3 cc_rec_end_date
>>>
>>> df['column_name'].tolist()
['cc_call_center_sk', 'cc_call_center_id', 'cc_rec_start_date', 'cc_rec_end_date']
If you prefer the dot notation, the following code does the same.
>>> df.column_name.tolist()
['cc_call_center_sk', 'cc_call_center_id', 'cc_rec_start_date', 'cc_rec_end_date']
Lets say you have a data frame named df which looks like this:
df
column_name
0 cc_call_center_sk
1 cc_call_center_id
2 cc_rec_start_date
3 cc_rec_end_date
then:
my_string = df.column_name.values.tolist()
or:
my_string = df['column_name'].values.tolist()
would give you the result that you want. Here is the result when you print my_string
['cc_call_center_sk',
'cc_call_center_id',
'cc_rec_start_date',
'cc_rec_end_date']
What you are trying to do is this:
my_strings = df.values.tolist()
This would give you a list of lists with the number of lists in the outer list being equal to the number of observations in your data frame. Each list would contain all the feature information pertaining to 1 observation.
I hope I was clear in explaining that to you.
Thank you
Related
I'm trying to execute a filter in python, but I'm stuck at the end, when I need to group the resullt.
I have a json, which is this one: https://api.jsonbin.io/b/62300664a703bb67492bd3fc/3
And what I'm trying to do with it is filtering "apiFamily" searching for "payments-ted" or "payments-doc". If I find a match, I then must verify that the column "ApiEndpoints" has at least two endpoints in it.
My ultimate goal is to append both "apiFamily" in one row and all the ApiEndpoints" in another row. Something like this:
"ApiFamily": [
"payments-ted",
"payments-doc"
]
"ApiEndpoints": [
"/ted",
"/electronic-ted",
"/phone-ted",
"/banking-ted",
"/shared-automated-teller-machines-ted"
"/doc",
"/electronic-doc",
"/phone-doc",
"/banking-doc",
"/shared-automated-teller-machines-doc"
]
I have managed so achieve partial sucess, searching for a single condition:
#ApiFilter = df[(df['ApiFamily'] == 'payments-pix') & (rolesFilter['ApiEndpoints'].apply(lambda x: len(x)) >= 2)]
This obviously extracts only payments-pix which contains two or more ApiEndpoints.
Now I can manage to check both conditions, if I try this:
#ApiFilter = df[((df['ApiFamily'] == 'payments-ted') | (df['ApiFamily'] == 'payments-doc') &(df['ApiEndpoints'].apply(lambda x: len(x)) >= 2)]
I will get the correct rows, but it will obviously list the brand twice.
When I try to groupby the result, all I get is this:
TypeError: unhashable type: 'Series'
My doubt is: how to avoid this error? I assume I must do some sort of conversion of the columns that have multiple itens inside a row, but what is the best method?
I have tried this solution , it is kind of round-about but gets the final result you want
First get the data into a dictionary object
>>> import requests
>>> url = 'https://api.jsonbin.io/b/62300664a703bb67492bd3fc/3'
>>> response = requests.get(url)
>>> d = response.json()
We just need the ApiFamily and ApiEndpoints into a new dictionary
>>> dNew = {}
>>> for item in d['data'] :
>>> if item['ApiFamily'] in ['payments-ted','payments-doc']:
>>> dNew[item['ApiFamily']] = item['ApiEndpoints']
Change dNew into a dataframe and transpose it.
>>> df1 = pd.DataFrame(dNew)
>>> df1 = df1.applymap ( lambda x : '\'' + x + '\'')
>>> df2 = df1.transpose()
At this stage df2 looks like this -
>>> print(df2)
0 1 2 3 \
payments-ted '/ted' '/electronic-ted' '/phone-ted' '/banking-ted'
payments-doc '/doc' '/electronic-doc' '/phone-doc' '/banking-doc'
4
payments-ted '/shared-automated-teller-machines-ted'
payments-doc '/shared-automated-teller-machines-doc'
Now join all the columns using the comma symbol
>>> df2['final'] = df2.apply( ','.join , axis=1)
Finally
>>> df2 = df2[['final']]
>>> print(df2)
final
payments-ted '/ted','/electronic-ted','/phone-ted','/bankin...
payments-doc '/doc','/electronic-doc','/phone-doc','/bankin...
I need to extract numeric values from a string inside a pandas DataFrame.
Let's say the DataFrame cell is as follows (stored as a string):
[1.234,2.345]
I can get the first value with the following:
print(df['column_name'].str.extract('(\d+.\d+)',).astype('float'))
Output:
1.234
Now my thoughts to find both values was to do the following:
print(df['column_name'].str.extract('(\d+.\d+),(\d+.\d+)',).astype('float'))
but the output is then as follows:
NaN NaN
Expected output:
1.234 2.345
Why not just pd.eval:
>>> df['Float'] = pd.eval(df['String'])
>>> df
String Float
0 [1.234, 2.345] [1.234, 2.345]
1 [1.234, 2.345] [1.234, 2.345]
>>>
If you want to use a regex to extract floats, you can use str.findall:
>>> df['column_name'].str.findall(r'(-?\d+\.?\d+)').str.join(' ')
0 1.234 2.345
Name: String, dtype: object
Old answer:
Use ast.literal_eval:
import ast
df = pd.DataFrame({'String': ['[1.234, 2.345]']})
df['Float'] = df['String'].apply(ast.literal_eval)
Output:
>>> df
String Float
0 [1.234, 2.345] [1.234, 2.345]
>>> type(df.at[0, 'String'][0])
str
>>> type(df.at[0, 'Float'][0])
float
You can use pandas.str.split, setting n=2. If you want to expand the DataFrame you must set expand=True.
So the result might look like:
your_dataframe['your_column_name'].str.split(",", n=2, expand=True).astype(float)
Consider a Pandas Dataframe like:
>>> import pandas as pd
>>> df = pd.DataFrame(dict(url=['http://url1.com', 'http://www.url1.com', 'http://www.url2.com','http://www.url3.com','http://www.url1.com']))
>>> df
Giving:
url
0 http://url1.com
1 http://www.url1.com
2 http://www.url2.com
3 http://www.url3.com
4 http://www.url1.com
I want to remove all rows containing url1.com and url2.com to obtain dataframe result like:
url
0 http://ww.url3.com
I do this
domainToCheck = ('url1.com', 'url2.com')
goodUrl = df['url'].apply(lambda x : any(domain in x for domain in domainToCheck))
But this give me no result.
Any idea how to solve the above problem?
Edit: Solution
import pandas as pd
import tldextract
df = pd.DataFrame(dict(url=['http://url1.com', 'http://www.url1.com','http://www.url2.com','http://www.url3.com','http://www.url1.com']))
domainToCheck = ['url1', 'url2']
s = df.url.map(lambda x : tldextract.extract(x).domain).isin(domainToCheck)
df = df[~s].reset_index(drop=True)
If we checking domain , we should find the 100% match domain rather than use string contain . since the subdomain may contain the same key work as domain
import tldextract
s=df.url.map(lambda x : tldextract.extract(x).domain).isin(['url1','url2'])
Out[594]:
0 True
1 True
2 True
3 False
4 True
Name: url, dtype: bool
df=df[~s]
Use, Series.str.contains to create a boolean mask m and then you can filter the dataframe df using this boolean mask:
m = df['url'].str.contains('|'.join(domainToCheck))
df = df[~m].reset_index(drop=True)
Result:
url
0 http://www.url3.com
you can use pd.Series.str.contains here.
df[~df.url.str.contains('|'.join(domainToCheck))]
url
3 http://www.url3.com
If you want to reset index use this
df[~df.url.str.contains('|'.join(domainToCheck))].reset_index(drop=True)
url
0 http://www.url3.com
I have a giant list of values that I've downloaded and I want to build and insert them into a dataframe.
I thought it would be as easy as:
import pandas as pd
df = pd.DataFrame()
records = giant list of dictionary
df['var1'] = records[0]['key1']
df['var2'] = records[0]['key2']
and I would get a dataframe such as
var1 var2
val1 val2
However, my dataframe appears to be empty? I can individually print values from records no problem.
Simple Example:
t = [{'v1': 100, 'v2': 50}]
df['var1'] = t[0]['v1']
df['var2'] = t[0]['v2']
I would like to be:
var1 var2
100 50
One entry of your list of dictionaries looks like something you'd pass to the pd.Series constructor. You can turn that into a pd.DataFrame if you want to with the series method pd.Series.to_frame. I transpose at the end because I assume you wanted the dictionary to represent one row.
pd.Series(t[0]).to_frame().T
v1 v2
0 100 50
Pandas do exactly that for you !
>>> import pandas as pd
>>> t = [{'v1': 100, 'v2': 50}]
>>> df=pd.DataFrame(t)
>>> df
v1 v2
0 100 50
EDIT
>>> import pandas as pd
>>> t = [{'v1': 100, 'v2': 50}]
>>> df=pd.DataFrame([t[0]['v1']], index=None, columns=['var1'])
>>> df
0
0 100
I have this type of DataFrame I wish to utilize. But because the data i imported is using the i letter for the imaginary part of the complex number, python doesn't allow me to convert it as a float.
5.0 0.01511+0.0035769i
5.0298 0.015291+0.0075383i
5.0594 0.015655+0.0094534i
5.0874 0.012456+0.011908i
5.1156 0.015332+0.011174i
5.1458 0.015758+0.0095832i
How can I proceed to change the i to j in each row of the DataFrame?
Thank you.
If you have a string like this: complexStr = "0.015291+0.0075383i", you could do:
complexFloat = complex(complexStr[:-1] + 'j')
If your data is a string like this: str = "5.0 0.01511+0.0035769i", you have to separate the first part, like this:
number, complexStr = str.split()
complexFloat = complex(complexStr[:-1] + 'j')
>>> complexFloat
>>> (0.015291+0.0075383j)
>>> type(complexFloat)
>>> <type 'complex'>
I'm not sure how you obtain your dataframe, but if you're reading it from a text file with a suitable header, then you can use a converter function to sort out the 'j' -> 'i' so that your dtype is created properly:
For file test.df:
a b
5.0 0.01511+0.0035769i
5.0298 0.015291+0.0075383i
5.0594 0.015655+0.0094534i
5.0874 0.012456+0.011908i
5.1156 0.015332+0.011174i
5.1458 0.015758+0.0095832i
the code
import pandas as pd
df = pd.read_table('test.df',delimiter='\s+',
converters={'b': lambda v: complex(str(v.replace('i','j')))}
)
gives df as:
a b
0 5.0000 (0.01511+0.0035769j)
1 5.0298 (0.015291+0.0075383j)
2 5.0594 (0.015655+0.0094534j)
3 5.0874 (0.012456+0.011908j)
4 5.1156 (0.015332+0.011174j)
5 5.1458 (0.015758+0.0095832j)
with column dtypes:
a float64
b complex128