String function on a pandas series - python

I wanted to used the below string functions text.lower for a Pandas series instead of from a text file. Tried different methods to convert the series to list and then string,, but no luck. Still I am not able to use the below function directly. Help is much appreciated.
def words(text):
return re.findall(r'\w+', text.lower())
WORDS = Counter(words(open('some.txt').read()))

I think need apply by your function:
s = pd.Series(['Aasa dsad d','GTH rr','SSD'])
print (s)
0 Aasa dsad d
1 GTH rr
2 SSD
dtype: object
def words(text):
return re.findall(r'\w+', text.lower())
print (s.apply(words))
0 [aasa, dsad, d]
1 [gth, rr]
2 [ssd]
dtype: object
But in pandas is better use str.lower and str.findall, because also working with NaNs:
print (s.str.lower().str.findall(r'\w+'))
0 [aasa, dsad, d]
1 [gth, rr]
2 [ssd]
dtype: object

Something like this?
from collections import Counter
import pandas as pd
series = pd.Series(['word', 'Word', 'WORD', 'other_word'])
counter = Counter(series.apply(lambda x: x.lower()))
print(counter)

Related

Pandas how to create unique 4 character string from longer string

I have a a pandas dataframe with strings. I would like to shorten them so I have decided to remove the vowels. Then my next step is to take the first four characters of the string but I am running into collisions. Is there a smarter way to do this so that I can try not to have repeatable strings but also keep to 4 character strings?
import pandas as pd
import re
d = {'test': ['gregorypolanco','franciscoliriano','chrisarcher', 'franciscolindor']}
df = pd.DataFrame(data=d)
def remove_vowels(r):
result = re.sub(r'[AEIOU]', '', r, flags=re.IGNORECASE)
return result
no_vowel = pd.DataFrame(df['test'].apply(remove_vowels))
no_vowel['test'].str[0:4]
Output:
0 grgr
1 frnc
2 chrs
3 frnc
Name: test, dtype: object
From the above you can see that 'franciscoliriano' and 'franciscolindor' are the same when shortened.

Dropping duplicate values in a column

i have a frame like;
df = pd.DataFrame({'America':["24,23,24,24","10","AA,AA, XY"]})
tried to convert it to a list, set etc.. but coudnt handle
how can i drop the duplicates
Use custom function with split and set:
df['America'] = df['America'].apply(lambda x: set(x.split(',')))
Another solution is use list comprehension:
df['America'] = [set(x.split(',')) for x in df['America']]
print (df)
America
0 {23, 24}
1 {10}
2 {AA, XY}
This is one approach using str.split.
Ex:
import pandas as pd
df = pd.DataFrame({'America':["24,23,24,24","10","AA,AA, XY"]})
print(df["America"].str.split(",").apply(set))
Output:
0 {24, 23}
1 {10}
2 {AA, XY}
Name: America, dtype: object

extract number using regular expression

I want to extract numbers using regular expression
df['price'][0]
has
'[<em class="letter" id="infoJiga">3,402,000</em>]'
And I want to extract 3402000
How can I get this in pandas dataframe?
However the value is a string, try the below code.
#your code
df['price'][0] returns '[<em class="letter" id="infoJiga">3,402,000</em>]'
let us say this is x.
y = ''.join(c for c in x.split('>')[1] if c.isdigit()).strip()
print (y)
output: 3402000
Hope it works.
The simplest regex assuming nothing about the environment may be ([\d,]*). Than you can pandas' to_numeric function.
Are all your values formatted the same way? If so, you can use a simple regular expression to extract the numeric values then convert them to int.
import pandas as pd
import re
test_data = ['[<em class="letter" id="infoJiga">3,402,000</em>]','[<em class="letter" id="infoJiga">3,401,000</em>]','[<em class="letter" id="infoJiga">3,400,000</em>]','[<em class="letter" id="infoJiga">2,000</em>]']
df = pd.DataFrame(test_data)
>>> df[0]
0 [<em class="letter" id="infoJiga">3,402,000</em>]
1 [<em class="letter" id="infoJiga">3,401,000</em>]
2 [<em class="letter" id="infoJiga">3,400,000</em>]
3 [<em class="letter" id="infoJiga">2,000</em>]
Name: 0, dtype: object
Define a method that extracts and returns to integer
def get_numeric(data):
match = re.search('>(.+)<', data)
if match:
return int(match.group(1).replace(',',''))
return None
Apply it to DataFrame
df[1] = df[0].apply(get_numeric)
>>> df[1]
0 3402000
1 3401000
2 3400000
3 2000
Name: 1, dtype: int64

Pythonic/efficient way to strip whitespace from every Pandas Data frame cell that has a stringlike object in it

I'm reading a CSV file into a DataFrame. I need to strip whitespace from all the stringlike cells, leaving the other cells unchanged in Python 2.7.
Here is what I'm doing:
def remove_whitespace( x ):
if isinstance( x, basestring ):
return x.strip()
else:
return x
my_data = my_data.applymap( remove_whitespace )
Is there a better or more idiomatic to Pandas way to do this?
Is there a more efficient way (perhaps by doing things column wise)?
I've tried searching for a definitive answer, but most questions on this topic seem to be how to strip whitespace from the column names themselves, or presume the cells are all strings.
Stumbled onto this question while looking for a quick and minimalistic snippet I could use. Had to assemble one myself from posts above. Maybe someone will find it useful:
data_frame_trimmed = data_frame.apply(lambda x: x.str.strip() if x.dtype == "object" else x)
You could use pandas' Series.str.strip() method to do this quickly for each string-like column:
>>> data = pd.DataFrame({'values': [' ABC ', ' DEF', ' GHI ']})
>>> data
values
0 ABC
1 DEF
2 GHI
>>> data['values'].str.strip()
0 ABC
1 DEF
2 GHI
Name: values, dtype: object
We want to:
Apply our function to each element in our dataframe - use applymap.
Use type(x)==str (versus x.dtype == 'object') because Pandas will label columns as object for columns of mixed datatypes (an object column may contain int and/or str).
Maintain the datatype of each element (we don't want to convert everything to a str and then strip whitespace).
Therefore, I've found the following to be the easiest:
df.applymap(lambda x: x.strip() if type(x)==str else x)
When you call pandas.read_csv, you can use a regular expression that matches zero or more spaces followed by a comma followed by zero or more spaces as the delimiter.
For example, here's "data.csv":
In [19]: !cat data.csv
1.5, aaa, bbb , ddd , 10 , XXX
2.5, eee, fff , ggg, 20 , YYY
(The first line ends with three spaces after XXX, while the second line ends at the last Y.)
The following uses pandas.read_csv() to read the files, with the regular expression ' *, *' as the delimiter. (Using a regular expression as the delimiter is only available in the "python" engine of read_csv().)
In [20]: import pandas as pd
In [21]: df = pd.read_csv('data.csv', header=None, delimiter=' *, *', engine='python')
In [22]: df
Out[22]:
0 1 2 3 4 5
0 1.5 aaa bbb ddd 10 XXX
1 2.5 eee fff ggg 20 YYY
The "data['values'].str.strip()" answer above did not work for me, but I found a simple work around. I am sure there is a better way to do this. The str.strip() function works on Series. Thus, I converted the dataframe column into a Series, stripped the whitespace, replaced the converted column back into the dataframe. Below is the example code.
import pandas as pd
data = pd.DataFrame({'values': [' ABC ', ' DEF', ' GHI ']})
print ('-----')
print (data)
data['values'].str.strip()
print ('-----')
print (data)
new = pd.Series([])
new = data['values'].str.strip()
data['values'] = new
print ('-----')
print (new)
Here is a column-wise solution with pandas apply:
import numpy as np
def strip_obj(col):
if col.dtypes == object:
return (col.astype(str)
.str.strip()
.replace({'nan': np.nan}))
return col
df = df.apply(strip_obj, axis=0)
This will convert values in object type columns to string. Should take caution with mixed-type columns. For example if your column is zip codes with 20001 and ' 21110 ' you will end up with '20001' and '21110'.
This worked for me - applies it to the whole dataframe:
def panda_strip(x):
r =[]
for y in x:
if isinstance(y, str):
y = y.strip()
r.append(y)
return pd.Series(r)
df = df.apply(lambda x: panda_strip(x))
I found the following code useful and something that would likely help others. This snippet will allow you to delete spaces in a column as well as in the entire DataFrame, depending on your use case.
import pandas as pd
def remove_whitespace(x):
try:
# remove spaces inside and outside of string
x = "".join(x.split())
except:
pass
return x
# Apply remove_whitespace to column only
df.orderId = df.orderId.apply(remove_whitespace)
print(df)
# Apply to remove_whitespace to entire Dataframe
df = df.applymap(remove_whitespace)
print(df)

How to unpack a Series of tuples in Pandas?

Sometimes I end up with a series of tuples/lists when using Pandas. This is common when, for example, doing a group-by and passing a function that has multiple return values:
import numpy as np
from scipy import stats
df = pd.DataFrame(dict(x=np.random.randn(100),
y=np.repeat(list("abcd"), 25)))
out = df.groupby("y").x.apply(stats.ttest_1samp, 0)
print out
y
a (1.3066417476, 0.203717485506)
b (0.0801133382517, 0.936811414675)
c (1.55784329113, 0.132360504653)
d (0.267999459642, 0.790989680709)
dtype: object
What is the correct way to "unpack" this structure so that I get a DataFrame with two columns?
A related question is how I can unpack either this structure or the resulting dataframe into two Series/array objects. This almost works:
t, p = zip(*out)
but it t is
(array(1.3066417475999257),
array(0.08011333825171714),
array(1.557843291126335),
array(0.267999459641651))
and one needs to take the extra step of squeezing it.
maybe this is most strightforward (most pythonic i guess):
out.apply(pd.Series)
if you would want to rename the columns to something more meaningful, than:
out.columns=['Kstats','Pvalue']
if you do not want the default name for the index:
out.index.name=None
maybe:
>>> pd.DataFrame(out.tolist(), columns=['out-1','out-2'], index=out.index)
out-1 out-2
y
a -1.9153853424536496 0.067433
b 1.277561889173181 0.213624
c 0.062021492729736116 0.951059
d 0.3036745009819999 0.763993
[4 rows x 2 columns]
I believe you want this:
df=pd.DataFrame(out.tolist())
df.columns=['KS-stat', 'P-value']
result:
KS-stat P-value
0 -2.12978778869 0.043643
1 3.50655433879 0.001813
2 -1.2221274198 0.233527
3 -0.977154419818 0.338240
I have met the similar problem. What I found 2 ways to solving it are exactly the answer of #CT ZHU and that of #Siraj S.
Here is my supplementary information you might be interested:
I have compared 2 ways and found the way of #CT ZHU performs much faster when the size of input grows.
Example:
#Python 3
import time
from statistics import mean
df_a = pd.DataFrame({'a':range(1000),'b':range(1000)})
#function to test
def func1(x):
c = str(x)*3
d = int(x)+100
return c,d
# Siraj S's way
time_difference = []
for i in range(100):
start = time.time()
df_b = df_a['b'].apply(lambda x: func1(x)).apply(pd.Series)
end = time.time()
time_difference.append(end-start)
print(mean(time_difference))
# 0.14907703161239624
# CT ZHU's way
time_difference = []
for i in range(100):
start = time.time()
df_b = pd.DataFrame(df_a['b'].apply(lambda x: func1(x)).tolist())
end = time.time()
time_difference.append(end-start)
print(mean(time_difference))
# 0.0014058423042297363
PS: Please forgive my ugly code.
not sure if the t, r are predefined somewhere, but if not, I am getting the two tuples passing to t and r by,
>>> t, r = zip(*out)
>>> t
(-1.776982300308175, 0.10543682705459552, -1.7206831272759038, 1.0062163376448068)
>>> r
(0.08824925924534484, 0.9169054844258786, 0.09817788453771065, 0.3243492942246433)
Thus, you could do this,
>>> df = pd.DataFrame(columns=['t', 'r'])
>>> df.t, df.r = zip(*out)
>>> df
t r
0 -1.776982 0.088249
1 0.105437 0.916905
2 -1.720683 0.098178
3 1.006216 0.324349

Categories