I'm having a problem trying to get a character count column of the string values in another column, and haven't figured out how to do it efficiently.
for index in range(len(df)):
df['char_length'][index] = len(df['string'][index]))
This apparently involves first creating a column of nulls and then rewriting it, and it takes a really long time on my data set. So what's the most effective way of getting something like
'string' 'char_length'
abcd 4
abcde 5
I've checked around quite a bit, but I haven't been able to figure it out.
Pandas has a vectorised string method for this: str.len(). To create the new column you can write:
df['char_length'] = df['string'].str.len()
For example:
>>> df
string
0 abcd
1 abcde
>>> df['char_length'] = df['string'].str.len()
>>> df
string char_length
0 abcd 4
1 abcde 5
This should be considerably faster than looping over the DataFrame with a Python for loop.
Many other familiar string methods from Python have been introduced to Pandas. For example, lower (for converting to lowercase letters), count for counting occurrences of a particular substring, and replace for swapping one substring with another.
Here's one way to do it.
In [3]: df
Out[3]:
string
0 abcd
1 abcde
In [4]: df['len'] = df['string'].str.len()
In [5]: df
Out[5]:
string len
0 abcd 4
1 abcde 5
Related
I am trying to detect values with some specific characters e.g(?,/ etc). Below you can see a small sample with some data.
import pandas as pd
import numpy as np
data = {
'artificial_number':['000100000','000010000','00001000/1','00001000?','0?00/10000'],
}
df1 = pd.DataFrame(data, columns = [
'artificial_number',])
Now I want to detect values with specific characters that are not numbers ('00001000/1','00001000?','0?00/10000') I tried with this lines below
import re
clean = re.sub(r'[^a-zA-Z0-9\._-]', '', df1['artificial_number'])
But this code is not working as I expected. So can anybody help me how to solve this problem ?
#replace the non-digit with an empty value
df1['artificial_number'].str.replace(r'([^\d])','', regex=True)
0 000100000
1 000010000
2 000010001
3 00001000
4 00010000
Name: artificial_number, dtype: object
if you like to list the column with non-digit values
df1.loc[df1['artificial_number'].str.extract(r'([^\d])')[0].notna()]
artificial_number
2 00001000/1
3 00001000?
4 0?00/10000
Assuming a number in your case is an integer, to find the values that have non-numbers, just count the number of numbers, and compare with length of string:
rows = [len(re.findall('[0-9]', s)) != len(s) for s in df1.artificial_number]
df1.loc[rows]
# artificial_number
#2 00001000/1
#3 00001000?
#4 0?00/10000
To detect which of the values aren't interpretable as numeric, you can also use str.isnumeric:
df1.loc[~df1.artificial_number.str.isnumeric()]
artificial_number
2 00001000/1
3 00001000?
4 0?00/10000
If all characters need to be digits (e.g. 10.0 should also be excluded), use str.isdigit:
df1.loc[~df1.artificial_number.str.isdigit()]
df1.iloc[0,0] = '000100000.0'
artificial_number
0 000100000.0
2 00001000/1
3 00001000?
4 0?00/10000
I am sure someone has asked a question like this before but my current attempts to search have not yielded a solution.
I have a column of text values, for example:
import pandas as pd
df2 = pd.DataFrame({'text':['a','bb','cc','4','m','...']})
print(df2)
text
0 a
1 bb
2 cc
3 4
4 m
5 ...
The column in 'text' is comprised of strings, ints, floats, and nan type data.
I am trying to combine (with a space [' '] between each text value) all the text values in-between each number (int/float) in the text column, ignoring Nan values, making each concatenated set a separate row.
What would be the most efficient way to accomplish this?
I thought to possibly read all values into a string, strip the Nan's, then split this successively if a number is encountered, but this seems highly inefficient.
Thank you for your help!
edit:
desired sample output
text
0 'a bb cc'
1 'm ...'
You can convert columns to numeric and test non missing values, so get Trues for numeric rows, then filter only non numeric in inverted mask by ~ in DataFrame.loc and aggregate by cumulative sum with mask by Series.cumsum with aggregate join:
#for remove NaNs before solution
df2 = df2.dropna(subset=['text'])
m = pd.to_numeric(df2['text'], errors='coerce').notna()
df = df2.loc[~m, 'text'].groupby(m.cumsum()).agg(' '.join).reset_index(drop=True).to_frame()
print (df)
text
0 a bb cc
1 m ...
I would avoid pandas for this operation altogether. Instead, use the library module more_itertools - namely, the split_at() function:
import more_itertools as mit
def test(x): # Test if X is a number of some sort or a nan
try: float(x); return True
except: return False
result = [" ".join(x) for x in mit.split_at(df2['text'].dropna(), test)]
# ['a bb cc', 'm ...']
df3 = pd.DataFrame(result, columns=['text',])
P.S. On a dataframe of 13,000 rows with an average group length of 10, this solution is 2 times faster than the pandas solution proposed by jezrael (0.00087 sec vs 0.00156 sec). Not a huge difference, indeed.
I am cleaning up some data and the raw dataset has entries as ['Field1.1', 'Field2.1', 'Field1.2', 'Field2.2']. For the dataset, either 'Field1' XOR 'Field2' will have a non-empty string. I'd like to create a single field 'Field.1' that will extract the non-empty string from 'Field1.1' XOR 'Field2.1' and place it in 'Field.1'. Similarly, I'd like to do this for 'Field1.2' and 'Field2.2' as 'Field.2'.
I am not sure how to select matching fields, i.e. 'X.1' with 'Y.1' and 'X.2' with 'Y.2', in order to do this.
My logic is that once I can select the correct pairs I can simply use a concat statement to add them and thereby extract the non-empty string for later use. If this logic is incorrect or there is a better way that does not rely on extracting the non-empty string in this way to concatenate them then please let me know.
If the logic is sound, please explain how might do this extraction given the indexing problem.
To be clearer, I want to go from:
df = pd.DataFrame({'field1.1': ['string1',''], 'field2.1':['','string2'],
'field1.2': ['string3',''], 'field2.2':['','string4']})
df
Out[1]:
field1.1 field2.1 field1.2 field2.2
0 string1 string2
1 string3 string4
to:
df2 = pd.DataFrame({'field.1': ['string1','string3'], 'field.2':['string2','string4']})
df2
Out[2]:
field.1 field.2
0 string1 string2
1 string3 string4
You can use wide_to_long, bfill, and then pivot back:
(pd.wide_to_long(df.where(df.ne('')).reset_index(),
stubnames=['Field1','Field2'],
i='index',
j='group',
sep='.')
.bfill(1)
.reset_index()
.pivot(values='Field1',index='index',columns='group')
)
Sample data:
df = pd.DataFrame([
['a','','b',''],
['c','','','d'],
['','e','','f'],
['','g','h','']],
columns=['Field1.1', 'Field2.1', 'Field1.2', 'Field2.2'])
group 1 2
index
0 a b
1 c d
2 e f
3 g h
I am trying to manipulate a large list of strings, so cannot do this manually. I am new to python so am having trouble figuring this out.
I have a dataframe with columns:
df = pd.read_csv('filename.csv')
df
A B
0 big_apples
1 big_oranges
2 small_pears
3 medium_grapes
and I need it to look more like:
A B
0 apples(big)
1 oranges(big)
2 pears(small)
3 grapes(medium)
I was thinking of using a startswith() function and .replace()/concatenate everything. But then I would have to create columns for each of these and i need it to recognize the unique prefixes. Is there a more efficient method?
You can do some string formatting and apply it to the Series:
df.B.apply(lambda x: '{}({})'.format(*x.split('_')[::-1]))
0 apples(big)
1 oranges(big)
2 pears(small)
3 grapes(medium)
Here apply is applying the formatting to each item of the series. Then apply the string formatting you desire (I'm using [::-1] to reverse the order of the string) and * to "unpack" the return values that are in a list
Given a simple Pandas Series that contains some strings which can consist of more than one sentence:
In:
import pandas as pd
s = pd.Series(['This is a long text. It has multiple sentences.','Do you see? More than one sentence!','This one has only one sentence though.'])
Out:
0 This is a long text. It has multiple sentences.
1 Do you see? More than one sentence!
2 This one has only one sentence though.
dtype: object
I use pandas string method split and a regex-pattern to split each row into its single sentences (which produces unnecessary empty list elements - any suggestions on how to improve the regex?).
In:
s = s.str.split(r'([A-Z][^\.!?]*[\.!?])')
Out:
0 [, This is a long text., , It has multiple se...
1 [, Do you see?, , More than one sentence!, ]
2 [, This one has only one sentence though., ]
dtype: object
This converts each row into lists of strings, each element holding one sentence.
Now, my goal is to use the string method contains to check each element in each row seperately to match a specific regex pattern and create a new Series accordingly which stores the returned boolean values, each signalizing if the regex matched on at least one of the list elements.
I would expect something like:
In:
s.str.contains('you')
Out:
0 False
1 True
2 False
<-- Row 0 does not contain 'you' in any of its elements, but row 1 does, while row 2 does not.
However, when doing the above, the return is
0 NaN
1 NaN
2 NaN
dtype: float64
I also tried a list comprehension which does not work:
result = [[x.str.contains('you') for x in y] for y in s]
AttributeError: 'str' object has no attribute 'str'
Any suggestions on how this can be achieved?
you can use python find() method
>>> s.apply(lambda x : any((i for i in x if i.find('you') >= 0)))
0 False
1 True
2 False
dtype: bool
I guess s.str.contains('you') is not working because elements of your series is not strings, but lists. But you can also do something like this:
>>> s.apply(lambda x: any(pd.Series(x).str.contains('you')))
0 False
1 True
2 False