How many rows in dataframe contain question mark symbol - python

I have dataframe made from csv in which missing data is represented by ? symbol. I want to check how many rows there are in which ? occurs with number of occurrence.
So far i made this but it shows number of all rows, not only that ones in which ? occurs.
print(sum([True for idx,row in df.iterrows() if
any(row.str.contains('[?]'))]))

You can use apply + str.contains, assuming all your columns are strings.
c = np.sum(df.apply(lambda x: x.str.contains('\?')).values)
If you need to select string columns only, use select_dtypes -
i = df.select_dtypes(exclude=['number']).apply(lambda x: x.str.contains('\?'))
c = np.sum(i.values)
Alternatively, to find the number of rows containing ? in them, use
c = df.apply(lambda x: x.str.contains('\?')).any(axis=1).sum()
Demo -
df
A B
0 aaa ?xyz
1 bbb que!?
2 ? ddd
3 foo? fff
df.apply(lambda x: x.str.contains('\?')).any(1).sum()
4

Related

How to conserve dataframe rows containing a list a specific strings?

I have a dataframe with a column level
level
0 HH
1 FF
2 FF
3 C,NN-FRAC,W-PROC
4 C,D
...
8433 C,W-PROC
8434 C,D
8435 D
8436 C,Q
8437 C,HH
I would like to only conserve row which contains specific string:
searchfor = ['W','W-OFFSH','W-ONSH','W-GB','W-PROC','W-NGTC','W-TRANS','W-UNSTG','W-LNGSTG','W-LNGIE','W-LDC','X','Y','LL','MM','MM – REF','MM – IMP','MM – EXP','NN','NN-FRAC','NN-LDC','OO']
which should give me (from the above extract):
level
1 C,NN-FRAC,W-PROC
2 C,W-PROC
I tried to apply these 2 different string filter but non one give me the excepted result.
df = df[df['industrytype'].str.contains(searchfor)]
df = df[df['industrytype'].str.contains(','.join(searchfor))]
It might not be behaving the expected way because of the presence of comma in the columns. You can write a simple function which splits at comma and checks for each different splits. You can use apply method to use that function on the column.
def filter(x):
x = x.split(',')
for i in x:
if i in searchfor:
return True
return False
df = df[df.industrytype.apply(filter)]

Python: Splitting a Column into concatenated rows based on specific Values

I am sure someone has asked a question like this before but my current attempts to search have not yielded a solution.
I have a column of text values, for example:
import pandas as pd
df2 = pd.DataFrame({'text':['a','bb','cc','4','m','...']})
print(df2)
text
0 a
1 bb
2 cc
3 4
4 m
5 ...
The column in 'text' is comprised of strings, ints, floats, and nan type data.
I am trying to combine (with a space [' '] between each text value) all the text values in-between each number (int/float) in the text column, ignoring Nan values, making each concatenated set a separate row.
What would be the most efficient way to accomplish this?
I thought to possibly read all values into a string, strip the Nan's, then split this successively if a number is encountered, but this seems highly inefficient.
Thank you for your help!
edit:
desired sample output
text
0 'a bb cc'
1 'm ...'
You can convert columns to numeric and test non missing values, so get Trues for numeric rows, then filter only non numeric in inverted mask by ~ in DataFrame.loc and aggregate by cumulative sum with mask by Series.cumsum with aggregate join:
#for remove NaNs before solution
df2 = df2.dropna(subset=['text'])
m = pd.to_numeric(df2['text'], errors='coerce').notna()
df = df2.loc[~m, 'text'].groupby(m.cumsum()).agg(' '.join).reset_index(drop=True).to_frame()
print (df)
text
0 a bb cc
1 m ...
I would avoid pandas for this operation altogether. Instead, use the library module more_itertools - namely, the split_at() function:
import more_itertools as mit
def test(x): # Test if X is a number of some sort or a nan
try: float(x); return True
except: return False
result = [" ".join(x) for x in mit.split_at(df2['text'].dropna(), test)]
# ['a bb cc', 'm ...']
df3 = pd.DataFrame(result, columns=['text',])
P.S. On a dataframe of 13,000 rows with an average group length of 10, this solution is 2 times faster than the pandas solution proposed by jezrael (0.00087 sec vs 0.00156 sec). Not a huge difference, indeed.

Drop column that starts with

I have a data frame that has multiple columns, example:
Prod_A Prod_B Prod_C State Region
1 1 0 1 1 1
I would like to drop all columns that starts with Prod_, (I can't select or drop by name because the data frame has 200 variables)
Is it possible to do this ?
Thank you
Use startswith for mask and then delete columns with loc and boolean indexing:
df = df.loc[:, ~df.columns.str.startswith('Prod')]
print (df)
State Region
1 1 1
First, select all columns to be deleted:
unwanted = df.columns[df.columns.str.startswith('Prod_')]
The, drop them all:
df.drop(unwanted, axis=1, inplace=True)
we can also use negative RegEx:
In [269]: df.filter(regex=r'^(?!Prod_).*$')
Out[269]:
State Region
1 1 1
Drop all rows where the path column starts with /var:
df = df[~df['path'].map(lambda x: (str(x).startswith('/var')))]
This can be further simplified to:
df = df[~df['path'].str.startswith('/var')]
map+lambda offer more flexibility by allowing to handle raw values as opposed to scalars. In the example below rows will be removed when they start with /var or are empty (nan, None, etc).
df = df[~df['path'].map(lambda x: (str(x).startswith('/var') or not x))]
Drop all rows where the path column starts with /var or /tmp (you can also pass a tuple to startswith):
df = df[~df['path'].map(lambda x: (str(x).startswith(('/var', '/tmp'))))]
The tilda ~ is used for negation; if you wanted instead to keep all rows starting with /var then just remove the ~.

Create a string columns based on values of columns and index Pandas

I have the following dataframe:
A B
Tenor
1 15.1726 0.138628
2 15.1726 0.147002
3 15.1726 0.155376
4 15.1726 0.163749
5 15.1726 0.172123
I want to be able to create another column that has a string by concatenating the previous columns, including the index. For instance, first row of this new columns would be: XXXX1XXXX15.1726XXXX0.138628
How can I do that in Pandas? If I try to use df[ColumnName] in the string formula Pandas will always bring the index, which will mess up my string.
you can use apply
df['NewCol'] = df.apply(lambda x: "XXXX" + str(x.name) + "XXXX" + str(x.A) + "XXXX" + str(x.B), axis=1)
also, a bit shorter, from #Abdou
df['joined'] = df.apply(lambda x: 'XXXX'+'XXXX'.join(map(str,x)),axis=1)
You can try something like this:
df['newColumn'] = [("XXXX"+str(id)+"XXXX" +str(entry.A) + "XXXX" +str(entry.B)) for id,entry in df.iterrows()]
I thought this was interesting
df.reset_index().applymap('XXXX{}'.format).sum(1)
0 XXXX1XXXX15.1726XXXX0.138628
1 XXXX2XXXX15.1726XXXX0.147002
2 XXXX3XXXX15.1726XXXX0.155376
3 XXXX4XXXX15.1726XXXX0.163749
4 XXXX5XXXX15.1726XXXX0.172123
dtype: object

Getting substring based on another column in a pandas dataframe

Hi is there a way to get a substring of a column based on another column?
import pandas as pd
x = pd.DataFrame({'name':['bernard','brenden','bern'],'digit':[2,3,3]})
x
digit name
0 2 bernard
1 3 brenden
2 3 bern
What i would expect is something like:
for row in x.itertuples():
print row[2][:row[1]]
be
bre
ber
where the result is the substring of name based on digit.
I know if I really want to I can create a list based on the itertuples function but does not seem right and also, I always try to create a vectorized method.
Appreciate any feedback.
Use apply with axis=1 for row-wise with a lambda so you access each column for slicing:
In [68]:
x = pd.DataFrame({'name':['bernard','brenden','bern'],'digit':[2,3,3]})
x.apply(lambda x: x['name'][:x['digit']], axis=1)
Out[68]:
0 be
1 bre
2 ber
dtype: object

Categories