How do I split multiple columns? - python

I would like to split each of columns in dataset.
The idea is to split the number between "/" and string between "/" and "#" and put this values to the new colums.
I tried sth like this :
new_df = dane['1: Brandenburg'].str.split('/',1)
and then creating new columns for it. But I don't want to do this for all 60 colums.
first column
1: Branburg :
ES-NL-10096/1938/X1#hkzydzon.dk/6749
BE-BR-6986/3551/B1#oqk.bf/39927
PH-SA-39552610/2436/A1#venagi.hr/80578
PA-AE-59691/4881/X1#zhicksl.cl/25247
second column
2: Achon :
DE-JP-20082/2066/A2#qwier.cu/68849
NL-LK-02276/2136/A1#ozmdpfts.de/73198
OM-PH-313/3671/Z1#jtqy.ml/52408
AE-ID-9632/3806/C3#lhbt.ar/83484
etc,etc...

As I understood, you want to extract two parts from each cell.
E.g. from ES-NL-10096/1938/X1#hkzydzon.dk/6749 there should be
extracted:
1938 - the number between slashes,
X1 - the string between the second slash and #.
To to this, you can run:
df.stack().str.extract(r'/(?P<num>\d+)/(?P<txt>[A-Z\d]+)#')\
.stack().unstack([1, 2])
You will get a MultiIndex on columns:
top level - the name of "source" column,
second level - num and txt - 2 extracted "parts".
For your sample data, the result is:
1: Brandenburg 2: Achon
num txt num txt
0 1938 X1 2066 A2
1 3551 B1 2136 A1
2 2436 A1 3671 Z1
3 4881 X1 3806 C3

You can use df.apply() to iterate over all the columns of your Dataframe and apply given function. Here is an example:
def fn(col):
return col.str.split('/',1)
new_df = dane.apply (lambda col: fn(col), axis=1)
Here axis=1 means iterate over all the columns. Hope this helps!

Related

How to conserve dataframe rows containing a list a specific strings?

I have a dataframe with a column level
level
0 HH
1 FF
2 FF
3 C,NN-FRAC,W-PROC
4 C,D
...
8433 C,W-PROC
8434 C,D
8435 D
8436 C,Q
8437 C,HH
I would like to only conserve row which contains specific string:
searchfor = ['W','W-OFFSH','W-ONSH','W-GB','W-PROC','W-NGTC','W-TRANS','W-UNSTG','W-LNGSTG','W-LNGIE','W-LDC','X','Y','LL','MM','MM – REF','MM – IMP','MM – EXP','NN','NN-FRAC','NN-LDC','OO']
which should give me (from the above extract):
level
1 C,NN-FRAC,W-PROC
2 C,W-PROC
I tried to apply these 2 different string filter but non one give me the excepted result.
df = df[df['industrytype'].str.contains(searchfor)]
df = df[df['industrytype'].str.contains(','.join(searchfor))]
It might not be behaving the expected way because of the presence of comma in the columns. You can write a simple function which splits at comma and checks for each different splits. You can use apply method to use that function on the column.
def filter(x):
x = x.split(',')
for i in x:
if i in searchfor:
return True
return False
df = df[df.industrytype.apply(filter)]

Python: Splitting a Column into concatenated rows based on specific Values

I am sure someone has asked a question like this before but my current attempts to search have not yielded a solution.
I have a column of text values, for example:
import pandas as pd
df2 = pd.DataFrame({'text':['a','bb','cc','4','m','...']})
print(df2)
text
0 a
1 bb
2 cc
3 4
4 m
5 ...
The column in 'text' is comprised of strings, ints, floats, and nan type data.
I am trying to combine (with a space [' '] between each text value) all the text values in-between each number (int/float) in the text column, ignoring Nan values, making each concatenated set a separate row.
What would be the most efficient way to accomplish this?
I thought to possibly read all values into a string, strip the Nan's, then split this successively if a number is encountered, but this seems highly inefficient.
Thank you for your help!
edit:
desired sample output
text
0 'a bb cc'
1 'm ...'
You can convert columns to numeric and test non missing values, so get Trues for numeric rows, then filter only non numeric in inverted mask by ~ in DataFrame.loc and aggregate by cumulative sum with mask by Series.cumsum with aggregate join:
#for remove NaNs before solution
df2 = df2.dropna(subset=['text'])
m = pd.to_numeric(df2['text'], errors='coerce').notna()
df = df2.loc[~m, 'text'].groupby(m.cumsum()).agg(' '.join).reset_index(drop=True).to_frame()
print (df)
text
0 a bb cc
1 m ...
I would avoid pandas for this operation altogether. Instead, use the library module more_itertools - namely, the split_at() function:
import more_itertools as mit
def test(x): # Test if X is a number of some sort or a nan
try: float(x); return True
except: return False
result = [" ".join(x) for x in mit.split_at(df2['text'].dropna(), test)]
# ['a bb cc', 'm ...']
df3 = pd.DataFrame(result, columns=['text',])
P.S. On a dataframe of 13,000 rows with an average group length of 10, this solution is 2 times faster than the pandas solution proposed by jezrael (0.00087 sec vs 0.00156 sec). Not a huge difference, indeed.

Dropping row in pandas by index - have to add to index?

I have a pandas dataframe of 182 rows that comes from read_csv. The first column, sys_code, contains various alphanumeric codes. I want to drop ones that start with 'FB' (there are 14 of these). I loop through the dataframe, adding what I assume would be the index to a list, then try to drop by index using the list. But this doesn't work unless I add 18 to each index number.
Without adding 18, I get a list containing numbers from 84 - 97. When I try to drop the rows using this list for indexes, I get KeyError: '[84] not found in axis'. But when I add 18 to each number, it works fine, at least for this particular dataset. But why is this? Shouldn't i be the same as the index number?
fb = []
i = 0
df.reset_index(drop=True)
for x in df['sys_code']:
if x[:2] == 'FB':
fb.append(i+18) #works
fb.append(i) # doesn't work
i += 1
df.drop(fb, axis=0, inplace=True)
You could use Series.str.startswith. Here's an example:
df = pd.DataFrame({'col1':['some string', 'FBsomething', 'FB', 'etc']})
print(df)
col1
0 some string
1 FBsomething
2 FB
3 etc
You could remove those strings that do not start with FB using:
df[~df.col1.str.startswith('FB')]
col1
0 some string
3 etc

How many rows in dataframe contain question mark symbol

I have dataframe made from csv in which missing data is represented by ? symbol. I want to check how many rows there are in which ? occurs with number of occurrence.
So far i made this but it shows number of all rows, not only that ones in which ? occurs.
print(sum([True for idx,row in df.iterrows() if
any(row.str.contains('[?]'))]))
You can use apply + str.contains, assuming all your columns are strings.
c = np.sum(df.apply(lambda x: x.str.contains('\?')).values)
If you need to select string columns only, use select_dtypes -
i = df.select_dtypes(exclude=['number']).apply(lambda x: x.str.contains('\?'))
c = np.sum(i.values)
Alternatively, to find the number of rows containing ? in them, use
c = df.apply(lambda x: x.str.contains('\?')).any(axis=1).sum()
Demo -
df
A B
0 aaa ?xyz
1 bbb que!?
2 ? ddd
3 foo? fff
df.apply(lambda x: x.str.contains('\?')).any(1).sum()
4

how to reorder pandas data frame rows based on external index

I would like to reorder the rows in a dataframe based on an external mapping. So for example if the list is (2,1,3) I want to move the first item in the old df to the second item in the new df. I thought my question was the same as this: How to reorder indexed rows based on a list in Pandas data frame but that solution is not working. Here's what I've tried:
a = list(sampleinfo.filename)
b = list(exprs.columns)
matchIndex2 = [a.index(x) for x in b]
(1)
sampleinfo2 = sampleinfo[matchIndex2,]
(2)
sampleinfo2 = sampleinfo
sampleinfo2.reindex(matchIndex2)
Neither solution errors out, but the order doesn't change - it's like I haven't done anything.
I am trying to make sure that the columns in exprs and the filename values of the rows in sampleinfo are in the same order. In the solution I see online I see I can sort the columns of exprs instead:
a = list(sampleinfo.filename)
b = list(exprs.columns)
matchIndex = [b.index(x) for x in a]
exprs = exprs[matchIndex]
But I'd like to be able to sort by row. How can I do this?
The dataframes I am working with are too large to paste, but here's the general scenario:
exprs:
a1 a2 a3 a4 a5
1 2 2 2 1
4 3 2 1 1
sampleinfo:
filename otherstuff
a1 fwsegs
a5 gsgers
a3 grsgs
a2 gsgs
a4 sgs
Here's a function to re-order rows using an external list that is tied to a particular column in the data frame:
def reorder(A, column, values):
"""Re-order data frame based on a column (given in the parameter
column, which must have unique values)"""
if set(A[column]) != set(values):
raise Exception("ERROR missing values for re-ordering")
at_position = {}
index = 0;
for v in A[column]:
at_position[v] = index
index += 1
re_position = [ at_position[v] for v in values ]
return A.iloc[ re_position ];

Categories