Extract all the following rows in pandas - python

I have the following pandas DataFrame:
df
A B
1 b0
2 a0
3 c0
5 c1
6 a1
7 b1
8 b2
The first row which starts with a is
df[df.B.str.startswith("a")]
A B
2 a0
I would like to extract the first row in column B that starts with a and every row after. My desired result is below
A B
2 a0
3 c0
5 c1
6 a1
7 b1
8 b2
How can this be done?

One option is to create a mask and use it for selection:
mask = df.B.str.startswith("a")
mask[~mask] = np.nan
df[mask.fillna(method='ffill').fillna(0).astype(int) == 1]
Another option is to build an index range:
first = df[df.B.str.startswith("a")].index[0]
df.ix[first:]
The latter approach assumes that an "a" is always present.

using idxmax to find first True
df.loc[df.B.str[0].eq('a').idxmax():]
A B
1 2 a0
2 3 c0
3 5 c1
4 6 a1
5 7 b1
6 8 b2

If I understand your question correctly, here is how you do it :
df = pd.DataFrame(data={'A':[1,2,3,5,6,7,8],
'B' : ['b0','a0','c0','c1','a1','b1','b2']})
# index of the item beginning with a
index = df[df.B.str.startswith("a")].values.tolist()[0][0]
desired_df = pd.concat([df.A[index-1:],df.B[index-1:]], axis = 1)
print desired_df
and you get:

Related

Extract TLDs , SLDs from a dataframe column into new columns

I am trying to extract the top level domain (TLD), second level (SLD) etc from a column in a dataframe and added to new columns. Currently I have a solution where I convert this to a list and then use tolist, but the since this does sequential append, it does not work correctly. For example if the url has 3 levels then the mapping gets messed up
df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4],"C":["xyz[.]com","abc123[.]pro","xyzabc[.]gouv[.]fr"]})
df['C'] = df.C.apply(lambda x: x.split('[.]'))
df.head()
A B C
0 1 2 [xyz, com]
1 2 3 [abc123, pro]
2 3 4 [xyzabc, gouv, fr]
d = [pd.DataFrame(df[col].tolist()).add_prefix(col) for col in df.columns]
df = pd.concat(d, axis=1)
df.head()
A0 B0 C0 C1 C2
0 1 2 xyz com None
1 2 3 abc123 pro None
2 3 4 xyzabc gouv fr
I want C2 to always contain the TLD (com,pro,fr) and C1 to always contain SLD
I am sure there is a better way to do this correctly and would appreciate any pointers.
You can shift the Cx columns:
df.loc[:, "C0":] = df.loc[:, "C0":].apply(
lambda x: x.shift(periods=x.isna().sum()), axis=1
)
print(df)
Prints:
A0 B0 C0 C1 C2
0 1 2 NaN xyz com
1 2 3 NaN abc123 pro
2 3 4 xyzabc gouv fr
You can also use a regex with negative lookup and split builtin pandas with expand
df[['C0', 'C2']] = df.C.str.split('\[\.\](?!.*\[\.\])', expand=True)
df[['C0', 'C1']] = df.C0.str.split('\[\.\]', expand=True)
that gives
A B C C0 C2 C1
0 1 2 xyz[.]com xyz com None
1 2 3 abc123[.]pro abc123 pro None
2 3 4 xyzabc[.]gouv[.]fr xyzabc fr gouv

How to combine 2 dataframes, using the dot product

I have 2 dataframes:
df_1 = pd.DataFrame({"c1":[2,3,5,0],
"c2":[1,0,5,2],
"c3":[8,1,5,1]},
index=[1,2,3,4])
df_2 = pd.DataFrame({"u1":[1,0,1,0],
"u2":[-1,0,1,1]},
index=[1,2,3,4])
For every combination of "c" and "u", I want to calculate the dot product, e.g. with np.dot().
For example, the value of c1-u1 is calculated like this: 2*1 + 3*0 + 5*1 + 0*0 = 7
The resulting dataframe should look like this:
u1 u2
c1 7 3
c2 6 6
c3 13 -2
Is there an "elegant" way of solving this or is iterating through the 2 dataframes the only way?
Do you mean:
df_1.T # df_2
# or equivalently
# df1.T.dot(df2)
Output:
u1 u2
c1 7 3
c2 6 6
c3 13 -2
We can do matrix multiplication using pandas dot function.
df_1.T.dot(df_2)
Output:
u1 u2
c1 7 3
c2 6 6
c3 13 -2

Faster copying of pandas data with some conditions

I have a dataframe(df_main) into which I want to copy the data based on finding the necessary columns from another dataframe(df_data).
df_data
name Index par_1 par_2 ... par_n
0 A1 1 a0 b0
1 A1 2 a1
2 A1 3 a2
3 A1 4 a3
4 A2 2 a4
...
df_main
name Index_0 Index_1
0 A1 1 2
1 A1 1 3
2 A1 1 4
3 A1 2 3
4 A1 2 4
5 A1 3 4
...
I want to copy the parameter columns from df_data into df_main with the condition that all the parameters with same name and index in df_data row are copied to the df_main.
I have made the following implementation using for loops which is practically too slow to be used:
def data_copy(df, df_data, indice):
'''indice: whether Index_0 or Index_1 is being checked'''
names = df['name'].unique()
# We get all different names in the dataset to loop over
for name in tqdm.tqdm(names):
# Get unique index for a specific name
indexes = df[df['name']== name][indice].unique()
# Looping over all indexes
for index in indexes:
# From df_data, get the data of all cols of specific name and data
data = df_data[(df_data['Index']==index) & (df_data['name'] == name)]
# columns: Only the cols of structure's data
req_data = data[columns]
for col in columns:
# For each col (e.g. g1, g2, etc), get the val of a specific index
val = df_struc.loc[(df_data['Index']==index) & (df_data['name'] == name), col]
df.loc[(df[indice] == index) & (df['name']== name), col] = val[val.index.item()]
return df
df_main = data_copy(df_main, df_data, 'Index_0')
This gives me what I required as:
df_main
name Index_0 Index_1 par_1 par_2 ...
0 A1 1 2 a0
1 A1 1 3 a0
2 A1 1 4 a0
3 A1 2 3 a1
4 A1 2 4 a1
5 A1 3 4 a2
However, running it on a really big data requires a lot of time. What's the best way of avoiding the for loops for a faster implementation?
For each data frame, you can create a new column that will concat both name and index. See below :
import pandas as pd
df1 = {'name':['A1','A1'],'index':['1','2'],'par_1':['a0','a1']}
df1 = pd.DataFrame(data=df1)
df1['new'] = df1['name'] + df1['index']
df1
df2 = {'name':['A1','A1'],'index_0':['1','2'],'index_1':['2','3']}
df2 = pd.DataFrame(data=df2)
df2['new'] = df2['name'] + df2['index_0']
df2
for i, row in df1.iterrows():
df2.loc[(df2['new'] == row['new']) , 'par_1'] = row['par_1']
df2
Result :
name index_0 index_1 new par_1
0 A1 1 2 A11 a0
1 A1 2 3 A12 a1

Clean way of slicing + stacking pandas dataframe

I have a Pandas DataFrame, say df, which is 1099 lines by 33 rows. I need the original file to be processed by another software, but it is not in the proper format. This is why I'm trying to get the good format whith pandas.
The problem is very simple: df is constituted by columns of identifiers (7 in the real case, only 3 in the following example), and then by corresponding results by months. To be clear, it's like
A B C date1result date2result date2result
a1 b1 c1 12 15 17
a2 b2 c3 5 8 3
But to be processed, I would need it to have one line per result, adding a column for the date. In the given example, it would be
A B C result date
a1 b1 c1 12 date1
a1 b1 c1 15 date2
a1 b1 c1 17 date3
a2 b2 c3 5 date1
a2 b2 c3 8 date2
a2 b2 c3 3 date3
So to be more precise, I have edited manually all column names with date (after the read_excel, the looked like '01/01/2015 0:00:00' or something like that, and I was unable to access them... As a secondary question, does anyone knows how to access columns being imported from a date field in an .xlsx?), so that date column names are now 2015_01, 2015_02... 2015_12, 2016_01, ..., 2016_12, the 5 first being 'Account','Customer Name','Postcode','segment' and 'Rep'. So I tried the following code:
core = df.loc[:,('Account','Customer Name','Postcode','segment','Rep')]
df_final=pd.Series([])
for year in [2015,2016]:
for month in range(1, 13):
label = "%i_%02i" % (year,month)
date = []
for i in range(core.shape[0]):
date.append("01/%02i/%i"%(month,year))
df_date=pd.Series(date) #I don't know to create this 1xn df
df_final = df_final.append(pd.concat([core, df[label], df_date], axis=1))
That works roughly, but it is very unclean: I get a (26376, 30) shaped df_final, fist column being the dates, then the results, but of course with '2015_01' as column name, then all the '2015_02' through '2016_12' filled by NaN, and at last my Account', 'Customer Name', 'Postcode', 'segment' and 'Rep' columns. Does anyone know how I could do such a "slicing+stacking" in a clean way?
Thank you very much.
Edit: it is roughly the reverse of this question: Stacking and shaping slices of DataFrame (pandas) without looping
Ithink you need melt:
df = pd.melt(df, id_vars=['A','B','C'], value_name='result', var_name='date')
print (df)
A B C date result
0 a1 b1 c1 date1result 12
1 a2 b2 c3 date1result 5
2 a1 b1 c1 date2result 15
3 a2 b2 c3 date2result 8
4 a1 b1 c1 date3result 17
5 a2 b2 c3 date3result 3
And then convert to_datetime:
print (df)
A B C 2015_01 2016_10 2016_12
0 a1 b1 c1 12 15 17
1 a2 b2 c3 5 8 3
df = pd.melt(df, id_vars=['A','B','C'], value_name='result', var_name='date')
df.date = pd.to_datetime(df.date, format='%Y_%m')
print (df)
A B C date result
0 a1 b1 c1 2015-01-01 12
1 a2 b2 c3 2015-01-01 5
2 a1 b1 c1 2016-10-01 15
3 a2 b2 c3 2016-10-01 8
4 a1 b1 c1 2016-12-01 17
5 a2 b2 c3 2016-12-01 3

problems with MultiIndex

I'm having problems with MultiIndex and stack(). The following example is based on a solution from Calvin Cheung on StackOvervlow.
=== multi.csv ===
h1,main,h3,sub,h5
a,A,1,A1,1
b,B,2,B1,2
c,B,3,A1,3
d,A,4,B2,4
e,A,5,B3,5
f,B,6,A2,6
=== multi.py ===
#!/usr/bin/env python
import pandas as pd
df1 = pd.read_csv('multi.csv')
df2 = df1.pivot('main', 'sub').stack()
print(df2)
=== output ===
h1 h3 h5
main sub
A A1 a 1 1
B2 d 4 4
B3 e 5 5
B A1 c 3 3
A2 f 6 6
B1 b 2 2
This works as long as the entries in the sub column are unique with respect to the corresponding entry in the main column. But if we change the sub column entry in row e to B2, then B2 is no longer unique in the group of A rows and we get an error message: "pandas.core.reshape.ReshapeError: Index contains duplicate entries, cannot reshape".
I was expected the shape of the sub index to behave like the shape of the primary index where duplicates are indicated with blank entries under the first row entry.
=== expected output ===
h1 h3 h5
main sub
A A1 a 1 1
B2 d 4 4
e 5 5
B A1 c 3 3
A2 f 6 6
B1 b 2 2
So my question is, how can I structure a MultiIndex in a way that allows duplicates in sub-levels?
Rather than do a pivot*, just set_index directly (this works for both examples):
In [11]: df
Out[11]:
h1 main h3 sub h5
0 a A 1 A1 1
1 b B 2 B1 2
2 c B 3 A1 3
3 d A 4 B2 4
4 e A 5 B2 5
5 f B 6 A2 6
In [12]: df.set_index(['main', 'sub'])
Out[12]:
h1 h3 h5
main sub
A A1 a 1 1
B B1 b 2 2
A1 c 3 3
A B2 d 4 4
B2 e 5 5
B A2 f 6 6
*You're not really doing a pivot here anyway, it just happens to work in the above case.

Categories