How to use a string to set iloc in pandas - python

I understand the general usage of iloc as follows.
import pandas as pd
df = pd.DataFrame([[1,2,3,4,5],[4,5,6,4,5],[7,8,9,4,5],[10,11,12,4,5]])
df_ = df.iloc[:, 1:4]
On the other hand, although it is a limited usage, is it possible to set iloc using a string?
Below is pseudo code that does not work properly but is what I would like to do.
import pandas as pd
df = pd.DataFrame([[1,2,3,4,5],[4,5,6,4,5],[7,8,9,4,5],[10,11,12,4,5]])
df.columns = ["money","job","fruits","animals","height"]
tests = ["1:2","2:3", "1:4"]
for i in tests:
print(df.iloc[:,i])
Is there a better way to split the string into "start_col" and "end_col" using a function?

You an just create a converter function:
import pandas as pd
df = pd.DataFrame([[1,2,3,4,5],[4,5,6,4,5],[7,8,9,4,5],[10,11,12,4,5]])
ranges = ["1:2", "2:3", "1:4"]
def as_int_range(ranges):
return [i for rng in ranges for i in range(*map(int, rng.split(':')))]
df.iloc[as_int_range(ranges),:]
0 1 2 3 4
1 4 5 6 4 5
2 7 8 9 4 5
1 4 5 6 4 5
2 7 8 9 4 5
3 10 11 12 4 5

iloc[ ] is for slicing numeric data. For String slicing, you can use loc[ ] like you have used iloc[ ] for numbers. Here is the official pandas documentation for implementing loc[ ] https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html

I didn't mention it in my original question.
I wrote a program that supports examples like ["1:3, 4"].
import pandas as pd
df = pd.DataFrame([[1,2,3,4,5],[4,5,6,4,5],[7,8,9,4,5],[10,11,12,4,5]])
df.columns = ["a", "b", "c" , "d", "e"]
def args_to_list(string):
strings = string.split(",")
column_list = []
for each_string in strings:
each_string = each_string.strip()
if ":" in each_string:
start_ , end_ = each_string.split(":")
for i in range(int(start_), int(end_)):
column_list.append(i)
else:
column_list.append(int(each_string))
return column_list
tests = ["1:2", "1,2,3,4", "1:2,3", "1,2:3,4"]
for i in tests:
list_ =args_to_list(i)
print(list_)
print(df.iloc[:, list_])
print(list_)

Related

Explode a string with random length equally to next empty columns pandas

Let's say I've df like this..
string some_col
0 But were so TESSA tell me a little bit more t ... 10
1 15
2 14
3 Some other text xxxxxxxxxx 20
How can I split string col such that long string exploded into random lengths equally across empty cells. It should look like this after fitting.
string some_col
0 But were so TESSA tell me . 10
1 little bit more t seems like 15
2 you pretty upset 14
Reproducable
import pandas as pd
data = [['But were so TESSA tell me a you pretty upset.', 10], ['', 15], ['', 14]]
df = pd.DataFrame(data, columns=['string', 'some_col'])
print(df)
I've no idea how to get even started I'm looking for execution steps so that I can implemnt on my own any refrence would be great!
You need to create groups with a non empty row and all consecutive empty rows (the group length gives the number of chunks) then use np.split_array to create n list of words:
import numpy as np
# first row --v group length --v
wrap = lambda x: [' '.join(l) for l in np.array_split(x.iloc[0].split(), len(x))]
df['string2'] = (df.groupby(df['string'].str.len().ne(0).cumsum())['string']
.apply(wrap).explode().to_numpy())
Output:
string some_col string2
0 But were so TESSA tell me a you pretty upset. 10 But were so TESSA
1 15 tell me a
2 14 you pretty upset.
3 Some other text xxxxxxxxxx 20 Some other text xxxxxxxxxx
This works in your case:
import pandas as pd
import numpy as np
from math import ceil
data = [['But were so TESSA tell me a you pretty upset.', 10], ['', 15], ['', 14],
['Some other long string that you need..', 10], ['', 15]]
df = pd.DataFrame(data, columns=['string', 'some_col'])
df['string'] = np.where(df['string'] == '', None, df['string'])
df.ffill(inplace=True)
df['group_id'] = df.groupby('string').cumcount() + 1
df['max_group_id'] = df.groupby('string',).transform('count')['group_id']
df['string'] = df['string'].str.split(' ')
df['string'] = df.apply(func=lambda r: r['string'][int(ceil(len(r['string'])/r['max_group_id'])*(r['group_id']-1)):
int(ceil(len(r['string'])/r['max_group_id'])*r['group_id'])], axis=1)
df.drop(columns=['group_id', 'max_group_id'], inplace=True)
print(df)
Result:
string some_col
0 [But, were, so, TESSA] 10
1 [tell, me, a, you] 15
2 [pretty, upset.] 14
3 [Some, other, long, string] 10
4 [that, you, need..] 15
You can customize number of rows you want with this code :
import pandas as pd
import random
df = pd.read_csv('text.csv')
string = df.at[0,'string']
# the number of rows you want
num_of_rows = 4
endLineLimits = random.sample(range(1, string.count(' ')), num_of_rows - 1)
count = 1
for i in range(len(string)):
if string[i] == ' ':
if count in endLineLimits:
string = string[:i] + ';' + string[i+1:]
count += 1
newStrings = string.split(';')
for i in range(len(df)):
df.at[i,'string'] = newStrings[i]
print(df)
Example result:
string some_col
0 But were so TESSA tell 10
1 me a little bit more t 15
2 seems like you pretty 14
3 upset 20

Pandas Multiindex columns slice: use combination of all and pesice selet

Input: hierarchical headered dataframe (multiindex columns).
Ask: select combination of specific column(s) [level0, level1] and broadcast [level0, :]
Example:
import numpy as np
import pandas as pd
index=pd.MultiIndex.from_product([["A", "B"], ["x", "y", "z"]])
df = pd.DataFrame(np.random.randn(8,6), columns=index)
The desire result is to select ('A','y') and everything for 'B'.
I've managed to achieve this using the solution below:
df[[x for x in df.columns if x == ('A','z') or x[0]=='B']]
I've tried to use .loc[] and slice(None) but this did not work.
Is there a more elegant solution to iterate over the columns' tuples?
Cheers.
Your list comprehension should work:
cols = [ent for ent in df if ent == ('A','y') or ent[0] == 'B']
df.loc[:, cols]
A B
y x y z
0 0.069915 2.563734 1.034784 0.659189
1 -0.240847 1.924626 1.241827 0.973155
2 -1.091353 -1.003005 1.648075 -1.162863
3 -0.747503 -0.211539 1.861991 -1.011261
4 -0.354648 -0.117533 0.524876 0.884997
5 0.786158 -2.073479 1.374893 1.428770
6 0.597740 0.853482 -0.187112 0.000626
7 -0.749839 -1.084576 -0.327888 -0.286908
Another option is with select_columns from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor
df.select_columns(('A','y'), 'B')
A B
y x y z
0 0.069915 2.563734 1.034784 0.659189
1 -0.240847 1.924626 1.241827 0.973155
2 -1.091353 -1.003005 1.648075 -1.162863
3 -0.747503 -0.211539 1.861991 -1.011261
4 -0.354648 -0.117533 0.524876 0.884997
5 0.786158 -2.073479 1.374893 1.428770
6 0.597740 0.853482 -0.187112 0.000626
7 -0.749839 -1.084576 -0.327888 -0.286908

combine lines in dataframe with same identifiant

I have this dataframe as a list:
l = [["a",1,2,"","",""],["a",1,2,3,"",""], ["a","",2,"3","4",""],["a",1,"","",4,5]]
I would like to combine all those lines to obtain this final line :
Ideally, I would flatten the list of lists to fill the blank value where needed. What would be the pythonest way to do that ?
Try:
import pandas as pd
import numpy as np
l = [["a",1,2,"","",""],["a",1,2,3,"",""], ["a","",2,"3","4",""],["a",1,"","",4,5]]
df = pd.DataFrame(l)
df.replace('',np.nan).ffill().tail(1)
df_out = df.replace('', np.nan).ffill().tail(1)
print(df_out)
Output:
0 1 2 3 4 5
3 a 1.0 2.0 3 4 5.0

Data Preprocessing in Python using Pandas

I am trying to preprocess one of my columns in my Data frame. The issue is that I have [[ content1] , [content2], [content3]] in the relations column. I want to remove the Brackets
i have tried this following:
df['value'] = df['value'].str[0]
the output that i get is
[content 1]
df
print df
id value
1 [[str1],[str2],[str3]]
2 [[str4],[str5]]
3 [[str1]]
4 [[str8]]
5 [[str9]]
6 [[str4]]
the expected output should be like
id value
1 str1,str2,str3
2 str4,str5
3 str1
4 str8
5 str9
6 str4
It looks like you have lists of lists. You can try to unnest and join:
df['value'] = df['value'].apply(lambda x: ','.join([e for l in x for e in l]))
Or:
from itertools import chain
df['value'] = df['value'].apply(lambda x: ','.join(chain.from_iterable(x)))
NB. If you get an error, please provide it and the type of the column (df.dtypes)
As I could see, your data and sampling the same:
Sample Data:
df = pd.DataFrame({'id':[1,2,3,4,5,6], 'value':['[[str1],[str2],[str3]]', '[[str4],[str5]]', '[[str1]]', '[[str8]]', '[[str9]]', '[[str4]]']})
print(df)
id value
0 1 [[str1],[str2],[str3]]
1 2 [[str4],[str5]]
2 3 [[str1]]
3 4 [[str8]]
4 5 [[str9]]
5 6 [[str4]]
Result:
df['value'] = df['value'].str.replace('[', '').astype(str).str.replace(']', '')
print(df)
id value
0 1 str1,str2,str3
1 2 str4,str5
2 3 str1
3 4 str8
4 5 str9
5 6 str4
Note: as the error code says AttributeError: Can only use .str accessor with string values which means it's not treating it as str hence you may cast it to str by astype(str) and then do the replace operation.
You can use useful regex python package re.
This is the solution.
import pandas as pd
import re
make the test data
data = [
[1, '[[str1],[str2],[str3]]'],
[2, '[[str4],[str5]]'],
[3, '[[str1]]'],
[4, '[[str8]]'],
[5, '[[str9]]'],
[6, '[[str4]]']
]
conver data to Dataframe
df = pd.DataFrame(data, columns = ['id', 'value'])
print(df)
remove '[', ']' from the 'value' column
df['value']=df.apply(lambda x: re.sub("[\[\]]", "", x['value']),axis=1)
print(df)

join two columns of different dataframes into another dataframe

I have two dataframes:
one:
[A]
1
2
3
two:
[B]
7
6
9
How can I join two columns of different dataframes into another dataframe?
Like that:
[A][B]
1 7
2 6
3 9
I already tried that:
result = A
result = result.rename(columns={'employee_id': 'A'})
result['B'] = pd.Series(B['employee_id'])
and
B_column = B["employee_id"]
result = pd.concat([result,B_column], axis = 1)
result
but I still couldn't
import pandas as pd
df1 = pd.DataFrame(data = {"A" : range(1, 4)})
df2 = pd.DataFrame(data = {"B" : range(7, 10)})
df = df1.join(df2)
Gives
A
B
0
1
7
1
2
8
2
3
9
While there is various way to accomplish this, one way would be to just merge them on the index.
Something like this:
dfResult = dfA.merge(dfB, left_on=dfA.index, right_on=dfB.index, how='inner')

Categories