I have a DataFrame as below. I want to concatenate first 2 columns.
If the length of their concatenation is <13 then I would like to add 0s in between so that the length becomes 13.
If the length of their concatenation is >=13 then I just want to concatenate.
d = {'col1': [123456, 2, 1234567], 'col2': [1234567, 4, 1234567]}
df = pd.DataFrame(data=d)
df
df['var3'] = df.col1.astype(str) + df.col1.astype(str)
df
In case of second row, instead of '22' I want 11 0s between 2 and 2.
I would like to keep the third row as it is as the length of concatenation is >13.
You may want to convert the numbers to strings before doing anything else, so I assume that col1 and col2 are strings.
First, find the combined string lengths and how many zeros are missing:
pads = 13 - (df.col1.str.len() + df.col2.str.len())
Then generate the necessary paddings and concatenate the columns and the paddings:
df['var3'] = df.col1 + pads.apply(lambda x: x * '0') + df.col2
#0 1234561234567
#1 2000000000004
#2 12345671234567
For each row, make a tuple with 3 values:
string1
string2
The difference between the length of both strings and 13 (or whatever target length)
x = pd.Series(list(zip(df['col1'].astype(str),
df['col2'].astype(str),
13 - (df['col1'].astype(str) + df['col2'].astype(str)).str.len())))
Then use the string method ljust to pad the left string with 0s and add it to the right string. Assign everything to the new column.
df['var3'] = x.apply(lambda x: x[0].ljust(x[2], '0') + x[1])
Related
Is there a way to groupby only groups with 2 or more rows?
Or can I delete groups from a grouped dataframe that contains only 1 row?
Thank you very much for your help!
Yes there is a way. Here is an example below
df = pd.DataFrame(
np.array([['A','A','B','C','C'],[1,2,1,1,2]]).T
, columns= ['type','value']
)
groups = df.groupby('type')
groups_without_single_row_df = [g for g in groups if len(g[1]) > 1]
groupby return a list of tuples.
Here, 'type' (A, b or C) is the first element of the tuple and the subdataframe the second element.
You can check length of each subdataframe with len() as in [g for g in groups if len(g[1]) > 1] where we check the lengh of the second element of the tuple.
If the the len() is greater than 1, it is include in the ouput list.
Hope it helps
How to convert a comma-separated value into a list of integers in a pandas dataframe?
Input:
Desired output:
There are 2 steps - split and convert to integers, because after split values are lists of strings, solution working well also if different lengths of lists (not added Nones):
df['qty'] = df['qty'].apply(lambda x: [int(y) for y in x.split(',')])
Or:
df['qty'] = df['qty'].apply(lambda x: list(map(int, x.split(','))))
Alternative solutions:
df['qty'] = [[int(y) for y in x.split(',')] for x in df['qty']]
df['qty'] = [list(map(int, x.split(','))) for x in df['qty']]
Or try expand=True:
df['qty'] = df['qty'].str.split(',', expand=True).astype(int).agg(list, axis=1)
Vectorised solution :
import ast
df["qty"] = ("[" + df["qty"].astype(str) + "]").apply(ast.literal_eval)
How to detect columns and rows that might have one of the characters in a string of a dataframe element other than the desired characters.
desired characters are A, B, C, a, b, c, 1, 2, 3, &, %, =, /
dataframe -
Col1
Col2
Col3
Abc
Øa
12
bbb
+
}
output will be elements Øa, +, } and their location in dataframe.
I find it really difficult to locate an element for a condition directly in pandas, so I converted the dataframe to a nested list first, then proceeded to work with the list. Try this:
import pandas as pd
import numpy as np
#creating your sample dataframe
array = np.array([['Abc','Øa','12'],['bbb','+','}']])
columns = ['Col1','Col2','Col3']
df = pd.DataFrame(data=array, columns=columns)
#convert dataframe to nested list
pd_list = df.values.tolist()
#return any characters other than the ones in 'var'
all_chars = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;=>?#[\\]^_`{|}~Ø'
var = 'ABCabc123&%=//'
for a in var:
all_chars = all_chars.replace(a, "")
#stores previously detected elements to prevent duplicate
temp_storage = []
#loops through the nested list to get the elements' indexes
for x in all_chars:
for i in pd_list:
for n in i:
if x in n:
#check if element is duplicate
if not n in temp_storage:
temp_storage.append(n)
print(f'found {n}: row={pd_list.index(i)}; col={i.index(n)}')
Output:
> found +: row=1; col=1
> found }: row=1; col=2
> found Øa: row=0; col=1
I have a DataFrame with two columns that look something like this:
A B
0 '_______' [2,3]
1 '_______' [0]
2 '_______' [1,4,6]
where one column is a string with 7 "_" and the other column contains a numpy array with different lengths. My goal is to change column A using B as indexes so it looks like this:
A B
1 '__23___' [2,3]
2 '0______' [0]
3 '_1__4_6' [1,4,6]
My code seems to work but I keep getting the error:
SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
I don't understand how I fix this error. My code is:
for i in range(len(df)):
row = df.iloc[i,:].copy()
numbers = row['B']
for j in numbers:
loop_string = df['A'][i]
df['A'][i] = loop_string[:j] + str(j) + loop_string[j+1:]
Also the fact that I need two for loops bothers me this must be possible an other more efficient way. Can anyone help me?
You can use apply to use a custom function on the B column:
df['A'] = df['B'].apply(lambda l: ''.join([str(i) if i in l else '_'for i in range(7)]))
The above does not consider the original value of A but instead creates an entirely new string column.
Result:
A B
0 __23___ [2, 3]
1 0______ [0]
2 _1__4_6 [1, 4, 6]
How can I join the multiple lists in a Pandas column 'B' and get the unique values only:
A B
0 10 [x50, y-1, sss00]
1 20 [x20, MN100, x50, sss00]
2 ...
Expected output:
[x50, y-1, sss00, x20, MN100]
You can do this simply by list comprehension and sum() method:
result=[x for x in set(df['B'].sum())]
Now If you print result you will get your desired output:
['y-1', 'x20', 'sss00', 'x50', 'MN100']
If in input data are not lists, but strings first create lists:
df.B = df.B.str.strip('[]').str.split(',')
Or:
import ast
df.B = df.B.apply(ast.literal_eval)
Use Series.explode for one Series from lists with Series.unique for remove duplicates if order is important:
L = df.B.explode().unique().tolist()
#alternative
#L = df.B.explode().drop_duplicates().tolist()
print (L)
['x50', 'y-1', 'sss00', 'x20', 'MN100']
Another idea if order is not important use set comprehension with flatten lists:
L = list(set([y for x in df.B for y in x]))
print (L)
['x50', 'MN100', 'x20', 'sss00', 'y-1']