how to split a column into comma seperated string? - python

Here is sample dataframe and a is my column name.
a b x
0 1 3 a
1 2 4 a
2 1 3 b
3 2 5 b
4 2 4 c
need a column unique values to be seperated in this way
required output: '1','2'
below is my code i'm getting like this
x=x1['id'].unique()
x2=','.join("\'"+str(i)+"\'" for i in x)
for this way of code
i'm getting output some thing like this
output:"'1','2'"
**2nd approach:**
x2=','.join("\'"+x1['id']+"\'"):
if i'm do this i'm getting the count of id has been increasing
i need to pass output into sql query like select * from abc where a in (x2) for that reason need output something like this
x2 -->'1','2'
i'm getting
x2--->" '1','2'"

Try using your first approach with f-strings to make things easier.
x2 =' ,'.join(f"'{str(i)}'" for i in x)
query = rf"""
SELECT
*
FROM
abc
WHERE
a in ({x2})
"""
If you try print(query), it gives
SELECT
*
FROM
abc
WHERE
a in ('1' ,'2')

Related

Filter dataframe based on matching values from two columns

I have a dataframe like as shown below
cdf = pd.DataFrame({'Id':[1,2,3,4,5],
'Label':[1,2,3,0,0]})
I would like to filter the dataframe based on the below criteria
cdf['Id']==cdf['Label'] # first 3 rows are matching for both columns in cdf
I tried the below
flag = np.where[cdf['Id'].eq(cdf['Label'])==True,1,0]
final_df = cdf[cdf['flag']==1]
but I got the below error
TypeError: 'function' object is not subscriptable
I expect my output to be like as shown below
Id Label
0 1 1
1 2 2
2 3 3
I think you're overthinking this. Just compare the columns:
>>> cdf[cdf['Id'] == cdf['Label']]
Id Label
0 1 1
1 2 2
2 3 3
Your particular error though is coming from the fact that you're using square brackets to call np.where, e.g. np.where[...], which is wrong. You should be using np.where(...) instead, but the above solution is bound to be as fast as it gets ;)
Also you can check query
cdf.query('Id == Label')
Out[248]:
Id Label
0 1 1
1 2 2
2 3 3

Saving small sub-dataframes containing all values associated to a specific 'key' string

I'd need a little suggestion on a procedure using pandas, I have a 2-columns dataset that looks like this:
A 0.4533
B 0.2323
A 1.2343
A 1.2353
B 4.3521
C 3.2113
C 2.1233
.. ...
where first column contains strings and the second one floats. I would like to save the minimum value for each group of unique strings in order to have the associated minimum with A, B, C. Does anybody have any suggestions on that? It could help me also storing somehow all the values for each string they are associated.
Many thanks,
James
Input data:
>>> df
0 1
0 A 0.4533
1 B 0.2323
2 A 1.2343
3 A 1.2353
4 B 4.3521
5 C 3.2113
6 C 2.1233
Use groupby before min:
out = df.groupby(0).min()
Output result:
>>> out
1
0
A 0.4533
B 0.2323
C 2.1233
Update:
filter out all the values in the original dataset that are more than 20% different from the minimum
out = df[df.groupby(0)[1].apply(lambda x: x <= x.min() * 1.2)]
>>> out
0 1
0 A 0.4533
1 B 0.2323
6 C 2.1233
You can simply do it by
min_A=min(df[df["column_1"]=="A"]["value"])
min_B=min(df[df["column_1"]=="B"]["value"])
min_C=min(df[df["column_1"]=="C"]["value"])
where df = Dataframe column_1 and value are the names of the columns of the dataframe
You can also do it by using the pre-defined function of pandas i.e. groupby()
>> df.groupby(["column_1"]).min()
The Above will also give the same results.

Select all dataframe rows containing a specific integer

My data-frame looks something like this:
x y
1 a
1 b
2 c
3 d
4 e
5 f
1 g
All I want is to count the number of rows containing the number '1' in column 'x'. I know how this works for strings, but I can't find anything similar for numbers. The printed output in this case would be
3
df.loc[df.x == 1, 'x'].count()

Create a column based on multiple column distinct count pandas [duplicate]

I want to add an aggregate, grouped, nunique column to my pandas dataframe but not aggregate the entire dataframe. I'm trying to do this in one line and avoid creating a new aggregated object and merging that, etc.
my df has track, type, and id. I want the number of unique ids for each track/type combination as a new column in the table (but not collapse track/type combos in the resulting df). Same number of rows, 1 more column.
something like this isn't working:
df['n_unique_id'] = df.groupby(['track', 'type'])['id'].nunique()
nor is
df['n_unique_id'] = df.groupby(['track', 'type'])['id'].transform(nunique)
this last one works with some aggregating functions but not others. the following works (but is meaningless on my dataset):
df['n_unique_id'] = df.groupby(['track', 'type'])['id'].transform(sum)
in R this is easily done in data.table with
df[, n_unique_id := uniqueN(id), by = c('track', 'type')]
thanks!
df.groupby(['track', 'type'])['id'].transform(nunique)
Implies that there is a name nunique in the name space that performs some function. transform will take a function or a string that it knows a function for. nunique is definitely one of those strings.
As pointed out by #root, often the method that pandas will utilize to perform a transformation indicated by these strings are optimized and should generally be preferred to passing your own functions. This is True even for passing numpy functions in some cases.
For example transform('sum') should be preferred over transform(sum).
Try this instead
df.groupby(['track', 'type'])['id'].transform('nunique')
demo
df = pd.DataFrame(dict(
track=list('11112222'), type=list('AAAABBBB'), id=list('XXYZWWWW')))
print(df)
id track type
0 X 1 A
1 X 1 A
2 Y 1 A
3 Z 1 A
4 W 2 B
5 W 2 B
6 W 2 B
7 W 2 B
df.groupby(['track', 'type'])['id'].transform('nunique')
0 3
1 3
2 3
3 3
4 1
5 1
6 1
7 1
Name: id, dtype: int64

Python Pandas. Delete cells whose value is contained in another cell in the same column

I have a dataframe like this:
A B
exa 3
example 6
exam 4
hello 4
hell 3
I want to delete the rows that are substrings of another row and keep the longest one (Notice that B is already the length of A)
I want my table to look like this:
A B
example 6
hello 4
I thought about the following boolean filter but it does not work :(
df['Check'] = df.apply(lambda row: df.count(row['A'] in row['A'])>1, axis=1)
This is non-trivial. But we can take advantage of B to sort the data, compare each value with only those strings larger than itself for solution slightly better than O(N^2).
df = df.sort_values('B')
v = df['A'].tolist()
df[[not any(b.startswith(a) for b in v[i + 1:]) for i, a in enumerate(v)]].sort_index()
A B
1 example 6
3 hello 4
Like what cold provided my solution is O(m*n) as well (In your case m=n)
df[np.sum(np.array([[y in x for x in df.A.values] for y in df.A.values]),1)==1]
Out[30]:
A B
1 example 6
3 hello 4

Categories