How to Count No. Of Occurences after Union All - python

I'm new to SQLite and figuring out to implement the following codes using Python.
The sample database table is as below:
from
to
1
2
1
3
3
4
4
1
How do I implement the Union All and to count the number of occurrences for each integer?
My code is as such below:
combine_query = "SELECT * FROM (SELECT col1 FROM table UNION ALL SELECT col2 FROM table)"
c.execute(combine_query)
df= pd.DataFrame(c.fetchall(), columns=['Integers', [Occurences])
Integers
No of Occurences
1
3
2
1
3
2
4
2
Thank you.

We can try the following union approach:
SELECT val, COUNT(*) AS num
FROM
(
SELECT col1 AS val FROM yourTable
UNION ALL
SELECT col2 FROM yourTable
) t
GROUP BY val
ORDER BY val;

Related

Inverse a column in pyspark

a b
'1' 1
'2' 2
'3' 3
'4' 4
I would like to insert a new column which is the inverse of the b column while keeping the other columns constant.
Example:
a b c
'1' 1 4
'2' 2 3
'3' 3 2
'4' 4 1
We use temp['b'][::-1] to achieve this result in pandas. Is this transformation possible in pyspark as well?
Let's say your dataframe is ordered by column a.
You could try performing a self-join on a generated column that reverses the order. Such a column, rn, could be generated using row_number eg
Using pyspark api
from pyspark.sql import functions as F
from pyspark.sql import Window
output_df = (
df.withColumn(
"rn",
F.row_number().over(Window.orderBy("a"))
)
.alias("df1")
.join(
df.withColumn(
"rn",
F.row_number().over(Window.orderBy(F.col("a").desc()))
).alias("df2"),
["rn"],
"inner"
)
.selectExpr("df1.a","df1.b","df2.b as c")
)
Using spark sql
select
df1.a,
df1.b,
df2.b as c
from (
select
*,
row_number() over (order by a) rn
from
df
) df1
INNER JOIN (
select
b,
row_number() over (order by a desc) rn
from
df
) df2 on df1.rn=df2.rn;
a
b
c
1
1
4
2
2
3
3
3
2
4
4
1
View on DB Fiddle

Rename a count column in a group by pandas dataframe

My data looks like this:
EID TERM
1 2
1 2
2 2
3 2
1 3
2 3
3 3
What I would like to do is something similar to this SQL
SELECT COUNT(EID) AS EVENT_COUNT, EID, TERM
FROM TABLE1
GROUP BY EID, TERM
The results would look like
EVENT_COUNT EID TERM
2 1 2
1 2 2
1 3 2
1 1 3
1 2 3
1 3 3
My Python Code looks like this:
new_df = df.groupby(["EID","TERM"]).agg({"EID":"count"})
I get the counts fine but can't rename the column, so when I write this data frame to a csv my output header looks like this:
EID,TERM,EID
How do I rename the count column (last EID) to say EVENT_COUNT?
You can aggregate on the EID column only and use to_frame to rename.
(
df.groupby(["EID","TERM"]).EID.count()
.to_frame('EVENT_COUNT')
.reset_index()
[['EVENT_COUNT','EID','TERM']]
)
Of you can use namedagg (works above pandas 0.24):
(
df.groupby(["EID","TERM"])
.agg(EVENT_COUNT=("EID", "count"))
.reset_index()
[['EVENT_COUNT','EID','TERM']]
)

Find FIRST row that satisfies conditions using either Numpy array or Pandas dataframe. (possibly tricky)

I will ask the question in a library agnostic manner as one may be better than the other in this instance. Or maybe another magical library exists?
I have a DB table of about 10,000 records and I know how to create a numpy array or dataframe from it. The data is like so.
...
20,25,1,5
20,25,2,3
20,25,4,21
20,25,5,1
20,25,9,19
...
45,47,6,20
45,47,10,2
45,47,11,56
45,47,21,41
...
In the example search criteria below I am after the value '20' in col4 of this row.
45,47,6,20
Notice the first 2 columns have the same values and define a group.
Col2 will always be >= to col1 in a row.
The values in col3 will always be in ascending sequence within a group and not necessarily contiguous.
I am after the value of the cell in col4 using the following search criteria.
I know how to use a mask in numpy to find all the rows whose values are eg 'col1 >= 45 AND col2 <= 47'.
I have a third search value of eg '8' that is to be used to search col3 within the above group (col1, col2, 45 -> 47)
I need to find the FIRST row whose value in col3 <= 8.
Therefore I need to search the rows that have 'col1 >= 45 AND col2 <= 47' in a col3 DESCENDING sequence until row '45,47,6,20' is found. I am after the value '20' in col4.
There will only ever be at most 1 row that will match. It is possible that no row will match the criteria (eg if col3 search value was '3').
I need to do 100s of 1000s of searches at a time so would prefer that no new arrays or dataframes be created unless that has a minimal impact on resources.
I would:
filter the dataframe to only keep the lines matching the criteria
groupby on first 2 columns
apply tail(1) on each group to find the relevant line per group, if any
Code would be:
df[(df['col1']>=45)&(df['col2']<=47)&(df['col3']<=8)].groupby(['col1', 'col2']
).tail(1)
With your sample, it gives as expected
col1 col2 col3 col4
5 45 47 6 20
The good news is that you can search multiple groups in one pass, and it still gives expected results if no rows match the criteria. Demo:
>>> df[(df['col1']>=20)&(df['col2']<=47)&(df['col3']<=8)].groupby(['col1', 'col2']).tail(1)
col1 col2 col3 col4
3 20 25 5 1
5 45 47 6 20
>>> df[(df['col1']>=20)&(df['col2']<=47)&(df['col3']<=3)].groupby(['col1', 'col2']).tail(1)
col1 col2 col3 col4
1 20 25 2 3
>>> df[(df['col1']>=45)&(df['col2']<=47)&(df['col3']<=3)].groupby(['col1', 'col2']).tail(1)
Empty DataFrame
Columns: [col1, col2, col3, col4]
Index: []
I suggest using a multiindex for the three first columns and a mask on this multiindex as follows:
# I reproduce a similar dataframe
import pandas as pd
import numpy as np
np.random.seed(123)
v1 = np.random.randint(0, 10, 10)
v2 = v1 + 2
v3 = np.random.randint(0, 10, 10)
v4 = np.random.randint(0, 10, 10)
df = pd.DataFrame({"v1": v1,
"v2": v2,
"v3": v3,
"v4": v4})
# and sort it according to your comments
df = df.sort_values(by=["v1", "v2", "v3"])
df.head()
I get the following dataframe:
v1 v2 v3 v4
8 0 2 4 0
7 1 3 0 8
9 1 3 1 7
3 1 3 9 4
1 2 4 0 3
# parameters for research
val1 = 1 # the equivalent of your 45
val2 = 3 # the equivalent of your 47
val3 = 2 # the equivalent of your 8
# Set the multiindex
hdf = df.set_index(["v1", "v2", "v3"]).sort_index(ascending=False)
hdf.tail()
Your dataframe now looks as follows:
v4
v1 v2 v3
2 4 0 3
1 3 9 4
1 7
0 8
0 2 4 0
# Define the mask
mask = (hdf.index.get_level_values("v1") >= val1) & \
(hdf.index.get_level_values("v2") <= val2) &\
(hdf.index.get_level_values("v3") <= val3)
# Select only the first row returned by the selection using cumsum on mask
print(hdf.loc[mask & (mask.cumsum() == 1), ["v4"]])
And you get:
v4
v1 v2 v3
1 3 1 7

Replacing the missing values of while comparing 2 tables

I have 2 tables on my database and would like to compare both the tables and replace the missing values from one table to another. For example.
TABLE 1
column 1 column 2
ab 3
ab -
a 1
a -
b -
b 2
ab 3
ab -
a 1
a -
b 2
TABLE 2
column 1 column 2
ab 3
a 1
b 2
I want to compare both the tables on column 1 and replace only the missing values on column 2 and not touch the values that are already there.
Is this possible on SQL or using pandas on python? Any solution would be helpful.
## SQL Query be like ##
This query replaces the column2 value of Table2 to the column2 of Table1 which are NULL.
UPDATE table1
SET table1.Col2=table2.Col2
FROM table1
JOIN table2
ON table1.col1=table2.col1
where table1.col2 IS NULL

python pandas groupby then count rows satisfying condition

i am trying to do a groupby on the id column such that i can show the number of rows in col1 that is equal to 1.
df:
id col1 col2 col3
a 1 1 1
a 0 1 1
a 1 1 1
b 1 0 1
my code:
df.groupby(['id'])[col1].count()[1]
output i got was 2. It didnt show me the values from other ids like b.
i want:
id col1
a 2
b 1
if possible can the total rows per id also be displayed as a new column?
example:
id col1 total
a 2 3
b 1 1
Assuming you have only 1 and 0 in col1, you can use agg:
df.groupby('id', as_index=False)['col1'].agg({'col1': 'sum', 'total': 'count'})
# id total col1
#0 a 3 2
#1 b 1 1
It's because your rows which id is 'a' sums to 3. The 2 of them are identical that's why it's been grouped and considered as one then it added the unique row which contains the 0 value on its col 1. You can't group rows with different values on its rows.
Yes you can add it on your output. Just place a method how you counted all rows on your column section of your code.
If you want to generalize the solution to include values in col1 that are not zero you can do the following. This also orders the columns correctly.
df.set_index('id')['col1'].eq(1).groupby(level=0).agg([('col1', 'sum'), ('total', 'count')]).reset_index()
id col1 total
0 a 2.0 3
1 b 1.0 1
Using a tuple in the agg method where the first value is the column name and the second the aggregating function is new to me. I was just experimenting and it seemed to work. I don't remember seeing it in the documentation so use with caution.

Categories