I'm trying to filter a pandas pivot table, but am not sure of the correct syntax to filter the "group by" arguments. I've been trying the standard df['column_name'] but I receive a KeyError.
Here's the code for the table
pivot = pd.pivot_table(q5,values='ENTRIES',index('DATE','STATION','ID'),aggfunc='sum')
Here is what my pivot table looks like:
ENTRIES
DATE STATION ID
1/1/13 1 AVE 1 12
2 60
3 0
4 111
5 123
...
The desired result is to return Dates and Stations where at least one ID had < 10 Entries, but not all IDs had <10 Entries
Thanks
Related
I have a dataset with about 3m rows, and i want to know which id's have more than one unique value for a column lets call company_id. (i dont want to remove them, i need to identify these rows for analysis)
Table
id
company_id
1
10
2
11
1
13
2
11
3
31
3
31
3
33
in this example it would store the id's 1 and 3 because they have two different unique values for company_id. But it wouldnt store the id 2 because it has only one unique value for company_id (11)
I dont want to know how many are labeled in each company_id, i just need their id or index. thanks in advance.
Group the dataframe by id, then calculate nunique aggregate for company_id for each groups:
>>> df.groupby('id')['company_id'].agg('nunique')
id
1 2
2 1
3 2
Name: company_id, dtype: int64
I have a pandas dataframe with multiple IDs and with other columns I have one date columns say : 'date1'. I want to get all the rows with minimum date associated with all the IDs. The others column values should also be retained.
What I have:
ID date1 value
1 1/1/2013 a
1 4/1/2013 a
1 8/3/2014 b
2 11/4/2013 a
2 19/5/2016 b
2 8/4/2017 b
The output I want :
ID date1 value
1 1/1/2013 a
2 11/4/2013 a
Thank you
Convert to datetime:
df = df.assign(date1 = pd.to_datetime(df.date1))
Get the label index of the minimum and subset:
df.loc[df.groupby("ID").date1.idxmin()]
ID date1 value
0 1 2013-01-01 a
3 2 2013-11-04 a
Assuming you have IDs in ID and dates in DATE:
df.groupby('ID')['DATE'].min()
Groups by your ID and then selects the minimum in each group. Returns a series. If you want a data frame for that, then call _.reset_index() on the output.
If you instead want to select only the minimum rows, I would set the output as keys and then new_df.join(old_df.set_index(['ID', 'DATE']) rather than dealing with some index-based shenanigans.
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 2 years ago.
I have two dataframes and I want to filter one dataframe if two other values are not present in the other dataframe. Both dataframes share the name of the columns.
For example, dataframe A has:
col1,col2
1 5
-10 15
6 7
and dataframe B has:
col1,col2
6 7
-10 15
-1 5
So in this example, I would like to pick the value pair in A and see if it is present in B.
First row of A has value pair 1,5 and since 1,5 is not present in B that row would be excluded from A.
Second and third row of A has value -10,15 and 6,7, and since both are present in B I would like to keep them.
So the desired output of the filtered table A would be:
col1,col2
-10 15
6 7
How can I achieve this?
EDIT: One of the first things I tried was a merge, but the resulting dataframe was actually bigger than the original. Since merge and merging 101 topic was suggested, I will add the real dataframes here.
Dataframe A have latitude, longitude and id columns (id is not the index). It has 363 rows:
id lat lon
0 0 -33.252192 -70.765291
1 1 -33.224300 -70.780249
2 2 -33.251651 -70.797289
3 3 -33.298574 -70.770133
4 4 -33.214315 -70.787822
... ... ... ...
358 499 -33.227614 -70.770126
359 501 -33.299217 -70.770685
360 502 -33.191476 -70.801492
361 503 -33.239037 -70.780278
362 504 -33.263893 -70.762674
Dataframe B has 73096 rows and it also has and id, latitude and longitude. I'm putting here only lat and lon.
lat lon
1 -33.260415 -70.713767
2 -33.461718 -70.853525
3 -33.258741 -70.638032
4 -33.544858 -70.578624
8 -33.535512 -70.574188
... ... ...
97724 -33.451817 -70.847999
97725 -33.452225 -70.846520
97726 -33.450841 -70.841494
97729 -33.461407 -70.856090
97730 -33.457633 -70.822085
So I want to see if the lat,lon pair in A is present in B and if not then exclude it from A.
When I do A.merge(B) I get a dataframe that is 1108 rows long.
You can try pandas.merge. Something like df1.merge(df2, how='inner', left_on=['col1','col2'], right_on=['col1','col2']).
(To help you remember, the naming of these arguments comes from an inner join in database terminology)
A simple merge will do
df_out = dfA.merge(dfB)
Output
col1 col2
0 -10 15
1 6 7
df.merge does an inner join by default.
I have a dataframe something like below-
carrier_plan_identifier ... hios_issuer_identifier
1 AUSK ... 99806.0
2 AUSM ... 99806.0
3 AUSN ... 99806.0
4 AUSS ... 99806.0
5 AUST ... 99806.0
I need to pick a particular column ,lets say wellthie_issuer_identifier.
I need to query the database based on this column value. My select query will look something like .
select id, wellthie_issuer_identifier from issuers where wellthie_issuer_identifier in(....)
I need to add id column back to my existing dataframe with respect to the wellthie_issuer_identifier.
I have searched a lot but not clear with how this can be done.
Try this:
1.) pick a particular column ,lets say wellthie_issuer_identifier
t = tuple(df.wellthie_issuer_identifier)
This will give you a tuple like (1,0,1,1)
2.) query the database based on this column value
You need to substitute the above tuple in your query:
query = """select id, wellthie_issuer_identifier from issuers
where wellthie_issuer_identifier in{} """
Create a Cursor to the database and execute this query and Create a Dataframe of the result.
cur.execute(query.format(t))
df_new = pd.DataFrame(cur.fetchall())
df_new.columns = ['id','wellthie_issuer_identifier']
Now your df_new will have columns id, wellthie_issuer_identifier. You need to add this id column back to existing df.
Do this:
df = pd.merge(df,df_new, on='wellthie_issuer_identifier',how='left')
It will add an id column to df which will have values if a match is found on wellthie_issuer_identifier, otherwise it will put NaN.
Let me know if this helps.
You can add another column to a dataframe using pandas if the column is not too long, For example:
import pandas as pd
df = pd.read_csv('just.csv')
df
id user_id name
0 1 1 tolu
1 2 5 jb
2 3 6 jbu
3 4 7 jab
4 5 9 jbb
#to add new column to the data above
df['new_column']=['jdb','biwe','iuwfb','ibeu','igu']#new values
df
id user_id name new_column
0 1 1 tolu jdb
1 2 5 jb biwe
2 3 6 jbu iuwfb
3 4 7 jab ibeu
4 5 9 jbb igu
#this should help if the dataset is not too much
then you can go on querying your database
This will not take values for wellthie_issuer_identifier but as you told it will be all the values that are their, then below should work for you:
df1 = df.assign(id=(df['wellthie_issuer_identifier']).astype('category').cat.codes)
Working with dataframe df:
User_ID | Transaction_ID | Transaction_Row | Category
3824739 123 -1 A
3824739 123 -1 A
2398473 345 0 A
1230984 567 1 C
I need to pivot the above data by Category and sum Transaction_Row. However, I need to groupby Transaction ID, so that for Transaction ID 123 above, I only count the -1 once.
Can I do this with a pandas pivot table or only with a groupby?
pd.pivot_table(df,index=["Category"],values=["Transaction_Row"],aggfunc=np.sum)
Current Output:
Category | Sum of Transaction_Row
A -2
C 1
Desired Output:
Category | Sum of Transaction_Row
A -1
C 1
I don't know, how to edit the statement above to fix the double counting issue.
Thank You!
I hope I got your question right.
First, drop duplicates based on Transaction_ID and Transaction_Row only. Then do the pivot.
df_2 = df.drop_duplicates(subset=['Transaction_ID', 'Transaction_Row'])
pd.pivot_table(df_2, index=["Category"], values=["Transaction_Row"], aggfunc=np.sum)