Simplify query to just 'where' by making new column with pandas? - python

I have a column with SQL queries to a column. These are implemented on a function called Select_analysis
Form:
Select_analysis (input_shapefile, output_name, {where_clause}) # it takes until where.
Example:
SELECT * from OT # OT is a dataset
GROUP BY OT.CA # CA is a number that may exist many times.Therefore we group by that field.
HAVING ((Count(OT.OBJECTID))>1) # an id that appears more than once.
OT dataset
objectid CA
1 125
2 342
3 263
1 125
We group by CA.
About having: it is applied to the rows that have objectid more than once. Which is the objectid 1 in this example.
My idea is to make another column that will store a result that will be accessed with a simple where clause in the select_analysis function
example: OT dataset
objectid CA count_of_objectid_aftergroupby
1 125 2
2 342 1
3 263 1
1 125 2
So then can be:
Select_analysis(roads.shp,output.shp, count_of_objectid_aftergroupby > '1')
Notes
it has to be in such a way so that select analysis function is used in the end.

Assuming that you are pulling the data into pandas since it's tagged pandas, here's one possible solution:
df=pd.DataFrame({'objectID':[1,2,3,1],'CA':[125,342,463,125]}).set_index('objectID')
objectID CA
1 125
2 342
3 463
1 125
df['count_of_objectid_aftergroupby']=[df['CA'].value_counts().loc[x] for x in df['CA']]
objectID CA count_of_objectid_aftergroupby
1 125 2
2 342 1
3 463 1
1 125 2
The list comp does basically this:
pull the value counts for each item in df['CA'] as a series.
Use loc to index into the series at each value of 'CA' to find the count of that value
Put that item into a list
append that list as a new column

Related

Pandas filtering based on minimum data occurrences across multiple columns

I have a dataframe like this
country data_fingerprint organization
US 111 Tesco
UK 222 IBM
US 111 Yahoo
PY 333 Tesco
US 111 Boeing
CN 333 TCS
NE 458 Yahoo
UK 678 Tesco
I want those data_fingerprint for where the organisation and country with top 2 counts exists
So if see in organization top 2 occurrences are for Tesco,Yahoo and for country we have US,UK .
So based on that the output of data_fingerprint should be having
data_fingerprint
111
678
What i have tried for organization to exist in my complete dataframe is this
# First find top 2 occurances of organization
nd = df['organization'].value_counts().groupby(level=0, group_keys=False).head(2)
# Then checking if the organization exist in the complete dataframe and filtering those rows
new = df["organization"].isin(nd)
But i am not getting any data here.Once i get data for this I can do it along with country
Can someone please help to get me the output.I have less data so using Pandas
here is one way to do it
df[
df['organization'].isin(df['organization'].value_counts().head(2).index) &
df['country'].isin(df['country'].value_counts().head(2).index)
]['data_fingerprint'].unique()
array([111, 678], dtype=int64)
Annotated code
# find top 2 most occurring country and organization
i1 = df['country'].value_counts().index[:2]
i2 = df['organization'].value_counts().index[:2]
# Create boolean mask to select the rows having top 2 country and org.
mask = df['country'].isin(i1) & df['organization'].isin(i2)
# filter the rows using the mask and drop dupes in data_fingerprint
df.loc[mask, ['data_fingerprint']].drop_duplicates()
Result
data_fingerprint
0 111
7 678
You can do
# First find top 2 occurances of organization
nd = df['organization'].value_counts().head(2).index
# Then checking if the organization exist in the complete dataframe and filtering those rows
new = df["organization"].isin(nd)
Output - Only Tesco and Yahoo left
df[new]
country data_fingerprint organization
0 US 111 Tesco
2 US 111 Yahoo
3 PY 333 Tesco
6 NE 458 Yahoo
7 UK 678 Tesco
You can do the same for country

How to create a rank from a df with Pandas

I have a table that is cronologically sorted, with an state and an amount fore each date. The table looks as follows:
Date
State
Amount
01/01/2022
1
1233.11
02/01/2022
1
16.11
03/01/2022
2
144.58
04/01/2022
1
298.22
05/01/2022
2
152.34
06/01/2022
2
552.01
07/01/2022
3
897.25
To generate the dataset:
pd.DataFrame({'date': ["01/08/2022","02/08/2022","03/08/2022","04/08/2022","05/08/2022","06/08/2022","07/08/2022","08/08/2022","09/08/2022","10/08/2022","11/08/2022"], 'state' : [1,1,2,2,3,1,1,2,2,2,1],'amount': [144,142,166,144,142,166,144,142,166,142,166]})
I want to add a column called rank that is increased when the state changes. So if you have twenty times state 1, it is just rank 1. If then you have state 2, when the state 1 appears again, the rank is increased. That is, if for two days in a row State is 1, Rank is 1. Then, another state appears. When State 1 appears again, Rank would increment to 2.
I want to add a column called "Rank" which has a value that increments itself if a given state appears again. It is like a counter amount of times that state appear consecutively. That it, if state. An example would be as follows:
Date
State
Amount
Rank
01/01/2022
1
1233.11
1
02/01/2022
1
16.11
1
03/01/2022
2
144.58
1
04/01/2022
1
298.22
2
05/01/2022
2
152.34
2
06/01/2022
2
552.01
2
07/01/2022
3
897.25
1
This could be also understanded as follows:
Date
State
Amount
Rank_State1
Rank_State2
Rank_State2
01/01/2022
1
1233.11
1
02/01/2022
1
16.11
1
03/01/2022
2
144.58
1
04/01/2022
1
298.22
2
05/01/2022
2
152.34
2
06/01/2022
2
552.01
2
07/01/2022
3
897.25
1
Does anyone know how to build that Rank column starting from the previous table?
Your problem is in the general category of state change accumulation, which suggests an approach using cumulative sums and booleans.
Here's one way you can do it - maybe not the most elegant, but I think it does what you need
import pandas as pd
someDF = pd.DataFrame({'date': ["01/08/2022","02/08/2022","03/08/2022","04/08/2022","05/08/2022","06/08/2022","07/08/2022","08/08/2022","09/08/2022","10/08/2022","11/08/2022"], 'state' : [1,1,2,2,3,1,1,2,2,2,1],'amount': [144,142,166,144,142,166,144,142,166,142,166]})
someDF["StateAccumulator"] = someDF["state"].apply(str).cumsum()
def groupOccurrence(someRow):
sa = someRow["StateAccumulator"]
s = str(someRow["state"])
stateRank = len("".join([i if i != '' else " " for i in sa.split(s)]).split())\
+ int((sa.split(s)[0] == '') or (int(sa.split(s)[-1] == '')) and sa[-1] != s)
return stateRank
someDF["Rank"] = someDF.apply(lambda x: groupOccurrence(x), axis=1)
If I understand correctly, this is the result you want - "Rank" is intended to represent the number of times a given set of contiguous states have appeared:
date state amount StateAccumulator Rank
0 01/08/2022 1 144 1 1
1 02/08/2022 1 142 11 1
2 03/08/2022 2 166 112 1
3 04/08/2022 2 144 1122 1
4 05/08/2022 3 142 11223 1
5 06/08/2022 1 166 112231 2
6 07/08/2022 1 144 1122311 2
7 08/08/2022 2 142 11223112 2
8 09/08/2022 2 166 112231122 2
9 10/08/2022 2 142 1122311222 2
10 11/08/2022 1 166 11223112221 3
Notes:
instead of the somewhat hacky string cumsum method I'm using here, you could probably use a list accumulation function and then use a pandas split-apply-combine method to do the counting in the lambda function
you would then apply a state change boolean, and do a cumsum on the state change boolean, filtered/grouped on the state value (so, how many state changes do we have for any given state)
state change boolean is done like this:
someDF["StateChange"] = someDF["state"] != someDF["state"].shift()
so for a given state at a given row, you'd count how many state changes had occurred in the previous rows.

Python/Pandas: Delete rows where value in one column is not present in another column in the same data frame

I have a data frame containing a multi-parent hierarchy of employees. Node (int64) is a key that identifies each unique combination of employee and manager. Parent (float64) is a key that represents the manager in the ‘node’.
Due to some source data anomalies, there are parent keys present in the data frame that are not 'nodes'. I would like to delete all such rows where this is occurring.
empId
empName
mgrId
mgrName
node
parent
111
Alice
222
Bob
1
3
111
Alice
333
Charlie
2
4
222
Bob
444
Dave
3
5
333
Charlie
444
Dave
4
5
444
Dave
5
555
Elizabeth
333
Charlie
6
7
In the above sample, it would be employee ID 555 because parent key 7 is not present anywhere in ‘node’ column.
Here's what I tried so far:
This removes some rows but does not remove all. Not sure why?
df1 = df[df['parent'].isin(df['node'])]
I thought maybe it was because ‘parent’ is float and ‘node’ is int64, so I converted and tried but same result as previous.
df1 = df[df['parent'].astype('int64').isin(df['node'])]
Something to consider is that the data frame contains around 1.5 million rows.
I also tried this, but this just keeps running the code forever - I'm assume this is because .map will loop through the entire data frame (which is around 1.5 million rows):
df[df['parent'].map(lambda x: np.isin(x, df['node']).all())]
I'm especially perplexed, when I use the first 2 code snippets, as to why it would consistently filter out a small subset of rows that do not meet the filter condition but not all.
Again, 'parent' is float64 and has empty values. 'node' is int64 and has no empty values. A more realistic example of node and parent keys is as follows:
Node - 13192210227
Parent - 12668210227.0

Counting the number of customers by values in a second series

I have imported a list of customers into python to run some RFM analysis, this adds a new field to the data for the RFM Class, so now my data looks like this:
customer RFMClass
0 0001914f-4655-4148-a1dc-1f25ca6d1f15 343
1 0002e50a-5551-4d9a-8734-76307dfe2131 341
2 00039977-512e-47ad-b929-170f18a1b14a 442
3 000693ff-2c61-425c-97c1-0286c874dd2f 443
4 00095dc2-7f37-48b0-894f-910d90cbbee2 142
5 000b748b-7ea0-48f2-a875-5f6cb95561d9 141
...
I'd like to plot a histogram showing the number of customers in each RFM Class, how can I get a count of the number of distinct customers ID's per class?
I tried adding a 1 to every row with summary['number'] = 1 thinking that it might be easier to count these rather than the customer ID's, as these have already been de-duped in my code, but I can't figure out how to sum these per RFM Class either.
Any thoughts on how I could do this?
I worked this out by using .groupby on my RFM class and summing the 'number' I assigned to each row:
byhour = df.groupby(['Hour']).agg({'Orders': 'sum'})
print(byhour)
This then produces the desired output:
Orders
Hour
0 902
1 438
2 307
3 162
4 149
5 233
6 721

Why did I get dtype 'object' on reading in my data frame?

I am new to Python, and I want to determine the type of each column in a data frame, I wrote the code below, but the results are not as expected, I only get 'object' for type.
This is my data frame (just the first 7 th column):
IDINDUSANALYSE IDINDUS IDINDUSEFFLUENT DATEANALYSE IDTYPEECHANTILLON IDPRELEVEUR IDLABO IDORIGINEVAL CONFORME CONFCALC IDINDDOSS CONFFORCE
672 635 6740 10/01/13 2 1 3 1 1 1 531 0
673 635 6740 11/01/13 2 1 3 1 1 1 531 0
674 635 6740 14/01/13 2 1 3 1 1 1 531 0
675 635 6740 15/01/13 2 1 3 1 1 1 531 0
676 635 6740 16/01/13 2 1 3 1 1 1 531 0
677 635 6740 18/01/13 2 1 3 1 1 1 531 0
This is my code:
import pandas as pd
import csv
with open("/home/***/Documents/Table3.csv") as f:
r = csv.reader(f)
df = pd.DataFrame().from_records(r)
for index, row in df.iterrows():
print(df.dtypes)
As a result I get this :
0 object
1 object
2 object
3 object
4 object
Please tell we what I did wrong ?
Try this
import pandas as pd
df = pd.read_csv("/home/***/Documents/Table3.csv")
types = [df['{0}'.format(i)].dtype for i in df.columns]
print(types)
which results as
[dtype('float64'), dtype('O'), dtype('O')]
Considering your actual dataframe has 4 columns yet you got object as result 5 times, which was your first hint for you.
types = df.columns.to_series().groupby(df.dtypes).groups
Then print out types, and you would get all of the column types (grouped by type).
Also, you can open the .csv file directly to a data frame using: pd.read_csv(filepath)
If you want a specific column's type - df.column.dtype or df['column'].dtype
Please show your actual CSV file. If all columns were stored as object, it seems like they were detected as string, probably because your CSV file quotes each field. But post your actual CSV file.
To read in quoted fields in pandas and convert them back to their type (numeric/categorical), do either of:
pd.read_csv(..., quoting = pd.QUOTE_ALL)
pd.read_csv(..., quoting = pd.QUOTE_NONNUMERIC)
and read the section 'quoting' in https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
But also it's a good practice to explicitly pass pd.read_csv(..., dtype={...} a dictionary telling it which type to use for each column name.
e.g. {‘a’: np.float64, ‘b’: np.int32}

Categories