A want to create a system where the observations in a variable refer to a number using Python. All the numbers from the (in this case) 5 different variables together form a unique code. The first number corresponds to the first variable. When an observations in a different row is the same as the first, the same number applies. As illustrated in the example, If apple appears in row 1 and 3, both ID's get a '1' as first number.
The output should give a new column with the ID. If all the observations in a row are the same, the ID's will be the same. In the picture below you see 5 variables leading to the unique ID on the right, which should be the output.
You can use pd.factorize:
df['UniqueID'] = (df.apply(lambda x: (1+pd.factorize(x)[0]).astype(str))
.agg(''.join, axis=1))
print(df)
# Output
Fruit Toy Letter Car Country UniqueID
0 Apple Bear A Ferrari Brazil 11111
1 Strawberry Blocks B Peugeot Chile 22222
2 Apple Blocks C Renault China 12333
3 Orange Bear D Saab China 31443
4 Orange Bear D Ferrari India 31414
I have a table that is cronologically sorted, with an state and an amount fore each date. The table looks as follows:
Date
State
Amount
01/01/2022
1
1233.11
02/01/2022
1
16.11
03/01/2022
2
144.58
04/01/2022
1
298.22
05/01/2022
2
152.34
06/01/2022
2
552.01
07/01/2022
3
897.25
To generate the dataset:
pd.DataFrame({'date': ["01/08/2022","02/08/2022","03/08/2022","04/08/2022","05/08/2022","06/08/2022","07/08/2022","08/08/2022","09/08/2022","10/08/2022","11/08/2022"], 'state' : [1,1,2,2,3,1,1,2,2,2,1],'amount': [144,142,166,144,142,166,144,142,166,142,166]})
I want to add a column called rank that is increased when the state changes. So if you have twenty times state 1, it is just rank 1. If then you have state 2, when the state 1 appears again, the rank is increased. That is, if for two days in a row State is 1, Rank is 1. Then, another state appears. When State 1 appears again, Rank would increment to 2.
I want to add a column called "Rank" which has a value that increments itself if a given state appears again. It is like a counter amount of times that state appear consecutively. That it, if state. An example would be as follows:
Date
State
Amount
Rank
01/01/2022
1
1233.11
1
02/01/2022
1
16.11
1
03/01/2022
2
144.58
1
04/01/2022
1
298.22
2
05/01/2022
2
152.34
2
06/01/2022
2
552.01
2
07/01/2022
3
897.25
1
This could be also understanded as follows:
Date
State
Amount
Rank_State1
Rank_State2
Rank_State2
01/01/2022
1
1233.11
1
02/01/2022
1
16.11
1
03/01/2022
2
144.58
1
04/01/2022
1
298.22
2
05/01/2022
2
152.34
2
06/01/2022
2
552.01
2
07/01/2022
3
897.25
1
Does anyone know how to build that Rank column starting from the previous table?
Your problem is in the general category of state change accumulation, which suggests an approach using cumulative sums and booleans.
Here's one way you can do it - maybe not the most elegant, but I think it does what you need
import pandas as pd
someDF = pd.DataFrame({'date': ["01/08/2022","02/08/2022","03/08/2022","04/08/2022","05/08/2022","06/08/2022","07/08/2022","08/08/2022","09/08/2022","10/08/2022","11/08/2022"], 'state' : [1,1,2,2,3,1,1,2,2,2,1],'amount': [144,142,166,144,142,166,144,142,166,142,166]})
someDF["StateAccumulator"] = someDF["state"].apply(str).cumsum()
def groupOccurrence(someRow):
sa = someRow["StateAccumulator"]
s = str(someRow["state"])
stateRank = len("".join([i if i != '' else " " for i in sa.split(s)]).split())\
+ int((sa.split(s)[0] == '') or (int(sa.split(s)[-1] == '')) and sa[-1] != s)
return stateRank
someDF["Rank"] = someDF.apply(lambda x: groupOccurrence(x), axis=1)
If I understand correctly, this is the result you want - "Rank" is intended to represent the number of times a given set of contiguous states have appeared:
date state amount StateAccumulator Rank
0 01/08/2022 1 144 1 1
1 02/08/2022 1 142 11 1
2 03/08/2022 2 166 112 1
3 04/08/2022 2 144 1122 1
4 05/08/2022 3 142 11223 1
5 06/08/2022 1 166 112231 2
6 07/08/2022 1 144 1122311 2
7 08/08/2022 2 142 11223112 2
8 09/08/2022 2 166 112231122 2
9 10/08/2022 2 142 1122311222 2
10 11/08/2022 1 166 11223112221 3
Notes:
instead of the somewhat hacky string cumsum method I'm using here, you could probably use a list accumulation function and then use a pandas split-apply-combine method to do the counting in the lambda function
you would then apply a state change boolean, and do a cumsum on the state change boolean, filtered/grouped on the state value (so, how many state changes do we have for any given state)
state change boolean is done like this:
someDF["StateChange"] = someDF["state"] != someDF["state"].shift()
so for a given state at a given row, you'd count how many state changes had occurred in the previous rows.
In a Panda's dataframe: I want to count how many of value 1 there is, in the stroke coulmn, for each value in the Residence_type column. In order to count how much 1 there is, I convert the stroke column to a list, easier I think.
So for example, the value Rural in Residence_type has 300 times 1 in the stroke column.. and so on.
The data is something like this:
Residence_type Stroke
0 Rural 1
1 Urban 1
2 Urban 0
3 Rural 1
4 Rural 0
5 Urban 0
6 Urban 0
7 Urban 1
8 Rural 0
9 Rural 1
The code:
grpby_variable = data.groupby('stroke')
grpby_variable['Residence_type'].tolist().count(1)
the final goal is to find the difference between the number of times the value 1 appears, for each value in the Residence_type column (rural or urban).
Am I doing it right? what is this error ?
Not sure I got what you need done. Please try filter stroke==1, groupby and count;
df.query("Stroke==1").groupby('Residence_type')['Stroke'].agg('count').to_frame('Stroke_Count')
Stroke_Count
Residence_type
Rural 3
Urban 2
You could try the following if you need the differences between categories
df1 =df.query("Stroke==1").groupby('Residence_type')['Stroke'].agg('count').to_frame('Stroke_Count')
df1.loc['Diff'] = abs(df1.loc['Rural']-df1.loc['Urban'])
print(df1)
Stroke_Count
Residence_type
Rural 3
Urban 2
Diff 1
Assuming that Stroke only contains 1 or 0, you can do:
result_df = df.groupby('Residence_type').sum()
>>> result_df
Stroke
Residence_type
Rural 3
Urban 2
>>> result_df.Stroke['Rural'] - result_df.Stroke['Urban']
1
I need to read in a .csv file which contains a distance matrix, so it has identical row names and column names, and it's important to have them both. However, the code below can only get me a dataframe where row names are included in an extra "Unnamed: 0" column and the index become integers again, which is very inconvenient for the indexing later.
DATA = pd.read_csv("https://raw.githubusercontent.com/PawinData/UC/master/DistanceMatrix_shortestnetworks.csv")
I did check the documentation of pandas.read_csv and played with index_col, header, names, e.t.c but none seemed to work. Can anybody help me out here?
Use index_col=0 parameter for first column to index:
url = "https://raw.githubusercontent.com/PawinData/UC/master/DistanceMatrix_shortestnetworks.csv"
DATA = pd.read_csv(url, index_col=0)
print (DATA.head())
Imperial Kern Los Angeles Orange Riverside San Bernardino \
Imperial 0 3 3 2 1 2
Kern 3 0 1 2 2 1
Los Angeles 3 1 0 1 2 1
Orange 2 2 1 0 1 1
Riverside 1 2 2 1 0 1
San Diego San Luis Obispo Santa Barbara Ventura
Imperial 1 4 4 4
Kern 3 1 1 1
Los Angeles 2 2 2 1
Orange 1 3 3 2
Riverside 1 3 3 3
This issue most likely exhibits because your CSV was saved along with its RangeIndex, which usually doesn't have a name. The fix would actually need to be done when saving the DataFrame data.to_csv('file.csv', index = False)
To read the unnamed column as the index. Specify an index_col=0 argument to pd.read_csv, this reads in the first column as the index.
data = pd.read_csv("https://raw.githubusercontent.com/PawinData/UC/master/DistanceMatrix_shortestnetworks.csv",index_col = 0)
And to drop the unnamed column use data.drop(data.filter(regex="Unname"),axis=1, inplace=True)
I have the following data:
df =
Emp_Name Leaves Leave_Type Salary Performance
0 Christy 20 sick 3000.0 56.6
1 Rocky 10 Casual kkkk 22.4
2 jenifer 50 Emergency 2500.6 '51.6'
3 Tom 10 sick Nan 46.2
4 Harry nn Casual 1800.1 '58.3'
5 Julie 22 sick 3600.2 'unknown'
6 Sam 5 Casual Nan 47.2
7 Mady 6 sick unknown Nan
Output:
Emp_Name Leaves Leave_Type Salary Performance
0 Christy 20 sick 3000.0 56.6
1 jenifer 50 Emergency 2500.6 51.6
2 Tom 10 sick Nan 46.2
3 Sam 5 Casual Nan 47.2
4 Mady 6 sick unknown Nan
I want to delete records where there is datatype error in numerical columns(Leaves,Salary,Performance).
If numerical columns contains strings then that row show be deleted from data frame?
df[['Leaves','Salary','Performance']].apply(pd.to_numeric, errors = 'coerce')
but this will covert values to Nan.
Let's start from a note concerning your sample data:
It contains Nan strings, which are not among strings automatically
recognized as NaNs.
To treat them as NaN, I read the source text with read_fwf,
passing na_values=['Nan'].
And now get down to the main task:
Define a function to check whether a cell is acceptable:
def isAcceptable(cell):
if pd.isna(cell) or cell == 'unknown':
return True
return all(c.isdigit() or c == '.' for c in cell)
I noticed that you accept NaN values.
You also a cell if it contains only unknown string, but you don't
accept a cell if such word is enclosed between e.g. quotes.
If you change your mind about what is / is not acceptable, change the
above function accordingly.
Then, to leave only rows with all acceptable values in all 3 mentioned
columns, run:
df[df[['Leaves', 'Salary', 'Performance']].applymap(isAcceptable).all(axis=1)]