Extracting only the percent value in a column in pandas

Extracting only the percent value in a column in pandas - python

I have a column that includes strings including a percent at the end e.g XX: (+2, 30%); (-5, 20%); (+17, 50%) .
I need to extract the highest % value for each such string and perform this on the whole column.
Any advice will be highly appreciated!
Thanks

In my understanding, each cell in column XX is a cells which contains some percentages. I have included a small test DataFrame I have used:
import pandas as pd
import re
df = pd.DataFrame({"XX":["(+2, 30%), (-5, 20%), (+17, 50%)","(+2, 70%), (-5, 20%), (+17, 50%)", ""]})
pattern = re.compile("([0-9\.]+)%")
df["XX"].apply(lambda x: max(pattern.findall(x), default=-1))
OUTPUT
0 50
1 70

this code returns the most value in a column having percents
import pandas as pd
import numpy as np
data = [['2.3%', 1],['5.3%', 3]]
data = pd.DataFrame(data)
first_column = data.iloc[:, 0]
percent_list = []
for val in first_column:
percent_list.append(float(val[:-1]))
print(percent_list[np.argmax(percent_list)])

Related

i want to use the outputs as data and sum them

import numpy as np
import pandas as pd
df = pd.read_csv('test_python.csv')
print(df.groupby('fifth').sum())
this is my data
**And I am summing the first three columns for every word is in fifth.
The result is this and it is correct
The next thing I want to do is take those results and sum the together
example:
**buy = 6
cheese = 8
file = 12
.
.
.
word = 13**
How can I do this? how can I use the results.**
-And also now, want to use the column second as a new column with the name second2 with the results as data, how can I do it?

For Summing you can use apply-lambda ;
df = pd.DataFrame({"first":[1]*14,
"second":np.arange(1,15),
"third":[0]*14,
"forth":["one","two","three","four"]*3+["one","two"],
"fifth":["hello","no","hello","hi","buy","hello","cheese","water","hi","juice","file","word","hi","red"]})
df1 = df.groupby(['fifth'])['first','second','third'].agg('sum').reset_index()
df1["sum_3_Col"] = df1.apply(lambda x: x["first"] + x["second"] + x["third"],axis=1)
df1.rename(columns={'second':'second2'}, inplace=True)
Output of df1;

pandas str.contains match exact substring not working with regex boudry

I have two dataframes, and trying to find out a way to match the exact substring from one dataframe to another dataframe.
First DataFrame:
import pandas as pd
import numpy as np
random_data = {'Place Name':['TS~HOT_MD~h_PB~progra_VV~gogl', 'FM~uiosv_PB~emo_SZ~1x1_TG~bhv'],
'Site':['DV360', 'Adikteev']}
dataframe = pd.DataFrame(random_data)
print(dataframe)
Second DataFrame
test_data = {'code name': ['PB', 'PB', 'PB'],
'Actual':['programmatic me', 'emoteev', 'programmatic-mechanics'],
'code':['progra', 'emo', 'prog']}
test_dataframe = pd.DataFrame(test_data)
Approach
for k, l, m in zip(test_dataframe.iloc[:, 0], test_dataframe.iloc[:, 1], test_dataframe.iloc[:, 2]):
dataframe['Site'] = np.select([dataframe['Place Name'].str.contains(r'\b{}~{}\b'.format(k, m), regex=False)], [l],
default=dataframe['Site'])
The current output is as below, though I am expecting to match the exact substring, which is not working with the code above.
Current Output:
Place Name Site
TS~HOT_MD~h_PB~progra_VV~gogl programmatic-mechanics
FM~uiosv_PB~emo_SZ~1x1_TG~bhv emoteev
Expected Output:
Place Name Site
TS~HOT_MD~h_PB~progra_VV~gogl programmatic me
FM~uiosv_PB~emo_SZ~1x1_TG~bhv emoteev

Data
import pandas as pd
import numpy as np
random_data = {'Place Name':['TS~HOT_MD~h_PB~progra_VV~gogl',
'FM~uiosv_PB~emo_SZ~1x1_TG~bhv'], 'Site':['DV360', 'Adikteev']}
dataframe = pd.DataFrame(random_data)
test_data = {'code name': ['PB', 'PB', 'PB'], 'Actual':['programmatic me', 'emoteev', 'programmatic-mechanics'],
'code':['progra', 'emo', 'prog']}
test_dataframe = pd.DataFrame(test_data)
Map the test_datframe code and Actual into dictionary as key and value respectively
keys=test_dataframe['code'].values.tolist()
dicto=dict(zip(test_dataframe.code, test_dataframe.Actual))
dicto
Join the keys separated by | to enable search of either phrases
k = '|'.join(r"{}".format(x) for x in dicto.keys())
k
Extract string from datframe meeting any of the phrases in k and map them to to the dictionary
dataframe['Site'] = dataframe['Place Name'].str.extract('('+ k + ')', expand=False).map(dicto)
dataframe
Output

Not the most elegant solution, but this does the trick.
Set up data
import pandas as pd
import numpy as np
random_data = {'Place Name':['TS~HOT_MD~h_PB~progra_VV~gogl',
'FM~uiosv_PB~emo_SZ~1x1_TG~bhv'], 'Site':['DV360', 'Adikteev']}
dataframe = pd.DataFrame(random_data)
test_data = {'code name': ['PB', 'PB', 'PB'], 'Actual':['programmatic me', 'emoteev', 'programmatic-mechanics'],
'code':['progra', 'emo', 'prog']}
test_dataframe = pd.DataFrame(test_data)
Solution
Create a column in test_dataframe with the substring to match:
test_dataframe['match_str'] = test_dataframe['code name'] + '~' + test_dataframe.code
print(test_dataframe)
code name Actual code match_str
0 PB programmatic me progra PB~progra
1 PB emoteev emo PB~emo
2 PB programmatic-mechanics prog PB~prog
Define a function to apply to test_dataframe:
def match_string(row, dataframe):
ind = row.name
try:
if row[-1] in dataframe.loc[ind, 'Place Name']:
return row[1]
else:
return dataframe.loc[ind, 'Site']
except KeyError:
# More rows in test_dataframe than there are in dataframe
pass
# Apply match_string and assign back to dataframe
dataframe['Site'] = test_dataframe.apply(match_string, args=(dataframe,), axis=1)
Output:
Place Name Site
0 TS~HOT_MD~h_PB~progra_VV~gogl programmatic me
1 FM~uiosv_PB~emo_SZ~1x1_TG~bhv emoteev

ngroups in groupby object not matching nunique() in same column

I have a DataFrame consisting of Ids and Serial Numbers. I want to create a new DataFrame with the Ids as index and the serial numbers as column values and zero padding where the length are not equal.
My problem is that when I try to group by id the number of groups in my groupby("id")-object does not match the number of nunique("id") values which is counter intuitive. For every example I tried using smaller DateFrames the numbers match. Any suggestions why?
import pandas as pd
import numpy as np
# data example (real df is shape(188225, 2)
hu = pd.DataFrame({'Id': ['1','12','123','1234','12345'],
'Serial':['A','AB','ABC','ABC','ABC']},
dtype = 'category')
max_len = df.groupby('Id')['Serial'].size().max() # Find the max length
grouped = df.groupby('Id')
from io import StringIO
from csv import writer
output = StringIO()
csv_writer = writer(output)
for key, vals in grouped.groups.items():
# Vector of serials with 0 padding matching so max_len = | [a, b, c, 0, 0, 0...]|
csv_writer.writerow(np.append(np.append(key, vals.values), np.array([0] * (max_len - len(vals)))))
output.seek(0) #goes to the start of the IO file
dfdiscrete = pd.read_csv(output,
header=None,
index_col=0,
dtype=str)
print("\Discrete Serials:", len(grouped.groups), "nunique ids", hu['Id'].nunique())
I expect the these two to be:
Shape discrete devices: (29840, 50) nunique citizen ids 29840,
but the actual output is
Shape discrete devices: (56674, 50) nunique citizen ids 29840

Count occurrences of number from specific column in python

I am trying to do the equivalent of a COUNTIF() function in excel. I am stuck at how to tell the .count() function to read from a specific column in excel.
I have
df = pd.read_csv('testdata.csv')
df.count('1')
but this does not work, and even if it did it is not specific enough.
I am thinking I may have to use read_csv to read specific columns individually.
Example:
Column name
4
4
3
2
4
1
the function would output that there is one '1' and I could run it again and find out that there are three '4' answers. etc.
I got it to work! Thank you
I used:
print (df.col.value_counts().loc['x']

Here is an example of a simple 'countif' recipe you could try:
import pandas as pd
def countif(rng, criteria):
return rng.eq(criteria).sum()
Example use
df = pd.DataFrame({'column1': [4,4,3,2,4,1],
'column2': [1,2,3,4,5,6]})
countif(df['column1'], 1)

If all else fails, why not try something like this?
import numpy as np
import pandas
import matplotlib.pyplot as plt
df = pandas.DataFrame(data=np.random.randint(0, 100, size=100), columns=["col1"])
counters = {}
for i in range(len(df)):
if df.iloc[i]["col1"] in counters:
counters[df.iloc[i]["col1"]] += 1
else:
counters[df.iloc[i]["col1"]] = 1
print(counters)
plt.bar(counters.keys(), counters.values())
plt.show()

KeyError: 'column_name'

I am writing a python code, it should read the values of columns but I am getting the KeyError: 'column_name' error. Can anyone please tell me how to fix this issue.
import numpy as np
from sklearn.cluster import KMeans
import pandas as pd
### For the purposes of this example, we store feature data from our
### dataframe `df`, in the `f1` and `f2` arrays. We combine this into
### a feature matrix `X` before entering it into the algorithm.
df = pd.read_csv(r'C:\Users\Desktop\data.csv')
print (df)
#df = pd.read_csv(csv_file)
"""
saved_column = df.Distance_Feature
saved_column = df.Speeding_Feature
print(saved_column)
"""
f1 = df['Distance_Feature'].tolist()
f2 = df['Speeding_Feature'].tolist()
print(f1)
print(f2)
X=np.matrix(zip(f1,f2))
print(X)
kmeans = KMeans(n_clusters=2).fit(X)
Can anyone please help me.

Asumming 'C:\Users\Desktop\data.csv' contains the following data
Distance_Feature Speeding_Feature
1 2
3 4
5 6
...
Change
df = pd.read_csv(r'C:\Users\Desktop\data.csv')
to
df = pd.read_csv("data.txt",names=["Distance_Feature","Speeding_Feature"],sep= "\s+|\t+|\s+\t+|\t+\s+",header=1)
# Here it is assumed white space separator, if another separator is used change `sep`.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting only the percent value in a column in pandas - python

I have a column that includes strings including a percent at the end e.g XX: (+2, 30%); (-5, 20%); (+17, 50%) . I need to extract the highest % value for each such string and perform this on the whole column. Any advice will be highly appreciated! Thanks

Related

i want to use the outputs as data and sum them

pandas str.contains match exact substring not working with regex boudry

ngroups in groupby object not matching nunique() in same column

Count occurrences of number from specific column in python

KeyError: 'column_name'

Categories

Resources