Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I want to transform dataset type. But i cant do it beacuse of there are two dot in my dataset. Im using pd.apply(pd.to_numeric) code. The error code I get is as follows;
ValueError: Unable to parse string "1.232.2" at position 1
my dataset is like this;
Price Value
1.232.2 1.235.3
2.345.2 1.234.2
3.343.5 5.433.3
I must do removing first dot. Example for;
Price Value
1232.2 1235.3
2345.2 1234.2
3343.5 5433.3
I waiting for help. Thank you.
Here's a way to do this.
Convert string to float format (multiple dots to single dot)
You can just do a regex to solve for this.
regex expression: '\.(?=.*\.)'
Explanation:
'\. --> lookup for literal .
(?=.*\.)' --> Exclude all but last .
For each found, replace with ''
The code for this is:
df['Price'] = df['Price'].str.replace('\.(?=.*\.)', '',regex=True)
df['Value'] = df['Value'].str.replace('\.(?=.*\.)', '',regex=True)
If you also want to convert it to numeric, you can directly give:
df['Price'] = pd.to_numeric(df['Price'].str.replace('\.(?=.*\.)', '',regex=True))
df['Value'] = pd.to_numeric(df['Value'].str.replace('\.(?=.*\.)', '',regex=True))
The output of this will be:
Before Cleansing DataFrame:
Price Value
0 1.232.2 1.235.3
1 2.345.2 1.234.2
2 3.343.5 5.433.3
3 123.45 456.25.5
4 0.825 0.0.0
5 0.0.0.2 5.5.5
6 1234 4567
7 NaN NaN
After Cleansing DataFrame:
Price Value
0 1232.2 1235.3
1 2345.2 1234.2
2 3343.5 5433.3
3 123.45 45625.5
4 0.825 00.0
5 000.2 55.5
6 1234 4567
7 NaN NaN
The pd.numeric() version of the solution will look like this:
After Cleanins DataFrame:
Note: it converts all values to 3 decimal places as one of them has 3 decimal places.
Price Value
0 1232.200 1235.3
1 2345.200 1234.2
2 3343.500 5433.3
3 123.450 45625.5
4 0.825 0.0
5 0.200 55.5
6 1234.000 4567.0
7 NaN NaN
Discard data if more than one period (.) in data
If you want to process all the columns in the dataframe, you can use applymap() and if you want to process for a specific column use apply. Also use pd.isnull() to check if data is NaN so you can ignore processing that data.
The below code addresses for NaN, numbers without decimal places, numbers with one period, numbers with multiple periods. The code assumes the data in the columns are either NaNs or strings with digits and periods. It assumes there are no alphabet or non digit characters (except dots). If you need the code to check for digits only, let me know.
The code also assumes that you want to discard the leading numbers. If you do want to concatenate the numbers, then a different solution needs to be implemented (for ex: 1.2345.67 will be replaced to 2345.67 and 1 will be discarded. example #2: 1.2.3.4.5 will be replaced with 4.5 while discarding 1.2.3. If this is NOT what you want, we need to change the code.
You can do the following:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Price': ['1.232.2', '2.345.2', '3.343.5', '123.45', '0.825','0.0.0.2', '1234',np.NaN],
'Value': ['1.235.3', '1.234.2', '5.433.3', '456.25.5','0.0.0','5.5.5', '4567',np.NaN]})
print (df)
def remove_dots(x):
return x if pd.isnull(x) else '.'.join(x.rsplit('.',2)[-2:])
df = df.applymap(remove_dots)
print (df)
The output of this will be:
Before Cleansing DataFrame:
Price Value
0 1.232.2 1.235.3
1 2.345.2 1.234.2
2 3.343.5 5.433.3
3 123.45 456.25.5
4 0.825 0.0.0
5 0.0.0.2 5.5.5
6 1234 4567
7 NaN NaN
After Cleansing DataFrame:
Price Value
0 232.2 235.3
1 345.2 234.2
2 343.5 433.3
3 123.45 25.5
4 0.825 0.0
5 0.2 5.5
6 1234 4567
7 NaN NaN
If you want to change specific columns only, then you can use apply.
df['Price'] = df['Price'].apply(lambda x: x if pd.isnull(x) else '.'.join(x.rsplit('.',2)[-2:]))
df['Value'] = df['Value'].apply(lambda x: x if pd.isnull(x) else '.'.join(x.rsplit('.',2)[-2:]))
print(df)
Before and after will be the same:
Before Cleansing DataFrame:
Price Value
0 1.232.2 1.235.3
1 2.345.2 1.234.2
2 3.343.5 5.433.3
3 123.45 456.25.5
4 0.825 0.0.0
5 0.0.0.2 5.5.5
6 1234 4567
7 NaN NaN
After Cleansing DataFrame:
Price Value
0 232.2 235.3
1 345.2 234.2
2 343.5 433.3
3 123.45 25.5
4 0.825 0.0
5 0.2 5.5
6 1234 4567
7 NaN NaN
Considering two dataframes as follows:
import pandas as pd
df_rp = pd.DataFrame({'id':[1,2,3,4,5,6,7,8], 'res': ['a','b','c','d','e','f','g','h']})
df_cdr = pd.DataFrame({'id':[1,2,5,6,7,1,2,3,8,9,3,4,8],
'LATITUDE':[-22.98, -22.97, -22.92, -22.87, -22.89, -22.84, -22.98,
-22.14, -22.28, -22.42, -22.56, -22.70, -22.13],
'LONGITUDE':[-43.19, -43.39, -43.24, -43.28, -43.67, -43.11, -43.22,
-43.33, -43.44, -43.55, -43.66, -43.77, -43.88]})
What I have to do:
Compare each df_rp['id'] element with each df_cdr['id'] element;
If they are the same, I need to add in a data structure (list, series, etc.) the latitudes and longitudes that are on the same line as the id without repeating the id.
Below is an example of how I need the data to be grouped:
1:[-22.98,-43.19],[-22.84,-43.11]
2:[-22.97,-43.39],[-22.98,-43.22]
3:[-22.14,-43.33],[-22.56,-43.66]
4:[-22.70,-43.77]
5:[-22.92,-43.24]
6:[-22.87,-43.28]
7:[-22.89,-43.67]
8:[-22.28,-43.44],[-22.13,-43.88]
I'm having a hard time choosing which data structure is best for the situation (as I did in the example looks like a dictionary, but there would be several dictionaries) and how to add latitude and logitude to pairs without repeating the id. I appreciate any help.
We need to agg the second df , then reindex assign it back
df_rp['L$L']=df_cdr.drop('id',1).apply(tuple,1).groupby(df_cdr.id).agg(list).reindex(df_rp.id).to_numpy()
df_rp
Out[59]:
id res L$L
0 1 a [(-22.98, -43.19), (-22.84, -43.11)]
1 2 b [(-22.97, -43.39), (-22.98, -43.22)]
2 3 c [(-22.14, -43.33), (-22.56, -43.66)]
3 4 d [(-22.7, -43.77)]
4 5 e [(-22.92, -43.24)]
5 6 f [(-22.87, -43.28)]
6 7 g [(-22.89, -43.67)]
7 8 h [(-22.28, -43.44), (-22.13, -43.88)]
df_cdr['lat_long'] = df_cdr.apply(lambda x: list([x['LATITUDE'],x['LONGITUDE']]),axis=1)
df_cdr = df_cdr.drop(columns=['LATITUDE' , 'LONGITUDE'],axis=1)
df_cdr = df_cdr.groupby('id').agg(lambda x: x.tolist())
Output
lat_long
id
1 [[-22.98, -43.19], [-22.84, -43.11]]
2 [[-22.97, -43.39], [-22.98, -43.22]]
3 [[-22.14, -43.33], [-22.56, -43.66]]
4 [[-22.7, -43.77]]
5 [[-22.92, -43.24]]
6 [[-22.87, -43.28]]
7 [[-22.89, -43.67]]
8 [[-22.28, -43.44], [-22.13, -43.88]]
9 [[-22.42, -43.55]]
Assume df_rp.id is unique and sorted as in your sample. I come up with solution using set_index and loc to filter out id in df_cdr, but not in df_rp. Next, call groupby with lambda returns arrays
s = (df_cdr.set_index('id').loc[df_rp.id].groupby(level=0).
apply(lambda x: x.to_numpy()))
Out[709]:
id
1 [[-22.98, -43.19], [-22.84, -43.11]]
2 [[-22.97, -43.39], [-22.98, -43.22]]
3 [[-22.14, -43.33], [-22.56, -43.66]]
4 [[-22.7, -43.77]]
5 [[-22.92, -43.24]]
6 [[-22.87, -43.28]]
7 [[-22.89, -43.67]]
8 [[-22.28, -43.44], [-22.13, -43.88]]
dtype: object
I have to clean CSV file Data. The data that I am trying to clean is below.
Condition: I have to add #myclinic.com.au at the end of every string where it is missing.
douglas#myclinic.com.au
mildura
broadford#myclinic.com.au
officer#myclinic.com.au
nowa nowa#myclinic.com.au
langsborough#myclinic.com.au
brisbane#myclinic.com.au
robertson#myclinic.com.au
logan village
ipswich#myclinic.com.au
The code for this is
DataFrame = pandas.read_csv(ClinicCSVFile)
DataFrame['Email'] = DataFrame['Email'].apply(lambda x: x if '#' in str(x) else str(x)+'#myclinic.com.au')
DataFrameToCSV = DataFrame.to_csv('Temporary.csv', index = False)
print(DataFrameToCSV)
But the output that I am getting is none and I could not work on the later part of the Problem as it is generating the error below
TypeError: 'NoneType' object is not iterable
which is originated by the above data frame.
Please Help me with this.
Use endswith for condition with inverting by ~ and add string to end:
df.loc[~df['Email'].str.endswith('#myclinic.com.au'), 'Email'] += '#myclinic.com.au'
#if need check only #
#df.loc[~df['Email'].str.contains('#'), 'Email'] += '#myclinic.com.au'
print (df)
Email
0 douglas#myclinic.com.au
1 mildura#myclinic.com.au
2 broadford#myclinic.com.au
3 officer#myclinic.com.au
4 nowa nowa#myclinic.com.au
5 langsborough#myclinic.com.au
6 brisbane#myclinic.com.au
7 robertson#myclinic.com.au
8 logan village#myclinic.com.au
9 ipswich#myclinic.com.au
For me it working nice:
df = pd.DataFrame({'Email': ['douglas#myclinic.com.au', 'mildura', 'broadford#myclinic.com.au', 'officer#myclinic.com.au', 'nowa nowa#myclinic.com.au', 'langsborough#myclinic.com.au', 'brisbane#myclinic.com.au', 'robertson#myclinic.com.au', 'logan village', 'ipswich#myclinic.com.au']})
df.loc[~df['Email'].str.contains('#'), 'Email'] += '#myclinic.com.au'
print (df)
Email
0 douglas#myclinic.com.au
1 mildura#myclinic.com.au
2 broadford#myclinic.com.au
3 officer#myclinic.com.au
4 nowa nowa#myclinic.com.au
5 langsborough#myclinic.com.au
6 brisbane#myclinic.com.au
7 robertson#myclinic.com.au
8 logan village#myclinic.com.au
9 ipswich#myclinic.com.au
Using apply and endswith
Ex:
import pandas as pd
df = pd.read_csv(filename, names=["Email"])
print(df["Email"].apply(lambda x: x if x.endswith("#myclinic.com.au") else x+"#myclinic.com.au"))
Output:
0 douglas#myclinic.com.au
1 mildura#myclinic.com.au
2 broadford#myclinic.com.au
3 officer#myclinic.com.au
4 nowa nowa#myclinic.com.au
5 langsborough#myclinic.com.au
6 brisbane#myclinic.com.au
7 robertson#myclinic.com.au
8 logan village#myclinic.com.au
9 ipswich#myclinic.com.au
Name: Email, dtype: object
I have a two columns in dataset:
1) Supplier_code
2) Item_code
I have grouped them using:
data.groupby(['supplier_code', 'item_code']).size()
I get result like this:
supplier_code item_code
591495 127018419 9
547173046 1
3024466 498370473 1
737511044 1
941755892 1
6155238 875189969 1
13672569 53152664 1
430351453 1
573603000 1
634275342 1
18510135 362522958 6
405196476 6
441901484 12
29222428 979575973 1
31381089 28119319 2
468441742 3
648079349 18
941387936 1
I have my top 15 suppliers using:
supCounter = collections.Counter(datalist[3])
supDic = dict(sorted(supCounter.iteritems(), key=operator.itemgetter(1), reverse=True)[:15])
print supDic.keys()
This is my list of top 15 suppliers:
[723223131, 687164888, 594473706, 332379250, 203288669, 604236177,
533512754, 503134099, 982883317, 147405879, 151212120, 737780569, 561901243,
786265866, 79886783]
Now I want to join the two, i.e. groupby and get only the top 15 suppliers and there item counts.
Kindly help me in figuring this out.
IIUC, you can groupby supplier_code and then sum and sort_values. Take the top 15 and you're done.
For example, with:
gb_size = data.groupby(['supplier_code', 'item_code']).size()
Then:
N = 3 # change to 15 for actual data
gb_size.groupby("supplier_code").sum().sort_values(ascending=False).head(N)
Output:
supplier_code
31381089 24
18510135 24
591495 10
dtype: int64
I have a DataFrame with 2 columns. I need to know at what point the number of questions has increased.
In [19]: status
Out[19]:
seconds questions
0 751479 9005591
1 751539 9207129
2 751599 9208994
3 751659 9210429
4 751719 9211944
5 751779 9213287
6 751839 9214916
7 751899 9215924
8 751959 9216676
9 752019 9217533
I need the change in percent of 'questions' column and then sort on it. This does not work:
status.pct_change('questions').sort('questions').head()
Any suggestions?
Try this way instead:
>>> status['change'] = status.questions.pct_change()
>>> status.sort_values('change', ascending=False)
questions seconds change
0 9005591 751479 NaN
1 9207129 751539 0.022379
2 9208994 751599 0.000203
6 9214916 751839 0.000177
4 9211944 751719 0.000164
3 9210429 751659 0.000156
5 9213287 751779 0.000146
7 9215924 751899 0.000109
9 9217533 752019 0.000093
8 9216676 751959 0.000082
pct_change can be performed on Series as well as DataFrames and accepts an integer argument for the number of periods you want to calculate the change over (the default is 1).
I've also assumed that you want to sort on the 'change' column with the greatest percentage changes showing first...