I have to clean CSV file Data. The data that I am trying to clean is below.
Condition: I have to add #myclinic.com.au at the end of every string where it is missing.
douglas#myclinic.com.au
mildura
broadford#myclinic.com.au
officer#myclinic.com.au
nowa nowa#myclinic.com.au
langsborough#myclinic.com.au
brisbane#myclinic.com.au
robertson#myclinic.com.au
logan village
ipswich#myclinic.com.au
The code for this is
DataFrame = pandas.read_csv(ClinicCSVFile)
DataFrame['Email'] = DataFrame['Email'].apply(lambda x: x if '#' in str(x) else str(x)+'#myclinic.com.au')
DataFrameToCSV = DataFrame.to_csv('Temporary.csv', index = False)
print(DataFrameToCSV)
But the output that I am getting is none and I could not work on the later part of the Problem as it is generating the error below
TypeError: 'NoneType' object is not iterable
which is originated by the above data frame.
Please Help me with this.
Use endswith for condition with inverting by ~ and add string to end:
df.loc[~df['Email'].str.endswith('#myclinic.com.au'), 'Email'] += '#myclinic.com.au'
#if need check only #
#df.loc[~df['Email'].str.contains('#'), 'Email'] += '#myclinic.com.au'
print (df)
Email
0 douglas#myclinic.com.au
1 mildura#myclinic.com.au
2 broadford#myclinic.com.au
3 officer#myclinic.com.au
4 nowa nowa#myclinic.com.au
5 langsborough#myclinic.com.au
6 brisbane#myclinic.com.au
7 robertson#myclinic.com.au
8 logan village#myclinic.com.au
9 ipswich#myclinic.com.au
For me it working nice:
df = pd.DataFrame({'Email': ['douglas#myclinic.com.au', 'mildura', 'broadford#myclinic.com.au', 'officer#myclinic.com.au', 'nowa nowa#myclinic.com.au', 'langsborough#myclinic.com.au', 'brisbane#myclinic.com.au', 'robertson#myclinic.com.au', 'logan village', 'ipswich#myclinic.com.au']})
df.loc[~df['Email'].str.contains('#'), 'Email'] += '#myclinic.com.au'
print (df)
Email
0 douglas#myclinic.com.au
1 mildura#myclinic.com.au
2 broadford#myclinic.com.au
3 officer#myclinic.com.au
4 nowa nowa#myclinic.com.au
5 langsborough#myclinic.com.au
6 brisbane#myclinic.com.au
7 robertson#myclinic.com.au
8 logan village#myclinic.com.au
9 ipswich#myclinic.com.au
Using apply and endswith
Ex:
import pandas as pd
df = pd.read_csv(filename, names=["Email"])
print(df["Email"].apply(lambda x: x if x.endswith("#myclinic.com.au") else x+"#myclinic.com.au"))
Output:
0 douglas#myclinic.com.au
1 mildura#myclinic.com.au
2 broadford#myclinic.com.au
3 officer#myclinic.com.au
4 nowa nowa#myclinic.com.au
5 langsborough#myclinic.com.au
6 brisbane#myclinic.com.au
7 robertson#myclinic.com.au
8 logan village#myclinic.com.au
9 ipswich#myclinic.com.au
Name: Email, dtype: object
Related
I've scraped the crypto.com website to get the current prices of crypto coins in DataFrame form, it worked perfectly with pandas, but the 'Prices' values are mixed.
here's the output:
Name Price 24H CHANGE
0 BBitcoinBTC 16.678,36$16.678,36+0,32% +0,32%
1 EEthereumETH $1.230,40$1.230,40+0,52% +0,52%
2 UTetherUSDT $1,02$1,02-0,01% -0,01%
3 BBNBBNB $315,46$315,46-0,64% -0,64%
4 UUSD CoinUSDC $1,00$1,00+0,00% +0,00%
5 BBinance USDBUSD $1,00$1,00+0,00% +0,00%
6 XXRPXRP $0,4067$0,4067-0,13% -0,13%
7 DDogecoinDOGE $0,1052$0,1052+13,73% +13,73%
8 ACardanoADA $0,3232$0,3232+0,98% +0,98%
9 MPolygonMATIC $0,8727$0,8727+1,20% +1,20%
10 DPolkadotDOT $5,48$5,48+0,79% +0,79%
I created a regex to filter the mixed date:
import re
pattern = re.compile(r'(\$.*)(\$)')
for value in df['Price']:
value = pattern.search(value)
print(value.group(1))
output:
$16.684,53
$1.230,25
$1,02
$315,56
$1,00
$1,00
$0,4078
$0,105
$0,3236
$0,8733
but I couldn't find a way to change the values. Which is the best way to do it? Thanks.
if youre regex expression is good, this would work
df['Price']= df['Price'].apply(lambda x: pattern.search(x).group(1))
can you try this:
df['price_v2']=df['Price'].apply(lambda x: '$' + x.split('$')[1])
'''
0 $16.678,36+0,32%
1 $1.230,40
2 $1,02
3 $315,46
4 $1,00
5 $1,00
6 $0,4067
7 $0,1052
8 $0,3232
9 $0,8727
10 $5,48
Name: price, dtype: object
Also, BTC looks different from others. Is this a typo you made or is this the response from the api ? If there are parities that look like BTC, we can add an if else block to the code:
df['price']=df['Price'].apply(lambda x: '$' + x.split('$')[1] if x.startswith('$') else '$' + x.split('$')[0])
'''
0 $16.678,36
1 $1.230,40
2 $1,02
3 $315,46
4 $1,00
5 $1,00
6 $0,4067
7 $0,1052
8 $0,3232
9 $0,8727
10 $5,48
'''
Detail:
string = '$1,02$1,02-0,01%'
values = string.split('$') # output -- > ['', '1,02', '1,02-0,01%']
final_value = values[1] # we need only price. Thats why i choose the second element and apply this to all dataframe.
Considering two dataframes as follows:
import pandas as pd
df_rp = pd.DataFrame({'id':[1,2,3,4,5,6,7,8], 'res': ['a','b','c','d','e','f','g','h']})
df_cdr = pd.DataFrame({'id':[1,2,5,6,7,1,2,3,8,9,3,4,8],
'LATITUDE':[-22.98, -22.97, -22.92, -22.87, -22.89, -22.84, -22.98,
-22.14, -22.28, -22.42, -22.56, -22.70, -22.13],
'LONGITUDE':[-43.19, -43.39, -43.24, -43.28, -43.67, -43.11, -43.22,
-43.33, -43.44, -43.55, -43.66, -43.77, -43.88]})
What I have to do:
Compare each df_rp['id'] element with each df_cdr['id'] element;
If they are the same, I need to add in a data structure (list, series, etc.) the latitudes and longitudes that are on the same line as the id without repeating the id.
Below is an example of how I need the data to be grouped:
1:[-22.98,-43.19],[-22.84,-43.11]
2:[-22.97,-43.39],[-22.98,-43.22]
3:[-22.14,-43.33],[-22.56,-43.66]
4:[-22.70,-43.77]
5:[-22.92,-43.24]
6:[-22.87,-43.28]
7:[-22.89,-43.67]
8:[-22.28,-43.44],[-22.13,-43.88]
I'm having a hard time choosing which data structure is best for the situation (as I did in the example looks like a dictionary, but there would be several dictionaries) and how to add latitude and logitude to pairs without repeating the id. I appreciate any help.
We need to agg the second df , then reindex assign it back
df_rp['L$L']=df_cdr.drop('id',1).apply(tuple,1).groupby(df_cdr.id).agg(list).reindex(df_rp.id).to_numpy()
df_rp
Out[59]:
id res L$L
0 1 a [(-22.98, -43.19), (-22.84, -43.11)]
1 2 b [(-22.97, -43.39), (-22.98, -43.22)]
2 3 c [(-22.14, -43.33), (-22.56, -43.66)]
3 4 d [(-22.7, -43.77)]
4 5 e [(-22.92, -43.24)]
5 6 f [(-22.87, -43.28)]
6 7 g [(-22.89, -43.67)]
7 8 h [(-22.28, -43.44), (-22.13, -43.88)]
df_cdr['lat_long'] = df_cdr.apply(lambda x: list([x['LATITUDE'],x['LONGITUDE']]),axis=1)
df_cdr = df_cdr.drop(columns=['LATITUDE' , 'LONGITUDE'],axis=1)
df_cdr = df_cdr.groupby('id').agg(lambda x: x.tolist())
Output
lat_long
id
1 [[-22.98, -43.19], [-22.84, -43.11]]
2 [[-22.97, -43.39], [-22.98, -43.22]]
3 [[-22.14, -43.33], [-22.56, -43.66]]
4 [[-22.7, -43.77]]
5 [[-22.92, -43.24]]
6 [[-22.87, -43.28]]
7 [[-22.89, -43.67]]
8 [[-22.28, -43.44], [-22.13, -43.88]]
9 [[-22.42, -43.55]]
Assume df_rp.id is unique and sorted as in your sample. I come up with solution using set_index and loc to filter out id in df_cdr, but not in df_rp. Next, call groupby with lambda returns arrays
s = (df_cdr.set_index('id').loc[df_rp.id].groupby(level=0).
apply(lambda x: x.to_numpy()))
Out[709]:
id
1 [[-22.98, -43.19], [-22.84, -43.11]]
2 [[-22.97, -43.39], [-22.98, -43.22]]
3 [[-22.14, -43.33], [-22.56, -43.66]]
4 [[-22.7, -43.77]]
5 [[-22.92, -43.24]]
6 [[-22.87, -43.28]]
7 [[-22.89, -43.67]]
8 [[-22.28, -43.44], [-22.13, -43.88]]
dtype: object
As the title says, i want to know how to set every n-th value in a python list as Null. I looked after a solution in a lot of forums but i didn't find much.
I also don't want to overwrite existing values as None, instead i want to create new spaces with the value None
The list contains the date (12 dates = 1 year) and every 13th value should be empty because that row will be the average so i don't need a date
Here is my code how i generated the dates with pandas
import pandas as pd
numdays = 370 #i have 370 values, every day = 1 month. Starting from 1990 till June 2019
date1 = '1990-01-01'
date2 = '2019-06-01'
mydates = pd.date_range(date1, date2,).tolist()
date_all = pd.date_range(start=date1, end=date2, freq='1BMS')
date_lst = [date_all]
The expected Output:
01.01.1990
01.02.1990
01.03.1990
01.04.1990
01.05.1990
01.06.1990
01.07.1990
01.08.1990
01.09.1990
01.10.1990
01.11.1990
01.12.1990
None
01.01.1991
.
.
.
If I understood correctly:
import pandas as pd
numdays = 370
date1 = '1990-01-01'
date2 = '2019-06-01'
mydates = pd.date_range(date1, date2,).tolist()
date_all = pd.date_range(start=date1, end=date2, freq='1BMS')
date_lst = [date_all]
for i in range(12,len(mydates),13): # add this
mydates.insert(i, None)
I saw some of the answers above, but there's a way of doing this without having to loop over the complete list:
date_lst[12::12] = [None] * len(date_lst[12::12])
The first 12 in [12::12] means that the first item that should be changed is item number 12. The second 12 means that from then on every 12th item should be changed.
You add a step in iloc and set values this way.
lets generate some dummy data.
df = pd.DataFrame({'Vals' :
pd.date_range('01-01-19','02-02-19',freq='D')})
print(df)
Vals
0 2019-01-01
1 2019-01-02
2 2019-01-03
3 2019-01-04
4 2019-01-05
5 2019-01-06
6 2019-01-07
7 2019-01-08
now, you can decide your step
step = 5
new_df = df.iloc[step::step]
print(new_df)
Vals
5 2019-01-06
10 2019-01-11
15 2019-01-16
20 2019-01-21
25 2019-01-26
30 2019-01-31
now, if you want to write a value to a specific column then -
df['Vals'].iloc[step::step] = pd.NaT
print(df)
Vals
0 2019-01-01
1 2019-01-02
2 2019-01-03
3 2019-01-04
4 2019-01-05
5 NaT
Here is an example of setting null if the element of the list is in the 3rd position, you can make this 13th position by changed ((index+1)%13 == 0)
data = [1,2,3,4,5,6,7,8,9]
data = [None if ((index+1)%3 == 0) else d for index, d in enumerate(data)]
print(data)
output:
[1, 2, None, 4, 5, None, 7, 8, None]
According to your code try this:
date_lst = list(date_all)
dateWithNone = [None if ((index+1)%13 == 0) else d for index, d in enumerate(date_lst)]
print(dateWithNone)
I am working on below df but unable to apply filter in percentage field,but it is working normal excel.
I need to apply filter condition > 100.00% in the particular field using pandas.
I tried reading it from Html,csv and excel in pandas but unable to use condition.
it requires float conversion but not working with given data
I am assuming that the values you have are read as strings in Pandas:
data = ['4,700.00%', '3,900.00%', '1,500.00%', '1,400.00%', '1,200.00%', '0.15%', '0.13%', '0.12%', '0.10%', '0.08%', '0.07%']
df = pd.DataFrame(data)
df.columns = ['data']
printing the df:
data
0 4,700.00%
1 3,900.00%
2 1,500.00%
3 1,400.00%
4 1,200.00%
5 0.15%
6 0.13%
7 0.12%
8 0.10%
9 0.08%
10 0.07%
then:
df['data'] = df['data'].str.rstrip('%').str.replace(',','').astype('float')
df_filtered = df[df['data'] > 100]
Results:
data
0 4700.0
1 3900.0
2 1500.0
3 1400.0
4 1200.0
I have used below code as well.str.rstrip('%') and .str.replace(',','').astype('float') it working fine
I have a two columns in dataset:
1) Supplier_code
2) Item_code
I have grouped them using:
data.groupby(['supplier_code', 'item_code']).size()
I get result like this:
supplier_code item_code
591495 127018419 9
547173046 1
3024466 498370473 1
737511044 1
941755892 1
6155238 875189969 1
13672569 53152664 1
430351453 1
573603000 1
634275342 1
18510135 362522958 6
405196476 6
441901484 12
29222428 979575973 1
31381089 28119319 2
468441742 3
648079349 18
941387936 1
I have my top 15 suppliers using:
supCounter = collections.Counter(datalist[3])
supDic = dict(sorted(supCounter.iteritems(), key=operator.itemgetter(1), reverse=True)[:15])
print supDic.keys()
This is my list of top 15 suppliers:
[723223131, 687164888, 594473706, 332379250, 203288669, 604236177,
533512754, 503134099, 982883317, 147405879, 151212120, 737780569, 561901243,
786265866, 79886783]
Now I want to join the two, i.e. groupby and get only the top 15 suppliers and there item counts.
Kindly help me in figuring this out.
IIUC, you can groupby supplier_code and then sum and sort_values. Take the top 15 and you're done.
For example, with:
gb_size = data.groupby(['supplier_code', 'item_code']).size()
Then:
N = 3 # change to 15 for actual data
gb_size.groupby("supplier_code").sum().sort_values(ascending=False).head(N)
Output:
supplier_code
31381089 24
18510135 24
591495 10
dtype: int64