I need help grouping data - python

i have got this far. when i run it, it turns the string numbers into floats and even gets rid of the time stamps for me. how then do i take these numbers and put them in new groups? what my goal is, is i want each line to be its own list but only the numbers between the time stamps and the last 0 of each line.
import pandas as pd
import numpy as np
ver = '''
2018.12.0400:00,0.73572,0.73614,0.73544,0.73550,520,0
2018.12.0401:00,0.73550,0.73594,0.73545,0.73553,1181,0
2018.12.0402:00,0.73553,0.73606,0.73510,0.73539,1960,0
2018.12.0403:00,0.73539,0.73621,0.73481,0.73608,2898,0
'''
number = ver.split(',')
for num in number:
try:
new = float(num)
print(new)
except:
print('this one messed up')

you could split the data by line first then split by ',' again.
ver = '''2018.12.0400:00,0.73572,0.73614,0.73544,0.73550,520,0
2018.12.0401:00,0.73550,0.73594,0.73545,0.73553,1181,0
2018.12.0402:00,0.73553,0.73606,0.73510,0.73539,1960,0
2018.12.0403:00,0.73539,0.73621,0.73481,0.73608,2898,0'''
ver = [i.split(',') for i in ver.split('\n')]
df = pd.DataFrame(ver)
df
output:
0 1 2 3 4 5 6
0 2018.12.0400:00 0.73572 0.73614 0.73544 0.73550 520 0
1 2018.12.0401:00 0.73550 0.73594 0.73545 0.73553 1181 0
2 2018.12.0402:00 0.73553 0.73606 0.73510 0.73539 1960 0
3 2018.12.0403:00 0.73539 0.73621 0.73481 0.73608 2898 0

Related

Use previous value of Pandas series with mix of strings and ints

I'm trying to overwrite the times with the date of that day. This list is ~100 rows long, below is a sample:
Date
0 May-21-20 #Gets passed
1 02:51PM #(should read May-21-20)
2 01:59PM #(should read May-21-20)
3 01:29PM #etc
4 12:45PM #etc
5 12:42PM
6 11:55AM
7 10:02AM
8 09:37AM #(should read May-21-20)
9 May-20-20 #gets passed
10 02:47PM #(should read May-20-20)
11 02:30PM #(should read May-20-20)
12 02:29PM #(should read May-20-20)
13 02:01PM #(should read May-20-20)
Here's where I'm currently at with my code:
for i in headline_table['Date']:
date_list = headline_table['Date'].tolist() #Make the pd Sereies a List
index_value = date_list.index(i) #Now a list so I can reference index value
previous = index_value - 1 #index of current minus one = previous value
if re.search(r'^[A-Z]', i):
pass
else:
headline_table['Date'][i] = headline_table.loc[previous, 'Date']
I've tried a bunch of different ways to go about this but can't seem to figure it out. I do not get any errors with the code, but the times do not get overwritten with the date, instead it seems nothing happens.
We can do where with ffill
df['Date1']=df.Date.where(df.Date.str.contains('-')).ffill()

While loop on a dataframe column?

I have a small dataframe comprised of two columns, an ORG column and a percentage column. The dataframe is sorted largest to smallest based on the percentage column.
I'd like to create a while loop that adds up the values in the percentage column up until it hits a value of .80 (80%).
So far I've tried:
retail_pareto = 0
counter = 0
while retail_pareto < .80:
retail_pareto += retailerDF[counter]['RETAILER_PCT_OF_CHANGE']
counter += 1
This does not work, both the counter and the counter and retail_pareto value remain at zero with no real error message to help me troubleshoot what I'm doing incorrectly. Ideally, I'd like to end up with a list of the orgs with the largest percentage that together add up to 80%.
I'm not exactly sure what to try next. I've searched these forums, but haven't found anything similar in the forums yet.
Any advice or help is much appreciated. Thank you.
Example Dataframe:
ORG PCT
KST 0.582561
ISL 0.290904
BOV 0.254456
BRH 0.10824
GNT 0.0913631
DSH 0.023441
RDM -0.0119665
JBL -0.0348893
JBD -0.071883
WEG -0.232227
The output that I would expect would be something along the lines of:
ORG PCT
KST 0.582561
ISL 0.290904
Use:
df_filtered = df.loc[df['PCT'].shift(fill_value=0).cumsum().le(0.80),:]
#if you don't want include where cumsum is greater than 0,80
#df_filtered = df.loc[df['PCT'].cumsum().le(0.80),:]
print(df_filtered)
ORG PCT
0 KST 0.582561
1 ISL 0.290904
Can you use this example to help you?
import pandas as pd
retail_pareto = 0
orgs = []
for i,row in retailerDF.iterrows():
if retail_pareto <= .80:
retail_pareto += row['RETAILER_PCT_OF_CHANGE']
orgs.append(row)
else:
break
new_df = pd.DataFrame(orgs)
Edit: made it more like your example and added the new DataFrame.
Instead of your loop, take a more pandasonic approach.
Start with computing an additional column containing cumulative sum
of RETAILER_PCT_OF_CHANGE:
df['pct_cum'] = df.RETAILER_PCT_OF_CHANGE.cumsum()
For your data, the result is:
ORG RETAILER_PCT_OF_CHANGE pct_cum
0 KST 0.582561 0.582561
1 ISL 0.290904 0.873465
2 BOV 0.254456 1.127921
3 BRH 0.108240 1.236161
4 GNT 0.091363 1.327524
5 DSH 0.023441 1.350965
6 RDM -0.011967 1.338999
7 JBL -0.034889 1.304109
8 JBD -0.071883 1.232226
9 WEG -0.232227 0.999999
And now, to print rows which totally include 80 % of change,
ending on the first row above the limit, run:
df[df.pct_cum.shift(1).fillna(0) < 0.8]
The result, together with the cumulated sum, is:
ORG RETAILER_PCT_OF_CHANGE pct_cum
0 KST 0.582561 0.582561
1 ISL 0.290904 0.873465

After a desire ouput Not able to print date from both date/time

Hello Experts i have a program that read the csv file which contain several columns main motive of this program is to convert the string into seuence of number and duplicated string will be the same number which have taken this all operation i can able to perform but I want my date/time column to print only date for that i applied a slicing method that's work in console but I'm not able to to print it on my other csv file. Please tell me what to do.
This is the program I have written:
import pandas as pd
import csv
import os
# from io import StringIO
# tempFile="input1.csv"
with open("input1.csv", 'r',encoding="utf8") as csvfile:
# creating a csv reader object
reader = csv.DictReader(csvfile, delimiter=',')
# next(reader, None)
'''We then restructure the data to be a set of keys with list of values {key_1: [], key_2: []}:'''
data = {}
for row in reader:
# print(row)
for header, value in row.items():
try:
data[header].append(value)
except KeyError:
data[header] = [value]
'''Next we want to give each value in each list a unique identifier.'''
# Loop through all keys
for key in data.keys():
values = data[key]
things = list(sorted(set(values), key=values.index))
for i, x in enumerate(data[key]):
if key=="Date/Time":
var = data[key]
iter_obj1 = iter(var)
while True:
try:
element1 = next(iter_obj1)
date =element1[0:10]
print("date-",date)
except StopIteration:
break
break
else:
# if key == "Date/Time" :
# print(x[0:10])
# continue
data[key][i] = things.index(x) + 1
print('data.[keys]()-',data[key])
print('data.keys()-',data.keys())
print('values-',values)
print('data.keys()-',key)
print('x-',x)
print('i-',i)
# print("FullName-",FullName)
"""Since csv.writerows() takes a list but treats it as a row, we need to restructure our
data so that each row is one value from each list. This can be accomplished using zip():"""
with open("ram3.csv", "w") as outfile:
writer = csv.writer(outfile)
# Write headers
writer.writerow(data.keys())
# Make one row equal to one value from each list
rows = zip(*data.values())
# Write rows
writer.writerows(rows)
Note: I can't use pandas DataFrame. That's why I have written code like this please tell me how to print my date/time column only date where i need to change in code to get that...thanks
Input:
job_Id Name Address Email Date/Time
1 snehil singh marathalli ss#gmail.com 12/10/2011:02:03:20
2 salman marathalli ss#gmail.com 12/11/2011:03:10:20
3 Amir HSR ar#gmail.com 11/02/2009:09:03:20
4 Rakhesh HSR rakesh#gmail.com 09/12/2010:02:03:55
5 Ram marathalli r#gmail.com 01/10/2014:12:03:20
6 Shyam BTM ss#gmail.com 12/11/2012:01:03:20
7 salman HSR ss#gmail.com 11/08/2016:15:03:20
8 Amir BTM ar#gmail.com 07/10/2013:04:02:30
9 snehil singh Majestic sne#gmail.com 03/03/2018:02:03:20
Csv file:
job_Id Name Address Email Date/Time
1 1 1 1 12/10/2011:02:03:20
2 2 1 1 12/11/2011:03:10:20
3 3 2 2 11/02/2009:09:03:20
4 4 2 3 09/12/2010:02:03:55
5 5 1 4 01/10/2014:12:03:20
6 6 3 1 12/11/2012:01:03:20
7 2 2 1 11/08/2016:15:03:20
8 3 3 2 07/10/2013:04:02:30
9 1 4 5 03/03/2018:02:03:20
In this output, everything is correct but only the date/time column. I want to print date only, and not time.
if key=="Date/Time":
var="12/10/2011"
print(var)
var = data[key]
iter_obj1 = iter(var)
while True:
try:
element1 = next(iter_obj1)
date =element1[0:10]
print("date-",date)
except StopIteration:
break
i got it i should not use all these...things just added one line will print the desired output in the for loop..
if key=="Date/Time":
data[key][i] = data[key][i][0:10]
that's it.. its done all will be same..

Run into "KeyError: 2L" assigning to dictionary

I have a pandas data frame and I'm trying to add each acct_id_adj number to a dictionary and search for all phone numbers associated with that id through the notes.
Example of the dataframe
Index RowNum acct_id_adj NOTE_DT NOTE_TXT
0 1 A20000000113301111 5/2/2017 t/5042222222 lm w/ 3rd jn
1 2 A20000000038002222 5/4/2017 OB CallLeft Message
3 4 A20000000107303333 5/4/2017 8211116411 FOR $18490 MLF
import pandas
import re
PhNum = pandas.read_csv('C:/PhoneNumberSearch.csv')
PhNum = PhNum[PhNum['NOTE_TXT'].notnull()]
D = {}
#for i in xrange(PhNum.shape[0]):
for i in xrange(3):
ID = PhNum['acct_id_adj'][i]
Note = re.sub(r'\W+', ' ', PhNum['NOTE_TXT'][i])
print(Note)
Numbers = [int(s) for s in Note.split() if s.isdigit()]
print(Numbers)
for j in xrange(len(Numbers)):
if Numbers[j] > 1000000000:
D[ID] = Numbers[j]
print(D)
Out = pandas.DataFrame(D.items(), columns=['acct_id_adj', 'Phone_Number'])
However, at the third row I keep running into an error "KeyError: 2L" at ID = PhNum['acct_id_adj'][i]. Not finding good documentation and can't figure out why the issue would wait until then to arise.
All help appreciated in cluing me into what might be causing this error or if I'm thinking about dictionaries in the wrong way.
Analyse:
It seems that your PhoneNumberSearch.csv file is malformed, if so, pandas.read_csv will use the first column as the index, for example:
if csv file is:
Index,RowNum,acct_id_adj,NOTE_DT,NOTE_TXT
0,1,A20000000113301111,5/2/2017,t/5042222222 lm w/ 3rd jn,
1,2,A20000000038002222,5/4/2017,OB CallLeft Message,
3,4,A20000000107303333,5/4/2017,8211116411 FOR $18490 MLF,
The PhNum will be like this:
Index RowNum acct_id_adj NOTE_DT NOTE_TXT
0 1 A20000000113301111 5/2/2017 t/5042222222 lm w/ 3rd jn NaN
1 2 A20000000038002222 5/4/2017 OB CallLeft Message NaN
3 4 A20000000107303333 5/4/2017 8211116411 FOR $18490 MLF NaN
as you can see, there is no index 2 but 3, that's why ID = PhNum['acct_id_adj'][2] will raise error.
Solution:
What you can do you might consider index_col=False to force pandas to not use the first column as the index, refer to official doc:
PhNum = pandas.read_csv('C:/PhoneNumberSearch.csv',index_col=False)
The PhNum will give you with correct index:
Index RowNum acct_id_adj NOTE_DT NOTE_TXT
0 0 1 A20000000113301111 5/2/2017 t/5042222222 lm w/ 3rd jn
1 1 2 A20000000038002222 5/4/2017 OB CallLeft Message
2 3 4 A20000000107303333 5/4/2017 8211116411 FOR $18490 MLF

Formatting list data by table

I am trying to analyse some data, but my data contains letters which require standardising. What I would like to be able to do is, for every datatable in the data (this csv data contains 3 datatables) replace the letter T or any other letter for that matter with the next highest integer for that table. The first table contains no errors, the second table contains 1 T and the third contains 2 x t's.
DatatableA,1
DatatableA,2
DatatableA,3
DatatableA,4
DatatableA,5
DatatableB,1
DatatableB,6
DatatableB,T
DatatableB,3
DatatableB,4
DatatableB,5
DatatableB,2
DatatableC,3
DatatableC,4
DatatableC,2
DatatableC,1
DatatableC,Q
DatatableC,5
DatatableC,T
I am expecting this to be a relatively easy thing to code, however whilst I know how to replace all T's with a number, within a particular column or a particular row, I do not know how to replace each T with a different number depending on the Datatable it is in. Essentially I am looking to produce the following from the above:
DatatableA,1
DatatableA,2
DatatableA,3
DatatableA,4
DatatableA,5
DatatableB,1
DatatableB,6
DatatableB,7
DatatableB,3
DatatableB,4
DatatableB,5
DatatableB,2
DatatableC,3
DatatableC,4
DatatableC,2
DatatableC,1
DatatableC,6
DatatableC,5
DatatableC,6
Here nothing happened in DatatableA, DatatableB the only T was replaced with the next highest integer in this case it was replaced with a 7, in DatatableC there was two anomalous data points which were both replaced with the next highest integer, which was a 6.
If anyone can point me in the right direction or provide a snippet of something, It would be greatly appreciated. As always constructive comments are also appreciated.
Edit in reply to elyase
I attempted to run the code:
import pandas as pd
df = pd.read_csv('test.csv', sep=',', header=None, names=['datatable', 'col'])
def replace_letter(group):
letters = group.isin(['T', 'Q']) # select letters
group[letters] = int(group[~letters].max()) + 1 # replace by next max
return group
df['col'] = df.groupby('datatable').transform(replace_letter)
print df
and i received the traceback:
Traceback (most recent call last):
File "C:/test.py", line 11, in <module>
df['col'] = df.groupby('datatable').transform(replace_letter)
File "C:\Python27\lib\site-packages\pandas\core\groupby.py", line 1981, in transform
res = path(group)
File "C:\Python27\lib\site-packages\pandas\core\groupby.py", line 2006, in <lambda>
slow_path = lambda group: group.apply(lambda x: func(x, *args, **kwargs), axis=self.axis)
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 4416, in apply
return self._apply_standard(f, axis)
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 4491, in _apply_standard
raise e
ValueError: ("invalid literal for int() with base 10: 'col'", u'occurred at index col')
Is there something I have used in correctly, I could use AEAs answer, but I have been meaning to use pandas more, as the library seems so useful for data manipulations.
Pandas is ideal for this kind of tasks:
Read your csv:
>>> import pandas as pd
>>> df = pd.read_csv('data.csv', sep=',', header=None, names=['datatable', 'col'])
>>> df.head()
datatable col
0 DatatableA 1
1 DatatableA 2
2 DatatableA 3
3 DatatableA 4
4 DatatableA 5
Group, select and replace max:
def replace_letter(group):
letters = group.isin(['T', 'Q']) # select letters
group[letters] = int(group[~letters].max()) + 1 # replace by next max
return group
>>> df['col'] = df.groupby('datatable').transform(replace_letter)
>>> df
datatable col
0 DatatableA 1
1 DatatableA 2
2 DatatableA 3
3 DatatableA 4
4 DatatableA 5
5 DatatableB 1
6 DatatableB 6
7 DatatableB 7
8 DatatableB 3
9 DatatableB 4
10 DatatableB 5
11 DatatableB 2
12 DatatableC 3
13 DatatableC 4
14 DatatableC 2
15 DatatableC 1
16 DatatableC 6
17 DatatableC 5
18 DatatableC 6
Write to csv:
df.to_csv('result.csv', index=None, header=None)
I suppose I have to answer the question asked my by own alter-ego. Seriously, does StackExchange not sanitize usernames?
Here's a solution, not guaranteeing that it's efficient or simple, but the logic is pretty simple. First you iterate your dataset and check for anything that's not an integer string and record the largest value. Then you iterate again and replace non-integer strings.
I am using StringIO as a replacement for a file just for convenience sake.
import csv
import string
from StringIO import StringIO
raw = """DatatableA,1
DatatableA,2
DatatableA,3
DatatableA,4
DatatableA,5
DatatableB,1
DatatableB,6
DatatableB,T
DatatableB,3
DatatableB,4
DatatableB,5
DatatableB,2
DatatableC,3
DatatableC,4
DatatableC,2
DatatableC,1
DatatableC,Q
DatatableC,5
DatatableC,T"""
fp = StringIO()
fp.write(raw)
fp.seek(0)
reader = csv.reader(fp)
data = []
mapping = {}
for row in reader:
if row[0] not in mapping:
mapping[row[0]] = float("-inf")
if row[1] in string.digits:
x = int(row[1])
if x > mapping[row[0]]:
mapping[row[0]] = x
data.append(row)
for i, row in enumerate(data):
if row[1] not in string.digits:
mapping[row[0]] += 1
row[1] = str(mapping[row[0]])
fp.close()
fp = StringIO()
writer = csv.writer(fp)
writer.writerows(data)
print fp.getvalue()

Categories