I have a dataframe detailing information on store location and revenue. I would like to iterate this information, but break it down by location and machine number, and then export this information to excel. My current dataframe looks like this
Location Machine Number Net Funds Net Revenue
0 Location 1 123456 123 76
1 Location 1 325462 869 522
2 Location 1 569183 896 234
3 Location 2 129756 535 542
4 Location 2 234515 986 516
5 Location 2 097019 236 512
6 Location 3 129865 976 251
Ideally, the output would look something like this
Machine Number Net Funds Net Revenue
Location 1
123456 123 76
325462 869 522
269183 896 234
Machine Number Net Funds Net Revenue
Location 2
129756 535 542
234515 986 516
097019 236 512
Machine Number Net Funds Net Revenue
Location 3
129865 976 251
While I have been able to iterate this data into the format that I like using
for name, group in grouped:
print(name)
print(group)
I cannot call it to xlsxwriter.
Any guidance would be appreciated.
For this, you can use to_csv to create a csv string, then adjust the column headers. You can open the CSV file in Excel.
Try this code:
import pandas as pd
cols = ['Location','Machine Number','Net Funds','Net Revenue']
lst = [
['Location 1','123456',123, 76],
['Location 1','325462',869,522],
['Location 1','569183',896,234],
['Location 2','129756',535,542],
['Location 2','234515',986,516],
['Location 2','097019',236,512],
['Location 3','129865',976,251]]
df = pd.DataFrame(lst, columns=cols)
loclst = df['Location'].unique().tolist()
cc = ""
for loc in loclst:
dfl = df[df['Location']==loc][cols[1:]]
cc += ','.join(cols[1:]) + '\n' + loc+',,\n' + dfl.to_csv(index=False, header=False)
print(cc)
with open('out.csv','w') as f:
f.write(cc.replace('\r\n','\n'))
Output (out.csv)
Machine Number,Net Funds,Net Revenue
Location 1,,
123456,123,76
325462,869,522
569183,896,234
Machine Number,Net Funds,Net Revenue
Location 2,,
129756,535,542
234515,986,516
097019,236,512
Machine Number,Net Funds,Net Revenue
Location 3,,
129865,976,251
Related
is there a website or a function that create a DataFrame examples code so that it can be used in tutorials?
something like this
df = pd.DataFrame({'age': [ 3, 29],
'height': [94, 170],
'weight': [31, 115]})
or
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
or
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data)
You can get over 750 datasets from pydataset
pip install pydataset
To see a list of the datasets:
from pydataset import data
# To see a list of the datasets
print(data())
Output:
dataset_id title
0 AirPassengers Monthly Airline Passenger Numbers 1949-1960
1 BJsales Sales Data with Leading Indicator
2 BOD Biochemical Oxygen Demand
3 Formaldehyde Determination of Formaldehyde
4 HairEyeColor Hair and Eye Color of Statistics Students
.. ... ...
752 VerbAgg Verbal Aggression item responses
753 cake Breakage Angle of Chocolate Cakes
754 cbpp Contagious bovine pleuropneumonia
755 grouseticks Data on red grouse ticks from Elston et al. 2001
756 sleepstudy Reaction times in a sleep deprivation study
[757 rows x 2 columns]
Usage
And to use one of the example datasets in a dataframe, it is as simple as using the dataset_id:
from pydataset import data
df = data('cake')
print(df)
Output:
replicate recipe temperature angle temp
1 1 A 175 42 175
2 1 A 185 46 185
3 1 A 195 47 195
4 1 A 205 39 205
5 1 A 215 53 215
.. ... ... ... ... ...
266 15 C 185 28 185
267 15 C 195 25 195
268 15 C 205 25 205
269 15 C 215 31 215
270 15 C 225 25 225
[270 rows x 5 columns]
Note:
There are other packages with their own functionality. Or you can create your own.
You can get over 17000 datasets from the datasets package:
pip install datasets
To list all of the datasets:
from datasets import list_datasets
# Print all the available datasets
print(list_datasets())
I have a dataset that I've created from merging 2 df's together on the "NAME" column and now I have a larger dataset. To finish the DF, I want to perform some logic to it to clean it up.
Requirements:
I want to select the unique 'NAME' but I want to match the name with the highest Sales row, and if after going though the Sales column, all rows are less than 10, then move to the Calls column and select highest the row with the highest Call, and if all calls in the 'CALLS' are less than 10 then move to the Target Column select the highest Target. No rows are summed.
Here's my DF:
NAME CUSTOMER_SUPPLIER_NUMBER Sales Calls Target
0 OFFICE 1 2222277 84 170 265
1 OFFICE 1 2222278 26 103 287
2 OFFICE 1 2222278 97 167 288
3 OFFICE 2 2222289 7 167 288
4 OFFICE 2 2222289 3 130 295
5 OFFICE 2 2222289 9 195 257
6 OFFICE 3 1111111 1 2 286
7 OFFICE 3 1111111 5 2 287
8 OFFICE 3 1111112 9 7 230
9 OFFICE 4 1111171 95 193 299
10 OFFICE 5 1111191 9 193 298
Here's what I want to show in the final DF:
NAME CUSTOMER_SUPPLIER_NUMBER Sales Calls Target
0 OFFICE 1 2222277 97 167 288
5 OFFICE 2 2222289 9 195 257
7 OFFICE 3 1111111 5 2 287
9 OFFICE 4 1111171 95 193 299
10 OFFICE 5 1111191 9 193 298
I was thinking of solving this by using df.itterows()
Here's what I've tried:
for n, v in df.iterrows():
if int(v['Sales']) > 10:
calls = df.loc[(v['NAME'] == v) & (int(v['Calls'].max()))]
if int(calls['Calls']) > 10:
target = df.loc[(v['NAME'] == v) & (int(v['Target'].max()))]
else:
print("No match found")
else:
sales = df.loc[(v['NAME'] == v) & (int(v['Sales'].max())]
However, I keep getting KeyError: False error messages. Any thoughts on what I'm doing wrong?
This is not optimized, but it should meet your needs. The code snippet sends each NAME group to eval_group() where it checks the highest index for each column until the Sales, Calls, Target criteria is met.
If you were to optimize, then you could apply vectorization or parallelism principles to the eval_group so it is called against all groups at once, instead of sequentially.
A couple of notes, this will return the first row if a race condition is found (i.e. multiple records have the same maximum during idxmax() call). Also, I believe in your question, the first row in the desired answer should have OFFICE 1 being row 2, not 0.
df = pd.read_csv('./data.txt')
def eval_group(df, keys) :
for key in keys :
row_id = df[key].idxmax()
if df.loc[row_id][key] >= 10 or key == keys[-1] :
return row_id
row_ids = []
keys = ['Sales','Calls','Target']
for name in df['NAME'].unique().tolist() :
condition = df['NAME'] == name
row_ids.append( eval_group( df[condition], keys) )
df = df[ df.index.isin(row_ids) ]
df
NAME CUSTOMER_SUPPLIER_NUMBER Sales Calls Target
2 OFFICE 1 2222278 97 167 288
5 OFFICE 2 2222289 9 195 257
7 OFFICE 3 1111111 5 2 287
9 OFFICE 4 1111171 95 193 299
10 OFFICE 5 1111191 9 193 298
This takes a couple of steps, where you have to build intermediate dataframes, do a conditional, and filter based on the result of the conditions:
temp = (df
.drop(columns = 'CUSTOMER_SUPPLIER_NUMBER')
.groupby('NAME', sort = False)
.idxmax()
)
# get the booleans for rows less than 10
bools = df.loc(axis=1)['Sales':'Target'].lt(10)
# groupby for each NAME
bools = bools.groupby(df.NAME, sort = False).all()
# conditions buildup
condlist = [~bool_check.Sales, ~bool_check.Calls, ~bool_check.Target]
choicelist = [temp.Sales, temp.Calls, temp.Target]
# you might have to figure out what to use for default
indices = np.select(condlist, choicelist, default = temp.Sales)
# get matching rows
df.loc[indices]
NAME CUSTOMER_SUPPLIER_NUMBER Sales Calls Target
2 OFFICE 1 2222278 97 167 288
5 OFFICE 2 2222289 9 195 257
7 OFFICE 3 1111111 5 2 287
9 OFFICE 4 1111171 95 193 299
10 OFFICE 5 1111191 9 193 298
I am trying to add a name to my pandas df but I am failing. I want the two columns to be named "Job department" and "Amount"
df["sales"].value_counts()
>>>>>>output
sales 4140
technical 2720
support 2229
IT 1227
product_mng 902
marketing 858
RandD 787
accounting 767
hr 739
management 630
Name: sales, dtype: int64
Then I do:
job_frequency = pd.DataFrame(df["sales"].value_counts(), columns=['Job department','Amount'])
print(job_frequency)
but I get:
Empty DataFrame
Columns: [Job department, Amount]
Index: []
Use DataFrame.rename_axis for index name with
Series.reset_index for convert Series to DataFrame:
job_frequency = (df["sales"].value_counts()
.rename_axis('Job department')
.reset_index(name='Amount'))
print(job_frequency)
Job department Amount
0 sales 4140
1 technical 2720
2 support 2229
3 IT 1227
4 product_mng 902
5 marketing 858
6 RandD 787
7 accounting 767
8 hr 739
9 management 630
job_frequency = pd.DataFrame(
data={
'Job department': df["sales"].value_counts().index,
'Amount': df["sales"].value_counts().values
}
)
I have written a program (code below) that gives me for each file in a folder a data frame. In the data frame are the Quarters in the Year from the file and the counts (how often the quarters occurs in the file). An output for one file in the loop look for example like:
2008Q4 230
2009Q1 186
2009Q2 166
2009Q3 173
2009Q4 246
2010Q1 341
2010Q2 336
2010Q3 200
2010Q4 748
2011Q1 625
2011Q2 690
2011Q3 970
2011Q4 334
2012Q1 573
2012Q2 53
How can I create a big data frame where the counts for the quarters are summed up for all files in the folder?
path = "crisisuser"
os.chdir(path)
result = [i for i in glob.glob('*.{}'.format("csv"))]
os.chdir("..")
for i in result:
df = pd.read_csv("crisisuser/"+i)
df['quarter'] = pd.PeriodIndex(df.time, freq='Q')
df=df['quarter'].value_counts().sort_index()
I think you need append all Series to list, then use concat and sum per index values:
out = []
for i in result:
df = pd.read_csv("crisisuser/"+i)
df['quarter'] = pd.PeriodIndex(df.time, freq='Q')
out.append(df['quarter'].value_counts().sort_index())
s = pd.concat(out).sum(level=0)
I am trying to convert the following data structure;
To the format below in python 3;
if your data looks like:
array = [['PIN: 123 COD: 222 \n', 'LOA: 124 LOC: Sea \n'],
['PIN:456 COD:555 \n', 'LOA:678 LOC:Chi \n']]
You can do this:
1 Step: use regular expressions to parse your data, because it is string.
see more about reg-exp
raws=list()
for index in range(0,len(array)):
raws.append(re.findall(r'(PIN|COD|LOA|LOC): ?(\w+)', str(array[index])))
Output:
[[('PIN', '123'), ('COD', '222'), ('LOA', '124'), ('LOC', 'Sea')], [('PIN', '456'), ('COD', '555'), ('LOA', '678'), ('LOC', 'Chi')]]
2 Step: extract raw values and column names.
columns = np.array(raws)[0,:,0]
raws = np.array(raws)[:,:,1]
Output:
raws -
[['123' '222' '124' 'Sea']
['456' '555' '678' 'Chi']]
columns -
['PIN' 'COD' 'LOA' 'LOC']
3 Step: Now we can just create df.
df = pd.DataFrame(raws, columns=columns)
Output:
PIN COD LOA LOC
0 123 222 124 Sea
1 456 555 678 Chi
Is it what you want?
I hope it helps, I'm not sure about your input format.
And don't forget import libraries! (I used pandas as pd, numpy as np, re).
UPD: another way I have created log file like you have:
array = open('example.log').readlines()
Output:
['PIN: 123 COD: 222 \n',
'LOA: 124 LOC: Sea \n',
'PIN: 12 COD: 322 \n',
'LOA: 14 LOC: Se \n']
Then split by ' ' , drop '\n' and reshape:
raws = np.array([i.split(' ')[:-1] for i in array]).reshape(2, 4, 2)
In reshape, first number is raws count in your future dataframe, second - count of columns and last - you don't need to change. It won't works if you don't have whitespace between info and '\n' in each raw. If you don't, I will change an example.
Output:
array([[['PIN:', '123'],
['COD:', '222'],
['LOA:', '124'],
['LOC:', 'Sea']],
[['PIN:', '12'],
['COD:', '322'],
['LOA:', '14'],
['LOC:', 'Se']]],
dtype='|S4')
And then take raws and columns:
columns = np.array(raws)[:,:,0][0]
raws = np.array(raws)[:,:,1]
Finally, create dataframe (and cat last symbol for columns):
pd.DataFrame(raws, columns=[i[:-1] for i in columns])
Output:
PIN COD LOA LOC
0 123 222 124 Sea
1 12 322 14 Se
If you have many log files, you can do that for each in for-loop, save each dataframe at array (example, array calls DF_array) and then use pd.concat to do one dataframe from array of dataframes.
pd.concat(DF_array)
If you need I can add an example.
UPD:
I have created a dir with log files and then make array with all files from PATH:
PATH = "logs_data/"
files = [PATH + i for i in os.listdir(PATH)]
Then do for-loop like in last update:
dfs = list()
for f in files:
array = open(f).readlines()
raws = np.array([i.split(' ')[:-1] for i in array]).reshape(len(array)/2, 4, 2)
columns = np.array(raws)[:,:,0][0]
raws = np.array(raws)[:,:,1]
df = pd.DataFrame(raws, columns=[i[:-1] for i in columns])
dfs.append(df)
result = pd.concat(dfs)
Output:
PIN COD LOA LOC
0 123 222 124 Sea
1 12 322 14 Se
2 1 32 4 Ses
0 15673 2324 13464 Sss
1 12452 3122 11234 Se
2 11 132 4 Ses
0 123 222 124 Sea
1 12 322 14 Se
2 1 32 4 Ses