My excel data looks like this:
A B C
1 123 534 576
2 456 745 345
3 234 765 285
In another excel spreadsheet, my data may look like this:
B C A
1 123 534 576
2 456 745 345
3 234 765 285
How can I extract column C's contents from both spreadsheets?
My code is as follows:
#Open the workbook
ow = xlrd.open_workbook('export.xlsx').sheet_by_index(0)
#Store column 3's data inside an array
ips = ow.col_values(2, 1)
I would like something more like: ips = ow.col_values(C, 1)
How can I achieve the above?
Since I have two different spreadsheets, with the data that I'm wanting are in two separate rows, I have to search the first row by name until I find it, then extract that column.
Here's how I did it:
ow = xlrd.open_workbook('export.xlsx').sheet_by_index(0)
for x in range (0, 20):
try:
if ow.cell_value(0, x) == "IP Address":
print "found it!"
ips = ow.col_values(x, 1)
break
except IndexError:
continue
Related
I am new with python and pandas, I have a text file (data.txt) in which "content" is like. "123 456 789 101123 456 789 101 112 113 110 112 123 456 789 101 112 113 110 113 123 456 789 101 112 113 110 110 ............. " etc. and having an excel file (combination.xlsx) which is carrying some combination. (In excel sheet cell A1 = 123 456, A2 = 456 789, A3 = 789 101123, .................), my Problem is how to use/get each cell value from (combination.xlsx) to use for count of frequency of occurrence which may available in data.txt and print in another text file (final.txt). want to make a while loop which will start with picking the first cell value )A1) and start a loop and if it is = to or more then 1 then it will print in final.txt otherwise it should pick second cell value(A2).. till cell value/data is empty.
It seems to me that you don't need an explicit while loop here. You can get each cell value using pd.read_excel
which returns a dataframe with all cells. To count the frequency of occurrence, for each row of the dataframe you can use len over the re.findall with the following regular expression: \b({x})\b. This regex assures that the number sequence (x on this particular f-string) will match only between word boundaries. To print to another file, you can use df["Qnt"].to_csv.
import pandas as pd
import re
data_txt = "123 456 789 101123 456 789 101 112 113 110 112 123 456 789 101 112 113 110 113 123 456 789 101 112 113 110 110"
# read XLSX cells
df = pd.read_excel("combination.xlsx", header=None, names=["Comb"])
# count occurrences
find_qnt = lambda x: len(re.findall(rf"\b({x})\b", data_txt))
# apply to each row
df["Qnt"] = df["Comb"].apply(find_qnt)
print(df)
# print into another text file
df["Qnt"].to_csv("final.txt", index=False)
Output from df
Comb Qnt
0 123 456 3
1 456 789 4
2 789 101123 1
My program gives an output in a .txt file. There are 3 different tables in this output. I need to convert these three tables into pandas dataframes. I'm not sure what is the best way to approach this.
This is how my .txt output file looks like:
column_header standard_content (Old) standard_content (New)
214 STAFF_ORIGIN_IND_NATIVE_AMER N Y
215 STAFF_ORIGIN_IND_PACIF_ISLND N Y
128 STUDENT_INFORMATION_RELEASE N Y
211 STAFF_ORIGIN_IND_ASIAN N Y
105 STUDENT_ORIGIN_IND_NATIVE_AMER N Y
104 STUDENT_ORIGIN_IND_HISPANIC N Y
160 STUDENT_OUTSIDE_CATCHMENT N Y
346 COURSE_EXTRA_POINT_ELIGIBLE N Y
528 SUBSTITUTE_REQUIRED N Y
527 STAFF_ABSENCE_AUTHORIZED N Y
column_header data_req (Old) data_req (New)
20 SCHOOL_SIZE_GROUP N Y
241 STAFF_CONTACT N Y
346 COURSE_EXTRA_POINT_ELIGIBLE N Y
434 DISCIPLINE_FED_OFFENSE_GROUP N Y
32 SCHOOL_ATTENDANCE_TYPE N Y
142 STUDENT_COUNTRY_OF_BIRTH N Y
74 FACILITY_COUNTY_CODE N Y
64 FACILITY_PARKING_SPACES N Y
436 DISCIPLINE_DIST_OFFENSE_GROUP N Y
321 STAFF_BARGAINING_UNIT N Y
column_header element_type (Old) element_type (New)
331 DISTRICT_CODE Key Local
511 DISTRICT_CODE Key Local
445 DISTRICT_CODE Key Local
2 DISTRICT_CODE Key Local
302 STAFF_ASSIGN_FINANCIAL_CODE Key Local
493 SCHEDULE_SEQUENCE Key Local
461 INCIDENT_ID Key Local
431 INCIDENT_ID Key Local
159 STUDENT_CATCHMENT_CODE Key Local
393 DISTRICT_CODE Key Local
I tried to use this in a loop but it creates a single dataframe and it gets messed up.
df = pd.read_fwf(io.StringIO(report)
df.to_csv('data.csv')
result_df = pd.read_csv('data.csv', )
print("Final report", result_df)
Is there a way I can create a new dataframe based on a keyword, for example 'column_header', or any other way I can do this?
Do this in few steps.
Slurp the entire file
split according to a delimiter (empty lines)
read each part into a separate dataframe
if we let RAW_DATA be the content of your file, this could be done with
dfs = [pd.read_fwf(StringIO(part),
header=None, skiprows=1,
names=['id', 'header', 'old', 'new'])
for part in raw_data.strip().split('\n\n')]
The split looks for empty lines. The read_fwf call uses several pandas TextParser options to skip the header row and explicitly name the columns(the actual column headers throw off the fixed width parser).
The first frame will look like
id header old new
0 214 STAFF_ORIGIN_IND_NATIVE_AMER N Y
1 215 STAFF_ORIGIN_IND_PACIF_ISLND N Y
2 128 STUDENT_INFORMATION_RELEASE N Y
3 211 STAFF_ORIGIN_IND_ASIAN N Y
4 105 STUDENT_ORIGIN_IND_NATIVE_AMER N Y
5 104 STUDENT_ORIGIN_IND_HISPANIC N Y
6 160 STUDENT_OUTSIDE_CATCHMENT N Y
7 346 COURSE_EXTRA_POINT_ELIGIBLE N Y
8 528 SUBSTITUTE_REQUIRED N Y
9 527 STAFF_ABSENCE_AUTHORIZED N Y
I am trying to convert the following data structure;
To the format below in python 3;
if your data looks like:
array = [['PIN: 123 COD: 222 \n', 'LOA: 124 LOC: Sea \n'],
['PIN:456 COD:555 \n', 'LOA:678 LOC:Chi \n']]
You can do this:
1 Step: use regular expressions to parse your data, because it is string.
see more about reg-exp
raws=list()
for index in range(0,len(array)):
raws.append(re.findall(r'(PIN|COD|LOA|LOC): ?(\w+)', str(array[index])))
Output:
[[('PIN', '123'), ('COD', '222'), ('LOA', '124'), ('LOC', 'Sea')], [('PIN', '456'), ('COD', '555'), ('LOA', '678'), ('LOC', 'Chi')]]
2 Step: extract raw values and column names.
columns = np.array(raws)[0,:,0]
raws = np.array(raws)[:,:,1]
Output:
raws -
[['123' '222' '124' 'Sea']
['456' '555' '678' 'Chi']]
columns -
['PIN' 'COD' 'LOA' 'LOC']
3 Step: Now we can just create df.
df = pd.DataFrame(raws, columns=columns)
Output:
PIN COD LOA LOC
0 123 222 124 Sea
1 456 555 678 Chi
Is it what you want?
I hope it helps, I'm not sure about your input format.
And don't forget import libraries! (I used pandas as pd, numpy as np, re).
UPD: another way I have created log file like you have:
array = open('example.log').readlines()
Output:
['PIN: 123 COD: 222 \n',
'LOA: 124 LOC: Sea \n',
'PIN: 12 COD: 322 \n',
'LOA: 14 LOC: Se \n']
Then split by ' ' , drop '\n' and reshape:
raws = np.array([i.split(' ')[:-1] for i in array]).reshape(2, 4, 2)
In reshape, first number is raws count in your future dataframe, second - count of columns and last - you don't need to change. It won't works if you don't have whitespace between info and '\n' in each raw. If you don't, I will change an example.
Output:
array([[['PIN:', '123'],
['COD:', '222'],
['LOA:', '124'],
['LOC:', 'Sea']],
[['PIN:', '12'],
['COD:', '322'],
['LOA:', '14'],
['LOC:', 'Se']]],
dtype='|S4')
And then take raws and columns:
columns = np.array(raws)[:,:,0][0]
raws = np.array(raws)[:,:,1]
Finally, create dataframe (and cat last symbol for columns):
pd.DataFrame(raws, columns=[i[:-1] for i in columns])
Output:
PIN COD LOA LOC
0 123 222 124 Sea
1 12 322 14 Se
If you have many log files, you can do that for each in for-loop, save each dataframe at array (example, array calls DF_array) and then use pd.concat to do one dataframe from array of dataframes.
pd.concat(DF_array)
If you need I can add an example.
UPD:
I have created a dir with log files and then make array with all files from PATH:
PATH = "logs_data/"
files = [PATH + i for i in os.listdir(PATH)]
Then do for-loop like in last update:
dfs = list()
for f in files:
array = open(f).readlines()
raws = np.array([i.split(' ')[:-1] for i in array]).reshape(len(array)/2, 4, 2)
columns = np.array(raws)[:,:,0][0]
raws = np.array(raws)[:,:,1]
df = pd.DataFrame(raws, columns=[i[:-1] for i in columns])
dfs.append(df)
result = pd.concat(dfs)
Output:
PIN COD LOA LOC
0 123 222 124 Sea
1 12 322 14 Se
2 1 32 4 Ses
0 15673 2324 13464 Sss
1 12452 3122 11234 Se
2 11 132 4 Ses
0 123 222 124 Sea
1 12 322 14 Se
2 1 32 4 Ses
I'm not even sure if the title makes sense.
I have a pandas dataframe with 3 columns: x, y, time. There are a few thousand rows. Example below:
x y time
0 225 0 20.295270
1 225 1 21.134015
2 225 2 21.382298
3 225 3 20.704367
4 225 4 20.152735
5 225 5 19.213522
.......
900 437 900 27.748966
901 437 901 20.898460
902 437 902 23.347935
903 437 903 22.011992
904 437 904 21.231041
905 437 905 28.769945
906 437 906 21.662975
.... and so on
What I want to do is retrieve those rows which have the smallest time associated with x and y. Basically for every element on the y, I want to find which have the smallest time value but I want to exclude those that have time 0.0. This happens when x has the same value as y.
So for example, the fastest way to get to y-0 is by starting from x-225 and so on, therefore it could be the case that x repeats itself but for a different y.
e.g.
x y time
225 0 20.295270
438 1 19.648954
27 20 4.342732
9 438 17.884423
225 907 24.560400
I tried up until now groupby but I'm only getting the same x as y.
print(df.groupby('id_y', sort=False)['time'].idxmin())
y
0 0
1 1
2 2
3 3
4 4
The one below just returns the df that I already have.
df.loc[df.groupby("id_y")["time"].idxmin()]
Just to point out one thing, I'm open to options, not just groupby, if there are other ways that is very good.
So need remove rows with time equal first by boolean indexing and then use your solution:
df = df[df['time'] != 0]
df2 = df.loc[df.groupby("y")["time"].idxmin()]
Similar alternative with filter by query:
df = df.query('time != 0')
df2 = df.loc[df.groupby("y")["time"].idxmin()]
Or use sort_values with drop_duplicates:
df2 = df[df['time'] != 0].sort_values(['y','time']).drop_duplicates('y')
I have the following piece of code:
my_list = ["US", "IT", "ES", "NL"]
for i in my_list:
A = sum_products_by_country(world_level,i)
df = pd.DataFrame({'value':A})
Descending = df.sort_values( by='value', ascending = 0 )
Top_5 = Descending[0:5]
print(Top_5)
The "sum_products_by_country" is a created function which takes as arguments a data frame ( in my case is named "world_level") and a country name and returns the sum by product for this country. Using this loop I find the top5 products and the sums for each country of my_list. Here is the output of this loop:
US value
Product
B 1492
H 455
BB 351
C 119
F 117
IT value
Product
P 346
U 331
A 379
Q 190
D 1389
ES value
Product
P 3046
U3 331
A 379
Q 1390
DD 10389
NL value
Product
P 3465
U 3313
AA 379
2Q 190
D 189
I want to write this output in a excel sheet using :
writer = pd.ExcelWriter('top products.xlsx', engine='xlsxwriter')
Top_5.to_excel(writer, sheet_name='Sheet1')
writer.save()
Could you tell me where should I put the code above in order to get the required excel document?
Is there also a way to get the column names(country,product,value) only once on the top in my excel document and not for each country separately? So I want something like this:
Country Product value
US
B 1492
H 455
BB 351
C 119
F 117
IT
P 346
U 331
A 379
Q 190
D 1389
ES
P 3046
U3 331
A 379
Q 1390
DD 10389
NL
P 3465
U 3313
AA 379
2Q 190
D 189
Thank you
This script should help you:
#Create workbook object
wb = openpyxl.Workbook()
sheet = wb.get_active_sheet()
sheet.title='Products by country'
#Generate data
#Add titles in the first row of each column
sheet.cell(row=1, column=1).value='country'
sheet.cell(row=1, column=2).value='product'
sheet.cell(row=1, column=3).value='value'
#Loop to set the value of each cell
for i in range(0, len(Country)):
sheet.cell(row=i+2, column=1).value=Country[i]#with country being your array full of country names. If you have 5 values for one country I would advise just having the country name in there five times.
sheet.cell(row=i+2, column=2).value=Product[i]#array with products
sheet.cell(row=i+2, column=3).value=Values[i]#array with values
#Finally, save the file and give it a name
wb.save('NameFile.xlsx')