VLOOKUP/ETL with Python

VLOOKUP/ETL with Python - python

I have a data that comes from MS SQL Server. The data from the query returns a list of names straight from a public database. For instance, If i wanted records with the name of "Microwave" something like this would happen:
Microwave
Microwvae
Mycrowwave
Microwavee
Microwave would be spelt in hundreds of ways. I solve this currently with a VLOOKUP in excel. It looks for the value on the left cell and returns value on the right. for example:
VLOOKUP(A1,$A$1,$B$4,2,False)
Table:
A B
1 Microwave Microwave
2 Microwvae Microwave
3 Mycrowwave Microwave
4 Microwavee Microwave
I would just copy the VLOOKUP formula down the CSV or Excel file and then use that information for my analysis.
Is there a way in Python to solve this issue in another way?
I could make a long if/elif list or even a replace list and apply it to each line of the csv, but that would save no more time than just using the VLOOKUP. There are thousands of company names spelt wrong and i do not have the clearance to change the database.
So Stack, Any ideas on how to leverage python in this scenario?

If you had have data like this:
+-------------+-----------+
| typo | word |
+-------------+-----------+
| microweeve | microwave |
| microweevil | microwave |
| macroworv | microwave |
| murkeywater | microwave |
+-------------+-----------+
Save it as typo_map.csv
Then run (in the same directory):
import csv
def OpenToDict(path, index):
with open(path, 'rb') as f:
reader=csv.reader(f)
headings = reader.next()
heading_nums={}
for i, v in enumerate(headings):
heading_nums[v]=i
fields = [heading for heading in headings if heading <> index]
file_dictionary = {}
for row in reader:
file_dictionary[row[heading_nums[index]]]={}
for field in fields:
file_dictionary[row[heading_nums[index]]][field]=row[heading_nums[field]]
return file_dictionary
map = OpenToDict('typo_map.csv', 'typo')
print map['microweevil']['word']
The structure is slightly more complex than it needs to be for your situation but that's because this function was originally written to lookup more than one column. However, it will work for you, and you can simplify it yourself if you want.

Related

using data from one column as part of an update for another column using Pandas from a very large csv file

I have a very large CSV file, where I generate a URL by using data from one column [c] and update the corresponding column cell [f] with the new information. Although I program a lot in Python, I don't use Pandas that often, so I am unsure as to where to handle this problem.
F is the final output, so I am using the C column as an image name, the of the URL is the same.
| c | f |
| ------- | -------------------------- |
| 2134 | http://url.com/2134.jpg |
| 3e32 | http://url.com/3e32.jpg |
| jhknh | http://url.com/jhknh.jpg |
| 12.12.3 | http://url.com/12.12.3.jpg |
I have searched but I have not been able to find an implementable solution. I know, I probably would have to use chunksize for this, as there could be upward of 20000 records.
Any assistance with this would be greatly appreciated. I have looked and tried a few things but I am unable to come up with a solution.
Thank you in advance
~ E

Load your CSV file into a dataframe and update column 'f'
df = pd.read_csv('yourdatafile.csv')
df['f'] = 'http://url.com/' + df.c + '.jpg'
df
Output
c f
0 2134 http://url.com/2134.jpg
1 3e32 http://url.com/3e32.jpg
2 jhnkhk http://url.com/jhnkhk.jpg
3 12.12.1 http://url.com/12.12.1.jpg
If your records don't fit in memory you can chunk your data and append every chunk to a new file.
header = True
for chunk in pd.read_csv('yourdatafile.csv', chunksize=1000):
chunk['f'] = 'http://newurl.com/' + chunk.c + '.jpg'
chunk.to_csv('newdata.csv', mode='a+', index=False, header=header)
header = False

Pandas: Why are my headers being inserted into the first row of my dataframe?

I have a script that collates sets of tags from other dataframes, converts them into comma-separated string and adds all of this to a new dataframe. If I use pd.read_csv to generate the dataframe, the first entry is what I expect it to be. However, if I use the df_empty script (below), then I get a copy of the headers in that first row instead of the data I want. The only difference I have made is generating a new dataframe instead of loading one.
The resultData = pd.read_csv() reads a .csv file with the following headers and no additional information:
Sheet, Cause, Initiator, Group, Effects
The df_empty script is as follows:
def df_empty(columns, dtypes, index=None):
assert len(columns)==len(dtypes)
df = pd.DataFrame(index=index)
for c,d in zip(columns, dtypes):
df[c] = pd.Series(dtype=d)
return df
# https://stackoverflow.com/a/48374031
# Usage: df = df_empty(['a', 'b'], dtypes=[np.int64, np.int64])
My script contains the following line to create the dataframe:
resultData = df_empty(['Sheet','Cause','Initiator','Group','Effects'],[np.str,np.int64,np.str,np.str,np.str])
I've also used the following with no differences:
resultData = df_empty(['Sheet','Cause','Initiator','Group','Effects'],['object','int64','object','object','object'])
My script to collate the data and add it to my dataframe is as follows:
data = {'Sheet': sheetNum, 'Cause': causeNum, 'Initiator': initTag, 'Group': grp, 'Effects': effectStr}
count = len(resultData)
resultData.at[count,:] = data
When I run display(data), I get the following in Jupyter:
{'Sheet': '0001',
'Cause': 1,
'Initiator': 'Tag_I1',
'Group': 'DIG',
'Effects': 'Tag_O1, Tag_O2,...'}
What I want to see with both options / what I get when reading the csv:
+-------+-------+-----------+-------+--------------------+
| Sheet | Cause | Initiator | Group | Effects |
+-------+-------+-----------+-------+--------------------+
| 0001 | 1 | Tag_I1 | DIG | Tag_O1, Tag_O2,... |
| 0001 | 2 | Tag_I2 | DIG | Tag_O2, Tag_04,... |
+-------+-------+-----------+-------+--------------------+
What I see when generating a dataframe with df_empty:
+-------+-------+-----------+-------+--------------------+
| Sheet | Cause | Initiator | Group | Effects |
+-------+-------+-----------+-------+--------------------+
| Sheet | Cause | Initiator | Group | Effects |
| 0001 | 2 | Tag_I2 | DIG | Tag_O2, Tag_04,... |
+-------+-------+-----------+-------+--------------------+
Any ideas on what might be causing the generated dataframe to copy my headers into the first row and if it possible for me to not have to read an otherwise empty csv?
Thanks!

Why? Because you've inserted the first row as data. The magic behaviour of using the first row as header is in read_csv(), if you create your dataframe without using read_csv, the first row is not treated specially.
Solution? Skip the first row when inserting to the data frame generate by df_empty.

Counting book inventory by user

There is a table which contains data from 2014.The structure is as follows:
Each user can issue different number of book category.
User-id|Book-Category
1 |Thrill
2 |Thrill
3 |Mystery
3 |Mystery
The requirement is to find for each user,each type of book category issued.This data is already there in csv files but it is year wise available.
I have to add all those values.
eg:
data for 2014
u-id|book|count
1 |b1 |2
1 |b2 |4
... ... ...
data for 2015
u-id|book|count
1 |b1 |21
2 |b3 |12
//like the above format,available till 2018.(user1 with book b1 should have a count of 23
Now,I wrote a python script in which I just made a dictionary and iterated each row,if the key(u-id+book-category) was present,added the values of count otherwise,inserted key-value pair in that dictionary,did this for every year wise file in that script,since some files have size>1.5GB,the script kept on running for 7/8 hours,had to stop it.
Code:
import requests
import csv
import pandas as pd
Dict = {}
with open('data_2012.csv') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
if row['a']+row['b'] not in Dict:
Dict[row['a']+row['b']] = row['c']
##like this,iterating over the year wise files and finally writing the data to a different file.'a' and 'b' are mentioned at the first line of the data files for an easy access.
Is there any way with which we can achieve this functionality more elegantly in python or write a Map-Reduce job?

How to generate table using Python

I am quite struggling with as I tried many libraries to print table but no success - so I thought to post here and ask.
My data is in a text file (resource.txt) which looks like this (the exact same way it prints)
pipelined 8 8 0 17 0 0
nonpipelined 2 2 0 10 0 0
I want my data print in the following manner
Design name LUT Lut as m Lut as I FF DSP BRAM
-------------------------------------------------------------------
pipelined 8 8 0 17 0 0
Non piplined 2 2 0 10 0 0
Some time data may be more line column remain same but rows may increase.
(i have python 2.7 version)
I am using this part in my python code all code working but am couldn't able print data which i extracted to text file in tabular form. As I can't use panda library as it won't support for python 2.7, but I can use tabulate and all library. Can anyone please help me?
I tried using tabulate and all but I keep getting errors.
I tried at end simple method to print but its not working (same code works if I put at top of code but at the end of code this won't work). Does anyone have any idea?
q11=open( "resource.txt","r")
for line in q11:
print(line)

Here's a self contained function that makes a left-justified, technical paper styled table.
def makeTable(headerRow,columnizedData,columnSpacing=2):
"""Creates a technical paper style, left justified table
Author: Christopher Collett
Date: 6/1/2019"""
from numpy import array,max,vectorize
cols = array(columnizedData,dtype=str)
colSizes = [max(vectorize(len)(col)) for col in cols]
header = ''
rows = ['' for i in cols[0]]
for i in range(0,len(headerRow)):
if len(headerRow[i]) > colSizes[i]: colSizes[i]=len(headerRow[i])
headerRow[i]+=' '*(colSizes[i]-len(headerRow[i]))
header+=headerRow[i]
if not i == len(headerRow)-1: header+=' '*columnSpacing
for j in range(0,len(cols[i])):
if len(cols[i][j]) < colSizes[i]:
cols[i][j]+=' '*(colSizes[i]-len(cols[i][j])+columnSpacing)
rows[j]+=cols[i][j]
if not i == len(headerRow)-1: rows[j]+=' '*columnSpacing
line = '-'*len(header)
print(line)
print(header)
print(line)
for row in rows: print(row)
print(line)
And here's an example using this function.
>>> header = ['Name','Age']
>>> names = ['George','Alberta','Frank']
>>> ages = [8,9,11]
>>> makeTable(header,[names,ages])
------------
Name Age
------------
George 8
Alberta 9
Frank 11
------------

Since the number of columns remains the same, you could just print out the first line with ample spaces as required. Ex-
print("Design name",' ',"LUT",' ',"Lut as m",' ',"and continue
like that")
Then read the csv file. datafile will be
datafile = open('resource.csv','r')
reader = csv.reader(datafile)
for col in reader:
print(col[0],' ',col[1],' ',col[2],' ',"and
continue depending on the number of columns")
This is not he optimized solution but since it looks like you are new, therefore this will help you understand better. Or else you can use row_format print options in python 2.7.

Here is code to print table in nice table, you trasfer all your data to sets then you can data or else you can trasfer data in text file line to one set and print it
from beautifultable import BeautifulTable
h0=["jkgjkg"]
h1=[2,3]
h2=[2,3]
h3=[2,3]
h4=[2,3]
h5=[2,3]
h0.append("FPGA resources")
table = BeautifulTable()
table.column_headers = h0
table.append_row(h1)
table.append_row(h2)
table.append_row(h3)
table.append_row(h4)
table.append_row(h5)
print(table)
Out Put:
+--------+----------------+
| jkgjkg | FPGA resources |
+--------+----------------+
| 2 | 3 |
+--------+----------------+
| 2 | 3 |
+--------+----------------+
| 2 | 3 |
+--------+----------------+
| 2 | 3 |
+--------+----------------+
| 2 | 3 |
+--------+----------------+

Best way to find shortest path with loading very specific data from Google Spreadsheet

Now begore you would judge me for FAQ my problem is not so easy as 'the best shortest path algorithm' (at least I think).
I have a Google Spreadsheet.
Every row begins with a town name and it's followed by the nr of roads that go through and by name of of those roads. Something like here bellow:
Ex. Spreadsheet / First sheet:
A | B | C | D | E |
1 Town name | Number of roads you find here | road name | road name | road name |
2 Manchester| 3 | M1 | M2 | M3 |
3 Leeds | 1 | M3 | | |
4 Blackpool | 2 | M1 | M2 | |
Now this one Spreadsheet has many worksheets, each for every road name (in my case M1, M2, M3. M1 is the second worksheet since the first one has the content from above. M2 is the third etc)
Ex. Spreadsheet / Second sheet:
A | B | C | D | E | F |
1 This road | Town name | Distance in km | type of road | other road | other road |
2 M1 | Manchester| 0 | M2 | M3 | |
3 M1 | Blackpool | 25 | M2 | | |
Third sheet is similar, next sheets similar structure. One town can be containd in many sheets depending on how many roads link to it. You can see it from the above example.
The Spreadsheet is not made by me. It's like this. It will not get any better.
I have no problem pulling the data from the google spreadsheet in the program. Reading spreadsheet data with python is not the question here.
What is the best way to write a programme in wxpython/python where a user inputs Starting Town and Finishing Town.
The programme will read the spreadsheet and appropriate worksheets.
Will find somehow the best path in this jungle of worksheets.
It will additionally return the total distance from Starting Town to Finishing town even if the it has to go through maybe more then 2-3 worksheets to get there.
Will return results to the users screen in a lovely form :)
I hope you find my problem challenging enough to deserve a questioning.
I beg you for help. Show me the way to go about this very specific problem.

What came of your previous attempt:
Way too slow wxPython application getting data from Google Spreadsheet and User input needs speed up solution
Did you find what was taking so long? What other issues did you encounter there?
I'm relatively new to stackoverflow but I've seen these style questions, that can be interpreted as "Could you write this code for me?", as being rejected pretty swiftly.
You might want to consider sharing some of the challenges from the above link and explaining a specific problem within the project.
UPDATED
1+5:
From the WX point of view, you'll want to keep the UI responsive whilst the search is going on. One way to do this is kick off the searching in a separate thread which calls wx.PostEvent once it has finished. Then in the main wx App you have an event handler which receives the event and processes it. In your case "shows the results on a lovely form".
See here for an example: http://wiki.wxpython.org/LongRunningTasks

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

VLOOKUP/ETL with Python - python

Related

using data from one column as part of an update for another column using Pandas from a very large csv file

Pandas: Why are my headers being inserted into the first row of my dataframe?

Counting book inventory by user

How to generate table using Python

Best way to find shortest path with loading very specific data from Google Spreadsheet

Categories

Resources