Cleaning .csv text data in Python - python
I have recently created a python program that would import my finances from a .csv file and transfer it onto google sheets. However, I am struggling to figure out how to fix the names that my bank gives me.
Example:
ME DC SI XXXXXXXXXXXXXXXX NETFLIX should just be NETFLIX,
POS XXXXXXXXXXXXXXXX STEAM PURCHASE should just be STEAM and so on
Forgive me if this is a stupid question as I am a newbie when it comes to coding and I am just looking to use it to automate certain situations in my life.
import csv
from unicodedata import category
import gspread
import time
MONTH = 'June'
# Set month name
file = f'HDFC_{MONTH}_2022.csv'
#the file we need to extract data from
transactions = []
# Create empty list to add data to
def hdfcFin(file):
'''Create a function that allows us to export data to google sheets'''
with open(file, mode = 'r') as csv_file:
csv_reader = csv.reader(csv_file)
for row in csv_reader:
date = row[0]
name = row[1]
expense = float(row[2])
income = float(row[3])
category = 'other'
transaction = ((date, name, expense, income, category))
transactions.append(transaction)
return transactions
sa = gspread.service_account()
# connect json to api
sh = sa.open('Personal Finances')
wks = sh.worksheet(f'{MONTH}')
rows = hdfcFin(file)
for row in rows:
wks.insert_row([row[0], row[1], row[4], row[2], row[3]], 8)
time.sleep(2)
# time delay because of api restrictions
If you dont have specific format to identify the name then you can use below logic. Which will have key value pair. If key appears in name then you can replace it with value.
d={'ME DC SI XXXXXXXXXXXXXXXX NETFLIX':'NETFLIX','POS XXXXXXXXXXXXXXXX STEAM PURCHASE':'STEAM'}
test='POS XXXXXXXXXXXXXXXX STEAM PURCHASE'
if test in d.keys():
test=d[test]
print(test)
Output:
STEAM
If requirement is to fetch only last word out of your name then you can use below logic.
test='ME DC SI XXXXXXXXXXXXXXXX NETFLIX'
test=test.split(" ")[-1]
print(test)
Output:
NETFLIX
Related
Why does Twints "since" & "until" not work?
I'm trying to get all tweets from 2018-01-01 until now from various firms. My code works, however I do not get the tweets from the time range. Sometimes I only get the tweets from today and yesterday or from mid April up to now, but not since the beginning of 2018. I've got then the message: [!] No more data! Scraping will stop now. ticker = [] #read in csv file with company ticker in a list with open('C:\\Users\\veron\\Desktop\\Test.csv', newline='') as inputfile: for row in csv.reader(inputfile): ticker.append(row[0]) #Getting tweets for every ticker in the list for i in ticker: searchstring = (f"{i} since:2018-01-01") c = twint.Config() c.Search = searchstring c.Lang = "en" c.Panda = True c.Custom["tweet"] = ["date", "username", "tweet"] c.Store_csv = True c.Output = f"{i}.csv" twint.run.Search(c) df = pd. read_csv(f"{i}.csv") df['company'] = i df.to_csv(f"{i}.csv", index=False) Does anyone had the same issues and has some tip?
You need to add the configuration parameter Since separately. For example: c.Since = "2018-01-01" Similarly for Until: c.Until = "2017-12-27" The official documentation might be helpful. Since (string) - Filter Tweets sent since date, works only with twint.run.Search (Example: 2017-12-27). Until (string) - Filter Tweets sent until date, works only with twint.run.Search (Example: 2017-12-27).
How to use a CSV file and use the CSV file to have an input from a user?
I have dataset about car accidents statistics by using a .csv file. I want the user to type in State and all of the information about that State gets displayed for the user to see. How can do that? Dataset: State,Population,Vehicle Miles traveled (millions),Fatal Crashes,Deaths Alabama,"4,903,185","71,735",856,930 Alaska,"731,545","5,881",62,67 Arizona,"7,278,717","70,281",910,981 Arkansas,"3,017,804","37,099",467,505 California,"39,512,223","340,836","3,316","3,606" Colorado,"5,758,736","54,634",544,596 Connecticut,"3,565,287","31,601",233,249 Delaware,"973,764","10,245",122,132 District of Columbia,"705,749","3,756",22,23 Florida,"21,477,737","226,514","2,950","3,183" Georgia,"10,617,423","133,128","1,377","1,491" Hawaii,"1,415,872","11,024",102,108 Idaho,"1,787,065","18,058",201,224 Illinois,"12,671,821","107,525",938,"1,009" Indiana,"6,732,219","82,719",751,809 Iowa,"3,155,070","33,537",313,336 Kansas,"2,913,314","31,843",362,411 Kentucky,"4,467,673","49,410",667,732 Louisiana,"4,648,794","51,360",681,727 Maine,"1,344,212","14,871",143,157 Maryland,"6,045,680","60,216",484,521 Massachusetts,"6,892,503","64,890",321,334 Michigan,"9,986,857","102,174",902,985 Minnesota,"5,639,632","60,731",333,364 Mississippi,"2,976,149","41,091",581,643 Missouri,"6,137,428","79,168",818,880 Montana,"1,068,778","12,892",166,184 Nebraska,"1,934,408","21,242",212,248 Nevada,"3,080,156","28,794",285,304 New Hampshire,"1,359,711","13,828",90,101 New Jersey,"8,882,190","78,205",525,559 New Mexico,"2,096,829","27,772",368,424 New York,"19,453,561","123,986",876,931 North Carolina,"10,488,084","122,475","1,284","1,373" North Dakota,"762,062","9,826",91,100 Ohio,"11,689,100","114,694","1,039","1,153" Oklahoma,"3,956,971","44,648",584,640 Oregon,"4,217,737","35,808",451,489 Pennsylvania,"12,801,989","102,864",990,"1,059" Rhode Island,"1,059,361","7,581",53,57 South Carolina,"5,148,714","57,939",922,"1,001" South Dakota,"884,659","9,922",88,102 Tennessee,"6,829,174","82,892","1,040","1,135" Texas,"28,995,881","288,227","3,294","3,615" Utah,"3,205,958","32,911",225,248 Vermont,"623,989","7,346",44,47 Virginia,"8,535,519","85,432",774,831 Washington,"7,614,893","62,530",494,519 West Virginia,"1,792,147","19,077",247,260 Wisconsin,"5,822,434","66,348",526,566 Wyoming,"578,759","10,208",120,147 U.S. total,"328,239,523","3,261,774","33,244","36,096" I'm thinking of something like this: crashes = 0 stateInput = input("Please enter a State: ") print("The total of number of crashes in" , stateInput , "is:" , crashes)
I suggest you use the standard-library's csv module to read the file. Specifically the csv.DictReader class which will convert each row of the CSV file into a Python dictionary which will make processing it easier. The following code shows how to do that and also stores all the state dictionaries into a higher level stateDB dictionary so that the name of each state is associated with information about it, effectively making it a "database". Note that the state names are converted to all uppercase to simplify look-ups. import csv from pprint import pprint accidents_filepath = 'car_accidents.csv' stateDB = dict() with open(accidents_filepath, 'r', newline='') as csv_file: reader = csv.DictReader(csv_file) fieldnames = [name for name in reader.fieldnames if name != 'State'] for state_info in reader: state = state_info['State'] stateDB[state.upper()] = {key: state_info[key] for key in fieldnames} #which_state = input("Please enter a state: ") which_state = 'Mississippi' # Hardcode for testing purposes. print(f'Info for the state of {which_state}:') pprint(stateDB[which_state.upper()]) Output: Info for the state of Mississippi: {'Deaths': '643', 'Fatal Crashes': '581', 'Population': '2,976,149', 'Vehicle Miles traveled (millions)': '41,091'}
How can I alter the value in a specific column of a certain row in python without the use of pandas?
I was playing around with the code provided here: https://www.geeksforgeeks.org/update-column-value-of-csv-in-python/ and couldn't seem to figure out how to change the value in a specific column of the row without it bringing up an error. Say I wanted to change the status of the row belonging to the name Molly Singh, how would I go about it? I've tried the following below only to get an error and the CSV file turning out empty. I'd also prefer the solution be without the use of pandas tysm. For example the row in the csv file will originally be Sno Registration Number Name RollNo Status 1 11913907 Molly Singh RK19TSA01 P What I want the outcome to be Sno Registration Number Name RollNo Status 1 11913907 Molly Singh RK19TSA01 N One more question if I were to alter the value in column snow by doing addition/substraction etc how would I go about that as well? Thanks! the error I get as you can see, the name column is changed to true then false etc import csv op = open("AllDetails.csv", "r") dt = csv.DictReader(op) print(dt) up_dt = [] for r in dt: print(r) row = {'Sno': r['Sno'], 'Registration Number': r['Registration Number'], 'Name'== "Molly Singh": r['Name'], 'RollNo': r['RollNo'], 'Status': 'P'} up_dt.append(row) print(up_dt) op.close() op = open("AllDetails.csv", "w", newline='') headers = ['Sno', 'Registration Number', 'Name', 'RollNo', 'Status'] data = csv.DictWriter(op, delimiter=',', fieldnames=headers) data.writerow(dict((heads, heads) for heads in headers)) data.writerows(up_dt) op.close()
Issues Your error is because the field name in the input file is misspelled as Regristation rather than Registration Correction is to just read the names from the input file and propagate to the output file as below. Alternately, you can your code to: headers = ['Sno', 'Regristation Number', 'Name', 'RollNo', 'Status'] "One more question if I were to alter the value in column snow by doing addition/substraction etc how would I go about that as well" I'm not sure what is meant by this. In the code below you would just have: r['Sno'] = (some compute value) Code import csv with open("AllDetails.csv", "r") as op: dt = csv.DictReader(op) headers = None up_dt = [] for r in dt: # get header of input file if headers is None: headers = r # Change status of 'Molly Singh' record if r['Name'] == 'Molly Singh': r['Status'] = 'N' up_dt.append(r) with open("AllDetails.csv", "w", newline='') as op: # Use headers from input file above data = csv.DictWriter(op, delimiter=',', fieldnames=headers) data.writerow(dict((heads, heads) for heads in headers)) data.writerows(up_dt)
How to prevent a for loop from erasing it's output every time the loop restarts?
Background: I am creating a tweet scraper, using snscrape, to scrape tweets from sitting government representatives in the House and Senate. The tweets that I'm scraping I am scanning for keywords related to "cybersecurity" and "privacy". I'm using a dictionary of words to scan for. Usually, I would have many more members in the username list but I am just trying to test with a low number to get it working first. The problem: I have set up nested for loops to run through each username to check and the dictionary of words to scan for. The output is only showing the last person in my username list. I can't find out why. It's like every time the for loop restarts it erases the last person it just checked. The code: import os import pandas as pd tweet_count = 500 username = ["SenShelby", "Ttuberville", "SenDanSullivan"] text_query = ["cybersecurity", "cyber security", "internet privacy", "online privacy", "computer security", "health privacy", "privacy", "security breach", "firewall", "data"] since_date = "2016-01-01" until_date = "2021-10-14" for person in username: for word in text_query: os.system("snscrape --jsonl --progress --max-results {} --since {} twitter-search '{} from:{} until:{}'> user-tweets.json".format(tweet_count, since_date, word, person, until_date)) tweets_framework = pd.read_json('user-tweets.json', lines=True) tweets_framework.to_csv('user-tweets.csv', sep=',', index=False) Any help would be greatly appreciated!
first you should have a unique name for each user's JSON. second, you need to run the json to csv for each user (if this is what you try to do) for person in username: for word in text_query: filename = '{}-{}-tweets'.format(person, word) os.system("snscrape --jsonl --progress --max-results {} --since {} twitter-search '{} from:{} until:{}'> {}.json".format(tweet_count, since_date, word, person, until_date, filename)) tweets_framework = pd.read_json('{}.json'.format(filename), lines=True) tweets_framework.to_csv('{}.csv'.format(filename), sep=',', index=False)
How to Iterate through data present in the spread sheet in selenium/data
I am automating a web page where the data is entered in web pages from spread sheet like name ,date of birth.i am able to run if there a single record in a spread sheet.the issue is i don't how to iterate through the spread sheet if i have n number data in a spread sheet . Below,as of now,i able to fetch the 1 record.i would like to fetch both record 1&2 Example as in spread sheet: Record Name DOB 1 TEST1 05/06/2010 2 TEST2 06/05/2010 Please find the code i have tried so far driver = webdriver.Chrome('path') driver.fullscreen_window(); driver.get('url'); time.sleep(5); Workbook=xlrd.open_workbook("excelpath") Customerdetails = Workbook.sheet_by_index(0); Storagezipcode = Customerdetails.cell(1,0); text_area=driver.find_element_by_xpath('(//*[#id="QuoteDetails/PostalCode"])[2]'); text_area.send_keys(Storagezipcode.value); Nextbutton=driver.find_element_by_xpath('//span[contains(text(),"Next")]'); Nextbutton.click() time.sleep(10) #carlink page CarLink=driver.find_element_by_xpath('//span[#class="link"]'); CarLink.click(); time.sleep(30) #ModelYear page yeardetail=Customerdetails.cell(1,14); Yearlink = driver.find_element_by_xpath("//span[normalize-space(.)='{}']".format(yeardetail.value)); Yearlink.click(); time.sleep(10) #Company name SubModel=Customerdetails.cell(1,15); SubModellink=driver.find_element_by_xpath("//span[normalize-space(.)='{}']".format(SubModel.value)); SubModellink.click(); time.sleep(10) #Company model Bodymodel=Customerdetails.cell(1,16); Bodylink=driver.find_element_by_xpath("//span[normalize-space(.)='{}']".format(Bodymodel.value)); Bodylink.click(); time.sleep(10);
Use the row count to get all the rows and then iterate and use column index. import xlrd Workbook=xlrd.open_workbook("excelpath") Customerdetails = Workbook.sheet_by_index(0) for i in range(1,Customerdetails.nrows): Record=Customerdetails.cell(i,0) print(Record.value) Name= Customerdetails.cell(i,1) print(Name.value) DOB = Customerdetails.cell(i, 2) print(DOB.value)