Cleaning .csv text data in Python - python

I have recently created a python program that would import my finances from a .csv file and transfer it onto google sheets. However, I am struggling to figure out how to fix the names that my bank gives me.
Example:
ME DC SI XXXXXXXXXXXXXXXX NETFLIX should just be NETFLIX,
POS XXXXXXXXXXXXXXXX STEAM PURCHASE should just be STEAM and so on
Forgive me if this is a stupid question as I am a newbie when it comes to coding and I am just looking to use it to automate certain situations in my life.
import csv
from unicodedata import category
import gspread
import time
MONTH = 'June'
# Set month name
file = f'HDFC_{MONTH}_2022.csv'
#the file we need to extract data from
transactions = []
# Create empty list to add data to
def hdfcFin(file):
'''Create a function that allows us to export data to google sheets'''
with open(file, mode = 'r') as csv_file:
csv_reader = csv.reader(csv_file)
for row in csv_reader:
date = row[0]
name = row[1]
expense = float(row[2])
income = float(row[3])
category = 'other'
transaction = ((date, name, expense, income, category))
transactions.append(transaction)
return transactions
sa = gspread.service_account()
# connect json to api
sh = sa.open('Personal Finances')
wks = sh.worksheet(f'{MONTH}')
rows = hdfcFin(file)
for row in rows:
wks.insert_row([row[0], row[1], row[4], row[2], row[3]], 8)
time.sleep(2)
# time delay because of api restrictions

If you dont have specific format to identify the name then you can use below logic. Which will have key value pair. If key appears in name then you can replace it with value.
d={'ME DC SI XXXXXXXXXXXXXXXX NETFLIX':'NETFLIX','POS XXXXXXXXXXXXXXXX STEAM PURCHASE':'STEAM'}
test='POS XXXXXXXXXXXXXXXX STEAM PURCHASE'
if test in d.keys():
test=d[test]
print(test)
Output:
STEAM
If requirement is to fetch only last word out of your name then you can use below logic.
test='ME DC SI XXXXXXXXXXXXXXXX NETFLIX'
test=test.split(" ")[-1]
print(test)
Output:
NETFLIX

Related

Why does Twints "since" & "until" not work?

I'm trying to get all tweets from 2018-01-01 until now from various firms.
My code works, however I do not get the tweets from the time range. Sometimes I only get the tweets from today and yesterday or from mid April up to now, but not since the beginning of 2018. I've got then the message: [!] No more data! Scraping will stop now.
ticker = []
#read in csv file with company ticker in a list
with open('C:\\Users\\veron\\Desktop\\Test.csv', newline='') as inputfile:
for row in csv.reader(inputfile):
ticker.append(row[0])
#Getting tweets for every ticker in the list
for i in ticker:
searchstring = (f"{i} since:2018-01-01")
c = twint.Config()
c.Search = searchstring
c.Lang = "en"
c.Panda = True
c.Custom["tweet"] = ["date", "username", "tweet"]
c.Store_csv = True
c.Output = f"{i}.csv"
twint.run.Search(c)
df = pd. read_csv(f"{i}.csv")
df['company'] = i
df.to_csv(f"{i}.csv", index=False)
Does anyone had the same issues and has some tip?
You need to add the configuration parameter Since separately. For example:
c.Since = "2018-01-01"
Similarly for Until:
c.Until = "2017-12-27"
The official documentation might be helpful.
Since (string) - Filter Tweets sent since date, works only with twint.run.Search (Example: 2017-12-27).
Until (string) - Filter Tweets sent until date, works only with twint.run.Search (Example: 2017-12-27).

How to use a CSV file and use the CSV file to have an input from a user?

I have dataset about car accidents statistics by using a .csv file. I want the user to type in State and all of the information about that State gets displayed for the user to see. How can do that?
Dataset:
State,Population,Vehicle Miles traveled (millions),Fatal Crashes,Deaths
Alabama,"4,903,185","71,735",856,930
Alaska,"731,545","5,881",62,67
Arizona,"7,278,717","70,281",910,981
Arkansas,"3,017,804","37,099",467,505
California,"39,512,223","340,836","3,316","3,606"
Colorado,"5,758,736","54,634",544,596
Connecticut,"3,565,287","31,601",233,249
Delaware,"973,764","10,245",122,132
District of Columbia,"705,749","3,756",22,23
Florida,"21,477,737","226,514","2,950","3,183"
Georgia,"10,617,423","133,128","1,377","1,491"
Hawaii,"1,415,872","11,024",102,108
Idaho,"1,787,065","18,058",201,224
Illinois,"12,671,821","107,525",938,"1,009"
Indiana,"6,732,219","82,719",751,809
Iowa,"3,155,070","33,537",313,336
Kansas,"2,913,314","31,843",362,411
Kentucky,"4,467,673","49,410",667,732
Louisiana,"4,648,794","51,360",681,727
Maine,"1,344,212","14,871",143,157
Maryland,"6,045,680","60,216",484,521
Massachusetts,"6,892,503","64,890",321,334
Michigan,"9,986,857","102,174",902,985
Minnesota,"5,639,632","60,731",333,364
Mississippi,"2,976,149","41,091",581,643
Missouri,"6,137,428","79,168",818,880
Montana,"1,068,778","12,892",166,184
Nebraska,"1,934,408","21,242",212,248
Nevada,"3,080,156","28,794",285,304
New Hampshire,"1,359,711","13,828",90,101
New Jersey,"8,882,190","78,205",525,559
New Mexico,"2,096,829","27,772",368,424
New York,"19,453,561","123,986",876,931
North Carolina,"10,488,084","122,475","1,284","1,373"
North Dakota,"762,062","9,826",91,100
Ohio,"11,689,100","114,694","1,039","1,153"
Oklahoma,"3,956,971","44,648",584,640
Oregon,"4,217,737","35,808",451,489
Pennsylvania,"12,801,989","102,864",990,"1,059"
Rhode Island,"1,059,361","7,581",53,57
South Carolina,"5,148,714","57,939",922,"1,001"
South Dakota,"884,659","9,922",88,102
Tennessee,"6,829,174","82,892","1,040","1,135"
Texas,"28,995,881","288,227","3,294","3,615"
Utah,"3,205,958","32,911",225,248
Vermont,"623,989","7,346",44,47
Virginia,"8,535,519","85,432",774,831
Washington,"7,614,893","62,530",494,519
West Virginia,"1,792,147","19,077",247,260
Wisconsin,"5,822,434","66,348",526,566
Wyoming,"578,759","10,208",120,147
U.S. total,"328,239,523","3,261,774","33,244","36,096"
I'm thinking of something like this:
crashes = 0
stateInput = input("Please enter a State: ")
print("The total of number of crashes in" , stateInput , "is:" , crashes)
I suggest you use the standard-library's csv module to read the file. Specifically the csv.DictReader class which will convert each row of the CSV file into a Python dictionary which will make processing it easier.
The following code shows how to do that and also stores all the state dictionaries into a higher level stateDB dictionary so that the name of each state is associated with information about it, effectively making it a "database". Note that the state names are converted to all uppercase to simplify look-ups.
import csv
from pprint import pprint
accidents_filepath = 'car_accidents.csv'
stateDB = dict()
with open(accidents_filepath, 'r', newline='') as csv_file:
reader = csv.DictReader(csv_file)
fieldnames = [name for name in reader.fieldnames if name != 'State']
for state_info in reader:
state = state_info['State']
stateDB[state.upper()] = {key: state_info[key] for key in fieldnames}
#which_state = input("Please enter a state: ")
which_state = 'Mississippi' # Hardcode for testing purposes.
print(f'Info for the state of {which_state}:')
pprint(stateDB[which_state.upper()])
Output:
Info for the state of Mississippi:
{'Deaths': '643',
'Fatal Crashes': '581',
'Population': '2,976,149',
'Vehicle Miles traveled (millions)': '41,091'}

How can I alter the value in a specific column of a certain row in python without the use of pandas?

I was playing around with the code provided here: https://www.geeksforgeeks.org/update-column-value-of-csv-in-python/ and couldn't seem to figure out how to change the value in a specific column of the row without it bringing up an error.
Say I wanted to change the status of the row belonging to the name Molly Singh, how would I go about it? I've tried the following below only to get an error and the CSV file turning out empty. I'd also prefer the solution be without the use of pandas tysm.
For example the row in the csv file will originally be
Sno Registration Number Name RollNo Status
1 11913907 Molly Singh RK19TSA01 P
What I want the outcome to be
Sno Registration Number Name RollNo Status
1 11913907 Molly Singh RK19TSA01 N
One more question if I were to alter the value in column snow by doing addition/substraction etc how would I go about that as well? Thanks!
the error I get as you can see, the name column is changed to true then false etc
import csv
op = open("AllDetails.csv", "r")
dt = csv.DictReader(op)
print(dt)
up_dt = []
for r in dt:
print(r)
row = {'Sno': r['Sno'],
'Registration Number': r['Registration Number'],
'Name'== "Molly Singh": r['Name'],
'RollNo': r['RollNo'],
'Status': 'P'}
up_dt.append(row)
print(up_dt)
op.close()
op = open("AllDetails.csv", "w", newline='')
headers = ['Sno', 'Registration Number', 'Name', 'RollNo', 'Status']
data = csv.DictWriter(op, delimiter=',', fieldnames=headers)
data.writerow(dict((heads, heads) for heads in headers))
data.writerows(up_dt)
op.close()
Issues
Your error is because the field name in the input file is misspelled as Regristation rather than Registration
Correction is to just read the names from the input file and propagate to the output file as below.
Alternately, you can your code to:
headers = ['Sno', 'Regristation Number', 'Name', 'RollNo', 'Status']
"One more question if I were to alter the value in column snow by doing addition/substraction etc how would I go about that as well"
I'm not sure what is meant by this. In the code below you would just have:
r['Sno'] = (some compute value)
Code
import csv
with open("AllDetails.csv", "r") as op:
dt = csv.DictReader(op)
headers = None
up_dt = []
for r in dt:
# get header of input file
if headers is None:
headers = r
# Change status of 'Molly Singh' record
if r['Name'] == 'Molly Singh':
r['Status'] = 'N'
up_dt.append(r)
with open("AllDetails.csv", "w", newline='') as op:
# Use headers from input file above
data = csv.DictWriter(op, delimiter=',', fieldnames=headers)
data.writerow(dict((heads, heads) for heads in headers))
data.writerows(up_dt)

How to prevent a for loop from erasing it's output every time the loop restarts?

Background: I am creating a tweet scraper, using snscrape, to scrape tweets from sitting government representatives in the House and Senate. The tweets that I'm scraping I am scanning for keywords related to "cybersecurity" and "privacy". I'm using a dictionary of words to scan for. Usually, I would have many more members in the username list but I am just trying to test with a low number to get it working first.
The problem: I have set up nested for loops to run through each username to check and the dictionary of words to scan for. The output is only showing the last person in my username list. I can't find out why. It's like every time the for loop restarts it erases the last person it just checked.
The code:
import os
import pandas as pd
tweet_count = 500
username = ["SenShelby", "Ttuberville", "SenDanSullivan"]
text_query = ["cybersecurity", "cyber security", "internet privacy", "online privacy", "computer security", "health privacy", "privacy", "security breach", "firewall", "data"]
since_date = "2016-01-01"
until_date = "2021-10-14"
for person in username:
for word in text_query:
os.system("snscrape --jsonl --progress --max-results {} --since {} twitter-search '{} from:{} until:{}'> user-tweets.json".format(tweet_count, since_date, word, person, until_date))
tweets_framework = pd.read_json('user-tweets.json', lines=True)
tweets_framework.to_csv('user-tweets.csv', sep=',', index=False)
Any help would be greatly appreciated!
first you should have a unique name for each user's JSON.
second, you need to run the json to csv for each user (if this is what you try to do)
for person in username:
for word in text_query:
filename = '{}-{}-tweets'.format(person, word)
os.system("snscrape --jsonl --progress --max-results {} --since {} twitter-search '{} from:{} until:{}'> {}.json".format(tweet_count, since_date, word, person, until_date, filename))
tweets_framework = pd.read_json('{}.json'.format(filename), lines=True)
tweets_framework.to_csv('{}.csv'.format(filename), sep=',', index=False)

How to Iterate through data present in the spread sheet in selenium/data

I am automating a web page where the data is entered in web pages from spread sheet like name ,date of birth.i am able to run if there a single record in a spread sheet.the issue is i don't how to iterate through the spread sheet if i have n number data in a spread sheet .
Below,as of now,i able to fetch the 1 record.i would like to fetch both record 1&2
Example as in spread sheet:
Record Name DOB
1 TEST1 05/06/2010
2 TEST2 06/05/2010
Please find the code i have tried so far
driver = webdriver.Chrome('path')
driver.fullscreen_window();
driver.get('url');
time.sleep(5);
Workbook=xlrd.open_workbook("excelpath")
Customerdetails = Workbook.sheet_by_index(0);
Storagezipcode = Customerdetails.cell(1,0);
text_area=driver.find_element_by_xpath('(//*[#id="QuoteDetails/PostalCode"])[2]');
text_area.send_keys(Storagezipcode.value);
Nextbutton=driver.find_element_by_xpath('//span[contains(text(),"Next")]');
Nextbutton.click()
time.sleep(10)
#carlink page
CarLink=driver.find_element_by_xpath('//span[#class="link"]');
CarLink.click();
time.sleep(30)
#ModelYear page
yeardetail=Customerdetails.cell(1,14);
Yearlink = driver.find_element_by_xpath("//span[normalize-space(.)='{}']".format(yeardetail.value));
Yearlink.click();
time.sleep(10)
#Company name
SubModel=Customerdetails.cell(1,15);
SubModellink=driver.find_element_by_xpath("//span[normalize-space(.)='{}']".format(SubModel.value));
SubModellink.click();
time.sleep(10)
#Company model
Bodymodel=Customerdetails.cell(1,16);
Bodylink=driver.find_element_by_xpath("//span[normalize-space(.)='{}']".format(Bodymodel.value));
Bodylink.click();
time.sleep(10);
Use the row count to get all the rows and then iterate and use column index.
import xlrd
Workbook=xlrd.open_workbook("excelpath")
Customerdetails = Workbook.sheet_by_index(0)
for i in range(1,Customerdetails.nrows):
Record=Customerdetails.cell(i,0)
print(Record.value)
Name= Customerdetails.cell(i,1)
print(Name.value)
DOB = Customerdetails.cell(i, 2)
print(DOB.value)

Categories