I'm trying to create code that checks if the value in the index column of a CSV is equivalent in different rows, and if so, find the most occurring values in the other columns and use those as the final data. Not a very good explanation, basically I want to take this data.csv:
customer_ID,month,time,A,B,C
1003,Jan,2:00,1,1,4
1003,Jul,2:00,1,1,3
1003,Jan,2:00,1,1,4
1004,Feb,8:00,2,5,1
1004,Jul,8:00,2,4,1
And create a new answer.csv that recognizes that there are multiple rows for the same customer, so it finds the values that occur the most in each column and outputs those into one row:
customer_ID,month,ABC
1003,Jan,114
1004,Feb,251
I'd also like to learn that if there are values with the same number of occurrences (Month and B for customer 1004) how can I choose which one I want to be outputted?
I've currently written (thanks to Andy Hayden on a previous question I just asked):
import pandas as pd
df = pd.read_csv('data.csv', index_col='customer_ID')
res = df[list('ABC')].astype(str).sum(1)
print df
res.to_frame(name='answer').to_csv('answer.csv')
All this does, however, is create this (I was ignoring month previously, but now I'd like to incorporate it so that I can learn how to not only find the mode of a column of numbers, but also the most occurring string):
customer_ID,ABC
1003,114.0
1003,113.0
1003,114.0
1004,251.0
1004,241.0
Note: I don't know why it is outputting the .0 at the end of the ABC, it seems to be in the wrong variable format. I want each column to be outputted as just the 3 digit number.
Edit: I'm also having an issue that if the value in column A is 0 then the output becomes 2 digits and does not incorporate the leading 0.
What about something like this? This is not using Pandas though, I am not a Pandas expert.
from collections import Counter
dataDict = {}
# Read the csv file, line by line
with open('data.csv', 'r') as dataFile:
for line in dataFile:
# split the line by ',' since it is a csv file...
entry = line.split(',')
# Check to make sure that there is data in the line
if entry and len(entry[0])>0:
# if the customer_id is not in dataDict, add it
if entry[0] not in dataDict:
dataDict[entry[0]] = {'month':[entry[1]],
'time':[entry[2]],
'ABC':[''.join(entry[3:])],
}
# customer_id is already in dataDict, add values
else:
dataDict[entry[0]]['month'].append(entry[1])
dataDict[entry[0]]['time'].append(entry[2])
dataDict[entry[0]]['ABC'].append(''.join(entry[3:]))
# Now write the output file
with open('out.csv','w') as f:
# Loop through sorted customers
for customer in sorted(dataDict.keys()):
# use Counter to find the most common entries
commonMonth = Counter(dataDict[customer]['month']).most_common()[0][0]
commonTime = Counter(dataDict[customer]['time']).most_common()[0][0]
commonABC = Counter(dataDict[customer]['ABC']).most_common()[0][0]
# Write the line to the csv file
f.write(','.join([customer, commonMonth, commonTime, commonABC, '\n']))
It generates a file called out.csv that looks like this:
1003,Jan,2:00,114,
1004,Feb,8:00,251,
customer_ID,month,time,ABC,
Related
I have a CSV file with 3 rows: "Username", "Date", "Energy saved" and I would like to sum the "Energy saved" of a specific user by date.
For example, if username = 'merrytan', how can I print all the rows with "merrytan" such that the total energy saved is aggregated by date? (Date: 24/2/2022 Total Energy saved = 1001 , Date: 24/2/2022 Total Energy saved = 700)
I am a beginner at python and typically, I would use pandas to resolve this issue but it is not allowed for this project so I am at a complete loss on where to even begin. I would appreciate any help and guidance. Thank you.
My alternative to opening csv files is to use csv module of native python. You read them as a "file" and just extract the values that you need. I filter using the first column and keep only keep the equal index values from the concerned column. (which is thrid and index 2.)
import csv
energy_saved = []
with open(r"D:\test_stack.csv", newline="") as csvfile:
file = csv.reader(csvfile)
for row in file:
if row[0]=="merrytan":
energy_saved.append(row[2])
energy_saved = sum(map(int, energy_saved))
Now you have a list of just concerned values, and you can sum them afterwards.
Edit - So, I just realized that I left out the time part of your request completely lol. Here's the update.
import csv
my_dict = {}
with open(r"D:\test_stack.csv", newline="") as file:
for row in csv.reader(file):
if row[0]=="merrytan":
my_dict[row[1]] = my_dict.get(row[1], 0) + int(row[2])
So, we need to get the date column of the file as well. We need to make a presentation of two "rows" but when Pandas has been prohibited, we will go to dictionary with date as keys and energy as values.
But your date column has repeated values (regardless intended or else) and Dictionaries require keys to be unique. So, we use a loop. You add one date value after another as key and corresponding energy as value to the new dictionary, but when it is already present, you will sum with the existing value instead.
I would turn your CSV file into a two-level dictionary, with username and then date as the keys
infile = open("data.csv", "r").readlines()
savings = dict()
# Skip the first line of the CSV, since that has the column names
# not data
for row in infile[1:]:
username, date_col, saved = row.strip().split(",")
saved = int(saved)
if username in savings:
if date_col in savings[username]:
savings[username][date_col] = savings[username][date_col] + saved
else:
savings[username][date_col] = saved
else:
savings[username] = {date_col: saved}
I have a csv file with over 5,000,000 rows of data that looks like this (except that it is in Farsi):
Contract Code,Contract Type,State,City,Property Type,Region,Usage Type,Area,Percentage,Price,Price per m2,Age,Frame Type,Contract Date,Postal Code
765720,Mobayee,East Azar,Kish,Apartment,,Residential,96,100,570000,5937.5,36,Metal,13890107,5169614658
766134,Mobayee,East Azar,Qeshm,Apartment,,Residential,144.5,100,1070000,7404.84,5,Concrete,13890108,5166884645
766140,Mobayee,East Azar,Tabriz,Apartment,,Residential,144.5,100,1050000,7266.44,5,Concrete,13890108,5166884645
766146,Mobayee,East Azar,Tabriz,Apartment,,Residential,144.5,100,700000,4844.29,5,Concrete,13890108,5166884645
766147,Mobayee,East Azar,Kish,Apartment,,Residential,144.5,100,1625000,11245.67,5,Concrete,13890108,5166884645
770822,Mobayee,East Azar,Tabriz,Apartment,,Residential,144.5,50,500000,1730.1,5,Concrete,13890114,5166884645
I would like to have a code to list the variables in a specific column.
For example, I'd like it to return {Kish, Qeshm, Tabriz} for the 'city' column.
You need to first to import the csv module into your python file and read over each row in the file and save it in a list, so it'll be like
import csv
cities = []
with open("yourfile.csv", "r") as file:
reader = csv.DictReader(file) //This will save the values in the very top of the csv file as header so it will skip a line
for row in reader:
city = row["City"]
cities.append(city)
this will give you a list of cities=[Kish, Qesh, Tabriz, ....]
It appears you want to remove duplicates as well, which you can have by simply cast the finished list to set. Here's how to do it with pandas:
import pandas as pd
cities = pd.read_csv('yourfile.csv', usecols=['City'])['City']
# just cast to list if you want a plain list instead of a DataFrame
cities_list = list(cities)
# use set to remove the duplicates
unique_cities = set(cities)
In case you have need to preserve ordering, you might use an ordered dict with just keys.
Also, in case you're short on memory trying to read 5M rows in one go, you can read them in chuncks:
import pandas as pd
cities_chunks_list = [chunck['City'] for chunck in pd.read_csv('yourfile.csv', usecols=['City'], chunksize = 1000)]
#let's flatten the list
cities_list = [city for cities_chunk in cities_chunks_list for city in cities_chunk]
Hope I helped.
Using Pandas, I'm trying to extract value using the key but I keep failing to do so. Could you help me with this?
There's a csv file like below:
value
"{""id"":""1234"",""currency"":""USD""}"
"{""id"":""5678"",""currency"":""EUR""}"
I imported this file in Pandas and made a DataFrame out of it:
dataframe from a csv file
However, when I tried to extract the value using a key (e.g. df["id"]), I'm facing an error message.
I'd like to see a value 1234 or 5678 using df["id"]. Which step should I take to get it done? This may be a very basic question but I need your help. Thanks.
The csv file isn't being read in correctly.
You haven't set a delimiter; pandas can automatically detect a delimiter but hasn't done so in your case. See the read_csv documentation for more on this. Because the , the pandas dataframe has a single column, value, which has entire lines from your file as individual cells - the first entry is "{""id"":""1234"",""currency"":""USD""}". So, the file doesn't have a column id, and you can't select data by id.
The data aren't formatted as a pandas df, with row titles and columns of data. One option is to read in this data is to manually process each row, though there may be slicker options.
file = 'test.dat'
f = open(file,'r')
id_vals = []
currency = []
for line in f.readlines()[1:]:
## remove obfuscating characters
for c in '"{}\n':
line = line.replace(c,'')
line = line.split(',')
## extract values to two lists
id_vals.append(line[0][3:])
currency.append(line[1][9:])
You just need to clean up the CSV file a little and you are good. Here is every step:
# open your csv and read as a text string
with open('My_CSV.csv', 'r') as f:
my_csv_text = f.read()
# remove problematic strings
find_str = ['{', '}', '"', 'id:', 'currency:','value']
replace_str = ''
for i in find_str:
my_csv_text = re.sub(i, replace_str, my_csv_text)
# Create new csv file and save cleaned text
new_csv_path = './my_new_csv.csv' # or whatever path and name you want
with open(new_csv_path, 'w') as f:
f.write(my_csv_text)
# Create pandas dataframe
df = pd.read_csv('my_new_csv.csv', sep=',', names=['ID', 'Currency'])
print(df)
Output df:
ID Currency
0 1234 USD
1 5678 EUR
You need to extract each row of your dataframe using json.loads() or eval()
something like this:
import json
for row in df.iteritems():
print(json.loads(row.value)["id"])
# OR
print(eval(row.value)["id"])
I wish to read a csv file which has the following format using pandas:
atrrth
sfkjbgksjg
airuqghlerig
Name Roll
airuqgorqowi
awlrkgjabgwl
AAA 67
BBB 55
CCC 07
As you can see, if I use pd.read_csv, I get the fairly obvious error:
ParserError: Error tokenizing data. C error: Expected 1 fields in line 4, saw 2
But I wish to get the entire data into a dataframe. Using error_bad_lines = False will remove the important stuff and leave only the garbage values
These are the 2 of the possible column names as given below :
Name : [Name , NAME , Name of student]
Roll : [Rollno , Roll , ROLL]
How to achieve this?
Open the csv file and find a row from where the column name starts:
with open(r'data.csv') as fp:
skip = next(filter(
lambda x: x[1].startswith(('Name','NAME')),
enumerate(fp)
))[0]
The value will be stored in skip parameter
import pandas as pd
df = pd.read_csv('data.csv', skiprows=skip)
Works in Python 3.X
I would like to suggest a slight modification/simplification to #RahulAgarwal's answer. Rather than closing and re-opening the file, you can continue loading the same stream directly into pandas. Instead of recording the number of rows to skip, you can record the header line and split it manually to provide the column names:
with open(r'data.csv') as fp:
names = next(line for line in fp if line.casefold().lstrip().startswith('name'))
df = pd.read_csv(fp, names=names.strip().split())
This has an advantage for files with large numbers of trash lines.
A more detailed check could be something like this:
def isheader(line):
items = line.strip().split()
if len(items) != 2:
return False
items = sorted(map(str.casefold, items))
return items[0].startswith('name') and items[1].startswith('roll')
This function will handle all your possibilities, in any order, but also currently skip trash lines with spaces in them. You would use it as a filter:
names = next(line for line in fp if isheader(line))
If that's indeed the structure (and not just an example of what sort of garbage one can get), you can simply use skiprows argument to indicate how many lines should be skipped. In other words, you should read your dataframe like this:
import pandas as pd
df = pd.read_csv('your.csv', skiprows=3)
Mind that skiprows can do much more. Check the docs.
I'm currently trying to run through my csv file and identify the rows in a column.
The output should be something like "This column contains alpha characters only".
My code currently:
Within a method I have:
print('\nREGULAR EXPRESSIONS\n' +
'----------------------------------')
for x in range(0, self.tot_col):
print('\n' + self.file_list[0][x] +
'\n--------------') # Prints the column name
for y in range(0, self.tot_rows + 1):
if regex.re_alpha(self.file_list[y][x]) is True:
true_count += 1
else:
false_count += 1
if true_count > false_count:
percentage = (true_count / self.tot_rows) * 100
print(str(percentage) + '% chance that this column is alpha only')
true_count = 0
false_count = 0
self.file_list is the csv file in list format.
self.tot_rows & self.tot_col are the total rows and total columns respectively which has been calculated earlier within the program.
regex.re_alpha has been imported from a file and the method looks like:
def re_alpha(column):
# Checks alpha characters
alpha_valid = alpha.match(column)
if alpha_valid:
return True
else:
return False
This currently works, however I am unable to add my other regex checks such as alpha, numeric etc
I have tried to duplicate the if statement with a different regex check but it doesn't work.
I've also tried to do the counts in the regex.py file however the count stops at '1' and returns the wrong information..
I thought creating a class in the regex.py file would help however no avail.
Summary:
I would like to run multiple regex checks against my csv file and have them ordered via columns.
Thanks in advance.
From the code above, the first line of the CSV contains the column names. This means you could make a dictionary to contain your result where the keys are the column names.
from csv import DictReader
reader = DictReader(open(filename)) # filename is the name of the CSV file
results = {}
for row in reader:
for col_name, value in row.items():
results.setdefault(col_name, []).append(regex.re_alpha(value))
Now you have a dictionary called 'results' which has the output from the regex checks stored by column name. You can then output statistics. Or you could save the rows as you read them in a list and once you decide on an order you can go back and output rows to a new CSV file by outputting the items in each dictionary using the keys in the new order.
csv_writer = csv.writer(open(output_filename, 'w'))
new_order = [list of key names in the right order]
for row in saved_data:
new_row = map(row.get, new_order)
csv_writer.writerow(new_row)
Admittedly this is a bit of a sketch but it should get you going.