Python parsing large CSV file for usernames [closed] - python

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
I have a very large csv file (+50k lines).
This file contains IRC logs and here's the data format:
1st column: Message type (1 for message, 2 for system)
2nd column: Timestamps (numbers of seconds since a precise date)
3rd column: Username of the one writing the message
4th column: Message
Here's an example of the data:
1,1382445487956,"bob","i don't know how to do such a task"
1,1382025765196,"alice","bro ask stackoverflow"
1,1382454875476,"_XxCoder_killerxX_","I'm pretty sure it can be done with python, bob"
2,1380631520410,"helloman","helloman_ join the chan."
For example, _XxCoder_killerxX_ mentioned bob.
So, knowing all of this, I want to know which pair of usernames mentioned each others the most.
I want messages to be count, so I only need to work on lines starting with the number "1" (as there is a bunch of lines starting with "2" and other irrelevant numbers)
I know it can be done with the csv Python module, but I've never worked with such larges files so I really don't know how to start all of this.

You should perform two passes of the CSV: one to capture all sender usernames, the second to find sender usernames mentioned in messages.
import csv
users = set()
with open("test.csv", "r") as file:
reader = csv.reader(file)
for line in reader:
users.add(line[2])
mentions = {}
with open("test.csv", "r") as file:
reader = csv.reader(file)
for line in reader:
sender, message = line[2], line[3]
for recipient in users:
if recipient == sender:
continue # can't mention yourself
if recipient in message:
key = (sender, recipient)
mentions[key] = mentions.get(key, 0) + 1
for mention, times in mentions.items():
print(f"{mention[0]} mentioned {mention[1]} {times} time(s)")
totals = {}
for mention, times in mentions.items():
key = tuple(sorted(mention))
totals[key] = totals.get(key, 0) + times
for names, times in totals.items():
print(f"{names[0]} and {names[1]} mentioned each other {times} time(s)")
This example is naive, as it's performing simple substring matches. So, if there's someone named "foo" and someone mentions "food" in a message, it will indicate a match.

Here is a solution using pandas and sets. The use of pandas significantly simplifies the import and manipulation of csv data, and the use of sets allows one to count {'alice', 'bob'} and {'bob', 'alice'} as two occurrences of the same combination.
df = pd.read_csv('sample.csv', header=None)
df.columns = ['id','timestamp','username','message']
lst = []
for name in df.username:
for i,m in enumerate(df.message):
if name in m:
author = df.iloc[i,2]
lst.append({author, name})
most_freq = max(lst, key=lst.count)
print(most_freq)
#{'bob', '_XxCoder_killerxX_'}

Related

Python Readline Loop and Subloop

I'm trying to loop through some unstructured text data in python. End goal is to structure it in a dataframe. For now I'm just trying to get the relevant data in an array and understand the line, readline() functionality in python.
This is what the text looks like:
Title: title of an article
Full text: unfortunately the full text of each article,
is on numerous lines. Each article has a differing number
of lines. In this example, there are three..
Subject: Python
Title: title of another article
Full text: again unfortunately the full text of each article,
is on numerous lines.
Subject: Python
This same format is repeated for lots of text articles in the same file. So far I've figured out how to pull out lines that include certain text. For example, I can loop through it and put all of the article titles in a list like this:
a = "Title:"
titleList = []
sample = 'sample.txt'
with open(sample,encoding="utf8") as unstr:
for line in unstr:
if a in line:
titleList.append(line)
Now I want to do the below:
a = "Title:"
b = "Full text:"
d = "Subject:"
list = []
sample = 'sample.txt'
with open(sample,encoding="utf8") as unstr:
for line in unstr:
if a in line:
list.append(line)
if b in line:
1. Concatenate this line with each line after it, until i reach the line that includes "Subject:". Ignore the "Subject:" line, stop the "Full text:" subloop, add the concatenated full text to the list array.<br>
2. Continue the for loop within which all of this sits
As a Python beginner, I'm spinning my wheels searching google on this topic. Any pointers would be much appreciated.
If you want to stick with your for-loop, you're probably going to need something like this:
titles = []
texts = []
subjects = []
with open('sample.txt', encoding="utf8") as f:
inside_fulltext = False
for line in f:
if line.startswith("Title:"):
inside_fulltext = False
titles.append(line)
elif line.startswith("Full text:"):
inside_fulltext = True
full_text = line
elif line.startswith("Subject:"):
inside_fulltext = False
texts.append(full_text)
subjects.append(line)
elif inside_fulltext:
full_text += line
else:
# Possibly throw a format error here?
pass
(A couple of things: Python is weird about names, and when you write list = [], you're actually overwriting the label for the list class, which can cause you problems later. You should really treat list, set, and so on like keywords - even thought Python technically doesn't - just to save yourself the headache. Also, the startswith method is a little more precise here, given your description of the data.)
Alternatively, you could wrap the file object in an iterator (i = iter(f), and then next(i)), but that's going to cause some headaches with catching StopIteration exceptions - but it would let you use a more classic while-loop for the whole thing. For myself, I would stick with the state-machine approach above, and just make it sufficiently robust to deal with all your reasonably expected edge-cases.
As your goal is to construct a DataFrame, here is a re+numpy+pandas solution:
import re
import pandas as pd
import numpy as np
# read all file
with open('sample.txt', encoding="utf8") as f:
text = f.read()
keys = ['Subject', 'Title', 'Full text']
regex = '(?:^|\n)(%s): ' % '|'.join(keys)
# split text on keys
chunks = re.split(regex, text)[1:]
# reshape flat list of records to group key/value and infos on the same article
df = pd.DataFrame([dict(e) for e in np.array(chunks).reshape(-1, len(keys), 2)])
Output:
Title Full text Subject
0 title of an article unfortunately the full text of each article,\nis on numerous lines. Each article has a differing number \nof lines. In this example, there are three.. Python
1 title of another article again unfortunately the full text of each article,\nis on numerous lines. Python

Python - Dictionary - If loop variable not changing

Project is about to convert short forms into long description and read from csv file
Example: user enters LOL and then it should response 'Laugh of Laughter'
Expectation: Till the time time user enter wrong keyword computer keep on asking to enter short form and system answers it's long description from CSV file
I considered each row of CSV file as dictionary and broke down into keys and values
logic used: - Used while so, that it keeps on asking until short column didn't finds space, empty cell. But issue is after showing successful first attempt comparison in IF loop is not happening because readitems['short' ] is not getting updated on each cycle
AlisonList.csv Values are:
short,long
lol,laugh of laughter
u, you
wid, with
import csv
from lib2to3.fixer_util import Newline
from pip._vendor.distlib.util import CSVReader
from _overlapped import NULL
READ = "r"
WRITE = 'w'
APPEND = 'a'
# Reading the CSV file and converted into Dictionary
with open ("AlisonList.csv", READ) as csv_file:
readlist = csv.DictReader(csv_file)
# Reading the short description and showing results
for readitems in readlist:
readitems ['short'] == ' '
while readitems['short'] !='' :
# Taking input of short description
smsLang = str(input("Enter SMS Language : "))
if smsLang == readitems['short']:
print(readitems['short'], ("---Means---"), readitems['long'])
else:
break
Try this:
import csv
READ = "r"
WRITE = 'w'
APPEND = 'a'
# Reading the CSV file and converted into Dictionary
with open ("AlisonList.csv", READ) as csv_file:
readlist = csv.DictReader(csv_file)
word_lookup = { x['short'].strip() : x['long'].strip() for x in readlist }
while True:
# Taking input of short description
smsLang = str(input("Enter SMS Language : ")).lower()
normalWord = word_lookup.get(smsLang.lower())
if normalWord is not None:
print(f"{smsLang} ---Means--- {normalWord}")
else:
print(f"Sorry, '{smsLang}' is not in my dictionary.")
Sample output:
Enter SMS Language : lol
lol ---Means--- laugh of laughter
Enter SMS Language : u
u ---Means--- you
Enter SMS Language : wid
wid ---Means--- with
Enter SMS Language : something that won't be in the dictionary
Sorry, 'something that won't be in the dictionary' is not in my dictionary.
Basically, we compile a dictionary from the csv file, using the short words as the keys, and the long words as the items. This allows us in the loop to then just call word_lookup.get(smsLang) to find the longer version. If such a key does not exist, we get a result of None, so a simple if statement can handle the case where there is no longer version.
Hope this helps.

Formatting an unstructured csv in pandas

I'm having an issue reading in accurate information from archived 4chan comments. Since the structure of a thread of a 4chan thread doesn't (seem to) translate very well into a rectangular dataframe I'm having issues actually getting the appropriate comments from each thread into a single row in pandas.
To exacerbate the problem the dataset is 54GB in size and I asked a similar question on how to just read the data into a pandas dataframe (in which the solution to that problem made me realize this issue) which makes diagnosing every problem tedious.
The code I use to read in portions of the data is as follows:
def Four_pleb_chunker():
"""
:return: 4pleb data is over 54 GB so this chunks it into something manageable
"""
with open('pol.csv') as f:
with open('pol_part.csv', 'w') as g:
for i in range(1000): ready
g.write(f.readline())
name_cols = ['num', 'subnum', 'thread_num', 'op', 'timestamp', 'timestamp_expired', 'preview_orig', 'preview_w', 'preview_h',
'media_filename', 'media_w', 'media_h', 'media_size', 'media_hash', 'media_orig', 'spoiler', 'deleted', 'capcode',
'email', 'name', 'trip', 'title', 'comment', 'sticky', 'locked', 'poster_hash', 'poster_country', 'exif']
cols = ['num','timestamp', 'email', 'name', 'title', 'comment', 'poster_country']
df_chunk = pd.read_csv('pol_part.csv',
names=name_cols,
delimiter=None,
usecols=cols,
skip_blank_lines=True,
engine='python',
error_bad_lines=False)
df_chunk = df_chunk.rename(columns={"comment": "Comments"})
df_chunk = df_chunk.dropna(subset=['Comments'])
df_chunk['Comments'] = df_chunk['Comments'].str.replace('[^0-9a-zA-Z]+', ' ')
df_chunk.to_csv('pol_part_df.csv')
return df_chunk
This code works fine, however due to the structure of each thread a parser that I wrote sometimes returns nonsensical results. In csv form this is what the first few rows of the dataset look like (pardon the screen shot, its extremely difficult to actually write all those lines out using this UI.)
As it can be seen the comments per thread are split by '\' but then each comment doesn't take its own row. My goal is at least to get each comment into its own row so I can parse through it correctly. However the function I'm using to parse the data cuts off after 1000 iterations regardless if its a new line or not.
Fundamentally my questions are: How can I structure this data to actually read the comments accurately, and be able to read in a complete sample dataframe as opposed to a truncated one. As for solutions I've tried:
df_chunk = pd.read_csv('pol_part.csv',
names=name_cols,
delimiter='',
usecols=cols,
skip_blank_lines=True,
engine='python',
error_bad_lines=False)
If I get rid of/change the argument delimiter I get this error:
Skipping line 31473: ',' expected after '"'
Which makes sense because the data isn't separated by , so it skips every line that doesn't fit that condition, in this case the whole dataframe. inputing \ into the argument gives me a syntax error. I'm kind of at a loss for what to do next, so if anyone has had any experience dealing with an issue like this you'd be a lifesaver. Let me know if there isn't something I've included in here and I'll update the post.
Update, here are some sample lines from the CSV for testing:
2 23594708 1385716767 \N Anonymous \N Example: not identifying the fundamental scarcity of resources which underlies the entire global power structure, or the huge, documented suppression of any threats to that via National Security Orders. Or that EVERY left/right ideology would be horrible in comparison to ANY in which energy scarcity and the hierarchical power structures dependent upon it had been addressed.
3 23594754 1385716903 \N Anonymous \N ">>23594701\
\
No, /pol/ is bait. That's the point."
4 23594773 1385716983 \N Anonymous \N ">>23594754
\
Being a non-bait among baits is equal to being a bait among non-baits."
5 23594795 1385717052 \N Anonymous \N Don't forget how heavily censored this board is! And nobody has any issues with that.
6 23594812 1385717101 \N Anonymous \N ">>23594773\
\
Clever. The effect is similar. But there are minds on /pol/ who don't WANT to be bait, at least."
Here's a sample script that converts your csv into separate lines for each comment:
import csv
# open file for output and create csv writer
f_out = open('out.csv', 'w')
w = csv.writer(f_out)
# open input file and create reader
with open('test.csv') as f:
r = csv.reader(f, delimiter='\t')
for l in r:
# skip empty lines
if not l:
continue
# in this line I want to split the last part
# and loop over each resulting string
for s in l[-1].split('\\\n'):
# we copy all fields except the last one
output = l[:-1]
# add a single comment
output.append(s)
w.writerow(output)

Add input to an existing row in csv in python 3.6

I am working in python 3.6 on the following structure:
import csv
aircraft = input("Please insert the aircraft type : ")
characteristics = input("Please insert the respective aircraft characteristics: ")
with open("aircraft_list.csv","a",newline="") as output:
if aircraft not in open("aircraft_list.csv").read():
wr = csv.writer(output)
wr.writerow([aircraft + "," + characteristics])
# with squared brackets since otherwise each letter written as a separate string to the csv, separated by comma
else:
for row in enumerate(output):
data = row.split(",")
if data[0] == aircraft:
wr = csv.writer(output)
wr.writerow([characteristics],1)
I want to write the inputs to a csv in the following format:
B737,Boeing,1970, etc
A320,Airbus,EU, etc
As long as the aircraft e.g. B737 entry does yet not exist, it is easy to write it to a csv. However, as soon as the B737 property already exists in the csv, I want to add the characteristics (not the aircraft) to the entry already made for the e.g. B737. The order of the characteristics does not matter.
I want the additional input characteristics to be added to the correct row in my csv. How would I do that?
Since I’m new to coding I tried the basics and combined it with code which I found on Stackoverflow but unfortunately I cannot get it working.
Your help would be great, thank you!

mapping repeating ID's for an email [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I have a file with columns ID MAIL (20 millions):
000000#0000.com 0xE618EF6B90AG
000000#0000.com 0xE618EF6B90AF
00000#00000.com 0xE618EFBCC83D
00000#00000.com 0xE618EFBCC83C
#000000000 0xE618F02C223E432CEA
00000#0000.com 0x01010492A
0000#00000.com 0x52107A
# 0xE618F032F829432CE04343307C570906A
00000#0000.com 0xE618F032F829432CEB
000000#000.com 0xE618F032FE7B432CEC
000000#000.com 0xE618F032FE7B432CED
#hotmail.com 0x41970588
# 0x52087617
I need to map ID's registered to an email, so we can find what ID's have registered on a given mail. The email may have several ID's registered on it.
Here is the function i made, but it turns out that i need to exclude mostly non-valid emails like #.com # etc.
In the first version of script it works almost perfectly with a little thing, my parser breaks down if the email has a space somewhere in between symbols
So i added a regexp to check the email value but i get the error i don't know how to handle:
import re
def duplicates(filename):
with open(filename, 'r') as f:
lines = f.readlines()
query = (line.replace('\n','') for line in lines)
split_query = (line.split(' ') for line in query)
result_mail = {}
for line in split_query:
#added if statement to validate email, remove to check w/o
if re.match(r"[a-zA-Z0-9.-]+#[a-zA-Z0-9.-]+\.[a-zA-Z0-9.-]+", line[0]):
if line[0] not in result_mail:
result_mail[line[0]] = []
result_mail[line[0]].append(line[1])
for mail, ids in result_mail.iteritems():
if len(ids) > 1:
with open('MAIL_ids.txt', 'a') as r_mail:
r_mail.write(str(mail) + '\n')
r_mail.write(str(ids) + '\n')
if __name__ == '__main__':
import sys
filename = sys.argv[1]
duplicates(filename)
After running the script i get the error about KeyError '', why is this happening ?
File ".\dup_1.2.py", line 44, in <module>
duplicates(filename)
File ".\dup_1.2.py", line 32, in duplicates
result_mail[line[0]].append(line[1])
KeyError: ''
I also would like to rewrite the part where i add keys and values to dictionary. I'd like to use a generator defaultdict() smth like:
result_mail = defaultdict(list)
for line in lines:
if line[0] not in result_mail:
result_mail[line[0]].append(line[1])
It seems you just put the line result_mail[line[0]].append(line[1]) at the wrong level of indentation, so it is executed even when the if re.match condition does not apply.
Also, you might want to use collections.defaultdict to get rid of that if line[0] not in result_mail check.
result_mail = collections.defaultdict(list)
for (id_, mail) in split_query:
if re.match(r"[a-zA-Z0-9.-]+#[a-zA-Z0-9.-]+\.[a-zA-Z0-9.-]+", id_):
result_mail[id_].append(mail)

Categories