CSc 120: Fake News
The primary purpose of this assignment is to work with linked lists.
Restrictions
-
For the long problems, your code should follow the style guidelines
for the class.
-
You may not use concepts and/or short-hand syntax not yet covered in class. The restrictions include the following:
-
list (or dictionary) comprehensions, e.g., [n * 2 for i in range(10)]
-
with open (explicitly open and close the file instead)
-
the ternary operator (use an if instead)
-
nested functions (using def within a function to define another function)
-
recursion
-
exceptions and asssertions
-
type annotations
-
lambda expressions
-
generators and user defined iterators
-
default arguments
-
importing libraries (except for this assignment you will import the csv and string modules)
-
New restriction: Using try/except to avoid having to check for None when iterating through a linked list is banned.
Background
Fake news
has become a topic of interest in current times, but it is certainly nothing new. Creating and using "fake news" was a tactic used in the Roman era;
recently, the U of A special collections exhibit demonstrated the use of fake news during the Reformation by showing how the bible was altered to conform to the
politics of the era.
This programming assignment involves writing a
program to analyze some data about possibly "fake" news articles (from 2016) and identify
what they most commonly focus on.
The input data for this assignment are derived from
www.kaggle.com/mrisdal/fake-news. For use in this
assignment, the original data file has been modified as follows:
-
In order to simplify processing, non-ASCII characters have been replaced by blanks.
-
In order to reduce file sizes and make the analysis more manageable, the
bodies of the articles have been removed. The analysis in this assignment
is therefore limited to the titles (i.e., headlines) of articles.
-
Only US news sources have been considered.
-
For each discussion thread, only the first posting has been retained, i.e.,
comments and replies have been omitted.
If you would like to examine or work with the original un-edited data, see
the website mentioned above.
Expected Behavior
Write a program, in a file fake_news.py, that behaves as follows.
-
Read in the name of an input file with a silent prompt:
input()
-
Read in the data and process it as specified below. You must
use the csv module to read the input data (see
Using Python's csv module below, under Miscellaneous).
-
Extract the title from each input line;
-
Process each title as follows:
-
Divide the title into a Python list of words by processing the title
in such a way that no word in the resulting
list contains any whitespace or any punctuation characters.
Use string.punctuation for your punctuation characters.
You will have to import the string module for this.
(see Processing titles below)
-
From the resulting list, discard words with length ≤ 2. This
gets rid of small fragments of words that are created by the
splitting process in the previous step (e.g., the word
"Tom's"
gets split into "Tom" and "s",
and this step gets rid of "s") as well as
common words, like "a", "an",
and "is", that convey little information here.
(Not all common words are excluded, but we'll leave that be as a
concession to keeping things from getting overly complex.)
-
Convert all words to lower case to make counting word occurrences
case-insensitive.
We will refer to the resulting list of words as the
"cleaned list".
-
Use a linked list to maintain a count of the number of times
each word occurs across all of the input data (but remember that this
program limits itself to just the title of each news item). To this end,
for each word w in the cleaned list:
-
if w occurs in the linked list, update its count appropriately;
-
otherwise, add w to the linked list with count
initialized to 1.
-
After the data file has been read in and processed, sort the linked list
of words in descending order by count. The relative order of different
words that have the same count is up to you. Use your solution for the
sort() method from the previous long assignment.
-
Read in an integer value n using a silent prompt:
n = input()
n will be ≥ 0.
-
Traverse the sorted list to find the count for the node in the sorted list
at position nth (the very first node is at
position 0). We refer to this value
as k.
For example, suppose the linked list has the following values for
word counts (in descending order of count):
monsoon: 10
heat: 7
dry: 7
chili: 5
canyon: 5
bicycle: 3
room: 2
and n = 3, then the word at position 3 is chili,
so k = 5 (because chili has count 5).
-
Print out all the words in the sorted list that have count ≥ k
(see Output format below). The order in which they are printed
out is up to you.
Data Structures
You should use linked lists to keep track of words encountered in the
input data and their occurrence counts. You are free to either adapt the code
shown in the class lecture notes for this, e.g., by using any additional
attributes you may need, or to write your own. You may also use any code provided in
the short problems and the long problem that covered linked lists.
Since the primary purpose of this assignment is to practice using linked
lists, code that tries to sidestep linked lists, e.g., by copying data
to or from Python lists and doing some or all of the actual work of the
program in Python lists, will not get credit. (Python's csv module returns
a Python list when reading from a file. This is OK. Your code should
then process the strings in that list and insert them into your linked list,
and subsequent processing should be done using linked lists.)
Input format
Each line in the input file represents data about a particular news post.
Lines have the following format:
-
A line that begins with the character "#"
is a comment and should be ignored.
-
A non-comment line contains the following fields. (Here,
"..." indicates a field that has no bearing on this
programming problem and whose description has been omitted to
reduce clutter.)
0
identifier
|
1
...
|
2
author
|
3
date published
|
4
title of story
|
5
language
|
6
...
|
7
site URL
|
8
country
|
9–19
...
|
The highlighted field (field 4, "title of story") is the
field this program will analyze.
Output format
Print out the top k words by count from the sorted list using the
statement
print("{} : {:d}".format(word, count))
where word is a string (a word occurring in some title in the
input file) and count is an integer value giving the number
of occurrences of that string in the title field of posts in the input file.
Programming Requirements
Your program should implement (at least) the following classes along with the
methods specified.
-
class Node
-
An object of this class represents information about a word. It should
contain (at least) the following attributes:
-
_word: the word string.
-
_count: a count of the number of occurrences of the word.
-
_next: a reference to the next node in the linked list.
It should implement (at least) the following methods:
-
__init__(self, word): initializes the object's
attributes as follows: _word is set to
word; _count is set to 1;
_next is set to None.
-
word(self): returns the value of _word.
-
count(self): returns the value of _count.
-
next(self): returns the value of _next.
-
set_next(self, target): sets the _next
attribute to target.
-
incr(self): increments the value of _count.
-
__str__(self) and, optionally,
__repr__(self).
-
class LinkedList
-
An object of this class represents a linked list. It should contain
(at least) the following attributes:
-
_head: a reference to the first node in the list.
It should implement (at least) the following methods:
-
__init__(self): initializes _head
to None.
-
is_empty(self): returns a Boolean indicating whether
or not the list contains any nodes.
-
head(self): returns a reference to the first node
in the linked list, None if the list is empty.
-
update_count(self, word): if word
is present in the list, increments its count; otherwise, adds a node
for it with count initialized to 1.
-
rm_from_hd(self): removes the first node in the
linked list (updates the list's _head attribute
appropriately) and returns this node. It generates an error if this
method is invoked on an empty list.
-
insert_after(self, node1, node2):
node1 and node2 are references
to nodes. Inserts node2 into the list containing
node1 so that the node that comes after
node1 is node2.
-
sort(self): sorts the linked list in descending order
by _count.
-
get_nth_highest_count(self, n): returns the
count associated with the node in the linked list at position
n.
-
print_upto_count(self, n): print out all the words
that have count at least n.
-
__str__(self) and, optionally,
__repr__(self).
Note It is OK to use or adapt the code you were given for
the short problems and the previous long problem for this assignment.
If you do so, however, then for
each such method that you did not write yourself, you must add a
comment above the method indicating that it is taken from, or based on,
code you were given as part of the short problem.
Processing titles
When considering how to parse the titles, it isn't sufficient to simply split on whitespace. For example, for the title,
"It's a dog's life, full of treats!"
splitting on whitespace would give the list
["It's", 'a', "dog's", 'life,', 'full', 'of', 'treats!']
You might then attempt to iterate through this list and separate each string element by punctuation characters using element.split(string.punctuation), however, the split(arg) method does not split by any character in the argument string arg. The argument string must be contiguous in order to be considered a split point.
A better approach would be to first create a new string where all punctuation has been replaced by spaces.
Using that approach, for the title
"It's a dog's life, full of treats!", the newly created string would be "It s a dog s life full of treats ".
When that string is then split on whitespaces, the resulting list is
['It', 's', 'a', 'dog', 's', 'life', 'full', 'of', 'treats']
Errors
No error checking is required for this assignment.
Examples
Some examples of the behavior of this program are given
Input/Output examples.
Commenting
-
Your code should follow the style guidelines
for the class.
-
Write docstrings for the classes and methods that are not setters and getters.
-
You are allowed to use code provided in the short problems, the previous long problem,
or in the 120 class slides for methods that
you might use in the LinkedList and Node classes.
However, if you do, in the docstring for the method, include the origin of the source
in your comments.
Miscellaneous
Using Python's csv module
In general, CSV files can have a complex structure such that they
cannot be parsed using split(","). The file
csv-example.py,
which uses a small example input file
dummy.csv, shows how
Python's csv module can be used to read CSV files.
Sorting a linked list
You should use your solution for sorting a linked list from the previous long problem.
However, for reference, the algorithm that must be used for sorting linked lists in this
assignment is given
sorting a linked list.