University of Arizona, Department of Computer Science

CSc 120: Fake News

The primary purpose of this assignment is to work with linked lists.

Restrictions

For the long problems, your code should follow the style guidelines for the class.
You may not use concepts and/or short-hand syntax not yet covered in class. The restrictions include the following:
- list (or dictionary) comprehensions, e.g., [n * 2 for i in range(10)]
- with open (explicitly open and close the file instead)
- the ternary operator (use an if instead)
- nested functions (using def within a function to define another function)
- recursion
- exceptions and asssertions
- type annotations
- lambda expressions
- generators and user defined iterators
- default arguments
- importing libraries (except for this assignment you will import the csv and string modules)
- New restriction: Using try/except to avoid having to check for None when iterating through a linked list is banned.

Background

Fake news has become a topic of interest in current times, but it is certainly nothing new. Creating and using "fake news" was a tactic used in the Roman era; recently, the U of A special collections exhibit demonstrated the use of fake news during the Reformation by showing how the bible was altered to conform to the politics of the era.

This programming assignment involves writing a program to analyze some data about possibly "fake" news articles (from 2016) and identify what they most commonly focus on.

The input data for this assignment are derived from www.kaggle.com/mrisdal/fake-news. For use in this assignment, the original data file has been modified as follows:

In order to simplify processing, non-ASCII characters have been replaced by blanks.
In order to reduce file sizes and make the analysis more manageable, the bodies of the articles have been removed. The analysis in this assignment is therefore limited to the titles (i.e., headlines) of articles.
Only US news sources have been considered.
For each discussion thread, only the first posting has been retained, i.e., comments and replies have been omitted.

If you would like to examine or work with the original un-edited data, see the website mentioned above.

Expected Behavior

Write a program, in a file fake_news.py, that behaves as follows.

Read in the name of an input file with a silent prompt:
input()
Read in the data and process it as specified below. You must use the csv module to read the input data (see Using Python's csv module below, under Miscellaneous).
- Extract the title from each input line;
- Process each title as follows:
  - Divide the title into a Python list of words by processing the title in such a way that no word in the resulting list contains any whitespace or any punctuation characters. Use string.punctuation for your punctuation characters. You will have to import the string module for this. (see Processing titles below)
  - From the resulting list, discard words with length ≤ 2. This gets rid of small fragments of words that are created by the splitting process in the previous step (e.g., the word "Tom's" gets split into "Tom" and "s", and this step gets rid of "s") as well as common words, like "a", "an", and "is", that convey little information here. (Not all common words are excluded, but we'll leave that be as a concession to keeping things from getting overly complex.)
  - Convert all words to lower case to make counting word occurrences case-insensitive.
  We will refer to the resulting list of words as the "cleaned list".
- Use a linked list to maintain a count of the number of times each word occurs across all of the input data (but remember that this program limits itself to just the title of each news item). To this end, for each word w in the cleaned list:
  - if w occurs in the linked list, update its count appropriately;
  - otherwise, add w to the linked list with count initialized to 1.
- After the data file has been read in and processed, sort the linked list of words in descending order by count. The relative order of different words that have the same count is up to you. Use your solution for the sort() method from the previous long assignment.
Read in an integer value n using a silent prompt:
n = input()
n will be ≥ 0.
Traverse the sorted list to find the count for the node in the sorted list at position n^th (the very first node is at position 0). We refer to this value as k.
For example, suppose the linked list has the following values for word counts (in descending order of count):
monsoon: 10 heat: 7 dry: 7 chili: 5 canyon: 5 bicycle: 3 room: 2
and n = 3, then the word at position 3 is chili, so k = 5 (because chili has count 5).
Print out all the words in the sorted list that have count ≥ k (see Output format below). The order in which they are printed out is up to you.

Data Structures

You should use linked lists to keep track of words encountered in the input data and their occurrence counts. You are free to either adapt the code shown in the class lecture notes for this, e.g., by using any additional attributes you may need, or to write your own. You may also use any code provided in the short problems and the long problem that covered linked lists.

Since the primary purpose of this assignment is to practice using linked lists, code that tries to sidestep linked lists, e.g., by copying data to or from Python lists and doing some or all of the actual work of the program in Python lists, will not get credit. (Python's csv module returns a Python list when reading from a file. This is OK. Your code should then process the strings in that list and insert them into your linked list, and subsequent processing should be done using linked lists.)

Input format

Each line in the input file represents data about a particular news post. Lines have the following format:

A line that begins with the character "#" is a comment and should be ignored.

A non-comment line contains the following fields. (Here, "..." indicates a field that has no bearing on this programming problem and whose description has been omitted to reduce clutter.)

0
identifier 1
... 2
author 3
date published 4
title of story 5
language 6
... 7
site URL 8
country 9–19
...

The highlighted field (field 4, "title of story") is the field this program will analyze.

Output format

Print out the top k words by count from the sorted list using the statement

print("{} : {:d}".format(word, count))

where word is a string (a word occurring in some title in the input file) and count is an integer value giving the number of occurrences of that string in the title field of posts in the input file.

Programming Requirements

Your program should implement (at least) the following classes along with the methods specified.

class Node

An object of this class represents information about a word. It should contain (at least) the following attributes:

_word: the word string.

_count: a count of the number of occurrences of the word.

_next: a reference to the next node in the linked list.

It should implement (at least) the following methods:

__init__(self, word): initializes the object's attributes as follows: _word is set to word; _count is set to 1; _next is set to None.

word(self): returns the value of _word.

count(self): returns the value of _count.

next(self): returns the value of _next.

set_next(self, target): sets the _next attribute to target.

incr(self): increments the value of _count.

__str__(self) and, optionally, __repr__(self).

class LinkedList

An object of this class represents a linked list. It should contain (at least) the following attributes:

_head: a reference to the first node in the list.

It should implement (at least) the following methods:

__init__(self): initializes _head to None.

is_empty(self): returns a Boolean indicating whether or not the list contains any nodes.

head(self): returns a reference to the first node in the linked list, None if the list is empty.

update_count(self, word): if word is present in the list, increments its count; otherwise, adds a node for it with count initialized to 1.

rm_from_hd(self): removes the first node in the linked list (updates the list's _head attribute appropriately) and returns this node. It generates an error if this method is invoked on an empty list.

insert_after(self, node1, node2): node1 and node2 are references to nodes. Inserts node2 into the list containing node1 so that the node that comes after node1 is node2.

sort(self): sorts the linked list in descending order by _count.

get_nth_highest_count(self, n): returns the count associated with the node in the linked list at position n.

print_upto_count(self, n): print out all the words that have count at least n.

__str__(self) and, optionally, __repr__(self).

Note It is OK to use or adapt the code you were given for the short problems and the previous long problem for this assignment. If you do so, however, then for each such method that you did not write yourself, you must add a comment above the method indicating that it is taken from, or based on, code you were given as part of the short problem.

Processing titles

When considering how to parse the titles, it isn't sufficient to simply split on whitespace. For example, for the title, "It's a dog's life, full of treats!" splitting on whitespace would give the list

["It's", 'a', "dog's", 'life,', 'full', 'of', 'treats!']

You might then attempt to iterate through this list and separate each string element by punctuation characters using element.split(string.punctuation), however, the split(arg) method does not split by any character in the argument string arg. The argument string must be contiguous in order to be considered a split point. A better approach would be to first create a new string where all punctuation has been replaced by spaces. Using that approach, for the title "It's a dog's life, full of treats!", the newly created string would be "It s a dog s life full of treats ". When that string is then split on whitespaces, the resulting list is

['It', 's', 'a', 'dog', 's', 'life', 'full', 'of', 'treats']

Errors

No error checking is required for this assignment.

Examples

Some examples of the behavior of this program are given Input/Output examples.

Commenting

Your code should follow the style guidelines for the class.
Write docstrings for the classes and methods that are not setters and getters.
You are allowed to use code provided in the short problems, the previous long problem, or in the 120 class slides for methods that you might use in the LinkedList and Node classes. However, if you do, in the docstring for the method, include the origin of the source in your comments.

Miscellaneous

Using Python's csv module

split(",")

csv-example.py

dummy.csv

Sorting a linked list

sorting a linked list