University of Arizona, Department of Computer Science

CSc 120: Fake News

The primary purpose of this assignment is to work with linked lists.

Restrictions

  1. For the long problems, your code should follow the style guidelines for the class.
  2. You may not use concepts and/or short-hand syntax not yet covered in class. The restrictions include the following:

Background

Fake news has become a topic of interest in current times, but it is certainly nothing new. Creating and using "fake news" was a tactic used in the Roman era; recently, the U of A special collections exhibit demonstrated the use of fake news during the Reformation by showing how the bible was altered to conform to the politics of the era.

This programming assignment involves writing a program to analyze some data about possibly "fake" news articles (from 2016) and identify what they most commonly focus on.

The input data for this assignment are derived from www.kaggle.com/mrisdal/fake-news. For use in this assignment, the original data file has been modified as follows:

  1. In order to simplify processing, non-ASCII characters have been replaced by blanks.
  2. In order to reduce file sizes and make the analysis more manageable, the bodies of the articles have been removed. The analysis in this assignment is therefore limited to the titles (i.e., headlines) of articles.
  3. Only US news sources have been considered.
  4. For each discussion thread, only the first posting has been retained, i.e., comments and replies have been omitted.

If you would like to examine or work with the original un-edited data, see the website mentioned above.

Expected Behavior

Write a program, in a file fake_news.py, that behaves as follows.

  1. Read in the name of an input file with a silent prompt:
    input()

  2. Read in the data and process it as specified below. You must use the csv module to read the input data (see Using Python's csv module below, under Miscellaneous).

  3. Read in an integer value n using a silent prompt:
    n = input()
    n will be ≥ 0.

  4. Traverse the sorted list to find the count for the node in the sorted list at position nth (the very first node is at position 0). We refer to this value as k.

    For example, suppose the linked list has the following values for word counts (in descending order of count):

    monsoon: 10
    heat: 7
    dry: 7
    chili: 5
    canyon: 5
    bicycle: 3
    room: 2
    and n = 3, then the word at position 3 is chili, so k = 5 (because chili has count 5).

  5. Print out all the words in the sorted list that have count ≥ k (see Output format below). The order in which they are printed out is up to you.

Data Structures

You should use linked lists to keep track of words encountered in the input data and their occurrence counts. You are free to either adapt the code shown in the class lecture notes for this, e.g., by using any additional attributes you may need, or to write your own. You may also use any code provided in the short problems and the long problem that covered linked lists.

Since the primary purpose of this assignment is to practice using linked lists, code that tries to sidestep linked lists, e.g., by copying data to or from Python lists and doing some or all of the actual work of the program in Python lists, will not get credit. (Python's csv module returns a Python list when reading from a file. This is OK. Your code should then process the strings in that list and insert them into your linked list, and subsequent processing should be done using linked lists.)

Input format

Each line in the input file represents data about a particular news post. Lines have the following format:

Output format

Print out the top k words by count from the sorted list using the statement
print("{} : {:d}".format(word, count))
where word is a string (a word occurring in some title in the input file) and count is an integer value giving the number of occurrences of that string in the title field of posts in the input file.

Programming Requirements

Your program should implement (at least) the following classes along with the methods specified.
class Node
An object of this class represents information about a word. It should contain (at least) the following attributes:
  • _word: the word string.
  • _count: a count of the number of occurrences of the word.
  • _next: a reference to the next node in the linked list.

It should implement (at least) the following methods:

  • __init__(self, word): initializes the object's attributes as follows: _word is set to word; _count is set to 1; _next is set to None.
  • word(self): returns the value of _word.
  • count(self): returns the value of _count.
  • next(self): returns the value of _next.
  • set_next(self, target): sets the _next attribute to target.
  • incr(self): increments the value of _count.
  • __str__(self) and, optionally, __repr__(self).

class LinkedList
An object of this class represents a linked list. It should contain (at least) the following attributes:
  • _head: a reference to the first node in the list.
It should implement (at least) the following methods:
  • __init__(self): initializes _head to None.
  • is_empty(self): returns a Boolean indicating whether or not the list contains any nodes.
  • head(self): returns a reference to the first node in the linked list, None if the list is empty.
  • update_count(self, word): if word is present in the list, increments its count; otherwise, adds a node for it with count initialized to 1.
  • rm_from_hd(self): removes the first node in the linked list (updates the list's _head attribute appropriately) and returns this node. It generates an error if this method is invoked on an empty list.
  • insert_after(self, node1, node2): node1 and node2 are references to nodes. Inserts node2 into the list containing node1 so that the node that comes after node1 is node2.
  • sort(self): sorts the linked list in descending order by _count.
  • get_nth_highest_count(self, n): returns the count associated with the node in the linked list at position n.
  • print_upto_count(self, n): print out all the words that have count at least n.
  • __str__(self) and, optionally, __repr__(self).

Note It is OK to use or adapt the code you were given for the short problems and the previous long problem for this assignment. If you do so, however, then for each such method that you did not write yourself, you must add a comment above the method indicating that it is taken from, or based on, code you were given as part of the short problem.

Processing titles

When considering how to parse the titles, it isn't sufficient to simply split on whitespace. For example, for the title, "It's a dog's life, full of treats!" splitting on whitespace would give the list

["It's", 'a', "dog's", 'life,', 'full', 'of', 'treats!']
You might then attempt to iterate through this list and separate each string element by punctuation characters using element.split(string.punctuation), however, the split(arg) method does not split by any character in the argument string arg. The argument string must be contiguous in order to be considered a split point. A better approach would be to first create a new string where all punctuation has been replaced by spaces. Using that approach, for the title "It's a dog's life, full of treats!", the newly created string would be "It s a dog s life full of treats ". When that string is then split on whitespaces, the resulting list is
['It', 's', 'a', 'dog', 's', 'life', 'full', 'of', 'treats']

Errors

No error checking is required for this assignment.

Examples

Some examples of the behavior of this program are given Input/Output examples.

Commenting

  1. Your code should follow the style guidelines for the class.
  2. Write docstrings for the classes and methods that are not setters and getters.
  3. You are allowed to use code provided in the short problems, the previous long problem, or in the 120 class slides for methods that you might use in the LinkedList and Node classes. However, if you do, in the docstring for the method, include the origin of the source in your comments.
  4. Miscellaneous

    Using Python's csv module

    In general, CSV files can have a complex structure such that they cannot be parsed using split(","). The file csv-example.py, which uses a small example input file dummy.csv, shows how Python's csv module can be used to read CSV files.

    Sorting a linked list

    You should use your solution for sorting a linked list from the previous long problem. However, for reference, the algorithm that must be used for sorting linked lists in this assignment is given sorting a linked list.