Newspaper Web Scraper and Dashboard

Customer Need

The client required a solution to track and monitor newspaper articles in which the company’s name appeared. Initially, the focus was on two major Venezuelan publications—El Nacional and El Universal—which were sufficient for an MVP. However, the architecture needed to be extensible to support integration of additional newspapers in the future. The company also wanted to capture the frequency of mentions and preserve the full content of each relevant article. Finally, they sought to integrate this monitoring capability into their internal WhatsApp chatbot for users inquiries.

Solution Design and Development

A Python application was implemented with a daily cron job to scrape the specified newspaper websites for articles published on that day. The scraper searches for matches based on configurable keywords (e.g., the company name). Core components include:

  • Web Scraping Layer: Developed with BeautifulSoup, adhering to the Dependency Inversion principle. Each newspaper’s scraper is encapsulated within a class implementing a common interface. The system uses polymorphism to invoke the appropriate scraper dynamically.
  • Data Persistence: Matches are stored in a PostgreSQL database, capturing metadata such as publication date, source name, URL, and article content.
  • Reporting Dashboard: A Power BI Desktop solution was created to visualize the data. Two primary views were developed:
    • Line Chart: Displays the number of article matches over time, aggregated monthly.
    • Detail Table: Presents individual records with full article text and metadata for in‑depth review.
  • Chatbot Integration: The company’s existing tree‑based WhatsApp chatbot was extended to interact with the scraper backend, enabling users to:
    1. Query matches within a specified date range.
    2. Trigger an on‑demand scrape for the current day based on selected keywords.

Result and Conclusion

The solution was successfully deployed on the client’s on‑premise servers. Comprehensive documentation, including setup instructions and developer guidance, was delivered. The application’s modular design allows straightforward addition of new newspaper sources, ensuring long‑term scalability and adaptability to evolving monitoring requirements.