Untitled

DATA DOUBLE CONFIRM

Scraping Twitter data - Process - Python

This is Part I of a four-part post. Part I talks about collecting text data from Twitter while Part II discusses analysis on text data i.e. text mining. Part III outlines the process of presenting the data using Tableau and Part IV delves into insights from the analysis.

I made use of the package twitterscraper. It's really easy to use. Per Tweet it scrapes the following information:

  • Username and Full Name

  • Tweet-id

  • Tweet-url

  • Tweet text

  • Tweet timestamp

  • No. of likes

  • No. of replies

  • No. of retweets

Instructions to installation and examples on how to crawl are provided in the link but the process of installation and getting it to work was full of obstacles, one after another (not unusual with Python!). Here's a recount of my exasperating debugging process.

1. AttributeError: class Tweet has no attribute '__mro__'

This happens because I was using Python 2 (instead of Python 3) to run twitterscraper. As my laptop has both Anaconda 2 (Python 2.7) and Anaconda 3 (Python 3.4), perhaps by default, the package got installed within Python 2 and not Python 3. [I did the installation via the second option i.e. cloned the repo and ran the code python setup.py install within the folder containing setup.py.] So I uninstalled twitterscraper from Python 2 and tried to install under Python 3.

py -2.7 -m pip uninstall twitterscraper

py -3.4 -m pip install twitterscraper

2. Another error message: 'install_requires' must be a string or list of strings containing valid project/version requirement specifiers

So apparently this is due to very old version of setuptools. And I have to run

pip install --upgrade setuptools

3. Yet another error: Cannot remove entries from nonexistent file d:\anaconda32\envs\tst\lib\site-packages\easy-install.pth

Did the following workaround: pip install --upgrade --ignore-installed setuptools

4: Next error: Could not find .egg-info directory in install record for ...

Solution: pip install --upgrade setuptools pip

Finally no error with the installation code py -3.4 -m pip install twitterscraper and I managed to successfully crawl some Twitter data with twitterscraper! To collect Twitter data from a particular person, add from%3A in front of the username, for example, to scrape tweets from Barack Obama's Twitter account, run the follow code: twitterscraper from%3ABarackObama -o tweets.json

[/Edited on 22 Oct 2018] There were some updates made to the package on Github. Do check the link for the latest code. The code to collect tweets from Barack Obama, for example, is now: twitterscraper BarackObama -u -o tweets.json

Note: All these codes were ran within the Anaconda command line interface (i.e. Anaconda command prompt). The interface looks something like this. twitterscraper from%3ABarackObama -o tweets.json should be run outside Python within the Anaconda environment.

Resources:

https://pip.pypa.io/en/stable/reference/pip_uninstall/

https://stackoverflow.com/questions/2812520/pip-dealing-with-multiple-python-versions

https://github.com/bigchaindb/bigchaindb/issues/236

https://github.com/ContinuumIO/anaconda-issues/issues/542

https://stackoverflow.com/questions/26091641/what-does-a-could-not-find-egg-info-directory-in-install-record-from-pip-mean

2,025 views