The Scirate Python Package

I recently cobbled together scirate, a Python package that can extract information from the website Scirate. If you are not familiar with Scirate, it’s a website that aggregates content from arXiv and allows users to “scite” papers. Giving a paper a “scite” is very similar to the “like” or “thumbs up” button on other social media platforms. Giving a paper a “scite” can range from an endorsement of the content, to interest in the research, to “Hey, I know the authors and they are nice people”.

When I was a Ph.D. student, Scirate was, and indeed still is, primarily used by researchers in the quantum information community. Since I am most familiar with this category, along with the fact that this community is most active on Scirate, we will focus on applying the scirate Python package with a focus on quantum physics postings.

For more information on the scirate Python package I wrote along with documentation, examples, etc., please consult the following Github link or official release page on PyPI.

Example Usage

In this post we will take a look at some examples of how one may make use of this Python package. Before embarking on these examples, you must have Python installed on your machine. Currently, the scirate package supports Python 2.7 and 3.5. You will also require the pip module installed to install Python packages. The scirate package can then be installed via pip as

pip install scirate

Once installed, you will be ready to go. First, we will create a client to interface with the Scirate website:

# Import the Scirate client module and create a client object.
from scirate.client import ScirateClient
client = ScirateClient()

The client will be responsible for requesting information from Scirate and will serve as the intermediary for requesting and obtaining data.

For basic examples and usage, you may consult the README file which lists some of the more basic functionality and examples of what one can use the scirate Python package for. In the next immediate sections, we will be replicating much of what is already contained in the README to make this post as self-contained as possible. The final sections will be building on that to explore some slightly more involved examples that may be of interest.

Basic Usage

In this basic usage section we will be replicating much of what is present in the README of the scirate Python package. If you have already consulted the README, you can skip over this section.

Papers

Let us access a paper on Scirate via the arXiv identifier. Say we want to access information via Scirate on the following listing 1509.01147.

We can grab some of the basic information, such as the authors, title, abstract, arXiv category, etc.

from scirate.paper import SciratePaper
paper = client.paper("1509.01147")
print(paper)
	>>> "The Information Paradox for Black Holes"
print(paper.authors)
	>>> ['S. W. Hawking']
print(paper.abstract[0:50])
	>>> "I propose that the information loss paradox can be"
print(paper.category)
	>>> "hep-th"

We can also grab some of the more Scirate-specific metrics. Such as the number of scites for a given article, who scited the article, etc.

print(paper.scites)
	>>> 6
print(paper.scitors)
	>>> ['Andrew Childs', 'Jonny', 'Mehdi Ahmadi', 'Noon van der Silk', 'Ryan L. Mann', 'Tom Wong']

Consult the documentation for further examples of information that can be obtained from a paper.

Authors

You can get information about an author as well.

from scirate.author import ScirateAuthor
author = client.author("Terrance", "Tao", "math.CO")
print(author)
	>>> "Terrance Tao"
print(author.papers[0])
    >>> "An inverse theorem for an inequality of Kneser"
print(author.arxiv_ids[0])
    >>> "1711.04337"

Using the arXiv identifier along with what we did in the Papers
section, we can obtain further information about that paper if 
we wish

```python
paper = client.paper(author.arxiv_ids[0])
print(paper.scites)
    >>> 0

Note that the mathematician Terrance Tao published on multiple arXiv categories. We can look up his papers under the math.NT category as well.

author = client.author("Terrance", "Tao", "math.NT")
print(author.papers[0])
    >>> "Long gaps in sieved sets"
print(author.category)
    >>> math.NT

Highest Scited Papers of the Year

As an example, say we want to determine what the highest scited papers were in the “quant-ph” category for 2017. We can use Python’s built-in datetime module to define objects referring to the start and end date range:

# Use Python's datetime module to set a range of dates. Our range
# starts on the first day of 2017 and ends at the last day of 2017.
import datetime

start_date = datetime.date(2017, 1, 1)
end_date = datetime.date(2017, 12, 31)

We now want to cycle through all of the days within this range and extract the highest number of scites received for each day. Both of these things are accomplished with the following snippet of code.

# Loop through all days in 2017. Save information on each paper with the 
# highest number of scites for each day and store in respective lists.
from scirate.category import ScirateCategory
date = start_date
delta = datetime.timedelta(days=1)

max_scites = []
max_scites_dates = []
max_scites_arxiv_ids = [] 
while d <= end_date:
	category = client.category("quant-ph", date.strftime("%Y-%m-%d"))
	if category.scites != []:
		daily_max_scite = category.scites[0]
		daily_max_arxiv_id = category.arxiv_ids[0]

		max_scites.append(daily_max_scite)
		max_scites_arxiv_ids.append(daily_max_arxiv_id)
		max_scites_dates.append(date)
	d += delta

The above while loop goes through each day and checks whether or not there were postings on that day. If so, we extract the listing with the highest number of scites. This may be obtained by consulting the zeroth index, as listings are sorted from the highest number of scites to the lowest. We then store the number of scites that the highest ranked article has for that day along with the date it was posted and also the arXiv identifier for later processing.

This code doesn’t take all that long to compute, however, one may wish to only run this once and store the resulting lists for easy retrieval. To accomplish this, we can make use of the pickle module in Python like so:

# Pickle lists to byte files for faster retrieval and to
# avoid excess computation.
import pickle

with open('max_scites.pkl', 'wb') as f:
	pickle.dump(max_scites, f)

with open('max_scites_arxiv_ids.pkl', 'wb') as f:
	pickle.dump(max_scites_arxiv_ids, f)

with open('max_scites_dates.pkl', 'wb') as f:
	pickle.dump(max_scites_dates, f)

Once the pickled files have been stored, you may load them directly into your script as opposed to computing the contents from scratch again.

# Retrieve pickled files.
max_scites = pickle.load(open('max_scites.pkl', 'rb'))
max_scites_arxiv_ids = pickle.load(open('max_scites_arxiv_ids.pkl', 'rb'))
max_scites_dates = pickle.load(open('max_scites_dates.pkl', 'rb'))

Now we can determine what the highest number of scites was for 2017.

# Print the maximum number of scites over all 
# papers of 2017.
max(max_scites)
	>>> 114

The most scited paper on Scirate in 2017 received 114 scites. We can determine what paper this was by consulting the other list of items that held the arXiv identifiers of each paper.

# Use the paper module in the scirate package to 
# extract further information about the highest
# scited paper.
from scirate.paper import SciratePaper

arxiv_id = max_scites_arxiv_ids[max_scites.index(max(max_scites))]
print(arxiv_id)
	>>> 1704.00690
	
paper = client.paper(arxiv_id)
print(paper.title)
	>>> Quantum advantage with shallow circuits

print(paper.authors)
	>>> ['Sergey Bravyi', 'David Gosset', 'Robert Koenig']

Total Daily Postings

In this section, we will take a look at producing a bar chat that displays the cumulative total of postings for each day of the week over the entire year of 2017. We start off by creating an empty ordered dictionary that will store the labels for each weekday along with the cumulative counts of papers published on those days.

weekday_dict = collections.OrderedDict()
weekday_dict["MON"] = 0
weekday_dict["TUE"] = 0
weekday_dict["WED"] = 0
weekday_dict["THU"] = 0
weekday_dict["FRI"] = 0

Now we will range over all of the dates for which postings were made available. We actually already computed this in the max_scites_dates. By ranging over these dates and extracting the year, month, and day, we can determine the number of papers posted on that particular day with the following snippet of code.

# Loop over all dates with postings on Scirate and 
# accumulate respective total postings for each
# day of the week.
for date in max_scites_dates:
	s = date.split("-")
	year, month, day = int(s[0]), int(s[1]), int(s[2])
	d = datetime.date(year, month, day)
	category = client.category("quant-ph", d.strftime("%Y-%m-%d"))
	print(date)
	print(category.papers)
	weekday_dict[weekday_list[d.weekday()]] += len(category.papers)

with open('weekday_dict.pkl', 'wb') as f:
	pickle.dump(weekday_dict, f)

Just as we did before, we will pickle the result so we can refer to it later instead of computing this every single time. Once we have computed the content to be stored in weekly_dict, we can load it in from the pickeled file and produce our bar chart.

# Load the pickled file and create a bar chart that
# displays the cumulative number of postings on each
# day of the week.
weekday_dict = pickle.load(open('weekday_dict.pkl', 'rb'))
weekday_list = ["MON", "TUE", "WED", "THU", "FRI"]
x = np.arange(5)
y = list(weekday_dict.values())
plt.xticks(np.arange(5), weekday_list)
plt.bar(x, y)
plt.xlabel("Day of Week")
plt.ylabel("Total Number of Scites")
plt.show()

Running the above code will generate the following bar chart.

Conclusion

This post was just a brief look at the scirate Python package I wrote and an example of some of the things you may want to do with it. By all means, if you make use of this module to do anything interesting, I would most certainly like to see what you have done. Thanks again for reading, and happy coding!

The Scirate Python Package