The Scirate Python Package
I recently cobbled together
scirate, a Python package that can extract information
from the website Scirate. If you are not familiar with Scirate, it’s a website that
aggregates content from arXiv and allows users to “scite” papers. Giving a paper a
“scite” is very similar to the “like” or “thumbs up” button on other social media platforms.
Giving a paper a “scite” can range from an endorsement of the content, to interest in the
research, to “Hey, I know the authors and they are nice people”.
When I was a Ph.D. student, Scirate was, and indeed still is, primarily used by researchers
in the quantum information community. Since I am most familiar with this category, along
with the fact that this community is most active on Scirate, we will focus on applying the
scirate Python package with a focus on quantum physics postings.
In this post we will take a look at some examples of how one may make use of this Python
package. Before embarking on these examples, you must have Python installed on your machine.
scirate package supports Python 2.7 and 3.5. You will also require the
module installed to install Python packages. The
scirate package can then be
pip install scirate
Once installed, you will be ready to go. First, we will create a client to interface with the Scirate website:
# Import the Scirate client module and create a client object. from scirate.client import ScirateClient client = ScirateClient()
The client will be responsible for requesting information from Scirate and will serve as the intermediary for requesting and obtaining data.
For basic examples and usage, you may consult the README file which lists some of
the more basic functionality and examples of what one can use the
package for. In the next immediate sections, we will be replicating much of what is
already contained in the README to make this post as self-contained as possible. The
final sections will be building on that to explore some slightly more involved examples
that may be of interest.
In this basic usage section we will be replicating much of what is present in the README
scirate Python package. If you have already consulted the README, you can
skip over this section.
Let us access a paper on Scirate via the arXiv identifier. Say we want to access information via Scirate on the following listing 1509.01147.
We can grab some of the basic information, such as the authors, title, abstract, arXiv category, etc.
from scirate.paper import SciratePaper paper = client.paper("1509.01147") print(paper) >>> "The Information Paradox for Black Holes" print(paper.authors) >>> ['S. W. Hawking'] print(paper.abstract[0:50]) >>> "I propose that the information loss paradox can be" print(paper.category) >>> "hep-th"
We can also grab some of the more Scirate-specific metrics. Such as the number of scites for a given article, who scited the article, etc.
print(paper.scites) >>> 6 print(paper.scitors) >>> ['Andrew Childs', 'Jonny', 'Mehdi Ahmadi', 'Noon van der Silk', 'Ryan L. Mann', 'Tom Wong']
Consult the documentation for further examples of information that can be obtained from a paper.
You can get information about an author as well.
from scirate.author import ScirateAuthor author = client.author("Terrance", "Tao", "math.CO") print(author) >>> "Terrance Tao" print(author.papers) >>> "An inverse theorem for an inequality of Kneser" print(author.arxiv_ids) >>> "1711.04337" Using the arXiv identifier along with what we did in the Papers section, we can obtain further information about that paper if we wish ```python paper = client.paper(author.arxiv_ids) print(paper.scites) >>> 0
Note that the mathematician Terrance Tao published on multiple arXiv categories. We can look up his papers under the math.NT category as well.
author = client.author("Terrance", "Tao", "math.NT") print(author.papers) >>> "Long gaps in sieved sets" print(author.category) >>> math.NT
One may also wish to look at papers under various arXiv identifier listings on Scirate. For instance, one may wish to find all of the papers posted under the ‘quant-ph’ category posted on September 7, 2017.
from scirate.category import ScirateCategory category = client.category("quant-ph", "09-07-2017") print(category.papers[0:2]) >>> ['Quantum Advantage from Conjugated Clifford Circuits', 'Extended Nonlocal Games from Quantum-Classical Games']
Highest Scited Papers of the Year
As an example, say we want to determine what the highest scited papers were in the “quant-ph”
category for 2017. We can use Python’s built-in
datetime module to define objects
referring to the start and end date range:
# Use Python's datetime module to set a range of dates. Our range # starts on the first day of 2017 and ends at the last day of 2017. import datetime start_date = datetime.date(2017, 1, 1) end_date = datetime.date(2017, 12, 31)
We now want to cycle through all of the days within this range and extract the highest number of scites received for each day. Both of these things are accomplished with the following snippet of code.
# Loop through all days in 2017. Save information on each paper with the # highest number of scites for each day and store in respective lists. from scirate.category import ScirateCategory date = start_date delta = datetime.timedelta(days=1) max_scites =  max_scites_dates =  max_scites_arxiv_ids =  while d <= end_date: category = client.category("quant-ph", date.strftime("%Y-%m-%d")) if category.scites != : daily_max_scite = category.scites daily_max_arxiv_id = category.arxiv_ids max_scites.append(daily_max_scite) max_scites_arxiv_ids.append(daily_max_arxiv_id) max_scites_dates.append(date) d += delta
while loop goes through each day and checks whether or not there were postings
on that day. If so, we extract the listing with the highest number of scites. This may be
obtained by consulting the zeroth index, as listings are sorted from the highest number of
scites to the lowest. We then store the number of scites that the highest ranked article
has for that day along with the date it was posted and also the arXiv identifier for
This code doesn’t take all that long to compute, however, one may wish to only run this
once and store the resulting lists for easy retrieval. To accomplish this, we can make
use of the
pickle module in Python like so:
# Pickle lists to byte files for faster retrieval and to # avoid excess computation. import pickle with open('max_scites.pkl', 'wb') as f: pickle.dump(max_scites, f) with open('max_scites_arxiv_ids.pkl', 'wb') as f: pickle.dump(max_scites_arxiv_ids, f) with open('max_scites_dates.pkl', 'wb') as f: pickle.dump(max_scites_dates, f)
Once the pickled files have been stored, you may load them directly into your script as opposed to computing the contents from scratch again.
# Retrieve pickled files. max_scites = pickle.load(open('max_scites.pkl', 'rb')) max_scites_arxiv_ids = pickle.load(open('max_scites_arxiv_ids.pkl', 'rb')) max_scites_dates = pickle.load(open('max_scites_dates.pkl', 'rb'))
Now we can determine what the highest number of scites was for 2017.
# Print the maximum number of scites over all # papers of 2017. max(max_scites) >>> 114
The most scited paper on Scirate in 2017 received 114 scites. We can determine what paper this was by consulting the other list of items that held the arXiv identifiers of each paper.
# Use the paper module in the scirate package to # extract further information about the highest # scited paper. from scirate.paper import SciratePaper arxiv_id = max_scites_arxiv_ids[max_scites.index(max(max_scites))] print(arxiv_id) >>> 1704.00690 paper = client.paper(arxiv_id) print(paper.title) >>> Quantum advantage with shallow circuits print(paper.authors) >>> ['Sergey Bravyi', 'David Gosset', 'Robert Koenig']
Total Daily Postings
In this section, we will take a look at producing a bar chat that displays the cumulative total of postings for each day of the week over the entire year of 2017. We start off by creating an empty ordered dictionary that will store the labels for each weekday along with the cumulative counts of papers published on those days.
weekday_dict = collections.OrderedDict() weekday_dict["MON"] = 0 weekday_dict["TUE"] = 0 weekday_dict["WED"] = 0 weekday_dict["THU"] = 0 weekday_dict["FRI"] = 0
Now we will range over all of the dates for which postings were made available. We actually
already computed this in the
max_scites_dates. By ranging over these dates and extracting
the year, month, and day, we can determine the number of papers posted on that particular
day with the following snippet of code.
# Loop over all dates with postings on Scirate and # accumulate respective total postings for each # day of the week. for date in max_scites_dates: s = date.split("-") year, month, day = int(s), int(s), int(s) d = datetime.date(year, month, day) category = client.category("quant-ph", d.strftime("%Y-%m-%d")) print(date) print(category.papers) weekday_dict[weekday_list[d.weekday()]] += len(category.papers) with open('weekday_dict.pkl', 'wb') as f: pickle.dump(weekday_dict, f)
Just as we did before, we will pickle the result so we can refer to it later instead of computing
this every single time. Once we have computed the content to be stored in
weekly_dict, we can
load it in from the pickeled file and produce our bar chart.
# Load the pickled file and create a bar chart that # displays the cumulative number of postings on each # day of the week. weekday_dict = pickle.load(open('weekday_dict.pkl', 'rb')) weekday_list = ["MON", "TUE", "WED", "THU", "FRI"] x = np.arange(5) y = list(weekday_dict.values()) plt.xticks(np.arange(5), weekday_list) plt.bar(x, y) plt.xlabel("Day of Week") plt.ylabel("Total Number of Scites") plt.show()
Running the above code will generate the following bar chart.
This post was just a brief look at the
scirate Python package I wrote and an example of some
of the things you may want to do with it. By all means, if you make use of this module to do
anything interesting, I would most certainly like to see what you have done. Thanks again for
reading, and happy coding!