Social Media Analysis - Dr. Sharon G. Small

Social Media has been exploding in number of platforms and use over the last decade.  Usage has been analyzed following violent behavior by a person who has social media accounts - could their actions have been predicted?

Social Media is currently being looked at to see how it has affected our most recent presidential election.  It is currently being used in politics in ways no one could have predicted.

We are currently exploring what can be automatically determined by an analysis of social media posts. For example can we automatically determine a person's: education level, age, race, etc.  Can we determine over time if posts have change in tone or sentiment?  Can we determine if posts imply a violent or pacifist nature? Can posts clustered in geographic areas be used to predict social events?

TREC Tasks Track - Dr. Sharon G. Small

The primary goals of this track are to evaluate system's understanding of tasks users aim to achieve and evaluate relevance of retrieved documents with respect to underlying tasks in query. 

Research in Information Retrieval has traditionally focused on serving the best results for a single query, ignoring the reasons (or the task) that might have motivated the user to submit that query. Often times search engines are used to complete complex tasks (information needs); achieving these tasks with current search engines requires users to issue multiple queries. For example, booking travel to a location such as London could require the user to submit various queries such as flights to London, hotels in London, points of interest around London, etc. Similarly, a person who is trying to organize a wedding would need to issue separate queries in order to locate stores to buy a wedding gown, arrange catering, book honeymoon, etc. In some cases users may not even be aware of all the subtasks they need to achieve to satisfy their information need, which makes search an even more difficult experience. Ideally, a search engine should be able to understand the reason that caused the user to submit a query (i.e., the actual task that caused the query to be issued), and rather than just showing results relevant to the query submitted, the search engine should be able to guide the user to achieve their task by incorporating the information about the actual information need. The goal of this track is to devise evaluation methodologies for evaluating the quality of task based information retrieval systems. We have completed a system and submitted our results to TREC 2017.

Completed Projects

STIRS: Siena's Twitter Information Retrieval System

There has been an increasing interest, by both the research community and businesses, in microblogs as a source of viable information for a variety of tasks. NIST (National Institute of Standards and Technology) has added a microblog retrieval track to TREC (Text REtrieval Conference) for the first time in 2011. Twitter was selected as the source of microblog data. Teams participating in the track were provided with the code necessary to download the Twitter Corpus, consisting of approximately 16 million tweets from a 2-week time period, January 24, 2011 to February 8, 2011, inclusive. Teams were also provided with a training set of 12 example topics, and a test set of 50 topics. Our team expanded on the training set by creating 22 more topics. As this was the first year running this track by TREC no judgment sets were provided. For each topic the system was required to search the Twitter data corpus and return a ranked list of the top 30 relevant tweets. It is not yet clear how traditional Information Retrieval (IR) would perform on microblogs; therefore our first logical step was to run an experiment using traditional simple keyword informational retrieval. We used the open source IR system Lucene to index the NIST supplied Twitter corpus and to run our baseline experiments. This Lucene version became the first version of our STIRS system and these results were used as our baseline. We experimented with a variety of Computational Linguistics techniques to then improve the precision of our baseline.  Based on our team’s judgments we were able to achieve a ~33% increase in precision over our baseline. (Paper to Appear the Twentieth Annuals TREC Conference Proceedings)

SAWUS: Siena's Automated Wikipedia Update System

The National Institute of Standards and Technology (NIST) has been running an annual Text Retrieval Competition and Conference (TREC) since 1992.  This is a premier conference that offers researchers in the field of Computational Linguistics the opportunity to showcase their work and compare their results against other leading researchers.  Our Siena research team will participate in the TREC Knowledge Based Acquisition Track, which is being offered for the first time this year.  The objective of this track is to drive research into automatic acquisition of knowledge, such as automatically updating Wikipedia by utilizing online news. Specifically teams of researchers will develop systems that will filter a stream of content, ~500,000 English articles from online news,  for information that should be included on a given Wikipedia page. Automatically processing this content stream, systems will be required to identify relevant content and recommend edits to the proper corresponding Wikipedia pages to human curators.

SERP - Siena Environmental Review Project

This research project will provide an interdisciplinary undergraduate research experience for several Siena students. Part of the Siena Environmental Review Project (SERP), this work will explore methods for improved understanding of the environmental review documents generated in updating regulatory oversight of natural gas extraction using hydraulic fracturing of horizontal wells from the Marcellus shale formation. The key innovation is the use of automated computational techniques to process and “understand” specific content in the documents and their related public comments.

We will use Computational Linguistic (CL) techniques to automatically process and index our data collection. The project anticipates three significant outcomes. First, the project will demonstrate the automated extraction of geographic information identified with the extensive public comments related to the draft Supplemental Generic Environmental Impact Statement (SGEIS) on the Oil, Gas and Solution Mining Regulatory Program. Second, the project will identify in both SGEIS documents and public comments the frequency with which specific environmental and economic impacts are identified. The final task will be to demonstrate an automated approach to measuring the frequency with which alternative regulatory approaches (e.g. prohibition, technological standards, disclosure, bonding and compensation) are identified and advocated.

SCAPE - Siena College Automated Predictor of Extremism

A decade ago, intelligence analyst Chris Dishman advanced a persuasive theory that situated terrorist and guerrilla groups on a transformational spectrum that ranged from ideologically committed to criminally motivated. Despite improved international anti-terror cooperation and information sharing, it is still extremely difficult to pinpoint where on such a spectrum from political violence to criminality the specific organizations are to be located today. In part, the lack of certainty is a reflection of the secretive nature of guerrilla groups and their affiliated networks, which often effectively obfuscate their operations. It is also a result of the proliferation of communications technologies that have become widely available to such groups.

Our research team consists of a lead social scientist with knowledge about a range of guerrilla organizations, a lead researcher in the filed of Computational Linguistics relative to terrorist applications and several students with expertise in programming and computational linguistics. The team will develop an analysis and forecasting model that aims to identify typical patterns that relate to criminally and ideologically patterns of political violence. If successful this model can be incorporated into a system that would predict an organizations focus, thereby making it possible for governments to devise more effective policy responses to guerrilla structures and their criminal networks, and to ultimately weaken them.

MIRS - Siena Medical Information Retrieval System

The National Institute of Standards and Technology (NIST) has been running an annual Text Retrieval Competition and Conference (TREC) since 1992.  This is a premier conference that offers researchers in the field of Computational Linguistics the opportunity to showcase their work and compare their results against other leading researchers.  Our Siena research team will participate in the TREC 2012 Medical Records Track.  The goal of the legal track is to develop search technologies that meet the needs of health professionals to engage in effective discovery in digital document collections.  Specifically, our MIRS research team will develop a system that will accept a given topic of interest, searches a medical report corpus, and return a ranked list of relevant documents, for example with the goal of setting up clinical trials.

The ACCESS Protocol in SCALE-UP Classroom Learning Environments

A current research project is “Bridging the Expert-Novice Problem-Solving Gap with the ACCESS Protocol”, which is funded by an NSF CCLI grant that is in its third year. The work is in collaboration with Prof. Gerald Feldman at George Washington University (GWU). The project involves use of a general problem solving method applied in College (algebra-based) Physics courses at GWU in a SCALE-UP classroom active-learning environment. The study of the expert-novice problem builds on advances in cognitive science and artificial intelligence in the attempt to understand and mitigate the difficulty that people have learning physics, with the goal of gaining insight into the mechanisms of general problem solving.

Applications of Independent Component Analysis

This work is based on research in the areas of Artificial Intelligence and Neural Computing. Current projects use a new method related to neural computing called Independent Component Analysis (ICA). The FastICA software in MATLAB is used for implementations of ICA to analyze data from physics experiments. Preliminary findings are published in conference proceedings of the International Neural Network Society (INNS). Preliminary results from accelerator-based nuclear physics data show superior performance in identifying resonance effects and variations of reaction cross-sections as a function of measurement angle.

LIRS - Siena Legal Information Retrieval System

The National Institute of Standards and Technology (NIST) has been running an annual Text Retrieval Competition and Conference (TREC) since 1992.  This is a premier conference that offers researchers in the field of Computational Linguistics the opportunity to showcase their work and compare their results against other leading researchers.  Our Siena research team will participate in the TREC 2012 Legal Track.  The goal of the legal track is to develop earch technologies that meet the needs of lawyers to engage in effective discovery in digitial document collections.  Specifically, our research team will develop a system that will accept a give topic of insterest, search a legal corpus, and return a ranked list of relevant documents.