Monday, July 6. 2009
I'm retrieving the search engine count estimates for queries from Bing, Google and Yahoo, using the APIs for each and getting back JSON formatted results. You'd think its easy, but think again. The problem is that each search engine has a different format with different interpretations of the JSON spec. For example, the total number of hits (search engine count estimate) for a query is called "Total" by Bing, "estimatedResultCount" by Google and "deephits" by Yahoo. It gets worse, Bing returns a integer value (correct) while Yahoo and Google return a string containing an integer (not so correct). Hopefully in the future there will be a single standard format for interacting with a search engine, however its a very very long way off.
Monday, June 29. 2009
Some observations on using the search APIs for the 3 major search engines, Google, Bing and Yahoo. By far the best I've found is Bing, followed by Yahoo, and lastly Google.
To perform a search for Ireland:
For Bing and Yahoo you need to sign up for an API key. It only takes 2 minutes. Google doesnt need an API key. All can return JSON formatted search results (also XML), however each has a proprietary format. All search engines have removed limits on the number of queries you can submit. Bing and Yahoo dont place limits on the number of results that can be returned from a single query, however Google limits you to 64 results for a general search (other searches are more limited). The 'rsz' parameter for Google can be small (4 results) or large (8 results). To retrieve more results for all you can apply an offset, which is the last parameter for each. Overall the search APIs have improved massively over the past year. All thats really missing is a unified search syntax and result set.
Tuesday, June 16. 2009
Using data I collected from Reddit for RedditTrends I found some interesting spikes in the number of votes submitted articles receive.

This graph shows the sum total of UP votes received per day by submitted links.

This graph shows the sum total of DOWN votes received per day by submitted links.
As you can see, the graphs are normally quite static, but have huge spikes every now and again, which are massively out of kilter with the normal every day average.

This graph shows the total score achieved which should be = UP votes - DOWN votes. On one positive note, users of Reddit are twice as likely to UP vote a link than to DOWN vote a link.

When this is compared to the total number of comments its clear to see that the spike around January is a natural one. It seems to correspond with Obamas election and inauguration.
The huge spike on 24th of April was effectively a revolt on Reddit over the number of duplicate stories. Dozens of links got vast numbers of UP and DOWN votes, mostly with the same title.
Sunday, May 3. 2009
As an extension to PredictReddit I've created RedditTrends.com. Its kind of like Google trends except for Reddit submissions. Its just an early work in progress but looks interesting and produces pretty graphs.
Pandemic
Obama
Mexico
Graphs are produced using flot (jQuery). The backend runs the Zend framework (PHP) and MySQL.
Saturday, May 2. 2009
On Reddit there are a huge number of links submitted, however few ever get enough votes to make it to the front page of the site where most Reddit users will see it. A submission depends greatly on the title, however you only have one shot to get it right.
This is where PredictReddit comes in. It allows you to test out your proposed title, giving you an estimate of the number of votes it is likely to get. So you can fine tune it before you submit it to Reddit. It uses past submissions to predict future votes. This of course assumes that the Reddit community is interested in similar recurring topics (and it seems to be).
There can be some confusion over the results it gives back. For example, if you type in a title that you know got a high number of votes and you get a low number of votes back. You might assume that PredictReddit is broken, however in reality, if a story is very successful, often you find numerous other submissions trying to piggyback on its success (but they fail to get many votes). This gives a low predicted number of votes. Best to play around with it yourself and try it out. And remember its just for fun.
It works by using a k-Nearest Neighbours algorithm. It was written in PHP using the Zend Framework. It uses MySQL for data storage. Data is pulled from Reddit using their json interface.
Wednesday, November 14. 2007
The logical operator for NOT Equals in Matlab is ~= rather than !=
So not is ~ in matlab, compared to ! in Java/C/Perl etc....
Saturday, July 21. 2007
I implimented an automatic tagging system a while ago for phpbb, which calculates the most commonly occuring words in a thread and saves them as tags. A user can then click on a tag to find other threads tagged with the same word.
They can then see all of the threads with this tag on a nice interactive timeline, can click on a thread on the timeline to see the 1st post, or drag it to see a different point in time.
Continue reading "Visualising threads on a timeline"
Friday, May 4. 2007
With the upcoming elections Politics.ie has seen its traffic explode, trippling in 2 weeks from 30k pageviews per day to 100k pageviews per day. Six months ago it was only 15k pageviews per day and when it got to the 30k mark we got ourselves a new dual Xeon to keep up.
So heres some things I've done over the past while to reduce the load on the server to cope with the huge increase in traffic.
Installed eXtreme styles template caching mod - this precomplies the templates phpBB uses.
Removed the CSS from being included inline in each page to a standalone style file saving 12kb per page.
Caching of pages - the most viewed pages are the latest discussions page, the landing page and the index.php page. So by taking a snapshot of each every few minutes, and serving those out, that cut down on the number of pages to be dynamic ally generated. Very simply I just put a check at the top of the php scripts to see if the url matched e.g. '/latest.php' and served out a copy if it did. Only a few pages are suitable to caching. Nobody has complained so far, so it must be working fine.
Turned on mod_expires in apache and set images & javascript to expire in 1 month, thus they will be cached along the way. The mod_mem_cache isnt suitable because of the dynamic nature of most of the pages on the site.
Logged into Googles webmaster tools and asked Google to slow down their crawling (they were asking for 80k pages per day).
I've also optimized the mysql queries, but thats for another day. All in all the above things reduced the load a fair bit, but the load is continuously rising so the reduced load is used up.
Monday, February 12. 2007
Firefox and IE 7 have a search box in the top bar, and you can choose from a list of search engines to perform a search from. Wouldn't it be nice if a person could visit your site and your sites search facility was available in this drop down list, and your site could be permanently added as a search engine in a persons browser?
Well its very simple. Create an xml file such as this:
<?xml version="1.0"?>
<OpenSearchDescription xmlns="http://a9.com/-/spec/opensearch/1.1/">
<ShortName>Bigulo</ShortName>
<Description>Search for people</Description>
<Image height="16" width="16"
type="image/x-icon">http://bigulo.com/favicon.ico</Image>
<Url type="text/html" method="get"
template="http://www.bigulo.com/index.php?namesearch={searchTerms}"/>
</OpenSearchDescription>
Obviously you replace the various lines with your own urls. Then in your HTML header add in this:
<link rel="search" type="application/opensearchdescription+xml" href="/bigulosearch.xml" title="Bigulo" />
where the xml file is the one from above. Thats all. Now in IE and Firefox a person will have the facility to use your sites search feature from their built in search box.
Thanks to Niall.
Monday, January 8. 2007
Someone decided to put Dez's latest blog post on reddit. Unfortunatly the mysql server was having trouble keeping up, resulting in a slow enough response for visitors. This has happened twice before (its a popular blog). Anyway the problem is from a bit of sloppy coding in serendipity, and one quick change later the site is back running at full speed.
Continue reading "The value of indexes in SQL tables"
Monday, December 4. 2006
Myself and Conor (mostly Conor) used 500 bebo profile pictures to generate a mosaic of The Blizzards latest album cover. It turned out really well. It was generated using Matlab. The original Blizzards image is only 60x60 pixels, but as you can see it turned out really well.
Continue reading "Bebo picture mosaic of The Blizzards"
Sunday, December 3. 2006
Previously I gave an overview of an auto tagging and suggested thread system I created for phpBB. Here is some of the code to impliment it.
Continue reading "Auto tagging and suggesting related threads in phpBB"
Friday, December 1. 2006
From last night any page that google adsense bot (Mediapartner) requested from my site resulted in a 404. This not only affected this site, but another site I admin which is in a different country, running different software and operating system.
For example the adsense bot requests:
http://bigulo.com/index.php%3Fnamesearch=george+bush
http://www.politics.ie/viewforum.php%3Ff%3D2
When it should be:
http://bigulo.com/index.php?namesearch=george+bush
http://www.politics.ie/viewforum.php?f=2
The reason for this is it converts all question marks in a url to %3F. This only affects the google adsense bot, all other user agents work perfectly and the google search engine bot works fine.
At first the incorrect urls were appearing intermittently first appearing at [30/Nov/2006:20:42:23 -0600] with correct urls appearing inbetween. Then at [01/Dec/2006:05:58:03 -0600] all further urls from the adsense bot were incorrect, and have been ever since.
Since the adsense bot cannot view any pages, it cannot serve out context related ads, so public service ads are appearing, resulting in lost revenue. Judging by the times, I would say they tested a new version of the adsense bot, then deployed it across all of their adsense bot machines at 12pm GMT. If you run adsense on your site, I'd suggest checking your logs and contacting google.
UPDATE: They fixed it at [01/Dec/2006:19:41:03 -0600].
Tuesday, November 28. 2006
The problem: I want to access individual characters in a string in PHP but I dont have the PHP modules installed to use explode. Heres an alternative (and inefficient) way of doing it using split:
function string_split($str, $nr){
return split("-l-", chunk_split($str, $nr, '-l-'));
}
string_split('1234', 1);
Sunday, September 3. 2006
Introduction
I run a few websites on a 1Ghz PIII with 512MB RAM. Its an old server, but its all I've got, so have to make best use of it. Since the last restart of MySQL 58 days ago there have been 210 million queries processed, which works out at 40 per second. The databases contain 12 million rows, and 600MB of data. The server gets most hits during working hours on weekdays, so at peak times there's over 100 queries per second which need to be serviced. With so many queries its vital that they are optimized to maximize throughput. So here's some simple things you can do to optimize your MySQL queries. Most of these tips are generalizable to any database but MySQL has some proprietary syntax, thus I cant guarantee if they will work on any other database software.
Indexing
The simplest way to improve the performance of a your queries is to index some fields in the table your looking up. So if you use ORDER BY age DESC to sort results by age, an index on the age field will dramatically speed up the query. Essentially when rows are inserted, the index on the age field presorts the age field for you. An index does take up additional space in your database though, so don't go overboard. It also increases the insert time. If your using phpMyAdmin you just click the lightening icon beside the field you want to index.
Multiple inserts
Rather than using multiple different queries to insert multiple rows into the same table, group them together. For example:
insert into users(name,age) values (andrew,24),(john,25),(jack,34).
By grouping the inserts together you reduce the overhead for setting up the connection to the database.
Delayed inserts
If your data doesn't need to be instantaneously inserted into the database (so can be inconsistent), then use DELAYED when inserting. This allows for MySQL to store inserts (and updates) in memory, and group them together for more efficient processing. For example:
insert DELAYED into users(name,age) values (andrew,24),(john,25),(jack,34).
Counting
So you want to find out how many rows there are in a table. The most common mistake I see is where people select every row. This means MySQL has to read in the data from every row in the database (probably off disk) which is extremely inefficient.
Wrong Way: select * from users
The right way to do it is to use count. MySQL stores the number of rows in a table already, so the below query only reads a single value, which is far more efficient.
Right Way: select count(*) from users
If you have a where clause, using count is still more efficient, because it only reads what it needs into memory.
|