In order to do this, I'd need to...
- Scrape the HTML from the web page and find the event data
- Connect to Google Calendar and add the events found
Because I like programming in python, the first thing I did was to go get the latest copy of BeautifulSoup, which is a library that is unbelievably handy for scraping data out of HTML and also Google GData which lets me talk to Google Calendar.
I so I began...
import urllib, urlparse, gdata, time, datetime
from bs4 import BeautifulSoup
import atom
import gdata.calendar
import gdata.calendar.service
... and loaded the libraries. Then I connected to Google Calendar, like this...
print "Connecting to Google Calendar"
calendar_service = gdata.calendar.service.CalendarService()
calendar_service.email = '*********@york.ac.uk'
calendar_service.password = '**********'
calendar_service.source = 'Google-Calendar_Python_Sample-1.0'
calendar_service.ProgrammaticLogin()
.... then got the web page with the Festival of Ideas events on it like this...
url = 'http://yorkfestivalofideas.com/talks/'
print "reading ", url
u = urllib.urlopen( url )
html = u.read()
... At this point, I knew I wanted to create a separate calendar, so I made one in Google Calendar ( IMPORTANT! Set the timezone of your newly created calendar!!! ). Once I'd done this, I could then find what's called the calendar link which you use to specify which calendar you want events to go into...
def get_my_calendars_url(cal_name):
feed = calendar_service.GetOwnCalendarsFeed()
for i, a_calendar in enumerate(feed.entry):
name = a_calendar.title.text
print i, a_calendar.title.text, a_calendar.link[0].href
if name == cal_name:
return a_calendar.link[0].href
calendar_link = get_my_calendars_url("Festival of Ideas")
So, now I have some HTML with useful information in it and a way of connecting to my chosen calendar... I need to use Beautiful soup to fish out the data I need. I begin like this...
soup = BeautifulSoup( html )
events = soup.find_all("div", {'class':'event'})
... Now the HTML has been turned into a "soup" which means I can do fancy things with it... like the 2nd line above where I grab any DIV that is of class "event" from code that looks like this..
<div class="event"> | |
<div class="eventdate"> | |
<div class="day"> | |
Thu | |
</div> | |
<div class="date"> | |
14 | |
</div> | |
<div class="month"> | |
Jun | |
</div> | |
</div> | |
<div class="eventdetails"> | |
<p class="eventtitle"> | |
<a href="/talks/2012/frenck/"> | |
Where it all began: The Big Bang | |
</a> | |
</p> | |
<p class="eventteaser"> | |
Professor Carlos Frenk will open this year's York Festival of Ideas with a talk on the biggest metamorphosis of all - that of the universe as a whole, from the simplicity of the Big Bang to the complexity of the universe of galaxies, stars, and the planet on which we live. | |
</p> | |
</div> | |
<div class="clear"></div> |
...Once I've got a list of events I can then do this... which finds the title, and the text and the dates and times of the events....
for event in events:
try:
title = event.find('p', {'class':'eventtitle'}).find('a').contents[0].strip()
href = event.find('p', {'class':'eventtitle'}).find('a')['href']
href = urlparse.urljoin(url, href)
#Get the actual page in the href!
u = urllib.urlopen( href )
event_html = u.read()
small_soup = BeautifulSoup(event_html)
start_time = small_soup.find('abbr', {'class':'dtstart'})['title']
st = time.strptime(start_time, "%Y-%m-%dT%H:%M")
end_dt = datetime.datetime(2012, st.tm_mon, st.tm_mday, st.tm_hour+2, 0, 0)
end_time = end_dt.strftime("%Y-%m-%dT%H:%M:%S")
start_time = start_time + ":00" #HACK UG!
teaser = event.find('p', {'class':'eventteaser'}).contents[0].strip()
teaser = teaser + "\n\n" + href
print "creating event:", title
print create_event(title, teaser, "York, UK", start_time, end_time)
print "_" * 80
except Exception, err:
print err
.... and the create_event code, which uses that calendar_link mentioned earlier, is...
def create_event( title='A lovely event',
content='Some text about it',
where='York, UK', start_time=None, end_time=None):
event = gdata.calendar.CalendarEventEntry()
event.title = atom.Title(text=title)
event.content = atom.Content(text=content)
#time_zone = 'Europe/London'
#event.timezone = gdata.calendar.data.TimeZoneProperty(value=time_zone)
event.where.append(gdata.calendar.Where(value_string=where))
if start_time is None:
# Use current time for the start_time and have the event last 1 hour
start_time = time.strftime('%Y-%m-%dT%H:%M:%S.000Z', time.gmtime())
end_time = time.strftime('%Y-%m-%dT%H:%M:%S.000Z', time.gmtime(time.time() + 3600))
event.when.append(gdata.calendar.When(start_time=start_time, end_time=end_time))
new_event = calendar_service.InsertEvent(event, calendar_link)
return new_event
... Putting it all together I got a events that can be displayed in a fairly rubbishy widget ( go to June 2012 to see the events! ) or a calendar that anyone can browse here.
https://www.google.com/calendar/embed?src=york.ac.uk_9d9et5aruukobiaqpgke4n63rk@group.calendar.google.com&ctz=Europe/London&gsessionid=OK
The End Result?
To be honest, presentation isn't Google Calendar's strongpoint is it? It's fugly. It's all about the utility though... and I suppose making sure you get to those events.
I guess my point was, and is, that more of this sort of data should be ending up in places that I can use it, i.e in Google Calendar rather than hiding on a web page somewhere. Maybe this little bit of code will help someone to get their events in a more usable form.
0 comments:
Post a Comment